US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • AGIS cleanup on "Associated CE queues" section of PQs. Requested by Tadashi for harvester, applicable to other pilot submission channels as well.  
        • only put production-functional queues there
    • 13:20 13:25
      OSG software issues 5m
      Speaker: Brian Lin (University of Wisconsin)
    • 13:25 13:30
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Today is the OSG area coordinators meeting focusing on networking.  All are welcome to join: https://opensciencegrid.github.io/management/area-coordinators/

      We are getting the word out on two items related to perfSONAR:

      1.  Please update your instances to CentOS 7 ASAP
      2. OSG networking services will be migrating from grid.iu.edu over to opensciencegrid.org (and moving from GOC to AGLT2) by May 31, 2018

      The next HEPiX NFV working group meeting is coming up April 25:  https://indico.cern.ch/event/715631/ 

       

    • 13:45 13:50
      XCache, R&D for the Data Delivery Layer 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:50 13:55
      HPCs integration 5m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • production
          • fluctuation in the number of running job slots at the beginning of the week, due to the central service issue (expired cert)
        • overlay option disabled in singularity on all WNs, in light of the reported vulnerability
        • migration to SL7 ongoing, new purchased WNs are on SL7 (with singularity SL6 image). The rest is waiting to go together with the TOR switch migration, for minimizing the downtime.
        • dcache pool nodes upgraded from 3.0.11 to 3.0.43, to fix a login handshake problem with xrootd clients new version ( > 4.7.0)
        • transition to LCMAPS VOMS plugin ongoing on the CE side (rolling fashion). SE (dCache) is almost done (only corner case, grid-proxy-init users, not covered). 
        • BNL is in the process of creating the dump of name space.  The program is currently running to fill the additional database column with full path, which will be used to create the dump.  BNL has currently about 130M files.  The process of filling the column is going a bit slower than 100Hz.  So, it will take 3 weeks or so.  But, once the table is filled and being kept up-to-date, the creation of the dump will be much faster.  
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        C6420 chassis and R6420 sleds (56 HT Cores) have been delivered to UM and MSU.  7 sleds are currently on-line and running HTCondor jobs.  12 additional sleds should be configured by COB Friday.  The balance await the installation of a 10Gb switch at MSU.

        These C6420 are current hogs.  See the attached plots.

        The UM site has a coolant leak, that will be addressed tomorrow.  Unfortunately all cooling must be shut down to find the leak, so today we began to idle down the WN to power them off, minimizing the room head load.  Service VMs and machines, along with dCache storage will remain online, hopefully within the cooling available from a portable unit that will be put in place during the repairs.  These repairs are expected to be completed by COB tomorrow.

        Approximately 2/3 of our WN total are now running SL7.  The last 1/3 will transition more slowly as operation of the muon-calibration center needs to be carefully transitioned.

        Our space reporting via ruby script transitioned to SL7 on Friday; and broke as ruby activerecord version has jumped by 2 major versions, from 2.3.18 to 4.2.6 .  This was fixed by Shawn over the weekend, and our space reporting is once again up to date.  The Wiki page below has been updated with the changed code.
        https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DcacheSpaceReportingJsonViaRuby

        The xrootd problem reported a few days ago and summarized at this URL
        https://github.com/dCache/dcache/pull/3562
        has not been observed here to the best of my knowledge.  We are running dCache 4.0.5 with a mix of xrootd rpms, depending on whether the WN is SL6(4.6.1) or SL7(4.8.1).

        AGLT2 now has a full suite of SL7 PandaQueues in operation.  SL6 PQ are also still enabled, and will remain so until the last SL6 WN is retired.

        To quote Xin, " overlay option disabled in singularity on all WNs, in light of the reported vulnerability "

         

         

      • 14:05
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        MWT2 Update

        • UC
          • C6420 deployment: all 20 new workers are now online and in production
          • Elasticsearch: received new hardware to expand ES cluster
          • dCache: updated to 3.1.32, applied LCMAPS configuration change
          • Gatekeepers: reconfigured for LCMAPS
        • IU
          • C6420 deployment: working on getting the first C6420 built 
        • UIUC
          • GPFS stale file handle on mwt2-gk and workers

        Stampede2 Update

        • On Saturday/Sunday (3/24-3/26) Stampede2 was nearly empty
        • Stampede PanDA Qs had many activated jobs available
          • CONNECT_STAMPEDE_MCORE
          • CONNECT_ES_STAMPEDE_MCORE
        • Started a large run on S2
          • 640 nodes (out of 1680 nodes)
          • Peaked at 30720 cores (out of 80640 cores)
          • 3840 simultaneous jobs
          • Standard and ES MCORE jobs (simulation)
          • Processed 70M events with 85K jobs over about 48 hours
          • 90% efficiency
        • Steady state operation will be much lower
          • 300 nodes with 14400 cores (1800 jobs)
          • Run stated yesterday (4/10-4/11)
          • 9M events, 10.5K jobs
          • 90% efficiency
          • Asked by TACC Admins to limit number of jobs due to potential I/O load issues
        • All PandaQs are now tagged with resource type HPC
        • Using Singularity Image based on the one in atlas repository
        • LSM with pCache gets an 85% hit rate on stagein files
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        NET3 "Northeast Tier 3" successfully launched.  38 users from BU, Harvard, and UMASS/Amherst.  

        BU+MIT+ESnet working on re-establishing our LHCONE peering.  The current theory is that a bad card at MANLAN is causing a problem.  Have been jumping on and off LHCONE in the past week for testing. 

        Working on finishing off LCMAPs migration (with Brian Lin helping), turning off GRAM at BU and migrating from Bestman to Wei's Gridftp with Adler32 callout.  

        Lots of NESE activity... The NESE first deployment ordered.  10.8 PB.  First major test will be ATLAS storage endpoint.  

        Production smooth with a noticeable switch over from mcore domination to production domination. 

        Start planning for RH7 migration. 

        Annual MGHPCC maintenance down day is May 22.  

      • 14:15
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all sites running well

        - Lucille is having A/C issues, though, so running at reduced capacity

        - still waiting for RUCIO server update in order to take advantage of local xrootd redirector for jobs running on OSCER

        - singularity tests currently on hold because we can't run without overlay turned on

         

      • 14:20
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        • Finished updating storage at site; now at 425TB
        • Walked a rebuild through cluster to fix a Rocks issue
        • Fixing an issue with SAM.

        SWT2_CPB:

        • HTCondor transition now at 3,000 cores, scaling issues arrived.  Investigating
        • Seeing an issue with potential XRootD problem when checksumming long filenames
        • SWT2_CPB_GROUPDISK has been retired.

        General:

        • CPU time of jobs in Torque seems to be wrong.  Startup/shutdown seems to be OK, but main processing time is not seen.
      • 14:25
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:30 14:35
      AOB 5m