US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:10
      Singularity / centos 7 deployment in the US cloud 5m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:10 13:15
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:15 13:20
      OSG software issues 5m
      Speakers: Brian Lin (University of Wisconsin), Patrick Mcguigan (University of Texas at Arlington (US))

      HTCondor-CE transitions

      • BU - upgrading the CE to latest OSG 3.4, had questions about error messages
      • Harvard - troubleshooting Slurm draining after Slurm controller issue
      • OU - OCHEP migration targeted in a couple of weeks
      • UTA - questions about LCMAPS VOMS
    • 13:20 13:25
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:25 13:30
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:30 13:35
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:35 13:40
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Lots of work ongoing in OSG networking to reorganize the networking services.  New VM monitoring perfSONAR RSV at http://psrsv.grid.iu.edu/rsv/

    • 13:40 13:45
      XCache 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:45 13:50
      HPCs integration 5m
      Speaker: Taylor Childers (Argonne National Laboratory (US))

      Titan continues production using custom backfill and allocation queues with 15.2M events processed in the last two weeks. Harvester deployment is advancing, aiming for production with mini-pilots by end of February. Singularity container testing is continuing using the current custom workflows. Some discussion will be needed on how to integrate containers in the Harvester workflow in a unified way.

      NERSC continues production with standard grid pilots on Edison with 5.7M events. Harvester is currently being tested on the Cori queues with the mini-pilot which is why Cori is no longer in production. The goal is similar to Titan, to have Harvester in production by end of February using Shifter containers.

      ALCF 9.5M simulated events processed in the last two weeks. Harvester is being used with mini-pilots here. We are in the process of testing Yoda with jumbo jobs, but currently ironing out the details. One aspect of jumbo jobs that we did not anticipate is that all 150k input files must reach the source RSE before Harvester will begin launching jobs. We have been using Rucio at Theta lately which takes a long time to transfer one file at a time. We are now swapping in the Globus Online plugins to make this transfer much faster. Hopefully this will be in place by next week and we can get some idea of the scaling on Theta using Harvester + Yoda(Event Service) + Jumbo Jobs. We are also going to begin using Singularity containers on Theta as it has been installed on all nodes.

    • 13:50 14:25
      Site Reports
      • 13:50
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 13:55
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        AGLT2_SL7 queue is working fine for the past 2 weeks.

        Yesterday added AGLT2_MCORE_SL7, which is online, but is getting no brokered jobs.  Investigating.

        Working to go to LCMAPS mapping for dCache, which will in turn allow us to turn off our GUMS servers.  We expect to do that later this afternoon as the initial change-over seems to be working.

        We've ordered hardware now from Dell, C6420 sleds plus some needed switch infrastructure to support them.  Delivery is expected near the end of February.

         

      • 14:00
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Site is now performing well and full of jobs

         

        Various software updates

        • HEP_OSlibs 1.1.4 has been fully deployed
        • CVMFS 2.4.4
        • HTcondor 8.4.12
        • OSG 3.3.32
        • Judith will be working on the lcmaps conversion

         

        Stampede2 integration via CONNECT

        • Stratum-R replication of repositories to disk
        • CONNECT Stampede PanDA queues created in AGIS
        • Will use LSM for data transfer
        • Skylakes with 96 cores, 192GB, 2GB/core, CentOS7
          • No swap space on workers
          • Limit usage to 48 cores to avoid OOM problems
        • Using Singularity (CentOS6) container with additions for
          • Symlink of /cvmfs to repositories
          • Additional RPMS for LSM and pcache
          • Additional mounts for /home1 and /work
          • HEP_OSlibs 1.1.4 etc
        • Connect SSH Glidein starts singularity container on worker node
          • HTCondor overlay starts inside singularity
          • Pilot runs in overlay
          • Network issue to Italy on all of S2 caused validation jobs to fail (now fixed)
          • Hammer Cloud jobs running

         

        Two unrelated incidents to Illinois offline

        • Juniper network switch failed on Monday February 5
          • Was part of a redundant pair but failure took out both
          • Removal of bad switch allowed site to come back online
          • Some type of hardware failure coupled with a firmware bug
          • Overnighted replacement switch was DOA
          • Redundancy resumed on Wednesday February 7
        • GPFS lockup on at 2AM Saturday February 10
          • Nightly snapshot locked hanging the filesystem
          • Required a complete reboot of the cluster
          • System resumed operation by 10:30 AM

         

        Illinois will have the monthly PM on Wednesday February 20

        • Part of PM will involved extensive network testing
        • Will require a shutdown of GPFS
        • MWT2 will take brief downtime to move GridFTP door VM

         

         

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Ongoing issues:

        1. Brian Lin is helping us track down HT-Condor/Slurm/SGE issues on the Harvard & BU side respectively.

        2. Working on dropping our use of edg-mkgridmap.pl, migrating from Bestman to Gridftp

        3. Ongoing GPFS operations - repairing some bad luns, file system maintenance, migration of system pool to warrantied equipment.

        4. Preparations for first big NESE deployment.

        5. HU squid issue just came up this morning.  Dan is working on it. 

        6. Re-enabled LHCONE peering, but there were immediate problems.  Networking team is working with MANLAN to investigate and fix.

      • 14:10
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - Mostly smooth running

        - Some storage issues at OSCER, Dell is investigating the storage server that keeps crashing-

        - Still seeing 'Auth failed' stagein errors at OSCER, which are most likely not site related, since they also happened at Lucille, and according to Wei, the error appears because xrdcp command does not have voms proxy

        - Lucille still working on reconfiguring their storage, have HA issues

        - Lucille A/R numbers for January are incorrect, since most of the time it was in a scheduled downtime, so the R number should be higher. Opening ticket.

         

      • 14:15
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        Updated HEP_OSlibs.

        Recent Dell Purchase delivered

        Work continues on HTCondor-CE

         

      • 14:20
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:25 14:30
      AOB 5m