US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Note that the CHEP 2024 site is up:  https://indico.cern.ch/event/1338689/    Submission deadline is May 10 but ATLAS will have a month earlier deadline!!

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      DOMA/Token Transition

      Plot below shows the breakdown of token-vs-certificate transfers through current (Monit link; filtered on destination=USA, group by auth_method).  During DC24, token usage peaked at >50% token-based transfers by volume.

      Many thanks to all the sites for their hard work!

       

      Currently, post-DC24 retrospectives are ongoing.  Request: Can sites please send us any issues they observed with tokens during the DC24 period?  We would like to sort through the issues and make sure we upstream / work on your issues.

       

      Pacing items for this year to watch out for:

      • CERN IAM services to be migrated to a new infrastructure.

      • Mature / release the FTS version that supports tokens.

       

      Working with WLCG to update a community timeline.

      Software

      • Release

        • XRootD 5.6.8 expected within the week

      • Kubernetes Accounting

        • How flexible is the wording of the milestone “Deploy monitoring, alerting and APEL accounting for UTA k8s cluster using Prometheus”?
        • Effort is beginning.

        • Met with others working on the same things (AUDITOR, KAPEL).  There are certainly differences:

          • Existing KAPEL only uses summarized data, not per job data that GRACC expects.

        • But, we can certainly build off their existing code.

        • ATLAS k8s access for the Software Team
          • Working on access to NET2
          • Need to verify access to SWT2/UTA Google k8s
          • Status on UTA creds?

       

    • 13:10 13:25
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Infrastructure and Compute Farm 5m
        Speaker: Thomas Smith
      • 13:15
        Storage 5m
        Speaker: Vincent Garonne (Brookhaven National Laboratory (US))
      • 13:20
        Tier1 Services 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • GGUS:165414 - Staging failures at BNL-OSG2_MCTAPE
          • Due to the NET2 congestion
          • Can be solved with multi-hop but that would occupy space at the source DATADISK.
        • IPv6, ALMA9 and ARM tests - ongoing
        • Blacklisted over the weekend due to lackin cvmfs on some nodes
          • Looking to upgrade the cvmfs client
    • 13:25 13:35
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Running affected by DC24 - see section 2.3.5 for details. In general the over the past 30 days the sites have been pretty stable.
        • AGLT2 is fighting cvmfs problems where cvmfs hangs servers in a way that requires a reboot. They also had some minor squid/varnish issues.
        • MWT2 working on network upgrade at IU. Some production loss caused by two FTS incidents. A dCache parameter that had been tuned to allow a lot of movers proved to be set too high for the DC24 on the older, dense storage servers.
        • NET2 working on bringing the remaining compute servers.
        • OU had various transfer issues. Some of the tickets received were marked won't fix.
        • CPB needs to implement tokens. CPB got a ticket for data transfer issues.
      • Held Tier 2 Technical Meeting last week 
        • Lots of discussion about stuck/queued transfers (overran the end of the meeting).
          • CPB got blocked from receiving new work when the gueue of pending transfers got too big.
          • Considerable follow up in an email thread started by Ivan.

        • Saw that CPB site is now running high memory jobs in the Google cloud.

        • Sites are preparing for the update to EL9 by June with some sites further along than others

      • Still chasing issues with the current version of cvmfs.

      • The zombie pilot situation improved by use of a Zombie pilot killer built into the pilot wrapper but still don’t understand the underlying cause.
    • 13:35 13:40
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))

      TACC running started

      NERSC throughput increased but now held up by poor transfer success rate between BNL and Glasgow 

      With the help of NERSC staff, now measuring HS23 values for various configurations using Hep score coming from cvmfs -

      When running with 256 threads (entire machine)   - HS23 result -  1592.4074   or 6.2/logical core

      When running with 8 threads (whole node scheduling still)  - HS23 result - 145.7546 or 18.2/logical core

       

      measuring other configurations to determine the optimal configuration and values. 

    • 13:40 13:55
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:40
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • IRIS-HEP AGC Demo Day #4 this Friday (link)
      • 13:45
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Updating HTCondor queue configs. Previously we have short and long queue that runs on seperated set of nodes, this leads to resource underutilization when only certain queues got picked. We are now updating the queues/deployments such that both are configured with HPA. The deployment are affinted to node partitions but can be scheduled across the partition. Also updated the HPA metrics so that the scaling is in a more controlled fasion. 
    • 13:55 14:10
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 13:55
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • ADC:
          • Missing CVMFS repos still showing up in HC exclusions at BNL and SWT2_CPB
            • CVMFS check in wrapper activated 4 hours ago (Github)
            • Killed pilots monitoring: here
            • Will eventually prevent HC exclusions, but is there a potential worry about "black hole" nodes?  Do sites have any automated detection/response?
          • Trying to create “US News of the day” summary mail, but too much work for one person. Feel free to add your observations to it (Link)
        • NET2
          • Overwhelmed with transfers / stageouts
          • 10 Gbit link (will be upgraded to 100 Gbit in the summer)
          • This should not happen
          • In DDM todo list: To find a way to take into account the queue length at destination (ideally also at source and per link)when proposing destination storage for staging.
        • SWT2
          • Blacklisted due to missing cvmfs on some nodes. The CVMFS check should solve that.
          • Slow deletions (Monit)
        • OU_OSCER
          • Removed the PQ.environ:"XRD_LOGLEVEL=Debug" from the CRIC settings. It was filling the Harvester discs over the weekend.
          • Slowest deletions in US cloud (Monit)
        • All transfers failing at WT2 (GGUS)
        • iut2-slate squid not reporting after switch maintenance (GGUS)
        • Still seeing notable xcache bypass levels at VP sites, including BNL after lowering storage watermarks (link)
      • 14:00
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        • MWT2 has one more XCache
        • AGLT2 still issues with SLATE instances. Working on adding one xcache for them in NRP.
        • Wuppertal node got fixed (OOMed).

         

        ServiceX

           * Installing ServiceXLite on NRP and FAB.

         

      • 14:05
        Facility R&D 5m
        Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        • Joint ATLAS / IRIS-HEP Kubernetes Hackathon coming up April 24-26 at UChicago
        • MWT2, AGLT2 and UVic (Canadian cloud) have already provided hardware and login for the stretched platform. 
        • Someone should come up with a clever name for it :)
    • 14:10 14:20
      AOB 10m