US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2024-02-28T13:00:00-05:00
End: 2024-02-28T15:25:00-05:00
Location: No location set

Wednesday 28 Feb 2024, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  USATLAS Data Challenge 2024 Take-aways (Google Slides)
  
  USATLAS Data Challenge 2024 Take-aways.pdf
  
  Note that the CHEP 2024 site is up: https://indico.cern.ch/event/1338689/ Submission deadline is May 10 but ATLAS will have a month earlier deadline!!
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  DOMA/Token Transition
  
  Plot below shows the breakdown of token-vs-certificate transfers through current (Monit link; filtered on destination=USA, group by auth_method). During DC24, token usage peaked at >50% token-based transfers by volume.
  
  Many thanks to all the sites for their hard work!
  
  Currently, post-DC24 retrospectives are ongoing. Request: Can sites please send us any issues they observed with tokens during the DC24 period? We would like to sort through the issues and make sure we upstream / work on your issues.
  
  Pacing items for this year to watch out for:
  
  CERN IAM services to be migrated to a new infrastructure.
  
  Mature / release the FTS version that supports tokens.
  
  Working with WLCG to update a community timeline.
  
  Software
  
  Release
  
  XRootD 5.6.8 expected within the week
  
  Kubernetes Accounting
  
  How flexible is the wording of the milestone “Deploy monitoring, alerting and APEL accounting for UTA k8s cluster using Prometheus”?
  
  Effort is beginning.
  
  Met with others working on the same things (AUDITOR, KAPEL). There are certainly differences:
  
  Existing KAPEL only uses summarized data, not per job data that GRACC expects.
  
  But, we can certainly build off their existing code.
  
  ATLAS k8s access for the Software Team
  
  Working on access to NET2
  
  Need to verify access to SWT2/UTA Google k8s
  
  Status on UTA creds?
- 13:10 → 13:25
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Infrastructure and Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:15
    
    Storage 5m
    
    Speaker: Vincent Garonne (Brookhaven National Laboratory (US))
  - 13:20
    Tier1 Services 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    GGUS:165414 - Staging failures at BNL-OSG2_MCTAPE
    
    Due to the NET2 congestion
    
    Can be solved with multi-hop but that would occupy space at the source DATADISK.
    
    IPv6, ALMA9 and ARM tests - ongoing
    
    Blacklisted over the weekend due to lackin cvmfs on some nodes
    
    Looking to upgrade the cvmfs client
- 13:25 → 13:35
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Running affected by DC24 - see section 2.3.5 for details. In general the over the past 30 days the sites have been pretty stable.
  
  AGLT2 is fighting cvmfs problems where cvmfs hangs servers in a way that requires a reboot. They also had some minor squid/varnish issues.
  
  MWT2 working on network upgrade at IU. Some production loss caused by two FTS incidents. A dCache parameter that had been tuned to allow a lot of movers proved to be set too high for the DC24 on the older, dense storage servers.
  
  NET2 working on bringing the remaining compute servers.
  
  OU had various transfer issues. Some of the tickets received were marked won't fix.
  
  CPB needs to implement tokens. CPB got a ticket for data transfer issues.
  
  Held Tier 2 Technical Meeting last week
  
  Lots of discussion about stuck/queued transfers (overran the end of the meeting).
  
  CPB got blocked from receiving new work when the gueue of pending transfers got too big.
  
  Considerable follow up in an email thread started by Ivan.
  
  Saw that CPB site is now running high memory jobs in the Google cloud.
  
  Sites are preparing for the update to EL9 by June with some sites further along than others
  
  Still chasing issues with the current version of cvmfs.
  
  The zombie pilot situation improved by use of a Zombie pilot killer built into the pilot wrapper but still don’t understand the underlying cause.
- 13:35 → 13:40
  
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  
  TACC running started
  
  NERSC throughput increased but now held up by poor transfer success rate between BNL and Glasgow
  
  With the help of NERSC staff, now measuring HS23 values for various configurations using Hep score coming from cvmfs -
  
  When running with 256 threads (entire machine) - HS23 result - 1592.4074 or 6.2/logical core
  
  When running with 8 threads (whole node scheduling still) - HS23 result - 145.7546 or 18.2/logical core
  
  measuring other configurations to determine the optimal configuration and values.
- 13:40 → 13:55
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:40
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    IRIS-HEP AGC Demo Day #4 this Friday (link)
  - 13:45
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Updating HTCondor queue configs. Previously we have short and long queue that runs on seperated set of nodes, this leads to resource underutilization when only certain queues got picked. We are now updating the queues/deployments such that both are configured with HPA. The deployment are affinted to node partitions but can be scheduled across the partition. Also updated the HPA metrics so that the scaling is in a more controlled fasion.
- 13:55 → 14:10
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 13:55
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    ADC:
    
    Missing CVMFS repos still showing up in HC exclusions at BNL and SWT2_CPB
    
    CVMFS check in wrapper activated 4 hours ago (Github)
    
    Killed pilots monitoring: here
    
    Will eventually prevent HC exclusions, but is there a potential worry about "black hole" nodes? Do sites have any automated detection/response?
    
    Trying to create “US News of the day” summary mail, but too much work for one person. Feel free to add your observations to it (Link)
    
    NET2
    
    Overwhelmed with transfers / stageouts
    
    10 Gbit link (will be upgraded to 100 Gbit in the summer)
    
    This should not happen
    
    In DDM todo list: To find a way to take into account the queue length at destination (ideally also at source and per link)when proposing destination storage for staging.
    
    SWT2
    
    Blacklisted due to missing cvmfs on some nodes. The CVMFS check should solve that.
    
    Slow deletions (Monit)
    
    OU_OSCER
    
    Removed the PQ.environ:"XRD_LOGLEVEL=Debug" from the CRIC settings. It was filling the Harvester discs over the weekend.
    
    Slowest deletions in US cloud (Monit)
    
    All transfers failing at WT2 (GGUS)
    
    iut2-slate squid not reporting after switch maintenance (GGUS)
    
    Still seeing notable xcache bypass levels at VP sites, including BNL after lowering storage watermarks (link)
  - 14:00
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    MWT2 has one more XCache
    
    AGLT2 still issues with SLATE instances. Working on adding one xcache for them in NRP.
    
    Wuppertal node got fixed (OOMed).
    
    ServiceX
    
    * Installing ServiceXLite on NRP and FAB.
  - 14:05
    Facility R&D 5m
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Joint ATLAS / IRIS-HEP Kubernetes Hackathon coming up April 24-26 at UChicago
    
    Recent presentation here
    
    Please sign up if you are interested in attending.
    
    Training will be remote-friendly, the rest of the workshop will be in person.
    
    https://indico.cern.ch/event/1384683/
    
    MWT2, AGLT2 and UVic (Canadian cloud) have already provided hardware and login for the stretched platform.
    
    Someone should come up with a clever name for it :)
- 14:10 → 14:20
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

DOMA/Token Transition

Software