US ATLAS Computing Facility (Possible Topical)

Name: US ATLAS Computing Facility (Possible Topical)
Start: 2025-07-30T13:00:00-04:00
End: 2025-07-30T15:25:00-04:00
Location: No location set

Wednesday 30 Jul 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  We are working on creating the updated Milestones, using input from our WBS 2.3 copy and the scrubbing presentations with a goal of finishing by the end of this week.
  
  Scrubbing outcomes are still being discussed and final descopes and changes should be known in a few weeks.
  
  Even though we have procurement plans for FY25, all Tier-2s should hold off any purchases till we can discuss them with 2.3 and get approval from Paolo and Verena. This is because we need to understand the impact from scrubbing and the plans to NOT have equipment funds for the Tier-2s in FY26 and FY27.
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  XRootD 5.8.4
  
  Frontier Squid 6.13-1.5 in upcoming (breaking change for squids with multiple workers)
  
  osg-scitokens-mapfile update for JLab + relevant VOs (new issuer)
  
  Miscellaneous
  
  Kuantifier: working with UNL to figure out a solution for multi-container pods
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
    
    wvbs 2.3.1.pdf
  - 13:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Very good running at all Tier 2 sites for the last two weeks.
  
  There were two production reductions:
  
  The MWT2 Illinois site was down for a on July 16 for preventive maintenance.
  
  On July 22 most sites suffered a high failure rate when bad derivation requests were running.
  
  The failures were caused by memory leaks and in some cases servers ran out of memory and rebooted.
  
  Rod paused the requests and pushed on a Panda ticket about developing an automated procedure to catch failing tasks.
  
  EL9/FY24 Equipment.
  
  AGLT2 finished installing their new equipment and updating to RHEL9 at the MSU site.
  
  This completed AGLT2 milestones #307 (EL9) and #313 (FY24 equipment) and also allowed marking milestone #311 (all sites deploy their FY24 compute) as complete.
  
  SWT2 CPB has setup their new storage running Alma Linux 9. The CPB team is now in the process of draining sets of older storage servers and updating them from CentOS 7 to Alma Linux 9. Once a group of servers is drained, it is updated to Alma Linux 9. The process will be repeated as many times to update all older servers except the very oldest storage servers which are out of warranty and will be retired.
  
  All other sites have completed both their EL9 updates and have deployed their FY24 purchases.
  
  Scrubbing results:
  
  We did OK and the Tier 2 sites received full funding for FY25.
  
  The paperwork for the last increment of FY25 funding is in process right now.
  
  The outlook for FY26 is better than what people had worried about.
  
  It looks like the sites will get full funding for personnel plus $50k for equipment, supplies, and travel.
  
  Any spending beyond this (e.g. large equipment purchases) will be approved on a case by case basis.
  
  The current outlook for FY27 is that the situation will be the same: personnel + $50k.
  
  Due to delays in the INI spending and the need to spend the grant down to zero by January 31, 2027 when the current cooperative agreement (CA) ends, it is still looking reasonable to expect substantial end of grant infrastructure funding. The current estimate is that this funding will be about $3 million split between the 4 Tier 2 federations.
  
  We need to revisit the procurement plans to see if modifications are required.
  
  The likely answer is yes.
  
  Shawn is checking on when the FY25 equipment must be spent.
  
  It might be sensible to hold off purchasing for now and/or to buy longer warranties.
  
  In any case there is a hard, unchangeable deadline to make equipment purchases by 90 days before the end of the current CA.
  
  This date is approximately November 1, 2026.
  
  Shawn and Alexei have entered a proposed list of FY26 milestones but these need to be discussed in detail.
  
  These milestones are one of the main outputs of the scrubbing.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: 3K GPU added, ~70/77% CPU/GPU allocation used
    
    Constant rate of job failures due to SLURM job timeout
    
    (Xin)The parallel job in the same work triggers some issues on the rucio side. Some jobs can not get service initialized that they fail on stage-in/out
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Develop a custom Kubespawner class for JupyterHub to launch containers using specific UID/GID, enabling non-root user sessions with mounts to dCache and GPFS
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Coffea-casa configuration updates
    
    Update link on the portal to use the instance that run coffee-casa as AF users
    
    Updated auth configuration to use the prod key-cloak instance with idp hint(globus) to bypass the Keycloak login page's IdP selection screen.
    
    Updated user mapping to use the posix claim rather than a callout to the connect api server.
    
    ServiceX
    
    transformer oom only on the ADS nodes, memory peaked at 2.4G while on other nodes it's less than 500MB.
    
    increased men limit as a workaround. troubleshooting on going.
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  All US ATLAS compute is now (finally) on EL9!
  
  SWT2_CPB storage is the last service left to migrate
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    Ongoing migration to new Frontier service for Varnish (MWT2 yesterday, AGLT2 today, others to come)
    
    Greedy deletion deactivated on site storage after completion of SCRATCHDISK to DATADISK migration
    
    OOM node crashes at OU due to job memory overruns sparked discussion of how best to prevent such incidents
    
    Discussion of plans for ESNET xcache service and how best to integrate it into shifter monitoring
    
    LHC: Flawless p-p operation. We will probably move 2 PB from CERN datadisk to tzdisk for contingency.
    
    Low level of ADC Ops support this week due to several key people on leave (Andreu, Rod, Timo, Fabio, Alex (HC), Dario (ES)). For list of ADC people on leave, refer to this gdoc
    
    Varnish deployment is proceeding. We also have a new, upgraded (latest Tomcat), k8s Frontier instance which is being tested.
    
    An evgen campaign overloaded the Rucio and IAM DB with many small input files. Throttled.
    
    FTS:
    
    All transfers to US storages use BNL FTS with tokens
    
    Started thinking of storing/integrating full FTS configuration in CRIC
    
    CRIC: We are loosing our main developer in the beginning of September. WLCG is going to take over the CRIC efford needed by ATLAS.
    
    Tokens: Added ADC recipes on how to implement tokens for CEs/SEs
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCache
    
    added Oxford back to testing
    
    new images with supervisord for ESnet node
    
    VP
    
    NTR
    
    Varnish
    
    added UAM local varnish
    
    moved MWT2 and AGLT2 varnishes to use new Frontier (CERN Openstack k8s based)
    
    NET, ucsc, SWT2, BNL are already set to use it
    
    Removed SWT2_CPB from varnish testing
    
    Still using squids BNL, SWT2_CPB, FZK_LCG2, Beijing
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Lincoln back from vacation this week, NTR
    
    Aidan has done some very nice work demonstrating Kueue and Multi-Kueue between RP1 and UC AF.
    
    See Facility R&D notes
    
    He will start looking into sending HTCondor pods between clusters.
    
    David solved HTCondor auth issues on RP1
    
    Single shared HTCondor queue, maintained between Jupyter sessions.
    
    Now to solve the more challenging problem of syncing users into all HTCondor pods, so shared filesystems can be used.
    
    Looking into Keycloak --> LDAP sync tools developed by IceCube folks for longer term solution
    
    Meanwhile, Lincoln will put together a shim to allow existing user sync scripts work a la UC AF
    
    NTR on EOS work but still on our plate
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility (Possible Topical)

Facilities Team Google Drive Folder