US ATLAS Computing Facility (Possible Topical)

Name: US ATLAS Computing Facility (Possible Topical)
Start: 2025-07-02T13:00:00-04:00
End: 2025-07-02T15:25:00-04:00
Location: No location set

Wednesday 2 Jul 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Scrubbing prep continues. WBS 2.3 scrubbing is Tuesday, July 15th (L2 and L3s).
  
  Quarterly reports for L3 areas in WBS 2.3 will be due Friday July 18th. Please plan ahead to have them completed.
  
  Next week is ATLAS S&C focused on R2R4 https://indico.cern.ch/event/1509063/
  
  Need to plan for next capacity tests as well as capability tests. See https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link
  
  Alexei: possible collaboration with Weka, details to be discussed. Also noted new Google contact for ATLAS assigned.
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (probably next week)
  
  cvmfs-2.13.1: fixes a bug that has been present since cvmfs-2.12.0 which prevents periodic resets to the closest stratum 1, causing performance degradation. All who have upgraded to version 2.12.0 or later are encouraged to update as soon as possible.
  
  Previously mentioned Frontier Squid 6.13 delayed and moved into osg-upcoming due to required manual changes
  
  XRootD 5.8.3-1.3: same as 5.8.3-1.2 except with https://github.com/xrootd/xrootd/pull/2472
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  WBS 2.3.1.2 Tier-1 Infrastructure - Jason
  
  A GGUS on Varnish monitoring. Ongoing.
  
  WBS 2.3.1.3 Tier-1 Compute - Tom
  
  Two black hole nodes (acas0927 and acas0950). Hardware issues with both of them. Taken offline by Oszkar.
  
  VP queue
  
  Does not run anymore merge jobs (thanks to the WFMS team)
  
  Due to misconfiguration it was running only on Intel WNs - fixed.
  
  WBS 2.3.1.4 Tier-1 Storage - Ivan (Carlos is on leave)
  
  Internal Tape discussion:
  
  Increased the limit of the queued files for staging on BNL tape in the Data Carousel settings to 200k
  
  Tim compiling list of files grouped by tape
  
  Increase of staging request timeout was discussed but it is not possible from Rucio (it is hardcoded for all sites)
  
  WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
  
  NTR
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
  - 13:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Reasonable running over the past two weeks with some problems.
  
  AGLT2 had two partial drains on June 25-26 and June 29-30.
  
  MWT2 was affected mistake made by IU Networking on June 17 causing a significant drain. June 17-18.
  
  MWT2 took a down time June 24-25 to upgrade to dCache version 10.2.13.
  
  NET2 had the tape cache fill and currently transfers to the NET2 tape are off and data already on the cache is going onto tape and removed from the cache.
  
  OU was completely drained June 18-20. More recently they had a black hole node that caused some trouble.
  
  TW-FTT continued to have job failures and periods of low CPU efficiency.
  
  Work continues at MSU and UTA on updating to EL9 and to put FY24 purchases online.
  
  MSU has made significant progress over the last two weeks and has there install process working.
  
  UTA continues setting up a test storage cluster using EL9 and their new storage servers.
  
  Quarterly report is due June 18 so management can look at the L3 reports while writing their reports.
  
  Rafael and I are still discussing various possible funding scenarios and how we would respond.
  
  It takes careful thinking to understand how to minimize the effect of sharp budget decreases.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: running smoothly, both CPU&GPU above the expectation
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    ~6K dedicated cores added to Tier-3 Alma 9 pool (retired from Tier-1)
    
    SL7 submit hosts will be shut down next Monday
    
    Progress on dCache data display issue within a Pod
    
    By configuring the OpenShift worknode to retrieve LDAP information like other nodes, the pod is able to correctly display string-based user and group names for dCache data. This solution is simpler as it does not require a sidecar container.
    
    We have manually applied and verified this approach on a test node. However, there are still some issues related to the configuration file setup.
    
    Had a meeting to discuss the AF status, current issues and the following work
    
    Start to take a look at Kubespawner to understand how to use UID/GID to start a non-root user's container.
    
    Work on the https proxy for the external access to JuypterHub running within a pod
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Dask-Gateway / HTCondor Integration
    
    Deployment is now functional.
    
    Implemented a backend that runs the scheduler pod in Kubernetes while spawning workers as HTCondor jobs.
    
    Currently, the code has some coupling with our user management (CIConnect); efforts are underway to decouple this before making the repository public so that other interested sites can adopt it as well.
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    ADC Operations
    
    First sites going for 16-cores under CMS pressure (IN2P3-CC). Evaluation of impact on ADC workflows is under investigation.
    
    Transferring space from SCRATCHDISK to DATADISK is ongoing. Greedy deletions enabled for BNL, SWT2 and MWT2 today.
    
    Cleaned up Rucio VOMS roles today (Details)
    
    Varnish deployment is ongoing.
    
    Work still ongoing to add VP queue at NET2 pointing to the ESNET xcache
    
    Transfer failures seemingly related to cert issues still ongoing at TW-FTT (GGUS)
  - 14:15
    
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    EOS deployment moved to bare metal and puppetized, working on understanding how to properly authorize users
    
    HTCondor pool setup on rp1 ongoing, working through auth issues between schedd/collector/startds
    
    Armada integration ongoing, executor is deployed on the UChicago AF to ostensibly send work from RP1, but scheduler seems to be non-functional for reasons TBD
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility (Possible Topical)

Facilities Team Google Drive Folder

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Ivan (Carlos is on leave)

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan