US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-07-16T13:00:00-04:00
End: 2025-07-16T15:25:00-04:00
Location: No location set

Wednesday 16 Jul 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  
  XRootD 5.8.4 (initial build passed testing, we're working on syncing up our RPM spec)
  
  HTCondor 24.10 incoming
  
  OSG 25 targeted for September. OSG 23 will EOL upon OSG 25's release
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    Moved 285 TB from SCRATCHDISK to DATADISK on ADC request.Shared pool:
    
    Job submission was not succeeding.
    
    Rebooting the CE helped.
    
    One CE is down for upgrade
    
    VP Panda Queue was not working due to stuck ESNET xCache. xCache was restarted and VP is back in operation.
    
    FTS hosts ran out od diskspace. Solved. (GGUS: 3843)
    
    Deletion performance during greedy deletions on SCRATCHDISK: 485 TB deleted within 4.5 hours at a mean rate of 140k deletions per hour
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Pretty good running during the last couple of weeks.
  
  AGLT2 had two outage because the storage filled up there was dark data also. Philippe solved the second outage be restarting dCache which made most of the dark data vanish. The root cause is still under investigation.
  
  MWT2 IU had two racks turned off for power work early this morning. The work was successful and the down servers were returned to service.
  
  MWT2 Illinois will be offline tomorrow for their quarterly preventive maintenance.
  
  OU drained and refilled on Monday.
  
  TW-FTT has struggle recently to keep their usual number of slot (~600) in service.
  
  I am working on the manager's quarterly report.
  
  Thanks to all 4 T2s: all of you provided your reporting before the scrubbing.
  
  Things are looking up on the EL9/FY24 equipment front:
  
  At MSU all FY24 gear has been put online running EL9.
  
  We closed the AGLT2 FY24 equipment milestone.
  
  Looks like Philippe will finish with moving the MSU servers to RHEL9 within a couple of weeks. The process is partially complete.
  
  At UTA all of the new FY24 servers are in racks and running Alma Linux 9.
  
  The new gear is entirely storage servers and is being tested.
  
  This is very close to being done on the EL9 milestone. All that remains is update the older storage to EL9.
  
  It also will complete FY24 equipment milestone when the new storage is fully online.
  
  The statement was made at the scrubbing that we would know soon after the scrubbing how much FY25 money each site will get.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: CPU&GPU well above expectation.
    
    Some job failures are seen. Stage-out time out when coping output to BNL, to be investigated.
    
    CFS disk space freed from some GPU users
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    work continues with Rod Walker on overlay batch system for NERSC. Next to test scaling up to 10's of nodes then to 50 nodes, then 100 nodes.
    
    Received information from NERSC on how to release nodes from a slurm job. This will require additional coding in the wrapper and changing the way the SLURM srun command is used. (ie multiple srun commands one per node - not really scalable above 100 nodes) and a monitoring loop with a 1-5 minute sleep
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Generated the LOCALGROUPDISK data access list with last access times to support Mayuko in data management:
    /pnfs/usatlas.bnl.gov/LOCALGROUPDISK/localgroupdisk_dump_06252025 (3,642,326 files)
    
    Worked on the design and development of a custom Kubespawner class for JupyterHub to launch containers using specific UID/GID, enabling non-root user sessions with mounts to dCache and GPFS.
    
    Conducted testing with YAML configurations to start non-root containers.
    
    Tested and verified the resolution of the dCache data display issue (showing as 99:99) within a container environment., which make openshift nodes connecting to LDAP to load user information
    
    Successfully applied the configuration on the test cluster to enable proper LDAP integration.
    
    Rolling out these changes to production will require careful planning to avoid disruption to online service
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Dask-Gateway/HTCondor
    
    Implemented a configurable user mapping function to map Dask-Gateway users to HTCondor users, making the controller more general and reusable across different sites.
    
    When used with JupyterHub auth, user can also configure username_claim to map accounts.
    
    ServiceX updated to 1.7.1
    
    Ran into tranformer OOM issue caused by new upstream atlas software(release 25) resulted in 2GB+ memory required during the compilation step.
    
    Intermittent OOM happens depend on node memory pressure situations.
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Thanks to those who have submitted their quarterly report updates, need the rest by TOMORROW, please.
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    Heavy Ion running was very intense on computing infrastrructure (but successful). Constant running at that rate is not sustainable. The design data write throughput is 7 GB/s. We were writing at up to 15 GB/s
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCache
    
    ESnet instance used by BNL and NET2. Still not stressed at ~20k accesses/h.
    
    Few more issues with xcaches in UK cloud.
    
    ServiceX/Y
    
    both upgraded to 1.7.1
    
    r25 seems to have higher memory requirements. ServiceX requests 1 core, 2GB and has issues on nodes where memory is tight. ServiceY requests 1core 8GB and shows no problems.
    
    Varnish
    
    SWT2 has test setup up and running. Production setup should be in place today.
    
    Technion has instance that serves all three IL sites.
    
    NET2 and MSU now don't use official frontier as a backend but a k8s frontier instance.
    
    Still missing: BNL, FZK, Beijing, TW.
    
    AI
    
    testing new ES MCP for varnish
    
    moving frontend to TypeScript
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    still working on debugging fully Kubernetes-based HTCondor pool on RP1. Having some particular trouble integrating with K8S networking to remotely submit to schedd.
    
    taking another look at Kueue, with Multi-Kueue for other cluster support. Armada scheduling between two clusters remains challenging. Will discuss side-by-side at the next meeting July 24.
    
    some folks from G-Research have offered to speak about cloud batch scheduling and Armada at the Facility R&D meeting on Aug 7
    
    Working on QR
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder