US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-06-18T12:00:00-04:00
End: 2025-06-18T13:35:00-04:00
Location: No location set

Wednesday 18 Jun 2025, 12:00 → 13:35 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 12:00 → 12:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Main focus is on pre-scrubbing for Friday. We still need completed presentations from 2.3.1, 2.3.4 and 2.3.5 ASAP.
  
  ATLAS OTP is again due. (ATLAS ADC DDM OTP request just sent...look for others soon)
- 12:05 → 12:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (impending)
  
  Ilija: status of XRootD 5.8.3-1.2 testing?
  
  Frontier squid 6.13-1.3: major update that requires manual intervention for installations with multiple workers (https://opensciencegrid.atlassian.net/issues/SOFTWARE-6171)
  
  Miscellaneous
  
  Kuantifier
  
  OKD unpriv Prometheus notes passed along to NET2
  
  Kuantifier and docs ready
  
  EL10
  
  What're the US ATLAS plans for upgrading?
  
  Does anyone use the HTCondor keyboard daemon?
  
  Who uses nftables?
- 12:10 → 12:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  
  Upcoming BNL / HTCondor / USATLAS meeting: 25 JUN @ 15:00 eastern / 14:00 central time:
  
  This is the USATLAS version of the meeting. Please forward along to USATLAS folks as you see fit
  
  Zoom link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1
  
  Google doc link: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0
  - 12:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
    
    NTR
  - 12:15
    Compute Farm 5m
    
    Speaker: Thomas Smith
    
    Operations this week have been very smooth, no problems, no interruptions
    
    Work has been done this week on physical retirement of old ATLAS T1 hardware
    
    ATLAS T1 has now fully vacated our old datacenter
  - 12:20
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
    
    Smooth operation Disk and Tape systems
    
    Per agreement with ADC and HPSS storage experts, the FIFO batch queue management threshold has been reduced from 3 to 2 days. Staging mode now switches to FIFO when requests older than 2 days are detected.
    
    dCache Doors and Pools updated to version 9.2.35.
    
    Kafka-based monitoring data collectors enabled
    
    Scheduled dCache maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
    
    Database Hardware and release update to Postgresql16
    
    Minor dCache releases will normalized to 9.2.35 few nodes
    
    Scheduled HPSS gateway maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
    
    ATLASDADISK temp space rolled back to nominal pledge values
    
    Extra space added (~190TB) to minimize DDM blacklisting
  - 12:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    NTR
- 12:30 → 12:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  It has been a tough period recently as several sites have had problems that affected production.
  
  All sites were affected by a rucio issue of Jun 6.
  
  MWT2 was put offline for what was an unintended side effect of unrelated change made by IU network engineers.
  
  MWT2 had its DATADISK fill because there were many small files and the deleting process could not keep up
  
  NET2 was down for a couple days for the MGHPCC annual maintenance.
  
  NET2 also had some production loss in the last week because of network problems.
  
  OU was affected by the Great Plains Network replacing switched and forgetting to turn on jumbo frames
  
  In the last day OU had their storage was fill to capacity by transfers that apparently did not honor the size limit of their endpoint.
  
  CPB had trouble with their gatekeepers.
  
  There was progress on EL9 update/FY24 installation
  
  MSU updated their Satellite/Capsule software but still have trouble
  
  CPB is still working on their storage.
  
  Rafael will give the Tier 2 scrubbing (WBS 2.3.2) and the slides are complete.
  
  A tip of the hat to Rafael for jumping on this.
- 12:40 → 12:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 12:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Slides preparation for pre-scrubbing
    
    Perlmutter: Good CPU&GPU usage
    
    NERSC-10 allocation proposal: collaborate with HEP-CCE
    
    Charles is collecting the workflow list. Related to HPC ops: MC simulation, GNN4ITK (GPU recon), FastChain (Sim+recon).
  - 12:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 12:50 → 13:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 12:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
    
    Built a new image to have nfs-utils installed to have idmapd running and uploaded to SDCC Quay service.
    
    Looking into side-car container to make idmapd running in a non-root pod
    
    to run idmapd with root privilege and the main application container running with less privileges to satisfy OpenShift’s restricted SCC rules.
  - 12:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Ceph File System Capacity
    
    Ceph file systems have exceeded 80% usage of the 1PB capacity, beginning to impact performance and user quota allocation.
    
    Dask-Gateway with HTCondor Integration
    
    Ongoing development to enable Dask-Gateway scheduling on HTCondor.
    
    Exploring two backend integration approaches:
    
    Extending the existing Kubernetes backend to support Condor workers.
    
    Adapting the JobQueue backend model to interface with HTCondor.
- 13:10 → 13:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 13:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
    
    BNL-OSG2_SCRATCHDISK: 935TB-> 650TB
    
    SWT2_CPB_SCRATCHDISK: 500TB-> 200TB
    
    MWT2_UC_SCRATCHDISK: 550TB-> 350TB
    
    Issues with DATADISK space at several US sites for different reasons (OU drained because of this)
    
    BNL tape storage admins met with ADC to discuss staging issues during recent data carousel campaign
    
    Decided to change retrieval algorithm so that it switches strategy from "High to Low" to FIFO after 2 days instead of 3
    
    Rod work with Doug on OverlayBS deployment, also adding GPU scheduling features to WFMS (see slides)
  - 13:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCache
    
    new testing image created
    
    issues with caches in UK
    
    VP queues at MWT2 and AGLT2 disabled for now.
    
    VP
    
    NTR
    
    Varnish
    
    Got a server at IN2P3-CC. This morning all of the FR cloud (except far-flung sites) moved to Cloudflare dns loadbalancer.
    
    Tracking CRIC overwrites at CERN-PROD, lxplus, LRZ, Beijing...
    
    Next on DE cloud.
    
    Concerning US:
    
    BNL - varnish server works but lacks monitoring so not in use.
    
    NET2 - using NRP varnish. Just got email from Derek asking if we can deploy on NET2 instead of NRP.
    
    SWT2 - need to get varnish installed.
    
    AI
    
    have 3 summer students to work on an agentic assistant.
    
    Will test ADK, LangGraph, and OpenAI approaches.
    
    ServiceX/Y
    
    NTR
  - 13:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Have been working on scrubbing slides
    
    Armada
    
    Auth issues fixed
    
    Working on connecting to a second cluster (RP1 --> UChicago AF) with minimal privileges
    
    Coffea Casa
    
    User login should be working (https://coffea-casa.hl-lhc.io/ login with ATLAS IAM), user gets persistent /home and /scratch from CephFS
    
    Working on adding HTCondor / Dask support
    
    EOS
    
    Work continues with authentication, how to inject users into MGM pod in production not clear, may require feature contributions to the Helm chart
    
    HTCondor overlay container
    
    Have a container that runs unprivileged and connects back to UChicago AF HTCondor pool
    
    Will be working with Doug to test this at NERSC
- 13:25 → 13:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder