US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
12:00
→
12:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
12:05
→
12:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (impending)
- Ilija: status of XRootD 5.8.3-1.2 testing?
- Frontier squid 6.13-1.3: major update that requires manual intervention for installations with multiple workers (https://opensciencegrid.atlassian.net/issues/SOFTWARE-6171)
Miscellaneous
- Kuantifier
- OKD unpriv Prometheus notes passed along to NET2
- Kuantifier and docs ready
- EL10
- What're the US ATLAS plans for upgrading?
- Does anyone use the HTCondor keyboard daemon?
- Who uses nftables?
-
12:10
→
12:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
Upcoming BNL / HTCondor / USATLAS meeting: 25 JUN @ 15:00 eastern / 14:00 central time:
This is the USATLAS version of the meeting. Please forward along to USATLAS folks as you see fit
Zoom link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1
Google doc link: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0
- 12:10
- 12:15
-
12:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
Smooth operation Disk and Tape systems
-
Per agreement with ADC and HPSS storage experts, the FIFO batch queue management threshold has been reduced from 3 to 2 days. Staging mode now switches to FIFO when requests older than 2 days are detected.
-
dCache Doors and Pools updated to version 9.2.35.
-
Kafka-based monitoring data collectors enabled
-
Scheduled dCache maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
-
Database Hardware and release update to Postgresql16
-
Minor dCache releases will normalized to 9.2.35 few nodes
-
Scheduled HPSS gateway maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
-
ATLASDADISK temp space rolled back to nominal pledge values
-
Extra space added (~190TB) to minimize DDM blacklisting
-
- 12:25
-
12:30
→
12:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- It has been a tough period recently as several sites have had problems that affected production.
- All sites were affected by a rucio issue of Jun 6.
- MWT2 was put offline for what was an unintended side effect of unrelated change made by IU network engineers.
- MWT2 had its DATADISK fill because there were many small files and the deleting process could not keep up
- NET2 was down for a couple days for the MGHPCC annual maintenance.
- NET2 also had some production loss in the last week because of network problems.
- OU was affected by the Great Plains Network replacing switched and forgetting to turn on jumbo frames
- In the last day OU had their storage was fill to capacity by transfers that apparently did not honor the size limit of their endpoint.
- CPB had trouble with their gatekeepers.
- There was progress on EL9 update/FY24 installation
- MSU updated their Satellite/Capsule software but still have trouble
- CPB is still working on their storage.
- Rafael will give the Tier 2 scrubbing (WBS 2.3.2) and the slides are complete.
- A tip of the hat to Rafael for jumping on this.
- It has been a tough period recently as several sites have had problems that affected production.
-
12:40
→
12:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
12:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
12:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
12:40
-
12:50
→
13:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
12:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
- Built a new image to have nfs-utils installed to have idmapd running and uploaded to SDCC Quay service.
- Looking into side-car container to make idmapd running in a non-root pod
- to run idmapd with root privilege and the main application container running with less privileges to satisfy OpenShift’s restricted SCC rules.
- Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
-
12:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Ceph File System Capacity
-
Ceph file systems have exceeded 80% usage of the 1PB capacity, beginning to impact performance and user quota allocation.
Dask-Gateway with HTCondor Integration
-
Ongoing development to enable Dask-Gateway scheduling on HTCondor.
-
Exploring two backend integration approaches:
-
Extending the existing Kubernetes backend to support Condor workers.
-
Adapting the JobQueue backend model to interface with HTCondor.
-
-
-
12:50
-
13:10
→
13:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
13:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
- BNL-OSG2_SCRATCHDISK: 935TB-> 650TB
- SWT2_CPB_SCRATCHDISK: 500TB-> 200TB
- MWT2_UC_SCRATCHDISK: 550TB-> 350TB
- Issues with DATADISK space at several US sites for different reasons (OU drained because of this)
- BNL tape storage admins met with ADC to discuss staging issues during recent data carousel campaign
- Decided to change retrieval algorithm so that it switches strategy from "High to Low" to FIFO after 2 days instead of 3
- Rod work with Doug on OverlayBS deployment, also adding GPU scheduling features to WFMS (see slides)
- DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
-
13:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- new testing image created
- issues with caches in UK
- VP queues at MWT2 and AGLT2 disabled for now.
- VP
- NTR
- Varnish
- Got a server at IN2P3-CC. This morning all of the FR cloud (except far-flung sites) moved to Cloudflare dns loadbalancer.
- Tracking CRIC overwrites at CERN-PROD, lxplus, LRZ, Beijing...
- Next on DE cloud.
- Concerning US:
- BNL - varnish server works but lacks monitoring so not in use.
- NET2 - using NRP varnish. Just got email from Derek asking if we can deploy on NET2 instead of NRP.
- SWT2 - need to get varnish installed.
- AI
- have 3 summer students to work on an agentic assistant.
- Will test ADK, LangGraph, and OpenAI approaches.
- ServiceX/Y
- NTR
- XCache
-
13:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Have been working on scrubbing slides
- Armada
- Auth issues fixed
- Working on connecting to a second cluster (RP1 --> UChicago AF) with minimal privileges
- Coffea Casa
- User login should be working (https://coffea-casa.hl-lhc.io/ login with ATLAS IAM), user gets persistent /home and /scratch from CephFS
- Working on adding HTCondor / Dask support
- EOS
- Work continues with authentication, how to inject users into MGM pod in production not clear, may require feature contributions to the Helm chart
- HTCondor overlay container
- Have a container that runs unprivileged and connects back to UChicago AF HTCondor pool
- Will be working with Doug to test this at NERSC
-
13:10
-
13:25
→
13:35
AOB 10m
-
12:00
→
12:05