US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
- 
                    
                    
- 
        
            
                
        12:00
    
    
        →
        
            12:05
        
    
            
        
        WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
 - 
        
            
                
        12:05
    
    
        →
        
            12:10
        
    
            
        
        OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (impending)
- Ilija: status of XRootD 5.8.3-1.2 testing?
 - Frontier squid 6.13-1.3: major update that requires manual intervention for installations with multiple workers (https://opensciencegrid.atlassian.net/issues/SOFTWARE-6171)
 
Miscellaneous
- Kuantifier
- OKD unpriv Prometheus notes passed along to NET2
 - Kuantifier and docs ready
 
 - EL10
- What're the US ATLAS plans for upgrading?
 - Does anyone use the HTCondor keyboard daemon?
 - Who uses nftables?
 
 
 - 
        
            
                
        12:10
    
    
        →
        
            12:30
        
    
            
        
        WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
Upcoming BNL / HTCondor / USATLAS meeting: 25 JUN @ 15:00 eastern / 14:00 central time:
This is the USATLAS version of the meeting. Please forward along to USATLAS folks as you see fit
Zoom link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1
Google doc link: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0
- 12:10
 - 12:15
 - 
        
            
                
        12:20
    
    
            
        
        Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
Smooth operation Disk and Tape systems
- 
Per agreement with ADC and HPSS storage experts, the FIFO batch queue management threshold has been reduced from 3 to 2 days. Staging mode now switches to FIFO when requests older than 2 days are detected.
 - 
dCache Doors and Pools updated to version 9.2.35.
 - 
Kafka-based monitoring data collectors enabled
 - 
Scheduled dCache maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
 - 
Database Hardware and release update to Postgresql16
 - 
Minor dCache releases will normalized to 9.2.35 few nodes
 - 
Scheduled HPSS gateway maintenance: 06/24/2025, from 13:00 to 19:00 CEST.
 - 
ATLASDADISK temp space rolled back to nominal pledge values
 - 
Extra space added (~190TB) to minimize DDM blacklisting
 
 - 
 - 12:25
 
 - 
        
            
                
        12:30
    
    
        →
        
            12:40
        
    
            
        
        WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- It has been a tough period recently as several sites have had problems that affected production.
- All sites were affected by a rucio issue of Jun 6.
 - MWT2 was put offline for what was an unintended side effect of unrelated change made by IU network engineers.
 - MWT2 had its DATADISK fill because there were many small files and the deleting process could not keep up
 - NET2 was down for a couple days for the MGHPCC annual maintenance.
 - NET2 also had some production loss in the last week because of network problems.
 - OU was affected by the Great Plains Network replacing switched and forgetting to turn on jumbo frames
 - In the last day OU had their storage was fill to capacity by transfers that apparently did not honor the size limit of their endpoint.
 - CPB had trouble with their gatekeepers.
 
 - There was progress on EL9 update/FY24 installation
- MSU updated their Satellite/Capsule software but still have trouble
 - CPB is still working on their storage.
 
 - Rafael will give the Tier 2 scrubbing (WBS 2.3.2) and the slides are complete.
- A tip of the hat to Rafael for jumping on this.
 
 
 - It has been a tough period recently as several sites have had problems that affected production.
 - 
        
            
                
        12:40
    
    
        →
        
            12:50
        
    
            
        
        WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 
        
            
                
        12:40
    
    
            
        
        HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
 - 
        
            
                
        12:45
    
    
            
        
        Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
 
 - 
        
            
                
        12:40
    
    
            
        
        
 - 
        
            
                
        12:50
    
    
        →
        
            13:10
        
    
            
        
        WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 
        
            
                
        12:50
    
    
            
        
        Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
- Built a new image to have nfs-utils installed to have idmapd running and uploaded to SDCC Quay service.
 - Looking into side-car container to make idmapd running in a non-root pod
- to run idmapd with root privilege and the main application container running with less privileges to satisfy OpenShift’s restricted SCC rules.
 
 
 
 - Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
 - 
        
            
                
        12:55
    
    
            
        
        Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
 - 
        
            
                
        13:00
    
    
            
        
        Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Ceph File System Capacity
- 
Ceph file systems have exceeded 80% usage of the 1PB capacity, beginning to impact performance and user quota allocation.
 
Dask-Gateway with HTCondor Integration
- 
Ongoing development to enable Dask-Gateway scheduling on HTCondor.
 - 
Exploring two backend integration approaches:
- 
Extending the existing Kubernetes backend to support Condor workers.
 - 
Adapting the JobQueue backend model to interface with HTCondor.
 
 - 
 
 - 
 
 - 
        
            
                
        12:50
    
    
            
        
        
 - 
        
            
                
        13:10
    
    
        →
        
            13:25
        
    
            
        
        WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- 
        
            
                
        13:10
    
    
            
        
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
- BNL-OSG2_SCRATCHDISK: 935TB-> 650TB
 - SWT2_CPB_SCRATCHDISK: 500TB-> 200TB
 - MWT2_UC_SCRATCHDISK: 550TB-> 350TB
 
 - Issues with DATADISK space at several US sites for different reasons (OU drained because of this)
 - BNL tape storage admins met with ADC to discuss staging issues during recent data carousel campaign
- Decided to change retrieval algorithm so that it switches strategy from "High to Low" to FIFO after 2 days instead of 3
 
 - Rod work with Doug on OverlayBS deployment, also adding GPU scheduling features to WFMS (see slides)
 
 - DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
 - 
        
            
                
        13:15
    
    
            
        
        Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- new testing image created
 - issues with caches in UK
 - VP queues at MWT2 and AGLT2 disabled for now.
 
 - VP
- NTR
 
 - Varnish
- Got a server at IN2P3-CC. This morning all of the FR cloud (except far-flung sites) moved to Cloudflare dns loadbalancer.
 - Tracking CRIC overwrites at CERN-PROD, lxplus, LRZ, Beijing...
 - Next on DE cloud.
 - Concerning US:
- BNL - varnish server works but lacks monitoring so not in use.
 - NET2 - using NRP varnish. Just got email from Derek asking if we can deploy on NET2 instead of NRP.
 - SWT2 - need to get varnish installed.
 
 
 - AI
- have 3 summer students to work on an agentic assistant.
 - Will test ADK, LangGraph, and OpenAI approaches.
 
 - ServiceX/Y
- NTR
 
 
 - XCache
 - 
        
            
                
        13:20
    
    
            
        
        Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Have been working on scrubbing slides
 - Armada
- Auth issues fixed
 - Working on connecting to a second cluster (RP1 --> UChicago AF) with minimal privileges
 
 - Coffea Casa
- User login should be working (https://coffea-casa.hl-lhc.io/ login with ATLAS IAM), user gets persistent /home and /scratch from CephFS
 - Working on adding HTCondor / Dask support
 
 - EOS
- Work continues with authentication, how to inject users into MGM pod in production not clear, may require feature contributions to the Helm chart
 
 - HTCondor overlay container
- Have a container that runs unprivileged and connects back to UChicago AF HTCondor pool
 - Will be working with Doug to test this at NERSC
 
 
 
 - 
        
            
                
        13:10
    
    
            
        
        
 - 
        
            
                
        13:25
    
    
        →
        
            13:35
        
    
            
        
        AOB 10m
 
 - 
        
            
                
        12:00
    
    
        →
        
            12:05