US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
- 
                    
                    - 
        
            
                
        12:00
    
    
        →
        
            12:05
        
    
            
        
        WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 
        
            
                
        12:05
    
    
        →
        
            12:10
        
    
            
        
        OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas SelmeciRelease (impending) - Ilija: status of XRootD 5.8.3-1.2 testing?
- Frontier squid 6.13-1.3: major update that requires manual intervention for installations with multiple workers (https://opensciencegrid.atlassian.net/issues/SOFTWARE-6171)
 Miscellaneous - Kuantifier
- OKD unpriv Prometheus notes passed along to NET2
- Kuantifier and docs ready
 
- EL10
- What're the US ATLAS plans for upgrading?
- Does anyone use the HTCondor keyboard daemon?
- Who uses nftables?
 
 
- 
        
            
                
        12:10
    
    
        →
        
            12:30
        
    
            
        
        WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))Upcoming BNL / HTCondor / USATLAS meeting: 25 JUN @ 15:00 eastern / 14:00 central time: This is the USATLAS version of the meeting. Please forward along to USATLAS folks as you see fit Zoom link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1 Google doc link: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0 - 12:10
- 12:15
- 
        
            
                
        12:20
    
    
            
        
        Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))Smooth operation Disk and Tape systems - 
Per agreement with ADC and HPSS storage experts, the FIFO batch queue management threshold has been reduced from 3 to 2 days. Staging mode now switches to FIFO when requests older than 2 days are detected. 
- 
dCache Doors and Pools updated to version 9.2.35. 
- 
Kafka-based monitoring data collectors enabled 
- 
Scheduled dCache maintenance: 06/24/2025, from 13:00 to 19:00 CEST. 
- 
Database Hardware and release update to Postgresql16 
- 
Minor dCache releases will normalized to 9.2.35 few nodes 
- 
Scheduled HPSS gateway maintenance: 06/24/2025, from 13:00 to 19:00 CEST. 
- 
ATLASDADISK temp space rolled back to nominal pledge values 
- 
Extra space added (~190TB) to minimize DDM blacklisting 
 
- 
- 12:25
 
- 
        
            
                
        12:30
    
    
        →
        
            12:40
        
    
            
        
        WBS 2.3.2 Tier2 CentersUpdates on US Tier-2 centers Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- It has been a tough period recently as several sites have had problems that affected production.
- All sites were affected by a rucio issue of Jun 6.
- MWT2 was put offline for what was an unintended side effect of unrelated change made by IU network engineers.
- MWT2 had its DATADISK fill because there were many small files and the deleting process could not keep up
- NET2 was down for a couple days for the MGHPCC annual maintenance.
- NET2 also had some production loss in the last week because of network problems.
- OU was affected by the Great Plains Network replacing switched and forgetting to turn on jumbo frames
- In the last day OU had their storage was fill to capacity by transfers that apparently did not honor the size limit of their endpoint.
- CPB had trouble with their gatekeepers.
 
- There was progress on EL9 update/FY24 installation
- MSU updated their Satellite/Capsule software but still have trouble
- CPB is still working on their storage.
 
- Rafael will give the Tier 2 scrubbing (WBS 2.3.2) and the slides are complete.
- A tip of the hat to Rafael for jumping on this.
 
 
- It has been a tough period recently as several sites have had problems that affected production.
- 
        
            
                
        12:40
    
    
        →
        
            12:50
        
    
            
        
        WBS 2.3.3 Heterogenous Integration and OperationsHIOPS Convener: Rui Wang (Argonne National Laboratory (US))- 
        
            
                
        12:40
    
    
            
        
        HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
- 
        
            
                
        12:45
    
    
            
        
        Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
 
- 
        
            
                
        12:40
    
    
            
        
        
- 
        
            
                
        12:50
    
    
        →
        
            13:10
        
    
            
        
        WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))- 
        
            
                
        12:50
    
    
            
        
        Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))- Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
- Built a new image to have nfs-utils installed to have idmapd running and uploaded to SDCC Quay service.
- Looking into side-car container to make idmapd running in a non-root pod
- to run idmapd with root privilege and the main application container running with less privileges to satisfy OpenShift’s restricted SCC rules.
 
 
 
- Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
- 
        
            
                
        12:55
    
    
            
        
        Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 
        
            
                
        13:00
    
    
            
        
        Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))Ceph File System Capacity - 
Ceph file systems have exceeded 80% usage of the 1PB capacity, beginning to impact performance and user quota allocation. 
 Dask-Gateway with HTCondor Integration - 
Ongoing development to enable Dask-Gateway scheduling on HTCondor. 
- 
Exploring two backend integration approaches: - 
Extending the existing Kubernetes backend to support Condor workers. 
- 
Adapting the JobQueue backend model to interface with HTCondor. 
 
- 
 
- 
 
- 
        
            
                
        12:50
    
    
            
        
        
- 
        
            
                
        13:10
    
    
        →
        
            13:25
        
    
            
        
        WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)- 
        
            
                
        13:10
    
    
            
        
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))- DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
 - BNL-OSG2_SCRATCHDISK: 935TB-> 650TB
- SWT2_CPB_SCRATCHDISK: 500TB-> 200TB
- MWT2_UC_SCRATCHDISK: 550TB-> 350TB
 
- Issues with DATADISK space at several US sites for different reasons (OU drained because of this)
- BNL tape storage admins met with ADC to discuss staging issues during recent data carousel campaign
- Decided to change retrieval algorithm so that it switches strategy from "High to Low" to FIFO after 2 days instead of 3
 
- Rod work with Doug on OverlayBS deployment, also adding GPU scheduling features to WFMS (see slides)
 
- DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
- 
        
            
                
        13:15
    
    
            
        
        Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))- XCache
- new testing image created
- issues with caches in UK
- VP queues at MWT2 and AGLT2 disabled for now.
 
- VP
- NTR
 
- Varnish
- Got a server at IN2P3-CC. This morning all of the FR cloud (except far-flung sites) moved to Cloudflare dns loadbalancer.
- Tracking CRIC overwrites at CERN-PROD, lxplus, LRZ, Beijing...
- Next on DE cloud.
- Concerning US:
- BNL - varnish server works but lacks monitoring so not in use.
- NET2 - using NRP varnish. Just got email from Derek asking if we can deploy on NET2 instead of NRP.
- SWT2 - need to get varnish installed.
 
 
- AI
- have 3 summer students to work on an agentic assistant.
- Will test ADK, LangGraph, and OpenAI approaches.
 
- ServiceX/Y
- NTR
 
 
- XCache
- 
        
            
                
        13:20
    
    
            
        
        Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))- Have been working on scrubbing slides
- Armada
- Auth issues fixed
- Working on connecting to a second cluster (RP1 --> UChicago AF) with minimal privileges
 
- Coffea Casa
- User login should be working (https://coffea-casa.hl-lhc.io/ login with ATLAS IAM), user gets persistent /home and /scratch from CephFS
- Working on adding HTCondor / Dask support
 
- EOS
- Work continues with authentication, how to inject users into MGM pod in production not clear, may require feature contributions to the Helm chart
 
- HTCondor overlay container
- Have a container that runs unprivileged and connects back to UChicago AF HTCondor pool
- Will be working with Doug to test this at NERSC
 
 
 
- 
        
            
                
        13:10
    
    
            
        
        
- 
        
            
                
        13:25
    
    
        →
        
            13:35
        
    
            
        
        AOB 10m
 
- 
        
            
                
        12:00
    
    
        →
        
            12:05