US ATLAS Computing Facility (Replaced Tech Presentation)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
- 
                    
                    
- 
        
            
                
        13:00
    
    
        →
        
            13:05
        
    
            
        
        WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Today is a regular facility meeting (we had no Topical Presentation lined up). Please let us know if you have a topic you would like to present at a future meeting.
There are a lot of things going on.
- February 2025 is a "Capabilities" Testing and Demonstration month.   See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link 
- Please consider participating in one or more and feel free to edit existing documents or add new ones
 
 - The Tier-2s need to come up with a plan for how to use extra funds during this calendar year.
- Highest priority is ensuring each of our Tier-2s will have 400 Gbps links by the end of 2029 (but it may be too early to spend directly on that now)
 - Each Tier-2 should be engaging the the relevant campus and regional networks to discuss their upgrade plans and timelines
 - Also consider needs for the funds to fix infrastructure issues (power, cooling)
 - First version of a WBS 2.3.2 document is due by the end of this month, with details needed by July scrubbing
 
 - Ongoing Jumbo frames testing is proceeding smoothly.  
- Today is the last "regular" frames transfer testing from CERN-PROD_PILOT to both NET2 and BNL, tomorrow and Friday will be Jumbo frame testing
 
 - Upcoming Meetings
- LHCONE/LHCOPN meeting https://indico.cern.ch/event/1479019/
 - WLCG DOMA https://indico.cern.ch/event/1511535/
 - HEPiX https://indico.cern.ch/event/1477299/
 
 - Also for your calendar, we plan to have a USATLAS facilities meeting as part of HTC25 in Madison Wisconsin June 2-6, 2025.   
- Meeting site is https://agenda.hep.wisc.edu/event/2297/overview
 
 - USATLAS Scrubbing dates are decided July 14/15 at Stonybrook (possibly will be moved to 15/16 for European travel needs)
- While many of you won't need to attend, you may be asked for input or slides for the scrubbing
 
 
 - February 2025 is a "Capabilities" Testing and Demonstration month.   See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link 
 - 
        
            
                
        13:05
    
    
        →
        
            13:10
        
    
            
        
        OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (this week)
- vo-client
 - XRootD shoveler
 - xrdcl-pelican
 
Release (aiming for next week)
- XRootD 5.7.3
 - CVMFS 2.12.6: new release (currently released version is 2.11.5) with various client features and bug fixes. See details here https://cvmfs.readthedocs.io/en/stable/cpt-releasenotes.html
 
Other projects
- ARM package integration testing: made some progress in getting ARM VMs started by HTCondor and are working through some minor invocation issues
 - Kuantifier: waiting on NET2 authenticated Prometheus dev instance
- Eduardo has nodes for this and is working on setting up the cluster
 
 
 - 
        
            
                
        13:10
    
    
        →
        
            13:30
        
    
            
        
        WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
- 
        
            
                
        13:10
    
    
            
        
        Tier-1 Infrastructure 5mSpeaker: Jason Smith
 - 
        
            
                
        13:15
    
    
            
        
        Compute Farm 5mSpeaker: Thomas Smith
 - 
        
            
                
        13:20
    
    
            
        
        Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
 - 
        
            
                
        13:25
    
    
            
        
        Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- NTR
 
WBS 2.3.1.3 Tier-1 Compute - Tom
- Testing Condor v24 LTS configuration on gridgk03
- Some issues with jobs being evicted after 2 hours. Condor developers have been contacted and are providing support
 
 - All WNs upgraded condor 24.0 LTS and Alma Linux 9.5, operation of workers has been smooth
 
WBS 2.3.1.4 Tier-1 Storage - Carlos
- Database hardware issue affecting Pinmanager, Bulk, TransferManager and SpaceManager services
- Degradation of service mainly affecting WRITEs (02/01/25 5PM EST)
 - Service recovered 02/02/25
 - Activity on synchronizing internal accounting (spacemanager) tables after restoring the service
 
 - Enabling JumboFrames on all doors and storage servers for ongoing Capabilities testing
 - Bulk service restarted on 02/09/25
- 130k staging requests stuck in QUEUE state
 - After restarting the service the requests were submitted to HPSS. The entire workflow is working as expected. A follow up ticket created to dCache devs https://github.com/dCache/dcache/issues/7746
 
 
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
 
 - 
        
            
                
        13:10
    
    
            
        
        
 - 
        
            
                
        13:30
    
    
        →
        
            13:40
        
    
            
        
        WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Some reduction in production in the last 30 days.
- Two central outages:
- 1/14/24-1/16/24 Change at CERN causes BNL to fail and sites drain until they are moved to CERN FTS instance.
 - 2/6/24 One of two harvester instances at CERN has a database issue. US sites using HTCondor-CE drain.
- Does not affect NET2 and Kubernetes part of CPB.
 
 
 - For the month of January the Illinois site at MTW2 is offline reducing MWT2 production by about 1/3.
- Jan 2-15 the site was down to move to a new building,
 - From Jan 16-22 (approximately) authentication was not working,
 - From Jan 23-31 (approximately) Systems were rebuilt as RHEL9 using new puppet setup.
- There were also various hardware and power balance issues.
 
 
 - NET2 had a couple of interruptions to get their 400G uplink working.
- The good news is the 400G is in service and working well!
 
 - OU_OSCER_ATLAS generally stable and lots opportunistic jobs.
- Some draining 2/11/25
 
 - SWT2_CPB worked most of January to get their site up running Alma Linux 9.
- Things stablelized on 2/3/24.
- CPB did not refill last week for one whole day after the harvester issue was fixed.
- Cause of the slow refilling is under investigation,
 
 
 - CPB did not refill last week for one whole day after the harvester issue was fixed.
 
 - Things stablelized on 2/3/24.
 
 - Two central outages:
 - Procurement Planning
- We need to come up with a list of extra network gear we need to spend $2-$4 million split between the Tier sites by the end of February.
 - Procurement plans will likely be due by the end of March now that the equipment funding levels are known.
 
 - Operations  Planning
- Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
- Clearly storage tokens will need to be supported at all sites,
 - Some sites need to update to OSG24/Condor24.
 - All sites have all public facing servers dual stacked and supporting IPv6 except the CE at OU.
 - AGLT2 and CPB still need to go to jumbo frames.
 
 
 - Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
 
 - Some reduction in production in the last 30 days.
 - 
        
            
                
        13:40
    
    
        →
        
            13:50
        
    
            
        
        WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 
        
            
                
        13:40
    
    
            
        
        HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
 - 
        
            
                
        13:45
    
    
            
        
        Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
 
 - 
        
            
                
        13:40
    
    
            
        
        
 - 
        
            
                
        13:50
    
    
        →
        
            14:10
        
    
            
        
        WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50
 - 
        
            
                
        13:55
    
    
            
        
        Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
 - 
        
            
                
        14:00
    
    
            
        
        Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
- ServiceX updated to 1.5.6. It’s expected to be reliable, and Ben is confident that it’s ready for broader use.
 - Added Dask-Gateway support to the AB image (currently in a branch). Since it requires JupyterHub for launching, we are preping up BinderHub as the launching platform.
 - coffea-casa cull timeout adjusted from 1 hour to 1 day - this is to support users to launch computations from the terminal.
 - Maintenance is scheduled for late February or early March.
 
 
 - 
        
            
                
        14:10
    
    
        →
        
            14:25
        
    
            
        
        WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- 
        
            
                
        14:10
    
    
            
        
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- ADC Operations:
- 05.02.2025: One Harvester (out of two) DB lock timeouts.
 - 29.01.2025: Panda issue due to token issuer change (ATLASPANDA-1291)
 - DDM Ops/US Ops: Fabio is back. His priorities were defined.
 - GPUs: Need Cuda > 12.8 on all PQs. Expect Helpdesk tickets.
 - SAM tests moved from python2@SL7 to python3@EL9.
 
 - US Cloud Operations
- SWT2: Failed transfers due to ACT access problem. Ongoing.
 - Ongoing JumboFrames tests.
 
 - USATLAS Helpdesk Tickets (Link)
 
 - ADC Operations:
 - 
        
            
                
        14:15
    
    
            
        
        Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCaches 
- several issues I should look at.
 - still did not debug gStream issue.
 
 - VP
- working fine
 - need to follow up on NET2 VP queue mails.
 
 - Varnishes
- all working fine
 - there was a discussion on wholesale move from squid to varnishes.
 - now adding instances at NRP in NL and CZ to serve frontier data.
 
 - ServiceY
- retesting FAB server-side delivery.
 - new datasets, new cluster
 
 - ServiceX
- upgraded to 1.5.6
 - new code gen images.
 
 - AI
- now WFMS assistant 'knows' most of the panda task table columns. wfms-assistant.af.atlas-ml.org
 
 
 - XCaches 
 - 
        
            
                
        14:20
    
    
            
        
        Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
rp1 ceph storage bottlenecked on wireguard interface at IU. Much older equipment (R720?), CPU might not be fast enough to handle the encryption overhead. 2 solutions implemented:
- increasing k8s MTU from 1280 to 8780 increased iperf throughput from 1Gbps to 4Gbps.
 - adding non-wireguard backhaul network for Ceph increased performance to 10Gbps (line rate)
 
testing feasibility of unprivileged wireguard on VM at UChicago: podman seems to let us create tunnel interfaces in containers without rootly privileges in current (EL9+) kernels. might have interesting implications for jobs.
ongoing re-testing of ServiceY on FAB. Fengping will present at KNIT10 conference in March.
Flocking tests from UChicago AF -> MWT2 ongoing, to be tested at large scale with upcoming MWT2 storage downtime.
 
 - 
        
            
                
        14:10
    
    
            
        
        
 - 
        
            
                
        14:25
    
    
        →
        
            14:35
        
    
            
        
        AOB 10m
 
 - 
        
            
                
        13:00
    
    
        →
        
            13:05