US ATLAS Computing Facility (Replaced Tech Presentation)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1:00 PM
→
1:05 PM
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Today is a regular facility meeting (we had no Topical Presentation lined up). Please let us know if you have a topic you would like to present at a future meeting.
There are a lot of things going on.
- February 2025 is a "Capabilities" Testing and Demonstration month. See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link
- Please consider participating in one or more and feel free to edit existing documents or add new ones
- The Tier-2s need to come up with a plan for how to use extra funds during this calendar year.
- Highest priority is ensuring each of our Tier-2s will have 400 Gbps links by the end of 2029 (but it may be too early to spend directly on that now)
- Each Tier-2 should be engaging the the relevant campus and regional networks to discuss their upgrade plans and timelines
- Also consider needs for the funds to fix infrastructure issues (power, cooling)
- First version of a WBS 2.3.2 document is due by the end of this month, with details needed by July scrubbing
- Ongoing Jumbo frames testing is proceeding smoothly.
- Today is the last "regular" frames transfer testing from CERN-PROD_PILOT to both NET2 and BNL, tomorrow and Friday will be Jumbo frame testing
- Upcoming Meetings
- LHCONE/LHCOPN meeting https://indico.cern.ch/event/1479019/
- WLCG DOMA https://indico.cern.ch/event/1511535/
- HEPiX https://indico.cern.ch/event/1477299/
- Also for your calendar, we plan to have a USATLAS facilities meeting as part of HTC25 in Madison Wisconsin June 2-6, 2025.
- Meeting site is https://agenda.hep.wisc.edu/event/2297/overview
- USATLAS Scrubbing dates are decided July 14/15 at Stonybrook (possibly will be moved to 15/16 for European travel needs)
- While many of you won't need to attend, you may be asked for input or slides for the scrubbing
- February 2025 is a "Capabilities" Testing and Demonstration month. See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link
-
1:05 PM
→
1:10 PM
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (this week)
- vo-client
- XRootD shoveler
- xrdcl-pelican
Release (aiming for next week)
- XRootD 5.7.3
- CVMFS 2.12.6: new release (currently released version is 2.11.5) with various client features and bug fixes. See details here https://cvmfs.readthedocs.io/en/stable/cpt-releasenotes.html
Other projects
- ARM package integration testing: made some progress in getting ARM VMs started by HTCondor and are working through some minor invocation issues
- Kuantifier: waiting on NET2 authenticated Prometheus dev instance
- Eduardo has nodes for this and is working on setting up the cluster
-
1:10 PM
→
1:30 PM
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
1:10 PM
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
1:15 PM
Compute Farm 5mSpeaker: Thomas Smith
-
1:20 PM
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
-
1:25 PM
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- NTR
WBS 2.3.1.3 Tier-1 Compute - Tom
- Testing Condor v24 LTS configuration on gridgk03
- Some issues with jobs being evicted after 2 hours. Condor developers have been contacted and are providing support
- All WNs upgraded condor 24.0 LTS and Alma Linux 9.5, operation of workers has been smooth
WBS 2.3.1.4 Tier-1 Storage - Carlos
- Database hardware issue affecting Pinmanager, Bulk, TransferManager and SpaceManager services
- Degradation of service mainly affecting WRITEs (02/01/25 5PM EST)
- Service recovered 02/02/25
- Activity on synchronizing internal accounting (spacemanager) tables after restoring the service
- Enabling JumboFrames on all doors and storage servers for ongoing Capabilities testing
- Bulk service restarted on 02/09/25
- 130k staging requests stuck in QUEUE state
- After restarting the service the requests were submitted to HPSS. The entire workflow is working as expected. A follow up ticket created to dCache devs https://github.com/dCache/dcache/issues/7746
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
-
1:10 PM
-
1:30 PM
→
1:40 PM
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Some reduction in production in the last 30 days.
- Two central outages:
- 1/14/24-1/16/24 Change at CERN causes BNL to fail and sites drain until they are moved to CERN FTS instance.
- 2/6/24 One of two harvester instances at CERN has a database issue. US sites using HTCondor-CE drain.
- Does not affect NET2 and Kubernetes part of CPB.
- For the month of January the Illinois site at MTW2 is offline reducing MWT2 production by about 1/3.
- Jan 2-15 the site was down to move to a new building,
- From Jan 16-22 (approximately) authentication was not working,
- From Jan 23-31 (approximately) Systems were rebuilt as RHEL9 using new puppet setup.
- There were also various hardware and power balance issues.
- NET2 had a couple of interruptions to get their 400G uplink working.
- The good news is the 400G is in service and working well!
- OU_OSCER_ATLAS generally stable and lots opportunistic jobs.
- Some draining 2/11/25
- SWT2_CPB worked most of January to get their site up running Alma Linux 9.
- Things stablelized on 2/3/24.
- CPB did not refill last week for one whole day after the harvester issue was fixed.
- Cause of the slow refilling is under investigation,
- CPB did not refill last week for one whole day after the harvester issue was fixed.
- Things stablelized on 2/3/24.
- Two central outages:
- Procurement Planning
- We need to come up with a list of extra network gear we need to spend $2-$4 million split between the Tier sites by the end of February.
- Procurement plans will likely be due by the end of March now that the equipment funding levels are known.
- Operations Planning
- Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
- Clearly storage tokens will need to be supported at all sites,
- Some sites need to update to OSG24/Condor24.
- All sites have all public facing servers dual stacked and supporting IPv6 except the CE at OU.
- AGLT2 and CPB still need to go to jumbo frames.
- Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
- Some reduction in production in the last 30 days.
-
1:40 PM
→
1:50 PM
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
1:40 PM
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
1:45 PM
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
1:40 PM
-
1:50 PM
→
2:10 PM
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 1:50 PM
-
1:55 PM
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
2:00 PM
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
- ServiceX updated to 1.5.6. It’s expected to be reliable, and Ben is confident that it’s ready for broader use.
- Added Dask-Gateway support to the AB image (currently in a branch). Since it requires JupyterHub for launching, we are preping up BinderHub as the launching platform.
- coffea-casa cull timeout adjusted from 1 hour to 1 day - this is to support users to launch computations from the terminal.
- Maintenance is scheduled for late February or early March.
-
2:10 PM
→
2:25 PM
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
2:10 PM
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- ADC Operations:
- 05.02.2025: One Harvester (out of two) DB lock timeouts.
- 29.01.2025: Panda issue due to token issuer change (ATLASPANDA-1291)
- DDM Ops/US Ops: Fabio is back. His priorities were defined.
- GPUs: Need Cuda > 12.8 on all PQs. Expect Helpdesk tickets.
- SAM tests moved from python2@SL7 to python3@EL9.
- US Cloud Operations
- SWT2: Failed transfers due to ACT access problem. Ongoing.
- Ongoing JumboFrames tests.
- USATLAS Helpdesk Tickets (Link)
- ADC Operations:
-
2:15 PM
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCaches
- several issues I should look at.
- still did not debug gStream issue.
- VP
- working fine
- need to follow up on NET2 VP queue mails.
- Varnishes
- all working fine
- there was a discussion on wholesale move from squid to varnishes.
- now adding instances at NRP in NL and CZ to serve frontier data.
- ServiceY
- retesting FAB server-side delivery.
- new datasets, new cluster
- ServiceX
- upgraded to 1.5.6
- new code gen images.
- AI
- now WFMS assistant 'knows' most of the panda task table columns. wfms-assistant.af.atlas-ml.org
- XCaches
-
2:20 PM
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
rp1 ceph storage bottlenecked on wireguard interface at IU. Much older equipment (R720?), CPU might not be fast enough to handle the encryption overhead. 2 solutions implemented:
- increasing k8s MTU from 1280 to 8780 increased iperf throughput from 1Gbps to 4Gbps.
- adding non-wireguard backhaul network for Ceph increased performance to 10Gbps (line rate)
testing feasibility of unprivileged wireguard on VM at UChicago: podman seems to let us create tunnel interfaces in containers without rootly privileges in current (EL9+) kernels. might have interesting implications for jobs.
ongoing re-testing of ServiceY on FAB. Fengping will present at KNIT10 conference in March.
Flocking tests from UChicago AF -> MWT2 ongoing, to be tested at large scale with upcoming MWT2 storage downtime.
-
2:10 PM
-
2:25 PM
→
2:35 PM
AOB 10m
-
1:00 PM
→
1:05 PM