US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
News:
- Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
- Very important IB meeting today at 1pm Eastern about the Genesis project. Please join!
Upcoming meetings:
Unchanged from last meeting
- LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
- HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
- dCache workshop [May 6-7 at NIKHEF]
- CHEP 2026 [23-29 May in Bangkok, Thailand]
Open tickets:
Unchanged for more than 1 month, please provide update in the minutes.
- ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
- ggus:3559 SWT2/OU: Dual-stack [on hold]
- ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
Operations:
- AGLT2
- MWT2
- NET2
- SWT2/CPB
- SWT2/OU
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
- After the site recovered from the shutdown on March 10, the site is running smoothly.
- ggus:1001382 : There are no new updates.
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
IPv4 switched off for LHCONE on 3/10/2026 - no problems noticed so far
Sunday 15-Mar Our central management NFS server got stuck
Indirectly froze all computing jobs.
HC set AGLT2 offline around 4pm EDT
Noticed just before 7pm and restarted that VM.
HC set AGLGT2 back online around 6am
A/R critical 7hFix afterwards: updated the kernel, increase the CPU/memory, also increased the threads (8->256) of the NFS server, changed the client mount options, to allow metadata cache for 60s
SciTag/firefly
Continuing testing and patching dCache 11.2.1
Several bugs fixed
Private 11.2.2 RC3 version with pull requests to developers
Now shows on ESnet dashboard
Comparison of SciTags dCache throughput and WLCG Site Network Monitoring throughput:


You need to "invert" the WLCG site data and compare by converting between Gb/s and GB/s (a factor of 8).
-
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
We rebuilt four R740 storage servers from EL7 to EL9 while preserving data.
-
After the rebuilds, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
-
The servers have been returned to production, and we have not observed any issues so far.
-
We are currently creating backups of additional R740 storage servers for migrating them from EL7 to EL9.
-
So far, we have not observed any issues with transfers related to these rebuilds.
-
We are still experiencing unavailable logs for most failed jobs with SIGTERM error.
-
We changed the default walltime limit from 48 to 78 hours on both of our Condor-CE.
-
We changed the KillWait value in our Slurm configuration from 5 to 10 minutes.
-
A new second backup Varnish server, gk12.atlas-swt2.org, was added to our site’s proxy list in CRIC.
-
Its priority was set to the second position in case our main Varnish server fails.
-
We experienced roughly 156 stage-out errors and 133 stage-in errors on 3/17.
-
We noticed three storage servers had very high load.
-
We are still investigating this issue.
-
The number of errors seems to be mostly associated with these storage servers.
OU:- Site running well
- Network monitoring: have contacted OneNet to get access to OFFN switch data to publish
- XRootD version / storage migration: about to create 1 PB OURdisk partition, then will contact rucio team to start draining and migrating
- Dual stack: trying to get more OFFN ipv4 addresses, in order to avoid having to split ipv4/ipv6 between OU and OFFN network
- Still need to address the issue with SAM/ETF not taking into account scheduled maintenance in the R/A reports
-
-
11:00
→
11:10