US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
News:
- Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
- Discussions ongoing about FY2026 and infrastructure funds. Please, send paperwork related to personel to SBU ASAP.
Upcoming meetings:
Unchanged from last meeting
- LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
- HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
- dCache workshop [May 6-7 at NIKHEF]
- CHEP 2026 [23-29 May in Bangkok, Thailand]
Open tickets:
Unchanged from last meeting
- ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
- ggus:3559 SWT2/OU: Dual-stack [on hold]
- ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
Operations:
- AGLT2
- MWT2
- NET2
- SWT2/CPB
- SWT2/OU
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
- ggus:1001382 : Some sites are failing to fetch CRL from our CRL web server and asymmetric route. There are no new updates.
- scheduled downtime: site will be shutdown from 9:00 6 Mar. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday or Tuesday morning (local time).
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Follow up on partially puzzling results from mini-data challenge:
Reminder:
Reading from AGLT2 to MWT2 showed expected saturation for UM->MWT2
But lower and uneven throughput for MSU->MWT2 (unstable? packet loss?)
Possible explanation: MSU storage currently only ~1/3 of AGLT2 total
On read, files have to come from the pool with the requested file
Independently from queue length for that pool
Repeat test with more controlled site targeting:
Hiro generated a list of (large) candidate files from AGLT2 datadisk
We sorted that list into 3 lists: MSU/UM/both (both = cached copies)
Test started reading only from MSU -> saturation near 100G, as expected
Then only from UM -> saturation near 80G, as expected
Then from both (with mixed list of files matching MSU/UM storage ratio) -> both sides saturated
03-Mar dCache update 11.2.0 -> 11.2.1
This version added SciTag/fireflies for https transfers
No problem with update
ESNet dashboard for SciTags
problems with one data switch at the UM site, it went down twice in the past 2 days, had to power cycle it to bring it back. This causes 6 Tier2 work nodes to lost connections
Out of Warranty UPS in the UM Tier3 room had leaking battery , and it affect some Tie2 services as well. We were able to find some old batteries to replace them, and the alert is gone, still observing before placing order for new batteries.
-
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Slow transfers on one of the storage servers led to a couple of short blacklistings, we suspect a hardware issue and the Harvard side is looking into it.
Unscheduled tape downtime over the past few days. The first message stated it was due to an issue with the inventory database caused by a firmware update. The most recent message indicated that a replacement part was needed. Either way, file to tape association will not be affected so there won't be any data loss.
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
We rebuilt six R740 storage servers from EL7 to EL9 while preserving data.
-
After the rebuild, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
-
The server has been returned to production, and we have not observed any issues so far.
-
We are currently creating backups of additional R740 storage servers in preparation of migrating them from EL7 to EL9.
-
So far, we have not observed any issues with transfers related to these rebuilds.
-
We performed different tests to try and speed up the process of copying data, including removing empty directories, to try and speed up the process.
-
Due to news about Storm resolving issues with full chain certificates, we are gradually replacing leaf-only certificates with full chain certificates.
-
We reverted one of the two servers that had leaf-only certificates back to full chain certificate.
-
Currently three of four XRootD Proxy servers have full chain certificates.
-
One R740xd2 server lost network connection on 2/27/2026 from 9:00 p.m. to 12:00 a.m. (2/28/2026 3:00 a.m. to 6:00 a.m. UTC).
-
This may have caused some transfer issues at this time. We fixed it for now, but are still investigating this issue.
-
We have been experiencing failed jobs with error SIGTERM.
-
It contains long jobs that are hitting the three day limit set in CRIC and shorter jobs.
-
Logs are not accessible most of the time.
-
We are thinking about changing the job limits at our site to see if it helps with making logs available.
OU:
- Running well
- Working on migrating from old xrootd storage to new cephfs storage
-
-
11:00
→
11:10