US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2026-03-04T11:00:00-05:00
End: 2026-03-04T12:00:00-05:00
Location: No location set

Wednesday 4 Mar 2026, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  News:
  
  Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
  
  Discussions ongoing about FY2026 and infrastructure funds. Please, send paperwork related to personel to SBU ASAP.
  
  Upcoming meetings:
  
  Unchanged from last meeting
  
  LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
  
  HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
  
  dCache workshop [May 6-7 at NIKHEF]
  
  CHEP 2026 [23-29 May in Bangkok, Thailand]
  
  Open tickets:
  
  Unchanged from last meeting
  
  ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
  
  ggus:3559 SWT2/OU: Dual-stack [on hold]
  
  ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
  
  Operations:
  
  Site production during the previous 2 weeks: AGLT2, MWT2, NET2, SWT2 (CPB, OU), TW
  
  TW
  
  AGLT2
  
  MWT2
  
  NET2
  
  SWT2/CPB
  
  SWT2/OU
- 11:10 → 11:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
  ggus:1001382 : Some sites are failing to fetch CRL from our CRL web server and asymmetric route. There are no new updates.
  
  scheduled downtime: site will be shutdown from 9:00 6 Mar. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday or Tuesday morning (local time).
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Follow up on partially puzzling results from mini-data challenge:
  Reminder:
  Reading from AGLT2 to MWT2 showed expected saturation for UM->MWT2
  But lower and uneven throughput for MSU->MWT2 (unstable? packet loss?)
  Possible explanation: MSU storage currently only ~1/3 of AGLT2 total
  On read, files have to come from the pool with the requested file
  Independently from queue length for that pool
  
  Repeat test with more controlled site targeting:
  Hiro generated a list of (large) candidate files from AGLT2 datadisk
  We sorted that list into 3 lists: MSU/UM/both (both = cached copies)
  Test started reading only from MSU -> saturation near 100G, as expected
  Then only from UM -> saturation near 80G, as expected
  Then from both (with mixed list of files matching MSU/UM storage ratio) -> both sides saturated
  
  03-Mar dCache update 11.2.0 -> 11.2.1
  This version added SciTag/fireflies for https transfers
  No problem with update
  ESNet dashboard for SciTags
  
  problems with one data switch at the UM site, it went down twice in the past 2 days, had to power cycle it to bring it back. This causes 6 Tier2 work nodes to lost connections
  
  Out of Warranty UPS in the UM Tier3 room had leaking battery , and it affect some Tie2 services as well. We were able to find some old batteries to replace them, and the alert is gone, still observing before placing order for new batteries.
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  Working on procurement and getting quotes
  
  Holding off on scheduling the dcache update until the 11.2.1 webdav bug is fixed
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Slow transfers on one of the storage servers led to a couple of short blacklistings, we suspect a hardware issue and the Harvard side is looking into it.
  
  Unscheduled tape downtime over the past few days. The first message stated it was due to an issue with the inventory database caused by a firmware update. The most recent message indicated that a replacement part was needed. Either way, file to tape association will not be affected so there won't be any data loss.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  We rebuilt six R740 storage servers from EL7 to EL9 while preserving data.
  
  After the rebuild, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
  
  The server has been returned to production, and we have not observed any issues so far.
  
  We are currently creating backups of additional R740 storage servers in preparation of migrating them from EL7 to EL9.
  
  So far, we have not observed any issues with transfers related to these rebuilds.
  
  We performed different tests to try and speed up the process of copying data, including removing empty directories, to try and speed up the process.
  
  Due to news about Storm resolving issues with full chain certificates, we are gradually replacing leaf-only certificates with full chain certificates.
  
  We reverted one of the two servers that had leaf-only certificates back to full chain certificate.
  
  Currently three of four XRootD Proxy servers have full chain certificates.
  
  One R740xd2 server lost network connection on 2/27/2026 from 9:00 p.m. to 12:00 a.m. (2/28/2026 3:00 a.m. to 6:00 a.m. UTC).
  
  This may have caused some transfer issues at this time. We fixed it for now, but are still investigating this issue.
  
  We have been experiencing failed jobs with error SIGTERM.
  
  It contains long jobs that are hitting the three day limit set in CRIC and shorter jobs.
  
  Logs are not accessible most of the time.
  
  We are thinking about changing the job limits at our site to see if it helps with making logs available.
  
  OU:
  
  Running well
  
  Working on migrating from old xrootd storage to new cephfs storage

US ATLAS Tier 2 Technical

News:

Upcoming meetings:

Open tickets:

Operations: