US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2026-03-18T11:00:00-04:00
End: 2026-03-18T12:00:00-04:00
Location: No location set

Wednesday 18 Mar 2026, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  News:
  
  Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
  
  Very important IB meeting today at 1pm Eastern about the Genesis project. Please join!
  
  Upcoming meetings:
  
  Unchanged from last meeting
  
  LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
  
  HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
  
  dCache workshop [May 6-7 at NIKHEF]
  
  CHEP 2026 [23-29 May in Bangkok, Thailand]
  
  Open tickets:
  
  Unchanged for more than 1 month, please provide update in the minutes.
  
  ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
  
  ggus:3559 SWT2/OU: Dual-stack [on hold]
  
  ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
  
  Operations:
  
  Site production during the previous 2 weeks: AGLT2, MWT2, NET2, SWT2 (CPB, OU), TW
  
  TW
  
  AGLT2
  
  MWT2
  
  NET2
  
  SWT2/CPB
  
  SWT2/OU
- 11:10 → 11:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
  After the site recovered from the shutdown on March 10, the site is running smoothly.
  
  ggus:1001382 : There are no new updates.
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  IPv4 switched off for LHCONE on 3/10/2026 - no problems noticed so far
  
  Sunday 15-Mar Our central management NFS server got stuck
  Indirectly froze all computing jobs.
  HC set AGLT2 offline around 4pm EDT
  Noticed just before 7pm and restarted that VM.
  HC set AGLGT2 back online around 6am
  A/R critical 7h
  
  Fix afterwards: updated the kernel, increase the CPU/memory, also increased the threads (8->256) of the NFS server, changed the client mount options, to allow metadata cache for 60s
  
  SciTag/firefly
  Continuing testing and patching dCache 11.2.1
  Several bugs fixed
  Private 11.2.2 RC3 version with pull requests to developers
  Now shows on ESnet dashboard
  
  Comparison of SciTags dCache throughput and WLCG Site Network Monitoring throughput:
  
  You need to "invert" the WLCG site data and compare by converting between Gb/s and GB/s (a factor of 8).
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  Continuing to work on procurement
  
  MWT2_TEST was running too many jobs for ~1 week
  
  Bug in MaxWorkers
  
  Took a couple of days to drain out the running jobs
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Blacklisted briefly due to a storage pool with a bad disk, now replaced.
  
  Numerous staging errors following the end of the tape downtime, due to lots of tape users trying to access the system at once.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  We rebuilt four R740 storage servers from EL7 to EL9 while preserving data.
  
  After the rebuilds, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
  
  The servers have been returned to production, and we have not observed any issues so far.
  
  We are currently creating backups of additional R740 storage servers for migrating them from EL7 to EL9.
  
  So far, we have not observed any issues with transfers related to these rebuilds.
  
  We are still experiencing unavailable logs for most failed jobs with SIGTERM error.
  
  We changed the default walltime limit from 48 to 78 hours on both of our Condor-CE.
  
  We changed the KillWait value in our Slurm configuration from 5 to 10 minutes.
  
  A new second backup Varnish server, gk12.atlas-swt2.org, was added to our site’s proxy list in CRIC.
  
  Its priority was set to the second position in case our main Varnish server fails.
  
  We experienced roughly 156 stage-out errors and 133 stage-in errors on 3/17.
  
  We noticed three storage servers had very high load.
  
  We are still investigating this issue.
  
  The number of errors seems to be mostly associated with these storage servers.
  
  OU:
  
  Site running well
  
  Network monitoring: have contacted OneNet to get access to OFFN switch data to publish
  
  XRootD version / storage migration: about to create 1 PB OURdisk partition, then will contact rucio team to start draining and migrating
  
  Dual stack: trying to get more OFFN ipv4 addresses, in order to avoid having to split ipv4/ipv6 between OU and OFFN network
  
  Still need to address the issue with SAM/ETF not taking into account scheduled maintenance in the R/A reports

US ATLAS Tier 2 Technical

News:

Upcoming meetings:

Open tickets:

Operations: