US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2026-02-18T11:00:00-05:00
End: 2026-02-18T12:00:00-05:00
Location: No location set

Wednesday 18 Feb 2026, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  News:
  
  Tentative next procurement meeting next Friday (2/27). An email to Tier-2 managers has been sent. We will try to avoid clashing with the US ATLAS management meeting like last time.
  
  dCache version 11.2 only included fireflies for xroot transfers. dCache version 11.2.1 is supposed to have fireflies for all protocols. It also includes tape metadata capability.
  
  If your site is testing it and has results, please share news during the rount table (and enter in the minutes)
  
  Short mini-capacity challenge happening today between PRG-NET2 [notes]
  
  Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
  
  Certificate issues:
  
  The transfer problems between SWT2 and IFIC/TECHNION was due to X.509 certificates with the full chain being used.
  
  StoRM systems only accept leaf certificates, which we agreed is not the correct procedure.
  
  WLCG and/or ATLAS do not have, in practice, any way to enforce sites to accept full certificates.
  
  A workaround has been used at SWT2 (see more details in Zach's minutes below) by using a leaf certificate in some servers. Apparently other sites do the same.
  
  This is not the correct procedure, and it will cause problems in the future with systems that only accept full-chain certificates (as it was the case for Google, for instance)
  
  Horst will open a ticket with OSG to understand what kind of certificate is being used.
  
  Upcoming meetings:
  
  Just finished: ATLAS S&C meeting [Feb 9-13 at CERN]
  
  Upcoming:
  
  LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
  
  HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
  
  dCache workshop [May 6-7 at NIKEFF]
  
  CHEP 2026 [23-29 May in Bangkok, Thailand]
  
  Open tickets:
  
  ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
  
  ggus:3559 SWT2/OU: Dual-stack [on hold]
  
  ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
  
  Operations:
  
  Site production during the previous 2 weeks: AGLT2, MWT2, NET2, SWT2 (CPB, OU), TW
  
  TW
  
  AGLT2
  
  MWT2
  
  NET2
  
  SWT2/CPB
  
  SWT2/OU
- 11:10 → 11:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  05-Feb: Upgraded to dCache 11.2.0 (from 10.2.18)
  First Golden release to support flow marking
  But bug/omission: see marking for xrootd but not webdav transfers
  11.2.1 coming soon; will upgrade when available
  
  05-Feb repeated mini-challenge MWT2<>AGLT2 (plots)
  MWT2->AGLT2 very smooth with 90/95Gbps to UM/MSU
  AGLT2->MWT2 somewhat puzzling
  good: verified dcache config bug fixed: doors are redirecting on read
  good: reads from UM saturated at 80G bottleneck
  odd?: reads from MSU "choppy" 10-90G.
  Will re-test with separate reads from just MSU and just UM
  
  Noticed cvmfs file /cvmfs/sft.cern.ch/lcg/lastUpdate stopped updating after Friday 06-Feb
  Found no issue; same at MWT2, and lxplus
  Contacted Dave & Valentin
  Quickly verified it was not a cvmfs issue, but just that file, started JIRA ticket
  CCed Andre who recognized problem came from omission while moving gitlab repositories
  For reference, some useful command to debug repo updating issues:
  To check health and updates of local repo:
  cvmfs_config status sft.cern.ch
  To check current revision time stamp on stratum 1:
  curl http://cvmfs-stratum-one.cern.ch:8000/cvmfs/sft.cern.ch/.cvmfspublished -s -o - | grep -a ^T
  And/Or to check current Revision number:
  https://cvmfs-monitor-frontend.web.cern.ch/sft.cern.ch
  
  Requested correction for January A/R ticket
  Reminder: We fixed condor config bug keeping test jobs idle when too many merge jobs present
  AGLT2 should now have more consistent/correct reporting
  
  After successful 2-day test of removing IPv4 from LHCONE
  planning on removing it permanently
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  Discussing procurement plan, what to purchase, estimating retirements
  
  Fred to work on quotes. UC team will be setting up a meeting with Dell rep
  
  Discussing elasticsearch analytics refresh in the plan as well
  
  IU network team updated the VRF for Technion to fix asymmetric routes between IU and Technion. Monitoring
  
  Reran mini capacity challenge after the Feb 3 network and job saturation. Ran Feb 5
  
  Waiting on dcache 11.2.1 for Fireflies for WebDav
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Some issues over the last two weeks related to the tape. Last week the dcache pool facing the tape froze, leading to failures when transfers accumulated. The previous week we had an issue with a couple of pools that were not delivering enough throughput. This led to errors that filled the dcache head node disk with logs, causing it to crash, and there was some difficulty in returning to normal operations.
  
  IPV4 off test on LHCOne went with no problems.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  We rebuilt another R740 storage server from EL7 to EL9 while preserving the existing data.
  
  After the rebuild, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
  
  The server has been returned to production, and we have not observed any issues so far.
  
  We are currently creating backups of four R740 storage servers in preparation of migrating them from EL7 to EL9.
  
  On 2/12/2026, we rebuilt and upgraded one of our four XRootD Proxy servers with improved hardware to increase performance.
  
  We have not observed any related transfer issues so far.
  
  We understand the cause of the transfer problems affecting IFIC-LCG2 and TECHNION-HEP and have closed GGUS-Ticket-ID: #1001633.
  
  We changed the full chain certificates on two of our four XRootD Proxy servers with leaf only certificates and have been monitoring.
  
  OU:
  
  Site running well
  
  Some storage overload and high mem jobs during the last week, but only sporadically.

Choose timezone

US ATLAS Tier 2 Technical

News:

Upcoming meetings:

Open tickets:

Operations: