US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2026-05-13T11:00:00-04:00
End: 2026-05-13T12:00:00-04:00
Location: No location set

Wednesday 13 May 2026, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  News:
  
  Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
  
  We need to schedule a procurement meeting
  
  Tentative date: Friday, May 15th, 10am eastern. We expect the Tier-2 managers to attend: please, let Fred and Rafael know if this times doesn't work for you.
  
  Others are invited, if you are interested.
  
  Initial assumption for discussion on Friday: \$10/HS and \$80/TB.
  
  Upcoming meetings:
  
  CHEP 2026 [May 23rd-29th in Bangkok, Thailand]
  
  HTC2026 [June 9th-12th in Madison, Wisconsin] - US ATLAS face-to-face on Tuesday and Wednesday (June 9th and 10th).
  
  ATLAS S&C week #84 [June 29th - July 3rd, CERN]
  
  Open tickets:
  
  ggus:1002244 NET2: RSE basepath prefix (on-hold until the next downtime in June)
  
  ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
  
  ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
  
  Operations:
  
  Site production during the previous 2 weeks: AGLT2, MWT2, NET2, SWT2 (CPB, OU), TW
  
  TW
  
  AGLT2
  
  MWT2
  
  NET2
  
  SWT2/CPB
  
  SWT2/OU
- 11:10 → 11:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
  After the site recovered from the downtime on May 1, the site is running smoothly.
  
  Mitigations for both CopyFail and DirtyFrag have been applied.
  
  ggus:1001382 : DDM Ops helped remove distances for multi-hop transfer configuration. ( ATLDDMOPS-5836 )
- 11:20 → 11:30
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  GGUS-1002723 deletion failures on S3 storage
  
  resolved. The VM of xrootd-s3gw had problems during the VMware cluster migration
  
  applied mitigation to both copy.fail and dirtyfrag exploits
  
  5h downtime on 4/30
  
  Reconfigured dcache to use a unified base path for any/all accesses, via default variable dcache.root.
  
  applied firmware and software updates on all the nodes, and rebooted
  
  smooth ramping up after downtime ended. A short window of low transfer efficiency due to a zookeeper cluster issue, spotted and fixed within 30minutes.
  
  05/05: updated dcache from 11.2.3 to 11.2.4, smooth
  
  Remaining ~186 TB dark data on datadisk.
  We think some/most of it is understood.
  Some of it must be leftover from the late 2024 dcache database problem
  some files are pinned on disk but have no pnfs entry.
  We will try to find those through database queries.
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  dCache upgraded to 11.2.4
  
  XRootD path updated to /. Resolved https://helpdesk.ggus.eu/#ticket/zoom/1002243
  
  Patched for both CopyFail and DirtyFrag
  
  Entire site rebooted for CopyFail kernel mitigation
  
  GGUS ticket from when dCache was rebooting: https://helpdesk.ggus.eu/#ticket/zoom/1002545
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Occasional problems with transfers failing because of FTS transfer requests arriving too long after files are staged out from the tape. Eduardo and Fabio are consulting on a workaround, long-term solution is FTS4. A ticket for this issue is open here.
  
  Copy.fail exploit is less urgent for us because jobs are run in pods, and the exploit doesn't work in the pod environment, probably because the modules in question are not exposed in the pods by default. We still plan to fix soon, but this will require an OKD upgrade.
  
  Mini-mini-data challenge between us and Prague happening today (right now, in fact). The problem of trans-Atlantic transfers being capped at 100 G should be fixed, so we will try to reach 400 G.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Migration of one of the last two ME4084 storage servers that are EL7 is complete. Once checks are done, we will work on migrating data from the last EL7 storage.
  
  We rebuilt our last R740 storage server to EL9.
  
  We created a second XRootD redirector that is EL9 in the production cluster, but service is not running yet. We are still performing other tests before fully using this server.
  
  Applied the copy.fail2/dirtyflag fix to all nodes in the production and test clusters.
  
  Survived short campus-wide power failure (~minute) and slightly longer (~2 hours) chilled water outage without any SWT2 issues yesterday. New alerting system worked well.
  
  OU:
  
  Had brief internal network glitch early this morning; resolved.
  
  Applied mitigation to both CVEs.
  
  Dual stack ticket resolved, all nodes now both ipv4 and ipv6.
  
  Still working on new storage, there are issues re-mounting cephfs file system with correct size.
  
  Making slow progress with site network monitoring; have identified OneNet OFFN switch to monitor, have requested access to SNMP info from that; will re-purpose slate01 as host to collect that information.

US ATLAS Tier 2 Technical

News:

Upcoming meetings:

Open tickets:

Operations: