US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      News:

      • Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
      • We need to schedule a procurement meeting
        • Tentative date: Friday, May 15th, 10am eastern. We expect the Tier-2 managers to attend: please, let Fred and Rafael know if this times doesn't work for you. 
        • Others are invited, if you are interested.
        • Initial assumption for discussion on Friday: \$10/HS and \$80/TB.

      Upcoming meetings:

      • CHEP 2026 [May 23rd-29th in Bangkok, Thailand]
      • HTC2026 [June 9th-12th in Madison, Wisconsin] - US ATLAS face-to-face on Tuesday and Wednesday (June 9th and 10th).
      • ATLAS S&C week #84 [June 29th - July 3rd, CERN]

      Open tickets:

      • ggus:1002244 NET2: RSE basepath prefix (on-hold until the next downtime in June)
      • ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
      • ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue

      Operations:

      • AGLT2

      • MWT2

      • NET2

      • SWT2/CPB

      • SWT2/OU

       

    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • After the site recovered from the downtime on May 1, the site is running smoothly.
      • Mitigations for both CopyFail and DirtyFrag have been applied.
      • ggus:1001382 : DDM Ops helped remove distances for multi-hop transfer configuration. ( ATLDDMOPS-5836
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
      • GGUS-1002723 deletion failures on S3 storage
        • resolved. The VM of xrootd-s3gw had problems during the VMware cluster migration
      • applied mitigation to both copy.fail and dirtyfrag exploits
      • 5h downtime on 4/30 
        • Reconfigured dcache to use a unified base path for any/all accesses, via default variable dcache.root. 
        • applied firmware and software updates on all the nodes, and rebooted 
        • smooth ramping up after downtime ended. A short window of low transfer efficiency due to a zookeeper cluster issue, spotted and fixed within 30minutes. 
      • 05/05: updated dcache from 11.2.3 to 11.2.4, smooth
      • Remaining ~186 TB dark data on datadisk. 
            We think some/most of it is understood.
            Some of it must be leftover from the late 2024 dcache database problem
              some files are pinned on disk but have no pnfs entry.
            We will try to find those through database queries.  
    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • dCache upgraded to 11.2.4
        • XRootD path updated to /. Resolved https://helpdesk.ggus.eu/#ticket/zoom/1002243
      • Patched for both CopyFail and DirtyFrag 
        • Entire site rebooted for CopyFail kernel mitigation
        • GGUS ticket from when dCache was rebooting: https://helpdesk.ggus.eu/#ticket/zoom/1002545
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Occasional problems with transfers failing because of FTS transfer requests arriving too long after files are staged out from the tape.  Eduardo and Fabio are consulting on a workaround, long-term solution is FTS4.  A ticket for this issue is open here.

      Copy.fail exploit is less urgent for us because jobs are run in pods, and the exploit doesn't work in the pod environment, probably because the modules in question are not exposed in the pods by default.  We still plan to fix soon, but this will require an OKD upgrade.

      Mini-mini-data challenge between us and Prague happening today (right now, in fact).  The problem of trans-Atlantic transfers being capped at 100 G should be fixed, so we will try to reach 400 G.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB: 

      • Migration of one of the last two ME4084 storage servers that are EL7 is complete. Once checks are done, we will work on migrating data from the last EL7 storage. 

      • We rebuilt our last R740 storage server to EL9. 

      • We created a second XRootD redirector that is EL9 in the production cluster, but service is not running yet. We are still performing other tests before fully using this server. 

      • Applied the copy.fail2/dirtyflag fix to all nodes in the production and test clusters. 

      • Survived short campus-wide power failure (~minute) and slightly longer (~2 hours) chilled water outage without any SWT2 issues yesterday. New alerting system worked well.

       

      OU:

      • Had brief internal network glitch early this morning; resolved.
      • Applied mitigation to both CVEs.
      • Dual stack ticket resolved, all nodes now both ipv4 and ipv6.
      • Still working on new storage, there are issues re-mounting cephfs file system with correct size.
      • Making slow progress with site network monitoring; have identified OneNet OFFN switch to monitor, have requested access to SNMP info from that; will re-purpose slate01 as host to collect that information.