US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      News:

      • Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
      • Discussions ongoing about FY2026 and infrastructure funds. Please, send paperwork related to personel to SBU ASAP.

      Upcoming meetings:

      Unchanged from last meeting

      Open tickets:

      Unchanged from last meeting

      • ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
      • ggus:3559 SWT2/OU: Dual-stack [on hold]
      • ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue

      Operations:

      • AGLT2

      • MWT2

      • NET2

      • SWT2/CPB

      • SWT2/OU

       

    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • ggus:1001382 : Some sites are failing to fetch CRL from our CRL web server and asymmetric route. There are no new updates.
      • scheduled downtime:  site will be shutdown from 9:00 6 Mar. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday or Tuesday morning (local time).
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Follow up on partially puzzling results from mini-data challenge:
        Reminder:
          Reading from AGLT2 to MWT2 showed expected saturation for UM->MWT2
          But lower and uneven throughput for MSU->MWT2 (unstable? packet loss?) 
          Possible explanation: MSU storage currently only ~1/3 of AGLT2 total
            On read, files have to come from the pool with the requested file
            Independently from queue length for that pool 


        Repeat test with more controlled site targeting:
          Hiro generated a list of (large) candidate files from AGLT2 datadisk
          We sorted that list into 3 lists: MSU/UM/both (both = cached copies)
          Test started reading only from MSU -> saturation near 100G, as expected
          Then only from UM -> saturation near 80G, as expected
          Then from both (with mixed list of files matching MSU/UM storage ratio) -> both sides saturated 

      03-Mar dCache update 11.2.0 -> 11.2.1
        This version added SciTag/fireflies for https transfers
        No problem with update
        ESNet dashboard for SciTags

      problems with one data switch at the UM site, it went down twice in the past 2 days, had to power cycle it to bring it back. This causes 6 Tier2 work nodes to lost connections

      Out of Warranty UPS in the UM Tier3 room had leaking battery , and it affect some Tie2 services as well. We were able to find some old batteries to replace them, and the alert is gone, still observing before placing order for new batteries.  

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • Working on procurement and getting quotes
      • Holding off on scheduling the dcache update until the 11.2.1 webdav bug is fixed
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Slow transfers on one of the storage servers led to a couple of short blacklistings, we suspect a hardware issue and the Harvard side is looking into it.

      Unscheduled tape downtime over the past few days.  The first message stated it was due to an issue with the inventory database caused by a firmware update.  The most recent message indicated that a replacement part was needed.  Either way, file to tape association will not be affected so there won't be any data loss.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • We rebuilt six R740 storage servers from EL7 to EL9 while preserving data. 

        • After the rebuild, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss. 

        • The server has been returned to production, and we have not observed any issues so far.

        • We are currently creating backups of additional R740 storage servers in preparation of migrating them from EL7 to EL9. 

        • So far, we have not observed any issues with transfers related to these rebuilds. 

        • We performed different tests to try and speed up the process of copying data, including removing empty directories, to try and speed up the process. 

      • Due to news about Storm resolving issues with full chain certificates, we are gradually replacing leaf-only certificates with full chain certificates. 

        • We reverted one of the two servers that had leaf-only certificates back to full chain certificate. 

        • Currently three of four XRootD Proxy servers have full chain certificates. 

      • One R740xd2 server lost network connection on 2/27/2026 from 9:00 p.m. to 12:00 a.m. (2/28/2026 3:00 a.m. to 6:00 a.m. UTC). 

        • This may have caused some transfer issues at this time. We fixed it for now, but are still investigating this issue.

      • We have been experiencing failed jobs with error SIGTERM. 

        • It contains long jobs that are hitting the three day limit set in CRIC and shorter jobs. 

        • Logs are not accessible most of the time.

        • We are thinking about changing the job limits at our site to see if it helps with making logs available. 

       

      OU:

      • Running well
      • Working on migrating from old xrootd storage to new cephfs storage