US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      News:

      • Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
      • Very important IB meeting today at 1pm Eastern about the Genesis project. Please join!

      Upcoming meetings:

      Unchanged from last meeting

      Open tickets:

      Unchanged for more than 1 month, please provide update in the minutes.

      • ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
      • ggus:3559 SWT2/OU: Dual-stack [on hold]
      • ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue

      Operations:

      • AGLT2

      • MWT2

      • NET2

      • SWT2/CPB

      • SWT2/OU

       

    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • After the site recovered from the shutdown on March 10, the site is running smoothly.
      • ggus:1001382 : There are no new updates.
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      IPv4 switched off for LHCONE on 3/10/2026 - no problems noticed so far

      Sunday 15-Mar Our central management NFS server got stuck
        Indirectly froze all computing jobs.
        HC set AGLT2 offline around 4pm EDT
        Noticed just before 7pm and restarted that VM.
        HC set AGLGT2 back online around 6am
        A/R critical 7h

      Fix afterwards: updated the kernel, increase the CPU/memory, also increased the threads (8->256) of the NFS server, changed the client mount options, to allow metadata cache for 60s

       

      SciTag/firefly
        Continuing testing and patching dCache 11.2.1
        Several bugs fixed
        Private 11.2.2 RC3 version with pull requests to developers
        Now shows on ESnet dashboard
         

      Comparison of SciTags dCache throughput and WLCG Site Network Monitoring throughput:

      You need to "invert" the WLCG site data and compare by converting between Gb/s and GB/s (a factor of 8).

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • Continuing to work on procurement
      • MWT2_TEST was running too many jobs for ~1 week
        • Bug in MaxWorkers
        • Took a couple of days to drain out the running jobs
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Blacklisted briefly due to a storage pool with a bad disk, now replaced.

      Numerous staging errors following the end of the tape downtime, due to lots of tape users trying to access the system at once.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • We rebuilt four R740 storage servers from EL7 to EL9 while preserving data. 

        • After the rebuilds, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss. 

        • The servers have been returned to production, and we have not observed any issues so far.

        • We are currently creating backups of additional R740 storage servers for migrating them from EL7 to EL9. 

        • So far, we have not observed any issues with transfers related to these rebuilds. 

      • We are still experiencing unavailable logs for most failed jobs with SIGTERM error. 

        • We changed the default walltime limit from 48 to 78 hours on both of our Condor-CE. 

        • We changed the KillWait value in our Slurm configuration from 5 to 10 minutes. 

      • A new second backup Varnish server, gk12.atlas-swt2.org, was added to our site’s proxy list in CRIC. 

        • Its priority was set to the second position in case our main Varnish server fails. 

      • We experienced roughly 156 stage-out errors and 133 stage-in errors on 3/17.

        • We noticed three storage servers had very high load. 

        • We are still investigating this issue. 

        • The number of errors seems to be mostly associated with these storage servers. 


      OU:

      • Site running well
      • Network monitoring: have contacted OneNet to get access to OFFN switch data to publish
      • XRootD version / storage migration: about to create 1 PB OURdisk partition, then will contact rucio team to start draining and migrating
      • Dual stack: trying to get more OFFN ipv4 addresses, in order to avoid having to split ipv4/ipv6 between OU and OFFN network
      • Still need to address the issue with SAM/ETF not taking into account scheduled maintenance in the R/A reports