US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      News:

      • Remember to submit operations plans before the EOY break! Upload them here (everyone is missing).
      • Add the results of the December 2025 capacity mini-challenge here (SWT2 is missing).
      • Capability mini-challenge planned for January 2026 (planning document here)
        • Node tuning
        • Rucio/SENSE
        • IPv4 blackout
      • Please, keep the capacity and services spreadsheets updated.

      Operations:

      • Site production during the last two weeks: AGLT2MWT2NET2, SWT2 (CPBOU), TW
      • Open tickets: 
        • NET2
          • ggus 3255 NET2_Amherst: jobs failing with "Job has reached the specified backoff limitBackoffLimitExceeded"
            • New pilot deployed. Active discussion about results ongoing in the ggus ticket.

      Upcoming meetings:

      • CHEP 2026 [23-29 May in Bangkok, Thailand]

      • ISGC 2026 [15-20 March 2026 in Taiwan]
        • For both CHEP and ISGC, the formal period for ATLAS submissions is already closed. But if you have some last-minute abstract to submit, don't hesitate to contact S&C speakers committee.
      • The next LHCOPN/ONE meeting has been proposed to be located in Canada on the 14-16 of April 2026. The meeting will be hosted by CANARIE in a city not yet confirmed, possibly Ottawa.
      • ATLAS S&C meeting [Feb 9-13 at CERN]
        • If you would like to request a talk, please contact S&C coordinators. Specify if you are looking for a plenary or a parallel.

      Discussion items:

      • [Ivan] On the ALMA minor version (from here): "Each minor version reaches end of life when the new version is released". It can be a motivation to keep the RHEL 9.X version up to date.

      • [Shawn] AGLT2 has an updated script for FasterData and is looking for feedback (see here). It has been tested with a single storage server and the idea is to use it for capability mini-challenge.
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • Dec 01: Migrated the HTCondor-CE to the production PQ.
      • Dec 09: Unset the distance TW-QMUL due to mtu issue. (GGUS)
      • Working on WLCG Accounting configuration of the HTCondor CE.
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

       

      The UM site has updated to RHEL9.7; The work nodes and dcache nodes will wait for the next firmware update (Next quarter) to reboot to the new kernel

       

      Fixed the PDU issue in the UM Tier2, brought all the work nodes back online.

       

      Reported and resolved the UNK A/R site results for November 2025

       

      Mini data challenge: identified some storage server having saturated disks during writing , and more storage servers during reading, working on collecting more monitor data from IO performance.

      AGLT2 has a fiber issue this morning:
      Service Degradation - Merit is reporting a Fiber Issue between Kalamazoo and Jackson.
      This is affecting the direct connection between UM and ESnet
      But our network resiliency for UM via MSU (over our 100G Research Triangle with WSU) is working and we aren't seeing any failures

      Note on host tuning script being developed for WLCG perfSONAR:  https://osg-htc.org/networking/perfsonar/tools_scripts/fasterdata-tuning/ This script is also intended to audit or apply Fasterdata settings. Have been used on UM's perfSONAR (psum01/psum02/psum02-100g.aglt2.org) and on one of our dCache hosts umfs20.aglt2.org. Script will be upgraded to "save" and "restore" configs for upcoming mini-capability challenge in January 2026. Feedback on the script and docs welcome.

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

      Increased meanRSS to 3GB

      SCRATCHDISK filled up due to the mini capacity challenge and set us offline temporarily on December 3rd

      Collected network metrics from the mini capacity challenge and added to the Google drive

      Tested the cgroups changes in OSG25 on the MWT2_TEST and MWT2 queues before it was added to the production pilot

      Working on our operations plan

    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Fix for the backoff issue is in the pilot.  Observe very few backoff errors over the last few days, though still non-zero.  (Also there are some jobs that throw the backoff limit error but are still marked as finished, not failed, which I don't understand.). Thanks to Fred and Michal for following up!  Discussion is continuing on GGUS and JIRA.

      Tape downtime due to NESE tape system maintenance.

      Issue where a pool that had IO blocked due to a kernel panic caused a lot of job failures.

      Blacklisted the weekend after Thanksgiving, slow recovery due to HC test 1293 not submitting a new job.

       

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • We rebuilt the remaining three XRootD Proxy servers from EL7 to EL9. All four are now EL9. 

        • We experienced issues with XRootD 5.7.1 and attempted various changes, performed different tests, and communicated with XRootD experts. They recommended we switch to 5.7.2 due to performance issues in 5.7.1. They stopped using the std::regex which reduced CPU load, resulting in improved throughput. std::regex was running very slowly. 

      • Our site may experience a downtime on 1/2/2026. There is maintenance work being done that requires the building’s power to be switched over to the generator. We will declare a warning downtime for this and be in the data center to monitor during the switchover. 

      • We have started implementing Zabbix to move away from Ganglia for alerting. 


      OU:

      • Running well, just some occasional storage overload because of simultaneous heavy data influx and high I/O jobs
      • OU_OSCER_GPU now in production, running actual GPU jobs
      • New SLURM version 25.11 being tested now, so we're making progress with cgroups and killing high mem jobs
      • Making progress with ipv6 dual stacking, only CE missing now
      • More network/transfer testing of new storage on Friday