US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US)), Verena Ingrid Martinez Outschoorn (University of Massachusetts (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Verena Ingrid Martinez Outschoorn (University of Massachusetts (US))

      News:

      • Procurement + operations plan for FY26:
        • Operations plans still needed in December
        • We held a meeting last Friday about procurements with FY25 funds and about the situation with FY26.
          • We decided to delay purchases with FY25 funds until February (unless urgent needs)
          • We will have another meeting early February
          • FY26 procurement plans should just describe any updates to purchases with FY25 funds and, as usual, retirements and estimated resources.

      Operations:

      • Site production during the last two weeks: AGLT2MWT2NET2, SWT2 (CPBOU), TW
      • Open tickets: 
        • NET2
          • ggus 3255 NET2_Amherst: jobs failing with "Job has reached the specified backoff limitBackoffLimitExceeded"

      A JIRA ticket was create to follow this issue.

          • NET2_LOCALGROUPDISK blacklisted in DDM (DISKSPACE). More info
        • SWT2_CPB
          • SWT2_CPB_TEST blacklisted in DDM (FT). More info

      Upcoming meetings:

      • SuperComputing25 [Nov 16-25]:
      • Registration and abstract submission for CHEP 2026 [23-29 May in Bangkok, Thailand] is open. ATLAS abstracts are due to the CSC by November 19th, and the conference abstract deadline on the 19th of December. 

      • ISGC 2026 will be held from March 15-20 2026 and is now open for submission. The deadline is November 17th and again all abstracts should be sent to the CSC as soon as possible.
      • The next LHCOPN/ONE meeting has been proposed to be located in Canada on the 14-16 of April 2026. The meeting will be hosted by CANARIE in a city not yet confirmed, possibly Ottawa.
      • ATLAS S&C meeting [Feb 9-13 at CERN]
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • 10/30: Low number of running slots caused by a PQ misconfiguration.
      • Lower transfer efficiency mostly due to issues at Glasgow and QMUL.
      • Deployed local Varnish server for Frontier and CVMFS access.
      • The maximum storage space has been adjusted to the actual available capacity of 2.2 PB.
      • The Condor and OS migration is in progress (thanks to Judith for the Puppet manifests!).
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Update HTCondor to 25.0.3, together with firmware/kernel/afs/lustre/zfs updates 

      UM site has Updated cvmfs to 2.13.3, and do not see the cvmfs_probe hang error anymore. MSU site will follow soon. 

      Migrated the Tier2 NFS server (providing Tier2 user home area)to EL9, did not cause interruption to the service

       

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

      Increased our meanRSS to 2800

      IU had a compute downtime on Friday (11/07/2025) to do a circuit test.

      CVMFS upgraded at IU to 2.13.3. Planning to upgrade Condor and CVMFS at UC and UIUC together

    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Verena Ingrid Martinez Outschoorn (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
      • Network and storage optimization concluded, so no more brief exclusions observed
      • Brief problem with 41 servers that needed to be rebooted. They are back to normal.
    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Experienced power outage on 11/6/2025. We performed our standard procedures for bringing the site back online. We came back online six hours later. 

        • Experienced issue with Varnish server, but brought it back online. 

        • Fixed test cluster at a later time. 

        • Communicated with campus facilities to work toward better preparing for power incidents in the future. 

        • Continuing to communicate with campus networking concerning better alerting during power events. 

        • Created improvements during downtime that we could not make otherwise. 

      • Continuing EL7 to EL9 migration. Performing tests with rebuilding EL7 storage as EL9 and testing Puppet modules. 

      • Continuing to enable safety shutdown of iDRAC on worker nodes in the event of a power outage. 

      • Experienced some issues with the test cluster not receiving the latest CRLs, but we resolved this. 

      • Finalized purchased replacement 1G switches. We are waiting for delivery.


      OU:

      • Still seeing crashed nodes from mis-behaving hi-mem jobs
      • Waiting to hear an update from OSCER admins about status of new SLURM controller to test killing jobs via cgroups v2
      • CVMFS 2.13.3 seems to behave nicely