US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      News:

      • Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
      • AMD CPU benchmarking ongoing by Fred.

      Upcoming meetings:

      • dCache workshop [May 6th-7th at NIKHEF]
      • CHEP 2026 [May 23rd-29th in Bangkok, Thailand]
      • HTC2026 [June 9th-12th in Madison, Wisconsin] - US ATLAS face-to-face on Tuesday and Wednesday (June 9th and 10th).
      • ATLAS S&C week #84: end of June, more information to come

      Open tickets:

      • Infrastructure tickets [all on-hold until the next downtime where CRIC and RUCIO will be updated]
      • ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
      • ggus:3559 SWT2/OU: Dual-stack

      Operations:

      • AGLT2

      • MWT2

      • NET2

      • SWT2/CPB

      • SWT2/OU

       

    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • Scheduled downtime:  site shutdown from 1:00 on 28 April. (UTC) because of high-voltage switchgear for maintenance.
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
      • deleted 105TB dark data from datadisk, including rucio, SAM, DUMPS directories under datadisk 
      • prepare for the downtime 4/30 9AM-14:00PM 
        • plan to update the firmware and kernel for all the work nodes and storage nodes, and reboot them
        • the new release of dcache is not yet out, but we will continue with the downtime for planned work, and do dcache update another time without downtime. 
    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • UIUC PM on 04/15/2026
      • dCache pools overloaded on 04/17/2026. It was mainly from one user jobs
      • Set offline briefly on 04/19/2026. A few worker switches maxed out briefly
      • IU Networking updated campus to LHCONE VRF BGP peerings on 04/21/2026
      • dCache upgrade is planned for 05/04, assuming the patched version is released Thursday
      • Dark data on 04/28. Hiro's tests in /pnfs/uchicago.edu/atlasdatadisk/hiro/DAVS. Cleaned up and down to less than 10TB 
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Smooth running except that a power sag on the 19th at the MGHPCC caused the cooling to shut down, putting us into an unscheduled downtime.  We recovered fine, though we did find a bug in CRIC which caused the storage to continue to be marked as in downtime for a couple of days after the downtime ended.

      Load balacing in ESnet international links fixed for NET2. Mini-data challenges with PRG to be resumed next week (or the next)

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB: 

      • We rebuilt four storage servers from EL7 to EL9. 

        • There were some transfer errors caused by these rebuilds. 

        • No data has been lost. Backups were made before rebuilds. 

        • We checked and verified data was not lost after rebuilds were complete. 

        • We have one R740xd2 storage server left to rebuild, which will be rebuilt today (4/29). 

        • We have two ME4084 storage arrays (connected to R640) we plan to migrate data from, rebuild to EL9, then put back into production. 

          • Once data is migrated from these servers, all of our storage nodes in production will be EL9. 

      • Changes were made to CRIC to remove the reliability and availability monitoring of SWT2_CPB_SE_TEST-WEBDAV-gridftp.swt2.uta.edu and SWT2_CPB-CE-HTCONDOR-CE-test03.swt2.uta.edu (part of the test cluster) so it does not impact the availability and reliability shown for the SWT2_CPB production site. 

        • We changed the in_report and in_monitored values in CRIC to “False”. 

      OU:

      • Running smoothly
      • Still waiting on feedback from OSCER admins and OneNet folks on the three open tickets. Will follow up again and ask for updates