US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      MSU finished the migration of Tier2 nodes to EL9

      Smooth operation, only one node with cvmfs issues which is not detectable by our check procedures. 

      UPS issue in the UM Tier 2 room (same issue happened in May this year): We opened the case on July 31st, slow response from APC. Not scheduled a date to fix the broken circuit yet due to lack of parts.

       

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • Updating CVMFS to 2.13.2
      • Updating PERC H355 firmware on our newest UC storage
      • ~Half a rack of IU compute is offline due to a PDU failure
      • Collecting job information in Elasticsearch. Working on dashboards to compare requested memory vs. job utilization
      • Adding an additional XRootD door to the MWT2 dCache
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Starting on the 2nd the VP queue was receiving very few jobs, with almost no files accessed in the xcache.  As of last night we are getting some jobs again, though the number of file accesses recorded in the xcache monitoring is still fairly low compared to what it was before.  Hammercloud jobs continued to be submitted and to run successfully, so the queue was never blacklisted.  The xcache seems to have been working throughout this time, the various monitorings all show green.  Unclear what was stopping jobs from picking the queue, any insight from ADC would be welcome.

      On the 23rd we had to restart the cluster's CVMFS daemonset, as CVMFS access problems were causing many jobs to fail.  This resulted in a loss of running jobs, but we have seen no CVMFS issues since then.

      On the 3rd we were briefly blacklisted to a storage issue.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • EL9 Migration
        • Currently focusing on deploying new storage and data migration to retire old storage. 
        • We are planning to continue working on this when able.
        • We do have new EL9 builds, but still need to test and improve them. 
      • New Storage Deployment

        • 31 files equaling 1.2 GB in size were lost during our migration on 7/23/2025 due to a rare error condition not caught by the original migration script. 

        • We contacted Fabio Luchetti and Petr Vokac for help in reporting this. 

        • We paused migration temporarily to test and improve our migration script in order to prevent further issues.

        • We have significantly improved the script, adding more safeguards and functionality, but are continuing to develop and test this in our test cluster. It is more robust and addresses the rare error condition.

        • We are testing this script. We will continue migration once scripts are sufficiently improved and tests are completed.

      • GGUS-Ticket-ID: #683657: Varnish

        • We gradually increased the priority of our Varnish server.

        • We restarted Varnish services for the new frontier, as requested by Ilija. 

        • Communicated with Ilija concerning monitoring and other information. 

        • We performed a one hour test with Varnish as primary proxy in CRIC on 8/5/2025 at 11:00 a.m. to 12:00 p.m. CT, then reverted Varnish back to position 1 (send priority).

          • We evaluated monitoring provided by Ilija and Panda job logs. 

          • We will be performing the same test today (8/6), but for five hours instead. We will monitor and check results afterwards. 

      • New Hardware

        • We received new hardware for head nodes. We plan to replace old head nodes for better performance and safety. 

      • Declared Downtime

        • Declared downtime of severity warning for 8/1 and 8/2. 

        • An external power supplier to campus performed work that temporarily removed power to the data center for one minute each day. Our UPS kept us online, and we prepared for a potential outage. 

        • Fortunately, we did not experience an outage during this time. 

      • UPS Issue 

        • After the brief power losses on 8/1 and 8/2, we noticed alerts from our UPS system.

        • We are communicating with the UPS vendor to gather more information and recommendations going forwards to ensure we do not experience any serious issues with our UPS. 

      • Storage Issue

        • One storage node is experiencing issues with one drive bay. Replacing the drive did not resolve the issue. 

        • Currently working with the vendor for assistance in resolving this issue since it is covered by warranty. 

      • Dark data on SCRATCHDISK

        • We sent the last dump of dark data 7/23. The dump file includes 30 TB of files and is created from 3 dumps. This consists of 7/8 from SWT2_CPB (all dump), 7/18 (dump Rucio data), and 7/21 (second dump from SWT2_CPB). We are waiting for dark data to be cleaned by DPM managers. 

      OU:

      • Site running well, nothing to report
      • I'm still in the ACP planning meeting, I will switch over as soon as that is over.