US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • November 20 and 22: The site had several hours of network interruption due to unplanned maintenance by the network provider.
      • HTCondor, HTCondor-CE, and OS Migration Status:
        • Updated HTCondor to 25.3.1 and OS to EL9

        • Currently have 1872 CPUs running on EL9

        • Set test PQ online and TW-FTT Queue set to BROKEROFF

      • Started using local Varnish server for Frontier and CVMFS
      • Plan to replace ARC-CE with HTCondor-CE, upgrade all OS7 worker nodes to EL9, update the site infrastructure to EL9, HTCondor-CE, and Varnish, and decommission ARC-CE and Squid.
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Ticket 1001213 
        Jobs failing. Writing new files was failing. Our storage appeared to be full.
          That happened after too many pools were left set rdonly for too long.
            The rdonly pools slowly drained (file system showing 66%) following DDM deletions
            and the remaining RW pools slowly filled up (to 98%) with new files.
            We had forgotten to set back RW half of the pools from one site
            after the on-the-fly rolling dcache update 4 weeks earlier.
          We already had a cron job alerting for pools becoming offline 
            and it has now been upgraded to also flag rdonly pools.
          We also re-balanced all pools site wide.
            To re-spread the unused space among all pools.
            It is used for temporary cache between UM-MSU.
            
      Condor/Condor-CE updates to OSG25
        Condor on 25.0.3
        Condor-CE on 25.0.1

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      GGUS:1001113 error due to an issue with a new storage machine, rapidly fixed.

      Blacklisted on the 15th/16th, once due to a network issue and once for a cvmfs problem.

      Downtime around SC 2025, then more cvmfs issues coming out of the downtime, probably because the dense nodes rapidly filled up with Event Index jobs, overloading their cvmfs instances.  We have been planning to switch to accessing cvmfs using the cvmfs-csi Container Storage Interface kubernetes plugin rather than a direct mount, which ought to prevent these issues: this will now happen sooner rather than later.  Unfortunately the image registry was on one of the dense nodes which needed to be rebooted, and failed to clone the images again automatically, so jobs subsequently hung in harvester until the cloning was done by hand.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Campus facilities performed power tests on Saturday 11/22. They tested the backup generator in case power was lost to the building, and it succeeded during tests. 

      • We rebuilt one XRootD Proxy server to EL9 after we did tests in the test cluster. We see performance issues and have resolved this. 

        • Communicating with XRootD experts.

        • Performing different tests.

        • Researching potential causes.

        • Tried upgrading to newer version 5.9.0 and rebuilding with new hardware. 

      • We are continuing to migrate data off storage. The most recent was a PowerEdge R740, which makes up the majority of our storage servers. 

        • We have not retired any storage yet, as we may need to use certain storage to complete the migration of data. 

      • We are testing Zabbix in the test cluster.


      OU:

      • Running well, no issues
      • Still waiting for new SLURM version in order to start testing cgroups v2 RAM killing