US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))

      Dear All,

      I have to apologize for the meeting today again because of other engagemnt out of town.

      There is no update about the international network of TW-FTT so far. 

       

       

    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Running well overall
        Some issues with cvmfs; removed automated 'reload'; gave some nodes to experts.
        One dip to 60% occupancy when only score jobs present
          Trying to understand the underlying limation

      Availability report for March showed 95% while monitoring shows over 99%
         We need to create a ticket

      A frew more lost files from Dec 2024 incident.
        Noticed as stage-in errors (pilot:1099)
        10 files from one data set
        checked whole data set (mc23_13p6TeV:AOD.42171985.*)
        of total 351 files, 39 were registered/created at AGLT2
        37 total missing; declared bad/lost

      EL9 at MSU
        Correction to last report: there was one more hurdle, now solved
        Needed one more allowance through MSU firewall for aglt2 subnet to capsule port 443
        That allowed the node being provisioned to register itself during build
        Via subscription-manager and http-proxy from private subnet to capsule public https port
      Next steps :
        Building first VM for perfsonar infrastructure.
        Will make first node built into worker node.
        Also start with new storage nodes.

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • Working on updating cvmfs on compute machines. Currently draining machines to restart cvmfs for the update
      • Starting to discuss operations and procurement plans for this year
      • IU network config change to fix route asymmetry along LHCONE
      • A storage node at UC was down for a short time while we replaced dead optics
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Connected now via Cambridge with redundant 400+400

      Working on procurement plans, a bit late due to lack of external inputs.

      other than that, NTR.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB: 

      • Operations

        • We experienced a significant drain on 4/3/2025. There are still investigations ongoing, but with investigating locally and the help of others we are gathering information and our site has filled back up. We noticed a significant drop in multicore jobs during this time.

        • We increased our Slurm max job limit to 12000 from 10000. 

        • We created a second CE, but have not implemented it yet. 

        • Timo and Rod have helped us in having both SWT2_CPB_TEST and SWT2_CPB use gk10 and enabling 16 core jobs to be submitted. 

        • We have reached out to experts and shared requests logs from our CE for review.

        • Timo has helped investigate this and found errors in apfmon logs showing "The job's remote status is unknown… known again" It is still unclear if this is a central issue or a bug in the version of HTCondor-CE, but it appears to be some kind of handshake/status problem. 

        • We have been focused on finding out why this issue occurred and how to prevent this from happening in the future. 

        • We are now draining SWT2_CPB_TEST to revert back to only having running jobs on the SWT2_CPB queue. 

        • We experienced a spike in errors recently due to jobs hitting the 2 day limit on our CE. We are discussing the idea of making changes to these limits. 

          • Last update from ADC OPS meeting:

            • Request all sites move to at least 96h maxwalltime

            • ATLAS VO Card includes 5760 minute walltime limit = 96 hours

      • Monitoring

        • We are currently working on developing better monitoring of our site to include additional information from our Slurm and CE servers. 

      • EL9 Migration Updates

        • Built test storage nodes in the test cluster. There are still more tests we want to perform. 

        • Improving the storage module in Puppet/Foreman.

      • GGUS Ticket - Enable Network Monitoring

        •  Followed up with campus networking. It appears there were internal changes that caused them to lose track of our request. 

        • They added their manager of Operations Center and are discussing this.

      OU:

      • May not be able to join because of a conflicting meeting, sorry.
      • Running well, only occasional storage overloads.
      • We think the lscratch deletion issue has been fixed, so we can migrate over from el7 containers to el9 containers.