US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Rafael is busy today so I am stepping in.
      • Great running over the last couple of weeks
        • Chasing a problematic request (64837) with ewelinA L. that may be causing cvmfs issues. The request definitely has jobs that reach the wall time limit repeatedly.
        • Various 
      • Please don't order equipment until we have a better understanding of our finances.
        • Currently for equipment we have FY25 funding, $0 for FY26, and the infrastructure money is unknown.
          • John Hobbs reports the mark up of the FY26 spending bill is more favorable than the administration's budget request.
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))

      1. Generally, site is running smoothly, except for low incoming data in recent days.

      2. scheduled downtime:  site will be shutdown or 60 hours from 23:00 20 Sep. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday morning (local time) and might be online earlier than the plan. 

    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

        Ticket GGUS-1000556
          quickly found one pool node had restarted (mufs17)
          dcache was started and the problem resolved

        Noticed some work nodes had a higher fraction of CPU idling.
          Boinc should at least be using cycles unused by jobs.
          Found boinc zombie processes. Couldn't clear them.
          Drained and rebooted 8 nodes

        CVMFS switching issue much improved, likely resolved
          after Ilija increased the maximum number of objects deleted during cache cleaning

        Updated Boinc control scripts to run on leftover EL7 nodes
          (MSU T3, retired T2 nodes) 
          not running condor jobs but allowed to run boinc until update to EL9

        PDU issue at UM not resolved
          but we managed to power on 5 more work nodes
          only 3 work nodes remain shutdown

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • Testing new pilot
        • Set pilot memory limit to 1.4x job request. Will set local condor limit to 2x
        • Found the pilot doesn't seem to be running the proper cgroup processes to do the job, although the max memory is set
      • Will be rolling out kernel updates for CVE-2025-38352 at IU and UC
      • IU UPS maintenance today. Some compute will be down, but nothing downtime noteworthy
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
      • Operations

      Efficiency dips in the past two weeks are due to small OKD upgrades (much easier to upgrade if we are not also trying to do firmware upgrades on the servers).  We expect another one of these sometime in the next couple of weeks, an important bugfix.  After that, it should be a while before we need another one.

      We are seeing many fewer backoff errors so far following the upgrade, but they haven't entirely gone away, we believe that a fix is also needed for the pilot to ensure that it correctly identifies the available resources.  We will resume discussions with Paul and Fernando on this topic this week.

      • Tape

      Total usage is at 11.5 PB and increasing: https://monit-grafana.cern.ch/goto/Spl7J2CHR?orgId=17

      • Varnish

      Using new local Varnish deployment as suggested by Ilija, working well so far.  Also serving BNL now.

      • Tickets

      ggus 3255:  Discussion on going with Paul and Fernando, it was vacation time. To be resumed this week.

      • Sense deployment

      Sense tests with a dCache door dedicated for the the sense workflow worked connecting the production instance. On going performance tests. Images for dCache doors and for gfal clients built and available in our registry to automate the tests.

      • Network

      Testing ECMP deployment (working with Juniper to make sure we did the right thing). We are reching 400 Gb/s way too often in production https://dashboard.stardust.es.net/goto/DI9jBhjHR?orgId=2

       

       

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Data Migration

        • The new migration script is working as intended in terms of safety and logging, but the speed is slower than we expected. We have not lost any data since we resumed data migration. We are investigating this slowness and trying to find ways to speed it up. We are contacting XRootD experts for any advice and suggestions.

          • We have researched options for using more threads with XRootD, possibly running multiple instances of the script using separate lists of data to transfer, and other alternative methods for speeding up the process. 

          • We are still researching and testing. 

          • Maintaining stats on each run for review.

        • Since we have completed the migration of data off of one MD3460, we have started the data migration of another MD3460. We are monitoring this closely and performing checks after each portion is done.

        • We are using the MD3460 that no longer has real data for doing tests and planning for the rest of the storage we will update to EL9. 

          • Attempting to update this storage and preserve fake data as a test.

          • Upgrading to EL9 and noting any issues we may need to address. 

          • Considering our options going forward from what we learn in this process of upgrading to EL9.

          • We will need a separate partition template for our MD4084s compared to our R740s/R760s due to differences in partitioning. 

      • Failed Jobs SIGTERM

        • Jobs are continuing to fail due to SIGTERM. 

        • We increased the maxtime limit for the SWT2_CPB PQ to sixty hours, but jobs are hitting this limit. 

      • GGUS-Ticket-ID: #1000094 - Reallocate Scratchdisk Space to Datadisk

        • Waiting for DDM Ops to finish cleaning dark data, then will revisit this.

      • Misc

        • Experienced a PDU that had one of its bank’s circuit breaker trip during poor weather conditions. We investigated this and are redistributing where WNs are connected in this rack to prevent future issues. We are monitoring for any further issues with this PDU in case replacement is required. 

        • Gathering additional information and improving our documentation on power draw for future improvements and planning. 

        • We converted an internal monitoring map of compute slots from EL7 to EL9 and implemented it to our new monitoring system. 

        • LEARN is performing maintenance on 9/18 and 9/19 which may impact our network traffic for a brief moment. We scheduled a warning downtime in case we experience an outage during this time. 

        • Scheduled routine preventative maintenance for our UPS on 10/6. 

        • The site has been running very well the past two weeks.


      OU:

      • Nothing to report, site running well
      • Some lost heartbeat jobs, but probably from the Harvester side, since we don't see any SLURM issues on our end
      • Sorry, can't make it in person this week