US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-09-17T11:00:00-04:00
End: 2025-09-17T12:00:00-04:00
Location: No location set

Wednesday 17 Sept 2025, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Rafael is busy today so I am stepping in.
  
  Great running over the last couple of weeks
  
  Chasing a problematic request (64837) with ewelinA L. that may be causing cvmfs issues. The request definitely has jobs that reach the wall time limit repeatedly.
  
  Various
  
  Please don't order equipment until we have a better understanding of our finances.
  
  Currently for equipment we have FY25 funding, $0 for FY26, and the infrastructure money is unknown.
  
  John Hobbs reports the mark up of the FY26 spending bill is more favorable than the administration's budget request.
- 11:10 → 11:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  
  1. Generally, site is running smoothly, except for low incoming data in recent days.
  
  2. scheduled downtime: site will be shutdown or 60 hours from 23:00 20 Sep. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday morning (local time) and might be online earlier than the plan.
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Ticket GGUS-1000556
  quickly found one pool node had restarted (mufs17)
  dcache was started and the problem resolved
  
  Noticed some work nodes had a higher fraction of CPU idling.
  Boinc should at least be using cycles unused by jobs.
  Found boinc zombie processes. Couldn't clear them.
  Drained and rebooted 8 nodes
  
  CVMFS switching issue much improved, likely resolved
  after Ilija increased the maximum number of objects deleted during cache cleaning
  
  Updated Boinc control scripts to run on leftover EL7 nodes
  (MSU T3, retired T2 nodes)
  not running condor jobs but allowed to run boinc until update to EL9
  
  PDU issue at UM not resolved
  but we managed to power on 5 more work nodes
  only 3 work nodes remain shutdown
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Testing new pilot
  
  Set pilot memory limit to 1.4x job request. Will set local condor limit to 2x
  
  Found the pilot doesn't seem to be running the proper cgroup processes to do the job, although the max memory is set
  
  Will be rolling out kernel updates for CVE-2025-38352 at IU and UC
  
  IU UPS maintenance today. Some compute will be down, but nothing downtime noteworthy
- 11:40 → 11:50
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  Operations
  
  Efficiency dips in the past two weeks are due to small OKD upgrades (much easier to upgrade if we are not also trying to do firmware upgrades on the servers). We expect another one of these sometime in the next couple of weeks, an important bugfix. After that, it should be a while before we need another one.
  
  We are seeing many fewer backoff errors so far following the upgrade, but they haven't entirely gone away, we believe that a fix is also needed for the pilot to ensure that it correctly identifies the available resources. We will resume discussions with Paul and Fernando on this topic this week.
  
  Tape
  
  Total usage is at 11.5 PB and increasing: https://monit-grafana.cern.ch/goto/Spl7J2CHR?orgId=17
  
  Varnish
  
  Using new local Varnish deployment as suggested by Ilija, working well so far. Also serving BNL now.
  
  Tickets
  
  ggus 3255: Discussion on going with Paul and Fernando, it was vacation time. To be resumed this week.
  
  Sense deployment
  
  Sense tests with a dCache door dedicated for the the sense workflow worked connecting the production instance. On going performance tests. Images for dCache doors and for gfal clients built and available in our registry to automate the tests.
  
  Network
  
  Testing ECMP deployment (working with Juniper to make sure we did the right thing). We are reching 400 Gb/s way too often in production https://dashboard.stardust.es.net/goto/DI9jBhjHR?orgId=2
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Data Migration
  
  The new migration script is working as intended in terms of safety and logging, but the speed is slower than we expected. We have not lost any data since we resumed data migration. We are investigating this slowness and trying to find ways to speed it up. We are contacting XRootD experts for any advice and suggestions.
  
  We have researched options for using more threads with XRootD, possibly running multiple instances of the script using separate lists of data to transfer, and other alternative methods for speeding up the process.
  
  We are still researching and testing.
  
  Maintaining stats on each run for review.
  
  Since we have completed the migration of data off of one MD3460, we have started the data migration of another MD3460. We are monitoring this closely and performing checks after each portion is done.
  
  We are using the MD3460 that no longer has real data for doing tests and planning for the rest of the storage we will update to EL9.
  
  Attempting to update this storage and preserve fake data as a test.
  
  Upgrading to EL9 and noting any issues we may need to address.
  
  Considering our options going forward from what we learn in this process of upgrading to EL9.
  
  We will need a separate partition template for our MD4084s compared to our R740s/R760s due to differences in partitioning.
  
  Failed Jobs SIGTERM
  
  Jobs are continuing to fail due to SIGTERM.
  
  We increased the maxtime limit for the SWT2_CPB PQ to sixty hours, but jobs are hitting this limit.
  
  GGUS-Ticket-ID: #1000094 - Reallocate Scratchdisk Space to Datadisk
  
  Waiting for DDM Ops to finish cleaning dark data, then will revisit this.
  
  Misc
  
  Experienced a PDU that had one of its bank’s circuit breaker trip during poor weather conditions. We investigated this and are redistributing where WNs are connected in this rack to prevent future issues. We are monitoring for any further issues with this PDU in case replacement is required.
  
  Gathering additional information and improving our documentation on power draw for future improvements and planning.
  
  We converted an internal monitoring map of compute slots from EL7 to EL9 and implemented it to our new monitoring system.
  
  LEARN is performing maintenance on 9/18 and 9/19 which may impact our network traffic for a brief moment. We scheduled a warning downtime in case we experience an outage during this time.
  
  Scheduled routine preventative maintenance for our UPS on 10/6.
  
  The site has been running very well the past two weeks.
  
  OU:
  
  Nothing to report, site running well
  
  Some lost heartbeat jobs, but probably from the Harvester side, since we don't see any SLURM issues on our end
  
  Sorry, can't make it in person this week

Choose timezone

US ATLAS Tier 2 Technical