US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
- Rafael is busy today so I am stepping in.
- Great running over the last couple of weeks
- Chasing a problematic request (64837) with ewelinA L. that may be causing cvmfs issues. The request definitely has jobs that reach the wall time limit repeatedly.
- Various
- Please don't order equipment until we have a better understanding of our finances.
- Currently for equipment we have FY25 funding, $0 for FY26, and the infrastructure money is unknown.
- John Hobbs reports the mark up of the FY26 spending bill is more favorable than the administration's budget request.
- Currently for equipment we have FY25 funding, $0 for FY26, and the infrastructure money is unknown.
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
1. Generally, site is running smoothly, except for low incoming data in recent days.
2. scheduled downtime: site will be shutdown or 60 hours from 23:00 20 Sep. (UTC) because of routine power system maintenance in Academia Sinica. Site recovery will be started on Monday morning (local time) and might be online earlier than the plan.
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Ticket GGUS-1000556
quickly found one pool node had restarted (mufs17)
dcache was started and the problem resolvedNoticed some work nodes had a higher fraction of CPU idling.
Boinc should at least be using cycles unused by jobs.
Found boinc zombie processes. Couldn't clear them.
Drained and rebooted 8 nodesCVMFS switching issue much improved, likely resolved
after Ilija increased the maximum number of objects deleted during cache cleaningUpdated Boinc control scripts to run on leftover EL7 nodes
(MSU T3, retired T2 nodes)
not running condor jobs but allowed to run boinc until update to EL9PDU issue at UM not resolved
but we managed to power on 5 more work nodes
only 3 work nodes remain shutdown -
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
- Testing new pilot
- Set pilot memory limit to 1.4x job request. Will set local condor limit to 2x
- Found the pilot doesn't seem to be running the proper cgroup processes to do the job, although the max memory is set
- Will be rolling out kernel updates for CVE-2025-38352 at IU and UC
- IU UPS maintenance today. Some compute will be down, but nothing downtime noteworthy
- Testing new pilot
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
- Operations
Efficiency dips in the past two weeks are due to small OKD upgrades (much easier to upgrade if we are not also trying to do firmware upgrades on the servers). We expect another one of these sometime in the next couple of weeks, an important bugfix. After that, it should be a while before we need another one.
We are seeing many fewer backoff errors so far following the upgrade, but they haven't entirely gone away, we believe that a fix is also needed for the pilot to ensure that it correctly identifies the available resources. We will resume discussions with Paul and Fernando on this topic this week.
- Tape
Total usage is at 11.5 PB and increasing: https://monit-grafana.cern.ch/goto/Spl7J2CHR?orgId=17
- Varnish
Using new local Varnish deployment as suggested by Ilija, working well so far. Also serving BNL now.
- Tickets
ggus 3255: Discussion on going with Paul and Fernando, it was vacation time. To be resumed this week.
- Sense deployment
Sense tests with a dCache door dedicated for the the sense workflow worked connecting the production instance. On going performance tests. Images for dCache doors and for gfal clients built and available in our registry to automate the tests.
- Network
Testing ECMP deployment (working with Juniper to make sure we did the right thing). We are reching 400 Gb/s way too often in production https://dashboard.stardust.es.net/goto/DI9jBhjHR?orgId=2
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Data Migration
-
The new migration script is working as intended in terms of safety and logging, but the speed is slower than we expected. We have not lost any data since we resumed data migration. We are investigating this slowness and trying to find ways to speed it up. We are contacting XRootD experts for any advice and suggestions.
-
We have researched options for using more threads with XRootD, possibly running multiple instances of the script using separate lists of data to transfer, and other alternative methods for speeding up the process.
-
We are still researching and testing.
-
Maintaining stats on each run for review.
-
-
Since we have completed the migration of data off of one MD3460, we have started the data migration of another MD3460. We are monitoring this closely and performing checks after each portion is done.
-
We are using the MD3460 that no longer has real data for doing tests and planning for the rest of the storage we will update to EL9.
-
Attempting to update this storage and preserve fake data as a test.
-
Upgrading to EL9 and noting any issues we may need to address.
-
Considering our options going forward from what we learn in this process of upgrading to EL9.
-
We will need a separate partition template for our MD4084s compared to our R740s/R760s due to differences in partitioning.
-
-
-
Failed Jobs SIGTERM
-
Jobs are continuing to fail due to SIGTERM.
-
We increased the maxtime limit for the SWT2_CPB PQ to sixty hours, but jobs are hitting this limit.
-
-
GGUS-Ticket-ID: #1000094 - Reallocate Scratchdisk Space to Datadisk
-
Waiting for DDM Ops to finish cleaning dark data, then will revisit this.
-
-
Misc
-
Experienced a PDU that had one of its bank’s circuit breaker trip during poor weather conditions. We investigated this and are redistributing where WNs are connected in this rack to prevent future issues. We are monitoring for any further issues with this PDU in case replacement is required.
-
Gathering additional information and improving our documentation on power draw for future improvements and planning.
-
We converted an internal monitoring map of compute slots from EL7 to EL9 and implemented it to our new monitoring system.
-
LEARN is performing maintenance on 9/18 and 9/19 which may impact our network traffic for a brief moment. We scheduled a warning downtime in case we experience an outage during this time.
-
Scheduled routine preventative maintenance for our UPS on 10/6.
-
The site has been running very well the past two weeks.
-
OU:- Nothing to report, site running well
- Some lost heartbeat jobs, but probably from the Harvester side, since we don't see any SLURM issues on our end
- Sorry, can't make it in person this week
-
-
11:00
→
11:10