US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
Sorry, no slides today. Top of the meeting introduction here.
News:
- Procurement + operations plan for FY26 are due in December, before the brea.
- We should also review the procurement plan for FY25 (so that it can be approved and sites can buy equipment).
- Delay OSG 25 installation due to HT-Condor problem.
- New version of CVMFS in production: CernVM-FS 2.13.3
Some quick links:
NET2
ggus 1000970 NET2 transfers fail with "Failed to select pool: All pools are full"
ggus 1000848 NET2: Transfer and Staging Errors
ggus 3255 NET2_Amherst: jobs failing with "Job has reached the specified backoff limitBackoffLimitExceeded"
NET2_LOCALGROUPDISK blacklisted in DDM (DISKSPACE). More info
SWT2_CPB
SWT2_CPB_TEST blacklisted in DDM (FT). More infoUpcoming meetings:
- SuperComputing25 [Nov 16-25]:
- Rucio Workshop [Nov 3-7]
-
Registration and abstract submission for CHEP 2026 [23-29 May in Bangkok, Thailand] is open. ATLAS abstracts are due to the CSC by November 19th, and the conference abstract deadline on the 19th of December.
- ISGC 2026 will be held from March 15-20 2026 and is now open for submission. The deadline is November 17th and again all abstracts should be sent to the CSC as soon as possible.
- The next LHCOPN/ONE meeting has been proposed to be located in Canada on the 14-16 of April 2026. The meeting will be hosted by CANARIE in a city not yet confirmed, possibly Ottawa.
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
- updated dcache from 10.2.12 to 10.2.18, combined with firmware and other system updates
- doing a rolling update from condor 24.0.12 to 25.0.2 on the UM work nodes
- one team ticket about transfer failure, the cause was the lack of support of SHA1 on the dCache pool nodes, fixed that while doing reboot dcache pool nodes to apply fw update.
-
15-day total of 9 files reported with pilot:1099 or pilot:1361 error.
Declared lost.
From data sets with multiple copies worldwide.
Some replaced by DDM.
-
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Updated Elasticsearch 9.0 -> 9.2
- Enabling MWT2_TEST queue to test the new condor_chirp functionality in the test pilot
- Updating non-condor machines to OSG25
- waiting for fixes before we start looking at updating other machines
- Discussed retirements and procurement at UIUC
- Drained and rebooted UIUC workers after the PM to pick up the updated kernel
- Continuing to creep up MeanRSS in CRIC (currently 2600)
- IU compute down for power circuit testing on 11/7 (No site downtime, just less capacity for the day)
- Tests of using condor chip to include the Panda ID as a ClassAd are about to begin.
- Currently MWT2_TEST queue is online and configured similarly to the main MWT2 queue.
- MWT2_TEST is set to use the most recent development version of the pilot (3.11.0.29) which is believed to be capable of using condor chirp. Apparently the previous officially released version of the pilot did not work correctly.
- However, no jobs are being submitted to the MWT2_TEST queue.
- The lack of submitted jobs will be discussed with Ivan later today or tomorrow.
- Currently MWT2_TEST queue is online and configured similarly to the main MWT2 queue.
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Two tickets currently open:
683120, backoff limit failures: we have converged on a fix, which requires that the pod storage limit be exposed as an environmental variable so that the pilot can use that to determine the available space. Fernando has added the necessary environment variables, now we are waiting on Paul.
1000970, tape transfers failing: we will put the tape into a writing downtime to allow enough space to clear up for the transfers to resume.
Some job failures due to ongoing dcache/network upgrades and SENSE tests. Over the past week these have been eliminated thanks to the (temporary) addition of two powerful servers borrowed from NESE, which have made rebalancing and draining pools for upgrades go much more smoothly.
800 Gbps connection to NESE now available, next step is to get 800 Gbps to ESNET.
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Continuing EL7 to EL9 migration with data migration, testing rebuild of EL7 storage to EL9, and testing modules.
-
Purchasing new switches to replace our 1G switches that reached EOL due to recent power incidents.
-
Making progress on setting temperature threshold on iDRAC devices for safety. When the threshold is reached, the server will shutdown to preserve itself.
-
We met with a UPS vendor for preventative maintenance on 10/22/2025. Reviewing reports and discussing any improvements we may need to make..
-
Fixed part of a problem with emergency alerting, but are working toward improving it to avoid rare power incidents such as what occurred on 10/4/2025. Working on adding alerts even if the network goes down, and added a new channel for emergency alerts.
-
Added additional alerts for better visibility of operations.
-
Investigated transfer issues that showed jobs failed at SWT2_CPB. Turns out it is an issue with the INFN-ROMA1 site - not SWT2. In fact job is successful and files stored on CPB_DATADISK. We suggested improvements for the future. We believe these jobs should not be failing at our site.
-
Continuing to change and improve inventory dump script.
-
Continuing work on moving toward a new alerting system and moving away from Nagios.
- Found failed FTS transfers all across US sites and informed Hiro - who fixed a log file overfull problem at BNL.
OU:
- Still seeing crashing nodes because they're run out of RAM.
- Will follow up with OSCER admins about new SLURM version which supports cgroups v2
- Have updated Salt configuration from osg23 to osg25
- Will update CVMFS to latest version to see whether that fixes the D processes
-
-
11:00
→
11:10