US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Verena Ingrid Martinez Outschoorn (University of Massachusetts (US))
News:
- Procurement + operations plan for FY26:
- Operations plans still needed in December
- We held a meeting last Friday about procurements with FY25 funds and about the situation with FY26.
- We decided to delay purchases with FY25 funds until February (unless urgent needs)
- We will have another meeting early February
- FY26 procurement plans should just describe any updates to purchases with FY25 funds and, as usual, retirements and estimated resources.
Operations:
- Site production during the last two weeks: AGLT2, MWT2, NET2, SWT2 (CPB, OU), TW
- Open tickets:
- NET2
- ggus 3255 NET2_Amherst: jobs failing with "Job has reached the specified backoff limitBackoffLimitExceeded"
- NET2
A JIRA ticket was create to follow this issue.
Upcoming meetings:
- SuperComputing25 [Nov 16-25]:
-
Registration and abstract submission for CHEP 2026 [23-29 May in Bangkok, Thailand] is open. ATLAS abstracts are due to the CSC by November 19th, and the conference abstract deadline on the 19th of December.
- ISGC 2026 will be held from March 15-20 2026 and is now open for submission. The deadline is November 17th and again all abstracts should be sent to the CSC as soon as possible.
- The next LHCOPN/ONE meeting has been proposed to be located in Canada on the 14-16 of April 2026. The meeting will be hosted by CANARIE in a city not yet confirmed, possibly Ottawa.
- ATLAS S&C meeting [Feb 9-13 at CERN]
- Procurement + operations plan for FY26:
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
- 10/30: Low number of running slots caused by a PQ misconfiguration.
- Lower transfer efficiency mostly due to issues at Glasgow and QMUL.
- Deployed local Varnish server for Frontier and CVMFS access.
- The maximum storage space has been adjusted to the actual available capacity of 2.2 PB.
- The Condor and OS migration is in progress (thanks to Judith for the Puppet manifests!).
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Update HTCondor to 25.0.3, together with firmware/kernel/afs/lustre/zfs updates
UM site has Updated cvmfs to 2.13.3, and do not see the cvmfs_probe hang error anymore. MSU site will follow soon.
Migrated the Tier2 NFS server (providing Tier2 user home area)to EL9, did not cause interruption to the service
-
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Verena Ingrid Martinez Outschoorn (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Experienced power outage on 11/6/2025. We performed our standard procedures for bringing the site back online. We came back online six hours later.
-
Experienced issue with Varnish server, but brought it back online.
-
Fixed test cluster at a later time.
-
Communicated with campus facilities to work toward better preparing for power incidents in the future.
-
Continuing to communicate with campus networking concerning better alerting during power events.
-
Created improvements during downtime that we could not make otherwise.
-
Continuing EL7 to EL9 migration. Performing tests with rebuilding EL7 storage as EL9 and testing Puppet modules.
-
Continuing to enable safety shutdown of iDRAC on worker nodes in the event of a power outage.
-
Experienced some issues with the test cluster not receiving the latest CRLs, but we resolved this.
-
Finalized purchased replacement 1G switches. We are waiting for delivery.
OU:- Still seeing crashed nodes from mis-behaving hi-mem jobs
- Waiting to hear an update from OSCER admins about status of new SLURM controller to test killing jobs via cgroups v2
- CVMFS 2.13.3 seems to behave nicely
-
-
11:00
→
11:10