US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
10:00
→
10:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
Some discussions during the meeting:
- TW-FTT: Updated CRL and Fred reported no errors since then. 150TB in the last 7 days without errors.
- AGLT2: Bug at RH satellite @ MSU. Clear evidence of bug in the installations. Plans to use v6.15, as used in UofM, but that will only happen on June 10th.
- AGLT2: Trying to understand if dCache 10 can be patched to have production-version fireflies enabled. Otherwise, only in v11.
- OU_OSCER: Recurrent problems with storage overload will only be fixed when new CEPH storage is online. Recent new DTN improved situation a bit, but transfers limited to 4GB/s
- CPB: Only site 16-cores jobs. Ivan says that those are favored by CMS. Queue should be filled by 8-core jobs if 16-cores are not available.
- CPB: Still configuring test cluster in CRIC.. Backup CE configured, but not used in load-balancing mode to not mask problem. Main problem was not CE, but a CRIC configuration.
- NET2: Studying services that are not IPv6-only ready. Eduardo says that a test cluster should be setup just for that. Otherwise no one will do it.
- 10:10 → 10:20
-
10:20
→
10:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
EL9 at MSU
Satellite problems, freshly defined node usually builds 1st time and fails after
(should copy user keys to root account, should create bonded interfaces, etc, but doesn't)
Last week asked MSU IT to update RH Satellite from V6.13 to 6.15
current version is 6.17 but will pause at 6.15 (used at UM)
But currently MSU IT is in middle of "production patching", this week and next.
Submitted change request to 6.14 on Tue 6/10, then check, then 6.15 on 6/12
Will find ways to limp along and make progress in the meantime.OpenSearch: improving condor/boinc/squid monitoring
PDU problem: still looking for solution
All EL9 worker nodes, condor head node and condor CEs now on OSG24
Deleted dark data from aglt2datadisk (FTS test data and old monitoring info)
-
10:30
→
10:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
-
10:40
→
10:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Efforts to upgrade OKD are ongoing. Not all targeted issues have been resolved yet, and investigations are continuing.
Tape operations are ongoing. Efforts are being made to improve throughput using the existing hardware, but it is difficult to predict total performance gains at this time.
Additionally, there is a focused effort to enhance the Quality of Service (QoS), as the level acceptable for a Tier-2 site may not be sufficient for current needs. -
10:50
→
11:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Actively communicating with OSG experts and Panda team concerning a bug in Condor. Jobs are getting stuck in the Condor queue after completion. There are recent discussions on this potentially being an issue with harvester repeatedly losing contact with certain jobs (testing different setting for condor-ce-routing on test cluster). Also, the primary issue with our site not filling up properly with jobs was due to a CRIC setting, which Fred is assisting us with.
-
Implemented second CE as backup and to allow for downtime maintenance of our main CE whenever needed. This second CE is operating at a 100 max job limit.
-
Performed tests on our EL9 storage using a similar environment as the production cluster to the test module.
-
Coordinating with DDM Ops to add second RSE and other components, so we can use our test cluster separately to simulate production when staging changes. They added the RSE Monday (5/26/2025), so we are close to having this completed.
-
We have our internal monitoring for our Slurm and both CE’s now in place (thanks Judith for links and help).
-
For EL9, developed other appliances, but waiting to test in the test cluster before implementing.
-
Concerning GGUS ticket to enable network monitoring, I sent a follow up message to multiple members of campus networking last week. They said they would discuss it last week and move forward on our request. Waiting for a follow up.
OU:
- Occasional storage overload, caused by either heavy I/O jobs or massive WAN transfers. Should subside again eventually.
- Other than that, stable operations.
-
-
10:00
→
10:10