US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
10:00
→
10:10
Top of the meeting discussion 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
- Good production for the last couple of weeks.
- Good progress on EL9 updates.
- AGLT2 MSU closer.
- MWT2 Illinois finished on Jan 31.
- CPB finished all servers except storage.
- It looks like we are in generally good shape for software updates - see the services table: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
- Sites using puppet got a nasty surprise a couple of weeks ago so there is one service item that is still an issue.
- Subject to management approval the operations and procurement plans will be on March 31.
- The funding levels are known for this year (subject DOGE effects).
- We have a deadline of 28 Feb to provide estimates of how much money it will take to establish 400 G WAN connectivity by 2029 (i.e. for Run 4/HL-LHC).
-
10:10
→
10:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
Status report of TW-FTT:
1. Networking
1) submarine cable problem during 08:26 UTC and 09:32, 11 Feb. 2025
2) scheduled maintenance: 8:14 UTC and 14:07, 18 Feb. 2025
2. data transmission in Feb 2025 (until 18 Feb.) : total inbound and outbound traffic reached 191.7TB, 6% were inbound data.
3. plan of bring up another 1,200 CPUCore online by AlmaLinux9 was delayed due to the manpower situation.
-
10:20
→
10:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
- continuing implementing jumbo frames at UM,
- solved the problematic ones connected to an old switch which lack some jumbo frame configuration,
- still see problems with idrac.
- need to update SLATE and NRP nodes at UM
- EL9 provisioning at MSU:
MSU Satellite permissions granted.
MSU Satellite and AGLT2-MSU Capsule configuration done.
First worker node definition finally successful in Satellite.
Currently working on one more DNS workaround for bug/limitation.
Expect to have first node built today.
- continuing implementing jumbo frames at UM,
-
10:30
→
10:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
- Testing cgroups configuration for condor sites to relay back to Paul
- Waiting on rails to rack the UChicago storage purchase
- IU downtime scheduled for tomorrow
- Migrating our configuration management to openvox
- UIUC workers upgraded to EL9 as part of the NCSA datacenter move
- Storage filled up Feb 4 due to slow rucio deletions. Appears to have improved since one of the most recent rucio patches
- Working with the UC and IU networking teams to discuss the 400Gbps networking plans
-
10:40
→
10:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
1) During the Jumbo Frame Capacity challenge last week, changing the maximum allowed concurrent transfers in the FTS configuration for NET2 revealed an issue with the dCache load balancing policy used when a large number of requests—large relative to the number of dCache pools—arrive simultaneously. We are currently investigating this issue. However, for now, we have reverted to the previous maximum number in the FTS configuration for NET2, and since making this change yesterday, we have not observed any further errors of this nature. We are planning a test sequence for the production webdav doors using some parameters suggested by Judith.
2) This is done for NET2
3) There are three servers (one computing, two storage) that despite being racked are still not available in the pools because they are being used for evaluation,. The work is a bit late, but we will make them available as soon as possible.
-
10:50
→
11:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
OU:
- I won't be able to join, sorry
- We are having a scheduled OSCER maintenance Wed 8am till 11pm; ceph upgrade among other things; I scheduled an OSG downtime.
- Other than that, nothing to report.
SWT2_CPB:
-
Network
-
Met with campus networking to discuss plans for network upgrade.
-
Ongoing internal discussions and planning for internal network improvements.
-
-
EL9 Migration
-
Major EL9 upgrades for Condor-CE, Slurm, and worker nodes have been running smoothly.
-
Have been consistently running roughly 18K job slots.
-
Production jobs have been experiencing very low error rates.
-
Discussing and planning next steps.
-
-
Transfer Issue
-
Discovered transfer requests incorrectly using SWT2_DATADISK as both the source and destination, causing errors.
-
Ivan connected us with ACT experts for support. Waiting for further details.
-
-
Harvester Issue - Drain
-
The site was drained on Wednesday (2/5) due to an issue with one of the harvesters. Compared to other sites, ours remained drained for an additional twelve hours.
-
We started receiving jobs again on Friday (2/7).
-
No changes were made before or during the issue; it resolved on its own.
-
Waiting for expert analysis to determine the cause.
-
-
GGUS Tickets
- 162991
-
-
Continuing to work with campus networking to address this required. We previously held a meeting, opened a ticket in their system to improve their tracking of our request, and maintain regular follow-ups. Awaiting further assistance from their team.
-
-
-
168756
-
-
Waiting on more information from ESNet and someone from the state network provider (LEARN). They have concluded the issue is likely in the DE cloud routing.
-
-
- 162991
-
Storage
-
Continuing work on storage deployment.
-
-
10:00
→
10:10