US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-02-19T10:00:00-05:00
End: 2025-02-19T11:00:00-05:00
Location: No location set

Wednesday 19 Feb 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
  Good production for the last couple of weeks.
  
  Good progress on EL9 updates.
  
  AGLT2 MSU closer.
  
  MWT2 Illinois finished on Jan 31.
  
  CPB finished all servers except storage.
  
  It looks like we are in generally good shape for software updates - see the services table: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
  
  Sites using puppet got a nasty surprise a couple of weeks ago so there is one service item that is still an issue.
  
  Subject to management approval the operations and procurement plans will be on March 31.
  
  The funding levels are known for this year (subject DOGE effects).
  
  We have a deadline of 28 Feb to provide estimates of how much money it will take to establish 400 G WAN connectivity by 2029 (i.e. for Run 4/HL-LHC).
- 10:10 → 10:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  
  Status report of TW-FTT:
  
  1. Networking
  
  1) submarine cable problem during 08:26 UTC and 09:32, 11 Feb. 2025
  
  2) scheduled maintenance: 8:14 UTC and 14:07, 18 Feb. 2025
  
  2. data transmission in Feb 2025 (until 18 Feb.) : total inbound and outbound traffic reached 191.7TB, 6% were inbound data.
  
  3. plan of bring up another 1,200 CPUCore online by AlmaLinux9 was delayed due to the manpower situation.
- 10:20 → 10:30
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  continuing implementing jumbo frames at UM,
  
  solved the problematic ones connected to an old switch which lack some jumbo frame configuration,
  
  still see problems with idrac.
  
  need to update SLATE and NRP nodes at UM
  
  EL9 provisioning at MSU:
  MSU Satellite permissions granted.
  MSU Satellite and AGLT2-MSU Capsule configuration done.
  First worker node definition finally successful in Satellite.
  Currently working on one more DNS workaround for bug/limitation.
  Expect to have first node built today.
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Testing cgroups configuration for condor sites to relay back to Paul
  
  Waiting on rails to rack the UChicago storage purchase
  
  IU downtime scheduled for tomorrow
  
  Migrating our configuration management to openvox
  
  UIUC workers upgraded to EL9 as part of the NCSA datacenter move
  
  Storage filled up Feb 4 due to slow rucio deletions. Appears to have improved since one of the most recent rucio patches
  
  Working with the UC and IU networking teams to discuss the 400Gbps networking plans
- 10:40 → 10:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  1) During the Jumbo Frame Capacity challenge last week, changing the maximum allowed concurrent transfers in the FTS configuration for NET2 revealed an issue with the dCache load balancing policy used when a large number of requests—large relative to the number of dCache pools—arrive simultaneously. We are currently investigating this issue. However, for now, we have reverted to the previous maximum number in the FTS configuration for NET2, and since making this change yesterday, we have not observed any further errors of this nature. We are planning a test sequence for the production webdav doors using some parameters suggested by Judith.
  
  2) This is done for NET2
  
  3) There are three servers (one computing, two storage) that despite being racked are still not available in the pools because they are being used for evaluation,. The work is a bit late, but we will make them available as soon as possible.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  OU:
  
  I won't be able to join, sorry
  
  We are having a scheduled OSCER maintenance Wed 8am till 11pm; ceph upgrade among other things; I scheduled an OSG downtime.
  
  Other than that, nothing to report.
  
  SWT2_CPB:
  
  Network
  
  Met with campus networking to discuss plans for network upgrade.
  
  Ongoing internal discussions and planning for internal network improvements.
  
  EL9 Migration
  
  Major EL9 upgrades for Condor-CE, Slurm, and worker nodes have been running smoothly.
  
  Have been consistently running roughly 18K job slots.
  
  Production jobs have been experiencing very low error rates.
  
  Discussing and planning next steps.
  
  Transfer Issue
  
  Discovered transfer requests incorrectly using SWT2_DATADISK as both the source and destination, causing errors.
  
  Ivan connected us with ACT experts for support. Waiting for further details.
  
  Harvester Issue - Drain
  
  The site was drained on Wednesday (2/5) due to an issue with one of the harvesters. Compared to other sites, ours remained drained for an additional twelve hours.
  
  We started receiving jobs again on Friday (2/7).
  
  No changes were made before or during the issue; it resolved on its own.
  
  Waiting for expert analysis to determine the cause.
  
  GGUS Tickets
  
  162991
  
  Continuing to work with campus networking to address this required. We previously held a meeting, opened a ticket in their system to improve their tracking of our request, and maintain regular follow-ups. Awaiting further assistance from their team.
  
  168756
  
  Waiting on more information from ESNet and someone from the state network provider (LEARN). They have concluded the issue is likely in the DE cloud routing.
  
  Storage
  
  Continuing work on storage deployment.

Choose timezone

US ATLAS Tier 2 Technical