US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-04-02T10:00:00-04:00
End: 2025-04-02T11:00:00-04:00
Location: No location set

Wednesday 2 Apr 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_04022025_v2.pdf
- 10:10 → 10:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  Networking:
  
  Bandwidth connecting to LHCONE through ESNet will be upgraded to 5Gbps in April 2025
  
  Solutions to make use of the TW-US link and TW-JP (shared 10Gbps) link as well as TW-SG-AMS link(shared 10Gbps) for WLCG at the same time require further discussions
  
  Compute: Will speed up making all job slots (2.2K) online after migration from CentOS to AlmaLinux9
  
  Site: No particular problem on site functionality was found
- 10:20 → 10:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  EL9 at MSU
  all satellite/capsule configurations seem resolved
  all infrastructure/firewall issues seem resolved
  still working on node and build configurations
  intial test with R410 hit problem (no install support for disk controller)
  switched target to R630
  now builds to completion
  sorting out issue: not self-registering via subscription-manager as part of build
  
  Fallout from issue with dcache last December has subsided
  last problem noticed was 1 lost file a week ago
  
  Using nftables on EL9
  ported the iptable confifuration via cobbler to nftable via ansible
  
  question: do we need to do something now about the SAM test hickup last week?
  We still see a 31h gap in monitoring
- 10:30 → 10:40
  
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  
  Updated condor to 24.0.6-1 for the security release
  
  Updated ingress-nginx on our k8s clusters for https://kubernetes.io/blog/2025/03/24/ingress-nginx-cve-2025-1974/
  
  Troubleshooting 'Stale file handle' container issues. Possibly related to our CVMFS configuration on high-core count workers
  
  Updating CVMFS to 2.12.7
- 10:40 → 10:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Site drained last weekend owing to Harvester certificate expiring, it had to be renewed by hand. Eduardo and Fernando are working on making this process automated for kubernetes sites.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Operations
  
  No major site issues to mention. We have been running very smoothly.
  
  We noticed a slight decrease in running job slots and increase in wrapper faults on 3/30 (weekend), but this was brief and now resolved.
  
  Jobs appear to have decreased for a moment last night (4/1) as well. Will look into this.
  
  Maintenance of data centers as usual (replacing drives in storage, addressing two problematic worker nodes, monitoring).
  
  EL9 Migration Updates
  
  Continuing to make slight improvements to our modules in both the test and production clusters.
  
  Continuing to develop and test modules for XRootD proxy and storage in the test cluster.
  
  New Storage
  
  We physically installed additional new storage in racks.
  
  Ran into a few issues while trying to deploy new storage.
  
  Tested new third-party rails, but are deciding against using it. Contacted Dell for their suggestions on solutions as we also research and plan.
  
  DHCP requests issues interfering with Rocks ability to provision with EL7. This has been resolved. iDRAC devices were set to DHCP. Temporarily removed a module that manages this to fix/test in the test cluster before adding it back into production. Manually set these devices to static for now which resolved the issue.
  
  TFTP server issues. Provisioning nodes leads to TFTP timeout issues with Rocks. Investigated this and resolved it.
  
  Configured and tested provisioning of new storage in the test cluster. Rocks does not seem to support UEFI, but does work with BIOS. Setting new storage to BIOS boot mode causes M.2 in the BOSS-N1 boot controller to not be detected. Working on a solution and continuing to test for now.
  
  We plan on getting some of our new storage online and in service once we overcome these issues.
  
  OU:
  
  Not much to report, running well; just some occasional storage overloads
  
  EL9 /lscratch deletion bug seems to be fixed, so we will start migrating nodes from el7 container to el9 bare metal, a few nodes at a time

Choose timezone

US ATLAS Tier 2 Technical