US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-11-26T11:00:00-05:00
End: 2025-11-26T12:00:00-05:00
Location: No location set

Wednesday 26 Nov 2025, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Mario Lassnig just posted this into the ADC ops MatterMost channel:
  
  Mario Lassnig
  
  10:34 AM
  
  :alarm: Security Notice: GitLab has uncovered a widespread malicious npm supply-chain attack that potentially can destroy user data. All teams are required to audit packages. Further information can be found under the Computer Security Report for November 2025 and the original blog post in https://about.gitlab.com/blog/gitlab-discovers-widespread-npm-supply-chain-attack/
  
  Please get your operations plans ready before you leave for the end of year holidays.
- 11:10 → 11:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
  November 20 and 22: The site had several hours of network interruption due to unplanned maintenance by the network provider.
  
  HTCondor, HTCondor-CE, and OS Migration Status:
  
  Updated HTCondor to 25.3.1 and OS to EL9
  
  Currently have 1872 CPUs running on EL9
  
  Set test PQ online and TW-FTT Queue set to BROKEROFF
  
  Started using local Varnish server for Frontier and CVMFS
  
  Plan to replace ARC-CE with HTCondor-CE, upgrade all OS7 worker nodes to EL9, update the site infrastructure to EL9, HTCondor-CE, and Varnish, and decommission ARC-CE and Squid.
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Ticket 1001213
  Jobs failing. Writing new files was failing. Our storage appeared to be full.
  That happened after too many pools were left set rdonly for too long.
  The rdonly pools slowly drained (file system showing 66%) following DDM deletions
  and the remaining RW pools slowly filled up (to 98%) with new files.
  We had forgotten to set back RW half of the pools from one site
  after the on-the-fly rolling dcache update 4 weeks earlier.
  We already had a cron job alerting for pools becoming offline
  and it has now been upgraded to also flag rdonly pools.
  We also re-balanced all pools site wide.
  To re-spread the unused space among all pools.
  It is used for temporary cache between UM-MSU.
  
  Condor/Condor-CE updates to OSG25
  Condor on 25.0.3
  Condor-CE on 25.0.1
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  Completed update to OSG25
  
  Working on RFP for refresh on elasticsearch hardware
  
  condor change to cgroups broke the pilot cgroup implementation: https://opensciencegrid.atlassian.net/browse/HTCONDOR-3008
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  GGUS:1001113 error due to an issue with a new storage machine, rapidly fixed.
  
  Blacklisted on the 15th/16th, once due to a network issue and once for a cvmfs problem.
  
  Downtime around SC 2025, then more cvmfs issues coming out of the downtime, probably because the dense nodes rapidly filled up with Event Index jobs, overloading their cvmfs instances. We have been planning to switch to accessing cvmfs using the cvmfs-csi Container Storage Interface kubernetes plugin rather than a direct mount, which ought to prevent these issues: this will now happen sooner rather than later. Unfortunately the image registry was on one of the dense nodes which needed to be rebooted, and failed to clone the images again automatically, so jobs subsequently hung in harvester until the cloning was done by hand.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Campus facilities performed power tests on Saturday 11/22. They tested the backup generator in case power was lost to the building, and it succeeded during tests.
  
  We rebuilt one XRootD Proxy server to EL9 after we did tests in the test cluster. We see performance issues and have resolved this.
  
  Communicating with XRootD experts.
  
  Performing different tests.
  
  Researching potential causes.
  
  Tried upgrading to newer version 5.9.0 and rebuilding with new hardware.
  
  We are continuing to migrate data off storage. The most recent was a PowerEdge R740, which makes up the majority of our storage servers.
  
  We have not retired any storage yet, as we may need to use certain storage to complete the migration of data.
  
  We are testing Zabbix in the test cluster.
  
  OU:
  
  Running well, no issues
  
  Still waiting for new SLURM version in order to start testing cgroups v2 RAM killing

Choose timezone

US ATLAS Tier 2 Technical