US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
- Mario Lassnig just posted this into the ADC ops MatterMost channel:
:alarm: Security Notice: GitLab has uncovered a widespread malicious npm supply-chain attack that potentially can destroy user data. All teams are required to audit packages. Further information can be found under the Computer Security Report for November 2025 and the original blog post in https://about.gitlab.com/blog/gitlab-discovers-widespread-npm-supply-chain-attack/
- Please get your operations plans ready before you leave for the end of year holidays.
- Mario Lassnig just posted this into the ADC ops MatterMost channel:
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
- November 20 and 22: The site had several hours of network interruption due to unplanned maintenance by the network provider.
- HTCondor, HTCondor-CE, and OS Migration Status:
-
Updated HTCondor to 25.3.1 and OS to EL9
-
Currently have 1872 CPUs running on EL9
-
Set test PQ online and TW-FTT Queue set to BROKEROFF
-
- Started using local Varnish server for Frontier and CVMFS
- Plan to replace ARC-CE with HTCondor-CE, upgrade all OS7 worker nodes to EL9, update the site infrastructure to EL9, HTCondor-CE, and Varnish, and decommission ARC-CE and Squid.
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Ticket 1001213
Jobs failing. Writing new files was failing. Our storage appeared to be full.
That happened after too many pools were left set rdonly for too long.
The rdonly pools slowly drained (file system showing 66%) following DDM deletions
and the remaining RW pools slowly filled up (to 98%) with new files.
We had forgotten to set back RW half of the pools from one site
after the on-the-fly rolling dcache update 4 weeks earlier.
We already had a cron job alerting for pools becoming offline
and it has now been upgraded to also flag rdonly pools.
We also re-balanced all pools site wide.
To re-spread the unused space among all pools.
It is used for temporary cache between UM-MSU.
Condor/Condor-CE updates to OSG25
Condor on 25.0.3
Condor-CE on 25.0.1 -
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Completed update to OSG25
- Working on RFP for refresh on elasticsearch hardware
- condor change to cgroups broke the pilot cgroup implementation: https://opensciencegrid.atlassian.net/browse/HTCONDOR-3008
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
GGUS:1001113 error due to an issue with a new storage machine, rapidly fixed.
Blacklisted on the 15th/16th, once due to a network issue and once for a cvmfs problem.
Downtime around SC 2025, then more cvmfs issues coming out of the downtime, probably because the dense nodes rapidly filled up with Event Index jobs, overloading their cvmfs instances. We have been planning to switch to accessing cvmfs using the cvmfs-csi Container Storage Interface kubernetes plugin rather than a direct mount, which ought to prevent these issues: this will now happen sooner rather than later. Unfortunately the image registry was on one of the dense nodes which needed to be rebooted, and failed to clone the images again automatically, so jobs subsequently hung in harvester until the cloning was done by hand.
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Campus facilities performed power tests on Saturday 11/22. They tested the backup generator in case power was lost to the building, and it succeeded during tests.
-
We rebuilt one XRootD Proxy server to EL9 after we did tests in the test cluster. We see performance issues and have resolved this.
-
Communicating with XRootD experts.
-
Performing different tests.
-
Researching potential causes.
-
Tried upgrading to newer version 5.9.0 and rebuilding with new hardware.
-
We are continuing to migrate data off storage. The most recent was a PowerEdge R740, which makes up the majority of our storage servers.
-
We have not retired any storage yet, as we may need to use certain storage to complete the migration of data.
-
We are testing Zabbix in the test cluster.
OU:- Running well, no issues
- Still waiting for new SLURM version in order to start testing cgroups v2 RAM killing
-
-
11:00
→
11:10