US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
10:00
→
10:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
- 10:10 → 10:20
-
10:20
→
10:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Running well overall
Some issues with cvmfs; removed automated 'reload'; gave some nodes to experts.
One dip to 60% occupancy when only score jobs present
Trying to understand the underlying limationAvailability report for March showed 95% while monitoring shows over 99%
We need to create a ticketA frew more lost files from Dec 2024 incident.
Noticed as stage-in errors (pilot:1099)
10 files from one data set
checked whole data set (mc23_13p6TeV:AOD.42171985.*)
of total 351 files, 39 were registered/created at AGLT2
37 total missing; declared bad/lostEL9 at MSU
Correction to last report: there was one more hurdle, now solved
Needed one more allowance through MSU firewall for aglt2 subnet to capsule port 443
That allowed the node being provisioned to register itself during build
Via subscription-manager and http-proxy from private subnet to capsule public https port
Next steps :
Building first VM for perfsonar infrastructure.
Will make first node built into worker node.
Also start with new storage nodes. -
10:30
→
10:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
- Working on updating cvmfs on compute machines. Currently draining machines to restart cvmfs for the update
- Starting to discuss operations and procurement plans for this year
- IU network config change to fix route asymmetry along LHCONE
- A storage node at UC was down for a short time while we replaced dead optics
-
10:40
→
10:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
-
10:50
→
11:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
Operations
-
We experienced a significant drain on 4/3/2025. There are still investigations ongoing, but with investigating locally and the help of others we are gathering information and our site has filled back up. We noticed a significant drop in multicore jobs during this time.
-
We increased our Slurm max job limit to 12000 from 10000.
-
We created a second CE, but have not implemented it yet.
-
Timo and Rod have helped us in having both SWT2_CPB_TEST and SWT2_CPB use gk10 and enabling 16 core jobs to be submitted.
-
We have reached out to experts and shared requests logs from our CE for review.
-
Timo has helped investigate this and found errors in apfmon logs showing "The job's remote status is unknown… known again" It is still unclear if this is a central issue or a bug in the version of HTCondor-CE, but it appears to be some kind of handshake/status problem.
-
We have been focused on finding out why this issue occurred and how to prevent this from happening in the future.
-
We are now draining SWT2_CPB_TEST to revert back to only having running jobs on the SWT2_CPB queue.
-
We experienced a spike in errors recently due to jobs hitting the 2 day limit on our CE. We are discussing the idea of making changes to these limits.
-
Last update from ADC OPS meeting:
-
Request all sites move to at least 96h maxwalltime
-
ATLAS VO Card includes 5760 minute walltime limit = 96 hours
-
Monitoring
-
We are currently working on developing better monitoring of our site to include additional information from our Slurm and CE servers.
-
EL9 Migration Updates
-
Built test storage nodes in the test cluster. There are still more tests we want to perform.
-
Improving the storage module in Puppet/Foreman.
-
GGUS Ticket - Enable Network Monitoring
-
Followed up with campus networking. It appears there were internal changes that caused them to lose track of our request.
-
They added their manager of Operations Center and are discussing this.
OU:
- May not be able to join because of a conflicting meeting, sorry.
- Running well, only occasional storage overloads.
- We think the lscratch deletion issue has been fixed, so we can migrate over from el7 containers to el9 containers.
-
-
10:00
→
10:10