US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-04-16T10:00:00-04:00
End: 2025-04-16T11:00:00-04:00
Location: No location set

Wednesday 16 Apr 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_04162025.pdf
- 10:10 → 10:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  
  Dear All,
  
  I have to apologize for the meeting today again because of other engagemnt out of town.
  
  There is no update about the international network of TW-FTT so far.
- 10:20 → 10:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Running well overall
  Some issues with cvmfs; removed automated 'reload'; gave some nodes to experts.
  One dip to 60% occupancy when only score jobs present
  Trying to understand the underlying limation
  
  Availability report for March showed 95% while monitoring shows over 99%
  We need to create a ticket
  
  A frew more lost files from Dec 2024 incident.
  Noticed as stage-in errors (pilot:1099)
  10 files from one data set
  checked whole data set (mc23_13p6TeV:AOD.42171985.*)
  of total 351 files, 39 were registered/created at AGLT2
  37 total missing; declared bad/lost
  
  EL9 at MSU
  Correction to last report: there was one more hurdle, now solved
  Needed one more allowance through MSU firewall for aglt2 subnet to capsule port 443
  That allowed the node being provisioned to register itself during build
  Via subscription-manager and http-proxy from private subnet to capsule public https port
  Next steps :
  Building first VM for perfsonar infrastructure.
  Will make first node built into worker node.
  Also start with new storage nodes.
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Working on updating cvmfs on compute machines. Currently draining machines to restart cvmfs for the update
  
  Starting to discuss operations and procurement plans for this year
  
  IU network config change to fix route asymmetry along LHCONE
  
  A storage node at UC was down for a short time while we replaced dead optics
- 10:40 → 10:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Connected now via Cambridge with redundant 400+400
  
  Working on procurement plans, a bit late due to lack of external inputs.
  
  other than that, NTR.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Operations
  
  We experienced a significant drain on 4/3/2025. There are still investigations ongoing, but with investigating locally and the help of others we are gathering information and our site has filled back up. We noticed a significant drop in multicore jobs during this time.
  
  We increased our Slurm max job limit to 12000 from 10000.
  
  We created a second CE, but have not implemented it yet.
  
  Timo and Rod have helped us in having both SWT2_CPB_TEST and SWT2_CPB use gk10 and enabling 16 core jobs to be submitted.
  
  We have reached out to experts and shared requests logs from our CE for review.
  
  Timo has helped investigate this and found errors in apfmon logs showing "The job's remote status is unknown… known again" It is still unclear if this is a central issue or a bug in the version of HTCondor-CE, but it appears to be some kind of handshake/status problem.
  
  We have been focused on finding out why this issue occurred and how to prevent this from happening in the future.
  
  We are now draining SWT2_CPB_TEST to revert back to only having running jobs on the SWT2_CPB queue.
  
  We experienced a spike in errors recently due to jobs hitting the 2 day limit on our CE. We are discussing the idea of making changes to these limits.
  
  Last update from ADC OPS meeting:
  
  Request all sites move to at least 96h maxwalltime
  
  ATLAS VO Card includes 5760 minute walltime limit = 96 hours
  
  Monitoring
  
  We are currently working on developing better monitoring of our site to include additional information from our Slurm and CE servers.
  
  EL9 Migration Updates
  
  Built test storage nodes in the test cluster. There are still more tests we want to perform.
  
  Improving the storage module in Puppet/Foreman.
  
  GGUS Ticket - Enable Network Monitoring
  
  Followed up with campus networking. It appears there were internal changes that caused them to lose track of our request.
  
  They added their manager of Operations Center and are discussing this.
  
  OU:
  
  May not be able to join because of a conflicting meeting, sorry.
  
  Running well, only occasional storage overloads.
  
  We think the lscratch deletion issue has been fixed, so we can migrate over from el7 containers to el9 containers.

Choose timezone

US ATLAS Tier 2 Technical