US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-03-05T10:00:00-05:00
End: 2025-03-05T11:00:00-05:00
Location: No location set

Wednesday 5 Mar 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
- 10:10 → 10:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  In general, the site is running smoothly.
  
  Some updates about the network status.
  
  the issue of 1Gbps cap for a single data stream was solved. Now the whole 3x1Gbp links are available for each stream
  
  the 3x1Gbps will be upgraded to ~5Gbps in future weeks.
  
  when the submarine cable across the Pacific Ocean was in trouble, some traffic went thru the TW-JP-US route and almost used up the 10Gbps piple between AS and JP.
  
  Further upgrade of the international bandwidth between TW and US to 100Gbps by TWAREN is also possible later this year
- 10:20 → 10:30
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  Followup from 03-Mar job errors with missing input file:
  
  60 pilot:1094 job errors over previous 24h
  
  from 4 date sets mc23_13p6TeV:AOD.42228900/42228920/42228991/42229014.*
  
  2+4+23+1 = 30 lost files which had indeed been created Dec 7-8
  
  scanned all 1521 files in these 4 sets
  
  26+50+11+36 = 123 additional lost files
  
  EL9 at MSU, aka RH Satellite provisioning via Capsule at AGLT2: Still only frustratingly close.
  We had identified one port (5646) on satellite not reachable from capsule
  MSUIT Satellite team submitted a ticket for MSUIT firewall team
  Port was open 6 days later ... but request was off by one (5747)
  Correction was supposed to happen last night
  Still failing this morning (05-Mar).
  Already double-checked that all other needed ports, in both directions, are open.
  So this should be the last connectivity issue.
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  IU downtime went into a second day to make sure everything was coming up nicely
  
  UC Storage rails finally came in. Machines are racked, cabled, and currently going through benchmarks
  
  Planned to be in production this week
  
  Transitioned completely from puppet to openvox
  
  cgroups program made and sent to Paul Nilsson to test
- 10:40 → 10:50
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  There is an ongoing discussion about pilots trying to use more space than requested on NET2. We will move to a different Graph Driver away from Overlay, and that alone should mask this problem.
  
  We are conducting an ongoing investigation into tuning values for large numbers of transfer requests arriving on dCache doors at the same time: we saw this during the challenge and are seeing it again on transfers started by the worker nodes. Thanks, Judith, for the help.
  
  Two new rd760 are ready to be racked on NESE. They are currently being used for the ZFS performance investigation going on and they will be put in production very soon.
  
  We reported that we implemented BGP tagging of LHCONE prefixes in the ticket. We are waiting for Edoardo to confirm that it is working on their end.
  
  The first 1PB data flow is being set up by Fabio to start using the tape. This is the last stage of the setup.
  
  The OKD cluster is installed on Virtual Machines to be used by OSG folks for the kauntifier development. There are still some errors to be ironed out, but it should be ready by the end of this week.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  DNS Issue (External Change) - Drain
  
  Campus networking performed work on Sunday early morning (2/22/25) that caused inbound packets to the data center to be blocked. It was a routing problem. This indirectly impacted DNS, which led to various issues and draining.
  
  We noticed this Sunday morning, investigated, discovered issues with DNS, then implemented a temporary fix that afternoon in order to receive jobs again.
  
  Contacted campus networking Monday, held a meeting to work together in troubleshooting, and they resolved the routing issue on the campus router.
  
  EL9 Performance
  
  Other than the DNS issue, we have been running incredibly with new EL9 nodes. We have been running between 16K to 18K cores with very low error rate for production jobs.
  
  EL9 Next Steps
  
  Continuing to develop the EL9 test cluster to be in a better position to start developing the rest of the EL9 appliances and testing. Currently, it is hybrid EL7 and EL9, similar to the production cluster.
  
  Working on testing EL9 with storage.
  
  New Storage Deployment
  
  Issues with rails sent by Dell that are too long for our racks. We installed one storage in order to test these, and are purchasing third-party rails to test if they will work better for us.
  
  We have 8 MD3460 RAID arrays to replace, and 12 new storage nodes.
  
  Plan is still in discussion, but we plan on deploying four new storage as EL7, two new storage in the test cluster for various testing for migrating from EL7 to EL9, and the remaining four will be used to gradually replace the old MD3460s.
  
  Plan to have the four storage nodes deployed as EL7 within the next month, but the rest of the deployment will be more gradual.
  
  Procurement
  
  Planning potential purchase for new hardware for replacing head nodes and for improving network infrastructure (switches).
  
  OU:
  
  Sorry, can't attend because of a conflicting meeting
  
  OU_OSCER_ATLAS site running well
  
  Short downtime on Thursday morning to move network switch and some compute nodes
  
  OU_OSCER_ATLAS_TEST jobs are running fine now, but still getting HC jobs exceeding memory:
  
  https://bigpanda.cern.ch/jobs/?hours=12&computingsite=OU_OSCER_ATLAS_TEST&jobtype=prod&jobstatus=failed
  
  ANALY_OU_OSCER_GPU_TEST still (possibly container related) issues, continue to investigate

Choose timezone

US ATLAS Tier 2 Technical