US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
10:00
→
10:10
Top of the meeting discussion 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
-
10:10
→
10:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
In general, the site is running smoothly.
Some updates about the network status.
- the issue of 1Gbps cap for a single data stream was solved. Now the whole 3x1Gbp links are available for each stream
- the 3x1Gbps will be upgraded to ~5Gbps in future weeks.
- when the submarine cable across the Pacific Ocean was in trouble, some traffic went thru the TW-JP-US route and almost used up the 10Gbps piple between AS and JP.
- Further upgrade of the international bandwidth between TW and US to 100Gbps by TWAREN is also possible later this year
-
10:20
→
10:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
- Followup from 03-Mar job errors with missing input file:
- 60 pilot:1094 job errors over previous 24h
- from 4 date sets mc23_13p6TeV:AOD.42228900/42228920/42228991/42229014.*
- 2+4+23+1 = 30 lost files which had indeed been created Dec 7-8
- scanned all 1521 files in these 4 sets
- 26+50+11+36 = 123 additional lost files
- 60 pilot:1094 job errors over previous 24h
- EL9 at MSU, aka RH Satellite provisioning via Capsule at AGLT2: Still only frustratingly close.
We had identified one port (5646) on satellite not reachable from capsule
MSUIT Satellite team submitted a ticket for MSUIT firewall team
Port was open 6 days later ... but request was off by one (5747)
Correction was supposed to happen last night
Still failing this morning (05-Mar).
Already double-checked that all other needed ports, in both directions, are open.
So this should be the last connectivity issue.
- Followup from 03-Mar job errors with missing input file:
-
10:30
→
10:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
- IU downtime went into a second day to make sure everything was coming up nicely
- UC Storage rails finally came in. Machines are racked, cabled, and currently going through benchmarks
- Planned to be in production this week
- Transitioned completely from puppet to openvox
- cgroups program made and sent to Paul Nilsson to test
-
10:40
→
10:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
- There is an ongoing discussion about pilots trying to use more space than requested on NET2. We will move to a different Graph Driver away from Overlay, and that alone should mask this problem.
- We are conducting an ongoing investigation into tuning values for large numbers of transfer requests arriving on dCache doors at the same time: we saw this during the challenge and are seeing it again on transfers started by the worker nodes. Thanks, Judith, for the help.
- Two new rd760 are ready to be racked on NESE. They are currently being used for the ZFS performance investigation going on and they will be put in production very soon.
- We reported that we implemented BGP tagging of LHCONE prefixes in the ticket. We are waiting for Edoardo to confirm that it is working on their end.
- The first 1PB data flow is being set up by Fabio to start using the tape. This is the last stage of the setup.
- The OKD cluster is installed on Virtual Machines to be used by OSG folks for the kauntifier development. There are still some errors to be ironed out, but it should be ready by the end of this week.
-
10:50
→
11:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
DNS Issue (External Change) - Drain
-
Campus networking performed work on Sunday early morning (2/22/25) that caused inbound packets to the data center to be blocked. It was a routing problem. This indirectly impacted DNS, which led to various issues and draining.
-
We noticed this Sunday morning, investigated, discovered issues with DNS, then implemented a temporary fix that afternoon in order to receive jobs again.
-
Contacted campus networking Monday, held a meeting to work together in troubleshooting, and they resolved the routing issue on the campus router.
-
EL9 Performance
-
Other than the DNS issue, we have been running incredibly with new EL9 nodes. We have been running between 16K to 18K cores with very low error rate for production jobs.
-
EL9 Next Steps
-
Continuing to develop the EL9 test cluster to be in a better position to start developing the rest of the EL9 appliances and testing. Currently, it is hybrid EL7 and EL9, similar to the production cluster.
- Working on testing EL9 with storage.
-
New Storage Deployment
-
Issues with rails sent by Dell that are too long for our racks. We installed one storage in order to test these, and are purchasing third-party rails to test if they will work better for us.
-
We have 8 MD3460 RAID arrays to replace, and 12 new storage nodes.
-
Plan is still in discussion, but we plan on deploying four new storage as EL7, two new storage in the test cluster for various testing for migrating from EL7 to EL9, and the remaining four will be used to gradually replace the old MD3460s.
-
Plan to have the four storage nodes deployed as EL7 within the next month, but the rest of the deployment will be more gradual.
-
Procurement
-
Planning potential purchase for new hardware for replacing head nodes and for improving network infrastructure (switches).
OU:
- Sorry, can't attend because of a conflicting meeting
- OU_OSCER_ATLAS site running well
- Short downtime on Thursday morning to move network switch and some compute nodes
- OU_OSCER_ATLAS_TEST jobs are running fine now, but still getting HC jobs exceeding memory:
- https://bigpanda.cern.ch/jobs/?hours=12&computingsite=OU_OSCER_ATLAS_TEST&jobtype=prod&jobstatus=failed
- ANALY_OU_OSCER_GPU_TEST still (possibly container related) issues, continue to investigate
-
-
10:00
→
10:10