US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Thanks to everyone for getting in their WBS 2.3 quarterly reports.
WBS 2.3 top-level quarterly should be done soon.
WLCG/HSF meeting coming up in early May
Tier-2s need to work on finalizing procurement and ops plans (discuss in WBS 2.3.2)
- After procurement plans are ready, we need to work on 5-year estimator
Milestone updates still needed for WBS 2.3 https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?gid=1906829311#gid=1906829311
- #117 Feb 2025 Delayed (by SWT2) Updates? WLCG site network monitoring 2 years delayed so far...
- #374 Apr 2025 On Schedule (waiting on BNL?) Need updated comment?
- #279 Apr 2025 Delayed Need updated comment? Tier-1
- #392 Jan 2025 "On Schedule" Needs update Tier-1
- #393 Jan 2025 "On Schedule" Needs update Tier-1
- #191 Apr 2025 Delayed Tier-1 Update comment?
- #310 Feb 2025 Delayed SWT2 update estimated date and comment
- #316 Mar 2025 Delayed SWT2 update estimated date and comment
- #363 Mar 2025 On Schedule update status or estimate date/comment
- #410 Apr 2025 Delayed WBS 2.3.4 update comment?
- #414 Apr 2025 On Schedule but is this a real milestone (WBS 2.3.4)
- #328 Apr 2025 Delayed WBS 2.3.5.1 see comment, update estimated date
- #415 Mar 2025 WBS 2.3.5.2 Update estimated date and comment OR retire?
- #416 Jun 2025 WBS 2.3.5.2 Is estimated date correct? Update comment?
- #419 Mar 2025 On schedule WBS 2.3.5.2 New estimated date needed. Change Status to Delayed
- #428 Mar 2025 Delayed WBS 2.3.5.3 New estimated date, update comment
- 2
-
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
3
Tier-1 InfrastructureSpeaker: Jason Smith
-
4
Compute FarmSpeaker: Thomas Smith
-
5
StorageSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
6
Tier1 Operations and MonitoringSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
- NTR
WBS 2.3.1.3 Tier-1 Compute - Tom
- New compute racks added, cpu count for Tier 1 temporarily raised to ~45K cpu
- Older equipment retirement/ donation to Tier 3 will happen soon, Tier 1 core count will show a small net decrease, but there will still be a net gain in HEPscore23 (since the new hardware is better/faster core for core)
WBS 2.3.1.4 Tier-1 Storage - Carlos
- 5280TB DISK space added to 2025 pledge
- 10 pools hosts commissioned into production
- 25030TB TAPE space added to 2025 pledge
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
- Emptying of the cluster today due to a user assigning all his jobs to BNL only (~100k jobs)
- Killing all assigned user jobs to BN
- LUnsetting site for all his jobs
- Limiting number of score jobs at BNL temporarily
- The site started to recover in the last hour
-
3
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running for the last two weeks.
- AGLT2 continued work on understnding why cvmfs hangs at their sites
- Still trying to understand why AGLT2 does not seem to be able to run more than 6000 SCORE jobs at a time. This did cause a small draining on one day.
- MWT2 had a reduced production last week due to rolling draining to remount cvmfs repos.
- The draining/remount did end the cvmfs aborts and seemed to activate the fix of the bug causing the aborts.
- It also finally caused the increased number of file descriptors specifued in the configuration file to be used.
- I recommend that all sites update to cvmfs version 2.12.7
- OU had problems with their scratch area setup and had more failures than usual.
- Fixed some issues but the problem still occassionally appers on some servers.
- SWT2_CPB had trouble staying full
- ADC tried submitting 16 core MCORE jobs.
- Setup a second gate keeper.
- Seems better?
- AGLT2 continued work on understnding why cvmfs hangs at their sites
- Finished the quarterly reporting
- Now focussing on the Operations and Procurement plans.
- Reasonable running for the last two weeks.
-
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
7
HPC OperationsSpeaker: Rui Wang (Argonne National Laboratory (US))
TACC: job submission suspended during the weekend. Stop the harvester instance right now. ~1.5K SU.
Perlmutter: maintenance last week. CPU usage is slightly below expectation. MCORE job rate is quite stable (not Premium). Suggestions from NERSC (on Rucio) reduce the job in queue to improve the throughput
ACCESS: need to discuss with Doug on details
-
8
Integration of Complex Workflows on Heterogeneous ResourcesSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
7
-
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 9
-
10
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 11
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
12
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops NewsSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- Rucio DB overload on Wednesday due to hangin multiple ART jobs queries
- Problem is mitigated
- Ongoing:
- DB experts are working on DB optimization (ATDBOPS-406)
- ART workflow should be optimized (ATLINFR-5755)
- HC - Starting tomorrow PFT_MCORE tests will be able to auto-exclude Production-only PQs.
- Working on automatic storage blacklisting based on functional tests transfers
- A campaign to verify that all pledged compute resources are allowing 96 hour jobs.
- Fred found some:
- leaky Exotics derivations - triggered discussion on automatic stopping of leaky tasks
- failing evgen
- Rucio DB overload on Wednesday due to hangin multiple ART jobs queries
- 13
-
14
Facility R&DSpeaker: Lincoln Bryant (University of Chicago (US))
- Armada work continues on stretched k8s, some deficiencies in how to securely store the postgres password in the deployment
- Ticket for clarification / request for improvement will be filed
- Coffea Casa deployment work continues, debugging 'client not found' issue between JupyterHub and Keycloak
- Moving various AF/K8S services to keycloak-prod, deprecating keycloak-dev, syncing AF users into Keycloak periodically
- Armada work continues on stretched k8s, some deficiencies in how to securely store the postgres password in the deployment
-
12
-
15
AOB
-
1