US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Spring HEPiX was last week and there are lots of interesting talks attached to the Timetable: https://indico.cern.ch/event/1377701/ Note that Fall HEPiX will be at OU in early November...start getting your talks ready!
Quarterly reports are due. In WBS 2.3 we are missing: 2.3.2, MWT2, SWT2, HPC (2.3.3), 2.3.4.1, 2.3.4.3, CIOPS(2.3.5*)
Milestone updates are due. This link has the list of what milestone updates/changes/additions are known. Let Rob, Alexei and Shawn know of others.
Today through Friday is the Joint ATLAS / IRIS-HEP Kubernetes Hackathon (https://indico.cern.ch/event/1384683/) and we have quite a few people unable to join our meeting today.
The DC24 analysis for USATLAS sites is mostly complete but still needs a summary. We have already merged the site results into the WLCG DOMA final report.
Working on L3/L4 Management team composition
Tier-1
- Discussion with ADC about bulk data rebalancing [BNL received ~the same data volume to archive in 30 days, as for the whole Y2023]
- from communicaton with ADC Coordination "I see how this information might not propagate to the right places.
We agreed that for the next time a similar tape campaign starts, an approximation of the data amount expected per site and the duration of the campaign is going to be given directly to the sites in advance."
- from communicaton with ADC Coordination "I see how this information might not propagate to the right places.
- rolling upgrade of dCache at BNL is now running on 9.2.17
- Scheduled for the next week : Upgrade the firmware of the two Oracle SL8500 Atlas Libraries. Oracle needs 2-3 hours downtime and suggests next Monday or Tuesday – April 29 or 30 [green light from ADC coordination]
- ongoing discussion about HC auto-exclusion of sites
- Discussion with ADC about bulk data rebalancing [BNL received ~the same data volume to archive in 30 days, as for the whole Y2023]
-
2
OSG-LHCSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
3
Infrastructure and Compute FarmSpeaker: Thomas Smith
-
4
StorageSpeaker: Jason Smith
-
5
Tier1 ServicesSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
- Issues:
- CVMFS wrapper issues - from one node. Solved.
- Failed pilots due to HTCondor retries. Ongoing.
- Blacklisted on 04/12/24, 20:20 CET due to dCache pool problem. Solved.
- Intervention:
- dCache rolling upgrade to v.9.2.17 (GitHub:7525). Finished on 04/24/2024.
- Tape Libraries firmware update - planned for 04/29. No downtime required.
- Misc
- DC24 report finished.
- Issues:
-
3
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Good running over the past few weeks with minor problems.
- AGLT2 job failures involving cmvfs and certain jobsets
- NET2 almost all BU servers in production (a success not a problem!)
- SWT2 CPB working on LSM and several other issues.
- Working on reporting and milestones.
- Reworking the capacity sheet to make it easier to enter data.
- I am really worried that some sites will not make the June 30 deadline to retire EL7.
- Also need to replace OSG version 3.6 with version 23 by June 30.
- Still no news on the final funding increment?
- Good running over the past few weeks with minor problems.
-
6
WBS 2.3.3 HPC OperationsSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
- TACC
- Job failure due to input file
- Pilot fails to create the symbol link under the working area
PanDA_Pilot-*
(no error message frompilot.copytool.mv
)PanDA_Pilot-*
permission has been set to 750 instead of 770
- Switched to pilot v3.5.2.12. Local test in the debug queue succeeded
- Pilot fails to create the symbol link under the working area
- Requesting new test task
- The previous one was broken because of a false setting in the SLURM parameter for the regular queue. Jobs fail to be submitted afterward
- Job failure due to input file
- NERSC
- starting to get users requests (based on invite from Mike Hance)
- currently running with 5 node SLURM submissions for up to 20 hrs.
- seeing both evgen and simulation jobs.
- still need more work to keep up with the uniform usage line
- TACC
-
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
7
Analysis Facilities - BNLSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
8
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
9
Analysis Facilities - ChicagoSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
-
7
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops NewsSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
- ADC
- LHC Operation started (ramped up to 2000 bunches). No problems on data processing side.
- RPG/Data Consolidation campaign is currently running which adds afdditional traffic to tape systems.
- ARM resources at CERN increased to 2k slots (Monit)
- HC blacklisting:
- 2/3 AFTs were relying on a single input file which was effectively testing the one disk server storing this file. These tests will be replaced with same test, new software version, more and smaller files (DAODs instead of AOD)
- IGTF Root CAs is still using SHA-1 (deprecated 10 years ago) which forces EL9 WNs to re-enable SHA-1.
- ATLAS officially requested an upgrade of the CA (IGTF) from WLCG.
- For OSG, Alma9 nodes switch to SHA-1 (OSG Documentation)
- WLCG Service Report: Link: “Being followed up, but a quick solution looks very unlikely”
- Relevant GitHub issue: https://github.com/dlgroep/fetch-crl/issues/4
- USATLAS
- SWT2
- Higher load with high I/O jobs overloads the storage which makes HC jobs fail which blacklists the site.
- Can limit the average I/O per PQ in CRIC
- Not transparent on what is the configuration behind the internal xroot door.
- No news on LSM decomissioning
- Higher load with high I/O jobs overloads the storage which makes HC jobs fail which blacklists the site.
- AGLT2 PQs reconfigured (agreed with AGLT2 admins):
- AGLT2: maxrss:32000
- AGLT2_VHIMEM: minrss:32001, maxrss:48000
- SLAC
- Wei managed to configure the local Frontier
- SWT2
- ADC
-
11
Service Development & DeploymentSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
12
Facility R&DSpeakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
10
-
13
AOB
-
1