US ATLAS Computing Facility (Possible Topical)
→
US/Eastern
Description
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Monday and Tuesday this week was the Blueprint Workshop: Towards a National-Scale AI Collaboration in HEP https://indico.flatironinstitute.org/event/4120/timetable/
- Closeout slides summarize the workshop.
Upcoming events: CHEP 2026 next week, Facility F2F in Madison, ATLAS S&C, Scrubbing
Tier-2s should be working on a succinct procurement plan
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- Release
- HTCondor 25.11 is in the hopper
- We've been told to avoid XRootD 6 and 5.9.3
- CRIC contact updates
- We support mailing lists
- We need to add support for API key access to Topology before CRIC can get auto-updated
- OSG CE central collector used to provide contact information but it doesn't do that anymore.
- It also used to advertise site queue information if the site CE administrator configured their CE to do so. Should we continue doing that?
- Release
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
- 13:15
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
- No major operational issues to report this week.
- Last week, a patching campaign was carried out to address OS updates, including required hardware reboots
- The integration instance is now enabled with the HPSS testbed to validate tape workflows with dCache 11.2.(3→x).
- The tape area has been populated with approximately 40 TB of written data. Preliminary staging tests involving more than 2K files were successfully completed.
- kpatch-based security package management has been deployed on the integration instance, with selected components designated as “canary” systems for validation and monitoring.
- No major operational issues to report this week.
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- Anomalous activity, caused by sPHENIX, on one of the SCDF NetApp appliances severely degraded the CVMFS Stratum-1 performance starting at midnight ET Tuesday morning. This lasted for ~10 hrs before the issue was identified and mitigations were put into place. During that time BNL and BNL_OPP queues were taken offline by HC for a few hours. Tier-2 sites were also impacted.
- We are in the process of deploying new hardware for the Stratum-1 (just recently received) and that will eliminate this kind of shared storage dependency going forward.
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Good running over the past two weeks
- MWT2 downtime on Monday 11-May to update to dCache 11.2.4
- AGLT2 also updated to dCache 11.2.4 and did general cleaning like firmware updates.
- Firefly is now running at AGLT2 and MWT2. See the Firefly monitoring at: https://dashboard.stardust.es.net/goto/fflyj1vei3gg0a?orgId=2
- SWT2_CPB is down to migrating (has already migrated?) the very last of their servers to Alma Linux 9
- All sites have mitigated the copy fail etc CVEs.
- Held meeting to discuss procurement last Friday at 11 am EDT.
- Some notes in a presentation I wrote to guide the meeting:
https://docs.google.com/presentation/d/1E6bkrvOblZPwTM0mjwqVxYNQfztt-V2KSEcKszrtt8U/ - Discussed the plan for writing procurement plans which are due in just over a week:
- When writing the plan, estimate CPU at 10.00/HS (was 4.50/HS) and disk at 100.00/usable TB (was 45/usable TB)
- Our priorities are:
- First: Infrastructure and other items affecting large numbers of servers: networking, power (UPSs & PDUs), head nodes / gatekeeper nodes.
- Second: Storage: Meet the 2027 pledges and if possible buy enough to meet an estimate of the 2028 pledge too.
- Third: CPU: Lower priority and easier to bring into service at the last minute before HL-LHC starts.
- When writing the procurement plan, be sure to account for forced retirements of network switches and storage servers.
- Some notes in a presentation I wrote to guide the meeting:
- Second round of NET2 <-> PRG mini-challenge was last week. ESnet load balancing for NET2 transatlantic links seems to be fixed. Tests topped at almost 380 Gb/s. More details coming after data analysis.
- There is no news about the equipment funding.
- Good running over the past two weeks
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
AF Cluster Updates
- Addressed additional CVEs, including DirtyFrag and ssh-keysign-pwn vulnerabilities
- Added three head nodes to provide dedicated capacity for infrastructure services, helping separate system workloads from user batch workloads
- Drafting migration plans to transition the cluster to Kubespray-based management and a highly available (HA) control plane
-
13:50
-
14:10
→
14:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
RP1
- Provisioned the initial RP1 production cluster on IU hardware and migrated all services from rp1-dev
- Full service stack is now deployed on production, validating the GitOps deployment pattern and cluster overlay approach.
- Deployed the public documentation site at docs.rp1.hl-lhc.io, built with Zensical and served via nginx + git-sync.
ODF on RP1
- The ODF cluster was deployed on University of Chicago hardware and is currently offering services similar to RP1.
- Different upstream identity providers are being considered, including CILogon or Globus to open access to wider audience.
- Successfully replicated 10 TB test dataset and tutorial datasets to the MWT2_OPENDATA RSE.
LLM Assisted Infrastructure Management and Analysis Workflows
- Agentic infrastructure management and agentic analysis workflows are beginning to moving into practical implementation.
- Work is underway to assess open source solutions for managing agentic systems across infrastructure and analysis environments.
- Active discussions are focused on how to implement these systems safely, reliably, and with appropriate operational control.
-
14:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
14:10
-
14:30
→
14:40
AOB 10m
-
13:00
→
13:05