US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Facility is exploring purchasing options to determine costs and timelines
Meeting coming up:
- HEPiX
- LHCOPN/LHCONE
- CHEP
Work is progressing to debug SciTags support in dCache
Genesis NOFO may be out Friday
HTC26 planning is underway (Jun 9-12 in Madison) - Put a calendar hold. We will have 2 of the 4 days for USATLAS focused meetings and discussion (pre-scrubbing, data-challenges, purchasing, etc)
=============AI NOTES======================
Quick recap
The meeting focused on facility coordination and updates across various teams. Shawn provided updates on HTC26 planning, noting uncertainty about meeting space availability on Monday and discussing potential costs for room rental. Brian reported on OSG LHC activities, including Frontier Squid testing and issues with Google's certificate changes affecting X509 client authentication. The Tier 1 report from Carlos indicated smooth operations with ongoing work on HT condor-related code refactor and new SELinux access controls. Frederick provided updates on Tier 2 operations, including procurement efforts and network equipment pricing. Rui reported on HPC operations and data transfer requests, while Qiulan discussed work on the Jupiter hub authentication workflow. Fengping shared plans for a maintenance window on March 23rd for system updates. Ofer provided updates on continuous integration and operations, including Data25 reprocessing testing and FTS4 testing with CERN. Ilia presented on analytics operations, including work on AI assistants for monitoring and the implementation of new varnish caching setups. The conversation ended with a discussion about FTS4 deployment timeline and testing requirements for the upcoming DC27 data challenge.Next steps
- Shawn: Ask Janet what it would cost to book a meeting room on Monday for HTC26/US Atlas meeting and report back to the group.
- Brian: Send a brief email to Kaushik (and CC Ivan) about the Google certificate changes and their possible impact on Harvester and other services; Kaushik to forward to the Harvester team.
- Brian: Contact the CRIC team (cricDevs) to coordinate on topology fetching and authentication changes, and offer a ticket for further discussion.
- Horst: Send an email to help@osghc.org to investigate possible impacts of the new certificate usage rules on their XRootD service.
- Judith: Double-check with Rob and, if approved, send the post-mortem write-up regarding the recent incident to Ofer and the relevant group.
- Ilija: Complete work on restoring user branch analytics data for Tilla within the next week or two, pending final input from Tilla.
- Carlos: Send Ilija the contact information for the Dikach group working on AI for operations.
- Ivan: Keep Shawn and the group informed about the status and readiness of FTS4 for DC27, especially regarding the timeline for deployment at BNL.
- Hiro (and BNL team): Provide a rough estimate (e.g., one week) for the time needed to deploy FTS4 at BNL once it is released, for DC27 planning.
- Shawn (and/or relevant planning group): Consider and communicate if DC27 schedule needs to be adjusted based on FTS4 release and deployment timelines.
Summary
Meeting Space Planning and NOFO
The team discussed facility space options for upcoming meetings, including HTC26, HEPICS, LHCOP, LHC1, and CHEP. Shawn reported that while Monday space is unavailable, they could either pay for a room on Monday or meet on Tuesday and Wednesday, with the possibility of using hotel space. Alexei suggested booking a hotel room, and Rafael humorously proposed meeting at the pier with beer and sausage. Shawn mentioned that the Genesis NOFO might be released on Friday, requiring a quick response from multiple organizations.US Atlas Meeting and Updates
The meeting covered several key topics. Alexei discussed uncertainty around whether the upcoming funding opportunity would be a NOFO or a different type of call, with potential for around 100 awards. Kaushik and Shawn discussed the need for a US Atlas meeting, with Shawn still awaiting confirmation from Paolo and Verena about hosting it. Brian provided updates on OSG LHC, including testing of Frontier Squid and recent changes to Google host certificates, which could affect various services. The Tier 1 report, presented by Carlos, highlighted smooth operations, ongoing HT condor-related code refactor, and new strict SELs for NFS access, along with some storage issues that were being addressed.Operational Updates and AI Monitoring
The meeting covered updates on various operational aspects, including procurement, funding profiles, and maintenance schedules. Frederick discussed ongoing procurement efforts and shared a cost-effective Arista network equipment quote. Rui reported on HPC operations, highlighting stable job production and requests for group disk storage. The team discussed updates to the analysis facility, including authentication workflow improvements and potential simplifications to the Jupiter hub frontend. Fengping outlined a scheduled maintenance window for system updates at Chicago. Ofer and Ilia discussed AI monitoring efforts, with Ilia presenting an AI assistant for monitoring various services and communications. The team also addressed concerns about varnish caching at Southwest Tier 2 and discussed the timeline for deploying FTS4 in preparation for the Data Challenge 27. Ivan noted that FTS4 testing is ongoing, with plans to move production transfers to FTS4 once ready, and Shawn emphasized the importance of ensuring FTS4 meets the requirements for DC27. -
2
OSG-LHCSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- Release: frontier-squid 6.14-1.3 undergoing testing and should be released in the next week or so. Addresses security issue: https://github.com/squid-cache/squid/security/advisories/GHSA-c8cc-phh7-xmxr
- US ATLAS has completely moved from Frontier Squid -> Varnish
- Client X.509 usage going away
- We're seeing HTCondor-CEs dropping from https://collector.opensciencegrid.org/
- We're going to need to disable X.509 authentication for other services, like Topology. Who's a good contact for CRIC these days?
- Release: frontier-squid 6.14-1.3 undergoing testing and should be released in the next week or so. Addresses security issue: https://github.com/squid-cache/squid/security/advisories/GHSA-c8cc-phh7-xmxr
-
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
3
Tier-1 InfrastructureSpeaker: Jason Smith
-
4
Compute FarmSpeaker: Thomas Smith
- Mostly smooth operations on T1 for the last week
- currently looking into some job failures
- nearly finished with htcondor related puppet code refactor - scope complete, but untested
- This includes changes to the way condor related config files are distributed, will remove dependency on nfs for some of these files
- Other improvements to the general management of the pool infrastructure
- Mostly smooth operations on T1 for the last week
-
5
StorageSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
- 6
-
3
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Really good running recently with only a few minor outages.
- AGLT2 small reduction after a dCache upgrade.
- NET2 had short offline periods caused by disks that had slow readback speed.
- OU had a maintenance downtime.
- TW-FTT was down last weekend for an annual power maintenance.
- Working on procurement.
- Waiting for Dell to provide access to test systems to benchmark compute systems.
- Now that Shawn and Alexei have given us the expected amount of funding, we need to put details into our purchasing plans.
- There will be another meeting to discuss this "soon".
- In my own experience the pricing for Arista networking is slightly down over that last year.
- So now could be a good time to buy networking gear.
- Really good running recently with only a few minor outages.
-
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
7
HPC OperationsSpeaker: Rui Wang (Argonne National Laboratory (US))
-
8
Integration of Complex Workflows on Heterogeneous ResourcesSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
7
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 9
-
10
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
11
Analysis Facilities - ChicagoSpeaker: Fengping Hu (University of Chicago (US))
-
A maintenance window is scheduled for March 23.
-
During this maintenance period we will perform routine system updates, including upgrades to firmware, operating systems, Kubernetes, Rook/Ceph, and NVIDIA drivers, along with other standard infrastructure maintenance tasks.
-
Services will be temporarily unavailable during the maintenance window and are expected to be restored once the updates are completed.
-
-
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
-
12
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops NewsSpeaker: Kaushik De (University of Texas at Arlington (US))
-
Started first tests of data25 pp reprocessing (DATREP-381). The start of the full reprocessing will be announced at the ADC weekly meeting.
-
Continuing FTS4 tests.
- Need planning to ensure readiness and assess performance improvement for DC27
- Currently planning for a CY Q4 release
- How long do we need for BNL deployment once it's released? A: less than a month
- Run4 ADC TDR work ongoing
- AGLT2 has switched off IPv4 on LHCONE - no issues observed
- OU re-raised concerns about SAM/ETF accounting of scheduled downtimes - to be followed up next month
-
-
13
Services DevOpsSpeaker: Ilija Vukotic (University of Chicago (US))
- Analytics
- Working with Attila on getting Event-loop to report branch accesses.
- XCaches
- stable
- Oxford had downtime
- Varnish for Conditions
- added a second instance for SWT2 - the only site having two :)
- Frontiers
- Switched US, CA, ND, DE clouds to use new Frontier servers (k8s based).
- The rest of sites will be switched tomorrow.
- AI Assistance
- Now have a monitoring/alerting system that can monitor: k8s cluster, individual applications (ServiceX/Y, HTCondor, Ceph, Panda queue, Varnishes, user emails, mattermost, ...) send emails, slack, mattermost messages. Will add Discourse integration. Still growing in scope. It is almost trivial to add other apps. Who will help me define a good prompt for checking dCache? Powered by Claude Sonnet. Not trivial price - roughly 4cents/check.
- Investigating OpenAI gpt-5.4 that now also has computer use and tool search options.
- There is an expectation for NVidia to announce an open source NemoClaw. Will give it a test once it's out.
- Analytics
-
14
Facility R&DSpeaker: Robert William Gardner Jr (University of Chicago (US))
- Facility R&D Biweekly meeting last week (notes)
-
Discussion of UC AF security incident - Judith preparing a post mortem
-
Discussion of potential infrastructure reconfiguration
-
Strong progress on RP1 development
-
- Facility R&D Biweekly meeting last week (notes)
-
15
Cybersecurity plan(s)Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
-
12
-
16
AOB
-
1