US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Moving 1 hour earlier to avoid USATLAS IB Meeting
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 878527
Invite link: https://umich.zoom.us/j/99329677148
-
-
12:00
→
12:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
The main topic for today is scrubbing preparation!
- We have partial drafts ready for most WBS 2.3 L3 areas, but most have significant gaps that need covering
- We have an updated and correct effort spreadsheet (verified by Verena) that we need to match for FY26 and potential updates/changes for FY27
- If anyone needs to change FTE levels for FY27, please mark changes in bold and add a comment for the change.
- Most L3 still need to address risks and FY27 plans.
- Copy of the risk registry spreadsheet is in the 2026 scrubbing folder
- Milestone updates and additions should be modified on our WBS 2.3 copy: https://docs.google.com/spreadsheets/d/11FMFmiCHdkkyWvMmNIHiyMc0bBUQI6sS4AzE0FR5TM8/edit?usp=sharing
- If you modify a field, change it to BOLD
- If you add a new milestone, all content in the added row should be bold
- There are AI summaries from last week's face to face in the 2026 scrubbing folder that may be helpful
- Note some people are still adding milestones, goals in topics to the face to face meeting notes: https://docs.google.com/document/d/1_wYAVoJH2KVsOv8W5gZg9ZA2gRrj92I-RyTtLOPOyzw/edit?usp=sharing
- ACTION ITEM: L3's should schedule a review slot with Shawn and Alexei ASAP so we can discuss content.
Note: We need to provide Paolo and Verena Tier-2 availability for the last 3 months.
Upcoming meetings:
- ATLAS S&C at CERN June 28-July 3
- USATLAS Scrubbing (WBS 2.3) at Stony Brook on Tuesday July 14
ZOOM Notes for Meeting
Quick recap
The meeting focused on scrubbing preparation for WBS 2.3, with discussions about missing forward-looking plans and priority items in various L3 areas. Shawn emphasized the need for consistency across different WBS versions (2.3, 2.4, 2.5) to avoid discrepancies in presentations, particularly regarding FY26 and FY27 plans. The team discussed updates to effort spreadsheets, milestone tracking, and the transition from WBS 2.3 to 2.5 for FY27 activities. Frederick reported on Tier 2 operations, including recent downtime at Midwest and Net2 sites, while Rui provided updates on edge pieces and parameter testing. The conversation ended with a discussion about potential changes to data transfer policies at NERSC, where Douglas raised concerns about current workarounds that may not be permitted long-term, leading to a decision to develop a backup plan for future HPC operations.Next steps
Brian
- Register Alexei as a US Atlas VO contact as per instructions provided.
Douglas
- Develop a backup plan for data transfer at NERSC and other HPCs in case direct data movement is no longer allowed, and evaluate the required work and potential impact.
Frederick
- Transfer all milestone changes made in the central copy of the milestone spreadsheet to the correct (facility) copy, and remove bold formatting from the original as appropriate.
- Provide Tier 2 availability and reliability numbers (for March, April, May) for each Tier 2 to Shawn, including both aggregate and individual site data.
- Ping John Hobbs for a definitive answer on Midwest Tier 2 funding status in time for the next management meeting.
Gamboa
- Check with DCACHE developers/organizers about the possibility of opening the DCACHE developers meeting to Tier 2 DCACHE admins and report back to Ofer.
Qiulan
- After 30 days from notification, archive then delete inactive NFT user data according to policy, and await further guidance from Viviana regarding the 24 users who have not responded.
Rui
- Add machine names to the CPU time information in the scrubbing slides for ALCF.
Collaboration
- L3 leads: Schedule a meeting with Alexei and Shawn to review current slides, discuss what's missing/needs changes, and ensure content is ready for final review with Paolo and Verena.
- All L3s: Update the FY27 effort spreadsheet with any needed changes, marking updates in bold with comments for traceability.
- All L3s: Address risks and FY27 plans in their slide decks and update milestone information in the RWBS 2.3 copy, marking new/changed fields in bold.
- Kaushik and Zach: Meet with L3s (especially 2.3, 2.2, 2.5, and 2.4) to discuss and clarify the transition of milestones and research/operations thrusts between WBS areas, ensuring consistency across slides.
- Aidan (and Frederick): Begin testing CVMFS 2.14.0 pre-release on CIT nodes and confirm to Brian when testing starts.
- Kaushik, Raphael: Review the Tier 2 FTE sheet in the scrubbing slides and confirm accuracy, especially for Midwest Tier 2 and any other areas of concern.
- All PIs (Tier 2 leads): Check and validate that their effort numbers in presentations match the official effort spreadsheet, and update as necessary.
Summary
Scrubbing Meeting Preparation Updates
The team discussed preparation for an upcoming scrubbing meeting, with most L3 areas having partial or complete drafts but several significant gaps still needing coverage. Shawn noted that the effort spreadsheet for FY26 has been verified by Verena, and requested that any needed changes for FY27 be marked in bold with comments explaining the changes. The team also needs to address risks and FY27 plans, with Joe Boudreau's risk registry spreadsheet provided as reference material, and milestone updates should be modified in RWBS 2.3.WBS Version Consistency Planning
The team discussed ensuring consistency across different WBS versions (2.3, 2.5, 2.2) to avoid discrepancies in presentations and documentation. Kaushik suggested that items for FY27 should initially be placed in WBS 2.3 and later moved to 2.5, with team members notified via email about items that should move to 2.5. Douglas emphasized the need to maintain consistent milestones and research thrusts across all versions to ensure coherence, particularly regarding HPC-related items and facility requirements. Kaushik acknowledged that while the transition would be challenging, they would need to have discussions about all versions including 2.2 and 2.5, as the level 3 structure for 2.5 is not yet established.Milestone Template and Data Updates
Shawn and Frederick discussed issues with the milestone template copy, where Frederick had made changes to the wrong version. Frederick agreed to transfer his changes to the correct copy and remove bold formatting. They also discussed the need to gather availability numbers for Tier 2 systems over the last three months for future funding agency reports, with Frederick offering to provide both aggregate and individual availability and reliability data.Project Updates and Software Releases
Shawn noted upcoming meetings including Atlas Software and Computing in late June and WBS 2.3 scrubbing on July 14th. Frederick clarified that the three-month period would cover March, April, and May due to delayed June data. Brian announced the release of IGTF 1.143 to address issues with InCommons GTF certificates and mentioned ongoing work with the ExtraD team on patches. Frederick and Aidan agreed to install CVMFS 2.14.0 on CIT nodes after Brian's reminder.US Atlas Development Updates Meeting
Frederick confirmed he would notify the team when starting a test and updated them on US Atlas VO contacts, noting that Alexei needs to register as a contact. Brian reported that API key development for topology is still targeted for the end of the month despite Matias being on vacation next week. Gamboa discussed an upcoming stakeholders and users technical meeting about storage technologies and mentioned a new minor release of recache (11.2.05) that improves NFS components, with plans to upgrade the integration instance soon. Ofer inquired about opening the developers meeting to Tier 2 DCACHE admins, and Gamboa agreed to check with the meeting organizers and provide feedback next week.Tier Operations Update Meeting
Frederick provided updates on Tier 1 and Tier 2 operations, noting that Midwest Tier 2 experienced a production decrease due to server relocation and power consumption issues, while Net2 took downtime for maintenance. Frederick is working on Tier 2 scrubbing slides and requested Kaushik and Raphael to review the FTE sheet for accuracy. He also mentioned plans to follow up with John Hobbs about funding for tomorrow's management meeting.Effort Spreadsheet and Data Updates
The team discussed effort spreadsheet validation, with Frederick agreeing to review FY26 and FY27 numbers and asking PIs to verify their numbers match the official spreadsheet. Rui reported on edge pieces parameter testing and data transfer issues, while Douglas explained NERSC's policy on Globus data transfers and potential future changes that could affect their current workaround methods. The team also reviewed updates from project 234, including storage configuration changes and management node implementation, and Ivan provided comprehensive updates on various project activities including LAC Run 3 completion and data recovery efforts. -
12:05
→
12:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- Release
- IGTF 1.143: adds new InCommon IGTF v4 CA
- We're still working with the XRootD team on a better patch set
- US ATLAS CVMFS 2.14.0 testing status?
- Topology
- US ATLAS VO contacts updated but we need Alexei to register (https://osg-htc.org/docs/common/contact-registration/)
- API keys: still aiming for support by the end of this month
- Release
-
12:10
→
12:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
12:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
12:15
Compute Farm 5mSpeaker: Thomas Eric Smith (Brookhaven National Laboratory (US))
-
12:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
Note:
The next SCDF Stakeholders and Users Technical meeting has been
scheduled for Thursday, June 18th @ 11 am ET (10 am CT). Dmitry
Litvintsev will speak about the evaluation of storage technologies
at FNAL.https://indico.bnl.gov/event/33234/
Alternate zoom linK
https://bnl.zoomgov.com/j/1608396009?pwd=ITv56L0MNU91N5P7lrhwpQV1PaDwro.1
Operations
No major issues to report
-
12:25
Tier1 Operations and Monitoring 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
12:10
-
12:30
→
12:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Good running for most sites over the past couple of weeks.
- MWT2 had a small decrease in production due to work at the Illinois site to relocate servers within the machine room.
- NET2 had a 1 day downtime for an OKD update and is currently down for 3 days for the annual MGHPCC maintenance period.
- Working on the tier 2 scrubbing slides: https://docs.google.com/presentation/d/1EC9GNy8I93zoIYI23uBIgaZzDB3WRHSJoIXjZOdiL9g
- Each site should check slides 4-7 to make sure that I interpreted your procurement correctly/
- Also each site should check that the FTEs shown on slide 2 is correct.
- I entered all the milestone updates but on the wrong sheet.
- I will move the changes to the correct temporary sheet.
- There is no news on the funding.
- Good running for most sites over the past couple of weeks.
-
12:40
→
12:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
12:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
- Perlmutter: scheduled maintenance today
- preparing the NERSC_Perlmutter_Test queue for Nurcan's CREST test
- TACC: Harvester can run on the Stampede3 login node when the thread pool is limited to 15, but needs to check whether this is enough when job flows
- Contacted the SCDF storage group about giving Globus write permission on BNL production dCache for HPC jobs' data transfer
- Qiulan has passed the request to the group for discussion. Waiting for updates
- Perlmutter: scheduled maintenance today
-
12:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeaker: Doug Benjamin (Brookhaven National Laboratory (US))
-
12:40
-
12:50
→
13:10
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
12:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
Jupyter deployment updates
-
Working on puppetizing the JupyterHub configuration
-
Testing of the new federated JupyterHub instance is still ongoing
-
- Storage clean up updates
- Since the notification was sent 2 weeks ago, we have not received any requests from inactive users to access their data.
The Glance status for the 24 users whose notification emails bounced back has been sent to Viviana, and we are awaiting further guidance on how to proceed.
- Since the notification was sent 2 weeks ago, we have not received any requests from inactive users to access their data.
-
-
12:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
Ceph Data Durability Configuration Change (Planning)
Transitioning from 3x replication to 4+2 erasure coding (EC):
- ~2× improvement in usable storage capacity
- Comparable large-write performance
- Increased CPU overhead during writes
Management Nodes in production
3 management nodes now in production. System components are being migrated (via weighted affinity) to isolate them from user workloads (HTCondor, Jupyter notebook servers, etc.), improving cluster stability through fault isolation and more effective resource management.
-
12:50
-
13:10
→
13:30
WBS 2.3.5 Continuous OperationsConveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
- Top
- Still working on 2.3.5 scrubbing slides
- LHC
- LHC Run 3 physics data taking ended on Sunday, June 14th.
- Machine development tests (HL-LHC High Intensity) continuing until the end of June.
- Celebratory BBQ scheduled for tomorrow
- LHC Run 3 physics data taking ended on Sunday, June 14th.
- ADC Ops
- S&C week (Indico:1681871): 6/29/2029 - 7/3/2026
- Let Ivan know if you want to present something on one of the ADC sessions (Indico:1681882)
- Mario presented ADC on ATLAS Weekly (Link)
- Working on OTP
- TDR ADC secion was finished.
- S&C week (Indico:1681871): 6/29/2029 - 7/3/2026
- USATLAS
- BNL
- recuperated 486TB temporarily added to DATADISK a few days ago.
- Tape recalls started last week for data25 VBF delayed stream reprocessing
- TACC - DDM is fine with the proposed by Doug Globus-BNL-Rucio data transfer model. Doug is working on Globus plugin for Rucio
- NET2 - tape was not being filled in the last months. Restarted last week (Link)
- Due to overload in US FTS (caused by opendata transfers) Rucio-FTS communication was timing out (24th - 27th May) which triggered multiple transfers for the same file transfer resulting in files not known to Rucio - all US sites produced storage dumps to remove these files. The procedure is ongoing on DDM side
- BNL
-
13:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Kaushik De (University of Texas at Arlington (US))
-
13:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
13:20
Facility R&D 5mSpeaker: Robert William Gardner Jr (University of Chicago (US))
-
13:25
Cybersecurity plan(s) 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
- Top
-
13:30
→
13:40
AOB 10m
-
12:00
→
12:05