US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Moving 1 hour earlier to avoid USATLAS IB Meeting

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 878527

Invite link:  https://umich.zoom.us/j/99329677148

 

 

    • 12:00 12:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      The main topic for today is scrubbing preparation!

      • We have partial drafts ready for most WBS 2.3 L3 areas, but most have significant gaps that need covering
      • We have an updated and correct effort spreadsheet (verified by Verena) that we need to match for FY26 and potential updates/changes for FY27
        • If anyone needs to change FTE levels for FY27, please mark changes in bold and add a comment for the change.
      • Most L3 still need to address risks and FY27 plans.
        • Copy of the risk registry spreadsheet is in the 2026 scrubbing folder
      • Milestone updates and additions should be modified on our WBS 2.3 copy:  https://docs.google.com/spreadsheets/d/11FMFmiCHdkkyWvMmNIHiyMc0bBUQI6sS4AzE0FR5TM8/edit?usp=sharing
        • If you modify a field, change it to BOLD
        • If you add a new milestone, all content in the added row should be bold
      • There are AI summaries from last week's face to face in the 2026 scrubbing folder that may be helpful
      • ACTION ITEM:  L3's should schedule a review slot with Shawn and Alexei ASAP so we can discuss content.

       

      Note: We need to provide Paolo and Verena Tier-2 availability for the last 3 months.

      Upcoming meetings:

      • ATLAS S&C at CERN June 28-July 3
      • USATLAS Scrubbing (WBS 2.3) at Stony Brook on Tuesday July 14

       

      ZOOM Notes for Meeting

      Quick recap

      The meeting focused on scrubbing preparation for WBS 2.3, with discussions about missing forward-looking plans and priority items in various L3 areas. Shawn emphasized the need for consistency across different WBS versions (2.3, 2.4, 2.5) to avoid discrepancies in presentations, particularly regarding FY26 and FY27 plans. The team discussed updates to effort spreadsheets, milestone tracking, and the transition from WBS 2.3 to 2.5 for FY27 activities. Frederick reported on Tier 2 operations, including recent downtime at Midwest and Net2 sites, while Rui provided updates on edge pieces and parameter testing. The conversation ended with a discussion about potential changes to data transfer policies at NERSC, where Douglas raised concerns about current workarounds that may not be permitted long-term, leading to a decision to develop a backup plan for future HPC operations.

      Next steps

      Brian

      • Register Alexei as a US Atlas VO contact as per instructions provided.

      Douglas

      • Develop a backup plan for data transfer at NERSC and other HPCs in case direct data movement is no longer allowed, and evaluate the required work and potential impact.

      Frederick

      • Transfer all milestone changes made in the central copy of the milestone spreadsheet to the correct (facility) copy, and remove bold formatting from the original as appropriate.
      • Provide Tier 2 availability and reliability numbers (for March, April, May) for each Tier 2 to Shawn, including both aggregate and individual site data.
      • Ping John Hobbs for a definitive answer on Midwest Tier 2 funding status in time for the next management meeting.

      Gamboa

      • Check with DCACHE developers/organizers about the possibility of opening the DCACHE developers meeting to Tier 2 DCACHE admins and report back to Ofer.

      Qiulan

      • After 30 days from notification, archive then delete inactive NFT user data according to policy, and await further guidance from Viviana regarding the 24 users who have not responded.

      Rui

      • Add machine names to the CPU time information in the scrubbing slides for ALCF.

      Collaboration

      • L3 leads: Schedule a meeting with Alexei and Shawn to review current slides, discuss what's missing/needs changes, and ensure content is ready for final review with Paolo and Verena.
      • All L3s: Update the FY27 effort spreadsheet with any needed changes, marking updates in bold with comments for traceability.
      • All L3s: Address risks and FY27 plans in their slide decks and update milestone information in the RWBS 2.3 copy, marking new/changed fields in bold.
      • Kaushik and Zach: Meet with L3s (especially 2.3, 2.2, 2.5, and 2.4) to discuss and clarify the transition of milestones and research/operations thrusts between WBS areas, ensuring consistency across slides.
      • Aidan (and Frederick): Begin testing CVMFS 2.14.0 pre-release on CIT nodes and confirm to Brian when testing starts.
      • Kaushik, Raphael: Review the Tier 2 FTE sheet in the scrubbing slides and confirm accuracy, especially for Midwest Tier 2 and any other areas of concern.
      • All PIs (Tier 2 leads): Check and validate that their effort numbers in presentations match the official effort spreadsheet, and update as necessary.

      Summary

      Scrubbing Meeting Preparation Updates

      The team discussed preparation for an upcoming scrubbing meeting, with most L3 areas having partial or complete drafts but several significant gaps still needing coverage. Shawn noted that the effort spreadsheet for FY26 has been verified by Verena, and requested that any needed changes for FY27 be marked in bold with comments explaining the changes. The team also needs to address risks and FY27 plans, with Joe Boudreau's risk registry spreadsheet provided as reference material, and milestone updates should be modified in RWBS 2.3.

      WBS Version Consistency Planning

      The team discussed ensuring consistency across different WBS versions (2.3, 2.5, 2.2) to avoid discrepancies in presentations and documentation. Kaushik suggested that items for FY27 should initially be placed in WBS 2.3 and later moved to 2.5, with team members notified via email about items that should move to 2.5. Douglas emphasized the need to maintain consistent milestones and research thrusts across all versions to ensure coherence, particularly regarding HPC-related items and facility requirements. Kaushik acknowledged that while the transition would be challenging, they would need to have discussions about all versions including 2.2 and 2.5, as the level 3 structure for 2.5 is not yet established.

      Milestone Template and Data Updates

      Shawn and Frederick discussed issues with the milestone template copy, where Frederick had made changes to the wrong version. Frederick agreed to transfer his changes to the correct copy and remove bold formatting. They also discussed the need to gather availability numbers for Tier 2 systems over the last three months for future funding agency reports, with Frederick offering to provide both aggregate and individual availability and reliability data.

      Project Updates and Software Releases

      Shawn noted upcoming meetings including Atlas Software and Computing in late June and WBS 2.3 scrubbing on July 14th. Frederick clarified that the three-month period would cover March, April, and May due to delayed June data. Brian announced the release of IGTF 1.143 to address issues with InCommons GTF certificates and mentioned ongoing work with the ExtraD team on patches. Frederick and Aidan agreed to install CVMFS 2.14.0 on CIT nodes after Brian's reminder.

      US Atlas Development Updates Meeting

      Frederick confirmed he would notify the team when starting a test and updated them on US Atlas VO contacts, noting that Alexei needs to register as a contact. Brian reported that API key development for topology is still targeted for the end of the month despite Matias being on vacation next week. Gamboa discussed an upcoming stakeholders and users technical meeting about storage technologies and mentioned a new minor release of recache (11.2.05) that improves NFS components, with plans to upgrade the integration instance soon. Ofer inquired about opening the developers meeting to Tier 2 DCACHE admins, and Gamboa agreed to check with the meeting organizers and provide feedback next week.

      Tier Operations Update Meeting

      Frederick provided updates on Tier 1 and Tier 2 operations, noting that Midwest Tier 2 experienced a production decrease due to server relocation and power consumption issues, while Net2 took downtime for maintenance. Frederick is working on Tier 2 scrubbing slides and requested Kaushik and Raphael to review the FTE sheet for accuracy. He also mentioned plans to follow up with John Hobbs about funding for tomorrow's management meeting.

      Effort Spreadsheet and Data Updates

      The team discussed effort spreadsheet validation, with Frederick agreeing to review FY26 and FY27 numbers and asking PIs to verify their numbers match the official spreadsheet. Rui reported on edge pieces parameter testing and data transfer issues, while Douglas explained NERSC's policy on Globus data transfers and potential future changes that could affect their current workaround methods. The team also reviewed updates from project 234, including storage configuration changes and management node implementation, and Ivan provided comprehensive updates on various project activities including LAC Run 3 completion and data recovery efforts.



       

    • 12:05 12:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release
        • IGTF 1.143: adds new InCommon IGTF v4 CA
        • We're still working with the XRootD team on a better patch set
        • US ATLAS CVMFS 2.14.0 testing status?
      • Topology
    • 12:10 12:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 12:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 12:15
        Compute Farm 5m
        Speaker: Thomas Eric Smith (Brookhaven National Laboratory (US))
      • 12:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)

        Note:

        The next SCDF Stakeholders and Users Technical meeting has been
        scheduled for Thursday, June 18th @ 11 am ET (10 am CT). Dmitry
        Litvintsev will speak about the evaluation of storage technologies
        at FNAL.

        https://indico.bnl.gov/event/33234/

        Alternate zoom linK

        https://bnl.zoomgov.com/j/1608396009?pwd=ITv56L0MNU91N5P7lrhwpQV1PaDwro.1

         

        Operations

        No major issues to report

         

         

         

         

         

         

      • 12:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
    • 12:30 12:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running for most sites over the past couple of weeks.
        • MWT2 had a small decrease in production due to work at the Illinois site to relocate servers within the machine room.
        • NET2 had a 1 day downtime for an OKD update and is currently down for 3 days for the annual MGHPCC maintenance period.
      • Working on the tier 2 scrubbing slides: https://docs.google.com/presentation/d/1EC9GNy8I93zoIYI23uBIgaZzDB3WRHSJoIXjZOdiL9g
        • Each site should check slides 4-7 to make sure that I interpreted your procurement correctly/
        • Also each site should check that the FTEs shown on slide 2 is correct.
      • I entered all the milestone updates but on the wrong sheet.
        • I will move the changes to the correct temporary sheet.
      • There is no news on the funding.
    • 12:40 12:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 12:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • Perlmutter: scheduled maintenance today
          • preparing the NERSC_Perlmutter_Test queue for Nurcan's CREST test
        • TACC: Harvester can run on the Stampede3 login node when the thread pool is limited to 15, but needs to check whether this is enough when job flows
        • Contacted the SCDF storage group about giving Globus write permission on BNL production dCache for HPC jobs' data transfer
          • Qiulan has passed the request to the group for discussion. Waiting for updates
      • 12:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 12:50 13:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 12:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Jupyter deployment updates

          • Working on puppetizing the JupyterHub configuration

          • Testing of the new federated JupyterHub instance is still ongoing

        • Storage clean up updates
          • Since the notification was sent 2 weeks ago, we have not received any requests from inactive users to access their data.
            The Glance status for the 24 users whose notification emails bounced back has been sent to Viviana, and we are awaiting further guidance on how to proceed.
      • 12:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Ceph Data Durability Configuration Change (Planning)

        Transitioning from 3x replication to 4+2 erasure coding (EC):

        • ~2× improvement in usable storage capacity
        • Comparable large-write performance
        • Increased CPU overhead during writes

         

        Management Nodes in production

        3 management nodes now in production. System components are being migrated (via weighted affinity) to isolate them from user workloads (HTCondor, Jupyter notebook servers, etc.), improving cluster stability through fault isolation and more effective resource management.

    • 13:10 13:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • Top
        • Still working on 2.3.5 scrubbing slides
      • LHC
        • LHC Run 3 physics data taking ended on Sunday, June 14th.
          • Machine development tests (HL-LHC High Intensity) continuing until the end of June.
          • Celebratory BBQ scheduled for tomorrow
      • ADC Ops
        • S&C week (Indico:1681871): 6/29/2029 - 7/3/2026
          • Let Ivan know if you want to present something on one of the ADC sessions (Indico:1681882)
        • Mario presented ADC on ATLAS Weekly (Link)
        • Working on OTP
        • TDR ADC secion was finished.
      • USATLAS
        • BNL
          • recuperated 486TB temporarily added to DATADISK a few days ago.
          • Tape recalls started last week for data25 VBF delayed stream reprocessing
        • TACC - DDM is fine with the proposed by Doug Globus-BNL-Rucio data transfer model. Doug is working on Globus plugin for Rucio
        • NET2 - tape was not being filled in the last months. Restarted last week (Link)
        • Due to overload in US FTS (caused by opendata transfers) Rucio-FTS communication was timing out (24th - 27th May) which triggered multiple transfers for the same file transfer resulting in files not known to Rucio - all US sites produced storage dumps to remove these files. The procedure is ongoing on DDM side
      • 13:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
      • 13:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 13:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 13:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 13:30 13:40
      AOB 10m