US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Monday and Tuesday this week was the Blueprint Workshop: Towards a National-Scale AI Collaboration in HEP https://indico.flatironinstitute.org/event/4120/timetable/   

        - Closeout slides summarize the workshop.

      Upcoming events:  CHEP 2026 next week, Facility F2F in Madison, ATLAS S&C, Scrubbing

      Tier-2s should be working on a succinct procurement plan

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release
        • HTCondor 25.11 is in the hopper
        • We've been told to avoid XRootD 6 and 5.9.3
      • CRIC contact updates
        • We support mailing lists
        • We need to add support for API key access to Topology before CRIC can get auto-updated
      • OSG CE central collector used to provide contact information but it doesn't do that anymore.
        • It also used to advertise site queue information if the site CE administrator configured their CE to do so. Should we continue doing that?
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Eric Smith (Brookhaven National Laboratory (US))

        Operations for last week have been generally smooth, with the exception of the event on 19 May (details to follow from ops/monitoring)

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
        • No major operational issues to report this week.
          • Last week, a patching campaign was carried out to address OS updates, including required hardware reboots
        • The integration instance is now enabled with the HPSS testbed to validate tape workflows with dCache 11.2.(3→x).
          • The tape area has been populated with approximately 40 TB of written data. Preliminary staging tests involving more than 2K files were successfully completed.
          • kpatch-based security package management has been deployed on the integration instance, with selected components designated as “canary” systems for validation and monitoring.
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Anomalous activity, caused by sPHENIX, on one of the SCDF NetApp appliances severely degraded the CVMFS Stratum-1 performance starting at midnight ET Tuesday morning.  This lasted for ~10 hrs before the issue was identified and mitigations were put into place.  During that time BNL and BNL_OPP queues were taken offline by HC for a few hours.  Tier-2 sites were also impacted.
        • We are in the process of deploying new hardware for the Stratum-1 (just recently received) and that will eliminate this kind of shared storage dependency going forward. 
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running over the past two weeks
        • MWT2 downtime on Monday 11-May to update to dCache 11.2.4
        • AGLT2 also updated to dCache 11.2.4 and did general cleaning like firmware updates.
      •  Firefly is now running at AGLT2  and MWT2. See the Firefly monitoring at: https://dashboard.stardust.es.net/goto/fflyj1vei3gg0a?orgId=2 
      • SWT2_CPB is down to migrating (has already migrated?) the very last of their servers to Alma Linux 9
      • All sites have mitigated the copy fail etc CVEs.
      • Held meeting to discuss procurement last Friday at 11 am EDT.
        • Some notes in a presentation I wrote to guide the meeting:
          https://docs.google.com/presentation/d/1E6bkrvOblZPwTM0mjwqVxYNQfztt-V2KSEcKszrtt8U/
        • Discussed the plan for writing procurement plans which are due in just over a week:
          • When writing the plan, estimate CPU at 10.00/HS (was 4.50/HS) and disk at 100.00/usable TB (was 45/usable TB)
          • Our priorities are:
            • First: Infrastructure and other items affecting large numbers of servers: networking, power (UPSs & PDUs), head nodes / gatekeeper nodes.
            • Second: Storage: Meet the 2027 pledges and if possible buy enough to meet an estimate of the 2028 pledge too.
            • Third: CPU: Lower priority and easier to bring into service at the last minute before HL-LHC starts.
          • When writing the procurement plan, be sure to account for forced retirements of network switches and storage servers.
      • Second round of NET2 <-> PRG mini-challenge was last week. ESnet load balancing for NET2 transatlantic links seems to be fixed. Tests topped at almost 380 Gb/s. More details coming after data analysis.
      • There is no news about the equipment funding.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: Maintenance today. Production job has been running smoothly since 18th

        • (Doug): keep watching the empty Slurm job. Much lower rate after tuning the allowed failures

        TACC: installing and updating local alrb, harvester and pilot on Stampede3

         

         

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Updates on the Jupyter development and deployment
          • Testing on the new jupyter instance is still ongoing
          • Started to work on puppetizing the deployment
        • Updates on User storage clean up
          • The dCache patch has been done
          • Will enable the ban file and send email notification to the inactive users
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        AF Cluster Updates

        • Addressed additional CVEs, including DirtyFrag and ssh-keysign-pwn vulnerabilities
        • Added three head nodes to provide dedicated capacity for infrastructure services, helping separate system workloads from user batch workloads
        • Drafting migration plans to transition the cluster to Kubespray-based management and a highly available (HA) control plane
    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • Coordinating an audit/update of site security contacts in CRIC 
        • Not completely clear where all of those fields need to be changed
        • CRIC should be able to sync from OSG Topology using the OSG API
          • Contact name vs. email alias?
      • Admins have been kept busy responding to a stream of Linux kernel CVEs
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))

        RP1

        • Provisioned the initial RP1 production cluster on IU hardware and migrated all services from rp1-dev
        • Full service stack is now deployed on production, validating the GitOps deployment pattern and cluster overlay approach.
        • Deployed the public documentation site at docs.rp1.hl-lhc.io, built with Zensical and served via nginx + git-sync.

        ODF on RP1

        • The ODF cluster was deployed on University of Chicago hardware and is currently offering services similar to RP1.
        • Different upstream identity providers are being considered, including CILogon or Globus to open access to wider audience.
        • Successfully replicated 10 TB test dataset and tutorial datasets to the MWT2_OPENDATA RSE.

        LLM Assisted Infrastructure Management and Analysis Workflows

        • Agentic infrastructure management and agentic analysis workflows are beginning to moving into practical implementation.
        • Work is underway to assess open source solutions for managing agentic systems across infrastructure and analysis environments.
        • Active discussions are focused on how to implement these systems safely, reliably, and with appropriate operational control.
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m