US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 12:00 12:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Main focus is on pre-scrubbing for Friday.   We still need completed presentations from 2.3.1, 2.3.4 and 2.3.5 ASAP.

      ATLAS OTP is again due.   (ATLAS ADC DDM OTP request just sent...look for others soon)

       

    • 12:05 12:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (impending)

       

      Miscellaneous

      • Kuantifier
        • OKD unpriv Prometheus notes passed along to NET2
        • Kuantifier and docs ready
      • EL10
        • What're the US ATLAS plans for upgrading?
        • Does anyone use the HTCondor keyboard daemon?
        • Who uses nftables?

       

       

    • 12:10 12:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))

      Upcoming BNL / HTCondor / USATLAS meeting: 25 JUN @ 15:00 eastern / 14:00 central time:

      This is the USATLAS version of the meeting. Please forward along to USATLAS folks as you see fit

      Zoom link: https://bnl.zoomgov.com/j/1614360980?pwd=AQm6x3reOaNtEze9H7GjadACjvWaBE.1

      Google doc link: https://docs.google.com/document/d/1zTl-HIB07SEWgwB8hLH5O9deX4O-z_ezq1YQI5__VKI/edit?tab=t.0

      • 12:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith

        NTR

      • 12:15
        Compute Farm 5m
        Speaker: Thomas Smith
        • Operations this week have been very smooth, no problems, no interruptions

        • Work has been done this week on physical retirement of old ATLAS T1 hardware

          • ATLAS T1 has now fully vacated our old datacenter

      • 12:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        Smooth operation Disk and Tape systems 

               

        • Per agreement with ADC and HPSS storage experts, the FIFO batch queue management threshold has been reduced from 3 to 2 days. Staging mode now switches to FIFO when requests older than 2 days are detected.

        • dCache Doors and Pools updated to version 9.2.35.

          • Kafka-based monitoring data collectors enabled 

        • Scheduled dCache maintenance: 06/24/2025, from 13:00 to 19:00 CEST.

          • Database Hardware and release update to Postgresql16 

          • Minor dCache releases will normalized to 9.2.35 few nodes

        • Scheduled HPSS gateway maintenance: 06/24/2025, from 13:00 to 19:00 CEST.

        • ATLASDADISK temp space rolled back to nominal pledge values

          • Extra space added (~190TB) to minimize DDM blacklisting 

      • 12:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        NTR

    • 12:30 12:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • It has been a tough period recently as several sites have had problems that affected production.
        • All sites were affected by a rucio issue of Jun 6.
        • MWT2 was put offline for what was an unintended side effect of unrelated change made by IU network engineers.
        • MWT2 had its DATADISK fill because there were many small files and the deleting process could not keep up
        • NET2 was down for a couple days for the MGHPCC annual maintenance.
        • NET2 also had some production loss in the last week because of network problems.
        • OU was affected by the Great Plains Network replacing switched and forgetting to turn on jumbo frames
        • In the last day OU had their storage was fill to capacity by transfers that apparently did not honor the size limit of their endpoint.
        • CPB had trouble with their gatekeepers.
      • There was progress on EL9 update/FY24 installation
        • MSU updated their Satellite/Capsule software but still have trouble
        • CPB is still working on their storage.
      • Rafael will give the Tier 2 scrubbing (WBS 2.3.2) and the slides are complete.
        • A tip of the hat to Rafael for jumping on this.
    • 12:40 12:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 12:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Slides preparation for pre-scrubbing

        Perlmutter: Good CPU&GPU usage

        NERSC-10 allocation proposal: collaborate with HEP-CCE

        • Charles is collecting the workflow list. Related to HPC ops: MC simulation, GNN4ITK (GPU recon), FastChain (Sim+recon).

         

      • 12:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 12:50 13:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 12:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Investigate and work on the solution to fix the nobody:nobody issue for dCache data within a non-root pod
          • Built a new image to have  nfs-utils installed to have idmapd running and uploaded to SDCC Quay service.
          • Looking into side-car container to make idmapd running in a non-root pod
            •  to run idmapd with root privilege and the main application container running with less privileges to satisfy OpenShift’s restricted SCC rules.
      • 12:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Ceph File System Capacity

        • Ceph file systems have exceeded 80% usage of the 1PB capacity, beginning to impact performance and user quota allocation.

        Dask-Gateway with HTCondor Integration

        • Ongoing development to enable Dask-Gateway scheduling on HTCondor.

        • Exploring two backend integration approaches:

          • Extending the existing Kubernetes backend to support Condor workers.

          • Adapting the JobQueue backend model to interface with HTCondor.

    • 13:10 13:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 13:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • DDM is reallocating space at some sites from SCRATCHDISK to DATADISK
          • BNL-OSG2_SCRATCHDISK: 935TB-> 650TB
          • SWT2_CPB_SCRATCHDISK: 500TB-> 200TB
          • MWT2_UC_SCRATCHDISK: 550TB-> 350TB
        • Issues with DATADISK space at several US sites for different reasons (OU drained because of this)
        • BNL tape storage admins met with ADC to discuss staging issues during recent data carousel campaign
          • Decided to change retrieval algorithm so that it switches strategy from "High to Low" to FIFO after 2 days instead of 3
        • Rod work with Doug on OverlayBS deployment, also adding GPU scheduling features to WFMS (see slides)
      • 13:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCache
          • new testing image created
          • issues with caches in UK
          • VP queues at MWT2 and AGLT2 disabled for now.
        • VP
          • NTR
        • Varnish
          • Got a server at IN2P3-CC. This morning all of the FR cloud (except far-flung sites) moved to Cloudflare dns loadbalancer.
          • Tracking CRIC overwrites at CERN-PROD, lxplus, LRZ, Beijing...
          • Next on DE cloud.
          • Concerning US:
            • BNL - varnish server works but lacks monitoring so not in use.
            • NET2 - using NRP varnish. Just got email from Derek asking if we can deploy on NET2 instead of NRP.
            • SWT2 - need to get varnish installed.
        • AI
          • have 3 summer students to work on an agentic assistant.
          • Will test ADK, LangGraph, and OpenAI approaches.
        • ServiceX/Y
          • NTR
      • 13:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Have been working on scrubbing slides
        • Armada
          • Auth issues fixed
          • Working on connecting to a second cluster (RP1 --> UChicago AF) with minimal privileges 
        • Coffea Casa
          • User login should be working (https://coffea-casa.hl-lhc.io/ login with ATLAS IAM), user gets persistent /home and /scratch from CephFS
          • Working on adding HTCondor / Dask support
        • EOS
          • Work continues with authentication, how to inject users into MGM pod in production not clear, may require feature contributions to the Helm chart  
        • HTCondor overlay container
          • Have a container that runs unprivileged and connects back to UChicago AF HTCondor pool
          • Will be working with Doug to test this at NERSC
    • 13:25 13:35
      AOB 10m