US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Quarterly reporting is due at the end of the week (Oct 24).

      • Missing 2.3.2, 2.3.2.3, 2.3.4.1 (BNL) and all 2.3.5 sections
      • Please get them in ASAP

       

      Work on next CA continues

      • For WBS 2.3, we need to update the Basis-Of-Estimate (BOE) needed for the next CA submission
      • Updates needed for each Tier-2, WBS 2.3.4 (NSF funded parts), WBS 2.3.5 (NSF funded parts)
      • Look for emails from Shawn requesting updated text and confirmation of NSF funded effort shortly

       

      Please check/verify milestones https://docs.google.com/spreadsheets/d/1FkVDqLh_5PaHQDP-bfefBJ-PloIfD7LLw3sbP_vTgB0/edit?gid=1361093330#gid=1361093330

      Discussion today on live compiling GPU code on every job

      Below is our updated WBS 2.3 Organigram

      WBS 2.3 Organigram

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • OSG 25 released yesterday! 
        • https://osg-htc.org/docs/release/osg-25/
        • Adds EL10 support, updated versions of HTCondor + HTCondor-CE
        • Note that there are many packages missing from EPEL 10.0
        • Container images are on the way -- we're basing them on EL9
      • XRootD 5.9.0 is available in testing repos
      • gfal2: the plan on building for EL10
      • Do any US ATLAS sites support JLab / EIC / CLAS12?

       

       

    • 13:05 13:45
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:05
        ADC survey : GPU code installation 10m

        ADC is surveying sites for GPU code installation info (e.g., nvcc) on WNs, GPUs nodes for use with GPU queues – they want to live compile GPU code with every job?
        Question about efficiency of live compilation and waste of limited gpu resources?
        JD: Jobs still trying to access CVMFS Projects repo outside of CERN
        It will be discussed at today’s WBS 2.3 meeting

        Speakers: Costin Caramarcu, John Steven De Stefano Jr (Brookhaven National Laboratory (US))
      • 13:15
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:20
        Compute Farm 15m
        Speaker: Thomas Smith
        • Brief VM interruption yesterday (21 Oct) on a subset of VMs affected the atlas T1 Condor central manager
          • No real impact, the pool is resilient to brief interruptions in CM activity, CEs continue to schedule and run jobs 
          • CEs (gridgk03,4,6,7) were unaffected
          • Operations for the past week have been completely smooth, even with this event
        • Looking at condor_chirp, so job classad attributes can be modified on running jobs on the fly
          • condor_chirp is available, I've successfully tested it on my own
          • Located in a non standard place as it is not meant to be run on the command line, but invoked from within running jobs
          • /usr/libexec/condor/condor_chirp
          • May not be in the PATH, so keep this in mind if you wish to use it
      • 13:35
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        HPSS drive failure detected this morning. IBM engineer is contacted. Restores are partially impacted; resolution expected by end of day or through tomorrow.

        No major issue to report on dCache storage

      • 13:40
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Smooth running, albeit at reduced compute capacity due to PDU issue
        • PDU intervention scheduled for next Monday 10/27, will see short reduction of capacity by another ~30%
          • One more intervention needed, schedule TBD
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running in the last two weeks....
        • Minor planned and unplanned disruptions at MWT2, NET2, and CPB
        • Another fiber break has knocked out TW-FTT's connection again this week.
      • Almost done with the Tier 2 reporting.
      • Given how busy people are now, I (Fred) propose pushing the equipment discussion off to November.
        • I did not consult with Rafael on this proposal.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: Finished allocation on Monday, brought UCORE queue offline 

        • Tested MC/Track Overlay, RawToAll, RDOtoRDOTrig production

        Perlmutter: ~8%/32% CPU/GPU allocation remains, stable

        • GPU usage is still low

        ACCESS: Explorer allocation extended to Oct 2026

        Doug& Rob are putting up a note for Overlay cluster setup

         

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Carlos has provided a detailed summary of dCache local storage usage/allocation by AF/Tier-3 users.  To be discussed:  how to handle this appropriately going forward.
        • Continue work on the new federated frontend for ATLAS, DUNE, etc.
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • /data Access Issues – Experienced access interruptions last week, caused by a combination of high load and OSD pool performance issues. The problem was resolved after addressing laggy placement groups (PGs) and some adjustment on MDS handling.
        • HTCondor Configuration Update – Implemented new restrictions to improve stability. Job submissions from /dataand file transfers to/from /data are now disallowed to reduce load on CephFS and prevent scheduler (schedd) disruptions.
        • Triton Deployment – Refreshed deployment with an updated server version. Work is in progress to produce user documentation for the updated service.
    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • QR input has been received, planning to upload this afternoon
      • HEPiX is in China in 1.5 weeks, there is a possibility of getting slots for a remote presentation if someone is interested  https://indico.cern.ch/event/1536836/
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • Welcome to Yi-Ru (Jennifer) Chen from TW-FTT, who is currently visiting CERN and has been attending ATLAS and US Ops meetings.  We are looking forward to having her provide additional ops support for the US Cloud from the Asia time zone!
          • Jennifer and the TW-FTT site are also interested in migrating from ARC-CE to HTCondor-CE 
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        Analytics

        • running out of space on ES cluster.
        • as a temporary solution we will add an old storage node to serve as a cold storage.

         

        Caching physics data

        • Discussing http proxies testing with Raphael K. (Wuppertal). They will try ATC, nignx, xroot.
        • All xcaches now have new certs 

         

        Caching conditions

        • the new system works stably
        • one of the two k8s clusters running Frontier lost connectivity and the whole cluster had to be rebuilt. No impact on operations.
        • lxplus migrated off squids.
        • Still need to migrate NERSC off the squids
        • Still need to get a local Varnish for BNL.
        • will be removing backup proxies this week (both Fermilab and CERN)
        • will be moving CERN ITS operated Varnishes to k8s.

         

        Caching CVMFS

        • UC now have Prometheus monitoring of all the CVMFS clients. A lot of interesting data. 

         

        AI

        • A lot of small improvements on AF Assistant.
        • Now using OpenAI AgentBuilder and ChatKit for frontend. 
        • Will present tomorrow at ATLASE Scope Kick-off meeting.

         

      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:25 14:35
      AOB 10m