US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 8728527

Invite link:  https://umich.zoom.us/j/99329677148

 

 

    • 1
      WBS 2.3 Facility Management News
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
      • NSF finally approved the re-allocation for end-of-CA funds to be used for Tier-2 equipment.   As Fred noted, this is a necessary but insufficient step in that we also require the paperwork allowing us to get/spend the new funds ($1.7M/Tier-2).
        • John noted that we do not yet have our final 4 months of funding for FY26 and presumably, the assignment of those funds would be when we also get the end-of-CA funds.
      • We have a procurement discussion coming up on Friday.   Plans are hosted in a Google drive:  https://drive.google.com/drive/folders/1XPHgTxZ29QfK6o_p2vU0FuR1hjNTwBbx?usp=drive_link
        • Homework for everyone:  Please read and comment on the plans.  Are there suggestions for changes?
      • Next week is the face-to-face meeting.    Please look at the draft detailed agenda at https://agenda.hep.wisc.edu/event/2432/timetable/#20260609.detailed (Tuesday) and https://agenda.hep.wisc.edu/event/2432/timetable/#20260610.detailed (Wednesday).   
        • We are seeking topic "introducers" to seed discussions for the various areas:  Fabrics and Storage for USATLAS, USATLAS Ops & Monitoring, Facility Planning including GENESIS projects, USATLAS Topics for the WLCG-HSF meeting
        • Check the agenda: you may have already been "volunteered" :)
      • Work to prepare for the scrubbing should be underway.   Please see templates and documents in our Google folder
        • All L3s should try to have completed their draft presentations by Friday, June 12  
        • We are contacting those we want to be present in-person for the scrubbing today/tomorrow.   Venue Stony Brook, Dates July 13-15 (WBS 2.3 will likely be a specific day in the period TBD)
    • 2
      OSG-LHC
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release
        • cvmfs 2.14.0 client in testing: major rewrite to reduce privileges throughout the codebase
        • Holding back on XRootD releases
        • HTCondor 25.11.0,  25.0.11, 24.12.21, 24.0.21. Interesting features in 25.0.11:
          • condor_config_val --trace finds a definition in a complex configuration
          • Now able to set a duration to a user's running jobs ceiling
          • Initial support for EP health checks
      • Topology security contacts
    • WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 3
        Tier-1 Infrastructure
        Speaker: Jason Smith
      • 4
        Compute Farm
        Speaker: Thomas Eric Smith (Brookhaven National Laboratory (US))
      • 5
        Storage
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
        • No major issues to report
        • dCache Integration instance upgraded from 11.2.0.3 --> 11.2.0.4

         

      • 6
        Tier1 Operations and Monitoring
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Some issues with CERN EOS mounts on one of the Tier-3 submit hosts
        • 1-2 hour downtime tomorrow 2pm to replace tape library fiber switches.  Only affecting LTO8 staging requests (they will be queued, no staging errors expected).
    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running recently.
        • NET2 down for an update to the compute servers.
      • I got this information from John Hobbs about when the equipment the funding will become available.
        • Well, we have approval to make the change, but not the funding to actually do it! 
        • So we will probably have to make a special equipment-only 1030s request.
          • But that's a guesstimate.
        • If we get the rest of the funding next week, then we'll put everything in one allocation.
      • Finished benchmarking the 5 generation (Turin) EPYC CPUs:

        AMD EPYC CPU

        Cores/Threads 

         HT On (2 CPUs)

        HT Off (2 CPUs) 

        HT On / HT off

        9355

        32C/64T

        3,644   2,801

        130%

        9374F 32C/64T 4,018 3,027  133%
        9555 64C/128T 6,563 5,086 129%
        9755 128C/265T 11,403 8,849 129%
        9965 192C/384T 13,990  11,414  123%
      • Benchmarks were run under these conditions:
        • ALICE benchmark disabled.
        • NUMA=4 per cpu.
    • WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 7
        HPC Operations
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: Lots (~ thousands) of jobs failed due to SLURM job timeout - 

        • Especially during high parallel job periods (eg. 3600 pilots running on 230,400 cores all at once). Job finished, but the pilot is still running when SLURM time ends
        • (Doug) This is expected behavior because pilot is processing work on jobs when the Slurm time limit is trigger. We get decent throughput with jobs running 6 hrs vs 24 hrs. we allow any Slurm jobs to run between a minimum of 6 hrs up to a max of 24 hrs

         

        TACC: TACC is ok we ran the harvester on the Stampede3 login node. Need to reduce the number of threads used in the modules to keep the total under the user limit.

        There is a flag MR (shared by Serhan) that can ask Athena to stop during event processing with a given type signal --> could be helpful in case of SLURM timeout in the middle of the execution stage. Testing it on Perlmutter.

        • What is the flag?  We have to see how the flag will be propogated to the pilot inside of a container
      • 8
        Integration of Complex Workflows on Heterogeneous Resources
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 9
        Analysis Facilities - BNL
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Check the ATLAS glance status of the 24 bounce back users and send to Viviana and Ofer to ask for guidance how to proceed
          Glance status
          Count
          Inactive
          16
          Author
          4
          Active
          1
          No found
          2
        • Discussion about how to provide Jupyter notebook for the coming BNL  traineeship week in August
      • 10
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 11
        Analysis Facilities - Chicago
        Speaker: Fengping Hu (University of Chicago (US))
    • WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 12
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • ADC Ops
          • New pilot (version 3.13.2.3) released. This should solve the directIO problems from last week + the BNL shared home issue
          • Huge amount of open data transfers were triggered last week. Paused and will be investegated tomorrow (when Fabio is back)
        •  USATLAS
          • AGLT2: debugging CE accounting
          • BNL: A/R was low in the May report (similar to all other tape sites) due to SAM tape test problem. The report is to be redone.
          • CRIC: chasing IPv4 leftovers (ADCINFR-277): (Removed Hyak RC). Waiting for Alessandra to continue.
          • Security:
            • created and populated a common (temporary) US ATLAS security mailing list - usatlas-security@cern.ch
            • OSG topology is the authoritive Security Contacts source
            • CRIC synchronization is waiting for OSG API development (SOFTWARE-4134)
          • Slurm 25 with HTCondor-CE at OU:  Fred, Ofer to follow up with David Akin to deploy/test while Horst is on vacation.
      • 13
        Services DevOps
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14
        Facility R&D
        Speaker: Robert William Gardner Jr (University of Chicago (US))

        RP1

        • Added CVMFS integration to RP1 managed JupyterHub instances using cvmfs-csi

        ODF on RP1

        • Initial analysis environment deployed at jupyterhub.odf.uchicago.edu
        • Persistent home now utilizing node-local NVMe storage OpenEBS Local PV ZFS csi
        • Verified setupATLAS/ALRB, voms-proxy-init in jupyterhub-singleuser servers
        • Verified data access from MWT2_OPENDATA in ODF notebooks with xrootd and http
      • 15
        Cybersecurity plan(s)
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 16
      AOB