US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      WLCG Open Technical Forum (OTF) meeting #6 was the last two days https://indico.cern.ch/event/1562124/  

      ATLAS S&C is in ~2 weeks.  Agenda is evolving https://indico.cern.ch/event/1509065/

      ADC Coordination meeting this week postponed discussing walltime limit for ATLAS until next meeting (Sep 16), when Ivan should be able to attend.

      Friday will be a discussion about the Trusted-CI engagement for those interested: https://www.google.com/url?q=https://umich.zoom.us/j/93713387827?pwd%3D0NbAN2tYXlRMHKxjbpJmv3jqgmwopu.1&sa=D&source=calendar&ust=1757937587257560&usg=AOvVaw2gFop8OdF05hMz2nE2_rmL 9-10 AM Eastern

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      OSG 25

      • Aiming for the end of the month
      • OSG 23 will go EOL upon release of OSG 25
      • Aiming for EL10 support but may punt until later on in the release if we don't think it's ready
      • No major changes outside of the move to HTCSS 25
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
        • Smooth operations throughout the last week

        • Nothing major to report

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        No major issue to report 

        Feedback provided to GGUS# 1000475

         

         

         

         

      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Great running recently...
        • Only item of note was a planned downtime at NET2 for an OKD update.
      • Various minor issues:
        • cvmfs problems at AGLT2
        • Jobs hitting the wall time limit
      • Setup MWT2 to allow Paul to test a new pilot version that enables the sub-cgroup  memory limit.
        • Need to mail Paul to start his testing.
      • Progressing at CPB on migrating data to new servers and getting storage updated to EL9.
        • Future XRootD updates require EL9.
      • Please do not buy any equipment until we have guidance from management.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: ~23.5K SU remains, running well

        • Only two files failed to be staged on the first day (Sept. 3rd). All the rest of the data transfers are good
        • Setting up a testing harvester to try Varnish (thanks Illija!)

        Perlmutter: 9%/2% CPU/GPU allocation remains. Stable running

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • AF Debrief/Planning meeting with Lincoln last week (notes)
      • Updates to AF Docs (need to sort out GitHub roles/permissions)
      • At last week's 2.3/5 meeting:  discussion of AI tools (Shuwei), GPU metrics and Heavy-Ion storage requests at BNL
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • Maintenance Done on Aug 27
          • Operating system: AlmaLinux 9.5 → 9.6

          • Kubernetes: v1.29 → v1.31

            • In tree cephfs volume plugin removed in v1.31. Updated various places to use HostPath
          • HTCondor: 23.0.22 → 24.0.11

          • Other components: Server firmware, NVIDIA drivers.

    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • WLCG OTF #6 meeting this week
          • Day 1:  Environmental Sustainability; Day 2: Network Updates & Challenges
        • Ongoing work to configure external Varnish service for BNL
          • Numerous issues with network routing and configuration of servers at NET2
          • Ilija has some measurement of performance effect due to large Frontier cache latency
        • TW-FTT is sending an engineer to CERN in October/November to help integrate ASGC into the US Cloud support team
        • WLCG Ops Meeting last week: updates on CRIC and HC status, AVX2 policy
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        XCache

        • ESnet will be ugpraded
        • OX xcaches issues fixed by a manual restarts
        • new xcache service certificate made and given to UK users
        • Raphael (Wuppertal) thinking of testing http caching

         

        AI

        • Deployed OpenWebUI as a frontend for AF assistant
        • Setting up all the functionality will be quite involved 

         

        Varnish

        • We have a backup Frontier cluster and service up and running
        • US Varnishes (except SWT2_CPB) updated to see backup too.
        • NET2 now using their own Varnish
        • BNL uses NRP node in mghpcc (very large latency, performance 1/2 of local squid). Trying to get it to connect to NET2 varnish
        • working with Asoka and Chris on changes to Tier3 and lxplus settings.
        • Waiting on Ryu to see how to change setting at TACC.
        • starting rebuilding CREST 
        • Changes made to CVMFS varnishes to set larger nuke number. Working on it with Wenjing. Initial results very promissing. 

         

         

      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Facility R&D Biweekly last week (minutes)
    • 14:25 14:35
      AOB 10m