US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Capacity challenge for US underway

      Upcoming meetings:

        - LHCOPN/LHCONE https://indico.cern.ch/event/1534556/

        - ATLAS S&C week https://indico.cern.ch/event/1509065/

       

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:10 13:30
      Presentation: Power Measurements and Benchmarking from MWT2 20m
      Speaker: Aidan Rosberg (Indiana University (US))

      Slides

       

    • 13:30 13:50
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:30
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:35
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:40
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:45
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:50 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Great running over the past two weeks.
        • NET2 was put offline for a network problem and then it took the better part of a day to be put back online because HammerCloud was not submitting new tests.
          • NET2 tried to manually force the site online but this failed because the HammerCloud probes that put a site offline were not disabled.
        • Otherwise low unplanned downtime.
      • NET2 is down today and tomorrow to update OKD.
      • Mini-challenge is ongoing and today there is a larger scale test ongoing.
        • The challenge is showing up clearly on the MWT2 network plots.
      • Work continues at CPB to migrate data to servers running Alma Linux 9.
        • Zach Booth tells me that they close to having the migration procedure complete.
      • I have not heard anything about the final scrubbing report so please continue to hold off on starting procurement.
    • 14:00 14:10
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 14:00
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: 30K SU added with extension to Dec. 31, 2025. Frontera will be stopped on May 31, 2026, no LSCP next year.

        • They are defining allocation plans for AY26, to be announced

        Perlmutter: ~18/8% CPU/GPU allocation remains. Stable.

         

      • 14:05
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 14:10 14:30
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:10
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
    • 14:30 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:30
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • CRIC: We started handing over the support to Natalia & Panus (CERN IT)
        • CVMFS: Looking at server-client revision comparison at wrapper level
        • Helpdesk: Tickets that were marked only as ATLAS (without site) were not reaching ADC. Ongoing.
      • 14:35
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        XCaches

        • issues at Oxford and Birmingham. Both solved

         

        Varnish

        • resolved several site configuration issues (eg. MPPMU)
        • resolved issue with truncated responses. Solution requires a parameter in a startup script. Varnishes that are part of the new Frontier Launchpad updated. NRP deployed Varnishes all updated. Will be asking self-deployed varnishes for an update/restart.
        • resolved an issue with stale data in one condition directory. It was a simple frontier config change.
        • Created a spare k8s cluster for the new Frontier Launchpad. Still some issues with the LB.
        • With Ivan's blessing, I moved BNL to use CloudFlare proxy an hour ago. Everything seems to be fine. Following it today.

         

        Analytics

        • UC Elasticsearch cluster nodes storage has been changed. It took a week to do. Now we have 30% less storage capacity.
        • Several alarms have been updated

         

        AI

        • added a gmail MCP. Testing it in function of reading and responding to user's emails.
        • got email from OpenAI that the Assistant API is being shut down today even they claimed it will work till early 2026. This will require a few days of work to move to new API.
      • 14:40
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • This week is Lincoln's last US ATLAS Computing Facility meeting before moving on to his new position :) Thank you for being great colleagues over the years! 
      • 14:45
        AOB 10m