US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Unfortunately the IB conflicts with this meeting https://indico.cern.ch/event/1558310/ (as well as Supercomputing 2025).

         - Alexei will miss today because he is giving a talk at SC25

      The NSF next CA proposal was successfully submitted on Monday a little before 2 PM Eastern.   Now we wait to hear...

      We still have to confirm the USATLAS presenter for https://indico.cern.ch/event/1526077/timetable/

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith

        Happy to report, ATLAS T1 condor pool is back up to full strength

        all worker nodes are condor v 25.0.3

        remaining condor CE upgrades to be done this week to finish the pool upgrades (for the time being)

        Other news: Alma 9.7 dropped, and will be coming soon to the pool, date TBD

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        No major issues were encountered this week.

        Ongoing work continues with the BNL dCache storage team on user analysis accounts, involving additional SDFC members regarding policy and dCache user space management.

        A verification of dCache user-management features is also in progress with the dCache developers (see: https://github.com/dCache/dcache/issues/7947).

      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • T1 farm restored to full capacity yesterday after replacement of PDU circuit breaker
        • Some job failures due to (8) black hole nodes last week that resulted from the HTCondor upgrade procedure (cleared by reboot)
        • Ongoing issues with ARM queue (exclusions last week, draining this week)
        • Discussing GPU queue reconfiguration to allow for 16 CPU/1 GPU jobs
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running over the last 2 weeks
        • NET2 is in downtime to support a demo at SC25
        • CPB had another power failure.
      • New condor and cvmfs versions installed at AGLT2 and MWT2
        • Some reductions in number of slots running to do the underlying rolling updates.
        • MWT2 also updated to the AlmaLinux version to 9.7 at UC & IU. UIUC is still running RHEL 9.4.
      • SWT2 CPB  continues to work along on the EL9 updates.
      • TW-FTT / Yi-Ru continue to make good progress.
        • Still having transfer failures but seems like only to certain sites.
          • The mean transfer success rate has been ~90% but the failures are concentrated at certain sites.
          • For now the the site is being restricted to simulation jobs to reduce the amount of data that is transferred.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: possible extra CPU hours from NERSC

        • Eric: DOE has no extra allocation left. NERSC Management said probably, but they’re at SC25 so will get back to us after that.
        • Multi-year proposal: estimation of Doudna CPU performance -> 2-4x of perlmutter node (assuming Nvidia CPU). Projection for 2028-2030.
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Executed first step in review of local user storage: zeroed storage tokens for users with deactivated SDCC accounts and dedicated, but empty, dCache directories.  This recovered ~209 TB.  Next steps are under discussion.
        • Also analyzed and review the other two category users
          • 41 Deactivated users with space usage, used ~79.01TB, move to the data to archived storage and then release those storage tokens?
          • 23 activate users with zero size
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        GitHub Actions Runner Controller Added

        • Giordon is leading the deployment effort. This new controller will replace the existing cron jobs used to manage AF benchmarks through GitHub Actions.

        • The system automatically scales the runners based on demand.

        • It includes a self-maintained runner image and configuration setup to provide an execution environment tailored for benchmarking and other workflow needs.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • Functional testing of WAN transfers to dual-homed dCache pools was successful; however, the configuration changes were completely reverted due to some misconfiguration that led to staging problems.  We would like to redo the functional test asap in preparation for Dec. capability testing.
        • WLCG DOMA BDT meeting earlier today (link) - test tape+tokens at BNL early next year?
        • Requested add'l information from Pilot condor_chirp (jira)
        • Invited Canadian and South American facility teams to join our daily ops meeting - not much response yet
        • BNLHPC disks decommissioning is progressing - only several hundred files left on scratchdisk.  
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        XCache

        • lost one AF xcache, sent for repairs
        • issus at BHAM
        • there is a version 5.9.1 that should include some bugfixes. Will be testing it this week.

         

        Varnish

        • CERN local varnishes are hammered with 4x/h requests of 1.9GB. Nurcan checking who does it.
        • Everything works fine despite CloudFlare issues.
        • Remaining traffic on old Frontiers: CYFRONET, Mainz, three people GitLab cronjobs, NERSC 
        • ECDF is independently deploying Varnish for CVMFS

         

        AI

        • updated documentation for AF. GitHub Action is now doing most of it.
        • Exported all the requests up to now and manually reviewed/classify them. 

         

        Analytics

        • new alarms and alerts.
        • updated A&A frontend packages
        • More ES cleanup, moves to cold storage node

         

         

         

      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
        • ADC TCB meeting tomorrow to discuss PHYSLITE distribution (link)
        • Facility R&D meeting notes from last week 
          • Final presentation from Armen re:SWT2 k8s cluster (MS427)

          • Review of previous week’s integration challenge

          • Updates on RP1 and Kuantifier for Jupyter notebooks

          • Discussed impact of MkDocs EOL announcement (Zensical?)

            • CERN is not making an immediate change, but probably move to Zensical at some point in the future
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m