US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Thanks to everyone for getting in their WBS 2.3 quarterly reports.

      WBS 2.3 top-level quarterly should be done soon.

      WLCG/HSF meeting coming up in early May

      Tier-2s need to work on finalizing procurement and ops plans (discuss in WBS 2.3.2)

      • After procurement plans are ready, we need to work on 5-year estimator

       

      Milestone updates still needed for WBS 2.3 https://docs.google.com/spreadsheets/d/1Y0-KdvsRVCXYGd2t-SqCEFlppZn_PjvUUVDGp2vJjc4/edit?gid=1906829311#gid=1906829311 

      • #117 Feb 2025 Delayed (by SWT2)  Updates?  WLCG site network monitoring 2 years delayed so far...
      • #374 Apr 2025 On Schedule (waiting on BNL?)  Need updated comment?
      • #279 Apr 2025 Delayed   Need updated comment? Tier-1
      • #392 Jan 2025 "On Schedule" Needs update Tier-1
      • #393 Jan 2025 "On Schedule" Needs update Tier-1
      • #191 Apr 2025 Delayed  Tier-1 Update comment?
      • #310 Feb 2025 Delayed SWT2 update estimated date and comment
      • #316 Mar 2025 Delayed SWT2 update estimated date and comment
      • #363 Mar 2025 On Schedule  update status or estimate date/comment
      • #410 Apr 2025 Delayed WBS 2.3.4 update comment?
      • #414 Apr 2025 On Schedule but is this a real milestone (WBS 2.3.4)
      • #328 Apr 2025 Delayed WBS 2.3.5.1 see comment, update estimated date
      • #415 Mar 2025 WBS 2.3.5.2 Update estimated date and comment OR retire?
      • #416 Jun 2025 WBS 2.3.5.2 Is estimated date correct?  Update comment?
      • #419 Mar 2025 On schedule WBS 2.3.5.2 New estimated date needed.  Change Status to Delayed
      • #428 Mar 2025 Delayed  WBS 2.3.5.3 New estimated date, update comment
    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • XRootD 5.8.1 in osg-testing
      • ATLAS NRP deployments
      • Zeek
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        WBS 2.3.1.2 Tier-1 Infrastructure - Jason

        • NTR

        WBS 2.3.1.3 Tier-1 Compute - Tom

        • New compute racks added, cpu count for Tier 1 temporarily raised to ~45K cpu
          • Older equipment retirement/ donation to Tier 3 will happen soon, Tier 1 core count will show a small net decrease, but there will still be a net gain in HEPscore23 (since the new hardware is better/faster core for core)

        WBS 2.3.1.4 Tier-1 Storage - Carlos

        • 5280TB DISK space added to 2025 pledge 
        • 10 pools hosts commissioned into production
        • 25030TB TAPE space added to 2025 pledge

        WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

        • Emptying of the cluster today due to a user assigning all his jobs to BNL only (~100k jobs)
          • Killing all assigned user jobs to BN
          • LUnsetting site for all his jobs
          • Limiting number of score jobs at BNL temporarily
          • The site started to recover in the last hour
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running for the last two weeks.
        • AGLT2 continued work on understnding why cvmfs hangs at their sites
          • Still trying to understand why AGLT2 does not seem to be able to run more than 6000 SCORE jobs at a time. This did cause a small draining on one day.
        • MWT2 had a reduced production last week due to rolling draining to remount cvmfs repos.
          • The draining/remount did end the cvmfs aborts and seemed to activate the fix of the bug causing the aborts.
          • It also finally caused the increased number of file descriptors specifued in the configuration file to be used.
          • I recommend that all sites update to cvmfs version 2.12.7
        • OU had problems with their scratch area setup and had more failures than usual.
          • Fixed some issues but the problem still occassionally appers on some servers.
        • SWT2_CPB had trouble staying full
          • ADC tried submitting 16 core MCORE jobs.
          • Setup a second gate keeper.
          • Seems better?
      • Finished the quarterly reporting
      • Now focussing on the Operations and Procurement plans.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: job submission suspended during the weekend. Stop the harvester instance right now. ~1.5K SU. 

        Perlmutter: maintenance last week. CPU usage is slightly below expectation. MCORE job rate is quite stable (not Premium). Suggestions from NERSC (on Rucio) reduce the job in queue to improve the throughput

        ACCESS: need to discuss with Doug on details

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Work on the Jupyter testbed deployment to evaluate the authentication and retrive UID/GID dynamically.
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Binderhub service launching preps

        • Identity reconfigurations  - switched to keycloak-prod instance, replaced the connect lookup(security, performance, reliability issues) with posix claims
        • Adding monitoring
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • Rucio DB overload on Wednesday due to hangin multiple ART jobs queries
          • Problem is mitigated
          • Ongoing:
        • HC - Starting tomorrow PFT_MCORE tests will be able to auto-exclude Production-only PQs.
        • Working on automatic storage blacklisting based on functional tests transfers
        • A campaign to verify that all pledged compute resources are allowing 96 hour jobs.
        • Fred found some:
          • leaky Exotics derivations - triggered discussion on automatic stopping of leaky tasks
          • failing evgen
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • Waiting on ingress access to test new Varnish container at BNL
        • MS415 ??  EventLoop data access monitoring
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Armada work continues on stretched k8s, some deficiencies in how to securely store the postgres password in the deployment
          • Ticket for clarification / request for improvement will be filed
        • Coffea Casa deployment work continues, debugging 'client not found' issue between JupyterHub and Keycloak
        • Moving various AF/K8S services to keycloak-prod, deprecating keycloak-dev, syncing AF users into Keycloak periodically
    • 14:25 14:35
      AOB 10m