US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      OTP is due.   Please check your entries and update ASAP:

       

      We are in the middle of a mini-capacity challenge.   Each site should capture notes, logs, diagrams in their folder:  https://drive.google.com/drive/folders/1E7Xiox_SniBsbHXeb5rLFkt8fvdK6q4p?usp=drive_link  (see google doc in this folder for info)

       

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • XRootD 5.9.1 will be available in osg-testing shortly
      • HTCondor 25.5.50 being rolled out onto OSPool this week so after ~a week of stress testing, it will get a full release
      • Having some success with JupyterLab accounting with some contributions from Sam Albin @ UNL
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
        • 1st small batch of worker node upgrades to alma 9.7 is underway to test the jenkins upgrade pipeline, so far so good.

          • Pending success we'll kick off the full automated drain and upgrade, targeting a batch size of 40 workers

      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))

        *BNLHPC_SCRATCHDISK Decommissioned
        The associated storage resources are in the process of being commissioned into dCache production (~1 PB).

        *NFS V4 dCache protocol

        Current planning is to isolate access to ATTSUBXX, spoolXXXX, sgpu0001/2,  spoolsub01/02. no ACASXXX nodes

      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • John D successfully deployed a local Varnish server, but logging monitor (required by the lab) is not yet implemented
          • Ilija put the server into production last Wednesday nonetheless.  Service has not been particularly stressed.
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running over the last two weeks.
        • AGLT2 & MWT2 did rolling updates of condor causing a small reduction in production. 
        • NET2 had some downtime to support activity at SC25.
        • CPB had some minor draining ~10 days ago.
        • TW-FTT recovered from another network outage.
      • CPB has drained 5 of 8 MD3640 servers that they will retire they are fully updated to EL9.
      • TW-FTT is in the process of putting all 2.5k job slots online after converting to condor and Vanish.
        • The site has been running much better recently: kudos to YiRu!
      • Please file your operations plans before leaving leaving for the holidays.
      • I need to look at the Tier 2 OTP entries.
      • I am meeting with Andrey, Mayuko, and Kaushik later today. The meeting is about using a script that Andrey wrote to dump a list of files that have not been accessed in a "long" time in LOCALGROUPDISK.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: running stably, a bit lower in rate than before the pausing

        ACCESS: Stamped3 account --> needs Gordon (PI) to set up TACC account

         

        -----

        Fixed expired X509 credential on Monday.   NERSC CPU queue running at full steam (100 nodes/SLURM job) 5 SLURM jobs in the queue at a time. 

        Restarted the NERSC GPU queue.   Now debugging why HC jobs are failing. 

        BNLHPC_DATADISK  and BNLHPC_SCRATCHDISK RSE's decommissioned and drained. 

         

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Maintenance

        • Scheduled for December 10.

        • Planned activities include routine updates to firmware, operating system, Kubernetes, Rook/Ceph, NVIDIA drivers, and other core components.

        IaaS / Inference-as-a-Service

        • Continuing work with Xiangyang Ju (LBL) on testing Inference-as-a-Service for DAOD production.

        • Evaluation areas include memory footprint, Triton server capacity, and data throughput.

        • A functional deployment is now running at UC AF, and testing is currently in progress.

    • 14:10 14:30
      WBS 2.3.5 Continuous Operations
      Conveners: Ivan Glushkov (Brookhaven National Laboratory (US)), Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Kaushik De (University of Texas at Arlington (US))
        • Hiro is conducting the capacity challenge
        • Progress in deploying IPv6 at OU
        • Dev Pilot back working at MWT2 with cgroups OOM management in HTCondor25
        • XRootd 5.9 proxy server deployment on EL9 performance issues at SWT2
        • Updates from DDM at ADC weekly - beginning to test FTS4, also news about tape archive metadata and monitoring
        • CERN network outage on Sunday broke Rucio service and caused a mass exclusion event (link)
        • Leslie Groer (Waterloo T2 manager) joined our cloud daily ops meeting today
        • Heterogeneous Architectures meeting began today
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCache
          • upgraded to 5.9.0. 
          • will do 5.9.1 once it is in OSG, will test at UC AF
          • some issues with BHAM
        • AF
          • a lot of throughputs testing for the integration challenge
        • Varnish
          • BNL in operation now
          • One brief issue over the weekend when CERN loadbalancers went down and new frontiers were unavailable.
          • Next week will give a tutorial on Varnish monitoring
        • CREST
          • completely reworked Dev documentation and redeployed it.
          • fixed TLS on production clusters.
        • AI
          • NTR
      • 14:20
        Facility R&D 5m
        Speaker: Robert William Gardner Jr (University of Chicago (US))
      • 14:25
        Cybersecurity plan(s) 5m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Shigeki Misawa (Brookhaven National Laboratory (US))
    • 14:30 14:40
      AOB 10m