Indico celebrates its 20th anniversary! Check our blog post for more information!

US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Planning has started for the annual US ATLAS Technical and Pre-Scrubbing meeting.

      This will be held June 5-7, 2023, at Indiana University Bloomington.  

      Block agenda, TBC:

      June 5 (all day) - Technical S&C talks I

      June 6 (half day) - Technical S&C talks II

      June 6 (pm) - Pre-Scrubbing (closed)

      June 7 (all day) - Scrubbing (closed)

      • Please let Shawn and I know if you'd like to present 
      • Next week's facility coordination to get started on pre-scrubbing

       

      ATLAS S&C Plenary tomorrow: https://indico.cern.ch/event/1268248/

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release

      1. Initial EL9 release expected tomorrow!
        1. Contains HTCondor-CE 6 and the HTCondor feature series (10.3.0)
        2. We don't have osg-ca-certs with any SHA1 CA workarounds yet so sites should at least temporarily downgrade the default crypto policy (except for compute services like CEs/local condor)
      2. XRootD 5.5.4 with xrdcl-http is available in osg-testing
      3. Gratia Probe 2.8.4 (maybe this week) fixing issues with HTCondor APs introduced in 2.8.1
    • 13:20 13:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Updates on DC24 plans in DOMA General Meeting today
        • Planning document
        • Some question of T2-T2 component
      • S&C Plenary Demonstrators discussion tomorrow
      • Milestone #240 to be delayed by 1 month due to SWT2_CPB cluster network hardware upgrade and reconfig
      • Condor queue quota reconfiguration at BNL today (see T1 report)
      • BNL XRootd doors update in progress (GGUS)
      • GGUS down today 
      • 13:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • 13:30
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))
        • Some failures with stage-in errors, discussed in a dpa email thread. The 300s timeout on rucio get is a known problem, fixed in 1.30.8 rucio release, under testing right now, probably will be deployed this week. Hopefully this will resolve it.
        • Next big step is to recreate SWT2_CPB_K8S cluster within the SWT2_CPB main cluster network. Got 2 compute nodes (for 1 Master + 1 Worker node) to start a new K8S cluster. Once it's done, we'll migrate the current K8S cluster, and also add more worker nodes.
        • Optimizing the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU. Further changed the value in CRIC from 0.94 to 0.98, which must, and I checked, is addressing this issue. Things so far running fine.
    • 13:40 13:45
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
      • Stable running
      • still waiting for CPU delivery (due by end of next week) to meet the 2023 April pledge.
      • Today, Chris Hollowell  updated ATLAS Tier 1 quota so all production jobs have the same quota. (prod.all) including updating all the gatekeepers. 
        • Already tested on one gatekeeper since Friday. 
    • 13:45 14:05
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running with some issues in the last 30 days.
        • AGLT2 did some rolling updates (e.g. HTCondor10) that resulted in partial draining.
        • MWT2 had some trouble receiving enough work from the production system.
        • OU Was down several days to install a new gatekeeper.
      • NET2.1 is progressing
      • I need to write the global tier 2 operations plan today.
      • Please report on your procurement activities.
      • 13:45
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

         

        Updated cvmfs to use varnish style cache server from SLATE. 
        Each site (MSU/UM) now has its own varnish instance from the site's own slate cluster. 

         

        3/10/2023 
        MSU upgraded Juniper OS in all rack data switches

        to solve a problem/bug preventing access to a readonly account on our rack switches.
        Some Worker nodes from MSU had a burst of failed jobs from a couple sources.
        Some expected, as we were missing some of the redundant/bonded cabling (solved/all cabled now)
        Some unexpected related to the force-on setting needed for provisioning (avoidable in the future)

         

        3/16

        There was a new security kernel update available, we applied this new kernel to all our work nodes and interactive login nodes, and rebooted them to the new kernel. We also took this opportunity to update the firmware for the work nodes. This process required draining the HTCondor cluster, and for most of the time, BOINC backfilling jobs filled the draining job slots. Only for 2 days, when we were draining a small batch of work nodes, it happened that the BOINC queue had no available jobs, so some draining job slots did not get fulfilled. 

         

        2023 Equipment orders placed at MSU and UM  
        18x R740xd2 with 20T drives for estimated 6.8 PB in dCache (minus retirements)
        12x R6525 with AMD 7443 for 1152k cores or estimated ~2k HS06 (minus retirements)
        a second NVMe storage node for MSU Vmware cluster
        a second NVMe storage node for MSU SLATE
        another NVMe storage node for UM 
        also storage and GPU node for UM T3

      • 13:50
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Upgrading condor to 10.x and adding two additional gatekeepers

        Investigated squid failovers that were being misattributed to MWT2. 

        Investigating CVMFS failures and stuck jobs

        Site struggling to keep full during lack of sim

        Updated CRIC settings for MWT2_VHIMEM_UCORE

        Preparing RFQ for storage at UC; UIUC working on worker node purchase; IU preparing purchase now that sub-award is fully executed.

      • 13:55
        NET2 5m
        Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

        Working with ATLAS experts towards final configuration for dCache.

        Setting up DNS configuration: coordinating between UMass and Harvard to have both machines under net2.mghpccc.org. Configuration is on going

        With DNS work ongoing, so are the IGTF certificate requests for both Harvard and UMass machines.

      • 14:00
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU

        • New AlmaLinux9 osg-36 HTCondor-CE GK working well and stably
        • Last new compute nodes should be brought online this week
        • Brought up new 10 TB /xrd_test/ xrootd-cephfs test instance on se1 on port 64000
        • Initial tests successful, thanks to Wei and Hiro and Andy, but more testing needed to try to improve throughput

        UTA

        • Network update underway
          • Will attempt to do two or three racks per day until the old network components are all replaced
          • Also performing updates to nodes as we go
        • Provisioning for the updated K8
          • New master node  and compute node on CPB's network in place
          • Need to get rucio mover working with stage-out to internal xrootd door before scaling up 
        • Scheduling a meeting with UTA Networking to discuss several items including getting the network monitoring in place

         

    • 14:05 14:10
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      NERSC

      • Moved Harvester into the 'workflow queue' - now runs as a very long running (30 day) job. 
        • There may be bugs - still evaluating
      • NERSC_Perlmutter_GPU configuration merged into Github repo for Perlmutter harvester config
        • Rui has tested with simple Tensorflow test from end to end (https://bigpanda.cern.ch/job?pandaid=5799977799)

      TACC

      • CVFMFSExec working well. Working on scale up but stuck behind very long queue times. 
      • Modified Harvester code to not special-case around log files. May have helped? Still checking files.

       

    • 14:10 14:25
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:10
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Congratulations to Doug on being named AMG Analysis Workflow Containerization Activity Contact along with Matt Feickert
          • The goal is to put together infrastructure and examples of containers for analysis workflows for wide use in ATLAS. This effort connects AMG and other software groups with ADC. The focus is on analysis steps that could be carried out at a range of resources including lxplus/lxbatch or local analysis resources at various institutes and facilities (cloud, etc).
        • Useful talk on lxplus/lxbatch at last weeks AF Forum
        • Pre-CHEP workshop timetable pretty much finalized
        • I will need help with CHEP presentation, ATLAS seems to want it for review 30 days in advance
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • ServiceX deployment at CERN FABRIC site
          • Working with FABRIC team on getting external IPV6 networking service to work on the FABRIC slice
    • 14:25 14:35
      AOB 10m