US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1:00 PM 1:10 PM
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Proposed milestones to be add by COB Friday https://docs.google.com/spreadsheets/d/1CF5nSKi2UWiiF4hJpLbJIba_A-2aM00jS14lDDFcplY/edit#gid=634097696

       - Note we need more "detailed" milestones in EACH L3 area to cover all of calendar year 2024

      Quarterly reports deadline is Friday.   All L3 WBS quarterlies should be in by COB today

      Working on scrubbing responses due ASAP.

    • 1:10 PM 1:20 PM
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • New osg-xrootd + xcache versions
      • HTCondor 10.0.6 in EL7 & EL8 release
      • HTCondor 10.6.0 in upcoming (EL7, EL8) and release (EL9)
      • NO XRootD 5.6.0 or 5.6.1: we caught issues in our integration testing

      OSG 23

    • 1:20 PM 1:40 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • QR complete
      • ANALY_BNL_VP queue maxWorkers doubled to 10000
        • Usage has remained below ~500 slots.  Multicore scheduling issue?
      • Issue with home disks filling up at OU
      • CVMFS squid failover issue at SLAC (GGUS) - Wei may have solved
      • 1:20 PM
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 1:25 PM
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCaches

        • running fine
        • some bypass at MWT2 and AGLT2 once traffic goes above 2TB/h

        VP

        • next step in integration in Rucio now in PR.
        • probability of a dataset to have a virtual replica at BNL increased factor 5. we will need to look at the VP queue CRIC settings to get it to continuously   to run more jobs

        ServiceX

        • working fine on AF
        • more performance optimizations merged
        • running fine on FAB. Getting all servicex images to come with special gei.conf

        Analytics

        • all services work fine.
      • 1:30 PM
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))
        • Understanding the performance of the new cluster, looking into details if something doesn't look right. Overall the production is running fine.
        • Noticed a couple of time drop in production level, but it appears to be not specific to K8S cluster, and looks like was due to storage servers getting overloaded. 
        • With the new hardware, noticed that the nodes with more cpu cores (64/72/96) have overcommiting the node CPU. For the previous cluster I solved this issue by optimizing the job CPU requests coefficient sent from Harvester. Have to look into this, probably readjust it.
        • Noticed that K8S was trying to schedule production jobs on the master node. A NoSchedule taint was in place initially but looks like was lost at some point - reinstated.
        • Working on reinstalling Prometheus on a dedicated node. And next setting up job accounting reporting. 
    • 1:40 PM 1:45 PM
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
      • Working fine
      • Preparing answers to the post-scrubbing
      • Milestones and Risks have been provided
      • VP increased to 5,000k jobs (as discussed at OSG AHM), reached 5,000 running jobs, increased limit to 10,000 and never got more than 500 jobs since then. To be understood.
      • Quarterly report published
    • 1:45 PM 2:05 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • The last 30 days had good running.
        • CPB got most of their FY22 compute online but I leave it to Patrick to describe the status.
        • NET2 is pretty close to being online but again I leave it to Eduardo to describe the status.
      • Working on info for scrubbing response.
      • Also doing the quarterly reporting in parallel.
      • Looked at the Tier 2 milestones match what I was aware of.
      • 1:45 PM
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

        Incidents:

        We had two incidents with dCache. On July 8th, the postgresql partion of the head node was flooded by the billing database, and it took us over 24 hours on the weekend to recover it, we are planing to rebuild a R6525 work nodes with larger NVMe cards as the new head node to host a bigger postgresql partition (6TB vs 1TB)

        The second incident is on July 19th, 2 dCache nodes had all the pools offline, and caused some transfer failure, restarting the pools fixed the issue. 

        System update:

        We updated HTCondor from 9.0.17 to 10.0.5, and also took this chance to apply firmware and kernel updates with required system reboot. We ran into some token issue because in Condor 10, the TRUST_DOMAIN default value is changed to TRUST_UID, and the tokens used by daemon authentication need to be signed with the same TRUST_DOMAIN. Our fix is to set the TRUST_DOMAIN with the old value. 

         

         

      • 1:50 PM
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • Hardware issues on storage node at UC. Replaced controller and putting it back into service shortly.
        • UIUC preventative maintenance today (7/19). Will make sure nodes all come back online in production once it ends.
        • Planning to start setting up the WLCG SOC network monitoring hardware next week. Minor disruptions in the network could occur, but no downtime should be needed.
        • Building our first set of el9 (AlmaLinux9) worker nodes at IU. Have one in production at UC at the moment and seems to be OK.
        • UIUC compute has mostly come in (waiting on a couple chassis). Looking to install what has arrived by the end of the PM, but may have to wait a little longer.
      • 1:55 PM
        NET2 5m
        Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

        Storage
        Load tests
        -Manual transfers from lxplus and CERN FTS service worked
        - transfers from BNL FTS service revealed some SSL issues regarding the use of SHA1 for signing
        - tested experimental package made by OSG team, no effects
        - A second problem related to the transfer mode (PUL) for FTS revealed when transferring from BNL. It was working fine from CERN because it allowed streaming mode. webdav.authn.require-client-cert true was preventing HTTP-TPC from work.
        - with FTS transfers working correctly we were able to saturate our network link—WebDav and Xrootd were tested.
        - we are talking with Fabio to publish our storage
        webdav.data.net2.mghpcc.org
        xrootd.data.net2.mghpcc.org

         

        Openshift

        -Progressing, configuring X509 credentials for kubernetes cluster access

        - Many problems due to the dual stack setup (network policy controllers not working, Security Context Constraints not working)

      • 2:00 PM
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA

        • Completed installation of available new machines
          • A few machines need repair
          • More than 20,500 cores, additional 450 in repair, K8 cluster using 1,000
        • Mostly have balanced power and cooling in the data center
          • Have deployed one new rack and preparing to deploy another for additional space
          • Further work should be invisible as we move machines in groups of one or two at a time to new rack
        • Looking at replacing admin node in cluster

        OU

        • Completed installation of new machines; now 5300 slots plus opportunistic OSCER nodes
        • Ordered 3 more R6525, expected to arrive soon
        • Have installed slate01.oscer.ou.edu with RockyLinux 9.2, in the process of configuring it
        • Today OSCER maintenance, upgrading SLURM from v20 to v23 (or v22, if there are issues with v23)

         

    • 2:05 PM 2:10 PM
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
    • 2:10 PM 2:25 PM
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:10 PM
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • QR complete
        • CHEP paper in progress - need help from authors
        • Container development work ongoing (Shuwei), discussed at last week's 2.3/5 meeting
        • What shoudl be procedure for announcing downtimes?
      • 2:15 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:20 PM
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Downtime coming up next week for various upgrades(firmware, os, kubernetes, rook-ceph)
        • Servicex on FABRIC
          • slice stability issue(vms disapear) raised with FABRIC team - Seems to be a known issue that they will deploy a fix
          • Should have found a soluction for IPv6 preference(gai.conf to set preference, the default config prefers IPv4 over private IPv6)
    • 2:25 PM 2:35 PM
      AOB 10m