US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release

      • blahp (3.5 upcoming, 3.6) fixes for LSF, PBS, and Slurm.
      • osg-token-renewer: package for GlideinWMS frontends + Harvester to manage token renewal from OIDC providers like IAM

      Other

      Register and attend the token transition workshop! https://indico.fnal.gov/event/50597/overview

      • Hear details about the OSG plans and timeline
      • Learn where other VOs are in their transition
      • Start receiving token-based pilots at your CE during the Technical Working Sessions (if you've updated to HTCondor-CE 5 + HTCondor 9)
      • Learn about bearer token technology and how to request your own bearer tokens
      • Participate in the Open Discussion and Policy Working Session to discuss timelines and concerns
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      •  Good Running
        • MWT2 downtime with IU servers still down.
      • XRootD still has issues - could the affected sites report on the status
      • Still having significant trouble getting a clear story on delivery from Dell.
      • IPV6 could sites not supporting IPV6 report on their progress.
      • SRR using dCache now working at BNL and need to get it going at AGLT2 & NET2.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        Data Challenge:

          No problem

        Ongoing IPV6 issues with new UM networking:

          17 of the previous 31 worker nodes now recovered from ipv6 issues and back in condor
          (after migrating  gateway from shinano to new cisco pair).
          The other 14 nodes still fail to reach some internal IPv6 addresses (i.e. still only BOINC).
          UM network team has been helping and now started support ticket with Cisco.

        MSU migration to campus data center:

          Signing a formal agreement, including we will not be charged for space/power.

        Purchase Status:

          R740xd2 Storage nodes arriving this week at UM and next week at MSU.
          Two month earlier than early-Dec estimate at PO time
          and even one month earlier than sales rep mid-Nov estimate at pre-order time.
           
          The rest still shown as mid-January on Dell order support website.

         

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Site downtime yesterday

        • Updated dCache to 6.2.30
        • CRAC maintenance at UChicago

        Migrating elasticsearch to the new data center

        UIUC - Q4 planned maintenance next Wednesday (Oct 20, 2021)

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Live switched to Xrootd 5.3.0/WEBDAV for both GPFS and NESE Ceph storage for the Data Challenge, as promised to ADC.

        Bumpy first week => smooth operations by the end of the week.  

        Concentrating on NESE Tape now.  

        Planning for new workers, networking upgrades, more NESE Ceph storage...

      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        Waiting on DDM-ops to remove the last remaining data from DATADISK area (~60TB).

        SWT2_CPB:

        • Setting up meeting with OIT this week to go over latest IPV6 updates.  We expect to at least setup the PerfSonar machines with IPV6 next week.
        • WebDAV door has had problems, working with Wei and Andy to help debug issues
        • I/O loads are an issue since gridftp is not used much, was being used along with webdav during last week's network data challenge.

         

        OU:

        - xrootd (5.3.1) on proxy gateway (se1.oscer.ou.edu) kept hanging up. Increased stack (xrd.sched stksz 4m) and limits in

        /etc/security/limits.d/20-nfile.conf
        /etc/security/limits.d/20-nproc.conf

        and switched to jemalloc. That seems to have stabilized it for the moment. Still waiting for 5.3.2.

        - Kept seeing ES job stage-in failures, but don't see anything wrong here. If there were problems locally, non-ES stage-ins would presumably fail, too.

        Issue seems to have gone away again now, so was probably on the other end.

        - write_lan still not working in rucio copy tool, causing failover to write_wan, causing unnecessary load on proxy. Waiting for rucio fix.

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
      • TACC - 
        • Spoke to TACC consultants about the ongoing queue time problems. 2 factors:
          • Special "Gordon Bell" tasks prioritized in the queue for the last few weeks. These jobs are nearing completion.
          • We are projected to use up 123% of our allocation, so the scheduler has correspondingly down-prioritized us.
      • NERSC -
        • Power outage over the weekend brought down the workers + Harvester. Restarted now.
        • Initial Perlmutter setup work ongoing
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Smooth operations
        • Working with ServiceX developers (Ben Galewsky) to test at SDCC
        • Status of RBT?
        • Interesting Accelerator Forum today
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))

        David: Nothing new. Prepping for bootcamp.

        Ilija:

        • ML platform - all working fine.
        • ServiceX
          • two instances up and running (uproot and xaod)
          • stress tested
          • a lot of development 
    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Quarterly reports!
      • Network data challenge successful - latest summary of results from today's GDB
      • Tape challenge is ongoing, in phase 2 A-DT mode
      • Xin has deployed and tested HTCondor-CE with token auth at BNL
        • Some important testing notes re:fallback to X509
      • OSG has draft of documentation for XRootD HTTP-TPC endpoint deployment (thanks Brian!)
      • Working on BNL dCache SRR reporting issues and reconfiguration of LAKE endpoints in CRIC
        • Tape downtime for next week will test new Topology/CRIC configuration
      • F/S DevOps:  GitOps alerts now going to FederatedOps email list - working on improving the content
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        Analytics platform

        • ES working fine. 
        • Almost all the analytics services are now auto-rebuilt, auto-deployed using FluxCD.
        • Discussing Fluentd pilot monitoring
        • Devs on rucio traces analysis and alerting 

        XCache

        • all working fine

        VP

        • all working fine

        ServiceX

        • a lot of developments (tunning, debugging)
      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB (see Mark's report).

    • 14:55 15:05
      AOB 10m