US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5.33 (this week)

      • XRootD 5.1.1 (upcoming)
        • Known issues with xrootd-multiuser and a memory leak detected at SLAC and UNL
        • Includes updated Rucio and SciTokens plugins

      OSG 3.6/3.5.34 (next week?)

      • HTCondor 8.8.13
      • gratia-probe 1.23.2
      • CVMFS 2.8.1
    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 15m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • per ADC ops suggestion, increased the maxrss limit from 16GB to 32GB for the UCORE PQ, so there is no gap on covering all himem jobs. 
      • for dCache high availability purpose, also in response to the dCache outage on 3/11, when dcache admin service overloaded due to massive staging requests, multiple instances of several critical dCache services have been deployed, including 3 PnfsManager, 2 PinManager, 2 gPlazma and 2 SpaceManager
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Smoothish running
        • New Juniper gear at IU part of MWT2 caused the several short Hammer Cloud outages and a bad server at UIUC took MWT2 offline last night.
        • Short power failure at NET2.
        • CPB tried turning their machine room into a swimming pool twice.
      • Dave Dykstra at the Frontier Squid meeting pointed out that we need to do some tuning of our squids. I've asked Lincoln to summarize it for us.
        • We did meet the milestone due today of having the Slate deployed Frontier Squid in production at the Tier 3 sites.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

          Quiet 2 weeks.

          Services:
            dcache updated 6.2.10 to 6.2.17. No downtime. No issue.  

          Maintenance:
            UM retiring C6420 worker nodes and rebuilding with bigger /tmp for BOINC jobs.
            MSU used dcache update of pool nodes to catch up with all OS and FW updates.

          Infrastructure:
            UM received all Dell rack switches. All powered in temporary location.
            Using/learning Ansible to provision/update OS and configure/manage.
            Accidental damage of 1 data and 1 mgmt switch.  Trying to get replaced.

         

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        dCache upgraded to 6.2.16. Site filled slowly after we came back online

        Two IU switches replaced. Working with IU networking to debug network issues that have arisen since the replacement

        ElasticSearch upgraded to 7.11

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        A power failure interrupted NET2 operations on 2021-03-29 for an hour or two, taking hours of HC jobs to ramp back up.  NESE storage is on UPS and was unaffected.  

        NESE endpoints getting a bit saturated.  Will add a couple new nodes. 

        Upgrading centos 7.9

        xrd 5.1 testing progressing. 

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, running well.

        - In the data center today, racking new equipment, so I have to miss today, sorry.

         

        UTA:

        - Water leak issue in the machine room - repairs in progress

        - When dry, systems running o.k.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), lincoln bryant

      NERSC is humming along.

      TACC is quiet.

      TACC has switched over to using BNL Globus endpoint.

      BNL Globus endpoint had an issue with jbod that was identified and fix by power cycling it.

      Need to attach web-dav to BNLHPC_DATADISK and BNLHPC_SCRATCHDISK. so we can move away from gridftp.

       

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Nothing to report

      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        All working fine.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • Confusion still persists re: declaring downtimes, cf SWT2 downtime not blacklisting storage
      • Interesting update from Wei about XRootD 5.1.1 libcurl memory leak issues at WLCG DOMA-TPC today
      • Still investigating issues that prevent BNL HTTP-TPC tape endpoint from working as a source with the FTS DOMA instance (Petr Vokac)
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        * discovered large percentage of storage was filled with dark data. Not sure what caused it, investigating with Andy and Matevz. Made a code to automatically clean it up at startup. 

        * discovered some connections are not closed. Made k8s service changes, monitoring it.

        * moved to 5.1.1

        VP

        * waiting for changes in Rucio to go in production

        Squids

        * covered by Lincoln

    • 14:40 14:45
      AOB 5m