US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1:00 PM 1:10 PM
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)) , Dr Shawn Mc Kee (University of Michigan (US))
    • 1:10 PM 1:20 PM
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin) , Matyas Selmeci

      Release

      • XRootD 5.3.1 (including async.io fix)
      • gratia-probe 1.24.0 (some bug fixes for condor batch)

      Misc

      Add SRM tape vs disk service types in Topology: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4732

    • 1:20 PM 1:35 PM
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 1:20 PM
        TBD 10m
    • 1:35 PM 1:40 PM
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)) , Eric Christian Lancon (Brookhaven National Laboratory (US))
      • Condor pool happily full (except for brief outage see bellow)
      • In middle of HPSS upgrade. will know tomorrow if the downtime needs to be extended
        • Need to work with OSG so that Tape only downtime does not take offline all other storage
      • Increased the amount of space in the dCache Tape staging pools to account for increasing use of Data Carousel.
        • Need better monitoring w/ respect to Data Carousel to understand why we have some transfer tasks that do not finish within 1 week.
      • Before HPSS downtime - copied off of tape 210k DAOD and AOD files where BNL LAKE was only copy
        • Data was copied from BNLLAKE to BNL-OSG2_DATADISK
        • 16864 DAOD and 193889 AOD files remain to be copied off tape onto Disk will be done when
      • Reconfigured BNLLAKE and DATATAPE/MCTAPE pools to increase DATATAPE/MCTAPE pool to 1.8 PB
      • Preparatory work done to split Tape file families in two for BNLLake - on File Family associated with the Local group disk and one with data disk.
      • Setting up test stand for SRM + HTTPS tests.
        • Once initial tests are successful plan to use SRM + HTTPS for BNLLAKE and treat it like a tape endpoint until Rucio QOS goes into production sometime in FY22
    • 1:40 PM 2:00 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonably good running - One CERN FTS issue affecting most of the grid 7/26-7/27.
      • Lots of work preparing for the FY21 purchases.
      • 1:40 PM
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)) , Dr Shawn Mc Kee (University of Michigan (US)) , Prof. Wenjing Wu (University of Michigan)

        1) update to the new security update CentOS7 kernel 1160.36,
        did firmware update to all nodes and rebooted them to the new kernel,
        Also in the process of rebuilding all remaining SL7 WNs to CentOS7, for uniformity.

        2) did the 2 condor security updates: 8.8.13->8.8.14->8.8.15,
        update gatekeepers' condor-ce to 4.5.24)

        3) since reboot needed for new kernel, also updated dcache from 6.2.23 to 6.2.25,
        smooth update (also FW/BIOS)

        5) MSU site is moving the last batch of WNs to the Data Center today. 
        All nodes moved, powered, currently getting connected. Will be done by end of day.

        6) still have IPV6 issues at UM. We see them happening on the data switches too.

        7) had one instance of job draining due to pending transfer jobs >4000

        8) Adjusted space tokens in dCache, result is increase of AGLT2DATADISK by 290 TB

      • 1:45 PM
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)) , Jess Haney (Univ. Illinois at Urbana Champaign (US)) , Judith Lorraine Stephen (University of Chicago (US))

        UC

        • Power outage at UC took out some of our equipment, including our 40kW UPS. Running on bypass mode for now while we are relocating datacenters
        • More relocation hardware received. David is working on setup so that we can start data migrations
        • Network downtime scheduled for Aug 16. Need to schedule in CRIC

        IU

        • Nothing notable to report

        UIUC

        • SLATE node received and in the process of being built
        • PerfSONAR nodes purchased, estimated delivery Aug 16
      • 1:50 PM
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        0 GGUS tickets 

        1 HC bump Yesterday, stage-in timeouts 

        MGHPCC annual maintenance down day coming up : August 9 

        Xrootd containers working at BU, HTTP-TPC tests working, with local adler callout, some deletion errors, probably will disappear when we expand.  Next steps:

              Expand atlas-xrootd.bu.edu to all current gridftp endpoints

              Do likewise for NESE storage endpoints (NESE_DATADISK, NESE_SCRATCHDISK)

              Do likewise for NESE Tape endpoints

        Need to update OIM with Mark

        16 DTN endpoints arrived at NESE, racked and cabled. 

        ipv6 set up on perfsonar nodes.... We still need to test, then expand.   

        Preparing for major worker node purchase ASAP

        Start planning for NESE Ceph storage purchase in the Fall

        UMass joining NET2, new person to help with day-to-day operations; expand into UMass space at MGHPCC; collaborate with BU, Harvard, UMass types on large shared pool of worker nodes roughly along the lines of the shared storage NESE project. 

      • 1:55 PM
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)) , Mark Sosebee (University of Texas at Arlington (US)) , Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Running well

        - Had CVMFS problems with one worker node which caused strange rucio errors. Reboot fixed that.

        - Now using HTTP-TPC in production, seems to run well.

        UTA:

        - Two site issues:

            (i) squid outage on 7/24

            (ii) site drained on 7/30

        - Equipment from most recent purchase arriving. Will need to schedule a downtime for the installation  of the LAN re-vamp.

        - Ongoing work / testing:

            (i) IPv6

            (ii) xrootd-TPC

    • 2:00 PM 2:05 PM
      WBS 2.3.3 HPC Operations 5m
      Speaker: Lincoln Bryant (University of Chicago (US))
      • TACC:
        • We have used just over 50% of our TACC allocation.
        • I messed up my proxy on 7-27 which led to the large number of failures, otherwise TACC running fine.
      • NERSC:
        • Generally running well.
        • Only 3.5M hours remaining at NERSC. Do we need to dial it back more? 
    • 2:05 PM 2:20 PM
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:05 PM
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 2:10 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:15 PM
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)) , Ilija Vukotic (University of Chicago (US))
        • Working on backend and getting ready for actual users with the team.
        • Added a couple pre-alpha users to poke around internally.
        • Waiting for our permanent networking to be configured and connected. Hopefully will be done in the next couple weeks.
        • Dell GPU we purchased is slated to arrive in November due to shortages.
        • "old" ML platform works fine. There were two instances of users hoarding GPUs.
    • 2:20 PM 2:40 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Discussing adding a new OSG-CRIC mapping to distinguish between tape and disk SE downtimes
      • Attempting to avoid site job starvation due to FTS problems causing backlog of jobs transferring out
        • Updated PQ transferringlimit parameter to 5000 at MWT2, 3000 at other T2s (OU still default).  BNL is 8000.  BU at 20000?
      • Enabling IPV6 on BNL gridftp and davs doors
      • HTTP-TPC:  OU looks good in Paul's smoke tests.  BU still failing?  What is status of CPB?
      • Would like to deploy SLATE Squid at OU -- avoid reliance on network connection to UTA (Frontier-Squid seg faulting on failed DNS lookup).
      • BNL VP queue re-enabled with new gStream monitoring
        • Activity on disk cache still seems low - investigating
      • 2:20 PM
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)) , Xin Zhao (Brookhaven National Laboratory (US))
      • 2:25 PM
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)) , Robert William Gardner Jr (University of Chicago (US))

        Back from vacation.

        Found that everything was working fine without me.

        XCaches - all running fine.

        VP - all fine. Yesterday ANALY_BNL_VP queue put online. Will take quite some time to fill the cache with only 100 workers.

        This week testing Rucio changes related to replica ordering based on GeoIP.

         

      • 2:30 PM
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Nothing new from what we discussed the last time.

    • 2:40 PM 2:45 PM
      AOB 5m