US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      WLCG related:

       

      Facility related meetings

      • US ATLAS Facility meeting co-located with OSG All Hands
        • March 16-19, 2020
        • Oklahoma University, Norman, OK (Horst hosting)
      • We would like to hold a Kubernetes training event for site operators, sometime before this, or perhaps co-located, TBD. 

       

      Facility milestones

      • In the CIOPS area, in the next quarter we would like to focus attention on two deliverables:
      1. A federated-ops Frontier-Squid infrastructure
      2. An analysis caching demonstrator
      • Details to be defined

       

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.4.41/3.5.7

      Targeted for this week or next:

      Anyone using the rolling release repository? https://opensciencegrid.org/docs/release/notes/

      OSG 3.4.42/3.5.8

      Targeted for Jan 2020

      • XRootD 4.11.1
      • XRootD 5.0.0 (upcoming)
      • HTCondor 8.9.5 (upcoming)
      • Singularity 3.5.2 (OSG 3.4)
      • Enable TPC for osg-xrootd-standalone and macaroons for XCache/osg-xrootd-standalone by default
      • Disabling insecure ciphers in VOMS server
      • Dropping and/or moving OSG patches for remaining Globus packages upstream and to OSG metapackages
    • 13:20 13:35
      Topical Report
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        Status update on SWT2 12m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    • 13:35 13:40
      Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • Some critical DNS alias name records were mistakenly deleted yesterday by ITD, causing failed requests to CVMFS and Frontier squids. Recovered shortly.
      • A separate squid proxy is now added to AGIS configuration for BNL, which will be used to download user images from CERN central registry. 
    • 13:40 14:00
      Tier2 Centers
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        service

        one open ticket about failed jobs at UCORE. due to rucio copytool

        hardware

          all the planned retiring dcache storage nodes at UM site finished data migration(to new storage nodes).

         

         
         
         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        HTCondor-CE upgraded sitewide to 4.0.1-1

        UC

        • Three new GPU nodes added to the ML platform
        • Storage, compute, and analytics nodes built, waiting on network cables

        IU

        • Edge node built and registered in SLATE
        • New compute nodes racked, in the process of being built

        UIUC

        • POs submitted for new compute and edge node
        • IPv6 testing in progress; estimated end date for all of the UIUC IPv6 services Feb 2020
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        SWT2_CPB:

        An issue with a mount from an MD3XXX is being problematic, will try to recover today, or declare ~30TB lost.

        Squids have been updated to latest version from OSG.

         

        UTA_SWT2:

        Campus network disruption isolated the cluster for half of yesterday

        Squids have been updated to latest version from OSG.

         

        OU:

        OSCER maintenance today.

        Other than that, no issues.

         

    • 14:00 14:05
      HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))

      Testing new development pilot and new Singularity container at BNL KNL Cluster.

      The Harvester instance (python v2) is currently running two test jobs. Will likely need

      to fix the stage out.

      Since Cori has come back from the 12/5-12/6 shutdown, NERSC has only run 6580 jobs. 2.942 M events.  in the 6 days prior to that NERSC processed 12.25 M events (22K jobs)

      We have used over 101 M NERSC hours out of an initial allocation of 122 M hours. Due to Jumbo Job running we have only been charged 88M hours and have 38M hours remaining.

      We might not use all of our time by 10-Jan

      Deploying Harvester at Stampede2, Frontera:

      • Implementation details updated: https://docs.google.com/document/d/14eNw-3moIwC41lHOJ5Kfg90JliVHLiX9LCOvx1gjEOA/edit#heading=h.clckwez0g7jd
      • Created new OpenStack VM from scratch, installed and configured CVMFS, HTCondor, VOMS, and Harvester
      • New VM can submit successful jobs to hosted CE
      • Still having cert issues; tried a couple of different CA certs, still debugging
      • Can we use Midway as target of HTCondor-CE?
        • Probably can’t get around 2FA, but may be able to test single jobs
    • 14:05 14:20
      Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
        • Fixed HT-Condor and resumed Panda production and Panda analysis
        • switched from LSM to RUCIO mover. RUCIO mover prefers use GridFTP to upload. Martin agreed to change this.
        • UCSC wants to produce 300M simulation at SLAC. 
      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        three new GPU nodes installed, up and running in our ML platform.

        2 nodes are 8 x 2080Ti and 1 node with 4xV100.

        108 registered users of the platform. Most people sign in with their CERN account as that's easiest. 

    • 14:20 14:40
      Continuous Operations
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        A lot of discussions on how to change Perfsonar data ingest to ease data analysis. Now changing indexing. Once that's done we will have to replay the raw data from the tape. It will take significant time as replay of one day takes few hours. All other platforms are working fine. Smaller issues with ES (dead disk). Soon we will add 4 more data nodes and upgrade ES to 7.5.  

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        A lot of debugging of xcache issue where xcache server "forgets" its proxy and can't authenticate against origin servers. The issue does not appear related to network state, load on the node.

        It was reproduced a lot of times, and Andy is looking at the very detailed logs. A lot of Analysis jobs are failing at MWT2 for this reason. Will forward mail thread to Wei.

    • 14:40 14:45
      AOB 5m