US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5.8 + 3.4.42

      Next week at the earliest

      • HTCondor 8.8.7
      • HTCondor 8.9.5 (osg-upcoming)
      • XRootD 4.11.1
      • osg-xrootd-standalone with TPC and HTTP/S support by default
      • 3.5-only: GridFTP 13.20 (OSG-specific patches moved to osg-gridftp)

      Other

      Working with WLCG IAM folks to request WLCG tokens for testing HTCondor-CE and job submission via Harvester

    • 13:20 13:35
      Topical Report
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:35 13:40
      Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))

      Happy New Year! 

      No major issues over the holiday break. 

    • 13:40 14:00
      Tier2 Centers
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

            Tickets:  
              - currently none
              - since last meeting: 4 tickets, all closed/solved now, caused by short network issues at UM
            Operation:
              - switch problems at UM
                after adding a new switch to upgrade T3-T2 link from 40Gpgs to 100Gpbs
                caused some spanning tree and management interface access issues
                current status: seems solved after second firwmare update
              - currently evolving issue with 1 of 2 Liebert units at UM.  
                Currently operating at reduced but sufficient capacity.
                Repair will need either 8h partial downtime or wait for 3rd planned unit to become online.
              - as usual misc memory and disk issues for hardware under warranty or self-supported
            New hardware:
              - last of the new R740XD2 dcache server almost online
                still fighting with MSU IT automated SSL certificate issuance to get an IGTF-signed cert.
              - will allow to retire the 4x oldest MSU dcache servers
                and free up one MD3260 for spares on self-supported medium-old storage

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Apologies if nobody is present. Judith is OOTO and I am working on network equipment to bring up new purchases.

        -David

        GGUS Ticket #144542: Stage-in issues. Getting little from CERN regarding help debugging. Last update Judith removed the secondary lsm mover from the production queue, which was requested.

        UC:

        • New Workers are all up and added to the pool.
        • A few new storage nodes added to pool, waiting on network equipment updates to finish the rest
        • Working on getting network equipment online for the rest of the new purchases

        IU

        • New workers added to pool
        • New SLATE node is set up

        UIUC

        • Purchases have been made. Still waiting on their arrival.
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Relatively smooth operations over the break, although some of our C6100 workers are starting to die for various hardware reasons.  We'll be investigating.

        Low level DDM issue resolved my migrating away from "Let's Encrypt" host certs for NET2 and NESE gridftp endpoints. 

        NESE DDM started over the break with containers running on NESE gateways.  (NESE_DATADISK).  Performance looks good so far.  Adding gridftp endpoints and operations infrastructure.  

        NET2 storage for DELL has arrived.  On of the r740xd2's will be grabbed for a SLATE node.  We'll be in touch with the SLATE team as soon as that's up and running.  Need to expand UPS to three new racks for this.  Management switches still haven't yet arrived, but everything else is at Holyoke. 

        Still need to make a plan for ipv6.  We see that about 50% of DDM sites have ipv6 addresses now.  

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2

        • Had a downtime due to relocating network equipment in facility.

        SWT2_CPB

        • Had an issue related to XrootD configurations that limited selection of hosts.

        OU:

        - Nothing to report, all running well.

         

    • 14:00 14:05
      HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))

      in the last 30 days at NERSC produced 25.5 Million events. 8.5 M NERSC hours

      Very bursty usage. About once per week get up to > 2k nodes (almost 300K cores) for a short time. Currently running running with a modified pilot (will want to switch over to the new pilot in next allocation cycle). 

      The 2019 ERCAP allocation ends Jan 14, 2020 7:00 PST.  We will have some hours left over.  Cori downtime will be Jan 14, 07:00 PST to Jan 15, 2020 07:00 PST. After this downtime python 2 will not be supported.

      20-Dec-2019, Lincoln, Marc and DB worked together at Univ of Chicago to produce a Docker container to run Harvester on the edge.  This will be useful for the OLCF-Slate instance.

       

       

       

    • 14:05 14:20
      Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
    • 14:20 14:40
      Continuous Operations
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:40 14:45
      AOB 5m