US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.4.30-2 (released May 30)

      New IGTF CA and VO client updates

      OSG 3.4.31

      • Singularity 3.2.1 in the OSG mainline release
      • xrootd-scitokens 1.0.0
      • HTCondor 8.8.3 (upcoming)
      • VO client update adding the sPhenix VO

      OSG 3.4.32

      XCache 1.1, including an RPM for ATLAS XCache. PR here could use some review for the XRootD configuration.

       

    • 13:20 14:00
      Topical Report
      • 13:20
        TBN 15m
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))

        DNS issue over the weekend

        • BNL campus wide DNS servers were updated and stopped receiving requests from the SDCC DNS caching servers on Saturday, both computing and storage services of T1 facility were affected, including the monitoring infrastructure. The T1 computing farm drained and transfers failed during the period. Monitoring alerts couldn't be sent out, either.  Fixed early Saturday afternoon.  We will evaluate how to strengthen reliability of the external services we rely on, and how to enhance monitoring system in the future. 
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Hardware

        replace 2 cables connecting 2 major switches (shinano) they are duplicated links. We still see one of the link having 0.21% error packets. 

        Services

        Renewed  our hosts certificates to IGTF in common certificates, half done, half in process.

        No tickets for sites, services are running well. T2 jobs have high utilization rate in HTCondor, delivery 11.5K cores in average, it used to be 10.8K cores in average. 

        Unscheduled at-risk downtime for 2 hours during the  network maintenance work

         

         

         

         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Upgraded elasticsearch cluster to 6.8 and added xpack security configuration for both elasticsearch and kibana.

        In the process of upgrading perfSONAR nodes. UC is done; UIUC and IU are still in progress and should be done this week.

        Working on adding IPv6 to our dCache nodes.

        Added ANALY_MWT2_GPU queue for GPU testing.

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Had problem overloading gridftp endpoints with rucioSiteMover going

        Added 4 gridftp endpoints

        LSM to rucio-mover (LSM as backup)

        SL7 transition on all worker nodes done

        HC jobs succeeded, waiting for ADC to ramp us up

        Obsolete tickets

        Preparing for NESE testing

         

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - all sites running well

        - Lucille in long downtime for EL7 upgrade; may or may not come back to ATLAS production

        - Switched OU sites over to InCommon host certs

         

        UTA:

        - Migration to CentOS7 ongoing. Start of downtime Monday, 6/10. 

        - perfSONAR hosts upgraded to CentOS7 / pS v4.1.6-1.el7

        - test hosts will be updated by Friday

        - SLATE node PO submitted - awaiting delivery

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        Theta remains offline until we can get an official ATLAS container that does not require

        patches to AthenaMP that live outside of the container.  We have three patches that

        need to be added and one of the patches needs to be created before SVN access is cut off.

        This is for release 21.0.15.

         

        Cori is running again 3 tasks (one is done running jumbo jobs) and the other two are still

        running.

         
        We are using the FLEX quality of service at NERSC. It means that jobs run at a discount but also at lower priority.
         
        OLCF Titan running last 10 Mevent in backfill mode .
         
        see attached image
         

         

         

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m