US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Topical presentations, https://docs.google.com/document/d/1NIc67p3AB2RkYjJsP6Nx_lwPXFX03w1n2SFOgCU47ro/edit

      Reminder to update http://bit.ly/usatlas-capacity with new procurements and to inform Shawn.

      Meetings/workshop at FNAL next week:

      - GDB (9/10-11): https://indico.fnal.gov/event/21232/

      - pre-GDB (9/1): https://indico.cern.ch/event/739896/

      - FIM4R: (9/12): https://indico.cern.ch/event/739896/

       

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5

      OSG 3.4

      ATLAS XCache

      3.5.0/3.4.34 included ATLAS XCache RPMs based on Ilija's configuration. Our RPM doesn't reflect configuration of BNL, SLAC, etc. XCaches.

    • 13:20 14:00
      Topical Report
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • new purchased computing nodes will be delivered this Thursday
          • 97 AS-1023US-TR4 Supermicro Nodes
        • instability of dCache chimera name server
          • solved by adding an additional name server host
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Service:

        1) Running smooth, no new incidents/tickets

        2) Follow up on the jobs failed at SIGSEGV error, still have average of 20 jobs per day, plan to remove the local installation of the gfal libraries. 

        3) working on integrating more of site's service monitoring into check_mk

         

        Hardware

        1) Replaced 2 dcache database replication server with newer hardware. (Dell R610 and R710 nodes)

        2) Placed order for 3 Dell storage nodes for Tier2 usage (R740xd storage nodes)

         

         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Network instability between UC and IU due to a flaky 100G interface

        • Began 28 August, has been recurring since then
        • UChicago network engineers are working on troubleshooting
      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Minor problem with GPFS getting wedged by PanDA jobs with many inputs.  

        Smooth operations otherwise.

        Lots of NESE work happening.  Setting up Globus infrastructure for endpoints. 

        Will probably buy a couple more gateways for NET2 traffic to and from NESE. 

        Massive expansions happening at MGHPCC:

            1. New Harvard CANNON cluster: 100k x86 cores, 40PB storage, >1M Cuda cores

            2. $12M new MIT/IBM cluster 

            3. MIT Supercloud expansion, 450 nodes, each with 2 CPU, 2 NVIDIA GPUs, lots of Ram

         

         

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - No problems, all sites running well

        - Were slowly draining over the weekend, which seemed to be related to Condor-CE losing track of jobs, so we restarted Condor-CE and cleaned out all spool files, which caused all currently running jobs to fail, but now things look much better again and we're full.

        UTA:

        1) SLATE node is installed. Still need to finalize some configuration steps.

        2) Investigating some event index job failures at SWT2_CPB. Some of these were related to a storage issue over the weekend (that was fixed), but not all.

        3) Planning hardware deployment from our most recent purchase. 

        4) Backup A/C unit being installed this week in the SWT2_CPB machine room.

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        Nothing to significant report. Reports due for the ALCC allocations that we have not used yet.

        Need to recompile mpi4py at NERSC and test new container before we can resume running there.

         

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Nothing

    • 14:25 14:30
      AOB 5m