US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
      • XRootD RC5 available in osg-testing
      • xrootd-lcmaps 1.7.0 (now available on EL6) going out in the next release (tentatively tomorrow)
      • OSG ATLAS XCache preliminary image available (https://hub.docker.com/r/opensciencegrid/atlas-xcache/). Working with the SLATE team and Ilija to test it.
    • 13:20 13:40
      Topical Report
      • 13:20
        WBS 2.3.1 Tier1 Operations 10m
        Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 13:50
        AGLT2 5m
        Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        AGLT2 had its storage blacklisted for 3-days, even though the original problem was just a brief glitch introduced by our VMware migration/upgrade.   This prevented our site from being put back online by HammerCloud till the blacklisting was removed.

        On the positive side we managed to finally upgrade our VMware infrastructure from v5.5 running on old R630 nodes to v6.7 running on new R740 hardware.   Still lots of tuning to do but services are running much better now.

        Lots of cabling work ongoing as well, including correcting and updating labels, port descriptions in switches, socket descriptions on PDUs and the corresponding VISIO diagrams.

        New hardware (9 C6420 servers at UM) is cabled and ready to be built soon.

        Keep seeing high load condor work nodes, 2-3 nodes are being killed every day due to high load(>100 per core). This might be caused by specific jobs, usually OSG/CMS jobs. 

        HTCondor head node(a virtual machine) was out of reach for a few hours during the vmware update, but it did not affect the running jobs. 

        dcache head node is upgraded from 4.2.21 to 4.2.23, to fix the gplazma authentication bugs (the authentication would fail every a couple of days). The other pool/door nodes still run on 4.2.21. 

         

         

      • 13:55
        MWT2 5m
        Speaker: Lincoln Bryant (University of Chicago (US))

        GPFS filesystem issues at Illinois on Sunday.  Restored yesterday, UIUC nodes brought back online.

        Compute node purchase at both IU (Dell) and UIUC (HP) w/ mostly FY18 funds to be submitted shortly.

        Storage expansion, edge node for k8s/xcache/slate, ML node, network switch expansion at UC all submitted (some delivered). 

         

         

         

         

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • Will start the process to purchase the xcache box for SWT2_CPB.
        • Otherwise things running smoothly

        OU:

        Nothing to report, everything is running smoothly.

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        Jumbo job/co-jumbo Event service Task 16368172 has duplicate events.

        Jira ticket created to track the progress in debugging.

        https://its.cern.ch/jira/browse/ATLASES-73
         

        Until the problem is solved, there will be no more jumbo/co-jumbo ES tasks will be run.

        This will cause Theta to be paused. (we have 9.5 M Theta core hours to go). 88% of allocation used.

        OLCF has used 86 M Titan core hours 107% of allocation.

        NERSC (ERCAP allocation) 4.2 M NERSC hours used out of 120 (3%).  Need to use 12M hours by April 10th or we lose 25% of unused balance.

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Nothing to report. Pool is quite busy

    • 14:25 14:30
      AOB 5m