US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.4.26

      Singularity 3.1.0 in osg-upcoming

      OSG 3.4.27

      HTCondor-CE Upgrade

      Sites that run opportunistic jobs need to update to at least HTCondor-CE 3.2.0 OR add the following line to the top of /etc/condor-ce/condor_mapfile:

      GSI ".*,/[/A-Za-z0-9\.]*/Role=[A-Za-z0-9\.]*/Capability=NULL$" GSS_ASSIST_GRIDMAP

      For example: https://github.com/opensciencegrid/htcondor-ce/blob/master/config/condor_mapfile.osg

      Other

      • Last call for feedback on our community testing policy. We plan to publish it by COB today.
      • We're working on a container release policy and will open the document up to feedback in the next week or so.
      • HOW 2019 site admin training Thursday morning
    • 13:20 13:40
      Topical Report
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 13:50
        AGLT2 5m
        Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Hardware:

        9 new  C6420 work nodes at AGLT2 UM site are put online, adding 56*9=504 cores, another 7 at MSU are still waiting to be put on the rack. 

        Site Service:

        All AGLT2 PanDA queues are update in AGIS to use Singlarity.

        AGLT2_UCORE is in test model , we are ready to move to Harvester. 

        We find low utilization rate in our Condor system, mostly due to the configuration (still 10% cores are in static partitioning, and some work nodes have incorrect configuration so they did not have all the available cores being claimed by Condor). We have done usage analysis from both the Condor system and ATLAS Job archive(90% cores of AGLT2 can be used by ATLAS). The wall time utilization of Condor was below 50%.  In order to improve the utilization, we reconfigured over 100 work nodes, and also use ATLAS@home(BOINC) jobs to backfill. As a result, the CPU utilization of the cluster reaches about 90%. AGLT2 is the biggest site contributing to ATLAS@home, which contributes average 5200CPU days on a daily basis, and simulate 1.7M events per day. 

        Site downtime between 2nd and 3rd March due to power outage. Updated firmware of all nodes and switches, and dcache to 5.05. During downtime, we run ATLAS@home jobs when the power was back on. 

         

         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        UC and UIUC are running well and are full of jobs. We are currently debugging a problem with one of the hypervisors at IU that is affecting scheduling efficiency at IU.

        In the past week, one of our dCache servers started having issues. We were in a site downtime over the weekend while bringing dCache back up. We are now back online (using spare MD1200s in place of the faulty hardware), but we are still debugging with Dell to get appropriate warranty replacements.

        During the dCache downtime, we also updated our gatekeepers to htcondor-ce-3.2.1-1.osg34.el7.

        IU received new compute equipment last week. Fred and Neeha are in the process of bringing the new workers online.

        The new storage at UC was put online a couple weeks ago. We are in the process of bringing the rest of the new UC hardware online. UC is still waiting on the arrival of the machine learning machine, but otherwise we've received everything in the most recent equipment order.

        UC and IU switches have been reconfigured for IPv6; we are still testing. Adding IPv6 to the UIUC switches is still in progress.

      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        We are dealing with a GPFS hardware issue in the pool that contains the GPFS metadata.  It's being repaired, but we stopped the GPFS metadata scan (and thus updating of the JSON file) and borrowed some space temporarily to evacuate the affected LUNs and rebuild.  

        We had a meeting with Wei as part of preparation for setting up a new NESE DDM endpoint.  We're planning essentially a copy of what we're using now, but in Docker containers on the NESE side - multiple Gridftp endpoints with Wei's Adler callout and DNS round robin for load sharing.  

         

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU: Nothing to report, everything running well.

        SWT2_CPB: updated xrootd and torque configs; OS update performed

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        ALCC allocation exhausted at OLCF and ALCF. used more than 125% of allocation at OLCF.

        Now running at in significantly reduced priority at OLCF.

        At ALCF we will run backfill for jobs < 802 nodes. Initial testing of 1024 jobs shows that there is a scaling issues between Harvester and Yoda. 

        Running scaling tests at NERSC to understand issues.  up to 500 nodes works fine most of the time.

        Now testing 750 nodes per worker when Cori comes back tomorrow.

        - issues with SLAC storage element so testing BNL and AGLT2.

        Rwg: introduce Marc Weinberg to the group who will work on NSF HPC

         

         

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m