US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Attending: Ilija, Lincoln, Saul, Rob, Horst, Xin, Mark, Armen, Patrick, Wei, Wenjing, Brian, Ofer, Brian L, William, Fred

      Apologies: Eric, Doug, Shawn

       

       

       

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      3.4.23 (Released 2019-01-23)

      3.4.24

      • XRootD 4.9.0 RC4 just released upstream
      • Singularity 3.0.3 (upcoming)

      Other Projects

      • Base XCache docker image pushed to Docker Hub. Still working on the ATLAS XCache implementation.
      • Updated suggested account for supporting opportunistic ATLAS jobs (documentation)
    • 13:20 13:40
      Topical Report
      • 13:20
        WBS 2.3.5 Continuous Integration & Operations (CIOPS) 10m
        Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group), lincoln bryant

        Wei: "high availability"? It's only a cache... you can lose the data, no problem.  And you can have multiple caches to back it up.  Worried about perception of HA term. 

        Ilija: if we go for a model where all sites have these caches, it will become an important service. Updates, new features, want to refresh the site. Want service to come back quickly.

        Wei: reboots should be okay. And you might have a backup xcache anyway. It should be flexible. 

        Rob: Understood.. we need a better term.

        Wei: Page 4: concerns about stability goals, and what's possible for access via cache or direct to the origin.

        Xin: where should it be located within the site?  Ans: close to compute.

    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • dCache upgrade (v3.0 to v4.2) done on 01/22
          • NFS4.1 interface not working after the upgrade, under investigation with dCache developers. Affected local users.
        • CEs are all updated to HTCondor-CE version 3.2.0 
        • CentOS7 migration
          • moving to native SL7 hosts from local containers in March (probably combined with UCORE migration)
        • SCRATCHDISK space
          • 1.5PB.  Long standing issue with slow deletion.
          • ADC suggesting to reduce size by 1PB (move to DATADISK). Under discussion. 
        • IPv6
          • done. SE dual stack 
      • 13:50
        AGLT2 5m
        Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Services:

        Services are running smooth, no incidents during the past 2 weeks.

        The high load Condor work nodes only happens once on one work node in 2 weeks, much less frequent than before.

         

        Hardware:

        Retired a Dell M610 Blade to make space for the new work nodes (9 Dell C6420 work nodes, each with 56 HT CPUs, intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz). New work nodes are still in the process of getting online. 

         

         

      • 13:55
        MWT2 5m
        Speaker: Lincoln Bryant (University of Chicago (US))

        Equipment orders for UC have been submitted.

        A network downtime has been scheduled for February 6.

        Facility milestones - to be updated for next time, for all three sites (UC, IU, UIUC)

        • CentOS7 migration
        • SCRATCHDISK space
        • IPv6
      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Transitioned to single UCORE queue for NET2.

        Networking from NET2 to NESE at 2 x 100G working.  Testing NESE as an ATLAS DDM endpoint to follow.

        On deck....

        Preparing to purchase worker nodes, probably more C6420s.

        Finish retiring old Harvard Tier 3

        Finish switching from custom LSM to rucio (we got kind of stuck on this with a mysterious globus related error in PanDA).

        Buy & install SLATE node

        Migration to SL7

        IPv6

        Smooth operations with full site otherwise. 

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Not much to report, operating smoothly

        - Updated squid configuration at all sites

        - Scheduled OSCER maintenance today, should be transparent to Panda, just will just be held (queued) in SLURM

         

        UTA:

        Updated Squid configuration at both sites.

        Low level deletion issue observed at SWT2_CPB (hard to replicate)

        There will be a short power outage on 1/4 power feeds at UTA_SWT2 on Monday morning.  We expect that this will only affect some compute nodes.

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        Here the production in US HPC's for the past 14 days. Attached as image to the agenda.

        We have exhausted our allocation at OLCF and are now in the over-burn period.

        Kibana at Chicago reports different # of events from BigPanda monitoring - Jira ticket - https://its.cern.ch/jira/browse/ATLASES-68

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Nothing to report

    • 14:25 14:30
      AOB 5m