US ATLAS Computing Facility

US/Eastern
    • 1
      WBS 2.3 Facility Management News
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 2
      OSG-LHC
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
    • Topical Report : Data Carousel

      Data Carousel Update

      Convener: Xin Zhao (Brookhaven National Laboratory (US))
    • 3
      WBS 2.3.1 Tier1 Center
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • Update of dCache to 5.2 foreseen, after reprocessing

      • MAS progressing well with some delays due to lack of effort

        • ~1PB of unused data moved to intermediate storage, data moved to MAS will be deleted from DATADISK

        • BNL_LAKE_UCORE PQ now running production jobs, usage will be monitored

        • Presentation at the WLCG QoS workshop this Friday

      • Reprocessing had some operational issues (interference with data consolidation, T2 issues affecting T1 stage-in, etc….) The whole chain may need to be revisited?

    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Sign up for the OSG All Hands Meeting (AHM):
        https://opensciencegrid.org/all-hands/2020/
      • At Shawn McKee's request, I asked each Tier 2 to look over the description of their site in the management document. Please do this.
      • I have been pinging sites about mysteries that I found in the 2019 LCG accounting numbers
        • High memory jobs caused very low efficiency in December at BNL
        • Still need to workout how to account for non ATLAS production jobs (BOINC, OSG, etc.)
      • 4
        AGLT2
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Personnel update/correction:

        - Wenjing came back from China before the full travel ban took effect.
          Now working from home during self-quarantine period.

        Tickets:

        - old / now solved ticket   144783  12-Jan-2020   AGLT2: lost heartbeat

        - new / assigned ticket   144982  28-Jan-2020   AGLT2: lost heartbeat.
          Found and retired one particular worker node failing all jobs
          for what looked like a file system problem, but probably not related.
          No other acute problem found.
          Still suspect that most of these errors came from the global pilot problem active around that time.
          Currently only 5% of failures come from lost heartbeat.
          https://bigpanda.cern.ch/errors/?computingsite=AGLT2_UCORE&jobstatus=failed

        Hardware:

        - Last R740XD2 online and in production for dcache.
          Finishing migration and retirement of oldest dcache disk shelves at MSU.

        Services:

        - xrootd.aglt2.org certificate SANs restored.
         

      • 5
        MWT2
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • GGUS Ticket 144840 “Auth failed” on xroot downloads
          • Very low failure rate on download tests (less than 1%) and small failure rate on jobs.
          • Caused due to some weird bug in xrootd version. Fixed in a newer version (need to update)
        • GGUS Ticket 145103
          • Jobs failing with stage-out (permission denied) error.
          • Srm access log shows files failing to srmLS and then being placed into disk and succeeding and srmLS after.
          • Think it’s a similar issue to Ticket 144840. Need to update dcache and xrootd and keep an eye on it after.

        UC:

        Extra network equipment connected. The new storage nodes that were waiting to be added, have been.

        IU:

        Running smoothly

        UIUC:

        Running smoothly

      • 6
        NET2
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Stage in issue...?  Investigating now.   Brief DDM problem for ~1 hour in the past week.  Otherwise smooth operations.

      • 7
        SWT2
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        1) Generally smooth operations over the past two weeks

        2) Almost done with the replacement of the UPS batteries (finish this evening).

        3) Investigating accounting irregularities.

         

        OU:

        - Not much to report, smooth operations

        - Issues with xrootd file system under-reporting space '?oss.cgroup=ATLASDATADISK', investigating.

        - Also, xrootd daemons still not rotating logs correctly, need help with that

         

    • 8
      WBS 2.3.3 HPC Operations
      Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))

      NERSC is going along. spiky usage (see plot)

    • WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 9
        Analysis Facilities - BNL
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Updates:

        One further short batch downtime (< 2 hours) due to T3 GPFS filesystem issues last weekend, no running jobs in T3 at time so nothing affected.

      • 10
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 11
        ATLAS ML Platform & User Support
        Speaker: Ilija Vukotic (University of Chicago (US))

        ML platform working fine. Most usage from David Miller physics tasks. 

    • WBS 2.3.5 Continuous Operations
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 12
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 13
        Analytics Infrastructure & User Support
        Speaker: Ilija Vukotic (University of Chicago (US))

        4 data nodes added to ES. One old data node removed. Some slowness issues reported by CERN users. Investigating. A new Dash based platform for analytics on perfsonar data. I did all the boilerplate code, should be customized by Petya. Some issues with rucio-events data, some changes to jobs_archive data. All collection services and platforms running fine.

      • 14
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x)
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        4 Slate deployed XCaches are operational and serving VP scheduled jobs. Most of it at MWT2 and Prague. Not sure why BNL and AGLT2 get much less jobs. Discussing using VP to schedule jobs to Vienna T2. Working with Rod and Dimitrios

    • 15
      AOB