US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      - Follow up from scrubbing to be discussed within the next weeks

      - GDB meeting at FNAL in September: https://indico.fnal.gov/event/21232/

      - Call for nomination of WBS 2.2 & 2.3 will be issued (new term starts Oct. 1st)

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5.0/3.4.34

      To be released next week, instructions for upgrades between release series will be provided. More info will be in the release announcement/notes.

      • cvmfs 2.6.2
      • XCache 1.1 (including ATLAS/CMS RPMs)
      • xrootd-voms-plugin will be named back to vomsxrd in OSG 3.5

      XCache

      ATLAS input needed for the unified XCache doc: https://docs.google.com/document/d/1Cxuzy6onOgcjTalkpkT5sBqO2yQqt6ko3zGEk3whMVI/edit?usp=sharing

      IRIS-HEP deadline: August 31!

      New mailing lists

      Retirement of old mailing lists will be announced to the list with information and a grace period before removing the old lists

      • osg-sites (potentially renamed to sites-announce) will only allow owner-posting and will be used to announce software releases, packages ready for testing, and OSG operations issues pertaining to sites
      • software-discuss@opensciencegrid.org for OSG Software discussion, replacing osg-software
      • Retiring osg-int@opensciencegrid.org

       

    • 13:20 14:00
      Topical Report
      • 13:20
        NET2 Evolution 15m
        Speakers: Saul Youssef (Unknown), Prof. Saul Youssef (Boston University (US))
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • massive staging from tape for 2018 reprocessing campaign
          • 600k files staged from BNL tape, ~20% of the total amount in this campaign
          • almost done at BNL now (~900 left) 
          • postmortem ongoing on the performance of dCache and HPSS systems
        • dCache hasn't been stable recently
          • pool crashed, and chimera name server unresponsive
          • causing SAM test failures and other production issues
            • reason under investigation, suspect it's related to the recent high number of staging requests
          • system brought back up. with lower setting on staging limit  
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        2 Open Tickets

        - 142370  from 22-Jul-2019   AGLT2 timeout transfer errors.
        dCache door fails to send the information that the transfer is completed
        so the globus client remains stuck until the timeout of 360s kicks in.
        This is happening before asking for the checksum.
        Already reported by CMS.

        - 142695  from 13-Aug-2019   HC jobs failing for analysis queue.
        Fraction of jobs failing (2-10/hour), leaving condor_starter running.
        The pilot is receiving a continuous stream of SIGSEGV.
        Investigation now converging on libgfal_plugin_http.so, at least for initiating the problem.
        Instance from cvmfs works as expected but pilot2 at AGLT2 uses the local version from EPEL
        which yum updated on July 19 matching the start of this problem. At least CERN and ALGT2 affected.
        New Pilot2 v2.1.21 fixes endless waiting on the continous signal thrown by rucio.
        Rucio team may aslo have to address this bug.

        Operation otherwise stable

        Planned purchase  
          - Storage: 6x R740Xd2
          - infrastructure: PDUs and fan doors

         

      • 13:55
        MWT2 5m
        Speakers: David Allen Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
        • GGUS Ticket 142653 (solved): mwt2-gk (uiuc gatekeeper) was having some filesystem issues a couple weeks ago (Aug 10-11). Our colleagues there got it back up and running.
        • Because of the downed gatekeeper, our other GKs were taking on extra work and were also crashing from going OOM. We're investigating and believe it's a memory leak issue.
          • In the mean time, we're allotting more memory to the GKs
        • Currently drained of jobs as our GKs killed them all earlier this morning. We're investigating whether or not that had to do with the memory issues or not. We're refilling now.
      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        1. Production steady, site full.

        2. Pilot 2/singularity successfully working after an ADC config fix (which briefly caused a DDM ticket).

        3. New squid installed, failover problem solved.  

        4. NESE gridftp container working for transfers between NESE<->NET2.

        5. CephFS space for NET2 is ready in NESE. 

        6. Setting up NESE endpoint in AGIS (getting help to do that).  Gridftp gridftp.nese.mghpcc.org is the FQDN. 

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, sites working well.

        - Still working on proper xrootd space group reporting after successfully implementing space group assignment, though.

         

        UTA:

        Everything running well at UTA_SWT2

        We received equipment from latest buy.  First compute node is racked and being tested.  Storage will be worked on in September.

        We are also deploying our SLATE machine.

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))

        Written a plan to bring NSF HPC's online. Work split between DB, Marc Weinberg and

        Lincoln Bryant.  Basic idea is to use a  Hosted HTCondor-CE (with ssh) to submit jobs

        to HPC centers.  Details can be seen at this link - NSF HPC 2019.08.13 Workflow Plan

        Pilot v2 will be used on these HPC's.

         

        What is the status of CE in front of BNL IC queue?

        Issues creating job work directory on Shared filesystem. We are using ARC-CE rpm's.

        We need to test HTCondor-CE? 

      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m