US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-08-21T13:00:00-04:00
End: 2019-08-21T15:00:00-04:00
Location: No location set

Wednesday 21 Aug 2019, 13:00 → 15:00 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  - Follow up from scrubbing to be discussed within the next weeks
  
  - GDB meeting at FNAL in September: https://indico.fnal.gov/event/21232/
  
  - Call for nomination of WBS 2.2 & 2.3 will be issued (new term starts Oct. 1st)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5.0/3.4.34
  
  To be released next week, instructions for upgrades between release series will be provided. More info will be in the release announcement/notes.
  
  cvmfs 2.6.2
  
  XCache 1.1 (including ATLAS/CMS RPMs)
  
  xrootd-voms-plugin will be named back to vomsxrd in OSG 3.5
  
  XCache
  
  ATLAS input needed for the unified XCache doc: https://docs.google.com/document/d/1Cxuzy6onOgcjTalkpkT5sBqO2yQqt6ko3zGEk3whMVI/edit?usp=sharing
  
  IRIS-HEP deadline: August 31!
  
  New mailing lists
  
  Retirement of old mailing lists will be announced to the list with information and a grace period before removing the old lists
  
  osg-sites (potentially renamed to sites-announce) will only allow owner-posting and will be used to announce software releases, packages ready for testing, and OSG operations issues pertaining to sites
  
  software-discuss@opensciencegrid.org for OSG Software discussion, replacing osg-software
  
  Retiring osg-int@opensciencegrid.org
- 13:20 → 14:00
  Topical Report
  - 13:20
    
    NET2 Evolution 15m
    
    Speakers: Saul Youssef (Unknown), Prof. Saul Youssef (Boston University (US))
- 13:40 → 14:25
  US Cloud Status
  - 13:40
    
    US Cloud Operations Summary 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
    
    US-cloud-summary-8_14_19.pdf
    
    US-cloud-summary-8_21_19.pdf
  - 13:45
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    massive staging from tape for 2018 reprocessing campaign
    
    600k files staged from BNL tape, ~20% of the total amount in this campaign
    
    almost done at BNL now (~900 left)
    
    postmortem ongoing on the performance of dCache and HPSS systems
    
    dCache hasn't been stable recently
    
    pool crashed, and chimera name server unresponsive
    
    causing SAM test failures and other production issues
    
    reason under investigation, suspect it's related to the recent high number of staging requests
    
    system brought back up. with lower setting on staging limit
  - 13:50
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    2 Open Tickets
    
    - 142370 from 22-Jul-2019 AGLT2 timeout transfer errors.
    dCache door fails to send the information that the transfer is completed
    so the globus client remains stuck until the timeout of 360s kicks in.
    This is happening before asking for the checksum.
    Already reported by CMS.
    
    - 142695 from 13-Aug-2019 HC jobs failing for analysis queue.
    Fraction of jobs failing (2-10/hour), leaving condor_starter running.
    The pilot is receiving a continuous stream of SIGSEGV.
    Investigation now converging on libgfal_plugin_http.so, at least for initiating the problem.
    Instance from cvmfs works as expected but pilot2 at AGLT2 uses the local version from EPEL
    which yum updated on July 19 matching the start of this problem. At least CERN and ALGT2 affected.
    New Pilot2 v2.1.21 fixes endless waiting on the continous signal thrown by rucio.
    Rucio team may aslo have to address this bug.
    
    Operation otherwise stable
    
    Planned purchase
    - Storage: 6x R740Xd2
    - infrastructure: PDUs and fan doors
  - 13:55
    MWT2 5m
    
    Speakers: David Allen Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    GGUS Ticket 142653 (solved): mwt2-gk (uiuc gatekeeper) was having some filesystem issues a couple weeks ago (Aug 10-11). Our colleagues there got it back up and running.
    
    Because of the downed gatekeeper, our other GKs were taking on extra work and were also crashing from going OOM. We're investigating and believe it's a memory leak issue.
    
    In the mean time, we're allotting more memory to the GKs
    
    Currently drained of jobs as our GKs killed them all earlier this morning. We're investigating whether or not that had to do with the memory issues or not. We're refilling now.
  - 14:00
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    1. Production steady, site full.
    
    2. Pilot 2/singularity successfully working after an ADC config fix (which briefly caused a DDM ticket).
    
    3. New squid installed, failover problem solved.
    
    4. NESE gridftp container working for transfers between NESE<->NET2.
    
    5. CephFS space for NET2 is ready in NESE.
    
    6. Setting up NESE endpoint in AGIS (getting help to do that). Gridftp gridftp.nese.mghpcc.org is the FQDN.
  - 14:05
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Nothing to report, sites working well.
    
    - Still working on proper xrootd space group reporting after successfully implementing space group assignment, though.
    
    UTA:
    
    Everything running well at UTA_SWT2
    
    We received equipment from latest buy. First compute node is racked and being tested. Storage will be worked on in September.
    
    We are also deploying our SLATE machine.
  - 14:10
    
    HPC Operations 5m
    
    Speaker: Doug Benjamin (Duke University (US))
    
    Written a plan to bring NSF HPC's online. Work split between DB, Marc Weinberg and
    
    Lincoln Bryant. Basic idea is to use a Hosted HTCondor-CE (with ssh) to submit jobs
    
    to HPC centers. Details can be seen at this link - NSF HPC 2019.08.13 Workflow Plan
    
    Pilot v2 will be used on these HPC's.
    
    What is the status of CE in front of BNL IC queue?
    
    Issues creating job work directory on Shared filesystem. We are using ARC-CE rpm's.
    
    We need to test HTCondor-CE?
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
- 14:25 → 14:30
  
  AOB 5m