US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- Follow up from scrubbing to be discussed within the next weeks
- GDB meeting at FNAL in September: https://indico.fnal.gov/event/21232/
- Call for nomination of WBS 2.2 & 2.3 will be issued (new term starts Oct. 1st)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
OSG 3.5.0/3.4.34
To be released next week, instructions for upgrades between release series will be provided. More info will be in the release announcement/notes.
- cvmfs 2.6.2
- XCache 1.1 (including ATLAS/CMS RPMs)
- xrootd-voms-plugin will be named back to vomsxrd in OSG 3.5
XCache
ATLAS input needed for the unified XCache doc: https://docs.google.com/document/d/1Cxuzy6onOgcjTalkpkT5sBqO2yQqt6ko3zGEk3whMVI/edit?usp=sharing
IRIS-HEP deadline: August 31!
New mailing lists
Retirement of old mailing lists will be announced to the list with information and a grace period before removing the old lists
- osg-sites (potentially renamed to sites-announce) will only allow owner-posting and will be used to announce software releases, packages ready for testing, and OSG operations issues pertaining to sites
- software-discuss@opensciencegrid.org for OSG Software discussion, replacing osg-software
- Retiring osg-int@opensciencegrid.org
-
13:20
→
14:00
Topical Report
-
13:20
NET2 Evolution 15mSpeakers: Saul Youssef (Unknown), Prof. Saul Youssef (Boston University (US))
-
13:20
-
13:40
→
14:25
US Cloud Status
-
13:40
US Cloud Operations Summary 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:45
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- massive staging from tape for 2018 reprocessing campaign
- 600k files staged from BNL tape, ~20% of the total amount in this campaign
- almost done at BNL now (~900 left)
- postmortem ongoing on the performance of dCache and HPSS systems
- dCache hasn't been stable recently
- pool crashed, and chimera name server unresponsive
- causing SAM test failures and other production issues
- reason under investigation, suspect it's related to the recent high number of staging requests
- system brought back up. with lower setting on staging limit
- massive staging from tape for 2018 reprocessing campaign
-
13:50
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
2 Open Tickets
- 142370 from 22-Jul-2019 AGLT2 timeout transfer errors.
dCache door fails to send the information that the transfer is completed
so the globus client remains stuck until the timeout of 360s kicks in.
This is happening before asking for the checksum.
Already reported by CMS.- 142695 from 13-Aug-2019 HC jobs failing for analysis queue.
Fraction of jobs failing (2-10/hour), leaving condor_starter running.
The pilot is receiving a continuous stream of SIGSEGV.
Investigation now converging on libgfal_plugin_http.so, at least for initiating the problem.
Instance from cvmfs works as expected but pilot2 at AGLT2 uses the local version from EPEL
which yum updated on July 19 matching the start of this problem. At least CERN and ALGT2 affected.
New Pilot2 v2.1.21 fixes endless waiting on the continous signal thrown by rucio.
Rucio team may aslo have to address this bug.Operation otherwise stable
Planned purchase
- Storage: 6x R740Xd2
- infrastructure: PDUs and fan doors -
13:55
MWT2 5mSpeakers: David Allen Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
- GGUS Ticket 142653 (solved): mwt2-gk (uiuc gatekeeper) was having some filesystem issues a couple weeks ago (Aug 10-11). Our colleagues there got it back up and running.
- Because of the downed gatekeeper, our other GKs were taking on extra work and were also crashing from going OOM. We're investigating and believe it's a memory leak issue.
- In the mean time, we're allotting more memory to the GKs
- Currently drained of jobs as our GKs killed them all earlier this morning. We're investigating whether or not that had to do with the memory issues or not. We're refilling now.
-
14:00
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
1. Production steady, site full.
2. Pilot 2/singularity successfully working after an ADC config fix (which briefly caused a DDM ticket).
3. New squid installed, failover problem solved.
4. NESE gridftp container working for transfers between NESE<->NET2.
5. CephFS space for NET2 is ready in NESE.6. Setting up NESE endpoint in AGIS (getting help to do that). Gridftp gridftp.nese.mghpcc.org is the FQDN.
-
14:05
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Nothing to report, sites working well.
- Still working on proper xrootd space group reporting after successfully implementing space group assignment, though.
UTA:
Everything running well at UTA_SWT2
We received equipment from latest buy. First compute node is racked and being tested. Storage will be worked on in September.
We are also deploying our SLATE machine.
-
14:10
HPC Operations 5mSpeaker: Doug Benjamin (Duke University (US))
Written a plan to bring NSF HPC's online. Work split between DB, Marc Weinberg and
Lincoln Bryant. Basic idea is to use a Hosted HTCondor-CE (with ssh) to submit jobs
to HPC centers. Details can be seen at this link - NSF HPC 2019.08.13 Workflow Plan
Pilot v2 will be used on these HPC's.
What is the status of CE in front of BNL IC queue?
Issues creating job work directory on Shared filesystem. We are using ARC-CE rpm's.
We need to test HTCondor-CE?
-
14:15
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
13:40
-
14:25
→
14:30
AOB 5m
-
13:00
→
13:10