- Follow up from scrubbing to be discussed within the next weeks
- GDB meeting at FNAL in September: https://indico.fnal.gov/event/21232/
- Call for nomination of WBS 2.2 & 2.3 will be issued (new term starts Oct. 1st)
OSG 3.5.0/3.4.34
To be released next week, instructions for upgrades between release series will be provided. More info will be in the release announcement/notes.
XCache
ATLAS input needed for the unified XCache doc: https://docs.google.com/document/d/1Cxuzy6onOgcjTalkpkT5sBqO2yQqt6ko3zGEk3whMVI/edit?usp=sharing
IRIS-HEP deadline: August 31!
New mailing lists
Retirement of old mailing lists will be announced to the list with information and a grace period before removing the old lists
2 Open Tickets
- 142370 from 22-Jul-2019 AGLT2 timeout transfer errors.
dCache door fails to send the information that the transfer is completed
so the globus client remains stuck until the timeout of 360s kicks in.
This is happening before asking for the checksum.
Already reported by CMS.
- 142695 from 13-Aug-2019 HC jobs failing for analysis queue.
Fraction of jobs failing (2-10/hour), leaving condor_starter running.
The pilot is receiving a continuous stream of SIGSEGV.
Investigation now converging on libgfal_plugin_http.so, at least for initiating the problem.
Instance from cvmfs works as expected but pilot2 at AGLT2 uses the local version from EPEL
which yum updated on July 19 matching the start of this problem. At least CERN and ALGT2 affected.
New Pilot2 v2.1.21 fixes endless waiting on the continous signal thrown by rucio.
Rucio team may aslo have to address this bug.
Operation otherwise stable
Planned purchase
- Storage: 6x R740Xd2
- infrastructure: PDUs and fan doors
1. Production steady, site full.
2. Pilot 2/singularity successfully working after an ADC config fix (which briefly caused a DDM ticket).
3. New squid installed, failover problem solved.
4. NESE gridftp container working for transfers between NESE<->NET2.
5. CephFS space for NET2 is ready in NESE.
6. Setting up NESE endpoint in AGIS (getting help to do that). Gridftp gridftp.nese.mghpcc.org is the FQDN.
OU:
- Nothing to report, sites working well.
- Still working on proper xrootd space group reporting after successfully implementing space group assignment, though.
UTA:
Everything running well at UTA_SWT2
We received equipment from latest buy. First compute node is racked and being tested. Storage will be worked on in September.
We are also deploying our SLATE machine.
Written a plan to bring NSF HPC's online. Work split between DB, Marc Weinberg and
Lincoln Bryant. Basic idea is to use a Hosted HTCondor-CE (with ssh) to submit jobs
to HPC centers. Details can be seen at this link - NSF HPC 2019.08.13 Workflow Plan
Pilot v2 will be used on these HPC's.
What is the status of CE in front of BNL IC queue?
Issues creating job work directory on Shared filesystem. We are using ARC-CE rpm's.
We need to test HTCondor-CE?