- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
BrianL out until Oct 15. Mátyás Selmeci will be attending the facilities meetings until then.
globus-gssapi-gsi breakage
OSG 3.4.18
Release tomorrow (9/27).
OSG 3.4.19
XRootD Overhaul
OSG Topology (formerly OIM)
Continuing work on the HEP blueprinting effort with ESnet and CMS is ongoing. Next step was to have 2 sites on each side of the transAtlantic link to compare traffic estimation between ESnet and the experiments. For ATLAS we have identified AGLT2 and NET2 on the North America side and IN2P3-LPSC and WUPPERTAL on the EU side. CMS is using:
T2_US_Nebraska
T2_US_UCSD
CIEMAT-LCG2 (Spain)
INFN-Roma (Italy)
Response from ESnet:
Hello Garhan, Shawn, all,
I have completed the first run for the given sites for the month of Aug 2018. The results can be seen here: https://docs.google.com/spreadsheets/d/1o78o_SujmZ3TtnMnSQFy1aVZ1DR7ODth0Qa7-RNf9l4/edit?usp=sharing
On the first page of the spreadsheet I have summarized the prefixes used.
Two quick observations:
- Interestingly, I don’t see any traffic from or to NET2 - probably not what we expected.
- There is a lot of traffic from IN2P3-LPSC to IN2P3-LPSC.
How does it look to you?
Best regards,
Richard
Is NET2 not using LHCONE? May need to switch to SWT2 UTA ?
Shawn
We have had a LOT of troubles with score-dedicated WN since the upgrade to the foreshadow kernel 3.10.0-862.11.6.el7.x86_64 . This was done simultaneously with an upgrade of HTCondor to 8.6.12 . The symptoms are in a few different classifications:
The load suddenly starts to linearly increase, (this could be an artifact of some cron tasks) but the baseline seems to be that a lock of some kind on the file system is taken, but it is not released. "top" shows no real work ongoing any longer. So, cvmfs and HTCondor seem to argue about who should have access. Commonly stopping HTCondor will clear all the race conditions. Down-grading to HTCondor 8.4.11 is helpful, but issues still happen, and so the whole foreshadow kernel upgrade is suspect.
Another related symptom is that usage of swap suddenly surges to use all available. No processes seem to be killed via oom, but again, load rises, and no real processing goes on.
Often the Condor startd goes unresponsive, and is killed by the HTCondor watch dogs, but that does not necessarily clear the issue. HTCondor will sometimes totally crash out and stop, which usually results in an idle WN that must be restarted.
In a few cases, cvmfs was stuck in a tight loop trying to access disks, or cache, or.... Jakob put out a pre-release of 2.5.2 that addressed this too-aggressive behavior. This helped, but did not resolve anything, and his final conclusion was that cvmfs itself was not at primary fault.
The problem seems to have no solution that I'm aware of. We _could_ back down out of the foreshadow kernel....
We had a fiber cut on our private network a week or so back, with dropped half the HTCondor slots out of production. This was repaired after about 16 hours or so. At the same time, two of our internal switches quit talking, and this crashed our VMWare infrastructure.
Beyond this, we have been proceeding along fine.
0. Confirmed that we're on LHCONE and in the ESNet monitoring... https://my.es.net/lhcone/view/NET2/flow
1. Deletion errors appeared on Monday (same error as SWT2), presumably due to the middleware update that Brian B. told us about on Friday. We fixed this by switching to atlas-gridftp.bu.edu for deletion. The fact that this worked bodes well for switching from Bestman2->Gridftp with DNS & 5 endpoints.
2. We're in the process of retiring the HU queues and absorbing the old equipment into the BU side. This will simplify operations.
3. Still in the process of moving some Harvard users from their ancient Tier 3 to NET3 (BU/Harvard/UMASS).
4. Preparing a networking plan to connect to NESE storage.
5. Getting quotes from DELL. Interested to hear what the Rob/Shawn/DELL discussion comes up with.
UTA