US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-09-26T13:00:00-04:00
End: 2018-09-26T15:00:00-04:00
Location: No location set

WBS 2.3 re-shuffling on going
Tier2 computing review requested by management, still being organized.
Facility workshop end of this year,
- https://doodle.com/poll/hzx53r5qm8769ux7 location TBD
- 3 days possibly with ADC attendance
Preliminary list of FY19 milestones https://docs.google.com/spreadsheets/d/1MxVS8T47znFhzPyBtIV8nO1Hod-rvUSdxGO35HWKqcU/edit?usp=sharing
Deletion issues at BU, UTA : how this is issue can be better centrally handled by us?
SLATE deployment ? positive feedback from ITD security at BNL
Pilot role needs to be enabled for RW on US storage

problem with new version of globus rpm requesting new TLS version, breaking BestMan
- more info here https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMeetingWeek180924
- moving away from BestMan
addition of gridftp test to SAM, as part of ATLAS_CRITICAL
- more info here : https://indico.cern.ch/event/755513/contributions/3131553/attachments/1717584/2771546/adc_weekly_sam.pdf
- ANLASC the only USATLAS gridftp-only site in this test ?
A more general ADC question : "do we want to increase/improve the test in ETF/SAM, or do we want to have other external sources to publish results in the result DB and use (also) these to monitor the real usability of the sites?"
Harvester migration happening on other Clouds, US will be on the list after S&C week
- more info here : https://docs.google.com/presentation/d/19Bp_EpcwZM4hNZtqE9bjgdIAq-IPDhglHTdifIUIzkk/edit#slide=id.p

BrianL out until Oct 15. Mátyás Selmeci will be attending the facilities meetings until then.

globus-gssapi-gsi breakage

New version of globus-gssapi-gsi (v13.10-1) in EPEL forces TLS 1.2, breaking Bestman. Fix in config by setting MIN_TLS_PROTOCOL=TLS1_VERSION_DEPRECATED in /etc/grid-security/gsi.conf

OSG 3.4.18

Release tomorrow (9/27).

CVMFS 2.5.1
XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
HTCondor-CE bug fixes
Updating globus-gridftp-server packages to match the EPEL versions
gratia-probe bug fixes for slurm and condor probes
RSV bug fix for bogus "GRACC server not responding" warnings

OSG 3.4.19

autopyfactory 2.4.9
gsi-openssh update

XRootD Overhaul

JIRA Epic
We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache

OSG Topology (formerly OIM)

Topology and Downtime registration instructions are live: https://opensciencegrid.org/docs/common/registration/
Downtime registration form is live: https://topology.opensciencegrid.org/generate_downtime
Basic OSG contacts page is live: https://topology.opensciencegrid.org/contacts
Please send us your GitHub username if you plan to make topology updates
Topology (including downtime) updates still require human review but we hope to automate downtime review in the next few weeks

Continuing work on the HEP blueprinting effort with ESnet and CMS is ongoing. Next step was to have 2 sites on each side of the transAtlantic link to compare traffic estimation between ESnet and the experiments. For ATLAS we have identified AGLT2 and NET2 on the North America side and IN2P3-LPSC and WUPPERTAL on the EU side. CMS is using:

T2_US_Nebraska

T2_US_UCSD

CIEMAT-LCG2 (Spain)

INFN-Roma (Italy)

Response from ESnet:

Hello Garhan, Shawn, all,

I have completed the first run for the given sites for the month of Aug 2018. The results can be seen here: https://docs.google.com/spreadsheets/d/1o78o_SujmZ3TtnMnSQFy1aVZ1DR7ODth0Qa7-RNf9l4/edit?usp=sharing

On the first page of the spreadsheet I have summarized the prefixes used.

Two quick observations:

- Interestingly, I don’t see any traffic from or to NET2 - probably not what we expected.

- There is a lot of traffic from IN2P3-LPSC to IN2P3-LPSC.

How does it look to you?

Best regards,

Richard

Is NET2 not using LHCONE? May need to switch to SWT2 UTA ?

Shawn

Running fine in general
L1TF security patch applied to all CEs. Risk for WNs considered low.
dCache Storage server issue
- one disk had failure, triggered restart of the JBOD I/O module
- caused stage-in errors for jobs, GGUS 137367
disk size on dCache tape pools will be increased from 200TB to 2PB by the end of this week, with target of 5PB by the end of the year

We have had a LOT of troubles with score-dedicated WN since the upgrade to the foreshadow kernel 3.10.0-862.11.6.el7.x86_64 . This was done simultaneously with an upgrade of HTCondor to 8.6.12 . The symptoms are in a few different classifications:

The load suddenly starts to linearly increase, (this could be an artifact of some cron tasks) but the baseline seems to be that a lock of some kind on the file system is taken, but it is not released. "top" shows no real work ongoing any longer. So, cvmfs and HTCondor seem to argue about who should have access. Commonly stopping HTCondor will clear all the race conditions. Down-grading to HTCondor 8.4.11 is helpful, but issues still happen, and so the whole foreshadow kernel upgrade is suspect.

Another related symptom is that usage of swap suddenly surges to use all available. No processes seem to be killed via oom, but again, load rises, and no real processing goes on.

Often the Condor startd goes unresponsive, and is killed by the HTCondor watch dogs, but that does not necessarily clear the issue. HTCondor will sometimes totally crash out and stop, which usually results in an idle WN that must be restarted.

In a few cases, cvmfs was stuck in a tight loop trying to access disks, or cache, or.... Jakob put out a pre-release of 2.5.2 that addressed this too-aggressive behavior. This helped, but did not resolve anything, and his final conclusion was that cvmfs itself was not at primary fault.

The problem seems to have no solution that I'm aware of. We _could_ back down out of the foreshadow kernel....

We had a fiber cut on our private network a week or so back, with dropped half the HTCondor slots out of production. This was repaired after about 16 hours or so. At the same time, two of our internal switches quit talking, and this crashed our VMWare infrastructure.

Beyond this, we have been proceeding along fine.

UC/IU: Nothing new to report

UIUC: CentOS7 migration

Workers and head nodes are all upgraded
Currently running opportunistic jobs only
SL7 queues created in AGIS but are still disabled

0. Confirmed that we're on LHCONE and in the ESNet monitoring... https://my.es.net/lhcone/view/NET2/flow

1. Deletion errors appeared on Monday (same error as SWT2), presumably due to the middleware update that Brian B. told us about on Friday. We fixed this by switching to atlas-gridftp.bu.edu for deletion. The fact that this worked bodes well for switching from Bestman2->Gridftp with DNS & 5 endpoints.

2. We're in the process of retiring the HU queues and absorbing the old equipment into the BU side. This will simplify operations.

3. Still in the process of moving some Harvard users from their ancient Tier 3 to NET3 (BU/Harvard/UMASS).

4. Preparing a networking plan to connect to NESE storage.

5. Getting quotes from DELL. Interested to hear what the Rob/Shawn/DELL discussion comes up with.

UTA

Bestman issues are causing problems with deletion service at both clusters
- New gridFTP service being rolled out.
- Awaiting new host certificates from UTA OpSec
Network changeover will now occur Sept. 30th (02:00-07:00)
- UTA will now peer directly with LEARN on our campus
- Science DMZ traffic is affected by the change
- Will declare a downtime for the 5 hour window