US ATLAS Computing Integration and Operations
-
-
1
Top of the MeetingSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- WBS 2.3 re-shuffling on going
- Tier2 computing review requested by management, still being organized.
- Facility workshop end of this year,
- https://doodle.com/poll/hzx53r5qm8769ux7 location TBD
- 3 days possibly with ADC attendance
- Preliminary list of FY19 milestones https://docs.google.com/spreadsheets/d/1MxVS8T47znFhzPyBtIV8nO1Hod-rvUSdxGO35HWKqcU/edit?usp=sharing
- Deletion issues at BU, UTA : how this is issue can be better centrally handled by us?
- SLATE deployment ? positive feedback from ITD security at BNL
- Pilot role needs to be enabled for RW on US storage
-
2
ADC news and issuesSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- problem with new version of globus rpm requesting new TLS version, breaking BestMan
- more info here https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMeetingWeek180924
- moving away from BestMan
- addition of gridftp test to SAM, as part of ATLAS_CRITICAL
- more info here : https://indico.cern.ch/event/755513/contributions/3131553/attachments/1717584/2771546/adc_weekly_sam.pdf
- ANLASC the only USATLAS gridftp-only site in this test ?
- A more general ADC question : "do we want to increase/improve the test in ETF/SAM, or do we want to have other external sources to publish results in the result DB and use (also) these to monitor the real usability of the sites?"
- Harvester migration happening on other Clouds, US will be on the list after S&C week
- more info here : https://docs.google.com/presentation/d/19Bp_EpcwZM4hNZtqE9bjgdIAq-IPDhglHTdifIUIzkk/edit#slide=id.p
- problem with new version of globus rpm requesting new TLS version, breaking BestMan
-
3
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
BrianL out until Oct 15. Mátyás Selmeci will be attending the facilities meetings until then.
globus-gssapi-gsi breakage
- New version of globus-gssapi-gsi (v13.10-1) in EPEL forces TLS 1.2, breaking Bestman. Fix in config by setting MIN_TLS_PROTOCOL=TLS1_VERSION_DEPRECATED in /etc/grid-security/gsi.conf
OSG 3.4.18
Release tomorrow (9/27).
- CVMFS 2.5.1
- XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
- HTCondor-CE bug fixes
- Updating globus-gridftp-server packages to match the EPEL versions
- gratia-probe bug fixes for slurm and condor probes
- RSV bug fix for bogus "GRACC server not responding" warnings
OSG 3.4.19
- autopyfactory 2.4.9
- gsi-openssh update
XRootD Overhaul
- JIRA Epic
- We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache
OSG Topology (formerly OIM)
- Topology and Downtime registration instructions are live: https://opensciencegrid.org/docs/common/registration/
- Downtime registration form is live: https://topology.opensciencegrid.org/generate_downtime
- Basic OSG contacts page is live: https://topology.opensciencegrid.org/contacts
- Please send us your GitHub username if you plan to make topology updates
- Topology (including downtime) updates still require human review but we hope to automate downtime review in the next few weeks
-
4
ProductionSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
5
Data ManagementSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
6
NetworkingSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
Continuing work on the HEP blueprinting effort with ESnet and CMS is ongoing. Next step was to have 2 sites on each side of the transAtlantic link to compare traffic estimation between ESnet and the experiments. For ATLAS we have identified AGLT2 and NET2 on the North America side and IN2P3-LPSC and WUPPERTAL on the EU side. CMS is using:
T2_US_Nebraska
T2_US_UCSD
CIEMAT-LCG2 (Spain)
INFN-Roma (Italy)
Response from ESnet:
Hello Garhan, Shawn, all,
I have completed the first run for the given sites for the month of Aug 2018. The results can be seen here: https://docs.google.com/spreadsheets/d/1o78o_SujmZ3TtnMnSQFy1aVZ1DR7ODth0Qa7-RNf9l4/edit?usp=sharing
On the first page of the spreadsheet I have summarized the prefixes used.
Two quick observations:
- Interestingly, I don’t see any traffic from or to NET2 - probably not what we expected.
- There is a lot of traffic from IN2P3-LPSC to IN2P3-LPSC.
How does it look to you?
Best regards,
Richard
Is NET2 not using LHCONE? May need to switch to SWT2 UTA ?
Shawn
- 7
- 8
-
Site Reports
-
9
BNLSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- Running fine in general
- L1TF security patch applied to all CEs. Risk for WNs considered low.
- dCache Storage server issue
- one disk had failure, triggered restart of the JBOD I/O module
- caused stage-in errors for jobs, GGUS 137367
- disk size on dCache tape pools will be increased from 200TB to 2PB by the end of this week, with target of 5PB by the end of the year
-
10
AGLT2Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
We have had a LOT of troubles with score-dedicated WN since the upgrade to the foreshadow kernel 3.10.0-862.11.6.el7.x86_64 . This was done simultaneously with an upgrade of HTCondor to 8.6.12 . The symptoms are in a few different classifications:
The load suddenly starts to linearly increase, (this could be an artifact of some cron tasks) but the baseline seems to be that a lock of some kind on the file system is taken, but it is not released. "top" shows no real work ongoing any longer. So, cvmfs and HTCondor seem to argue about who should have access. Commonly stopping HTCondor will clear all the race conditions. Down-grading to HTCondor 8.4.11 is helpful, but issues still happen, and so the whole foreshadow kernel upgrade is suspect.
Another related symptom is that usage of swap suddenly surges to use all available. No processes seem to be killed via oom, but again, load rises, and no real processing goes on.
Often the Condor startd goes unresponsive, and is killed by the HTCondor watch dogs, but that does not necessarily clear the issue. HTCondor will sometimes totally crash out and stop, which usually results in an idle WN that must be restarted.
In a few cases, cvmfs was stuck in a tight loop trying to access disks, or cache, or.... Jakob put out a pre-release of 2.5.2 that addressed this too-aggressive behavior. This helped, but did not resolve anything, and his final conclusion was that cvmfs itself was not at primary fault.
The problem seems to have no solution that I'm aware of. We _could_ back down out of the foreshadow kernel....
We had a fiber cut on our private network a week or so back, with dropped half the HTCondor slots out of production. This was repaired after about 16 hours or so. At the same time, two of our internal switches quit talking, and this crashed our VMWare infrastructure.
Beyond this, we have been proceeding along fine.
- 11
-
12
NET2Speaker: Prof. Saul Youssef (Boston University (US))
0. Confirmed that we're on LHCONE and in the ESNet monitoring... https://my.es.net/lhcone/view/NET2/flow
1. Deletion errors appeared on Monday (same error as SWT2), presumably due to the middleware update that Brian B. told us about on Friday. We fixed this by switching to atlas-gridftp.bu.edu for deletion. The fact that this worked bodes well for switching from Bestman2->Gridftp with DNS & 5 endpoints.
2. We're in the process of retiring the HU queues and absorbing the old equipment into the BU side. This will simplify operations.
3. Still in the process of moving some Harvard users from their ancient Tier 3 to NET3 (BU/Harvard/UMASS).
4. Preparing a networking plan to connect to NESE storage.
5. Getting quotes from DELL. Interested to hear what the Rob/Shawn/DELL discussion comes up with.
-
13
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA
- Bestman issues are causing problems with deletion service at both clusters
- New gridFTP service being rolled out.
- Awaiting new host certificates from UTA OpSec
- Network changeover will now occur Sept. 30th (02:00-07:00)
- UTA will now peer directly with LEARN on our campus
- Science DMZ traffic is affected by the change
- Will declare a downtime for the 5 hour window
- Bestman issues are causing problems with deletion service at both clusters
-
9
-
14
AOB
-
1