US ATLAS Computing Integration and Operations
-
-
13:00
→
13:05
Top of the Meeting 5mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:05
→
13:15
Singularity / centos 7 deployment in the US cloud 10mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:20
→
13:25
Production 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
OSG area coordinators meeting today is on networking. Attend if you want to see the details: https://opensciencegrid.github.io/management/area-coordinators/
The new HEPiX Network Function Virtualization working group is starting up and will have its kick-off meeting by the end of this month. Marian Babik and Shawn McKee Co-chair. Sign up if you are interested in participating at: https://listserv.in2p3.fr/cgi-bin/wa?SUBED1=HEPIX-NFV-WG
Primary ESnet link to CERN for BNL had a brief 2 minute outage this morning. [ESNET-20180117-001]
CERN-513-CR5 <BCYP1046> WASH-CR5 - Circuit Outage around 4:35 AM Eastern time.No major networking issues for US sites that I know of.
-
13:45
→
13:50
XCache 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
→
13:55
HPCs integration 5mSpeaker: Taylor Childers (Argonne National Laboratory (US))
-
13:55
→
14:30
Site Reports
-
13:55
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- Meltdown & Spectre
- interactive nodes and CEs are patched, farm nodes will follow in a rolling fashion
- performance degradation
- almost none in our HS06 measurement
- ADC reports ~7% hit in KV tests
- extra CPUs added to the ATLAS farm (~15kHS06) during the holiday break, to help reprocessing etc campaigns.
- New compute nodes purchase arrived, will be brought online in the next weeks.
- Meltdown & Spectre
-
14:00
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
Taking advantage of the fact that all MSU WN were powered off for BPS building circuit work, we started the Meltdown/Spectre rpm updates there around January 5. All WN at both sites have now been updated with this fix, along with all gatekeepers, desktops, interactive machines and dCache pool servers. For the latter, we will be examining some kernel parameter changes that would hopefully gain back performance lost due to the kernel and other rpm updates.
See: https://community.centminmod.com/threads/linux-kernel-security-updates-for-spectre-meltdown-vulnerabilities.13648/
This says to add to the kernel line:noibrs noibpb nopti
HS06 runs indicate less than a 1% decrease in performance on modern processors from the kernel updates. This decrease seems to be more than offset by an increase in performance when the same machine is updated to SL7. Older processors seem to either be unchanged, or perform slightly better on HS06 with the kernel updates.
As John Hover points out though, IO will be the real sticking point. We are trying to obtain some data from muon calibration runs on this issue, but it is not yet available. The jobs consist of running Athena to convert a calibstream fragment to a calib ntuple. Results when available (later today, or perhaps tomorrow) will be posted to the usatlas-t2-l list when they are ready.
We now have a small SL7 gatekeeper/cluster running at AGLT2 (~100 cores), and have created an SCORE production queue (AGLT2_SL7) for testing. As of this writing, we have seen only a few software jobs (nagrun.sh -v..., about 130 jobs like this in 3 days time), and do not otherwise appear to be getting many pilots. We will follow up on this.
Otherwise operation has been smooth.
-
14:05
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
-
14:10
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Tested kernel patches. Doesn't break GPFS.
We'll have a DDM downtime next week (~24 hours) to do a GPFS maintenance and rebuild our SRM (to resolve a low level proxy ticket) and do kernel updates.
Lots of NESE activity preparing for first major deployment. Quotes coming in. Power/space/cooling/WAN ready.
LHCONE peering still needs to be resumed. We're not currently peering with LHCONE but this causes no immediate problems.
Smooth NET2 running otherwise.
-
14:15
SWT2-OU 5mSpeaker: Dr Horst Severini (University of Oklahoma (US))
- I will be late for the meeting since I'll be in a proposal meeting from 11 am till ...?
- New OSCER SE now in production; 700 TB xrootd filesystem.
- Switched all OU PQs from Lucille_SE to OU_OSCER_ATLAS_SE.
- Seems to work well for the most part; some jobs fail, still debugging errors.
- OU_OCHEP_SWT2 jobs fail at stageout with failed gfal2 dependency; not sure why, it should be using xrdcp instead, which works fine for HC jobs. Asked for help.
- Singularity tests successful on OU_OSCER_ATLAS_TEST; awaiting further tests.
-
14:20
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
New kernels and microcode are installed at both clusters to mitigate the recent flaws found in CPU's on CE and compute nodes.
SWT2_CPB:
- Suffered a power outage on Saturday (1/13) during a test of the backup generator
- We are following up with our facilities personnel to determine the root cause of the issue and when it will be resolved
- We have a lingering issue with one data server that is causing problems when reporting used space. Will look at this later today
- Production activities are fine
- Updated XRootD to version 4.8.0.1 on dataservers/redirector
UTA_SWT2:
- Moved old storage from SWT2_CPB to this cluster and brought it online.
- We have an open GGUS ticket concerning network outages. We are waiting to hear back from our network manager to understand why this happened, but has not repeated.
- Updated XRootD to version 4.8.0.1 on dataservers/redirector
- Suffered a power outage on Saturday (1/13) during a test of the backup generator
-
14:25
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:55
-
14:30
→
14:35
AOB 5m
-
13:00
→
13:05