US ATLAS Computing Facility
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
US ATLAS Computing Facility Assessment
https://docs.google.com/document/d/1y-3OtJKn52xsLZze3iURMiie3nrekEslrcdY0GYW--I/edit?usp=sharing
-
2
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
-
Topical Report
-
3
WBS 2.3.1 Tier1 OperationsSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
-
3
-
US Cloud Status
-
4
US Cloud Operations SummarySpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
5
BNLSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
-
6
AGLT2Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
AGLT2 had its storage blacklisted for 3-days, even though the original problem was just a brief glitch introduced by our VMware migration/upgrade. This prevented our site from being put back online by HammerCloud till the blacklisting was removed.
On the positive side we managed to finally upgrade our VMware infrastructure from v5.5 running on old R630 nodes to v6.7 running on new R740 hardware. Still lots of tuning to do but services are running much better now.
Lots of cabling work ongoing as well, including correcting and updating labels, port descriptions in switches, socket descriptions on PDUs and the corresponding VISIO diagrams.
New hardware (9 C6420 servers at UM) is cabled and ready to be built soon.
Keep seeing high load condor work nodes, 2-3 nodes are being killed every day due to high load(>100 per core). This might be caused by specific jobs, usually OSG/CMS jobs.
HTCondor head node(a virtual machine) was out of reach for a few hours during the vmware update, but it did not affect the running jobs.
dcache head node is upgraded from 4.2.21 to 4.2.23, to fix the gplazma authentication bugs (the authentication would fail every a couple of days). The other pool/door nodes still run on 4.2.21.
-
7
MWT2Speaker: Lincoln Bryant (University of Chicago (US))
GPFS filesystem issues at Illinois on Sunday. Restored yesterday, UIUC nodes brought back online.
Compute node purchase at both IU (Dell) and UIUC (HP) w/ mostly FY18 funds to be submitted shortly.
Storage expansion, edge node for k8s/xcache/slate, ML node, network switch expansion at UC all submitted (some delivered).
-
8
NET2Speaker: Prof. Saul Youssef (Boston University (US))
-
9
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
-
10
HPC OperationsSpeaker: Doug Benjamin (Duke University (US))
Jumbo job/co-jumbo Event service Task 16368172 has duplicate events.
Jira ticket created to track the progress in debugging.
https://its.cern.ch/jira/browse/ATLASES-73
Until the problem is solved, there will be no more jumbo/co-jumbo ES tasks will be run.
This will cause Theta to be paused. (we have 9.5 M Theta core hours to go). 88% of allocation used.
OLCF has used 86 M Titan core hours 107% of allocation.
NERSC (ERCAP allocation) 4.2 M NERSC hours used out of 120 (3%). Need to use 12M hours by April 10th or we lose 25% of unused balance.
-
11
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
12
Analysis Facilities - BNLSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
Nothing to report. Pool is quite busy
-
4
-
13
AOB
-
1