US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-04-27T13:00:00-04:00
End: 2016-04-27T15:30:00-04:00
Location: No location set

Wednesday 27 Apr 2016, 13:00 → 15:30 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

13:00 → 13:15

Top of the Meeting 15m

Speakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))

NSF Tier2 funding (Kaushik and Paolo)

FY2017-FY2021
Need to get an understanding for adjusting to SLAC T2 phase out, meet pledges, remain productive.
Eric has been developing spreadsheets for a five year plan.
Note this comes at the start of new NSF funding - which has constraints. Impact of needed R&D upgrade. In particular for high luminosity LHC. Funding not to start for two or three years - so how to fund the R&D.
Pressure comes to the computing budget, since its the largest share.
We've had a couple of meetings to look at pledges.
In order to make the exercise realistic, we need to understand the profile of retirements at the Tier2s.
c.f. the tables below.
Need to take inventory of old equipment. We need information from you.
FY16 spending on hardware? (should be in the table below)
- AGLT2 - no exactly; some part replacement.
- MWT2 - no
- NET2 - no
- SWT2 - no
- WT2 - $18,000
Questions?
- none

Table 4: FY16 Budget activity and capacity increments as of June 2016
Center	FY16 equipment budget ($)	FY16 installed equipment purchases ($)	Unspent FY16 ($)	CPU capacity increase with FY16 purchases (HS06)	Job slots (single logical threads) increase with FY16 purchases	Storage capacity increase with FY16 purchases (TB)	COMMITTED (but not installed) FY16 equipment funds ($)	CPU capacity increase with COMMITTED FY16 purchases (HS06)	Job slots (single logical threads) increase with COMMITTED FY16 purchases	Expected date CPU available to ATLAS	Storage capacity increase with COMMITTED FY16 purchases (TB)	Expected date storage available to ATLAS
Tier1	$2,217,000	$360,313	$1,856,687			2,450	$650,000				7526	6/30/2016
AGLT2	$250,000		$250,000				$0
MWT2	$457,491		$457,491				$0
NET2	$68,286		$68,286				$0
SWT2	$260,000		$260,000				$0
WT2	$200,000		$200,000				$0
USATLAS FACILITY	$3,452,777	$360,313	$3,092,464	0	0	2,450	$650,000	0	0		7526
USATLAS TIER2	$1,235,777	$0	$1,235,777	0	0	0	$0	0	0		0

Equipment Retirements

Need to fold in retirement of aging equipment in the Facility
Should include CPU, disk, and networking
Please update the capacity spreadsheet asap

Table 5: Tier 2 Planned equipment retirements (ending FY16)
Center	Total CPU to be retired (HS06)	Job slots to be retired (single logical threads)	Total disk to be retired (TB)	Comment
AGLT2	9,542	0	0
MWT2	7,925	988	250
NET2	9,717	0	0
SWT2	8,224	1002	400
WT2	0	0	0
USATLAS TIER2	35,407	1990	650


Table 6: Tier 2 CPU retirements by year (HS06)
Center	2016	2017	2018	2019
AGLT2	9,542
MWT2	7,925	3,237	34,387	33,229
NET2	9,717
SWT2	8,224	2,880
WT2	0	53,289	0	0
SUM	35,407	6,117	34,387	33,229


Table 7: Tier 2 storage retirements by year (TB)
Center	2016	2017	2018	2019
AGLT2	0	0	0	0
MWT2	228	504	984	1680
NET2	0	0	0	0
SWT2	400	0	0	0
WT2	0	0	0	0
SUM	628	504	984	1,680


Table 8: Tier 2 network gear upgrade cost
Center	2016	2017	2018	2019
AGLT2	$0	$0	$0	$0
MWT2	$0	$50,000	$100,000	$100,000
NET2	$0	$0	$0	$0
SWT2	$0	$0	$0	$0
WT2	$0	$0	$0	$0
SUM	$0	$50,000	$100,000	$100,000

USATLAS LHCONE Status

Yesterday reported on status if LHCONE peering
Slides here

13:25 → 13:35

Capacity News: Procurements & Retirements 10m
13:35 → 13:45

Production 10m

Speaker: Mark Sosebee (University of Texas at Arlington (US))

shift-summary-4_20_16.pdf

shift-summary-4_27_16.pdf
13:45 → 13:50

Data Management 5m

Speaker: Armen Vartapetian (University of Texas at Arlington (US))

160427_DataManagement_Armen.pdf
13:50 → 13:55

Data transfers 5m

Speaker: Hironori Ito (Brookhaven National Laboratory (US))
13:55 → 14:00

Networks 5m

Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
14:00 → 14:05

FAX and Xrootd Caching 5m

Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
14:25 → 15:25
Site Reports
- 14:25
  BNL 5m
  
  Speaker: Michael Ernst
  Smooth operations of Tier-1 services with the exception of network performance problems on the primary OPN circuit between BNL and CERN
  
  Unnoticed between April 6 and April 25
  
  perfSONAR plots clearly show the problem
  
  Brought up at yesterday's ESnet site coordinator meeting
  
  ESnet CTO suggested to have ESnet engineers work with Shawn on monitoring improvements
  
  Connectivity (via BGP session) maintained over the entire period
  
  Throughput significantly impaired leading to job completion delays
  
  Switching to secondary OPN circuit solved the problem
  
  ESnet engineering investigated the circuit and found packet loss caused by components close to the Virginia landing point
  
  Another OPN network performance issue was reported for transfers from BNL to SARA
  
  Independent from the BNL - CERN issue
  
  PO for ~7.5 PB (usable) of magnetic disk arrived at vendor (RAID Inc)
  
  Expect delivery in ~4 weeks
- 14:30
  
  AGLT2 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  We tuned the LMEM queue job slot allocation so as to not waste cpus when LMEM jobs land on smaller memory WNs. Operations are stable and near capacity for our site.
  
  We are scheduling a brief "at risk" OIM outage on Thursday morning to do reboots associated with the nss and nspr security updates, plus OSG RPM updates on our gatekeepers, as per OSG-SEC-2016-04-27.
- 14:35
  MWT2 5m
  
  Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
  Site has been running well
  
  Testing CVMFS 2.2.1
  
  Installed on all nodes
  
  So far no problems seen
  
  CVMFS 2.2.2 will be released soon (bug fixes for server, no client changes)
  
  Disk pledge
  
  dCache
  
  Found dark space in dCache (space not allocated to any token)
  
  Found dark space on backing store pools (let dCache autosize pools)
  
  Add two new servers with 260TB
  
  Time sync problem on dCache head nodes prevented allocation of free space to tokens
  
  Net of 750TB added to DATADISK
  
  Brings MWT2 up to 2015 pledge of 3300TiB on DATADISK, GROUPDISK, USERDISK
  
  To meet our remaining 2016 pledge of 4500TiB attacking on three fronts
  
  Bringing up S3 object store on Ceph system (can be 1200TiB on day one)
  
  Add RDB block devices on Ceph to be used on dCache
  
  Appears as a disk device which dCache uses as a pool
  
  Can immediately add to all space tokens
  
  Use all dCache doors (srm, webdav, xrootd)
  
  Performance needs to be monitored
  
  Future - dCache will directly support Ceph objects
  
  Migrate all space tokens except DATADISK from dCache to Ceph
  
  DATADISK will occupy all dCache space of 3654TiB
  
  GROUPDISK (812TiB) and USERDISK (400TiB) will put us over pledge
  
  As dCache RDB or space tokens migrate, S3 size can be reduced
  
  dCache to Ceph migration
  
  The plan to migrate a space token from dCache to Ceph
  
  Bestman SRM server (ceph-srm.mwt2.org:8443/srm/v2/server?SFN=)
  
  gfal-sync to synhronize backing store copy of space token on dCache to a copy on Ceph
  
  disable current spacetoken
  
  final sync
  
  enable space token with new SRM server
  
  Still need webdav and xrootd doors on Ceph
- 14:40
  
  NET2 5m
  
  Speaker: Prof. Saul Youssef (Boston University (US))
  
  Smooth operations + full sites.
  
  Dan has worked through the SLURM/Condor issues and is in the last stages of switching us over to HTCondor-ce on the Harvard side. He's getting help from OSG and coordinating with Jose to get test pilots. There are a couple of snags still, but we should be able to switch over completely later this week. We are similarly in the late stages of doing the same on the BU side.
  
  We have the Mass Open Cloud hardware-as-a-service software working and can successfully build new ATLAS worker nodes in the MOC hardware at MGHPCC. There is a mystery with unexpectedly low network bandwidth between the HaaS worker and GPFS which we're tracking down. After that, we will be testing production jobs and then expanding.
  
  We're getting lined up to purchase an additional 576 TB useable storage, which would exhaust our hardware funds through the end of September.
  
  We've also made progress with BU networking about possible short-term WAN upgrades of NET2 (either 40Gb/s or 4x10Gb/s). The critical issue is actually fees that the University pays to the NoX.
- 14:45
  
  SWT2-OU 5m
  
  Speaker: Dr Horst Severini (University of Oklahoma (US))
  
  - smooth operations
  
  - continuing with OSCER cluster commissioning, ready for Panda test pilots now
- 14:50
  
  SWT2-UTA 5m
  
  Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
  
  1) Generally smooth operations
  
  2) Odd file size problem caused us to be set off-lined for a brief period of time - resolved
  
  3) Feedback from campus networking staff regarding lhcone/science dmz status
  
  4) Deciding about schedule(s) for downtime(s) (s/w upgrades, add hardware...)
- 14:55
  
  WT2 5m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  Smooth operation.
15:25 → 15:30

AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

NSF Tier2 funding (Kaushik and Paolo)

Equipment Retirements

USATLAS LHCONE Status