US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))
NSF Tier2 funding (Kaushik and Paolo)
- FY2017-FY2021
- Need to get an understanding for adjusting to SLAC T2 phase out, meet pledges, remain productive.
- Eric has been developing spreadsheets for a five year plan.
- Note this comes at the start of new NSF funding - which has constraints. Impact of needed R&D upgrade. In particular for high luminosity LHC. Funding not to start for two or three years - so how to fund the R&D.
- Pressure comes to the computing budget, since its the largest share.
- We've had a couple of meetings to look at pledges.
- In order to make the exercise realistic, we need to understand the profile of retirements at the Tier2s.
- c.f. the tables below.
- Need to take inventory of old equipment. We need information from you.
- FY16 spending on hardware? (should be in the table below)
- AGLT2 - no exactly; some part replacement.
- MWT2 - no
- NET2 - no
- SWT2 - no
- WT2 - $18,000
- Questions?
- none
Table 4: FY16 Budget activity and capacity increments as of June 2016 Center FY16 equipment budget ($) FY16 installed equipment purchases ($) Unspent FY16 ($) CPU capacity increase with FY16 purchases (HS06) Job slots (single logical threads) increase with FY16 purchases Storage capacity increase with FY16 purchases (TB) COMMITTED (but not installed) FY16 equipment funds ($) CPU capacity increase with COMMITTED FY16 purchases (HS06) Job slots (single logical threads) increase with COMMITTED FY16 purchases Expected date CPU available to ATLAS Storage capacity increase with COMMITTED FY16 purchases (TB) Expected date storage available to ATLAS Tier1 $2,217,000 $360,313 $1,856,687 2,450 $650,000 7526 6/30/2016 AGLT2 $250,000 $250,000 $0 MWT2 $457,491 $457,491 $0 NET2 $68,286 $68,286 $0 SWT2 $260,000 $260,000 $0 WT2 $200,000 $200,000 $0 USATLAS FACILITY $3,452,777 $360,313 $3,092,464 0 0 2,450 $650,000 0 0 7526 USATLAS TIER2 $1,235,777 $0 $1,235,777 0 0 0 $0 0 0 0 Equipment Retirements
- Need to fold in retirement of aging equipment in the Facility
- Should include CPU, disk, and networking
- Please update the capacity spreadsheet asap
Table 5: Tier 2 Planned equipment retirements (ending FY16) Center Total CPU to be retired (HS06) Job slots to be retired (single logical threads) Total disk to be retired (TB) Comment AGLT2 9,542 0 0 MWT2 7,925 988 250 NET2 9,717 0 0 SWT2 8,224 1002 400 WT2 0 0 0 USATLAS TIER2 35,407 1990 650 Table 6: Tier 2 CPU retirements by year (HS06) Center 2016 2017 2018 2019 AGLT2 9,542 MWT2 7,925 3,237 34,387 33,229 NET2 9,717 SWT2 8,224 2,880 WT2 0 53,289 0 0 SUM 35,407 6,117 34,387 33,229 Table 7: Tier 2 storage retirements by year (TB) Center 2016 2017 2018 2019 AGLT2 0 0 0 0 MWT2 228 504 984 1680 NET2 0 0 0 0 SWT2 400 0 0 0 WT2 0 0 0 0 SUM 628 504 984 1,680 Table 8: Tier 2 network gear upgrade cost Center 2016 2017 2018 2019 AGLT2 $0 $0 $0 $0 MWT2 $0 $50,000 $100,000 $100,000 NET2 $0 $0 $0 $0 SWT2 $0 $0 $0 $0 WT2 $0 $0 $0 $0 SUM $0 $50,000 $100,000 $100,000 USATLAS LHCONE Status
- Yesterday reported on status if LHCONE peering
- Slides here
-
13:25
→
13:35
Capacity News: Procurements & Retirements 10m
-
13:35
→
13:45
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:45
→
13:50
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:50
→
13:55
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:55
→
14:00
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
14:00
→
14:05
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:25
→
15:25
Site Reports
-
14:25
BNL 5mSpeaker: Michael Ernst
- Smooth operations of Tier-1 services with the exception of network performance problems on the primary OPN circuit between BNL and CERN
- Unnoticed between April 6 and April 25
- perfSONAR plots clearly show the problem
- Brought up at yesterday's ESnet site coordinator meeting
- ESnet CTO suggested to have ESnet engineers work with Shawn on monitoring improvements
- Connectivity (via BGP session) maintained over the entire period
- Throughput significantly impaired leading to job completion delays
- Switching to secondary OPN circuit solved the problem
- ESnet engineering investigated the circuit and found packet loss caused by components close to the Virginia landing point
- Another OPN network performance issue was reported for transfers from BNL to SARA
- Independent from the BNL - CERN issue
- Unnoticed between April 6 and April 25
- PO for ~7.5 PB (usable) of magnetic disk arrived at vendor (RAID Inc)
- Expect delivery in ~4 weeks
- Smooth operations of Tier-1 services with the exception of network performance problems on the primary OPN circuit between BNL and CERN
-
14:30
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
We tuned the LMEM queue job slot allocation so as to not waste cpus when LMEM jobs land on smaller memory WNs. Operations are stable and near capacity for our site.
We are scheduling a brief "at risk" OIM outage on Thursday morning to do reboots associated with the nss and nspr security updates, plus OSG RPM updates on our gatekeepers, as per OSG-SEC-2016-04-27.
-
14:35
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site has been running well
Testing CVMFS 2.2.1
- Installed on all nodes
- So far no problems seen
- CVMFS 2.2.2 will be released soon (bug fixes for server, no client changes)
Disk pledge
- dCache
- Found dark space in dCache (space not allocated to any token)
- Found dark space on backing store pools (let dCache autosize pools)
- Add two new servers with 260TB
- Time sync problem on dCache head nodes prevented allocation of free space to tokens
- Net of 750TB added to DATADISK
- Brings MWT2 up to 2015 pledge of 3300TiB on DATADISK, GROUPDISK, USERDISK
- To meet our remaining 2016 pledge of 4500TiB attacking on three fronts
- Bringing up S3 object store on Ceph system (can be 1200TiB on day one)
- Add RDB block devices on Ceph to be used on dCache
- Appears as a disk device which dCache uses as a pool
- Can immediately add to all space tokens
- Use all dCache doors (srm, webdav, xrootd)
- Performance needs to be monitored
- Future - dCache will directly support Ceph objects
- Migrate all space tokens except DATADISK from dCache to Ceph
- DATADISK will occupy all dCache space of 3654TiB
- GROUPDISK (812TiB) and USERDISK (400TiB) will put us over pledge
- As dCache RDB or space tokens migrate, S3 size can be reduced
dCache to Ceph migration
- The plan to migrate a space token from dCache to Ceph
- Bestman SRM server (ceph-srm.mwt2.org:8443/srm/v2/server?SFN=)
- gfal-sync to synhronize backing store copy of space token on dCache to a copy on Ceph
- disable current spacetoken
- final sync
- enable space token with new SRM server
- Still need webdav and xrootd doors on Ceph
-
14:40
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Smooth operations + full sites.
Dan has worked through the SLURM/Condor issues and is in the last stages of switching us over to HTCondor-ce on the Harvard side. He's getting help from OSG and coordinating with Jose to get test pilots. There are a couple of snags still, but we should be able to switch over completely later this week. We are similarly in the late stages of doing the same on the BU side.
We have the Mass Open Cloud hardware-as-a-service software working and can successfully build new ATLAS worker nodes in the MOC hardware at MGHPCC. There is a mystery with unexpectedly low network bandwidth between the HaaS worker and GPFS which we're tracking down. After that, we will be testing production jobs and then expanding.
We're getting lined up to purchase an additional 576 TB useable storage, which would exhaust our hardware funds through the end of September.
We've also made progress with BU networking about possible short-term WAN upgrades of NET2 (either 40Gb/s or 4x10Gb/s). The critical issue is actually fees that the University pays to the NoX.
- 14:45
-
14:50
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
- 14:55
-
14:25
-
15:25
→
15:30
AOB 5m
-
13:00
→
13:15