US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))
US ATLAS Computing Facility Bi-weekly meeting
April 13, 2016- Will be summarizing Facility Capacity (http://bit.ly/usatlas-capacity) for the quarterly report pending final updates from NET2.
- New column defined to account for local storage: Installed Disk - Local Group Disk - Pledge (2016)
- Overall the Facility is meeting the April 2016 pledge in both storage and CPU
- Large storage increment coming from the Tier1
- SWT2 still has significant CPU increment, coming June 1.
- MWT2 is cheating a bit as installed Ceph storage included but is in transition to Rucio-managed space tokens. Getting experience with SRM LGD over Ceph now. Full transition before June 30.
- MWT2 LGD is anomalously high compared to other centers.
- NET2 pending updates today/tomorrow
Table 1: Installed capacities as of March 2016 Comparison to 2015 and 2016 Pledges Center Total CPU Installed (HS06) Job slots installed (single logical threads) Total Disk Installed (TB) Local Group Disk allocated (TB) Beyond Pledge CPU HS06 (2015) Beyond Pledge Job Slots (2015) Beyond Pledge Disk TB (2015) Beyond Pledge CPU HS06 (2016) Beyond Pledge Job Slots (2016) Beyond Pledge Disk TB (2016) Installed Disk-LGD-Pledge (2016) Tier1 132,627 13884 11600 500 22,627 2,369 2600 4,627 484 600 100 AGLT2 73,738 7500 3712 265 51,738 5,262 1312 48,738 4,957 712 447 MWT2 133,303 13500 5028 518 100,303 10,158 1428 95,303 9,652 528 10 NET2 61,038 6056 3000 357 39,038 3,873 600 36,038 3,576 0 -357 SWT2 62,375 6826 3530 164 40,375 4,418 1130 37,375 4,090 530 366 WT2 53,289 4464 3890 175 31,289 2,621 1490 28,289 2,370 890 715 USATLAS FACILITY 516,370 52230 30,760 1979 285,370 28,702 8,560 250,370 25,129 3,260 1,281 USATLAS TIER2 383,742 38346 19,160 1479 262,742 26,333 5,960 245,742 24,644 2,660 1,181 - New ADC Technical Coordination Board launched
- First meeting yesterday, https://indico.cern.ch/event/517357/
- Open
- New ADC organization announced yesterday (slide 7)
- https://indico.cern.ch/event/512533/contribution/2025382/attachments/1256719/1855508/adcreorg-20160412.pdf
- Have requested that we nominate a US ATLAS Computing Facility person to fill the vacancy for "Infrastructure and Facilities". Let me know if you're interested.
-
13:15
→
13:25
Jupytor and the ATLAS Analytics Platform 10mSpeaker: Ilija Vukotic (University of Chicago (US))
-
13:25
→
13:35
Capacity News: Procurements & Retirements 10m
-
13:35
→
13:45
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:45
→
13:50
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:50
→
13:55
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:55
→
14:00
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
14:00
→
14:05
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:25
→
15:25
Site Reports
-
14:25
BNL 5mSpeaker: Michael Ernst
Smooth operations at capacity over the course of the last 2 weeks.
- observe >5000 low priority Event Server jobs in MCORE queue
- HI reconstruction MCORE jobs require >24GB of memory
Working on procurement
- Secondary Disk
- Solution based on 14 RAID Inc 84-bay chassis with 8 TB Seagate PMR drives providing ~7.5 PB usable capacity. Status: ordered
- Compute
- In the process of ordering ~40 kHS06 based on Intel/Broadwell equipped servers
- Now offered in quantities by Dell and HP
- In the process of ordering ~40 kHS06 based on Intel/Broadwell equipped servers
AWS scaling
- Working with HTCondor team on scaling issues observed during the 100k core scaling test in March
- HTCondor team applied changes to their HTCondor <=> EC2 interaction protocol
- Demonstrated ability to increase number of acquired VMs from 5k to 10k
- They are now working on integrating the modified protocol component into a full release
-
14:30
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
It was the best of times, it was the worst of times.... Actually it has been a quiet time.
We have worked hard to bring offline machines back into the fold, and so we are now running close to our maximum job capacity. Overall we have been reasonably full most of the time, running a large number of LMEM jobs at any given time as well.
We had an incident on Sunday, lasting about 3 hours, where a tomcat6 update/restart on the GUMS servers picked up a new http cert that, unfortunately, was for the http service and not a copy of the hostcert. This was quickly corrected and we kept the associated downtime short.
-
14:35
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site has been running well
Testing CVMFS 2.2.0
- Installed on all nodes
- So far no problems seen
- 2.2.1 has been released as part of OSG 3.3.11 (will upgrade soon)
RSV Service Certificate from CILogon
- Renewing host/service certificates for CE now comes from CILogon
- Subject changes from "DigiCert" to "opensciencegrid"
- Need to create group mapping in GUMS for certificates with new subject
dCache
- dCache upgraded to 2.13.29
- No major problems with upgrade
- WebDav and XrootD doors now on own VM (were previously on pool nodes)
- Some issues with WebDav using uct2-s13.mwt2.org vs webdav.mwt2.org (Fixed in AGIS)
DDM
- Deletions errors tracked to incorrect ownership/permissions on files in GROUPDISK
- Many files owned by root
- Found about 300 directories owned by root preventing access by usatlas1 account
- chown/chmod to fix
- deletions are now succeeding
- USERDISK constantly filling
- Tracked to two of Fred's student
- They will cleanup
- Will be adding additional 280TB to dCache of which some will be given to USERDISK
Disk pledge
- New disk on on Ceph 1.4PB
- Bringing up Bestman SRM server, etc using Ceph as backing store (srm-ceph.mwt2.org:8443)
- Working on a migration scheme to move LOCALGROUPDISK to new system
-
14:40
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Our new 550 TB is on-line and in space tokens.
Last weekend we had a GPFS/SRM problem appearing on Saturday, fixed on Sunday morning. This was related to the new 550 TB.
Unusually large numbers of "No such file or directory" errors for deletion only appeared and was ticketed. Nothing was wrong on our side. Resolved by:
"Issue resolved - from Cedric:
The problem is that the Dark Reaper was using the lcg-utils implementation for the deletion (which is slightly
broken) instead of the gfal one. I switched back to gfal and it's working now.
ggus 120723 was closed."Smooth operations otherwise.
NESE proposal submitted to NSF (with Harvard, MIT, Northeastern, UMASS) :)
I just added 100TB free space from LOCALGROUPDISK(not in pledge) to DATADISK.
We are set up to add an additional 570 TB relatively inexpensively (two additional 60 drive MD3060e) if need be.
- 14:45
-
14:50
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
- 14:55
-
14:25
-
15:25
→
15:30
AOB 5m
-
13:00
→
13:15