US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-04-13T13:00:00-04:00
End: 2016-04-13T15:30:00-04:00
Location: No location set

Wednesday 13 Apr 2016, 13:00 → 15:30 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

13:00 → 13:15

Top of the Meeting 15m

Speakers: Kaushik De (University of Texas at Arlington (US)), Robert William Gardner Jr (University of Chicago (US))

US ATLAS Computing Facility Bi-weekly meeting
April 13, 2016

Will be summarizing Facility Capacity (http://bit.ly/usatlas-capacity) for the quarterly report pending final updates from NET2.
New column defined to account for local storage: Installed Disk - Local Group Disk - Pledge (2016)
- Overall the Facility is meeting the April 2016 pledge in both storage and CPU
- Large storage increment coming from the Tier1
- SWT2 still has significant CPU increment, coming June 1.
- MWT2 is cheating a bit as installed Ceph storage included but is in transition to Rucio-managed space tokens. Getting experience with SRM LGD over Ceph now. Full transition before June 30.
- MWT2 LGD is anomalously high compared to other centers.
- NET2 pending updates today/tomorrow

Table 1: Installed capacities as of March 2016					Comparison to 2015 and 2016 Pledges
Center	Total CPU Installed (HS06)	Job slots installed (single logical threads)	Total Disk Installed (TB)	Local Group Disk allocated (TB)	Beyond Pledge CPU HS06 (2015)	Beyond Pledge Job Slots (2015)	Beyond Pledge Disk TB (2015)	Beyond Pledge CPU HS06 (2016)	Beyond Pledge Job Slots (2016)	Beyond Pledge Disk TB (2016)	Installed Disk-LGD-Pledge (2016)
Tier1	132,627	13884	11600	500	22,627	2,369	2600	4,627	484	600	100
AGLT2	73,738	7500	3712	265	51,738	5,262	1312	48,738	4,957	712	447
MWT2	133,303	13500	5028	518	100,303	10,158	1428	95,303	9,652	528	10
NET2	61,038	6056	3000	357	39,038	3,873	600	36,038	3,576	0	-357
SWT2	62,375	6826	3530	164	40,375	4,418	1130	37,375	4,090	530	366
WT2	53,289	4464	3890	175	31,289	2,621	1490	28,289	2,370	890	715
USATLAS FACILITY	516,370	52230	30,760	1979	285,370	28,702	8,560	250,370	25,129	3,260	1,281
USATLAS TIER2	383,742	38346	19,160	1479	262,742	26,333	5,960	245,742	24,644	2,660	1,181

New ADC Technical Coordination Board launched
- First meeting yesterday, https://indico.cern.ch/event/517357/
- Open
New ADC organization announced yesterday (slide 7)
- https://indico.cern.ch/event/512533/contribution/2025382/attachments/1256719/1855508/adcreorg-20160412.pdf
- Have requested that we nominate a US ATLAS Computing Facility person to fill the vacancy for "Infrastructure and Facilities". Let me know if you're interested.

13:15 → 13:25

Jupytor and the ATLAS Analytics Platform 10m

Speaker: Ilija Vukotic (University of Chicago (US))

In-depth analytics using python.pdf
13:25 → 13:35

Capacity News: Procurements & Retirements 10m
13:35 → 13:45

Production 10m

Speaker: Mark Sosebee (University of Texas at Arlington (US))

shift-summary-4_13_16.pdf

shift-summary-4_6_16.pdf
13:45 → 13:50

Data Management 5m

Speaker: Armen Vartapetian (University of Texas at Arlington (US))
13:50 → 13:55

Data transfers 5m

Speaker: Hironori Ito (Brookhaven National Laboratory (US))
13:55 → 14:00

Networks 5m

Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
14:00 → 14:05

FAX and Xrootd Caching 5m

Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
14:25 → 15:25
Site Reports
- 14:25
  BNL 5m
  
  Speaker: Michael Ernst
  Smooth operations at capacity over the course of the last 2 weeks.
  
  observe >5000 low priority Event Server jobs in MCORE queue
  
  HI reconstruction MCORE jobs require >24GB of memory
  
  Working on procurement
  
  Secondary Disk
  
  Solution based on 14 RAID Inc 84-bay chassis with 8 TB Seagate PMR drives providing ~7.5 PB usable capacity. Status: ordered
  
  Compute
  
  In the process of ordering ~40 kHS06 based on Intel/Broadwell equipped servers
  
  Now offered in quantities by Dell and HP
  
  AWS scaling
  
  Working with HTCondor team on scaling issues observed during the 100k core scaling test in March
  
  HTCondor team applied changes to their HTCondor <=> EC2 interaction protocol
  
  Demonstrated ability to increase number of acquired VMs from 5k to 10k
  
  They are now working on integrating the modified protocol component into a full release
- 14:30
  
  AGLT2 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  It was the best of times, it was the worst of times.... Actually it has been a quiet time.
  
  We have worked hard to bring offline machines back into the fold, and so we are now running close to our maximum job capacity. Overall we have been reasonably full most of the time, running a large number of LMEM jobs at any given time as well.
  
  We had an incident on Sunday, lasting about 3 hours, where a tomcat6 update/restart on the GUMS servers picked up a new http cert that, unfortunately, was for the http service and not a copy of the hostcert. This was quickly corrected and we kept the associated downtime short.
- 14:35
  MWT2 5m
  
  Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
  Site has been running well
  
  Testing CVMFS 2.2.0
  
  Installed on all nodes
  
  So far no problems seen
  
  2.2.1 has been released as part of OSG 3.3.11 (will upgrade soon)
  
  RSV Service Certificate from CILogon
  
  Renewing host/service certificates for CE now comes from CILogon
  
  Subject changes from "DigiCert" to "opensciencegrid"
  
  Need to create group mapping in GUMS for certificates with new subject
  
  dCache
  
  dCache upgraded to 2.13.29
  
  No major problems with upgrade
  
  WebDav and XrootD doors now on own VM (were previously on pool nodes)
  
  Some issues with WebDav using uct2-s13.mwt2.org vs webdav.mwt2.org (Fixed in AGIS)
  
  DDM
  
  Deletions errors tracked to incorrect ownership/permissions on files in GROUPDISK
  
  Many files owned by root
  
  Found about 300 directories owned by root preventing access by usatlas1 account
  
  chown/chmod to fix
  
  deletions are now succeeding
  
  USERDISK constantly filling
  
  Tracked to two of Fred's student
  
  They will cleanup
  
  Will be adding additional 280TB to dCache of which some will be given to USERDISK
  
  Disk pledge
  
  New disk on on Ceph 1.4PB
  
  Bringing up Bestman SRM server, etc using Ceph as backing store (srm-ceph.mwt2.org:8443)
  
  Working on a migration scheme to move LOCALGROUPDISK to new system
- 14:40
  
  NET2 5m
  
  Speaker: Prof. Saul Youssef (Boston University (US))
  
  Our new 550 TB is on-line and in space tokens.
  
  Last weekend we had a GPFS/SRM problem appearing on Saturday, fixed on Sunday morning. This was related to the new 550 TB.
  
  Unusually large numbers of "No such file or directory" errors for deletion only appeared and was ticketed. Nothing was wrong on our side. Resolved by:
  
  "Issue resolved - from Cedric:
  The problem is that the Dark Reaper was using the lcg-utils implementation for the deletion (which is slightly
  broken) instead of the gfal one. I switched back to gfal and it's working now.
  
  ggus 120723 was closed."
  
  Smooth operations otherwise.
  
  NESE proposal submitted to NSF (with Harvard, MIT, Northeastern, UMASS) :)
  
  I just added 100TB free space from LOCALGROUPDISK(not in pledge) to DATADISK.
  
  We are set up to add an additional 570 TB relatively inexpensively (two additional 60 drive MD3060e) if need be.
- 14:45
  
  SWT2-OU 5m
  
  Speaker: Dr Horst Severini (University of Oklahoma (US))
  
  - smooth operations
  
  - Lucille scheduled maintenance for OS and firmware updates
  
  - still validating new OSCER SLURM CE
- 14:50
  
  SWT2-UTA 5m
  
  Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
- 14:55
  
  WT2 5m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  Testing batch VMs on openstack. There are IO issues on both OpenStack VMs and bare metal machines. Continue investigating to under how to address this issue.
15:25 → 15:30

AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations