US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-01-06T13:00:00-05:00
End: 2016-01-06T15:20:00-05:00
Location: your office

Wednesday 6 Jan 2016, 13:00 → 15:20 US/Eastern

virtual room (your office)

virtual room

your office

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speaker: Robert William Gardner Jr (University of Chicago (US))
  
  Meeting Minutes
- 13:15 → 13:25
  
  Capacity News: Procurements & Retirements 10m
- 13:25 → 13:35
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-12_30_15.pdf
  
  shift-summary-1_6_16.pdf
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  160106_DataManagement_Armen.pdf
- 13:40 → 13:45
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:45 → 13:50
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:50 → 13:55
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15 → 15:15
  Site Reports
  - 14:15
    BNL 5m
    
    Speaker: Michael Ernst
    
    Except a brief outage of the SE stable operations at the Tier-1
    
    the dCache core domain (admin node) ran out of memory in the early morning of Dec 31. This was quickly fixed by Hiro with a restart of the domain. As this has not happened before we suspect the situation was related to unusual operations in conjunction with performance optimization of the replica creation processes
    
    The Tier-1 was flagged by ADC operations about DDM transfer errors because of missing files.
    
    Our investigations have shown that this is not a problem at the BNL site. All files reported as lost in the context of this ticket were created by jobs running at the ORNL_Titan site. This site is using the BNL SE to store the job output files. For the data transfer between the site and BNL a specific site mover is used by the pilot to move files produced at ORNL_Titan to the BNL SE. Our investigations have shown that some transfers suffer from a high failure rate and need a lot of retries until they, according to the site mover, eventually succeed. However, even if they are reported as being successfully transferred, the files don't exist at the destination SE (BNL). We suspect there is a race condition in the site mover code, most likely due to timing issues in the transfer failure recovery section, that leads to the deletion of a file that was successfully transferred to BNL. Note FTS is handling such cases correctly, but FTS is not managing these particular transfers.
    
    Missing files were declared lost by the T1 Storage Management Group
    
    Updated the FY16 capacity/procurement table.
  - 14:20
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    With few exceptions AGLT2 mostly ran quite smoothly over the holiday break.
    
    A disk shelf at MSU lost 3 disks of a RAID-6 configuration just prior to New Years Day, and we were unable to recover it. There were 16087 files on the pool, 3900 of which had replicas elsewhere at AGLT2. The replicas were graduated to become primary copies. The balance were declared lost. Shawn documented the procedure he followed on our Wiki
    
    https://www.aglt2.org/wiki/bin/view/AGLT2/RecoverFromLostPool
    
    The pool itself consisted of 750GB drives that are now in short supply at AGLT2, so we chose to permanently retire this shelf to supplement available spares.
    
    It was expected that the 10 MSU R630 would be in production by New Years, but in the event these are not yet ready. Perhaps by the end of this week they will be installed. Multiple infrastructure changes were required to bring these machines online, involving IT support at MSU as well as AGLT2 moves, and this has all taken longer to effect than was expected.
    
    Some adjustments were made to the WLCG-v37 tab tables, to reflect the actual deployment of the final 2015 Dell R630 WN purchases.
  - 14:25
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is running well
    
    Full of Atlas jobs (MCORE, SCORE, Analy and Opport)
    
    Rolling update to Condor 8.4.2
    
    Renaming some Illinois nodes
    
    Networking
    
    Illinois ⇔ Indiana high latency fixed
    
    Was taking inefficient path via I2
    
    Now direct route via ESnet
    
    New hardware status
    
    UChicago
    
    18 Ceph Servers
    
    Rolling online of new servers / upgrade of existing servers.
    
    Indiana
    
    24 R630 (E5-2650 v3, 128GB) have been delivered
    
    Waiting on PDU to provide power (110v ⇒ 220v)
  - 14:30
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Reported pre-Christmas reprocessing experience at the latest ADC meeting https://indico.cern.ch/event/469712/
    
    We had a few hour down time yesterday due to a bad LUN problem in GPFS. This affects GPFS metadata response times and can cause Bestman to slow down or become unresponsive. The problem is over as of today.
    
    Smooth running other than that.
    
    No updates to capacity updates. 550 TB + 24 worker nodes are still on the way from DELL.
  - 14:35
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - sites running well, no problems over the holidays
    
    - network quite stable as well
    
    - seeing same lost heartbeat jobs as Wei at SLAC, at both OU and LU, so this cannot be a site issue
  - 14:40
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    We had a shutdown at UTA_SWT2 for a couple of days for electrical work in the facility. The system came back with few problems.
    
    No major problems observed for SWT2_CPB during the break.
    
    Working on putting storage online.
  - 14:45
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    smooth operation most of the time
    
    have to limit the s-core production due to high "lost heartbeat" issue, which happens to many sites.
    
    5 disks in one storage node failed. no data lost but have to move data around
    
    have one new CPU nodes online. will use elastic search to compare configuration of bare-metal and various sited VMs.
    
    storage PO received by vendor
    
    working on new batch node procurement: 14 M630s, ~$110K
- 15:15 → 15:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

virtual room

your office