US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-02-13T13:00:00-05:00
End: 2019-02-13T15:00:00-05:00
Location: No location set

Wednesday 13 Feb 2019, 13:00 → 15:00 US/Eastern

- 1
  
  WBS 2.3 Facility Management News
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  US ATLAS Computing Facility Assessment
  
  https://docs.google.com/document/d/1y-3OtJKn52xsLZze3iURMiie3nrekEslrcdY0GYW--I/edit?usp=sharing
- 2
  OSG-LHC
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  XRootD RC5 available in osg-testing
  
  xrootd-lcmaps 1.7.0 (now available on EL6) going out in the next release (tentatively tomorrow)
  
  OSG ATLAS XCache preliminary image available (https://hub.docker.com/r/opensciencegrid/atlas-xcache/). Working with the SLATE team and Ilija to test it.
- Topical Report
  - 3
    
    WBS 2.3.1 Tier1 Operations
    
    Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
    
    T1-operations-WBS231.pdf
- US Cloud Status
  - 4
    
    US Cloud Operations Summary
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 5
    
    BNL
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
  - 6
    
    AGLT2
    
    Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    AGLT2 had its storage blacklisted for 3-days, even though the original problem was just a brief glitch introduced by our VMware migration/upgrade. This prevented our site from being put back online by HammerCloud till the blacklisting was removed.
    
    On the positive side we managed to finally upgrade our VMware infrastructure from v5.5 running on old R630 nodes to v6.7 running on new R740 hardware. Still lots of tuning to do but services are running much better now.
    
    Lots of cabling work ongoing as well, including correcting and updating labels, port descriptions in switches, socket descriptions on PDUs and the corresponding VISIO diagrams.
    
    New hardware (9 C6420 servers at UM) is cabled and ready to be built soon.
    
    Keep seeing high load condor work nodes, 2-3 nodes are being killed every day due to high load(>100 per core). This might be caused by specific jobs, usually OSG/CMS jobs.
    
    HTCondor head node(a virtual machine) was out of reach for a few hours during the vmware update, but it did not affect the running jobs.
    
    dcache head node is upgraded from 4.2.21 to 4.2.23, to fix the gplazma authentication bugs (the authentication would fail every a couple of days). The other pool/door nodes still run on 4.2.21.
  - 7
    
    MWT2
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    GPFS filesystem issues at Illinois on Sunday. Restored yesterday, UIUC nodes brought back online.
    
    Compute node purchase at both IU (Dell) and UIUC (HP) w/ mostly FY18 funds to be submitted shortly.
    
    Storage expansion, edge node for k8s/xcache/slate, ML node, network switch expansion at UC all submitted (some delivered).
  - 8
    
    NET2
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 9
    SWT2
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Will start the process to purchase the xcache box for SWT2_CPB.
    
    Otherwise things running smoothly
    
    OU:
    
    Nothing to report, everything is running smoothly.
  - 10
    
    HPC Operations
    
    Speaker: Doug Benjamin (Duke University (US))
    
    Jumbo job/co-jumbo Event service Task 16368172 has duplicate events.
    
    Jira ticket created to track the progress in debugging.
    
    https://its.cern.ch/jira/browse/ATLASES-73
    
    Until the problem is solved, there will be no more jumbo/co-jumbo ES tasks will be run.
    
    This will cause Theta to be paused. (we have 9.5 M Theta core hours to go). 88% of allocation used.
    
    OLCF has used 86 M Titan core hours 107% of allocation.
    
    NERSC (ERCAP allocation) 4.2 M NERSC hours used out of 120 (3%). Need to use 12M hours by April 10th or we lose 25% of unused balance.
  - 11
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 12
    
    Analysis Facilities - BNL
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Nothing to report. Pool is quite busy
- 13
  
  AOB