US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-01-17T13:00:00-05:00
End: 2018-01-17T15:00:00-05:00
Location: No location set

Wednesday 17 Jan 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- 13:05 → 13:15
  
  Singularity / centos 7 deployment in the US cloud 10m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:15 → 13:20
  
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:20 → 13:25
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-1_10_18.pdf
  
  shift-summary-1_17_18.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  180117_DataManagement_Armen.pdf
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  
  OSG area coordinators meeting today is on networking. Attend if you want to see the details:  https://opensciencegrid.github.io/management/area-coordinators/
  
  The new HEPiX Network Function Virtualization working group is starting up and will have its kick-off meeting by the end of this month. Marian Babik and Shawn McKee Co-chair. Sign up if you are interested in participating at: https://listserv.in2p3.fr/cgi-bin/wa?SUBED1=HEPIX-NFV-WG
  
  Primary ESnet link to CERN for BNL had a brief 2 minute outage this morning.  [ESNET-20180117-001]
  CERN-513-CR5 <BCYP1046> WASH-CR5 - Circuit Outage around 4:35 AM Eastern time.
  
  No major networking issues for US sites that I know of.
- 13:45 → 13:50
  
  XCache 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50 → 13:55
  
  HPCs integration 5m
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    Meltdown & Spectre
    
    interactive nodes and CEs are patched, farm nodes will follow in a rolling fashion
    
    performance degradation
    
    almost none in our HS06 measurement
    
    ADC reports ~7% hit in KV tests
    
    extra CPUs added to the ATLAS farm (~15kHS06) during the holiday break, to help reprocessing etc campaigns.
    
    New compute nodes purchase arrived, will be brought online in the next weeks.
  - 14:00
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    Taking advantage of the fact that all MSU WN were powered off for BPS building circuit work, we started the Meltdown/Spectre rpm updates there around January 5. All WN at both sites have now been updated with this fix, along with all gatekeepers, desktops, interactive machines and dCache pool servers. For the latter, we will be examining some kernel parameter changes that would hopefully gain back performance lost due to the kernel and other rpm updates.
    See: https://community.centminmod.com/threads/linux-kernel-security-updates-for-spectre-meltdown-vulnerabilities.13648/
    This says to add to the kernel line:
    
    noibrs noibpb nopti
    
    HS06 runs indicate less than a 1% decrease in performance on modern processors from the kernel updates. This decrease seems to be more than offset by an increase in performance when the same machine is updated to SL7. Older processors seem to either be unchanged, or perform slightly better on HS06 with the kernel updates.
    
    As John Hover points out though, IO will be the real sticking point. We are trying to obtain some data from muon calibration runs on this issue, but it is not yet available. The jobs consist of running Athena to convert a calibstream fragment to a calib ntuple. Results when available (later today, or perhaps tomorrow) will be posted to the usatlas-t2-l list when they are ready.
    
    We now have a small SL7 gatekeeper/cluster running at AGLT2 (~100 cores), and have created an SCORE production queue (AGLT2_SL7) for testing. As of this writing, we have seen only a few software jobs (nagrun.sh -v..., about 130 jobs like this in 3 days time), and do not otherwise appear to be getting many pilots. We will follow up on this.
    
    Otherwise operation has been smooth.
  - 14:05
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    MWT2 currently down for updates
    
    UChicago is upgrading hardware for NFS based "homes" on all nodes
    
    Illinois is down for ICC PM (moving GPFS to new DDN hardware)
    
    All nodes using new kernel for Meltdown fix, etc
  - 14:10
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Tested kernel patches. Doesn't break GPFS.
    
    We'll have a DDM downtime next week (~24 hours) to do a GPFS maintenance and rebuild our SRM (to resolve a low level proxy ticket) and do kernel updates.
    
    Lots of NESE activity preparing for first major deployment. Quotes coming in. Power/space/cooling/WAN ready.
    
    LHCONE peering still needs to be resumed. We're not currently peering with LHCONE but this causes no immediate problems.
    
    Smooth NET2 running otherwise.
  - 14:15
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - I will be late for the meeting since I'll be in a proposal meeting from 11 am till ...?
    
    - New OSCER SE now in production; 700 TB xrootd filesystem.
    
    - Switched all OU PQs from Lucille_SE to OU_OSCER_ATLAS_SE.
    
    - Seems to work well for the most part; some jobs fail, still debugging errors.
    
    - OU_OCHEP_SWT2 jobs fail at stageout with failed gfal2 dependency; not sure why, it should be using xrdcp instead, which works fine for HC jobs. Asked for help.
    
    - Singularity tests successful on OU_OSCER_ATLAS_TEST; awaiting further tests.
  - 14:20
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    New kernels and microcode are installed at both clusters to mitigate the recent flaws found in CPU's on CE and compute nodes.
    
    SWT2_CPB:
    
    Suffered a power outage on Saturday (1/13) during a test of the backup generator
    
    We are following up with our facilities personnel to determine the root cause of the issue and when it will be resolved
    
    We have a lingering issue with one data server that is causing problems when reporting used space. Will look at this later today
    
    Production activities are fine
    
    Updated XRootD to version 4.8.0.1 on dataservers/redirector
    
    UTA_SWT2:
    
    Moved old storage from SWT2_CPB to this cluster and brought it online.
    
    We have an open GGUS ticket concerning network outages. We are waiting to hear back from our network manager to understand why this happened, but has not repeated.
    
    Updated XRootD to version 4.8.0.1 on dataservers/redirector
  - 14:25
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:30 → 14:35
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations