US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2017-04-26T13:00:00-04:00
End: 2017-04-26T16:10:00-04:00
Location: No location set

Wednesday 26 Apr 2017, 13:00 → 16:10 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Introduction
- 13:15 → 13:20
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  From Alessandro Di Girolamo:
  
  as you might remember we take from Rebus the “installed capacity” to get the average power of the logical CPU each site has. we use this to then calculate in the ATLAS job monitoring dashboard the HS06 provided by the site. Having stopped the BDII, you don’t have anymore those numbers in Rebus. we have discussed a possible solution, which is to add in Rebus the numbers from http://myosg.grid.iu.edu/rgsummary/xml?datasource=summary&summary_attrs_showwlcg=on&all_resources=on&gridtype=on&gridtype_1=on&active=on&active_value=1&disable_value=1%22 <HEPSPEC> <APELNormalFactor>
  
  Basically, these are the numbers that we enter into our OSG resources for HS06 and APEL Normalization Factors. If this is going to be used going forward, then each of us has to keep these numbers up to date.
  
  ====== Are they correct for your site? ============
  
  Please make sure that they ARE correct, now and going forward.
- 13:20 → 13:30
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-4_19_17.pdf
  
  shift-summary-4_26_17.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  170428_DataManagement_Armen.pdf
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:45 → 13:50
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  Proxy cache update
- 13:50 → 14:00
  Site movers 10m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  1. Interal/External xrootd door (doc is still not fully ready, see Mario's comment)
  
  https://docs.google.com/document/d/1FKbXCHZ-NA__nFlELUpm_D32OOdXGgq786GeEQotOUA/edit#
  
  2. Site movers: What need to be done in AGIS:
  
  use_newmover=True
  
  deprecate_oldmover=True
  
  wansinklimit=0 (or None?)
  
  All production US sites use new site movers. US sites not using new site mover ("use_newmover" != True)
  
  ANALY_BNL_CLOUD
  
  ANALY_BNL_EC2E1
  
  ANALY_BNL_EC2W1
  
  ANALY_BNL_EC2W2
  
  ANALY_BNL_T3-condor
  
  ANALY_NERSC
  
  ANALY_ORNL_Titan
  
  ANALY_TEST-APF-condor
  
  ANALY_TEST-APF2-condor
  
  ANALY_WISC_ATLAS
  
  BNL_CLOUD
  
  BNL_CLOUD_MCORE
  
  BNL_EC2E1
  
  BNL_EC2E1_MCORE
  
  BNL_EC2W1
  
  BNL_EC2W1_MCORE
  
  BNL_EC2W2
  
  BNL_EC2W2_MCORE
  
  ES_ORNL_Titan
  
  GOOGLE_COMPUTE_ENGINE
  
  NERSC-PDSF-sge
  
  ORNL_Titan_MCORE (does it matter?)
  
  TESTGLEXEC
  
  TWTEST
  
  TestPilot
  
  Titan_Harvester_MCORE
  
  UMESHTEST
  
  WT2_Install
  
  Sites using new site movers but not deprecating old movers:
  
  ANALY_AGLT2_TEST_SL6-condor
  
  ANALY_OU_OCHEP_SWT2-condor
  
  BNL_LOCAL-condor
  
  Lucille_CE
  
  Lucille_MCORE
  
  NERSC_Cori_2
  
  OUHEP_OSG
  
  OU_OCHEP_SWT2-condor
  
  OU_OSCER_ATLAS
  
  OU_OSCER_ATLAS_MCORE
  
  OU_OSCER_ATLAS_OPP
  
  Sites using new site mover but not set wansinklimit=0 (or None)
  
  ANALY_AGLT2_TEST_SL6-condor
  
  ANALY_CONNECT
  
  ANALY_CONNECT_SHORT
  
  ANALY_CONNECT_TEST
  
  ANALY_MWT2_HIMEM_MCORE
  
  ANALY_MWT2_MCORE
  
  ANALY_OU_OCHEP_SWT2-condor
  
  NERSC_Cori
  
  NERSC_Cori_2
  
  NERSC_Edison
  
  NERSC_Edison_2
  
  SLAC_ES-lsf
- 14:00 → 14:10
  
  OS performances testing 10m
  
  Speaker: Doug Benjamin (Duke University (US))
- 14:10 → 14:25
  
  HPCs integration 15m
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
  
  2017-04-26-USAtlasFacilities.pdf
- 14:25 → 16:00
  Site Reports
  - 14:25
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    dCache:
    
    slow release of free space: fixed by forcing dcache cleaner run more often
    
    high memory usage on admin node: fixed by increasing memory on the server
    
    IPv6 fully functional now
    
    AFS phase-out on Tier1 computing farm: ongoing, in rolling mode
    
    singularity test:
    
    new condor queue and PanDA queues are being tested via HC jobs
    
    individual production jobs also succeeded
    
    expect to go in production soon
    
    AGIS cleanup: deleted some obsolete BNL PanDA queues
  - 14:30
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    Last week our "OtherVO" gatekeeper suddently stopped working. I noticed this first on April 19 when our space-usage.json file stopped updating dating to mid-afternoon on April 18. Then I noticed that the CE jobs coming in were going into Hold, and never emerging to become real batch system jobs. Basically, all services were timing out.
    
    Chased this hard, even going so far as to totally rebuild that gatekeeper (which is good in the sense that gram should now be gone totally from it). Nothing worked until I moved it temporarily from the grid.umich.edu domain to the aglt2.org domain, and saw both gfal-copy and condor_ce_ping begin to work.
    
    Subsequently found that the UM subscribes to an IPS Service (Intrusion Prevention System) that had distributed a complete blockage on TSL/SSL traffic. Our subnet was scanned to their satisfaction, it was white-listed, and the gate-keeper then came back online.
    
    Beware of the possibility of a similar situation at your home Universities. It has also affected the MWT2 and Ligo collborators at the UM.
    
    A dCache OOM issue hit us when the number of multiple srm transfers reached levels above what we've ever seen here (a peak of 18k/10mins was observed a few days ago in srmwatch) in conjunction with just more memory being used anyway with dCache 3.0.11. We increased the dCacheDomain java instance memory from 2/4gb to 4/8gb and the issue resolved.
    
    Beyond this services are stable. Today we will complete the transition to the last of our new N2048 switches for public 1Gb NIC attachment.
    
    For queue ANALY_AGLT2_TEST_SL6-condor, the wansinklimit and deprecate_old_mover flags are now properly set (as of 1:30pm today).
  - 14:35
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is now full of jobs and operating well
    
    Minor problems/fixes during the last two weeks
    
    Site was running nearly all SCORE HIMEM jobs even though MCORE had many activated
    
    Because we had run so many MCORE jobs for so long time (no SCORE), weighting was skewed
    
    Changed the condor knob "PRIORITY_HALFLIFE" to 5 minutes to balance out faster (was 1/2 day)
    
    The three MWT2 Squids heavily loaded by Frontier requests
    
    Impacted access to CVMFS repositories causing slow mounts and access to data
    
    Created three "CVMFS" only squids on the local CVMFS Stratum-1 servers
    
    NSS bug with certificates
    
    Push out to all nodes on Tuesday
    
    OSG 3.3.23-1 installed on all nodes
    
    USERDISK and GROUPDISK decommissioning continuing
    
    Waiting on ADC to change Panda Q to use SCRATCHDISK for output by ANALY Qs
    
    Reducing size of GROUPDISK and adding freed space to DATADISK
    
    Storage decomissioning has begun
    
    In FY17 we are scheduled to retire over 1PB of old storage
    
    First server was retired causing a reduction of 120TB of available storage
    
    6 more servers to retire
    
    At UC, SciDMZ upgrade work continues. New fiber and Arista switch being deployed this week. Should improve WAN transfers limited by distribution switch between uct2 and the campus SciDMZ border router.
    
    At UC, CRAC2 unit compressor replaced, system fully back up and running well. CRAC1,3 units have been assessed by outside vendor, meeting tomorrow to discuss additional needed repairs.
    
    Greg is building a hot-isle containment system to improve efficiency.
  - 14:40
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    New 1.5PB storage in production.
    
    WAN and MGHPCC networking upgrade underway: 100G to NoX/LHCONE and multi 40G to NESE fabric. Equipment is partly here. We are involved of a re-design of the networking on the MGHPCC floor. Networking from the Northeastern University Pods to NET2 has been improved so that we can ramp up Mass Open Cloud production.
    
    There were 5000 transferring jobs from HU_ATLAS_Tier2 this morning. High but seems to be gradually going down. We're watching the situation.
    
    We have smooth operations and full sites otherwise.
  - 14:45
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all sites running well
    
    - Langston machine room had an A/C failure which caused a brief downtime. Not completely resolved yet, but head and storage nodes and a few computes nodes are back up, so running at reduced capacity
  - 14:50
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    Both sites were rebuilt for the recent security patch from Red Hat.
    
    UTA_SWT2
    
    Updated deprecate_oldmover flag in AGIS for all Panda queues. No impact observed as of yet
    
    Production is running fine
    
    SWT2_CPB
    
    Updated deprecate_oldmover flag in AGIS for all Panda queues. Don't expect any impact, but still looking.
    
    Production running fine.
    
    Still need to modify AGIS to setup internal Xrootd redirector to support direct reads for analysis jobs. Now have rights to modify AGIS
  - 14:55
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 16:00 → 16:05
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations