US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2017-05-31T13:15:00-04:00
End: 2017-05-31T16:25:00-04:00
Location: No location set

Wednesday 31 May 2017, 13:15 → 16:25 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:15 → 13:30
  
  Top of the Meeting 15m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Transparencies
- 13:30 → 13:35
  
  ADC news and issues 5m
  
  Minutes
  
  Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  From Wei:
  
  Vincent Garonne moved back to Oslo, still working on DDM. Mario Lassnig is in charge of DDM. Martin Barisits in charge of RUCIO.
  
  From Bob:
  
  Concrete plan nearly in place for Implementing WLCG diskless sites for production. Would utilize storage at "nearby" T2 site. See: https://indico.cern.ch/event/642836/contributions/2608398/attachments/1467335/2268911/Diskless_28May.pdf
- 13:35 → 13:45
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-5_17_17.pdf
  
  shift-summary-5_24_17.pdf
  
  shift-summary-5_31_17.pdf
- 13:45 → 13:50
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  170531_DataManagement_Armen.pdf
- 13:50 → 13:55
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:55 → 14:00
  
  Networks 5m
  
  Minutes
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  
  Grafana-ESnet-pS-Dashboard.PNG
  
  Grafana-pS-Link-Dashboard.PNG
  
  Kibaba-ps-dev-MetaDash-1.PNG
  
  Kibaba-ps-dev-MetaDash-2.PNG
  
  North American Throughput Meeting
  ============================
  
  31-May-2017, 10-11 AM Eastern
  
  Attending: Dave, Ilija, Shawn, Marian, Phillipe, Saul, Duncan, Garhan, Andy
  
  https://indico.cern.ch/event/640627/
  
  perfSONAR v4.0
  - Update progress and issues
  Shawn reported on OSG networking upgrades and data loss
  
  Network Measurement Platform status and updates
  Marian reported on ps_etf and meshconfig.grid.iu.edu. Review of the services monitored (https://etf-ps.cern.ch/etf/check_mk)
  
  Update on Analytics
  Ilija reported on work to find changes in packet-loss, throughput, etc. See paper https://arxiv.org/pdf/1508.01280.pdf
  Trying this method on CERN-BNL link analysis. Machine-learning also being tried on perfSONAR data to find anomalies in
  our data. (Someone working on Titan...need details)
  
  Round-table
  Saul mentioned that MGHPP is down for maintenance and this was an opportunity to go to 100G. When site is back up it will be
  using 100G. Shawn asked about using that path for LHCONE; Saul: yes, should be used.
  
  Andy: minor update to pScheduler in the next few days (intermittent lock-up fix). IPv6 may be having some issues.
  
  Marian: Question about Docker support for the full toolkit? Andy: being discussed if this will be done at next week's
  face-to-face in Ann Arbor.
  
  Lots of Q&A and account setup for meshconfig.
  
  AOB and next meeting
  
  Demo of OpenvSwitch / OpenFlow + OpenStack for our next meeting.
  
  Watch email for next meeting date.
- 14:00 → 14:05
  
  FAX and Xrootd Caching 5m
  
  Minutes
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  From Wei and Andy:
  
  Inverse RUCIO Name2Name component is ready. It is a plugin that identify file replicas at different sites as the same file and thus improves Xrootd proxy cache's hitting rate. It requires support from Xrootd release 4.7, which will be ready soon.
  
  Working with RUCIO team to report back Proxy cache's contents --- in progress.
- 14:05 → 14:15
  
  Site movers 10m
  
  Minutes
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  All completed except BNL_LOCAL-condor, which needs to set “deprecate_oldmover = True”. According to Xin:
  
  We can't change it for BNL_LOCAL-condor for the time being, as pilot
  running ES jobs there isn't ready for it. It's said the patch is already
  in, we can do the switch after it's released to production.
  
  I guess we don't need this item in the agenda in the future.
- 14:15 → 14:25
  
  OS performances testing 10m
  
  Speaker: Doug Benjamin (Duke University (US))
- 14:25 → 14:40
  
  HPCs integration 15m
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
  
  2017-05-31-USAtlasFacilities.pdf
- 14:40 → 16:15
  Site Reports
  - 14:40
    
    BNL 5m
    
    Minutes
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    T1 services are running fine.
  - 14:45
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    head01-pg-locks.PNG
    
    srmwatch-28may2017-fix.PNG
    
    Over the Memorial Day weekend dCache suddenly "crashed". Everything looked "normal" but writes were timing out. Tracked to "too many locks", and cleared by doing a vacuum on all postgres DBs. A secondary issue then asserted, where the dCacheDomain was running out of memory (at 2g). Increased to 3g and this problem resolved. We have been running stably since that time.
    
    Reminder that we will be down for a power outage from Noon on Friday, June 23 until sometime Monday June 26 when all services can be restarted. We will do some software updates and dCache maintenance during this period.
  - 14:50
    MWT2 5m
    
    Minutes
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is now full of jobs and operating well
    
    Updated site to OSG 3.3.24
    
    Retirement of several storage nodes in dCache
    
    Illinois Campus Cluster lost hypervisors
    
    Building which houses the ICC, ACB (Advanced Computation Building) had scheduled power outage 5/20
    
    Upgrades to power feed to building
    
    All equipment was powered off for 6 hours
    
    First power down for many items in over 8 years
    
    Cold start caused various hardware failures
    
    Hypervisor cluster would not restart causing loss of MWT2 VMs
    
    VMs migrated to campus base hypervisor which was planned to happen in a few weeks
    
    MWT2 VLAN had been extended to campus hypervisors
    
    Migration took several days
    
    VMs now use NFS to access GPFS due to VLAN outside ACB
    
    Took time to reconfigure VMs to new setup
    
    Some tuning still needed but all is working well
    
    USERDISK down to only 12TB in use
    
    SCRATCHDISK deletion still is issue
    
    Deletion is using rucio space reporting which seems to have a lot more free than dCache
    
    Lincoln/Judith working with Hiro/Armen
  - 14:55
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    We had the annual 1 day MGHPCC-wide power shutdown last week. Notable improvements made:
    
    1. Migrated NSF to new servers (mostly a Tier 3 issue)
    
    2. 100G WAN gear was installed and configured. Use of 100G only waits for NoX to switch us over.
    
    3. USERDISK is almost empty according to plan. Moved storage to other tokens as requested by Armen.
    
    4. Lots of NESE activity. CEPH cluster made from Harvard contributed equipment as a test ATLAS DDM endpoint.
    
    Smooth operations with only minor problems. High level of LIGO jobs for a few days.
  - 15:00
    
    SWT2-OU 5m
    
    Minutes
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    -nothing to report, all sites are running well
    
    - we're seeing some lost heartbeat jobs, but we believe they are not site related, since we're seeing them at multiple sites, and BU is seeing them as well (right now, Tuesday afternoon), and in the past we've never been able to find a local source for them, and believe they're panda related
  - 15:05
    
    SWT2-UTA 5m
    
    Minutes
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    1) Smooth operations over the past few weeks.
    
    2) Remaining WN's installed since the last meeting.
    
    3) Capacity spreadsheet updated to reflect current status.
  - 15:10
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 16:15 → 16:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

Share this page

Direct link

Social networks

Calendaring