ATLAS Sites Jamboree

Name: ATLAS Sites Jamboree
Start: 2018-03-05T09:00:00+01:00
End: 2018-03-07T19:00:00+01:00
Location: CERN

5 Mar 2018, 09:00 → 7 Mar 2018, 19:00 Europe/Zurich

40/S2-D01 - Salle Dirac (CERN)

40/S2-D01 - Salle Dirac

CERN

115

Show room on map

Alessandro Di Girolamo (CERN), Johannes Elmsheuser (Brookhaven National Laboratory (US))

Description

Details and notes of the ATLAS Sites Jamboree, March 2018

This ATLAS Sites Jamboree is part of the ATLAS SW&C technical week

And on Thursday there is the ADC&SW Joint session

Monday 5 March
- Intro and overviews
  
  Conveners: Alessandro Di Girolamo (CERN), Johannes Elmsheuser (Brookhaven National Laboratory (US))
  - 1
    
    ATLAS Sites Jamboree welcome
    
    Speaker: Alessandro Di Girolamo (CERN)
    
    ATLAS Sites Jamboree - Intro.pdf
  - 2
    
    SW & Computing overview
    
    Speakers: Davide Costanzo (University of Sheffield (GB)), Torre Wenaus (Brookhaven National Laboratory (US))
    
    SCWeek-Intro-Wenaus-201803.pdf
  - 3
    
    Software overview
    
    Speakers: Edward Moyse (University of Massachusetts (US)), Walter Lampl (University of Arizona (US))
    
    SW intro talk 05032018.pdf
  - 4
    
    Reports from: MC production, Reprocessing, Derivation, Distributed Analysis Shifts/DAST
    
    Speakers: David Michael South (Deutsches Elektronen-Synchrotron (DE)), Douglas Gingrich (University of Alberta (CA)), Dr Eirik Gramstad (University of Oslo (NO)), Farida Fassi (Universite Mohammed V (MA)), Josh McFayden (CERN), Mayuko Kataoka (University of Texas at Arlington (US)), Michael Ughetto (Stockholm University (SE)), Monica Dobre (IFIN-HH (RO))
    
    DAST_ReportFarida.pdf
    
    derivations_gramstad.pdf
    
    mccoord_050318.pdf
    
    ReprocForSitesJamboree.pdf
- Sites plans and reports
  - Availability reports (SAM), small sites consolidation, problematic sites follow up (40mins?)
  - sites reports, list of questions to be sent
  Conveners: Alessandra Forti (University of Manchester (GB)), Xin Zhao (Brookhaven National Laboratory (US))
  - 5
    
    Technical evolutions at CMS sites towards Run 3
    
    Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
    
    CMS-Site-Evolution.pdf
  - 6
    
    Tier3 + Tier2 consolidation
    
    Speaker: Dr Stephane Jezequel (LAPP (CNRS-USMB))
    
    Jamboree_5March_T2_3.pdf
  - 7
    
    HPC in ATLAS overview
    
    Speaker: Doug Benjamin (Duke University (US))
    
    ATLAS HPC
  - 14:50
    
    Coffeee
  - 8
    Clouds/Sites report
    
    a) IT cloud (INFN-T1)
    
    Speaker: Luca dell'Agnello (INFN)
    
    INFN-T1-20180305.pdf
    
    b) UK cloud
    
    Speaker: Alastair Dewhurst (STFC-Rutherford Appleton Laboratory (GB))
    
    ADCSite20180305.pdf
    
    c) ES cloud (PIC)
    
    Speaker: Aresh Vedaee (The Barcelona Institute of Science and Technology (BIST) (ES))
    
    05032018_ATLAS_Site_Jamboree_ES_Cloud_Aresh.pdf
    
    d) CA cloud
    
    Speaker: Reda Tafirout (TRIUMF (CA))
    
    CA_report.pdf
    
    e) ND cloud
    
    Speaker: Francesco Giovanni Sciacca (Universitaet Bern (CH))
    
    ATLAS-NDcloud-report.pdf
    
    f) DE Cloud
    
    Speaker: Michal Svatos (Acad. of Sciences of the Czech Rep. (CZ))
    
    DE_cloud_report.pdf
    
    g) US cloud
    
    Speaker: Robert William Gardner Jr (University of Chicago (US))
    
    2018.03.05 US ATLAS Sites Report.pdf
    
    Google doc slides
Tuesday 6 March
- General support & Compute
  
  Minutes
  
  Conveners: Ivan Glushkov (University of Texas at Arlington (US)), Mario Lassnig (CERN)
  
  >>> Compute & Support Session Minutes
  
  > DPA Overview - Ivan
  
  > Site Support - Xin
  >> Baseline requirements
  - Storage - SRM, discless
  >> Other services
  - Computing - 25/75 - Analy/Production or UCORE
  - Migration to CentOS
  >> Trends
  - Saving manpower for sites
     - Pilot improvements
     - Go UCORE
     - Go Diskless
     - Go Containers
     - Go Standard software
     - Still the flexibility should be kept - One Box Challenge, LSM, hybrid staging model
  >> AGIS
     - Get rid of obsolete data
     - Keep the documentation up-to-date
  >> Site support
  - Go to ADC Meetings
  - Follow DPA mailing list
  - Response to GGUS
  - Contact your cloud support list atlas-add-cloud-XX
  - Site Documentation - FAQ TWiki started
  
  > Site Monitoring - Ryu
  >> Availability & Reliability definition
  >> Probes
  - SE and CE Services
  >> Monitoring
  - WLCG dashboard
  - MONIT in preparation
  - Monthly reports are available
  - Re-computation can be requested
  >> Recent Changes
     - etf_default in AGIS to select which queue to be used
  >> Future Improvements
  >> HS06
  - Data Selection
  - CPU Ranking
  - HS06 vs events/sec per site - there are sights which are off from the values predicted bt HS06 in AGIS
  > Summary
  
  > EventService for Sites - Wen
  >> Events produces - peak - 33M/day
  >> Events produced per site
  >> Example of ES using resources - BNL_LOCAL, CONNECT*
  >> Storage - much more stable
  >> EventService
  - Storage - fixed. Now focusing on es_merga_failed errors.
  - HammerClout blacklists files
  >> ES Commissioning
  >> ES brokerage/priority/share when scaling
  >> ES sites when scaling
  
  > Site Description - AleDS
  >> Strategy
  - Using GLUE 2 or info from jobs
  - Plugins to different batch system to fetch info
  - Initial prototype - ready
  - Supported - LSF, PBS, HTCondor
  - Info already available via kibana. Complete view of the nodes - available.
  >> Checking of data
     - LSF is matching with local monitoring data
     - PBS - matching too
         - Not possible to derive all total number of slots
         - Sites do not allow usage of qtat
     - HTCondor - matching too. One can reconstruct teetotal number of possible slots.
  >> What we get:
  - number of slots, nodes shared among different PQs..
  - Some wrong configurations in AGIS. Ex: Set PBS, used UGE
  >> Conclusions
  - Next steps
  
  > Workflow Analysis - Thomas
  >> Introduction
  - Grid, finished jobs Nov 2017 - Jan 2018 from UC Kibana. Separation by resource type. TWiki available.
  >> Sum of Walltime x Cores
  >> Processingtype
  >> Walltime x Cores
  - User jobs are usually very short
  >> Execution Time x scores/ n_events
  >> CPU Efficiency per Core
  >> Max PSS per Core
  >> I/O Intensity
  >> Stagein Time & Inputfile Size
  >> Stageout Time & Outputfile Size
  >> Kibana Dashboards & Jupiter are available
  
  > Unified Panda Queue - Rod
  >> Definition
  - one queue per physical resource
  >> Motivation
  - Gshare and priorities should decide what job starts
  - Run order based on priority only
  - Evgen currently is manually capped
  - Incorporate Analysis
  >> Analysis
  - We should be running more according the gshare. This will be solved by UCORE
  >> Challenges
  - Needs unpartitioned cluster
  - Changes in monitoring
  >> Brokerage & Monitoring
  - Not exposed in bigpanda
  - HC PFT
  >> Deployment
  - Initially only possible via aCT
  - Now also Cream CE
  - Possible to submit to ANALY UCORE ANALY MCORE or HIMEM
  - jobs on a UCORE site are following the gshare exactly
  >> Need to limit jobs
  - We need limits on the physical properties of the jobs
  - Add perimeter per sum of running jobs?
  >> Conclusions
  
  > Payload Scheduler - Fernando & Tadashi
  >> Central control of resources for UCORE
  >> Dimensions to be controlled
  - Global shares, WQ alignment, local shares, site capabilities, data distribution
  >> Global Share Definition - reminder
  >> Panda flow scheme
  - Global shares are take into account in the last level - job dispatching
  - For each share n_queued = 2*n_running
  >> Alignment with JEDI WQ
  - There is a work queue per share per type (mcore, score..)
  - There are ad-hoc WQ for big resources for example
  >> Global shares scheme
  >> Global shares in BigPanda
  >> Changing shares UI in ProdSys.
  - Adding is still developer’s action
  >> Local partitioning
  - No way to say if mcore / score partitioning is dynamic
  >> Local partitioning ANALY vs PROD
  >> Global Shares Summary
  - Working well. We already have operation experience
  - Things to change in order for the shares to work perfectly
  >>> Harvester
  >> Motivation
  - Pilot submission to all resources
  - Pilot stream control
  - Clear separation of functionalities with pilot (Pilot 2)
  - Resource discovery & monitoring
  >> Design
  - Everything fully configurable / pluggable
  - Push & Pull modes
  - HPC mode
     - Runs on edge node - light and fulfilling edge node restrictions
     - WN communication via the shared FS
  - Grid/cloud mode
  >> Architecture
  >> Status - HPC
  - In use in most of the US HPCs
  >> Status - GRID
  - HTCondor plug in
     - Scalability tests are ongoing. How many jobs can be run simultaneously? Can it run on all site?
     - Running on 200 PDs, a few jobs per PQ.
  >> First Pilot Streaming commissioning
  >> Status - Cloud
  - Mixed mode: with cloud scheduler
  - Harvester only - Harvester is handling the VMs itself
     - Trying condor-let approach
  >> Status - Monitoring
  - First implementation - ready
  >> Plans
  
  > Blacklisting Overview - Jarka
  >> How to receive PFT/AFT HC tests
  - New: HC tests UCORE
  - New: Testing can start within 1 hour
  >> HC Blacklisting
  - New: AGIS controller in production
  >> How to:
  - Set up PQ status to ONLINE/TEST/BROKEROFF/OFFLINE
  - Set up PQ status to AUTO i.e. take over by HC
  - Adding a new default queue
  - Adding new PQ
  - Decomission a PQ
  - 9
    
    DPA overview
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    DPA Overview
  - 10
    
    Site configuration and support
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    jamboree-2018-site_config.pdf
    
    jamboree-2018-site_config.ppt
  - 11
    
    Site monitoring
    
    Speaker: Ryu Sawada (University of Tokyo (JP))
    
    2018_03_05_SC.pdf
  - 12
    
    EventService for sites
    
    Speaker: Wen Guan (University of Wisconsin (US))
    
    EventService
    
    EventService for sites_20180306.pdf
  - 13
    
    Site Description update, WN map
    
    Speaker: Alessandro De Salvo (Sapienza Universita e INFN, Roma I (IT))
    
    Site description 20180306.pdf
    
    Site description 20180306.ppt
  - 10:30
    
    Coffee break
  - 14
    
    Jobs types overview
    
    Speakers: Friedrich Hoenig (Ludwig Maximilians Universitat (DE)), Guenter Duckeck (Ludwig Maximilians Universitat (DE)), Thomas Maier (Ludwig Maximilians Universitat (DE))
    
    tmaier_workflow_analysis.pdf
    
    Workflow Analysis
  - 15
    
    Unified Panda queues
    
    Speaker: Rodney Walker (Ludwig Maximilians Universitat (DE))
    
    UnifiedPandaQueue.pdf
    
    UQ
  - 16
    
    Payload scheduler
    
    Speakers: Fernando Harald Barreiro Megino (University of Texas at Arlington), Tadashi Maeno (Brookhaven National Laboratory (US))
    
    Payload scheduler
  - 17
    
    Overview on blacklisting
    
    Speaker: Jaroslava Schovancova (CERN)
    
    BlacklistingOverview_ATLASSiteJamboree20180306_jschovan.pdf
- Storage
  
  Conveners: Fernando Harald Barreiro Megino (University of Texas at Arlington), Martin Barisits (CERN)
  - 18
    Overview from DDM
    
    General overview, plus things like (not only!):
    
    Movers and protocols strategy,
    
    disk space gathering info
    
    Localgroupdisk management
    
    Consistency checks
    
    Sites migration
    
    Warm hot cold storage
    
    Srm-less storage
    
    Speakers: Cedric Serfon (University of Oslo (NO)), Mario Lassnig (CERN), Martin Barisits (CERN)
    
    2018-03-06 DDM overview.pdf
  - 19
    
    FTS
    
    Speaker: Andrea Manzi (CERN)
    
    FTS@Atlas Jamboree.pdf
    
    FTS@Atlas Jamboree.pptx
  - 20
    
    Data transfer latencies and transfer queue studies
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
    
    ATLAS-FTS
  - 21
    
    ROOT/HTTP third party copy
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    ASCW58.pdf
    
    ASCW58.pptx
  - 22
    
    DDM/sites open questions
  - 16:00
    
    Coffee break
  - 23
    
    Tape
    
    Speakers: Cedric Serfon (University of Oslo (NO)), David Yu (Brookhaven National Laboratory (US)), Tomas Javurek (Albert Ludwigs Universitaet Freiburg (DE))
    
    ATLAS Jamboree 2018 BNL Tapes.pdf
    
    Tapes 2018.pdf
  - 24
    
    Caches: xrootd caches and others
    
    Speakers: Andrej Filipcic (Jozef Stefan Institute (SI)), Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), David Cameron (University of Oslo (NO)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    ARC Cache, ADC Jamboree, 6.3.18(2).pdf
    
    Cache recommendations, ADC Jamboree, 6.3.18.pdf
    
    Xcache
  - 25
    
    WAN vs LAN , direct I/O vs copy to scratch
    
    Speakers: Nicolo Magini (INFN e Universita Genova (IT)), Thomas Maier (Ludwig Maximilians Universitat (DE))
    
    20180306-WANLAN-magini-maier.pdf
    
    Slides
Wednesday 7 March
- Monitoring
  
  Conveners: Dario Barberis (Università e INFN Genova (IT)), Ilija Vukotic (University of Chicago (US))
  - 26
    
    From old to new dashboards: which tool for what?
    
    Speaker: Dario Barberis (Università e INFN Genova (IT))
    
    ADC_Mon_Report-Mar2018.pdf
  - 27
    
    Site dashboards based on ES+Kibana at UC
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Analytics for sites
  - 28
    
    Monitoring and accounting for cloud resources
    
    Speaker: Frank Berghaus (University of Victoria (CA))
    
    Cloud and Dynafed Accoutning.pdf
    
    Cloud Monitoring and Accounting
  - 29
    
    Pilot monitor update
    
    Speaker: Peter Love (Lancaster University (GB))
    
    atlas-jamboree-apfdash
- Evolution
  
  Conveners: Robert William Gardner Jr (University of Chicago (US)), Johannes Elmsheuser (Brookhaven National Laboratory (US))
  - 30
    Singularity and Containers
    
    Speakers: Alessandra Forti (University of Manchester (GB)), Andrej Filipcic (Jozef Stefan Institute (SI))
    
    a) Containers deployment status
    
    Speaker: Alessandra Forti (University of Manchester (GB))
    
    20180307_jamboree_containers.pdf
    
    b) Containers Site point of view
    
    Speaker: Dr Emmanouil Vamvakopoulos (Centre de Calcul IN2P3 (FR))
    
    contairner_CC-IN2P3b.pdf
    
    c) Discussion
    
    Speaker: All
  - 10:45
    
    Coffee break
  - 31
    
    Lightweight sites
    
    Speaker: Mayank Sharma (CERN)
    
    WLCG Lightweight Sites - ATLAS Jamboree final.pdf
    
    WLCG Lightweight Sites - ATLAS Jamboree final.pptx
  - 32
    
    One box challenge
    
    Speakers: Rodney Walker (Ludwig Maximilians Universitat (DE)), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    ATLAS@home report_site_jamboree.pdf
    
    OneBoxChallenge
  - 33
    
    Object store stress testing
    
    Speaker: Peter Love (Lancaster University (GB))
    
    atlas-jamboree-objectstores
  - 34
    
    Google cloud and CSCS HPC
    
    Speaker: Alexei Klimentov (Brookhaven National Laboratory (US))
    
    Google_CSCS_SWnC_Mar2018.pdf
  - 35
    
    Services Layer at the Edge (SLATE)
    
    Speaker: Robert William Gardner Jr (University of Chicago (US))
    
    Google doc slides
    
    SLATE for ATLAS Sites Jamboree.pdf
  - 36
    
    WLCG Evolution
    
    Speaker: Ian Bird (CERN)
    
    WLCG-ATLAS-290218.pdf
    
    WLCG-ATLAS-290218.pptx
  - 37
    
    Summary
    
    Speaker: Alessandro Di Girolamo (CERN)

Choose timezone

ATLAS Sites Jamboree

40/S2-D01 - Salle Dirac

CERN