ATLAS Sites Jamboree

40/S2-D01 - Salle Dirac (CERN)

40/S2-D01 - Salle Dirac


Show room on map
Alessandro Di Girolamo (CERN), Johannes Elmsheuser (Brookhaven National Laboratory (US))

Details and notes of the ATLAS Sites Jamboree, March 2018

This ATLAS Sites Jamboree is part of the ATLAS SW&C technical week

And on Thursday there is the ADC&SW Joint session


    • Intro and overviews
      Conveners: Alessandro Di Girolamo (CERN), Johannes Elmsheuser (Brookhaven National Laboratory (US))
      • 1
        ATLAS Sites Jamboree welcome
        Speaker: Alessandro Di Girolamo (CERN)
      • 2
        SW & Computing overview
        Speakers: Davide Costanzo (University of Sheffield (GB)), Torre Wenaus (Brookhaven National Laboratory (US))
      • 3
        Software overview
        Speakers: Edward Moyse (University of Massachusetts (US)), Walter Lampl (University of Arizona (US))
      • 4
        Reports from: MC production, Reprocessing, Derivation, Distributed Analysis Shifts/DAST
        Speakers: David Michael South (Deutsches Elektronen-Synchrotron (DE)), Douglas Gingrich (University of Alberta (CA)), Dr Eirik Gramstad (University of Oslo (NO)), Farida Fassi (Universite Mohammed V (MA)), Josh McFayden (CERN), Mayuko Kataoka (University of Texas at Arlington (US)), Michael Ughetto (Stockholm University (SE)), Monica Dobre (IFIN-HH (RO))
    • Sites plans and reports
      • Availability reports (SAM), small sites consolidation, problematic sites follow up (40mins?)
      • sites reports, list of questions to be sent
      Conveners: Alessandra Forti (University of Manchester (GB)), Xin Zhao (Brookhaven National Laboratory (US))
    • General support & Compute
      Conveners: Ivan Glushkov (University of Texas at Arlington (US)), Mario Lassnig (CERN)

      >>> Compute & Support Session Minutes

      > DPA Overview - Ivan

      > Site Support - Xin
      >> Baseline requirements
      - Storage - SRM, discless
      >> Other services
      - Computing - 25/75 - Analy/Production or UCORE
      - Migration to CentOS
      >> Trends
      - Saving manpower for sites
          - Pilot improvements
          - Go UCORE
          - Go Diskless
          - Go Containers
          - Go Standard software
          - Still the flexibility should be kept - One Box Challenge, LSM, hybrid staging model
      >> AGIS
          - Get rid of obsolete data
          - Keep the documentation up-to-date
      >> Site support
      - Go to ADC Meetings
      - Follow DPA mailing list
      - Response to GGUS
      - Contact your cloud support list atlas-add-cloud-XX
      - Site Documentation - FAQ TWiki started

      > Site Monitoring - Ryu
      >> Availability & Reliability definition
      >> Probes
      - SE and CE Services
      >> Monitoring
      - WLCG dashboard
      - MONIT in preparation
      - Monthly reports are available
      - Re-computation can be requested
      >> Recent Changes
          - etf_default in AGIS to select which queue to be used
      >> Future Improvements
      >> HS06
      - Data Selection
      - CPU Ranking
      - HS06 vs events/sec per site - there are sights which are off from the values predicted bt HS06 in AGIS
      > Summary

      > EventService for Sites - Wen
      >> Events produces - peak - 33M/day
      >> Events produced per site
      >> Example of ES using resources - BNL_LOCAL, CONNECT*
      >> Storage - much more stable
      >> EventService
      - Storage - fixed. Now focusing on es_merga_failed errors.
      - HammerClout blacklists files
      >> ES Commissioning
      >> ES brokerage/priority/share when scaling
      >> ES sites when scaling

      > Site Description - AleDS
      >> Strategy
      - Using GLUE 2 or info from jobs
      - Plugins to different batch system to fetch info
      - Initial prototype - ready
      - Supported - LSF, PBS, HTCondor
      - Info already available via kibana. Complete view of the nodes - available.
      >> Checking of data
          - LSF is matching with local monitoring data
          - PBS - matching too
              - Not possible to derive all total number of slots
              - Sites do not allow usage of qtat
          - HTCondor - matching too. One can reconstruct teetotal number of possible slots.
      >> What we get:
      - number of slots, nodes shared among different PQs..
      - Some wrong configurations in AGIS. Ex: Set PBS, used UGE
      >> Conclusions
      - Next steps

      > Workflow Analysis - Thomas
      >> Introduction
      - Grid, finished jobs Nov 2017 - Jan 2018 from UC Kibana. Separation by resource type. TWiki available.
      >>  Sum of Walltime x Cores
      >> Processingtype
      >> Walltime x Cores
      - User jobs are usually very short
      >> Execution Time x scores/ n_events
      >> CPU Efficiency per Core
      >> Max PSS per Core
      >> I/O Intensity
      >> Stagein Time & Inputfile Size
      >> Stageout Time & Outputfile Size
      >> Kibana Dashboards & Jupiter  are available

      > Unified Panda Queue - Rod
      >> Definition
      - one queue per physical resource
      >> Motivation
      - Gshare and priorities should decide what job starts
      - Run order based on priority only
      - Evgen currently is manually capped
      - Incorporate Analysis
      >> Analysis
      - We should be running more according the gshare. This will be solved by UCORE
      >> Challenges
      - Needs unpartitioned cluster
      - Changes in monitoring
      >> Brokerage & Monitoring
      - Not exposed in bigpanda
      - HC PFT
      >> Deployment
      - Initially only possible via aCT
      - Now also Cream CE
      - Possible to submit to ANALY UCORE ANALY MCORE or HIMEM
      - jobs on a UCORE site are following the gshare exactly
      >> Need to limit jobs
      - We need limits on the physical properties of the jobs
      - Add perimeter per sum of running jobs?
      >> Conclusions

      > Payload Scheduler - Fernando & Tadashi
      >> Central control of resources for UCORE
      >> Dimensions to be controlled
      - Global shares, WQ alignment, local shares, site capabilities, data distribution
      >> Global Share Definition - reminder
      >> Panda flow scheme
      - Global shares are take into account in the last level - job dispatching
      - For each share n_queued  = 2*n_running
      >> Alignment with JEDI WQ
      - There is a work queue per share per type (mcore, score..)
      - There are ad-hoc WQ for big resources for example
      >> Global shares scheme
      >> Global shares in BigPanda
      >> Changing shares UI in ProdSys.
      - Adding is still developer’s action
      >> Local partitioning
      - No way to say if mcore / score partitioning is dynamic
      >> Local partitioning ANALY vs PROD
      >> Global Shares Summary
      - Working well. We already have operation experience
      - Things to change in order for the shares to work perfectly
      >>> Harvester
      >> Motivation
      - Pilot submission to all resources
      - Pilot stream control
      - Clear separation of functionalities with pilot (Pilot 2)
      - Resource discovery & monitoring
      >> Design
      - Everything fully configurable / pluggable
      - Push & Pull modes
      - HPC mode
          - Runs on edge node - light and fulfilling edge node restrictions
          - WN communication  via the shared FS
      - Grid/cloud mode
      >> Architecture
      >> Status - HPC
      - In use in most of the US HPCs
      >> Status - GRID
      - HTCondor plug in
          - Scalability tests are ongoing. How many jobs can be run simultaneously? Can it run on all site?
          - Running on 200 PDs, a few jobs per PQ.
      >> First Pilot Streaming commissioning
      >> Status - Cloud
      - Mixed mode: with cloud scheduler
      - Harvester only - Harvester is handling the VMs itself
          - Trying condor-let approach
      >> Status - Monitoring
      - First implementation - ready
      >> Plans

      > Blacklisting Overview - Jarka
      >> How to receive PFT/AFT HC tests
      - New: HC tests UCORE
      - New: Testing can start within 1 hour
      >> HC Blacklisting
      - New: AGIS controller in production
      >> How to:
      - Set up PQ status to AUTO i.e. take over by HC
      - Adding a new default queue
      - Adding new PQ
      - Decomission a PQ

    • Storage
      Conveners: Fernando Harald Barreiro Megino (University of Texas at Arlington), Martin Barisits (CERN)
      • 18
        Overview from DDM
        • General overview, plus things like (not only!):
        • Movers and protocols strategy,
        • disk space gathering info
        • Localgroupdisk management
        • Consistency checks
        • Sites migration
        • Warm hot cold storage
        • Srm-less storage
        Speakers: Cedric Serfon (University of Oslo (NO)), Mario Lassnig (CERN), Martin Barisits (CERN)
      • 19
        Speaker: Andrea Manzi (CERN)
      • 20
        Data transfer latencies and transfer queue studies
        Speakers: Ilija Vukotic (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • 21
        ROOT/HTTP third party copy
        Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 22
        DDM/sites open questions
      • 4:00 PM
        Coffee break
      • 23
        Speakers: Cedric Serfon (University of Oslo (NO)), David Yu (Brookhaven National Laboratory (US)), Tomas Javurek (Albert Ludwigs Universitaet Freiburg (DE))
      • 24
        Caches: xrootd caches and others
        Speakers: Andrej Filipcic (Jozef Stefan Institute (SI)), Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), David Cameron (University of Oslo (NO)), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 25
        WAN vs LAN , direct I/O vs copy to scratch
        Speakers: Nicolo Magini (INFN e Universita Genova (IT)), Thomas Maier (Ludwig Maximilians Universitat (DE))