ATLAS Sites Jamboree

Europe/Zurich
40/S2-D01 - Salle Dirac (CERN)

40/S2-D01 - Salle Dirac

CERN

27
Show room on map
Alessandro Di Girolamo (CERN) , Johannes Elmsheuser (Brookhaven National Laboratory (US))
Description

Details and notes of the ATLAS Sites Jamboree, March 2018

This ATLAS Sites Jamboree is part of the ATLAS SW&C technical week

And on Thursday there is the ADC&SW Joint session

 

    • 10:00 12:00
      Intro and overviews
      Conveners: Alessandro Di Girolamo (CERN) , Johannes Elmsheuser (Brookhaven National Laboratory (US))
      • 10:00
        ATLAS Sites Jamboree welcome 10m
        Speaker: Alessandro Di Girolamo (CERN)
      • 10:10
        SW & Computing overview 20m
        Speakers: Davide Costanzo (University of Sheffield (GB)) , Torre Wenaus (Brookhaven National Laboratory (US))
      • 10:30
        Software overview 20m
        Speakers: Edward Moyse (University of Massachusetts (US)) , Walter Lampl (University of Arizona (US))
      • 10:50
        Reports from: MC production, Reprocessing, Derivation, Distributed Analysis Shifts/DAST 1h
        Speakers: David Michael South (Deutsches Elektronen-Synchrotron (DE)) , Douglas Gingrich (University of Alberta (CA)) , Dr Eirik Gramstad (University of Oslo (NO)) , Farida Fassi (Universite Mohammed V (MA)) , Josh McFayden (CERN) , Mayuko Kataoka (University of Texas at Arlington (US)) , Michael Ughetto (Stockholm University (SE)) , Monica Dobre (IFIN-HH (RO))
    • 13:30 17:50
      Sites plans and reports
      • Availability reports (SAM), small sites consolidation, problematic sites follow up (40mins?)
      • sites reports, list of questions to be sent
      Conveners: Alessandra Forti (University of Manchester (GB)) , Xin Zhao (Brookhaven National Laboratory (US))
    • 09:00 13:00
      General support & Compute
      Conveners: Ivan Glushkov (University of Texas at Arlington (US)) , Mario Lassnig (CERN)

      >>> Compute & Support Session Minutes

      > DPA Overview - Ivan

      > Site Support - Xin
      >> Baseline requirements
      - Storage - SRM, discless
      >> Other services
      - Computing - 25/75 - Analy/Production or UCORE
      - Migration to CentOS
      >> Trends
      - Saving manpower for sites
          - Pilot improvements
          - Go UCORE
          - Go Diskless
          - Go Containers
          - Go Standard software
          - Still the flexibility should be kept - One Box Challenge, LSM, hybrid staging model
      >> AGIS
          - Get rid of obsolete data
          - Keep the documentation up-to-date
      >> Site support
      - Go to ADC Meetings
      - Follow DPA mailing list
      - Response to GGUS
      - Contact your cloud support list atlas-add-cloud-XX
      - Site Documentation - FAQ TWiki started

      > Site Monitoring - Ryu
      >> Availability & Reliability definition
      >> Probes
      - SE and CE Services
      >> Monitoring
      - WLCG dashboard
      - MONIT in preparation
      - Monthly reports are available
      - Re-computation can be requested
      >> Recent Changes
          - etf_default in AGIS to select which queue to be used
      >> Future Improvements
      >> HS06
      - Data Selection
      - CPU Ranking
      - HS06 vs events/sec per site - there are sights which are off from the values predicted bt HS06 in AGIS
      > Summary

      > EventService for Sites - Wen
      >> Events produces - peak - 33M/day
      >> Events produced per site
      >> Example of ES using resources - BNL_LOCAL, CONNECT*
      >> Storage - much more stable
      >> EventService
      - Storage - fixed. Now focusing on es_merga_failed errors.
      - HammerClout blacklists files
      >> ES Commissioning
      >> ES brokerage/priority/share when scaling
      >> ES sites when scaling

      > Site Description - AleDS
      >> Strategy
      - Using GLUE 2 or info from jobs
      - Plugins to different batch system to fetch info
      - Initial prototype - ready
      - Supported - LSF, PBS, HTCondor
      - Info already available via kibana. Complete view of the nodes - available.
      >> Checking of data
          - LSF is matching with local monitoring data
          - PBS - matching too
              - Not possible to derive all total number of slots
              - Sites do not allow usage of qtat
          - HTCondor - matching too. One can reconstruct teetotal number of possible slots.
      >> What we get:
      - number of slots, nodes shared among different PQs..
      - Some wrong configurations in AGIS. Ex: Set PBS, used UGE
      >> Conclusions
      - Next steps

      > Workflow Analysis - Thomas
      >> Introduction
      - Grid, finished jobs Nov 2017 - Jan 2018 from UC Kibana. Separation by resource type. TWiki available.
      >>  Sum of Walltime x Cores
      >> Processingtype
      >> Walltime x Cores
      - User jobs are usually very short
      >> Execution Time x scores/ n_events
      >> CPU Efficiency per Core
      >> Max PSS per Core
      >> I/O Intensity
      >> Stagein Time & Inputfile Size
      >> Stageout Time & Outputfile Size
      >> Kibana Dashboards & Jupiter  are available

      > Unified Panda Queue - Rod
      >> Definition
      - one queue per physical resource
      >> Motivation
      - Gshare and priorities should decide what job starts
      - Run order based on priority only
      - Evgen currently is manually capped
      - Incorporate Analysis
      >> Analysis
      - We should be running more according the gshare. This will be solved by UCORE
      >> Challenges
      - Needs unpartitioned cluster
      - Changes in monitoring
      >> Brokerage & Monitoring
      - Not exposed in bigpanda
      - HC PFT
      >> Deployment
      - Initially only possible via aCT
      - Now also Cream CE
      - Possible to submit to ANALY UCORE ANALY MCORE or HIMEM
      - jobs on a UCORE site are following the gshare exactly
      >> Need to limit jobs
      - We need limits on the physical properties of the jobs
      - Add perimeter per sum of running jobs?
      >> Conclusions

      > Payload Scheduler - Fernando & Tadashi
      >> Central control of resources for UCORE
      >> Dimensions to be controlled
      - Global shares, WQ alignment, local shares, site capabilities, data distribution
      >> Global Share Definition - reminder
      >> Panda flow scheme
      - Global shares are take into account in the last level - job dispatching
      - For each share n_queued  = 2*n_running
      >> Alignment with JEDI WQ
      - There is a work queue per share per type (mcore, score..)
      - There are ad-hoc WQ for big resources for example
      >> Global shares scheme
      >> Global shares in BigPanda
      >> Changing shares UI in ProdSys.
      - Adding is still developer’s action
      >> Local partitioning
      - No way to say if mcore / score partitioning is dynamic
      >> Local partitioning ANALY vs PROD
      >> Global Shares Summary
      - Working well. We already have operation experience
      - Things to change in order for the shares to work perfectly
      >>> Harvester
      >> Motivation
      - Pilot submission to all resources
      - Pilot stream control
      - Clear separation of functionalities with pilot (Pilot 2)
      - Resource discovery & monitoring
      >> Design
      - Everything fully configurable / pluggable
      - Push & Pull modes
      - HPC mode
          - Runs on edge node - light and fulfilling edge node restrictions
          - WN communication  via the shared FS
      - Grid/cloud mode
      >> Architecture
      >> Status - HPC
      - In use in most of the US HPCs
      >> Status - GRID
      - HTCondor plug in
          - Scalability tests are ongoing. How many jobs can be run simultaneously? Can it run on all site?
          - Running on 200 PDs, a few jobs per PQ.
      >> First Pilot Streaming commissioning
      >> Status - Cloud
      - Mixed mode: with cloud scheduler
      - Harvester only - Harvester is handling the VMs itself
          - Trying condor-let approach
      >> Status - Monitoring
      - First implementation - ready
      >> Plans

      > Blacklisting Overview - Jarka
      >> How to receive PFT/AFT HC tests
      - New: HC tests UCORE
      - New: Testing can start within 1 hour
      >> HC Blacklisting
      - New: AGIS controller in production
      >> How to:
      - Set up PQ status to ONLINE/TEST/BROKEROFF/OFFLINE
      - Set up PQ status to AUTO i.e. take over by HC
      - Adding a new default queue
      - Adding new PQ
      - Decomission a PQ

    • 14:00 18:00
      Storage
      Conveners: Fernando Harald Barreiro Megino (University of Texas at Arlington) , Martin Barisits (CERN)
      • 14:00
        Overview from DDM 1h
        • General overview, plus things like (not only!):
        • Movers and protocols strategy,
        • disk space gathering info
        • Localgroupdisk management
        • Consistency checks
        • Sites migration
        • Warm hot cold storage
        • Srm-less storage
        Speakers: Cedric Serfon (University of Oslo (NO)) , Mario Lassnig (CERN) , Martin Barisits (CERN)
      • 15:00
        FTS 15m
        Speaker: Andrea Manzi (CERN)
      • 15:15
        Data transfer latencies and transfer queue studies 15m
        Speakers: Ilija Vukotic (University of Chicago (US)) , Shawn Mc Kee (University of Michigan (US))
      • 15:30
        ROOT/HTTP third party copy 15m
        Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER) , Wei Yang (SLAC National Accelerator Laboratory (US))
      • 15:45
        DDM/sites open questions 15m
      • 16:00
        Coffee break 30m
      • 16:30
        Tape 20m
        Speakers: Cedric Serfon (University of Oslo (NO)) , David Yu (Brookhaven National Laboratory (US)) , Tomas Javurek (Albert Ludwigs Universitaet Freiburg (DE))
      • 16:50
        Caches: xrootd caches and others 30m
        Speakers: Andrej Filipcic (Jozef Stefan Institute (SI)) , Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)) , Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER) , David Cameron (University of Oslo (NO)) , Wei Yang (SLAC National Accelerator Laboratory (US))
      • 17:20
        WAN vs LAN , direct I/O vs copy to scratch 20m
        Speakers: Nicolo Magini (INFN e Universita Genova (IT)) , Thomas Maier (Ludwig Maximilians Universitat (DE))