CMS's Nano-AOD - Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, Pisa)

  • analysis data access pattern
    - event processing at analysis level different wrt reconstruction
    - analysis “variations” needed 
    - complex event selection (skimming)
  • nanoAOD format
    - ~1-2Kb / event
    - systematic variations not stored
    - full-split (ntuple-like) of collection attributes
    - columnar compression (LZMA) in “baskets”
    - data read by typical analysis ~10% of the per-event size (up to ~30-40%)
  • analysis-skimming steps
    - analysis skimming can typically reduce by a factor 100 the number of events to handle
    - some systematic variations computed only after skimming step
    - for “basket based” compression or without compression, skimming necessary to NOT read the whole column
  • baskets and compression
    - compression is part of the trick to use to store floats with reduced precision
    - better/different: possibility to write a dataformat layer; use opaque types in ROOT 6.18
  • HDD (cold vs warm cache ) vs SSD cold cache comparison in the slides
  • spreadsheet to study IO vs CPU boundaries and costs in the slides
    - IO as bottleneck if using optimised code
    - LZMA remain the best trade-off
    - few advantages having persistent intermediate formats
  • network IO
    - latency hiding to address network latency
    - if data served from a single (or few) HDD, total seek time cannot be hidden
    - in computing centres, nanoAOD are small fraction and can be spread on several disks
    - current experience, network IO at analysis level better handled with “lazy download”-like solutions => need concrete techs to test here
  • conclusions
    - analysis access patterns typically do not bulk process the events
    - analysis access patterns cherry-pick the information to use
    - per-column read-saving in place
    - per-event not possible
    - options for uncompressed formats using opaque ROOT types
    - network access with latency hiding should be demonstrated for analysis use cases
  • see slides for more details 
  • from comments:
    - default basket size 32kB - actual size to be checked
    - full Run-2 data (100%) in nanoAOD (~50TB)
    - factor of 10 difference between CMS (~50TB) and ATLAS
    - nanoAOD targeting ~60-80% of the CMS analyses
    - miniAOD will still be present
    - this topic started with a PB problem to solve - now at the level of TB: necessity to obtain official numbers from computing coordination, wrap up, and rescope in case


HL-LHC review document preparations - Xavier Espinal

  • need input from the DOMA ACCESS community
  • please, contribute to the document
  • community input needed, e.g:
    - XCache initiatives update after 6 months of experience: US, DE, FR, IT: collected metrics and operational experience
    - baseline computing model estimates for HL-LHC data
