DOMA / ACCESS Meeting

Name: DOMA / ACCESS Meeting
Start: 2020-02-11T17:30:00+01:00
End: 2020-02-11T18:50:00+01:00
Location: CERN

Tuesday 11 Feb 2020, 17:30 → 18:50 Europe/Zurich

513/1-024 (CERN)

513/1-024

CERN

Show room on map

Frank Wuerthwein (Univ. of California San Diego (US)), Ilija Vukotic (University of Chicago (US)), Markus Schulz (CERN), Stephane Jezequel (LAPP-Annecy CNRS/USMB (FR)), Xavier Espinal (CERN)

Hide

Presents: Xavier Espinal, Frank Wuerthwein, Ilija Vukotic, Stephane Jezequel, Markus Schulz, Riccardo Di Maria, Andrea Sciabà, Andrea Rizzi, Daniele Spiga, Diego Ciangottini, Gonzalo Merino, Laurent Duflot, Michael Helmut Holzbock, Nikola Hardi, Nikolai Marcel Hartmann, Teng

CMS's Nano-AOD - Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, Pisa)

analysis data access pattern
- event processing at analysis level different wrt reconstruction
- analysis “variations” needed
- complex event selection (skimming)
nanoAOD format
- ~1-2Kb / event
- systematic variations not stored
- full-split (ntuple-like) of collection attributes
- columnar compression (LZMA) in “baskets”
- data read by typical analysis ~10% of the per-event size (up to ~30-40%)
analysis-skimming steps
- analysis skimming can typically reduce by a factor 100 the number of events to handle
- some systematic variations computed only after skimming step
- for “basket based” compression or without compression, skimming necessary to NOT read the whole column
baskets and compression
- compression is part of the trick to use to store floats with reduced precision
- better/different: possibility to write a dataformat layer; use opaque types in ROOT 6.18
HDD (cold vs warm cache ) vs SSD cold cache comparison in the slides
spreadsheet to study IO vs CPU boundaries and costs in the slides
- IO as bottleneck if using optimised code
- LZMA remain the best trade-off
- few advantages having persistent intermediate formats
network IO
- latency hiding to address network latency
- if data served from a single (or few) HDD, total seek time cannot be hidden
- in computing centres, nanoAOD are small fraction and can be spread on several disks
- current experience, network IO at analysis level better handled with “lazy download”-like solutions => need concrete techs to test here
conclusions
- analysis access patterns typically do not bulk process the events
- analysis access patterns cherry-pick the information to use
- per-column read-saving in place
- per-event not possible
- options for uncompressed formats using opaque ROOT types
- network access with latency hiding should be demonstrated for analysis use cases
see slides for more details
from comments:
- default basket size 32kB - actual size to be checked
- full Run-2 data (100%) in nanoAOD (~50TB)
- factor of 10 difference between CMS (~50TB) and ATLAS
- nanoAOD targeting ~60-80% of the CMS analyses
- miniAOD will still be present
- this topic started with a PB problem to solve - now at the level of TB: necessity to obtain official numbers from computing coordination, wrap up, and rescope in case

HL-LHC review document preparations - Xavier Espinal

need input from the DOMA ACCESS community
please, contribute to the document
community input needed, e.g:
- XCache initiatives update after 6 months of experience: US, DE, FR, IT: collected metrics and operational experience
- baseline computing model estimates for HL-LHC data

There are minutes attached to this event. Show them.

- 17:30 → 17:35
  
  Introduction 5m
  
  Speakers: Frank Wuerthwein (UCSD), Frank Wuerthwein (Univ. of California San Diego (US)), Ilija Vukotic (University of Chicago (US)), Stephane Jezequel (LAPP-Annecy CNRS/USMB (FR))
- 17:35 → 18:05
  
  CMS's Nano-AOD 30m
  
  Speaker: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)
  
  slides
- 18:05 → 18:20
  HL-LHC review document preparations 15m
  Need input from the DOMA ACCESS community to prepare the document for the HL-LHC review.
  
  Please contribute to the document: link
  
  Community input needed, e.g:
  - XCache initiatives update after 6 months of experience: US, DE, FR, IT: collected metrics and operational experience.
  - Baseline computing model estimates for HL-LHC data
- 18:20 → 18:30
  
  AOB 10m