36th ROOT Parallelism, Performance and Programming Model

Name: 36th ROOT Parallelism, Performance and Programming Model
Start: 2018-03-22T16:00:00+01:00
End: 2018-03-22T17:45:00+01:00
Location: CERN

Thursday 22 Mar 2018, 16:00 → 17:45 Europe/Zurich

4/S-020 (CERN)

4/S-020

CERN

Show room on map

Danilo Piparo (CERN)

Hide

Present: Xavi, Guilherme, Danilo, Giulio, Enric, Enrico, Gerri, Jakob, Philippe, Pere, Stefan, Jim, Axel

An Apache Arrow TDataSource

- New technologies are needed in O2, most notably for zero copy memory buffer adoption. The interoperability with ROOT is key, for example for analysis.

- The current version of the data source requires a full dataset in memory. A natural extension will be to leverage the possibility to read rows in batches.

Remarks of Giulio about TDF and TDS:

1. Extension: do not set the entry 1 by 1 but rather extract values for bunches of them. These are 2 optimisations: read in bunches within the tds itself, without changing interfaces. A second optimisation would consist in exposing vectors to the user (tdf) and allow her/it to navigate through it, as a memory page.

2. Extension: support "non rectangular sources" i.e. columns with different number of entries. Maybe limiting the fast track because of the slow track could be improve.

3. Add a TDataSink, for example to refill arrow table, without filling a table in foreach, with a copy.

4. Have a way to load 2 different "events" in order to decide if an object is owned by one event or the other, in an "untriggered environment". For example have a column which is filled by N entries for a single entry of k other columns.

5a. Support combinations: 2 columns, sized N and M, execute a function f(n,m).

5b. Support iteration of all possible associations.

Axel asks if Arrow supports nested collection: you can have those.
Jim stresses that one can have polymorphism via unions, a bit like std::variant. He also stresses the lack of pointers.
Jim remarks that Arrow and Parquet C++ are being developed together, with the idea of having Arrow in memory filled by Parquet being read.
Jim remarks that TDF is more functional than declarative. Giulio underline that TDF is more an abstraction on the event loop rather than a Data Frame.

There are minutes attached to this event. Show them.

- 16:00 → 16:25
  
  TArrowDS: Motivation, Status and Feedback 25m
  
  Speaker: Giulio Eulisse (CERN)
  
  slides.pdf
- 16:25 → 16:45
  Round Table and Discussion 20m
  - Towards 6.14
  - TDF string filters/defines: support of branch.subbranch.[...].leaf syntax
  - PyROOT