36th ROOT Parallelism, Performance and Programming Model

Europe/Zurich
4/S-020 (CERN)

4/S-020

CERN

10
Show room on map
Danilo Piparo (CERN)

Present: Xavi, Guilherme, Danilo, Giulio, Enric, Enrico, Gerri, Jakob, Philippe, Pere, Stefan, Jim, Axel

An Apache Arrow TDataSource

- New technologies are needed in O2, most notably for zero copy memory buffer adoption. The interoperability with ROOT is key, for example for analysis.

- The current version of the data source requires a full dataset in memory. A natural extension will be to leverage the possibility to read rows in batches.

Remarks of Giulio about TDF and TDS:

1. Extension: do not set the entry 1 by 1 but rather extract values for bunches of them.  These are 2 optimisations: read in bunches within the tds itself, without changing interfaces. A second optimisation would consist in exposing vectors to the user (tdf) and allow her/it to navigate through it, as a memory page.

2. Extension: support "non rectangular sources" i.e. columns with different number of entries. Maybe limiting the fast track because of the slow track could be improve.

3. Add a TDataSink, for example to refill arrow table, without filling a table in foreach, with a copy.

4. Have a way to load 2 different "events" in order to decide if an object is owned by one event or the other, in an "untriggered environment". For example have a column which is filled by N entries for a single entry of k other columns.

5a. Support combinations: 2 columns, sized N and M, execute a function f(n,m).

5b. Support iteration of all possible associations.

  • Axel asks if Arrow supports nested collection: you can have those.
  • Jim stresses that one can have polymorphism via unions, a bit like std::variant. He also stresses the lack of pointers.
  • Jim remarks that Arrow and Parquet C++ are being developed together, with the idea of having Arrow in memory filled by Parquet being read.
  • Jim remarks that TDF is more functional than declarative. Giulio underline that TDF is more an abstraction on the event loop rather than a Data Frame.
There are minutes attached to this event. Show them.
    • 16:00 16:25
      TArrowDS: Motivation, Status and Feedback 25m
      Speaker: Giulio Eulisse (CERN)
    • 16:25 16:45
      Round Table and Discussion 20m
      • Towards 6.14
      • TDF string filters/defines: support of branch.subbranch.[...].leaf syntax
      • PyROOT