Speaker
Description
The declarative approach to data analysis provides high-level abstractions for users to operate on their datasets in a much more ergonomic fashion compared to imperative interfaces. ROOT offers such a tool with RDataFrame, which creates a computation graph with the operations issued by the user and executes it lazily only when the final results are queried. It has always been oriented towards parallelisation, with native support for multithreading execution on a single machine.
Recently, RDataFrame has been extended with a Python layer that is capable of steering and executing the RDataFrame computation graph over a set of distributed resources, requiring minimal code changes for an RDataFrame application to run distributedly. The new tool features a modular design, such that it can support multiple backends - a single interface can be then connected to multiple distributed computing frameworks.
Since v6.24, Distributed RDataFrame has already been included in ROOT as an experimental feature, and it is currently under heavy development. This presentation will show the current performance figures when running real analyses with two different computing frameworks: Apache Spark and Dask. Furthermore, the performance optimisations that are being applied to Distributed RDataFrame will be discussed, namely caching, exploitation of RDataFrame native multithreading and compilation of C++ kernels in the distributed worker processes.
Speaker time zone | Compatible with Europe |
---|