19–25 Oct 2024
Europe/Zurich timezone

Distributed analysis in production with RDataFrame

24 Oct 2024, 16:15
18m
Large Hall B

Large Hall B

Talk Track 9 - Analysis facilities and interactive computing Parallel (Track 9)

Speaker

Marta Czurylo (CERN)

Description

The ROOT software package provides the data format used in High Energy Physics by the LHC experiments. It offers a data analysis interface called RDataFrame, which has proven to adapt well to the requirements of modern physics analyses. However, with increasing data collected by the LHC experiments, the challenge to perform an efficient analysis expands. One of the solutions to ease this challenge, is the leverage of modern high performing distributed computing environments for which RDataFrame provides an easy-to-use interface layer - the distributed RDataFrame.

In this talk, we show that the Distributed RDataFrame is out of the experimental testing phase, and it is now ready for production thanks to a stabilized user interface. We delve into recent improvements of the distributed RDataFrame, including Pythonizations of the interface that allow running the workflows seamlessly (for example, with the XGBoost library). As the variety and geographical locations of distributed environments are available, we show the reproducibility and compare the performance across several of them.

Primary author

Co-authors

Andrea Maria Ola Mejicanos (University of Wisconsin Madison (US)) Danilo Piparo (CERN) Dr Vincenzo Eduardo Padulano (CERN)

Presentation materials