Data analysis in High Energy Physics experiments requires processing of large amounts of data. As the main objective is to find interesting events from among those recorded by detectors, the typical operations involve data filtering by applying cuts and producing of histograms. The typical offline data analysis scenario for TOTEM experiment at LHC, CERN involves processing of 100s of ROOT ntuples of 1-2GB size, which gives up to a 1TB of data per analysis. The event size is relatively small (1KB-1MB) with most events of 1KB in size.
The goal of our work is to investigate the usability of one of modern big data toolkits, namely Apache Spark, to provide an interactive environment for parallel data analysis. As a proof of concept solution we employed Apache Spark 2.0, combined with Spark ROOT for accessing ROOT files. To provide interactive environment, we coupled it with Zeppelin, which provides a Web-based notebook environment, which allows combining analysis code in Scala to access Spark API and in Python for creating plots. This environment was deployed on Prometheus cluster at Academic Computer Centre Cyfronet AGH and integrated with SLURM resource management system. We developed scripts for combining these tools in a user friendly way and a set of notebooks showing sample analysis.
Our plans include the implementation of selected data analysis pipelines, performance analysis, integration with the Jupyter notebooks and SWAN (Service for Web based ANalysis) from CERN, and the development of high-level user friendly data analysis tools and libraries dedicated to high energy physics.
This research was supported in part by PLGrid Infrastructure.