CS3 2018 - Workshop on Cloud Storage Synchronization and Sharing Services

Name: CS3 2018 - Workshop on Cloud Storage Synchronization and Sharing Services
Start: 2018-01-29T08:00:00+01:00
End: 2018-01-31T18:00:00+01:00
Location: AGH Computer Science Building D-17

29–31 Jan 2018

AGH Computer Science Building D-17

Europe/Zurich timezone

Towards interactive data analysis for TOTEM experiment using Apache Spark

29 Jan 2018, 11:30

30m

AGH Computer Science Building D-17

AGH WIET, Department of Computer Science, Building D-17, Street Kawiory 21, Krakow

Presentation User Voice: Novel Applications

Mr Grzegorz Bogdał (AGH University of Science and Technology)Mr Piotr Gawryś (AGH University of Science and Technology)Mr Paweł Nowak (AGH University of Science and Technology)Mr Łukasz Plewnia (AGH University of Science and Technology) Leszek Grzanka (AGH University of Science and Technology (PL)) Maciej Malawski (AGH University of Science and Technology)

Data analysis in High Energy Physics experiments requires processing of large amounts of data. As the main objective is to find interesting events from among those recorded by detectors, the typical operations involve data filtering by applying cuts and producing of histograms. The typical offline data analysis scenario for TOTEM experiment at LHC, CERN involves processing of 100s of ROOT ntuples of 1-2GB size, which gives up to a 1TB of data per analysis. The event size is relatively small (1KB-1MB) with most events of 1KB in size.

The goal of our work is to investigate the usability of one of modern big data toolkits, namely Apache Spark, to provide an interactive environment for parallel data analysis. As a proof of concept solution we employed Apache Spark 2.0, combined with Spark ROOT for accessing ROOT files. To provide interactive environment, we coupled it with Zeppelin, which provides a Web-based notebook environment, which allows combining analysis code in Scala to access Spark API and in Python for creating plots. This environment was deployed on Prometheus cluster at Academic Computer Centre Cyfronet AGH and integrated with SLURM resource management system. We developed scripts for combining these tools in a user friendly way and a set of notebooks showing sample analysis.

Our plans include the implementation of selected data analysis pipelines, performance analysis, integration with the Jupyter notebooks and SWAN (Service for Web based ANalysis) from CERN, and the development of high-level user friendly data analysis tools and libraries dedicated to high energy physics.

This research was supported in part by PLGrid Infrastructure.

References:

https://spark.apache.org/
https://github.com/diana-hep/spark-root
https://zeppelin.apache.org/
http://www.plgrid.pl/en
https://swan.web.cern.ch/

Mr Grzegorz Bogdał (AGH University of Science and Technology) Mr Piotr Gawryś (AGH University of Science and Technology) Mr Paweł Nowak (AGH University of Science and Technology) Mr Łukasz Plewnia (AGH University of Science and Technology) Leszek Grzanka (AGH University of Science and Technology (PL)) Maciej Malawski (AGH University of Science and Technology)

CS3-TOTEM-SPARK-v2.pdf

CS3 2018 - Workshop on Cloud Storage Synchronization and Sharing Services

Towards interactive data analysis for TOTEM experiment using Apache Spark

AGH Computer Science Building D-17

Speakers

Description

Authors

Presentation materials