EP-IT Data Science Seminars

Big Data Tools and Pipelines for Machine Learning in HEP

by Luca Canali (CERN)

40/S2-B01 - Salle Bohr (CERN)

40/S2-B01 - Salle Bohr


Show room on map

The effective utilization of complex ML techniques for HEP at large scale poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. Tools and platforms from the “Big Data” ecosystem from industry and the open-source community can be profitably used for HEP use cases. This talk reports on one such example, describing the deployment of a ML pipeline for the use case of training a particle classifier based on neural networks. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. A neural network model, defined using the Keras API, is trained in a distributed fashion on Spark clusters, using BigDL with Analytics Zoo and Tensorflow. This talk will  describe the available solutions from the Big Data ecosystem implemented at CERN IT, relevant for ML use cases, with an outlook on the future evolution of this work.

Organized by

M. Girone, M. Elsing, L. Moneta, M. Pierini
Coffee will be served at 10h30