Machine learning has become a hot-topic. Spark is showing rapid adoption as an engine and framework for working on machine learning problems at scale. In particular Spark provides distributed computing, integration with the rest of the Hadoop ecosystem and specialized libraries for machine learning (MLlib).
In this tutorial the participant will learn why Apache Spark is a good solution for big data analysis and how to use Apache Spark and Python for machine learning. As an example, we will use the data from the Higgs Boson Machine Learning Challenge published in Kaggle by the ATLAS experiment. The goal of this challenge is to explore the potential of machine learning methods to improve the discovery significance of the experiment.
We will guide the participant through the complete analysis pipeline using Spark's MLlib (Spark's built-in Machine Learning library); starting with data preparation and feature selection, and ending with model evaluation techniques such as cross-validation.