21-25 August 2017
University of Washington, Seattle
US/Pacific timezone

Exploiting Apache Spark platform for CMS computing analytics

22 Aug 2017, 15:40
20m
Auditorium (Alder Hall)

Auditorium

Alder Hall

Oral Track 1: Computing Technology for Physics Research Track 1: Computing Technology for Physics Research

Speaker

Marco Meoni (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)

Description

The CERN IT provides a set of Hadoop clusters featuring more than 5 PB of raw storage. Different open-source user-level tools are installed for analytics purposes. For this reason, since early 2015, the CMS experiment has started to store a large set of computing metadata, including e.g. a massive number of dataset access log.. Several streamers have registered some billions traces from heterogeneous providers. These trace logs represent a valuable yet scarcely investigated set of information that needs to be cleansed, categorized and correlated; in the case of the CMS dataset access information, this work may lead to discover useful patterns to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. This work presents an evaluation of Apache Spark platform for CMS needs. We demonstrate a few use-cases how to efficiently process metadata information stored on CERN HDFS system in a scalable manner by harnessing a variety of languages of choice. Among them, Scala and Python offer the best approach to CMS use cases for executing extremely I/O intensive queries that leverage in-memory and persistence Spark API as well as assess streamlining predictive models that can learn dataset properties using machine learning approaches.

Primary authors

Prof. Daniele Bonacorsi (University of Bologna) Valentin Y Kuznetsov (Cornell University (US)) Tommaso Boccali (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P) Marco Meoni (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P) Luca Menichetti (CERN)

Presentation materials

Peer reviewing

Paper