14–18 Oct 2013
Amsterdam, Beurs van Berlage
Europe/Amsterdam timezone

Running a typical ROOT HEP analysis on Hadoop/MapReduce

17 Oct 2013, 14:14
22m
Graanbeurszaal (Amsterdam, Beurs van Berlage)

Graanbeurszaal

Amsterdam, Beurs van Berlage

Oral presentation to parallel session Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization

Speaker

Mr Stefano Alberto Russo (Universita degli Studi di Udine (IT))

Description

Hadoop/MapReduce is a very common and widely supported distributed computing framework. It consists in a scalable programming model named MapReduce, and a locality-aware distributed file system (HDFS). Its main feature is to implement data locality: through the fusion of computing and storage resources and thanks to the locality-awareness of HDFS, the computation can be scheduled on the nodes where data resides, therefore completely avoiding network bottlenecks and congestion. Thanks to this feature a Hadoop cluster can be easily scaled out. If one takes into account also its wide diffusion and support, it is clear that managing and processing huge amounts of data becomes an easier task. The main difference between the original goal of Hadoop and High Energy Physics (HEP) analyses is that the first consists in analysing plain text files with simple programs, while in the latter data is highly structured and complex programs are required to access the physics information and perform the analysis. We investigated how a typical HEP analysis can be performed using Hadoop/MapReduce, in a way which is completely transparent to ROOT, the data and the user. The method we propose relies on a "conceptual middleware" that allows to run ROOT without any modification, to store the data on HDFS in its original format, and to let the user deal with Hadoop/MapReduce in a classical way which does not require any specific knowledge about this new model. Hadoop's features, and in particular data locality, are fully exploited. The developed workflow and solutions can be easily adopted for almost any HEP analysis code, and in general for any complex code working on binary data relying on independent sub-problems. The proposed approach has been tested on a real case, an analysis to measure the top quark pair production cross section with the ATLAS experiment. The test worked as expected, bringing great benefits in terms of reducing by several orders of magnitude the network transfers required to perform the analysis with respect to a classic computing model.

Primary author

Mr Stefano Alberto Russo (Universita degli Studi di Udine (IT))

Co-authors

Marina Cobal (Universita degli Studi di Udine (IT)) Michele Pinamonti (Universita degli Studi di Udine (IT))

Presentation materials