1–5 Sept 2014
Faculty of Civil Engineering
Europe/Prague timezone

The ALICE analysis train system

1 Sept 2014, 17:25
25m
C217 (Faculty of Civil Engineering)

C217

Faculty of Civil Engineering

Faculty of Civil Engineering, Czech Technical University in Prague Thakurova 7/2077 Prague 166 29 Czech Republic
Oral Computing Technology for Physics Research Computing Technology for Physics Research

Speaker

Markus Bernhard Zimmermann (CERN and Westfaelische Wilhelms-Universitaet Muenster (DE))

Description

In order to cope with the large recorded data volumes (around 10 PBs per year) at the LHC, the analysis within ALICE is done by hundreds of analysis users on a GRID system. High throughput and short turn-around times are achieved by a centralized system called the ’LEGO’ trains. This system combines analysis of different users to so-called analysis trains which are then executed within the same GRID jobs reducing the number of times the data needs to be read from the storage systems. To prevent that a single failing analysis jeopardizes the results of all users within a train, an automatized testing procedure has been developed. Each analysis is tested separately for functionality and performance before it is allowed to be submitted to the GRID. The analysis train system steers the job management, the merging of the output and sends notifications to the users. Clear advantages of such a centralized system are improved performance, good usability for users and the means of bookkeeping important for the reproducibility of the results. The train system builds upon the already existing ALICE tools, i.e. the analysis framework as well as the GRID submission and monitoring infrastructure. The entry point to the train system is a web interface which allows to configure the analysis and the desired datasets as well as to test and submit the train. While the analysis configuration is done directly by the users, datasets and train submission are controlled by a smaller group of operators. The analysis train system is operational since early 2012 and has quickly gained popularity with a continuously increasing trend. Throughout 2013, about 4800 trains have been submitted consuming about 2600 CPU years while analyzing 75 PB of data. This constitutes about 57% of the resources consumed for analysis in ALICE in 2013. Within the GRID environment which by its nature has changing availability of resources, it has been very challenging to achieve a fast turn around time. Various measures have been implemented, e.g. to obtain a speedy merging process and to avoid that a few problematic GRID jobs stall the completion of a train. The talk will introduce the analysis train system which has become very important for the daily analysis within ALICE. Further, the talk will focus on bottlenecks which have been identified and addressed by dedicated improvements. Finally, the lessons learned when setting up an organized analysis system for a user group which is in the hundreds will be discussed.

Primary author

Markus Bernhard Zimmermann (CERN and Westfaelische Wilhelms-Universitaet Muenster (DE))

Presentation materials

Peer reviewing

Paper