Speaker
Mr
Tim Muenchen
(Bergische Universitaet Wuppertal)
Description
As the Large Hadron Collider (LHC) at CERN, Geneva, has begun operation in september, the large scale computing grid LCG (LHC Computing Grid) is meant to process and store the large amount of data created in simulating, measuring and analyzing of particle physic experimental data. Data acquired by ATLAS, one of the four big experiments at the LHC, are analyzed using compute jobs running on the grid and utilizing the ATLAS software framework 'Athena'. The analysis algorithms themselves are written in C++ by the physicists using Athena and the ROOT toolkit.
Identifying the reason for a job failure (or even the occurance of the failure itself) in this context is a tedious, repetitive and - more often than not - unsuccessful task. Often, to deal with failures in the RUNNING stage (as opposed to job submission failures or compilation errors in the user algorithms), the job is just being resubmitted. The debugging of such problems is made even more difficult by the fact that the output-sandbox, which contains the jobs' output and error logs, is discarded by the grid middleware if the job failed. So, valuable information that could aid in finding the failure reason is lost. These issues result in high job failure rates and less than optimal resource usage.
As part of the High Energy Particle Physics Community Grid project (HEPCG) of the German D-Grid Initiave, the University of Wuppertal has developed the Job Execution Monitor (JEM). JEM helps finding job failure reasons by two means: It periodically provides vital worker node system data and collects job run-time monitoring data. To gather this data, a supervised line-by-line execution of the user job is performed. JEM is providing new possibilities to find problems in largely distributed computing grids and to analyze these problems in nearly real-time.
All monitored information is presented to the user almost instantaneously and additionally stored in the jobs' output sandbox for further analysis. As a first step, JEM has been seamlessly integrated into ATLAS' and LHCb's grid user interface 'ganga'. In this way, submitted jobs are monitored transparently, requiring no additional effort by the user.
In this work, the functionality of and the concepts behind JEM are presented together with examples of typical problems that are easily discovered. Furthermore, we present an ongoing work of classifying problems automatically using expert systems.
Author
Mr
Tim Muenchen
(Bergische Universitaet Wuppertal)
Co-authors
Prof.
Erich Ehses
(University of applied sciences, Koeln)
Mr
Markus Mechtel
(University of Wuppertal)
Mr
Martin Rau
(University of applied sciences, Koeln)
Prof.
Nikolaus Wulff
(University of applied sciences, Muenster)
Mr
Peer Ueberholz
(Hochschule Niederrhein)
Prof.
Peter Maettig
(University of Wuppertal)
Dr
Torsten Harenberg
(University of Wuppertal)