BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CERN//INDICO//EN
BEGIN:VEVENT
SUMMARY:Job Centric Monitoring for ATLAS jobs in the LHC Computing Grid
DTSTART;VALUE=DATE-TIME:20081104T165000Z
DTEND;VALUE=DATE-TIME:20081104T171500Z
DTSTAMP;VALUE=DATE-TIME:20130519T192336Z
UID:indico-contribution-190@cern.ch
DESCRIPTION:Speakers: Mr. MUENCHEN\, Tim (Bergische Universitaet Wuppertal
 )\nAs the Large Hadron Collider (LHC) at CERN\, Geneva\, has begun operati
 on in september\, the large scale computing grid LCG (LHC Computing Grid) 
 is meant to process and store the large amount of data created in simulati
 ng\, measuring and analyzing of particle physic experimental data. Data ac
 quired by ATLAS\, one of the four big experiments at the LHC\, are analyze
 d using compute jobs running on the grid and utilizing the ATLAS software 
 framework 'Athena'. The analysis algorithms themselves are written in C++ 
 by the physicists using Athena and the ROOT toolkit.\n\nIdentifying the re
 ason for a job failure (or even the occurance of the failure itself) in th
 is context is a tedious\, repetitive and - more often than not - unsuccess
 ful task. Often\, to deal with failures in the RUNNING stage (as opposed t
 o job submission failures or compilation errors in the user algorithms)\, 
 the job is just being resubmitted. The debugging of such problems is made 
 even more difficult by the fact that the output-sandbox\, which contains t
 he jobs' output and error logs\, is discarded by the grid middleware if th
 e job failed. So\, valuable information that could aid in finding the fail
 ure reason is lost. These issues result in high job failure rates and less
  than optimal resource usage.\n\nAs part of the High Energy Particle Physi
 cs Community Grid project (HEPCG) of the German D-Grid Initiave\, the Univ
 ersity of Wuppertal has developed the Job Execution Monitor (JEM). JEM hel
 ps finding job failure reasons by two means: It periodically provides vita
 l worker node system data and collects job run-time monitoring data. To ga
 ther this data\, a supervised line-by-line execution of the user job is pe
 rformed. JEM is providing new possibilities to find problems in largely di
 stributed computing grids and to analyze these problems in nearly real-tim
 e.\n\nAll monitored information is presented to the user almost instantane
 ously and additionally stored in the jobs' output sandbox for further anal
 ysis. As a first step\, JEM has been seamlessly integrated into ATLAS' and
  LHCb's grid user interface 'ganga'. In this way\, submitted jobs are moni
 tored transparently\, requiring no additional effort by the user.\n\nIn th
 is work\, the functionality of and the concepts behind JEM are presented t
 ogether with examples of typical problems that are easily discovered. Furt
 hermore\, we present an ongoing work of classifying problems automatically
  using expert systems.\n\nhttp://indico.cern.ch/contributionDisplay.py?con
 tribId=190&sessionId=29&confId=34666
LOCATION:Ettore Majorana Foundation and Centre for Scientific Culture
URL:http://indico.cern.ch/contributionDisplay.py?contribId=190&sessionId=2
 9&confId=34666
END:VEVENT
END:VCALENDAR
