25–29 Sept 2006
CICG
Europe/Zurich timezone

Job reliability

26 Sept 2006, 14:00
5h 30m
CICG

CICG

CICG, 17 rue de Varembé, CH - 1211 Geneva 20 Switzerland
Board: 28
Poster Users & Applications Poster session

Speaker

Pablo Saiz (CERN)

Description

Errors are always frustrating. They are even more frustrating when their cause is not clear. And the GRID is not an exception. For example, submitting a job to the GRID and getting back an error is frustrating. Not knowing if the error was due to something you did, some middleware glitch or a site problem makes it even worse. Our goal was to tackle this problem. In order to do that, the first thing is to understand the different error messages reported back to the users. We went through the most common error messages: first, investigating the underlying problems; then categorizing, and if possible, helping the responsible to fix it; and finally monitoring if that error message disappeared. One common reason for job failures is site misconfiguration. Being able to detect such a misconfiguration as soon as possible helps in several ways: first of all, it minimizes the time that it takes to bring the site back to a normal state; moreover, debugging it is easier, since the problem happened in the recent past. In the next chapters we will describe in more detail the study that we did for some of the error messages. We will also describe the tools that we created to monitor the site efficiency.

Authors

Andrea Sciaba (CERN) Benjamin Gaidioz (CERN) Birger Koblitz (CERN) Hurng-Chun Lee (CERN) Juha Herrala (CERN) Julia Andreeva (CERN) Massimo Lamanna (CERN) Pablo Saiz (CERN)

Presentation materials

There are no materials yet.