Speaker
Pablo Saiz
(CERN)
Description
Errors are always frustrating. They are even more
frustrating when their cause is not
clear. And the GRID is not an exception. For example,
submitting a job to the GRID
and getting back an error is frustrating. Not knowing if the
error was due to
something you did, some middleware glitch or a site problem
makes it even worse.
Our goal was to tackle this problem. In order to do that,
the first thing is to
understand the different error messages reported back to the
users. We went through
the most common error messages: first, investigating the
underlying problems; then
categorizing, and if possible, helping the responsible to
fix it; and finally
monitoring if that error message disappeared.
One common reason for job failures is site misconfiguration.
Being able to detect
such a misconfiguration as soon as possible helps in several
ways: first of all, it
minimizes the time that it takes to bring the site back to a
normal state; moreover,
debugging it is easier, since the problem happened in the
recent past.
In the next chapters we will describe in more detail the
study that we did for some
of the error messages. We will also describe the tools that
we created to monitor the
site efficiency.
Authors
Andrea Sciaba
(CERN)
Benjamin Gaidioz
(CERN)
Birger Koblitz
(CERN)
Hurng-Chun Lee
(CERN)
Juha Herrala
(CERN)
Julia Andreeva
(CERN)
Massimo Lamanna
(CERN)
Pablo Saiz
(CERN)