Speaker
Pablo Saiz
(CERN)
Description
Thanks to the grid, users have access to computing resources distributed all over the
world. The grid hides the complexity and the differences of its heterogeneous
components. In order for this to work, it is vital that all the elements are setuped
properly, and that they can interact with each other. It is also very important that
errors are detected as soon as possible, and that the procedure to solve them is well
established.
Our goal is to improve the performance of the grid. In order to do this, we studied
two of its main elements: the workload and the data management systems. We developed
all tools needed to investigate the efficiency of the different centres. Furthermore,
our tools can be used to categorize the most common error messages, and measure their
time evolution.
One common reason for job failures is site misconfiguration. Being able to detect
such a misconfiguration as soon as possible helps in several ways: first of all, it
minimizes the time that it takes to bring the site back to a normal state; moreover,
debugging it is easier, since the problem happened in the recent past. This can be
specially helpful for new centers, since the tools provide the material needed to get
a better understanding of the grid's complexity.
In this contribution we will describe all the tools that we have developed to monitor
the grid efficiency. These tools are currently used by the four LHC experiments. We
will also describe the results and benefits that the tools have provided.
Authors
Benjamin Gaidioz
(CERN)
Gerhild Maier
(CERN)
Juha Herrala
(CERN)
Julia Andreeva
(CERN)
Pablo Saiz
(CERN)
Ricardo Rocha
(CERN)
catalin Cirstoiu
(CERN)