Sep 2 – 9, 2007
Victoria, Canada
Europe/Zurich timezone
Please book accomodation as soon as possible.

Grid reliability

Sep 3, 2007, 2:00 PM
Carson Hall C (Victoria, Canada)

Carson Hall C

Victoria, Canada

oral presentation Grid middleware and tools Grid middleware and tools


Pablo Saiz (CERN)


Thanks to the grid, users have access to computing resources distributed all over the world. The grid hides the complexity and the differences of its heterogeneous components. In order for this to work, it is vital that all the elements are setuped properly, and that they can interact with each other. It is also very important that errors are detected as soon as possible, and that the procedure to solve them is well established. Our goal is to improve the performance of the grid. In order to do this, we studied two of its main elements: the workload and the data management systems. We developed all tools needed to investigate the efficiency of the different centres. Furthermore, our tools can be used to categorize the most common error messages, and measure their time evolution. One common reason for job failures is site misconfiguration. Being able to detect such a misconfiguration as soon as possible helps in several ways: first of all, it minimizes the time that it takes to bring the site back to a normal state; moreover, debugging it is easier, since the problem happened in the recent past. This can be specially helpful for new centers, since the tools provide the material needed to get a better understanding of the grid's complexity. In this contribution we will describe all the tools that we have developed to monitor the grid efficiency. These tools are currently used by the four LHC experiments. We will also describe the results and benefits that the tools have provided.

Primary authors

Presentation materials