Mr Antonio Retico (CERN)
Grids have the potential to revolutionise computing by providing ubiquitous, on demand access to computational services and resources. They promise to allow for on demand access and composition of computational services provided by multiple independent sources. Grids can also provide unprecedented levels of parallelism for high-performance applications. On the other hand, grid characteristics, such as high heterogeneity, complexity and distribution create many new technical challenges. Among these technical challenges, failure management is a key area that demands much progress. A recent survey revealed that fault diagnosis is still a major problem for grid users. When a failure appears at the user screen, it becomes very difficult for her to identify whether the problem is in the used application, somewhere in the grid middleware, or even lower in the fabric that comprises the grid. In this paper we present a tool able to check if a given grid service works as expected for a given set of users (Virtual Organisation) on the different resources available on a grid. Our solution deals with grid services as single components that should produce an expected output to a pre-defined input, what is quite similar to unit testing. The tool, called Service Availability Monitoring or SAM, is being currently used by several different Virtual Organizations to monitor more than 300 grid sites belonging to the largest grids available today. We also discuss how this tool is being used by some of those VOs and how it is helping in the operation of the EGEE/WLCG grid.