Speaker
Mr
Antonio Retico
(CERN)
Description
Grids have the potential to revolutionise computing by providing ubiquitous, on
demand access to computational services and resources. They promise to allow for on
demand access and composition of computational services provided by multiple
independent sources. Grids can also provide unprecedented levels of parallelism for
high-performance applications. On the other hand, grid characteristics, such as high
heterogeneity, complexity and distribution create many new
technical challenges.
Among these technical challenges, failure management is a key area that demands much
progress. A recent survey revealed that fault diagnosis is still a major problem for
grid users. When a failure appears at the user screen, it becomes very difficult for
her to identify whether the problem is in the used application, somewhere in the grid
middleware, or even lower in the fabric that comprises the grid.
In this paper we present a tool able to check if a given grid service works as
expected for a given set of users (Virtual Organisation) on the different resources
available on a grid. Our solution deals with grid services as single components that
should produce an expected output to a pre-defined input, what is quite similar to
unit testing. The tool, called Service Availability Monitoring or SAM, is being
currently used by several different Virtual Organizations to monitor more than 300
grid sites belonging to the largest grids available today. We also discuss how this
tool is being used by some of those VOs and how it is helping in the operation of the
EGEE/WLCG grid.
Primary authors
Mr
Alexandre Duarte
(CERN/Federal University of Campina Grande)
Mr
Antonio Retico
(CERN)
Mr
Domenico Vicinanza
(CERN/University of Salerno)
Mr
Piotr Nyczyk
(CERN)