Speaker
Dr
Sinisa Veseli
(Fermilab)
Description
SAMGrid presently relies on the centralized database for providing several services
vital for the system operation. These services are all encapsulated in the SAMGrid
Database Server, and include access to file metadata and replica catalogs, dataset
and processing bookkeeping, as well as the runtime support for the SAMGrid station
services. Access to the centralized database and DB Servers represents a single point
of failure in the system and limits its scalability.
In order to address this issue, we have created a prototype of a peer-to-peer
information service that allows the system to operate during times when access to the
central DB is not available for any reason (e.g., network failures, scheduled
downtimes, etc.), as well as to improve the system performance during times of
extremely high system load when the central DB access is slow and/or has a high
failure rate. Our prototype uses Distributed Hash Tables to create a fault tolerant
and self-healing service. We believe that this is the first peer-to-peer information
service designed to become a part of an in-use grid system.
We describe here the prototype architecture and its existing and planned
functionality, as well as show how it can be integrated into the SAMGrid system. We
also present a study of performance of our new service under different circumstances.
Our results strongly demonstrate the feasibility and usefulness of the proposed
architecture.
Authors
Dr
Matthew Leslie
(Oxford University)
Dr
Sinisa Veseli
(Fermilab)