1–3 Mar 2006
CERN
Europe/Zurich timezone

A service to update and replicate biological databases

1 Mar 2006, 15:15
15m
40-SS-C01 (CERN)

40-SS-C01

CERN

Oral contribution Life Science 1a: Life Sciences

Speaker

Mr Jean Salzemann (IN2P3/CNRS)

Description

One of the main challenges in molecular biology is the management of data and databases. A large fraction of the biological data produced is publicly available on web sites or by ftp protocols. These public databases are internationally known and play a key role in the majority of public and private research. But their exponential growth raises an usage problem. Indeed, scientists need easy access to the last update of the databases in order to apply bioinformatics or data mining algorithms. The frequent and regular update of the databases is a recurrent issue for all host or mirror centres, and also for scientists using the databases locally for confidentiality reasons. We proposed a solution for the updates of these distributed databases. This solution come as a service embedded into the grid which uses its mechanisms and automatically performs updates. So we developed a set of web services that will rely on the grid to manage this task, with the aim of deploying the services under any grid middleware with a minimum of adaptation. This includes a client/server application with a set of rules and a protocol to update a database from a given repository and distribute the update through the grid storage elements while trying to optimize network bandwidth, file transfers size and fault tolerance, and finally offer a transparent automated service which does not require user intervention. This represents the challenges of the database update in a grid environment and the solution we proposed is basically to define two types of storage on the grid storage elements: some storage of reference where the update is first performed and working storage spaces where the jobs will pick up the information. The idea is to replicate the update on the grid from these reference points to the storage elements. From the service point of view, it is necessary that the grid information system can locate sites who host a given database in order to have the benefits of a dynamical database replication and location. From the user point of view, we need to dispose of the location information for each database in order to achieve scalability and find replica on the grid, this means having a metadata for each database that can refer to several physical locations on the grid and contain certain information as well, because the replica do not concern single files but a whole database with several files and/or directories. This service is being deployed on two French Grid infrastructures: RUGBI (based on Globus Toolkit 4) and Auvergrid (based on EGEE), so we plan a future deployment of this service on EGEE, especially in the Biomed VO, but the real issues are that the service need to be deployed as a grid service, and managed as a grid service, so some people from the VO should be able to deploy and administrate the service beside the site administrators, a role which is finding its limits in current VO management. The service is supposed to be embedded into the grid and is not just a pure application laid on it. Eventually it will be possible to offer this service as an application, but it would mean that its use is not mandatory and not automated, which is synonymous with losing its benefits and transparency since the user will need to specify the use of the service in his workflow. There are also future plans to add some optimisation on the deployment of the databases: for example, being able to split databases to store each part on a different storage element, or add the ability to offer several reference storages per database which would require to synchronize these storages with each other. The service will mature through its deployment on grid middlewares and will surely improve as it is used in production environments.

Primary author

Mr Jean Salzemann (IN2P3/CNRS)

Presentation materials