Speaker
Mr
Jean Salzemann
(IN2P3/CNRS)
Description
One of the main challenges in molecular biology is the management of data and
databases. A large fraction of the biological data produced is publicly available on
web sites or by ftp protocols. These public databases are internationally known and
play a key role in the majority of public and private research. But their
exponential
growth raises an usage problem. Indeed, scientists need easy access to the last
update of the databases in order to apply bioinformatics or data mining algorithms.
The frequent and regular update of the databases is a recurrent issue for all host
or
mirror centres, and also for scientists using the databases locally for
confidentiality reasons.
We proposed a solution for the updates of these distributed databases. This solution
come as a service embedded into the grid which uses its mechanisms and automatically
performs updates. So we developed a set of web services that will rely on the grid
to
manage this task, with the aim of deploying the services under any grid middleware
with a minimum of adaptation. This includes a client/server application with a set
of
rules and a protocol to update a database from a given repository and distribute the
update through the grid storage elements while trying to optimize network bandwidth,
file transfers size and fault tolerance, and finally offer a transparent automated
service which does not require user intervention. This represents the challenges of
the database update in a grid environment and the solution we proposed is basically
to define two types of storage on the grid storage elements: some storage of
reference where the update is first performed and working storage spaces where the
jobs will pick up the information. The idea is to replicate the update on the grid
from these reference points to the storage elements. From the service point of view,
it is necessary that the grid information system can locate sites who host a given
database in order to have the benefits of a dynamical database replication and
location. From the user point of view, we need to dispose of the location
information
for each database in order to achieve scalability and find replica on the grid, this
means having a metadata for each database that can refer to several physical
locations on the grid and contain certain information as well, because the replica
do
not concern single files but a whole database with several files and/or
directories.
This service is being deployed on two French Grid infrastructures: RUGBI (based on
Globus Toolkit 4) and Auvergrid (based on EGEE), so we plan a future deployment of
this service on EGEE, especially in the Biomed VO, but the real issues are that the
service need to be deployed as a grid service, and managed as a grid service, so
some
people from the VO should be able to deploy and administrate the service beside the
site administrators, a role which is finding its limits in current VO management.
The
service is supposed to be embedded into the grid and is not just a pure application
laid on it. Eventually it will be possible to offer this service as an application,
but it would mean that its use is not mandatory and not automated, which is
synonymous with losing its benefits and transparency since the user will need to
specify the use of the service in his workflow. There are also future plans to add
some optimisation on the deployment of the databases: for example, being able to
split databases to store each part on a different storage element, or add the
ability
to offer several reference storages per database which would require to synchronize
these storages with each other. The service will mature through its deployment on
grid middlewares and will surely improve as it is used in production environments.
Primary author
Mr
Jean Salzemann
(IN2P3/CNRS)