Speaker
Dr
Ivan Porro
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Description
In the context of bioinformatics laboratory research,
measurements from experiments
can range dramatically in their accuracy and
reproducibility, forcing researchers to
design experiments with more biological replicates. However,
statistical processing
systems can overcome this problem by widening the amount of
data they are able to
consider, but cost remains a strong limit on the size of
experiments. As a more
general solution, similar data may be collected across
several acquisition
facilities, but, in order to be able to reproduce or compare
different experimental
setups, side conditions associated to experiments must be
accurately tracked.
Moreover, end-users may be provided with different analysis
algorithms by different
providers, and search tools may be needed to find data and
applications. Eventually,
data or experimental setups as well as results from
experiments should be collected
through a user-friendly web interface.
Starting from these considerations, we decided to implement
a Grid-based data storage
and management system for data concerning bioinformatics
experiments. Indeed, a Grid
service based approach may provide a shared, standardized
and reliable solution for
storage and analysis of the above mentioned biological data.
Moreover, a Grid portal
may allow unpractised users to store their experimental data
on a complex storage
system and to access distributed data and services. Instead
of developing a database,
data and experiment annotations can be stored using metadata
management tools,
providing high flexibility and assuring experiment
replication for biological
research activities. Security and privacy issues can be
addressed using a
certificate-based authentication schema coming out for free
from the Grid technology
and sensible data can be federated or accessed without
moving them or via volatile
copies.
The described environment relies on storage services (with
replication and catalog
services) provided by the gLite Grid middleware. Through the
AMGA metadata catalog,
gLite is able to exploit the added value of metadata, in
order to let users better
classify and search experiments. The key feature in our
solution is that data files
can be searched and accessed just by providing their
description metadata. Several
keywords (metadata fields) are associated to data files and
the metadata catalog
collects such high level descriptions. Files are physically
stored on the Grid, and
the metadata catalog has also the information for accessing
them, through their
logical file name, without taking care of filesystem
structures. This way, files
could be replicated on disks to achieve more reliability but
the file collection is
kept consistent.
From a functional point of view, the adopted framework is
deployed in the form of a
Web Grid application visible to users as traditional web
pages, but it is ready to be
deployed also as a grid service exposed to the public with
standard interface (WSDL).
A Web interface has been implemented in order to hide the
complexity of framework and
to make users able to use a standard browser for navigating
a Grid portal and for
accessing available data services. From a data point of
view, the proposed
environment permits users to upload/download their data and
results on/from the Grid
Portal and to store them on Grid storage resources. The
filesystem complexity is
hidden by the AMGA representation, thus allowing also a
multiple perspective access
to data collections.
The same framework can be adopted in a biomedical scenario
combining text data for
patients, studies, and reports, as well as medical imaging
acquisition volumes, and
time series signals or genomic information.
Summary
We present the architecture of a Grid-based data storage and
management system for
data concerning bioinformatics experiments. The proposed system
is based on actual
gLite Grid services and may provide a shared, standardized and
reliable solution for
storage and analysis of biological data with a particular
emphasis on integration
between clinical and research domains.
Author
Dr
Ivan Porro
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Co-authors
Dr
Adam Papadimitropoulos
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Dr
Antonio Calanducci
(National Institute of Nuclear Physics (INFN), sez. Catania)
Dr
Federica Viti
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Prof.
Francesco Beltrame
(Italian National Research Council (CNR))
Mr
Luca Corradi
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Prof.
Marco Fato
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)
Dr
Silvia Scaglione
(Department of Communication, Computer and System Sciences (DIST), University of Genoa)