25–29 Sept 2006
CICG
Europe/Zurich timezone

A fully gLite metadata approach to access files for bioinformatics applications

26 Sept 2006, 14:00
5h 30m
CICG

CICG

CICG, 17 rue de Varembé, CH - 1211 Geneva 20 Switzerland
Board: 16
Poster Users & Applications Poster session

Speaker

Dr Ivan Porro (Department of Communication, Computer and System Sciences (DIST), University of Genoa)

Description

In the context of bioinformatics laboratory research, measurements from experiments can range dramatically in their accuracy and reproducibility, forcing researchers to design experiments with more biological replicates. However, statistical processing systems can overcome this problem by widening the amount of data they are able to consider, but cost remains a strong limit on the size of experiments. As a more general solution, similar data may be collected across several acquisition facilities, but, in order to be able to reproduce or compare different experimental setups, side conditions associated to experiments must be accurately tracked. Moreover, end-users may be provided with different analysis algorithms by different providers, and search tools may be needed to find data and applications. Eventually, data or experimental setups as well as results from experiments should be collected through a user-friendly web interface. Starting from these considerations, we decided to implement a Grid-based data storage and management system for data concerning bioinformatics experiments. Indeed, a Grid service based approach may provide a shared, standardized and reliable solution for storage and analysis of the above mentioned biological data. Moreover, a Grid portal may allow unpractised users to store their experimental data on a complex storage system and to access distributed data and services. Instead of developing a database, data and experiment annotations can be stored using metadata management tools, providing high flexibility and assuring experiment replication for biological research activities. Security and privacy issues can be addressed using a certificate-based authentication schema coming out for free from the Grid technology and sensible data can be federated or accessed without moving them or via volatile copies. The described environment relies on storage services (with replication and catalog services) provided by the gLite Grid middleware. Through the AMGA metadata catalog, gLite is able to exploit the added value of metadata, in order to let users better classify and search experiments. The key feature in our solution is that data files can be searched and accessed just by providing their description metadata. Several keywords (metadata fields) are associated to data files and the metadata catalog collects such high level descriptions. Files are physically stored on the Grid, and the metadata catalog has also the information for accessing them, through their logical file name, without taking care of filesystem structures. This way, files could be replicated on disks to achieve more reliability but the file collection is kept consistent. From a functional point of view, the adopted framework is deployed in the form of a Web Grid application visible to users as traditional web pages, but it is ready to be deployed also as a grid service exposed to the public with standard interface (WSDL). A Web interface has been implemented in order to hide the complexity of framework and to make users able to use a standard browser for navigating a Grid portal and for accessing available data services. From a data point of view, the proposed environment permits users to upload/download their data and results on/from the Grid Portal and to store them on Grid storage resources. The filesystem complexity is hidden by the AMGA representation, thus allowing also a multiple perspective access to data collections. The same framework can be adopted in a biomedical scenario combining text data for patients, studies, and reports, as well as medical imaging acquisition volumes, and time series signals or genomic information.

Summary

We present the architecture of a Grid-based data storage and
management system for
data concerning bioinformatics experiments. The proposed system
is based on actual
gLite Grid services and may provide a shared, standardized and
reliable solution for
storage and analysis of biological data with a particular
emphasis on integration
between clinical and research domains.

Author

Dr Ivan Porro (Department of Communication, Computer and System Sciences (DIST), University of Genoa)

Co-authors

Dr Adam Papadimitropoulos (Department of Communication, Computer and System Sciences (DIST), University of Genoa) Dr Antonio Calanducci (National Institute of Nuclear Physics (INFN), sez. Catania) Dr Federica Viti (Department of Communication, Computer and System Sciences (DIST), University of Genoa) Prof. Francesco Beltrame (Italian National Research Council (CNR)) Mr Luca Corradi (Department of Communication, Computer and System Sciences (DIST), University of Genoa) Prof. Marco Fato (Department of Communication, Computer and System Sciences (DIST), University of Genoa) Dr Silvia Scaglione (Department of Communication, Computer and System Sciences (DIST), University of Genoa)

Presentation materials

There are no materials yet.