Speaker
Dr
Christophe Blanchet
(Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1;)
Description
Bioinformatics analysis of data produced by high-throughput
biology, for instance genome projects [1], is one of
the major challenges for the next years. Some of the
requirements of this analysis are to access up-to-date
databanks (of sequences, patterns, 3D structures, etc.) and
relevant algorithms (for sequence similarity, multiple
alignment, pattern scanning, etc.) [2]. Since 1998, we are
developing the NPS@ Web server ([3], Network
Protein Sequence Analysis), that provides the biologist with
many of the most common resources for protein
sequence analysis, integrated into a common workflow. These
methods and data can be accessed through simple
web browsing and HTTP connection, or througth high-level
bioinformatics interface like MPSA program [4] or
AntheProt [5].
Today, the computing resources available behind the NPS@ Web
portal limit the capabilities of our server, as it is
the case also for other genomics and proteomics Web portals.
Indeed some methods are very computing-time
and memory consuming. Our NPS@ portal is facing an
increasing demand of CPU and disk resources and the
management of numerous bioinformatics resources (algorithms,
databanks).
NPS@ [3] is providing biologist with a Web form to input
their data (protein sequences) in order to run a BLAST
analysis against a given protein sequence database. User
pastes his sequence of protein in the corresponding
field. Then he chooses the database that will be scan with
the query sequence. All the protein databases available
on NPS@ can be selected through a multi-valued list of the form.
GPS@ grid web portal (Grid Protein Sequence Analysis,
http://gpsa-pbil.ibcp.fr) is the grid release of the NPS@
bioinformatics portal. GPS@ hides the mechanisms required
for submitting bioinformatics analyses on the grid
infrastructure. Selecting the “EGEE” check-box will schedule
the submission of the BLAST on the EGEE grid [6]
when clicking on the “submit” button. The bioinformatics
algorithms and databases available on GPS@ have been
distributed and registered on the grid and GPS@ runs its own
EGEE interface to the grid.
First, the job description in the Web form is converted into
a JDL file, that can then be submitted to the workload
management system of EGEE. The GPS@ sub-process that have
submitted the job, is also checking periodically
the status of this job by querying the resource broker with
the good commands. All steps are notified to the user
through the Web page of the submission, indicating the time
and the duration of the current step. When
achieved, i.e. reaching the “Done” step, the GPS@ automat
downloads the result file from BLAST. Then this raw
result file in BLAST format is processed and converted into
a HTML page showing, in a colored and graphical way,
the list of similar protein sequences, and also graph and
pairwise alignments of them. This formatting process is
directly inherited from the original NPS@ portal, providing
biologists with a well-known interface and way of
displaying results.
GPS@ portal makes the Bioinformatics job submission easier
on the grid, and provide biologist with the benefit of
the EGEE grid infrastructure to analyze large biological
dataset: e.g. including several protein secondary structure
predictions into a multiple alignment, or clustering a
sequence set by analyzing, with BLAST or SSEARCH, each
sequence against the others, …
Acknowledgements
This work has been funded by GriPPS project (ACI GRID
PPL02-05), EGEE project (EU FP6, ref. INFSO-508833)
and EMBRACE Network of Excellence (EU FP6, LHSG-CT-2004-512092).
References
[1] Bernal, A., Ear, U., Kyrpides, N. : Genomes OnLine
Database (GOLD): a monitor of genome projects world-
wide. NAR 29 (2001) 126-127
[2] G. Perrière, C. Combet, S. Penel, C. Blanchet, J.
Thioulouse, C. Geourjon, J. Grassot, C. Charavay, M. Gouy, L.
Duret and G. Deléage, Integrated databanks access and
sequence/structure analysis services at the PBIL. Nucleic
Acids Res., 31:3393-3399, 2003.
[3] Combet, C., Blanchet, C., Geourjon, C. et Deléage, G. :
NPS@: Network Protein Sequence Analysis. Tibs, 25
(2000) 147-150.
[4] Blanchet, C., Combet, C., Geourjon, C. et Deléage, G. :
MPSA: Integrated System for Multiple Protein
Sequence Analysis with client/server capabilities.
Bioinformatics, 16 (2000) 286-287.
[5] Deleage, G, Combet, C, Blanchet, C, Geourjon, C. :
ANTHEPROT: an integrated protein sequence analysis
software with client/server capabilities. Comput Biol Med.,
31 (2001) 259-267
[6] EGEE – Enabling Grid for E-science in Europe;
http://www.eu-egee.org
Primary author
Dr
Christophe Blanchet
(Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1;)
Co-authors
Dr
Christophe Combet
(Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1;)
Prof.
Gilbert Deleage
(Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1;)
Mr
Rémi Mollon
(Institut de Biologie et Chimie des Protéines (IBCP UMR 5086); CNRS; Univ. Lyon 1;)