EGEE User Forum

Europe/Zurich
CERN

CERN

Description

The EGEE (Enabling Grids for E-sciencE) project provides the largest production grid infrastructure for applications. In the first two years of the project an increasing number of diverse users communities have been attracted by the possibilities offered by EGEE and have joined the initial user communities. The EGEE user community feels it is now appropriate to meet to share their experiences, and to set new targets for the future, including both the evolution of the existing applications and the development and deployment of new applications onto the EGEE infrastructure.

The EGEE Users Forum will provide an important opportunity for innovative applications to establish contacts with EGEE and with other user communities, to plan for the future usage of the EGEE grid infrastructure, to learn about the latest advances, and to discuss the future evolution in the grid middleware. The main goal is to create a dynamic user community, starting from the base of existing users, which can increase the effectiveness of the current EGEE applications and promote the fast and efficient uptake of grid technology by new disciplines. EGEE fosters pioneering usage of its infrastructure by encouraging collaboration between diverse scientific disciplines. It does this to evolve and to expand the services offered to the EGEE user community, maximising the scientific, technological and economical relevance of grid-based activities.

We would like to invite hands-on users of the EGEE Grid Infrastructure to Submit an Abstract for this event following the suggested template.

EGEE User Forum Web Page
Participants
  • Adrian Vataman
  • Alastair Duncan
  • Alberto Falzone
  • Alberto Ribon
  • Ales Krenek
  • Alessandro Comunian
  • Alexandru Tudose
  • Alexey Poyda
  • Algimantas Juozapavicius
  • Alistair Mills
  • Alvaro del Castillo San Felix
  • Andrea Barisani
  • Andrea Caltroni
  • Andrea Ferraro
  • Andrea Manzi
  • Andrea Rodolico
  • Andrea Sciabà
  • Andreas Gisel
  • Andreas-Joachim Peters
  • Andrew Maier
  • Andrey Kiryanov
  • Aneta Karaivanova
  • Antonio Almeida
  • Antonio De la Fuente
  • Antonio Laganà
  • Antony wilson
  • Arnaud PIERSON
  • Arnold Meijster
  • Benjamin Gaidioz
  • Beppe Ugolotti
  • Birger Koblitz
  • Bjorn Engsig
  • Bob Jones
  • Boon Low
  • Catalin Cirstoiu
  • Cecile Germain-Renaud
  • Charles Loomis
  • CHOLLET Frédérique
  • Christian Saguez
  • Christoph Langguth
  • Christophe Blanchet
  • Christophe Pera
  • Claudio Arlandini
  • Claudio Grandi
  • Claudio Vella
  • Claudio Vuerli
  • Claus Jacobs
  • Craig Munro
  • Cristian Dittamo
  • Cyril L'Orphelin
  • Daniel JOUVENOT
  • Daniel Lagrava
  • Daniel Rodrigues
  • David Colling
  • David Fergusson
  • David Horn
  • David Smith
  • David Weissenbach
  • Davide Bernardini
  • Dezso Horvath
  • Dieter Kranzlmüller
  • Dietrich Liko
  • Dmitry Mishin
  • Doina Banciu
  • Domenico Vicinanza
  • Dominique Hausser
  • Eike Jessen
  • Elena Slabospitskaya
  • Elena Tikhonenko
  • Elisabetta Ronchieri
  • Emanouil Atanassov
  • Eric Yen
  • Erwin Laure
  • Esther Acción García
  • Ezio Corso
  • Fabrice Bellet
  • Fabrizio Pacini
  • Federica Fanzago
  • Fernando Felix-Redondo
  • Flavia Donno
  • Florian Urmetzer
  • Florida Estrella
  • Fokke Dijkstra
  • Fotis Georgatos
  • Fotis Karayannis
  • Francesco Giacomini
  • Francisco Casatejón
  • Frank Harris
  • Frederic Hemmer
  • Gael youinou
  • Gaetano Maron
  • Gavin McCance
  • Gergely Sipos
  • Giorgio Maggi
  • Giorgio Pauletto
  • giovanna stancanelli
  • Giuliano Pelfer
  • Giuliano Taffoni
  • Giuseppe Andronico
  • Giuseppe Codispoti
  • Hannah Cumming
  • Hannelore Hammerle
  • Hans Gankema
  • Harald Kornmayer
  • Horst Schwichtenberg
  • Huard Helene
  • Hugues BENOIT-CATTIN
  • Hurng-Chun LEE
  • Ian Bird
  • Ignacio Blanquer
  • Ilyin Slava
  • Iosif Legrand
  • Isabel Campos Plasencia
  • Isabelle Magnin
  • Jacq Florence
  • Jakub Moscicki
  • Jan Kmunicek
  • Jan Svec
  • Jaouher KERROU
  • Jean Salzemann
  • Jean-Pierre Prost
  • Jeremy Coles
  • Jiri Kosina
  • Joachim Biercamp
  • Johan Montagnat
  • John Walk
  • John White
  • Jose Antonio Coarasa Perez
  • José Luis Vazquez
  • Juha Herrala
  • Julia Andreeva
  • Kerstin Ronneberger
  • Kiril Boyanov
  • Kiril Boyanov
  • Konstantin Skaburskas
  • Ladislav Hluchy
  • Laura Cristiana Voicu
  • Laura Perini
  • Leonardo Arteconi
  • Livia Torterolo
  • Losilla Guillermo Anadon
  • Luciano Milanesi
  • Ludek Matyska
  • Lukasz Skital
  • Luke Dickens
  • Malcolm Atkinson
  • Marc Rodriguez Espadamala
  • Marc-Elian Bégin
  • Marcel Kunze
  • Marcin Plociennik
  • Marco Cecchi
  • Mariusz Sterzel
  • Marko Krznaric
  • Markus Schulz
  • Martin Antony Walker
  • Massimo Lamanna
  • Massimo Marino
  • Miguel Cárdenas Montes
  • Mike Mineter
  • Mikhail Zhizhin
  • Mircea Nicolae Tugulea
  • Monique Petitdidier
  • Muriel Gougerot
  • Nadezda Fialko
  • Nadine Neyroud
  • Nick Brook
  • Nicolas Jacq
  • Nicolas Ray
  • Nils Buss
  • Nuno Santos
  • Osvaldo Gervasi
  • Othmane Bouhali
  • Owen Appleton
  • Pablo Saiz
  • Panagiotis Louridas
  • Pasquale Pagano
  • Patricia Mendez Lorenzo
  • Pawel Wolniewicz
  • Pedro Andrade
  • Peter Kacsuk
  • Peter Praxmarer
  • Philippa Strange
  • Philippe Renard
  • Pier Giovanni Pelfer
  • Pietro Lio
  • Pietro Liò
  • Rafael Leiva
  • Remi Mollon
  • Ricardo Brito da Rocha
  • Riccardo di Meo
  • Robert Cohen
  • Roberta Faggian Marque
  • Roberto Barbera
  • Roberto Santinelli
  • Rolandas Naujikas
  • Rolf Kubli
  • Rolf Rumler
  • Romier Genevieve
  • Rosanna Catania
  • Sabine ELLES
  • Sandor Suhai
  • Sergio Andreozzi
  • Sergio Fantinel
  • Shkelzen RUGOVAC
  • Silvano Paoli
  • Simon Lin
  • Simone Campana
  • Soha Maad
  • Stefano Beco
  • Stefano Cozzini
  • Stella Shen
  • Stephan Kindermann
  • Steve Fisher
  • tao-sheng CHEN
  • Texier Romain
  • Toan Nguyen
  • Todor Gurov
  • Tomasz Szepieniec
  • Tony Calanducci
  • Torsten Antoni
  • tristan glatard
  • Valentin Vidic
  • Valerio Venturi
  • Vangelis Floros
  • Vaso Kotroni
  • Venicio Duic
  • Vicente Hernandez
  • Victor Lakhno
  • Viet Tran
  • Vincent Breton
  • Vincent LEFORT
  • Vladimir Voznesensky
  • Wei-Long Ueng
  • Ying-Ta Wu
  • Yury Ryabov
  • Ákos Frohner
    • 1:00 PM 2:00 PM
      Lunch 1h
    • 2:00 PM 6:30 PM
      1a: Life Sciences 40-SS-C01

      40-SS-C01

      CERN

      • 2:00 PM
        GPS@: Bioinformatics grid portal for protein sequence analysis on EGEE grid 15m
        One of current major challenges in the bioinformatics field is to derive valuable information from the complete genome sequencing projects, which provide the bioinformatics community with a large number of unknown sequences. The first prerequisite step in this process is to access up-to-date sequence and 3D-structure databanks (EMBL, GenBank, SWISS-PROT, Protein Data Bank...) maintained by several bio-computing centres (NCBI, EBI, EMBL, SIB, INFOBIOGEN, PBIL, …). For efficiency reasons, sequences should be analyzed using the maximal number of methods on a minimal number of different Web sites. To achieve this, we developed a Web server called NPS@ [1] (Network Protein Sequence Analysis) that provides biologists with many of the most common tools for protein sequence analysis through a classic Web browser like Netscape, or through a networked protein client software like MPSA [2]. Today, the genomic and post-genomic web portals available have to deal with their local cpu and storage resources. That’s why, most of the time, the portal administrators put some restrictions on the methods and databanks available. Grid computing [3], as in the European EGEE project [4], will be a viable solution to foresee these limitations and to bring computing resources suitable to the genomic research field. Nevertheless, the current job submission process on the EGEE platform is relatively complex and unsuitable for automation. The user has to install an EGEE user interface machine on a Linux computer (or to ask for a account on a public one), to remotely log on it, to init manually a certificate proxy for authentication reasons, to specify the job arguments to the grid middleware using the Job Description Language (JDL) and then to submit the job through a command line interface. Next, the grid-user has to check periodically the resource broker for the status of his job: “Submitted", "Ready", “Scheduled”, “Running”, etc. until the “Done” status. As a final command, he has to get his results with a raw file transfer from the remote storage area to his local file system. This mechanism is most of times off-putting scientist that are not aware of advanced computing techniques. Thus, we decide to provide biologists with a user-friendly interface for the EGEE computing and storage resources, by adapting our NPS@ web site. We have called this new portal GPS@ for “Grid Protein Sequence Analysis”, and it can be reached online at http://gpsa.ibcp.fr, yet for experimental tests only. In GPS@, we simplify the grid analysis query: GPS@ Web portal runs its own EGEE low-level interface and provides biologists with the same interface that they are using daily in NPS@. They only have to paste their protein sequences or patterns into the corresponding field of the submission web page. Then simply pressing the “submit” button launches the execution of these jobs on the EGEE platform. All the EGEE job submission is encapsulated into the GPS@ back office: scheduling and status of the submitted jobs. And finally the result of the bioinformatics jobs are displayed into a new Web page, ready for other analyses or for results download in the appropriate data format. [1] NPS@: Network Protein Sequence Analysis. Combet C., Blanchet C., Geourjon C. et Deléage G. Tibs, 2000, 25, 147-150. [2] MPSA: Integrated System for Multiple Protein Sequence Analysis with client/server capabilities. Blanchet C., Combet C., Geourjon C. et Deléage G. Bioinformatics, 2000, 16, 286-287. [3] Foster, I. And Kesselman, C. (eds.) : The Grid 2 : Blueprint for a New Computing Infrastructure, (2004). [4] Enabling Grid for E-sciencE (EGEE), online at www.eu-egee.org
        Speakers: Dr Christophe Blanchet (CNRS IBCP), Mr Vincent Lefort (CNRS IBCP)
        Slides
      • 2:15 PM
        Encrypted File System on the EGEE grid applied to Protein Sequence Analysis 15m
        Introduction Biomedical applications are pilot ones in the EGEE project [1][2] and have their own virtual organization: the “biomed” VO. Indeed, they have common security requirements such as electronic certificate system, authentication, secured transfer; but they have also specific ones such as fine grain access to data, encrypted storage of data and anonymity. Certificate system provides biomedical entities (like users, services or Web portals) with a secure and individual electronic certificate for authentication and authorization management. One key quality of such a system is the capacity to renew and revoke these certificates across the whole grid. Biomedical applications also need fine grain access (with Access Control Lists, ACLs) to the data stored on the grid: biologists and biochemists can then, for example, share data with colleagues working on the same project in other places. Thus, biomedical data need to be gridified with a high level of confidentiality because they can concern patients or sensitive scientific/industrial experiments. The solution is then to encrypt the data on the Grid storage resources, but to provide authorized users and applications with transparent and unencrypted access. Biological data and protein sequence analysis applications Biological data and bioinformatics programs have both special formats and behaviors, especially highlighted when they are used into a distributed computing platform such as grid [2]. Biological data represent very large datasets of different nature, from different sources, with heterogeneous models: protein three-dimensional structures, functional signatures, expression arrays, etc. Bioinformatics experiences use numerous methods and algorithms to analyze whole biological data which are available to the community [3]. For each domain of Bioinformatics, they are several different high-quality programs that are available for computing the same dataset in as many ways. But most bioinformatics programs are not adapted to distributed platform. One important disadvantage is that they are only accessing data with local file system interface to get the input data and to store their results, an other one being that these data must be unencrypted. The European EGEE grid The Enabling Grids for E-sciencE project (EGEE) [4], funded by the European Commission, aimed to build on recent advances in grid technology and to develop a service grid infrastructure such as described by Foster et al. at the end of 1990s [5]. The EGEE middleware provides grid users with a “user interface” (UI) to launch a job. Among the components of the EGEE grid: the “workload management system” (WMS) is responsible of job scheduling. The central piece is the scheduler (or “resource broker”) that determines where and when to send a job on the “computing elements” (CE) and get data from the “storage elements” (SE). The “data management system” (DMS) is a key service for our bioinformatics applications. Having efficient usage of DMS will be synonymous of good distribution of our protein sequence analysis applications. Inside the DMS, the “replica manager system” (RMS) provides users with data replica functionalities. But there is no available encryption service onto the production grid of EGEE, built upon the LCG2 middleware. “EncFile” encrypted file manager We have developed the EncFile, encrypted file management system, to provide our bioinformatics applications with facilities for computing sensitive data on the EGEE grid. The cipher algorithm AES (Advanced Encryption Standard) is used with 256 bits keys. And to bring fault tolerance properties to the platform, we have also applied a M-of-N technique described by Shamir for secret sharing [6]. We split a key into N shares, each stored in a different server. To rebuild a key, exactly M of the N shares are needed. With less than M shares, it is impossible to deduce several bits or even one of them. The “EncFile” system is composed of these N key servers and one client. The client is doing the decryption of the file for the legacy application, and is the only component able to rebuild the keys, securing their confidentiality. The transfer of the keys between the M servers and the client is secured with encryption and mutual authentication. In order to determine user authorization, the EncFile client send the user proxy to authenticate itself. Nonetheless, to avoid that a malicious person creates a fake EncFile client (e.g. to retrieve key shares), a second authentication is required with a specific certificate of the EncFile system. As seen before, most bioinformatics programs are only able to access their data through local file system interface, and also not encrypted. To answer to these 2 strong issues, we have combined the EncFile client and the Parrot software [7]. The resultant client (called Perroquet in Figure 1) acts as a launcher for applications, catching all their standard IO calls and replacing them with equivalent remote calls to remote files. Perroquet understands the logical file name (LFN) locators of our biological resources onto the EGEE grid, and do on-the-fly decryption. This has mainly two consequences: (i) higher security level because decrypted file copies could endanger data, (ii) better performances because files aren't read twice to locally copy and to decrypt. Thus, the EncFile client permits any applications to transparently read and write remote files, encrypted or not, as if they were local and plain-text files. We are using EncFile system to secure sensitive biological data on the EGEE production platform and to analyze them with world-famous legacy bioinformatics applications such as BLAST, SSearch or ClustalW. Conclusion We have developed the EncFile system for encrypted files management, and deployed it on the production platform of the EGEE project. Thus, we provided grid users with a user-friendly component that doesn’t require any user privileges, and is fault-tolerant because of the M-of-N technique, used to deploy key shares on several key servers. The EncFile client provides legacy bioinformatics applications with remote data access, such as the ones used daily for genomes analyses. Acknowledgement This works was supported by the European Union (EGEE project, ref. INFSO-508833). Authors express thanks to Douglas Thain for the interesting discussions about the Parrot tool. References [1] Jacq, N., Blanchet, C., Combet, C., Cornillot, E., Duret, L., Kurata, K., Nakamura, H., Silvestre, T., Breton, V. : Grid as a bioinformatics tool. , Parallel Computing, special issue: High-performance parallel bio- computing, Vol. 30, (2004). [2] Breton, V., Blanchet, C., Legré, Y., Maigne, L. and Montagnat, J.: Grid Technology for Biomedical Applications. M. Daydé et al. (Eds.): VECPAR 2004, Lecture Notes in Computer Science 3402, pp. 204–218, 2005. [3] Combet, C., Blanchet, C., Geourjon, C. et Deléage, G. : NPS@: Network Protein Sequence Analysis. Tibs, 25 (2000) 147-150. [4] Enabling Grid for E-sciencE (EGEE). Online: www.eu-egee.org [5] Foster, I. And Kesselman, C. (eds.) : The Grid 2 : Blueprint for a New Computing Infrastructure, (2004). [6] Shamir., A. “How to share a secret”. Communications of the ACM , 22(11):612–613, Nov. 1979. [7] Thain, D. and Livny, M.: Parrot: an application environment for data-intensive computing. Scalable Computing: Practice and Experience 6 (2005) 9-18
        Speakers: Dr Christophe Blanchet (CNRS IBCP), Mr Rémi Mollon (CNRS IBCP)
        Slides
      • 2:30 PM
        BIOINFOGRID: Bioinformatics Grid Application for life science 15m
        Project descriptions The European Commission promotes the Bioinformatics Grid Application for life science (BIOINFOGRID) project. The BIOINFOGRID project web site will be available at http://www.itb.cnr.it/bioinfogrid. The project aims to connect many European computer centres in order to carry out Bioinformatics research and to develop new applications in the sector using a network of services based on futuristic Grid networking technology that represents the natural evolution of the Web. More specifically the BIOINFOGRID project will make research in the fields of Genomics, Proteomics, Transcriptomics and applications in Molecular Dynamics much easier, reducing data calculation times thanks to the distribution of the calculation at any one time on thousands of computers across Europe and the world. Furthermore it will provide the possibility of accessing many different databases and hundreds of applications belonging to thousands of European users by exploiting the potential of the Grid infrastructure created with the EGEE European project and coordinated by CERN in Geneva. The BIOINFOGRID project foresees an investment of over one million euros funded through the European Commission’s “Research Infrastructures” budget. Grid networking promises to be a very important step forward in the Information Technology field. Grid technology will make a global network made up of hundreds of thousands of interconnected computers possible, allowing the shared use of calculating power, data storage and structured compression of data. This goes beyond the simple communication between computers and aims instead to transform the global network of computers into a vast joint computational resource. Grid technology is a very important step forward from the Web, that simply allows the sharing of information over the internet. The massive potential of Grid technology will be indispensable when dealing with both the complexity of models and the enormous quantity of data, for example, in searching the human genome or when carry out simulations of molecular dynamics for the study of new drugs. The grid collaborative and application aspects. The BIOINFOGRID projects proposes to combine the Bioinformatics services and applications for molecular biology users with the Grid Infrastructure created by EGEE (6th Framework Program). In the BIOINFOGRID initiative we plan to evaluate genomics, transcriptomics, proteomics and molecular dynamics applications studies based on GRID technology. Genomics Applications in GRID • Analysis of the W3H task system for GRID. • GRID analysis of cDNA data. • GRID analysis of the NCBI and Ensembl databases. • GRID analysis of rule-based multiple alignments. Proteomics Applications in GRID • Pipeline analysis for domain search for protein functional domain analysis. • Surface proteins analysis in GRID platform. Transcriptomics and Phylogenetics Applications in GRID • Data analysis specific for microarray and allow the GRID user to store and search this information, with direct access to the data files stored on Data Storage element on GRID servers. • To validate an infrastructure to perform Application of Phylogenetic based on execution application of Phylogenetic methods estimates trees. Database and Functional Genomics Applications • To offer the possibility to manage and access biological database by using the GRID EGEE. • To cluster gene products by their functionality as an alternative to the normally used comparison by sequence similarity. Molecular Dynamics Applications • To improve the scalability of Molecular Dynamics simulations. • To perform simulation folding and aggregation of peptides and small proteins, to investigate structural properties of proteins and protein-DNA complexes and to study the effect of mutations in proteins of biomedical interest. • To perform a challenge of the Wide In Silico Docking On Malaria. EGEE and EGEEII future plan BIOINFOGRID will evaluate the Grid usability in wide variety of applications, the aim to build a strong and unite BIONFOGRID Community and explore and exploit common solutions. The BIOINFOGRID collaboration will be able to establish a very large user group in Bioinformatics in EUROPE. This cooperation will be able to promote the Bioinformatics and GRID applications in EGEE and EGEEII. The aim of the BIOINFOGRID project is to bridge the gap, letting people from the bioinformatics and life science be aware of the power of Grid computing just trying to use it. We intend to pursue this goal by using a number of key bioinformatics applications and getting them run onto the European Grid Infrastructure. The most natural and important spin off of the BIOINFOGRID project will then be a strong dissemination action within the user’s communities and across them. In fact, from one side application’s experts will meet Grid experts and will learn how to re- engineer and adapt their applications to “run on the Grid” and, from the other side (and at the same time), application’s experts will meet other-applications’ experts with a high probability that ones’ expertises can be exploited as others’ solutions. The BIOINFOGRID project will provide the EGEEII with very useful inputs and feedbacks on the goodness and efficiency of the structure deployed and on the usefulness and effectiveness of the Grid services made available at the continental scale. In fact, having several bioinformatics scientific applications using these Grid services is a key moment to stress the generality of the services themselves.
        Speaker: Dr Luciano Milanesi (National Research Council - Institute of Biomedical Technologies)
        Slides
      • 2:45 PM
        BioDCV: a grid-enabled complete validation setup for functional profiling 15m
        Abstract BioDCV is a distributed computing system for the complete validation of gene profiles. The system is composed of a suite of software modules that allows the definition, management and analysis of a complete experiment on DNA microarray data. The BioDCV system is grid-enabled on LCG/EGEE middleware in order to build predictive classification models and to extract the most important genes on large scale molecular oncology studies. Performances are evaluated on a set of 6 cancer microarray datasets of different sizes and complexity, and then compared with results obtained on a standard Linux cluster facility. Introduction The scientific objective of BioDCV is a large scale comparison of prognostic gene signatures from cancer microarray datasets realized by a complete validation system and run in Grid. The models will constitute a reference experimental landscape for new studies. Outcomes of BioDCV consist of a predictive model, the straighforward evaluation of its accuracy, the lists of genes ranked for importance, the identification of patient subtypes. Molecular oncologists from medical research centers and collaborating bioinformaticians are currently the target end-users of BioDCV. The comparisons presented in this paper demonstrate the factibility of this approach on public data as well as on original microarray data from IFOM-Firc. The complete validation schema developed in our system involves an intensive replication of a basic classification task on resampled versions of the dataset. About 5x105 base models are developed, which may become 2x106 if the experiment is replicated with randomized output labels. The scheme must ensure that no selection bias effect is contaminating the experiment. The cost of this caution is high computational complexity. Porting to the Grid To guarantee fast, slim and robust code, and relational access to data and a model descriptions, BioDCV was written in C and interfaced with SQLite (http://www.sqlite.org), a database engine which supports concurrent access and transactions useful in a distributed environment where a dateset may be replicated for up to a few million models. In this paper, we present the porting of our application to grid systems, namely the Egrid (http://www.egrid.it) computational grids. The Egrid infrastructure is based on Globus/EDG/LCG2 middleware and is integrated as an independent virtual organization within Grid.it, the INFN production grid. The porting requires just two wrappers, one shell script to submit jobs and one C MPI program. When the user submits a BioDCV job to the grid, the grid middleware looks for the CE (Computing Element: where user tasks are delivered) and the WNs (Worker Nodes: machines where the grid user programs are actually executed) require to run the parallel program. As soon as the resources (CPUs in WNs) are available, the shell script wrapper is executed on the assigned CE. This script distributes the microarray dataset from the SE (Storage Element stores user data in the grid) to all the involved WNs. It then starts the C MPI wrapper which spawns several instances of the BioDCV program itself. When all BioDCV instances are completed, the wrapper copies all outputs including model and diagnostic data from the WNs to the starting SE. Finally, the process outputs are returned, thus allowing the reconstruction of a complete data archive for the study. Experiments and results Two experiments were designed to measure the performance of the BioDVC parallel application in two different computing available environments: a standard Linux cluster and a computational grid. In Benchmark 1, we study the scalability of our application as a function of the number of CPUs. The benchmark is executed on a Linux clusters formed by 8 Xeon 3.0 CPUs and on the EGEE grid infrastructure ranging from 1 to 64 Xeon CPUs. Two DNA microarray datasets are considered: LiverCanc (213 samples, ATAC-PCR, 1993 genes) and PedLeuk (327 samples, Affymetrix, 12625 genes). On both dataset we obtain a speed-up curve very close to linear. The speed-up factor for n CPUs is defined as the user time for one CPU divided by the user time for n CPUs. In Benchmark 2, we characterize the BioDCV application different d (number of features) and N (number of samples) values for a complete validation experiment, and we execute a task for each dataset on the EGEE grid infrastructure using a fixed number of CPUs. The benchmark was run on a suite of six microarray datasets: LiverCanc, PedLeuk, BRCA (62 samples, cDNA, 4000 genes), Sarcoma (35 samples, cDNA, 7143 genes), Wang (286 samples, Affymetrix, 17816 genes), Chang (295 samples, cDNA, 25000 genes). It can be observed that effective execution time (total execution time without queueing time at grid site) increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples. The performance penalty payed with respect to a standard parallel run performed on local cluster is limited and it is mainly due to data transfer from user machine to grid site and between WNs. Discussion and Conclusions The two experiments, which sum up to 139 CPU days within the Egrid infrastructure, implicate that general behavior of the BioDCV system on LCG/EGEE computational grids can be used in practical large scale experiments. The overall effort for gridification was limited to three months. We will investigate if substituting a model of one single job asking for N CPUs (MPI approach) with a model that submits N different single CPU jobs can overcome some limitations. Next step is porting our system under EGEE's Biomed VO. BioDCV is an open source application and it is currently distributed under GPL (SubVersion repository at http://biodcv.itc.it).
        Speaker: Silvano Paoli (ITC-irst)
        Slides
      • 3:00 PM
        Application of GRID resource for modeling charge transfer in DNA 15m
        Recently, at the interface of physics, chemistry and biology, a new and rapidly developing research trend has emerged concerned with charge transfer in biomacromolecules. Of special interest to researchers is the electron and hole transfer along a chain of base pairs, since the migration of radicals over a DNA molecule plays a crucial role in the processes of mutagenesis and carcinogenesis. Moreover, understanding the mechanism of charge transfer is necessary for the development of a new field, concerned with charge transfer in organic conductors and their possible application in computing technology. To use biomolecules as conductors, one should know the rate of charge mobility. We calculate theoretical values of charge mobility on the basis of a quantum- classical model of charge transfer in various synthesized polynucleotides at varying temperature T of the environment. To take into account temperature fluctuations, a random force with specified statistical characteristics was added in the classical equations of site motion (Langevin force). (See e.g.: V.D.Lakhno, N.S.Fialko. Hole mobility in a homogeneous nucleotide chain // JETP Letters, 2003, v.78 (5), pp.336- 338; V.D.Lakhno, N.S.Fialko. Bloch oscillations in a homogeneous nucleotide chain // Pisma v ZhETF, 2004, v.79 (10), pp.575-578). As is known, the results of most biophysical experiments are averaged (for example, in our case, over a great many DNA fragments in a solution) values of macroscopic physical parameters. When modeling charge transfer in a DNA at finite temperature, calculations should be carried out for a great many realizations so that to find average values of macroscopic physical parameters. This formulation of the problem enables paralleling of the program by realizations such as “one processor – one realization”. A sequential algorithm is used for individual realizations. Initial values of site velocities and displacements are preset randomly from the requirement of equilibrium distribution at a given temperature. In calculating individual realizations, at each step a random number with specified characteristics is generated for the Langevin term. To make the problem of modeling of the charge transfer in a given DNA sequence at a prescribed temperature suitable to be calculated using GRID resource, the original program was divided into 2 parts. The first program calculates one realization for given parameters. At the input it receives files with parameters and initial data. The peculiarity of the task is that we are interested in dynamics of charge transfer, so at the program output we get several dozens Mb results. Using a special script, 100-150 copies of the program run with the same parameters and random initial data. Upon completion of the computations, the files of results are compressed and transmitted to a predefined SE. When an appropriate number of realizations is calculated, the second program runs once. It must calculate average values for charge probabilities, for site displacements from the equilibrium, etc. A special script is sent to calculate this program on WN. This WN takes from SE files with results of realizations in series of 10 items. For each series the averaging program runs (at the output one gets the data averaged over 10 realizations). If the output file of a current realization is absent or defective, it is ignored, and the next output file is taken. The files obtained are processed by this averaging program again. This makes our results independent of chance failures in calculations of individual realizations. Using GRID resource by this method, we have carried out calculations of the hole mobility at different temperatures in the range from 10 to 300 K for (GG) and (GC) polynucleotide sequences (several thousands realizations).
        Speaker: Ms Nadezhda Fialko (research fellow)
        Slides
      • 3:15 PM
        A service to update and replicate biological databases 15m
        One of the main challenges in molecular biology is the management of data and databases. A large fraction of the biological data produced is publicly available on web sites or by ftp protocols. These public databases are internationally known and play a key role in the majority of public and private research. But their exponential growth raises an usage problem. Indeed, scientists need easy access to the last update of the databases in order to apply bioinformatics or data mining algorithms. The frequent and regular update of the databases is a recurrent issue for all host or mirror centres, and also for scientists using the databases locally for confidentiality reasons. We proposed a solution for the updates of these distributed databases. This solution come as a service embedded into the grid which uses its mechanisms and automatically performs updates. So we developed a set of web services that will rely on the grid to manage this task, with the aim of deploying the services under any grid middleware with a minimum of adaptation. This includes a client/server application with a set of rules and a protocol to update a database from a given repository and distribute the update through the grid storage elements while trying to optimize network bandwidth, file transfers size and fault tolerance, and finally offer a transparent automated service which does not require user intervention. This represents the challenges of the database update in a grid environment and the solution we proposed is basically to define two types of storage on the grid storage elements: some storage of reference where the update is first performed and working storage spaces where the jobs will pick up the information. The idea is to replicate the update on the grid from these reference points to the storage elements. From the service point of view, it is necessary that the grid information system can locate sites who host a given database in order to have the benefits of a dynamical database replication and location. From the user point of view, we need to dispose of the location information for each database in order to achieve scalability and find replica on the grid, this means having a metadata for each database that can refer to several physical locations on the grid and contain certain information as well, because the replica do not concern single files but a whole database with several files and/or directories. This service is being deployed on two French Grid infrastructures: RUGBI (based on Globus Toolkit 4) and Auvergrid (based on EGEE), so we plan a future deployment of this service on EGEE, especially in the Biomed VO, but the real issues are that the service need to be deployed as a grid service, and managed as a grid service, so some people from the VO should be able to deploy and administrate the service beside the site administrators, a role which is finding its limits in current VO management. The service is supposed to be embedded into the grid and is not just a pure application laid on it. Eventually it will be possible to offer this service as an application, but it would mean that its use is not mandatory and not automated, which is synonymous with losing its benefits and transparency since the user will need to specify the use of the service in his workflow. There are also future plans to add some optimisation on the deployment of the databases: for example, being able to split databases to store each part on a different storage element, or add the ability to offer several reference storages per database which would require to synchronize these storages with each other. The service will mature through its deployment on grid middlewares and will surely improve as it is used in production environments.
        Speaker: Mr Jean Salzemann (IN2P3/CNRS)
        Slides
      • 3:30 PM
        Questions and discussion 30m

        Questions and Discussion

      • 4:00 PM
        COFFEE 30m

        COFFEE

      • 4:30 PM
        Using Grid Computation to Accelerate Structure-based Design Against Influenza A Neuraminidases 15m
        The potential for re-emergence of influenza pandemics has been a great threat since the report of that the avian influenza A virus (H5N1) having acquired the ability to be transmitted to humans. An increase of transmission incidents suggests the risk of human-to-human transmission, and the report of development of drug resistance variants is another potential concern. At present, there are two effective antiviral drugs available, oseltamivir (Tamiflu) and zanamivir (Relenza). Both drugs were discovered through structure-based drug design targeting influenza neuraminidase (NA), a viral enzyme that cleaves terminal sialic acid residue from glycoconjugates. The action of NA is essential for virus proliferation and infectivity; therefore, blocking the actives would generate antivirus effects. To minimize non-productive trial-and-error approaches and to accelerate the discovery of novel potent inhibitors, medicinal chemists can take advantage of using modeled NA variant structures and doing structure-based design. A key work in structure-based design is to model complexes of candidate compounds to structures of receptor binding sites. The computational tools for the work are based on docking tools, such as AutoDock, to carry out quick conformation search of small compounds in the binding sites, fast calculation of binding energies of possible binding poses, prompt selection for the probable binding modes, and precise ranking and filtering for good binders. Although docking tools can be run automatically, one should control the dynamic conformation of the macromolecular binding site (rigid or flexible) and the spectrum of the screening small organics (building blocks and/or scaffolds; natural and/or synthetic compounds, diversified and/or “drug-like” filtered libraries). This process is characterized by computational and storage load which pose a great challenge to resources that a single institute can afford (For example, using AutoDock to evaluate one compound structure for 10 poses within the target enzyme would take 200 Kilobyte storage and 15 minutes on an average PC). The task to evaluate 1 million compound structures 100 poses each would cost 2 Terabyte and more than hundred years). To support such kind of computing demands, this project was initiated to develop a service prototype for distributing huge amount of computational docking requests by taking the advantages of the LCG/EGEE Grid infrastructure. According to what we have learned from both the High-Energy Physics experiments and the Biomedical community, an effective use of large scale computing offered by the Grid is very promising but calls for a robust infrastructure and careful preparation. Important points are the distributed job handling, data collection and error tracking: in many cases this might be a limitation due to the need of grid-expert personnel effort. Our final goal is to deliver an effective service to academic researchers who for the most part are not Grid experts, therefore we adopted a light-weight and easy-to-use framework for distributing docking jobs on the Grid. We expect that this decision will benefit future deployment efforts and improve application usability. Introducing the DIANE framework in building the service is aimed at handling the Grid applications in master-worker model, a native computing model of distributing docking jobs on the Grid. With the skeletal parallelism, applications plugged into the framework inherit the intrinsic DIANE features of distributed job handling such as automatic load balancing, and failure recovery. The python-based implementation also lowers the development effort of controlling application jobs on the Grid. With the hiding of composing JDL and of submitting jobs, users can even easily distribute their application jobs on the Grid without having Grid knowledge. In addition, this system can be used to seamlessly merge local guaranteed resources (like a dedicated cluster) with on-demand power provided by the Grid, allowing researches to concentrate on setting up of their application without facing a heavy entry barrier to move in production mode where more resources are needed. In a preliminary study, we arranged the work into six tasks: (1) target 3D structure preparation; (2) compound 3D structure preparation and refinement, (3) compound properties and filter, (4) Autodock run (5) probable hits analysis and selection, and (6) complex optimization and affinity re-calculation. The DIANE framework has been applied to distribute about 75000 time-consuming AutoDock processes on LCG for screening possible inhibitor candidates against neuraminidases. In addition to show the distribution efficiency, advantages of adopting DIANE framework in the AutoDock application are also discussed in terms of usability, stability and scalability.
        Speaker: Dr Ying-Ta Wu (Academia Sinica Genomic Research Center)
        Slides
      • 4:45 PM
        In silico docking on EGEE infrastructure: the case of WISDOM 15m
        Advance in combinatorial chemistry has paved the way for synthesizing large numbers of diverse chemical compounds. Thus there are millions of chemical compounds available in the laboratories, but it is nearly impossible and very expensive to screen such a high number of compounds in the experimental laboratories by high throughput screening (HTS). Besides the high costs, the hit rate in HTS is quite low, about 10 to 100 per 100,000 compounds when screened on targets such as enzymes. An alternative is high throughput virtual screening by molecular docking, a technique which can screen millions of compounds rapidly, reliably and cost effectively. Screening millions of chemical compounds in silico is a complex process. Screening each compound, depending on structural complexity, can take from a few minutes to hours on a standard PC, which means screening all compounds in a single database can take years. Computation time can be reduced very significantly with a large grid gathering thousands of computers. WISDOM (World-wide In Silico Docking On Malaria) is an European initiative to enable the in silico drug discovery pipeline on a grid infrastructure. Initiated and implemented by Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) in Germany and the Corpuscular Physics Laboratory (CNRS/IN2P3) of Clermont- Ferrand in France, WISDOM has deployed a large scale docking experiment on the EGEE infrastructure. Three goals motivated this first experiment. The biological goal was to propose new inhibitors for a family of proteins produced by Plasmodium falciparum. The biomedical informatics goal was the deployment of in silico virtual docking on a grid infrastructure. The grid goal is the deployment of a CPU consuming application generating large data flows to test the grid operation and services. Relevant information can be found on http://wisdom.eu-egee.fr and http://public.eu-egee.org/files/battles-malaria-grid-wisdom.pdf. With the help of the grid, large scale in silico experimentation is possible. Large resources are needed in order to test in a transparent way a family of targets, a large enough amount of possible drug candidates and different virtual screening tools with different parameter / scoring settings. The grid added value lies not only in the computing resources made available, but also already in the permanent storage of the data with a transparent and secure access. Reliable Workload Manager System, Information Service and Data Management Services are absolutely necessary for a large scale process. Accounting, security and license management services are also essential to impact the pharmaceutical community. In a close future, we expect improved data management middleware services to allow automatic update of compound database and the design of a grid knowledge space where biologists can analyze output data. Finally key issues to promote the grid in the pharmaceutical community include cost and time reduction in a drug discovery development, security and data protection, fault tolerant and robust services and infrastructure, and transparent and easy use of the interfaces. The first biomedical data challenge ran on the EGEE grid production service from 11 July 2005 until 19 August 2005. The challenge saw over 46 million docked ligands, the equivalent of 80 years on a single PC, in about 6 weeks. Usually in silico docking is carried out on classical computer clusters resulting in around 100,000 docked ligands. This type of scientific challenge would not be possible without the grid infrastructure - 1700 computers were simultaneously used in 15 countries around the world. The WISDOM data challenge demonstrated how grid computing can help drug discovery research by speeding up the whole process and reduce the cost to develop new drugs to treat diseases such as malaria. The sheer amount of data generated indicates the potential benefits of grid computing for drug discovery and indeed, other life science applications. Commercial software with a server license was successfully deployed on more than 1000 machines in the same time. First docking results show that 10% of the compounds of the database studied may be hits. Top scoring compounds possess basic chemical groups like thiourea, guanidino, amino-acrolein core structure. Identified compounds are non peptidic and low molecular weight compounds. Future plans for the WISDOM initiative is first to process the hits again with molecular dynamics simulations. A WISDOM demonstration will be conceived at the aim to show the submission of docking jobs on the grid at a large scale. A second data challenge planned for the fall of 2006 is also under preparation to improve the quality of service and the quality of usage of the data challenge process on gLite.
        Speaker: Mr Nicolas Jacq (CNRS/IN2P3)
        Slides
      • 5:00 PM
        Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation of Statistical Parametric Mapping Analysis 15m
        A voxel based statistical analysis of perfusional medical images may provide powerful support to the early diagnosis for Alzheimer’s Disease (AD). A Statistical Parametric Mapping algorithm (SPM), based on the comparison of the candidate with normal cases, has been validated by the neurological research community to quantify ipometabolic patterns in brain PET/SPECT studies. Since suitable “normal patient” PET/SPECT images are rare and usually sparse and scattered across hospitals and research institutions, the Data Grid distributed analysis paradigm (“move code rather than input data”) is well suited for implementing a remote statistical analysis use case, described as follow.
        Speaker: Mrs Livia Torterolo (Bio-Lab, DIST, University of Genoa)
        Slides
      • 5:15 PM
        SIMRI@Web : An MRI Simulation Web Portal on EGEE Grid Architecture 15m
        In this paper, we present a web protal that enables simulation of MRI images on the grid. Such simulations are done using the SIMRI MRI simulator that is implemented on the grid using MPI. MRI simulations are useful for better understanding the MRI physics, for studying MRI sequences (parameterisation), and validating image processing algorithms. The web portal client/server architecture is mainly based on a java thread that screens a data base of simulation jobs. The thread submits the new jobs to the grid, and updates the status of the running jobs. When a job is terminated, the thread sends the simulated image to the user. Through a client web interface, the user can submit new simulation jobs, get a detailed status of the running jobs, have the history of all the terminated jobs as well as their status and corresponding simulated image. As MRI simulation is computationally very expensive, grid technologies appear to a real added value for the MRI simulation task. Nevertheless the grid access should be simplified to enable final user running MRI simulations. That is why we develop a tis specific web portal to propose a user friendly interface for MRI simulation on the grid.
        Speaker: Prof. Hugues BENOIT-CATTIN (CREATIS - UMR CNRS 5515 - U630 Inserm)
        Slides
      • 5:30 PM
        Application of the Grid to Pharmacokinetic Modelling of Contrast Agents in Abdominal Imaging 15m
        The liver is the largest organ of the abdomen and there are a large number of lesions affecting it. Both benign and malignant tumours arise within it. The liver is also the target organ for most solid tumours metastasis. Angiogenesis is quite an important marker of tumour aggressiveness and response to therapy. The blood supply to the liver is derived jointly from the hepatic arteries and the portal venous system. Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) is extensively used for the detection of primary and metastatic hepatic tumours. However, the assessment of early stages of the malignancy and other diseases like cirrhosis require the quantitative evaluation of the hepatic arterial supply. To achieve this goal, it is important to develop precise pharmacokinetic approaches to the analysis of the hepatic perfusion. The influence of breathing, the large number of pharmacokinetic parameters and the fast variations in contrast concentration in the first moments after contrast injection reduce the efficiency of traditional approaches. On the other hand, the traditional radiological analysis requires the acquisition of images covering the whole liver, which greatly reduces the time resolution for the pharmacokinetic curves. The combination of all these adverse factors makes very challenging the analytical study of liver DCE-MRI data. The final objective of the work we present here is to provide the users with a tool to optimally select the parameters that describe the farmacokinetic model of the liver. This tool will use the Grid as a source of computing power and will offer a simply and user-friendly interface. The tool enables the execution of large sets of co-registration actions varying the values of the different parameters, easing the process of transferring the source data and the results. Since Grid concept is mainly batch (and the co-registration is not an interactive process due to its long duration), it must provide with a simply way to monitor the status of the processing. Finally the process must be achieved in the shorter time possible, considering the resources available.
        Speaker: Dr Ignacio Blanquer (Universidad Politécnica de Valencia)
        Slides
      • 5:45 PM
        Construction of a Mathematical Model of a Cell as a Challenge for Science in the 21 Century and EGEE project 15m
        As recently as a few years ago a possibility of constructing a mathematical model of a life seemed absolutely fantastic. However, at the beginning of 21-th century several research teams announced creation of a minimum model of life. To be more specific, not life in general, but an elementary brick of life, that is a living cell. The most well-known of them are: USA Virtual Cell Project (V-Cell), NIH (http: //www. nrcam.uchc.edu /vcellR3 /login/login.jsp); Japanese E-cell project (http://ecell. sourceforge.net/); Dutch project ViC (Virtual Cell) (http://www.bio.vu.nl /hwconf/Silicon /index.html). The above projects deal mainly with kinetics of cell processes. New approaches to modeling imply development of imitation models to simulate functioning of cell mechanisms and devising of software to simulate a complex of interrelated and interdependent processes (such as gene networks). With the emergence of an opportunity to use GRID infrastructure for solving such problems new and bright prospects have opened up. To develop an integrated model of more complex object than prokaryotic cell such as eukaryotic cell is the aim of the Mathematical Cell project (http://www.mathcell.ru) realized at the Joint Center for Computational Biology and Bioinformatics (www.jcbi.ru) of the IMPB RAS. Functioning of a cell is simulated based on the belief that the cell life is mainly determined by the processes of charge transfer in all its constituent elements. Since (like in physics where the universe is thought to have arisen as a result of a Big Bang) life originated from a DNA molecule, modeling should be started from the DNA. The MathCell model repository includes software to calculate charge transfer in an arbitrary nucleotide sequence of a DNA molecule. A sequence to be analyzed may be specified by a user or taken from databanks presented at the site of the Joint Center for Computational Biology and Bioinformatics (http://www.jcbi.ru). Presently, the MathCell site demonstrates a simplest model of charge transfer. In the framework of the GRID EGEE project any user registered and certified in EGEE infrastructure can use both the program and the computational resources offered by EGEE. In the near future IMPB RAS is planning to deploy in EGEE a software tool to calculate a charge transfer on inner membranes of some compartments of eukaryotic cells (mitochondria and chloroplasts) through direct simulation of charge transfer with regard to the detailed structure of biomembranes containing various molecular complexes. Next on the agenda is a software tool to calculate metabolic reaction pathways in compartments of a cell as well as the dynamics of gene networks. Further development of the MathCell project implies integration of individual components of the model into an integrated program system which would enable modeling of cell processes at all levels – from microscopic to macroscopic scales and from picoseconds to the scales comparable with the cell lifetime. Such modeling will naturally require combining of computational and commutation resources provided by EGEE project and their merging into an integrated computational medium.
        Speaker: Prof. Victor Lakhno (IMPB RAS, Russia)
        Slides
      • 6:00 PM
        Wind-up questions and discussion 30m
    • 12:30 PM 2:00 PM
      Lunch 1h 30m
    • 1:00 PM 2:00 PM
      Lunch 1h