Oct 10 – 14, 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Using HEP Computing Tools, Grid and Supercomputers for Genome Sequencing Studies

Oct 11, 2016, 2:15 PM
15m
GG C2 (San Francisco Mariott Marquis)

GG C2

San Francisco Mariott Marquis

Oral Track 3: Distributed Computing Track 3: Distributed Computing

Speakers

Alexei Klimentov (Brookhaven National Laboratory (US)) Ruslan Mashinistov (National Research Centre Kurchatov Institute (RU))

Description

PanDA - Production and Distributed Analysis Workload Management System has been developed to address ATLAS experiment at LHC data processing and analysis challenges. Recently PanDA has been extended to run HEP scientific applications on Leadership Class Facilities and supercomputers. The success of the projects to use PanDA beyond HEP and Grid has drawn attention from other compute intensive sciences such as bioinformatics.

Modern biology uses complex algorithms and sophisticated software, which is impossible to run without access to significant computing resources. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing streams of sequencing data that need to be processed, analysed and made available for bioinformaticians worldwide. Analysis of ancient genomes sequencing data using popular software pipeline PALEOMIX can take a month even running it on the powerful computer resource. PALEOMIX include typical set of software used to process NGS data including adapter trimming, read filtering, sequence alignment, genotyping and phylogenetic or metagenomic analysis. Sophisticated computing software WMS and efficient usage of the supercomputers can greatly enhance this process.

In this paper we will describe the adaptation the PALEOMIX pipeline to run it on a distributed computing environment powered by PanDA. We used PanDA to manage computational tasks on a multi-node parallel supercomputer. To run pipeline we split input files into chunks which are run separately on different nodes as separate inputs for PALEOMIX and finally merge output file, it is very similar to what it done by ATLAS to process and to simulate data. We dramatically decreased the total walltime because of jobs (re)submission automation and brokering within PanDA, what was earlier demonstrated for the ATLAS applications on the Grid. Using software tools developed initially for HEP and Grid can reduce payload execution time for Mammoths DNA samples from weeks to days.

Secondary Keyword (Optional) Experience/plans from outside experimental HEP/NP
Tertiary Keyword (Optional) High performance computing
Primary Keyword (Mandatory) Data processing workflows and frameworks/pipelines

Primary authors

Alexei Klimentov (Brookhaven National Laboratory (US)) Alexey Poyda (National Research Centre Kurchatov Institute (RU)) Ruslan Mashinistov (National Research Centre Kurchatov Institute (RU))

Co-authors

Alexander Novikov (National Research Centre Kurchatov Institute (RU)) Anton Teslyuk (National Research Centre Kurchatov Institute (RU)) Artem Nedoluzhko (National Research Centre Kurchatov Institute (RU)) Eygene Ryabinkin (National Research Centre Kurchatov Institute (RU)) Fedor Sharko (National Research Centre Kurchatov Institute (RU)) Ivan Tertychnyy (National Research Centre Kurchatov Institute (RU)) Kaushik De (University of Texas at Arlington (US)) Tadashi Maeno (Brookhaven National Laboratory (US)) Torre Wenaus (Brookhaven National Laboratory (US))

Presentation materials