1–5 Sept 2014
Faculty of Civil Engineering
Europe/Prague timezone

Next Generation Workload Management System for Big Data on Heterogeneous Distributed Computing

2 Sept 2014, 11:15
35m
Faculty of Civil Engineering

Faculty of Civil Engineering

Faculty of Civil Engineering, Czech Technical University in Prague Thakurova 7/2077 Prague 166 29 Czech Republic
Plenary Computing Technology for Physics Research Plenary

Speaker

Dr Alexei Klimentov (Brookhaven National Laboratory (US))

Description

The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Management System for managing the workflow for all data processing on hundreds of data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data centers are physically scattered all over the world. The scale is demonstrated by the following numbers: PanDA manages O(102) sites, O(105) cores, O(108) jobs per year, O(103) users and ATLAS Data Volume is O(1017) bytes. In 2013 we started an ambitious program to expand PanDA to all available computing resources, including opportunistic use of commercial and academic clouds and Leadership Computing Facilities (LCF). The project titled ‘Next Generation Workload Management and Analysis System for Big Data’ (BigPanDA) is funded by DOE ASCR and DOE HEP. Extending PanDA to clouds and LCF presents new challenges in managing heterogeneity and supporting workflow. The BigPanDA project is underway to setup and tailor PanDA at Oak Ridge Leadership Computing Facility (OLCF). and at National Research Center "Kurchatov Institute" together with ALICE Distributed Computing and ORNL computing professionals. Our approach for integration of the HPC platforms at OLCF and elsewhere is to reuse, as much as possible, existing components of the PanDA system. The next generation of PanDA will allow many data-intensive sciences employing a variety of computing platforms to benefit from ATLAS' and ALICE experience and proven tools in highly scalable processing. We will present our current accomplishments with running PanDA WMS at OLCF and other supercomputers and demonstrate our ability to use PanDA as a portal independent of the computing facilities infrastructure for High Energy and Nuclear Physics as well as other data-intensive science applications.

Author

Dr Alexei Klimentov (Brookhaven National Laboratory (US))

Co-authors

Alexandre Vaniachine (ATLAS) Danila Oleynik (Joint Inst. for Nuclear Research (RU)) Dr Jack Wells (ORNL) Jeff Porter (Lawrence Berkeley National Lab. (US)) Kaushik De (University of Texas at Arlington (US)) Kenneth Read (Oak Ridge National Laboratory - (US)) Paul Nilsson (University of Texas at Arlington (US)) Predrag Buncic (CERN) Dr Richard Philip Mount (SLAC National Accelerator Laboratory (US)) Sergey Panitkin (Brookhaven National Laboratory (US)) Prof. Shantenu Jha (Rutgers University) Tadashi Maeno (Brookhaven National Laboratory (US)) Dr Torre Wenaus (Brookhaven National Laboratory (US))

Presentation materials

Peer reviewing

Paper