21–25 Aug 2017
University of Washington, Seattle
US/Pacific timezone

Present and future of PanDA WMS integration with Titan supercomputer at OLCF

24 Aug 2017, 16:00
45m
The Commons (Alder Hall)

The Commons

Alder Hall

Poster Track 1: Computing Technology for Physics Research Poster Session

Speaker

Dr Siarhei Padolski (BNL)

Description

Modern physics experiments collect peta-scale volumes of data and utilize vast, geographically distributed computing infrastructure that serves thousands of scientists around the world.
Requirements for rapid, near real time data processing, fast analysis cycles and need to run massive detector simulations to support data analysis pose special premium on efficient use of available computational resources.
A sophisticated Workload Management System (WMS) is needed to coordinate the distribution and processing of data and jobs in such environment.
In this talk we will discuss PanDA WMS developed by the ATLAS experiment at the LHC.
Even though PanDA was originally designed for workload management in Grid environment, it was successfully extended to include cloud resources and supercomputers.
In particular we'll described current state of PanDA integration with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF).
Our approach utilizes a modified PanDA pilot framework for job submission to Titan's batch queues and for data transfers to and from OLCF .
The system employs lightweight MPI wrappers to run in parallel multiple, independent, single node payloads on Titan's multi-core worker nodes.
It also gives PanDA a new capability to collect, in real time, information about unused worker nodes on Titan, which allows to precisely
define the size and duration of jobs submitted to Titan according to available free resources.
The initial implementation of this system already allowed to collect in 2016 more than 70M core hours of otherwise left unused resources on Titan and execute tens of millions of PanDA jobs.
Based on experience gained on Titan the PanDA development team is exploring designs of next generation components and services for workload management on HPC, Cloud and Grid resources.
In this talk we’ll give an overview of these new components and discuss their properties and benefits.

Primary authors

Dr Alessio Angius (Rutgers University) Mr Danila Oleynik (Joint Institute for Nuclear Research (RU)) Prof. Matteo Turilli (Rutgers University) Dr Sergey Panitkin (Brookhaven National Lab) Prof. Shantenu Jha (Rutgers University (US))

Co-authors

Dr Alexei Klimentov (Brookhaven National Laboratory (US)) Dr Fernando Harald Barreiro Megino (University of Texas at Arlington) Dr Jack Wells (Oak Ridge National Laboratory) Prof. Kaushik De (University of Texas at Arlington (US)) Dr Paul Nilsson (Brookhaven National Laboratory (US)) Dr Pavlo Svirin (National Academy of Sciences of Ukraine (UA)) Dr Ruslan Mashinistov (Russian Academy of Sciences (RU)) Dr Sarp Oral (Oak Ridge National Lab) Dr Siarhei Padolski (BNL) Dr Tadashi Maeno (Brookhaven National Laboratory (US)) Torre Wenaus (Brookhaven National Laboratory (US))

Presentation materials