9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

9 Jul 2018, 12:00
15m
Hall 7 (National Palace of Culture)

Hall 7

National Palace of Culture

presentation Track 3 – Distributed computing T3 - Distributed computing

Speaker

Pavlo Svirin

Description

PanDA executes millions of ATLAS jobs a month on Grid systems with more than
300k cores. Currently, PanDA is compatible only with few HPC resources due to
different edge services and operational policies, does not implement the pilot
paradigm on HPC, and does not dynamically optimize resource allocation among
queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP)
system to overcome these disadvantages and enable the execution of ATLAS,
Molecular Dynamics and other workflows on HPC resources.

Harvester is a commonality layer which brings coherence to diverse HPC
systems, providing integration with PanDA workflows at job and event level. RP
is a pilot system capable of executing short/long-running single/many-cores
tasks on diverse HPC machines, supporting CPUs, GPUs, and multiple MPI
implementations.

We integrated Harvester and RP on Titan at ORNL, prototyping a Next Generation
Executor (NGE) to expose RP capabilities and manage the execution of PanDA
workflows. RP acquires Titan resources via queues and backfill capabilities
and publishes the available resources to NGE. Harvester requests available
resources and submits tasks for execution to NGE. NGE uses RP to execute those
tasks, managing input and output staging, and holding the states of resources
and tasks on a dedicated database.

Primary authors

Andre Merzky Pavlo Svirin Prof. Matteo Turilli (Rutgers University) Danila Oleynik (Joint Institute for Nuclear Research (RU)) Sergey Panitkin (Brookhaven National Laboratory (US)) Kaushik De (University of Texas at Arlington (US)) Shantenu Jha Alexei Klimentov (Brookhaven National Laboratory (US))

Presentation materials