Modern physics experiments collect peta-scale volumes of data and utilize vast, geographically distributed computing infrastructure that serves thousands of scientists around the world.
Requirements for rapid, near real time data processing, fast analysis cycles and need to run massive detector simulations to support data analysis pose special premium on efficient use of available computational resources.
A sophisticated Workload Management System (WMS) is needed to coordinate the distribution and processing of data and jobs in such environment.
In this talk we will discuss PanDA WMS developed by the ATLAS experiment at the LHC.
Even though PanDA was originally designed for workload management in Grid environment, it was successfully extended to include cloud resources and supercomputers.
In particular we'll described current state of PanDA integration with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF).
Our approach utilizes a modified PanDA pilot framework for job submission to Titan's batch queues and for data transfers to and from OLCF .
The system employs lightweight MPI wrappers to run in parallel multiple, independent, single node payloads on Titan's multi-core worker nodes.
It also gives PanDA a new capability to collect, in real time, information about unused worker nodes on Titan, which allows to precisely
define the size and duration of jobs submitted to Titan according to available free resources.
The initial implementation of this system already allowed to collect in 2016 more than 70M core hours of otherwise left unused resources on Titan and execute tens of millions of PanDA jobs.
Based on experience gained on Titan the PanDA development team is exploring designs of next generation components and services for workload management on HPC, Cloud and Grid resources.
In this talk we’ll give an overview of these new components and discuss their properties and benefits.