4–8 Nov 2019
Adelaide Convention Centre
Australia/Adelaide timezone

Large scale fine grain simulation workflows ("Jumbo Jobs") on HPC's by the ATLAS experiment

7 Nov 2019, 11:30
15m
Riverbank R1 (Adelaide Convention Centre)

Riverbank R1

Adelaide Convention Centre

Oral Track 9 – Exascale Science Track 9 – Exascale Science

Speaker

Doug Benjamin (Argonne National Laboratory (US))

Description

The ATLAS experiment is using large High Performance Computers (HPC's) and fine grained simulation workflows (Event Service) to produce fully simulated events in an efficient manner. ATLAS has developed a new software component (Harvester) which provides resource provisioning and workload shaping. In order to run effectively on the largest HPC machines, ATLAS develop Yoda-Droid software to orchestrate the MPI communication between Harvester and the simulation payload running on over 1000 nodes simultaneously. In this way over 130,000 cores can simultaneously produce simulated Monte Carlo events for ATLAS. The PanDA system also had to be changed to produce "jumbo jobs" capable of simulated over 1 Million events per submission to the local HPC scheduling systems.
This presentation will describe in detail the changes to PanDA to enable jumbo jobs and the Yoda-Droid software. Scaling and efficiency measurements will be presented. Results from deployment, integration and operation of the new software at the Titan, Cori and Theta HPC machines will be shown.

Consider for promotion Yes

Primary authors

Doug Benjamin (Argonne National Laboratory (US)) Wen Guan (University of Wisconsin (US)) Tadashi Maeno (Brookhaven National Laboratory (US)) Nicolo Magini (Iowa State University (US)) Paul Nilsson (Brookhaven National Laboratory (US)) Danila Oleynik (Joint Institute for Nuclear Research (RU)) Vakho Tsulaia (Lawrence Berkeley National Lab. (US)) Taylor Childers (Argonne National Laboratory (US)) Martina Javurkova (Albert Ludwigs Universitaet Freiburg (DE))

Presentation materials