19–25 Oct 2024
Europe/Zurich timezone

Checkpoint-Restart for HPC

Not scheduled
18m
Talk Track 7 - Computing Infrastructure Parallel (Track 7)

Speaker

Dr Madan Timalsina (NERSC/LBNL)

Description

This presentation delves into the implementation and optimization of checkpoint-restart mechanisms in High-Performance Computing (HPC) environments, with a particular focus on Distributed MultiThreaded CheckPointing (DMTCP). We explore the use of DMTCP both within and outside of containerized environments, emphasizing its application on NERSC Perlmutter, a cutting-edge supercomputing system. The discussion highlights the benefits of checkpoint-restart (C/R) techniques in managing complex, long-duration computations, showcasing the efficiency and reliability of these methods. Based on Geant4, a crucial tool for High Energy and Nuclear Physics, these techniques have been thoroughly tested and have passed the assessments. We further examine the integration of HPC containers, such as Shifter and Podman-HPC, which enhance computational task management and ensure consistent performance across various environments. Through real-world application examples, we illustrate the advantages of DMTCP in multi-threaded and distributed computing scenarios. Additionally we present the methods and results, demonstrating the impact of C/R on resource utilization, the future directions of this research, and its potential across various scientific domains.

Author

Dr Madan Timalsina (NERSC/LBNL)

Co-authors

Dr Johannes Blaschke (NERSC/LBNL) Dr Lisa Gerhardt (NERSC/LBNL) Dr Nicholas Tyler (NERSC/LBNL) Urjoshi Sinha (NERSC/LBNL) William Arndt, (NERSC/LBNL)

Presentation materials

There are no materials yet.