Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

23–28 Oct 2022
Villa Romanazzi Carducci, Bari, Italy
Europe/Rome timezone

Hyperparameter optimization, multi-node distributed training and benchmarking of AI-based HEP workloads using HPC

26 Oct 2022, 11:00
30m
Area Poster (Floor -1) (Villa Romanazzi)

Area Poster (Floor -1)

Villa Romanazzi

Poster Track 2: Data Analysis - Algorithms and Tools Poster session with coffee break

Speaker

Eric Wulff (CERN)

Description

In the European Center of Excellence in Exascale Computing "Research on AI- and Simulation-Based Engineering at Exascale" (CoE RAISE), researchers from science and industry develop novel, scalable Artificial Intelligence technologies towards Exascale. In this work, we leverage European High performance Computing (HPC) resources to perform large-scale hyperparameter optimization (HPO), multi-node distributed data-parallel training as well as benchmarking, using multiple compute nodes, each equipped with multiple GPUs.

Training and HPO of deep learning-based AI models is often compute resource intensive and calls for the use of large-scale distributed resources as well as scalable and resource efficient hyperparameter search algorithms. We evaluate the benefits of HPC for HPO by comparing different search algorithms and approaches, as well as performing scaling studies. Furthermore, the scaling and benefits of multi-node distributed data-parallel training using Horovod are presented, showing significant speed-up in model training. In addition, we present results from the development of a containerized benchmark based on an AI-model for event reconstruction that allows us to compare and assess the suitability of different hardware accelerators for training deep neural networks. A graph neural network (GNN) model known as MLPF, which has been developed for the task of Machine Learned Particle-Flow reconstruction in High Energy Physics (HEP), acts as the base model for which studies are performed.

Further developments of AI models in CoE RAISE have the potential to greatly impact the field of High Energy Physics by efficiently processing the very large amounts of data that will be produced by particle detectors in the coming decades. In order to do this efficiently, techniques that leverage modern HPC systems like multi-node training, large-scale distributed HPO as well as standardized benchmarking will be of great use.

Significance

We present the latest work in hyperparameter optimization (HPO) of MLPF and the first HPO results using a generator-level ground-truth definition for training a machine-learned algorithm for Particle-Flow reconstruction. In addition, we show multi-node scaling of MLPF training for the first time as well as an AI benchmark based on the MLPF training workload.

References

https://indico.cern.ch/event/855454/contributions/4598499/
https://arxiv.org/abs/2203.00330
https://arxiv.org/abs/2101.08578

Experiment context, if any CMS

Primary authors

Presentation materials

Peer reviewing

Paper