ACAT 2022

Name: ACAT 2022
Start: 2022-10-23T16:30:00+02:00
End: 2022-10-28T17:00:00+02:00
Location: Villa Romanazzi Carducci, Bari, Italy

23–28 Oct 2022

Villa Romanazzi Carducci, Bari, Italy

Europe/Rome timezone

Contact

acat-loc2022@cern.ch

Hyperparameter optimization, multi-node distributed training and benchmarking of AI-based HEP workloads using HPC

26 Oct 2022, 11:00

30m

Area Poster (Floor -1) (Villa Romanazzi)

Area Poster (Floor -1)

Villa Romanazzi

Poster Track 2: Data Analysis - Algorithms and Tools Poster session with coffee break

Eric Wulff (CERN)

In the European Center of Excellence in Exascale Computing "Research on AI- and Simulation-Based Engineering at Exascale" (CoE RAISE), researchers from science and industry develop novel, scalable Artificial Intelligence technologies towards Exascale. In this work, we leverage European High performance Computing (HPC) resources to perform large-scale hyperparameter optimization (HPO), multi-node distributed data-parallel training as well as benchmarking, using multiple compute nodes, each equipped with multiple GPUs.

Training and HPO of deep learning-based AI models is often compute resource intensive and calls for the use of large-scale distributed resources as well as scalable and resource efficient hyperparameter search algorithms. We evaluate the benefits of HPC for HPO by comparing different search algorithms and approaches, as well as performing scaling studies. Furthermore, the scaling and benefits of multi-node distributed data-parallel training using Horovod are presented, showing significant speed-up in model training. In addition, we present results from the development of a containerized benchmark based on an AI-model for event reconstruction that allows us to compare and assess the suitability of different hardware accelerators for training deep neural networks. A graph neural network (GNN) model known as MLPF, which has been developed for the task of Machine Learned Particle-Flow reconstruction in High Energy Physics (HEP), acts as the base model for which studies are performed.

Further developments of AI models in CoE RAISE have the potential to greatly impact the field of High Energy Physics by efficiently processing the very large amounts of data that will be produced by particle detectors in the coming decades. In order to do this efficiently, techniques that leverage modern HPC systems like multi-node training, large-scale distributed HPO as well as standardized benchmarking will be of great use.

References

https://indico.cern.ch/event/855454/contributions/4598499/
https://arxiv.org/abs/2203.00330
https://arxiv.org/abs/2101.08578

Significance

We present the latest work in hyperparameter optimization (HPO) of MLPF and the first HPO results using a generator-level ground-truth definition for training a machine-learned algorithm for Particle-Flow reconstruction. In addition, we show multi-node scaling of MLPF training for the first time as well as an AI benchmark based on the MLPF training workload.

Experiment context, if any	CMS

David Southwick (CERN) Mr Eduard Cuba (University of Zurich (CH)) Eric Wulff (CERN) Juan Pablo Garcia Amboage Dr Maria Girone (CERN)

ACAT2022_RAISE.pdf

acat2022_proceedings.pdf

ACAT 2022

Contact

Hyperparameter optimization, multi-node distributed training and benchmarking of AI-based HEP workloads using HPC

Area Poster (Floor -1)

Villa Romanazzi

Speaker

Description

References

Significance

Authors

Presentation materials

Peer reviewing

Paper

Choose timezone

ACAT 2022

Contact

Speaker

Description

References

Significance

Authors

Presentation materials

Peer reviewing

Paper