ACAT 2025

Name: ACAT 2025
Start: 2025-09-08T08:00:00+02:00
End: 2025-09-12T16:30:00+02:00
Location: Hamburg, Germany

8–12 Sept 2025

Hamburg, Germany

Europe/Berlin timezone

High-Performance Computing Workflow for Distributed Hyperparameter Search in Medium-Sized Machine Learning Models

8 Sept 2025, 11:00

30m

ESA W 'West Wing'

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Xiangyang Ju (Lawrence Berkeley National Lab. (US))

Machine Learning (ML) plays an important role in physics analysis in High Energy Physics. To achieve better physics performance, physicists are training larger and larger models with larger dataset. Therefore, many workflow developments focus on distributed training of large ML models, inventing techniques like model pipeline parallelism. However, not all physics analyses need to train large models. On the contrary, some emerging analysis techniques like OmniFold and Neural Simulation-Based Inference (NSBI) need to train thousands of small models to quantify systematic uncertainties. At the same time, each model undergoes hyperparameter optimization with constraints of physics performance. Similarly, ML-powered online hardware often favors small performant models for data compression and intelligent data filtering. Performing extensive automated model search is crucial for designing intelligent hardwares. They present a unique challenge for developing HPC workflows.

We will present a HPC-friendly workflow that simultaneously tackles the aforementioned challenges. The workflow will be applied to a realistic physics analysis for the NSBI analysis using the Perlmutter platform at NERSC.The workflow design and scaling of the workflow will be presented in detail.

References

https://cds.cern.ch/record/2915316

Significance

The novel ML-based data analysis techniques like Neural Simulation Based Inference and OmniFold impose unique computing challenges. Current infrastructures do not scale well at the High Performance Computing facilities. Our research solves both the problems of tuning hyperparameters and distributedly training O(1000) medium-sized ML models.

Aishik Ghosh (University of California Irvine (US)) Dennis Bollweg Tae Hyoun Park (Max Planck Society (DE)) Xiangyang Ju (Lawrence Berkeley National Lab. (US))

Habib Salman (ANL) Paolo Calafiura (Lawrence Berkeley National Lab. (US)) Walter Hopkins (Argonne National Laboratory (US))

CCE_2025_Poster_NSBI (2).pdf

ACAT 2025

High-Performance Computing Workflow for Distributed Hyperparameter Search in Medium-Sized Machine Learning Models

ESA W 'West Wing'

Speaker

Description

References

Significance

Authors

Co-authors

Presentation materials

Choose timezone

ACAT 2025

Speaker

Description

References

Significance

Authors

Co-authors

Presentation materials