8โ€“12 Sept 2025
Hamburg, Germany
Europe/Berlin timezone

High-Performance Computing Workflow for Distributed Hyperparameter Search in Medium-Sized Machine Learning Models

Not scheduled
30m
Hamburg, Germany

Hamburg, Germany

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Speaker

Xiangyang Ju (Lawrence Berkeley National Lab. (US))

Description

Machine Learning (ML) plays an important role in physics analysis in High Energy Physics. To achieve better physics performance, physicists are training larger and larger models with larger dataset. Therefore, many workflow developments focus on distributed training of large ML models, inventing techniques like model pipeline parallelism. However, not all physics analyses need to train large models. On the contrary, some emerging analysis techniques like OmniFold and Neural Simulation-Based Inference (NSBI) need to train thousands of small models to quantify systematic uncertainties. At the same time, each model undergoes hyperparameter optimization with constraints of physics performance. Similarly, ML-powered online hardware often favors small performant models for data compression and intelligent data filtering. Performing extensive automated model search is crucial for designing intelligent hardwares. They present a unique challenge for developing HPC workflows.

We will present a HPC-friendly workflow that simultaneously tackles the aforementioned challenges. The workflow will be applied to a realistic physics analysis for the NSBI analysis using the Perlmutter platform at NERSC.The workflow design and scaling of the workflow will be presented in detail.

References

https://cds.cern.ch/record/2915316

Significance

The novel ML-based data analysis techniques like Neural Simulation Based Inference and OmniFold impose unique computing challenges. Current infrastructures do not scale well at the High Performance Computing facilities. Our research solves both the problems of tuning hyperparameters and distributedly training O(1000) medium-sized ML models.

Authors

Aishik Ghosh (University of California Irvine (US)) Dennis Bollweg Tae Hyoun Park (Max Planck Society (DE)) Xiangyang Ju (Lawrence Berkeley National Lab. (US))

Co-authors

Habib Salman (ANL) Paolo Calafiura (Lawrence Berkeley National Lab. (US)) Walter Hopkins (Argonne National Laboratory (US))

Presentation materials

There are no materials yet.