Fast Machine Learning for Science Conference 2025

Name: Fast Machine Learning for Science Conference 2025
Start: 2025-09-01T08:30:00+02:00
End: 2025-09-05T17:30:00+02:00
Location: ETH Zurich

1–5 Sept 2025

ETH Zurich

Europe/Zurich timezone

Local organisers

fml-2025-organisers@cern.ch

Real-Time GPU Kalman-Filter Tracking via Kernel Refactoring and INT8 Surrogates for High-Luminosity Colliders

Not scheduled

20m

HIT G floor (gallery)

Poster Posters and coffee

Mr Hao-Chun Liang (Institute of Pioneer Semiconductor Innovation, National Yang Ming Chiao Tung University)

As the era of the High-Luminosity Large Hadron Collider (HL-LHC) approaches, the GPU-accelerated High-Level Trigger (HLT) of the CMS experiment faces a stringent requirement to reduce the Level-1 readout stream from 100 kHz to 5 kHz, a twenty-fold decrease essential to adhere to archival bandwidth constraints [1], [2]. Meeting this demand necessitates highly efficient real-time charged-particle tracking.

In the recent release of the ACTS Traccc pipeline, the Kalman-filter Fit kernel has emerged as the predominant latency bottleneck, primarily due to excessive register pressure and serialization of matrix inversion operations within GPU warp execution [3]. To address these performance limitations, we propose two synergistic GPU optimizations to increase the kernel throughput.

First, we introduce a kernel refactoring method to restructure the Fit kernel into three distinct computational phases—Predict, Update, and Finalize. Each of the phases is clearly delineated by a single __syncthreads() synchronization barrier. This refactoring method facilitates more efficient compiler-driven register allocation, significantly shortens register lifetimes, and effectively reduces register spill traffic, thereby enhancing kernel throughput.

Second, we substitute the computationally intensive analytic matrix inversion operations with a quantization-aware INT8 multilayer perceptron (MLP). This surrogate model, using a network architecture with three hidden layers, is explicitly trained to approximate the 6×2 Kalman-gain matrix. By leveraging NVIDIA’s __dp4a integer dot-product instruction together with extensive compile-time optimizations using C++ constexpr, we can perform the majority of INT8 operations directly in registers, dramatically reducing shared-memory traffic and further cutting overall latency.

Evaluations conducted on Geant4-simulated Open-Data-Detector events demonstrate substantial performance gains [4]. Kernel refactoring alone results in a throughput enhancement of approximately 15%. Furthermore, the MLP surrogate achieves high precision, replicating the analytic Kalman-gain matrix with a mean-squared error less than 8 × 10⁻⁵. Comprehensive testing on an NVIDIA RTX 2080 Ti GPU illustrates the combined efficacy of these optimizations, improving the Fit-kernel reciprocal throughput by 5.22x (from an initial 11.5 ms to 2.2 ms). Concurrently, the overall end-to-end pipeline reciprocal throughput decreases from 23.5 ms to 8.23 ms. This yields an overall speed-up factor of 2.86×, effectively transitioning the kernel execution from being memory- and special function unit (SFU)-bound to predominantly compute-bound.

In conclusion, we propose two methods: lightweight, quantization-aware MLP surrogates and meticulous kernel refactoring, to significantly optimize the GPU-based track fitting. This optimization successfully approaches the stringent budgetary constraints in terms of computational latency imposed by the HL-LHC HLT, marking a critical advancement toward efficient and effective real-time particle tracking at future high-luminosity collider experiments.

Mr Hao-Chun Liang (Institute of Pioneer Semiconductor Innovation, National Yang Ming Chiao Tung University)

Prof. Bo-Cheng Lai (Institute of Electronics, National Yang Ming Chiao Tung University) Dr Yuan-Tang Chou (University of Washington (US))

There are no materials yet.

Fast Machine Learning for Science Conference 2025

Local organisers

Real-Time GPU Kalman-Filter Tracking via Kernel Refactoring and INT8 Surrogates for High-Luminosity Colliders

HIT G floor (gallery)

Speaker

Description

Author

Co-authors

Presentation materials

Choose timezone

Fast Machine Learning for Science Conference 2025

Local organisers

Speaker

Description

Author

Co-authors

Presentation materials