Speaker
Description
As the era of the High-Luminosity Large Hadron Collider (HL-LHC) approaches, the GPU-accelerated High-Level Trigger (HLT) of the CMS experiment faces a stringent requirement to reduce the Level-1 readout stream from 100 kHz to 5 kHz, a twenty-fold decrease essential to adhere to archival bandwidth constraints [1], [2]. Meeting this demand necessitates highly efficient real-time charged-particle tracking.
In the recent release of the ACTS Traccc pipeline, the Kalman-filter Fit kernel has emerged as the predominant latency bottleneck, primarily due to excessive register pressure and serialization of matrix inversion operations within GPU warp execution [3]. To address these performance limitations, we propose two synergistic GPU optimizations to increase the kernel throughput.
First, we introduce a kernel refactoring method to restructure the Fit kernel into three distinct computational phases—Predict, Update, and Finalize. Each of the phases is clearly delineated by a single __syncthreads() synchronization barrier. This refactoring method facilitates more efficient compiler-driven register allocation, significantly shortens register lifetimes, and effectively reduces register spill traffic, thereby enhancing kernel throughput.
Second, we substitute the computationally intensive analytic matrix inversion operations with a quantization-aware INT8 multilayer perceptron (MLP). This surrogate model, using a network architecture with three hidden layers, is explicitly trained to approximate the 6×2 Kalman-gain matrix. By leveraging NVIDIA’s __dp4a integer dot-product instruction together with extensive compile-time optimizations using C++ constexpr, we can perform the majority of INT8 operations directly in registers, dramatically reducing shared-memory traffic and further cutting overall latency.
Evaluations conducted on Geant4-simulated Open-Data-Detector events demonstrate substantial performance gains [4]. Kernel refactoring alone results in a throughput enhancement of approximately 15%. Furthermore, the MLP surrogate achieves high precision, replicating the analytic Kalman-gain matrix with a mean-squared error less than 8 × 10⁻⁵. Comprehensive testing on an NVIDIA RTX 2080 Ti GPU illustrates the combined efficacy of these optimizations, improving the Fit-kernel reciprocal throughput by 5.22x (from an initial 11.5 ms to 2.2 ms). Concurrently, the overall end-to-end pipeline reciprocal throughput decreases from 23.5 ms to 8.23 ms. This yields an overall speed-up factor of 2.86×, effectively transitioning the kernel execution from being memory- and special function unit (SFU)-bound to predominantly compute-bound.
In conclusion, we propose two methods: lightweight, quantization-aware MLP surrogates and meticulous kernel refactoring, to significantly optimize the GPU-based track fitting. This optimization successfully approaches the stringent budgetary constraints in terms of computational latency imposed by the HL-LHC HLT, marking a critical advancement toward efficient and effective real-time particle tracking at future high-luminosity collider experiments.