Speaker
Description
Deploying machine learning models in environments with high-throughput, low-latency, and strict memory constraints is challenging, especially when these environments evolve rapidly and require simplified user-control, dependency management, and long-term maintainability. In high-energy physics, and particularly within the Trigger Systems of major LHC experiments, similar requirements arise for real-time data processing. While ML models offer significant opportunities, their inference phase continues to be challenging.
SOFIE (System for Optimized Fast Inference code Emit) translates trained ML models and generates low-latency, high-performance C++ code for inference while depending only on BLAS for matrix operations. Recent developments include an improved inference on CPUs that surpasses state-of-the-art ONNX Runtime performance for several LHC models. Through optimized kernels for common ML operations, enhanced memory-reuse mechanisms and improved dynamic tensor support, SOFIE delivers high CPU performance while remaining extremely lightweight.
The upcoming High-Luminosity LHC era highlights the growing need for co-processors to accelerate workloads that benefit from massive parallelism, such as ML inference. However, heterogeneous devices introduce complications like non-uniform memory formats, diverse inference configurations, and costly data movement between host and device.
To support ML inference across heterogeneous architectures, SOFIE can generate C++ code that uses alpaka data buffers for its memory management. Leveraging the benefits of alpaka on abstracting heterogeneous programming, this enables the generated code to run on multiple backends with minimal modification. Because the generated code is architecture-agnostic, integration into existing HEP workflows becomes easier.
While SOFIE relies on BLAS libraries for optimized matrix operations, heterogeneous architectures typically provide vendor-specific BLAS implementations with limited portability. This complicates the generation of fully architecture-agnostic code. To address this, we introduce sofieBLAS, a lightweight abstraction layer that exposes a unified BLAS interface and dynamically selects the appropriate backend at runtime. This preserves portability while allowing efficient use of vendor-optimized BLAS implementations.
With ML models in HEP research getting more complex and sophisticated everyday, compression techniques are becoming increasingly useful. In order to address this need, SOFIE is now integrated with PQuant- a library for End-to-End Hardware-Aware Model Compression using Pruning and Quantization techniques. This allows us to generate C++ code for inference from quantized ML models.
We present the recent developments in SOFIE, including the optimizations in CPU implementation and the new heterogeneous inference capabilities. We also show benchmarking results for both CPU and GPUs using SOFIE’s generated code, comparing performance against other ML inference libraries, including PyTorch and ONNX Runtime on common models used in HEP research such as ATLAS GN2, CMS ParticleNet, Diffusion models for Fast Simulations, etc.