Speaker
Description
Real-time inference with sub-microsecond latency is critical for the Level-1 trigger systems at the High-Luminosity LHC. We present an end-to-end, open-source framework that spans model optimization, quantization, and FPGA deployment, enabling the translation of high-level neural network or generic dataflow models into resource-efficient FPGA implementations.
Within the workflow, we introduce High-Granularity Quantization (HGQ), a quantization framework that simultaneously optimizes the model's resource utilization and accuracy through quantization-aware training with differentiable bitwidths, all with native Keras-like training speeds. The framework supports both conventional matmul-based neural network architectures, ranging from classical dense operations to multi-head attention blocks, as well as fabric-native architectures that map efficiently to FPGA Look-Up Table (LUT) primitives. Users can freely use either architecture or combine both in a single model to achieve optimal trade-offs between accuracy, resource usage, and latency.
On the backend, we present da4ml, an HLS compiler that optimizes and converts unrolled static dataflow graphs, such as machine learning models for L1T, into RTL firmware in either Verilog or VHDL. Specifically, the framework can optimize constant-matrix-vector multiplication (CMVM) operations into efficient adder graphs, enabling DSP-free implementations for a wide range of models. The package also provides a compilation-free precise resource surrogate and bit-exact emulation of the compiled models via a C++ based interpreter, allowing for rapid design space exploration and model validation.
To facilitate adoption, the HGQ and da4ml packages are designed with user-friendly APIs that integrate seamlessly together. Furthermore, these packages can interface directly with hls4ml, allowing users to leverage the strengths of all three frameworks and utilize existing workflows without friction.