1–5 Sept 2025
ETH Zurich
Europe/Zurich timezone

Designing and Deploying Low-Latency Neural Networks on FPGAs with HGQ and da4ml

1 Sept 2025, 11:00
1h 30m
COPL Common Room HIT F 23.2

COPL Common Room HIT F 23.2

Tutorial Tutorials Tutorials

Speaker

Chang Sun (California Institute of Technology (US))

Description

Neural networks with a latency requirement on the order of microseconds are widely used at the CERN Large Hadron Collider, particularly in the low-level trigger system. To satisfy this latency requirement, these neural networks are often deployed on FPGAs.

This tutorial aims to provide a practical, hands-on guide of a software-hardware co-design workflow using the HGQ2 and da4ml libraries. Comparing with existing workflows, this approach has shown to reduce the resource consumption of the resulting hardware designs by up to two orders of magnitude while maintaining the same accuracy. In particular, the following topics are covered:

  1. Setup and Basic Concepts

  2. Environment: We will cover installing the HGQ2 and da4ml packages via pip, configuring Keras v3 backends, and understanding the basics of numba JIT-compilation used in da4ml to avoid common pitfalls.

  3. HGQ Methodology: The key concepts of HGQ will be introduced, including the use of a surrogate gradient for differentiable bit-widths and the construction of a differentiable hardware resource estimate incorporated into the loss function for efficient model training.

  4. da4ml Methodology: An overview of da4ml's two-stage hybrid algorithm will be provided, including the coarse-grained graph-based reduction and the fine-grained common subexpression elimination to create multiplier-free designs. We will explain how this process aligns with HGQ's training goal by effectively reducing the number of non-zero digits in the weight matrix.

  5. The Co-Design Workflow

  6. Training with HGQ: We will define and train neural networks from scratch in HGQ, covering the basics of configuring fixed-point quantizers and applying HGQ to architectures ranging from simple DNNs to MLP-Mixers. Best practices for defining models that can be converted to FPGA firmware with bit-exactness will be discussed. In addition, guidance will be given on how to emulate QKeras behavior in HGQ2 when necessary.

  7. Synthesis with hls4ml and da4ml: We will demonstrate how to convert an HGQ-trained model using hls4ml for bit-exact firmware generation, and explain how this is achieved in the background through a model-wise symbolic precision propagation. We will also show how to enable and configure da4ml using the distributed_arithmetic strategy in hls4ml.

  8. Analysis and Advanced Techniques

  9. RTL Generation: For compatible network architectures, we will explore da4ml's ability to generate fully pipelined Verilog directly from a trained model. We will also demonstrate how to verify the design's correctness with streamlined Verilator emulation.

  10. Performance Review: We will analyze and compare key hardware metrics—initiation interval, latency, Fmax, and resource utilization—from both the hls4ml and standalone RTL workflows to discuss their trade-offs.

  11. Tuning Techniques: We will cover more advanced techniques, such as beta scheduling or targeting a specific resource budget in HGQ2 with PID control of beta, automatically logging models on the Pareto front to explore the accuracy-resource trade-off, and debugging common issues like divergent bit-widths during conversion.

Author

Chang Sun (California Institute of Technology (US))

Co-authors

Jennifer Ngadiuba (FNAL) Maria Spiropulu (California Institute of Technology (US)) Thea Aarrestad (ETH Zurich (CH)) Vladimir Loncar (CERN) Wayne Luk Zhiqiang (Walkie) Que (Imperial College London)

Presentation materials