1–2 Jul 2025
CERN
Europe/Zurich timezone

Adaptive Floating-Point Quantization for Efficient Neural Networks

2 Jul 2025, 10:30
20m
40/S2-B01 - Salle Bohr (CERN)

40/S2-B01 - Salle Bohr

CERN

100
Show room on map

Speaker

Nicolo Ghielmetti (CERN)

Description

The rapid growth of deep learning models, particularly Large Language Models (LLMs), which have increased their parameter counts nearly tenfold annually since 2018, has intensified the need for more efficient, power-aware deployment strategies. Quantization is a widely adopted technique for reducing the computational and memory footprint of neural networks by lowering numerical precision.
This work investigates a floating-point quantization approach to adaptively reduce bitwidths for weights and activations while preserving model accuracy. A quantization-oriented methodology is presented, which analyzes the distribution of tensor values to guide the design of custom floating-point formats. Experimental results on Recurrent Neural Networks demonstrate that this approach achieves an average 3.5× reduction in bit usage, with only a 0.5% drop in top-1 accuracy, using quantization-aware training (QAT).
Building on this work, a follow-up contribution extended the AMD/Xilinx deployment flow by enabling support for arbitrary floating-point in the Quantized Neural Network format QONNX, complementing the existing support in the QAT library Brevitas and completing the quantization path toward hardware acceleration with the AMD FPGA NN library FINN.

Presentation materials