Fast Machine Learning for Science Conference 2025

Name: Fast Machine Learning for Science Conference 2025
Start: 2025-09-01T08:30:00+02:00
End: 2025-09-05T17:30:00+02:00
Location: ETH Zurich

1–5 Sept 2025

ETH Zurich

Europe/Zurich timezone

Local organisers

fml-2025-organisers@cern.ch

FloatQuant: Arbitrary-Precision Minifloats in QONNX

Not scheduled

HIT G floor (gallery)

Poster Posters and coffee

Nicolo Ghielmetti (CERN) Yaman Umuroglu (AMD Research)

The rising popularity of large language models (LLMs) has led to a growing demand for efficient model deployment. In this context, the combination of post-training quantization (PTQ) and low-precision floating-point formats such as FP4, FP6 and FP8 has emerged as an important technique, allowing for rapid and accurate quantization with the ability to capture outlier values in LLMs without requiring extensive Quantization-Aware Training (QAT) typically needed for fixed-point formats. Nevertheless, a notable challenge remains: the gap between quantized models created by research frameworks, their standardized representation, and their compatibility with downstream tools. In this work, we help bridge this gap by enhancing the Quantized Open Neural Network Exchange (QONNX) representation format to formally include support for arbitrary-precision minifloat quantization through the newly introduced FloatQuant operator. We also propose a novel cost function for minifloat quantization to help guide quantization decisions in a cost-conscious manner, informed by the architecture of floating-point multiply-accumulate (FPMAC) nodes on FPGAs. Finally, we expand the QONNX model zoo by providing a series of example models quantized with FloatQuant to facilitate practical implementation and testing. Our contributions are upstreamed into the QONNX GitHub repositories.

Ebby Samson (AMD Research, Imperial College) Giuseppe Franco (AMD Research) Ian Colbert (AMD) Nicholas Fraser (AMD Research) Nicolo Ghielmetti (CERN) Shane Fleming (AMD Research) Yaman Umuroglu (AMD Research)

fastml_poster_final.pdf

Fast Machine Learning for Science Conference 2025

Local organisers

FloatQuant: Arbitrary-Precision Minifloats in QONNX

HIT G floor (gallery)

Speakers

Description

Authors

Presentation materials

Choose timezone

Fast Machine Learning for Science Conference 2025

Local organisers

Speakers

Description

Authors

Presentation materials