Speakers
Description
The rising popularity of large language models (LLMs) has led to a growing demand for efficient model deployment. In this context, the combination of post-training quantization (PTQ) and low-precision floating-point formats such as FP4, FP6 and FP8 has emerged as an important technique, allowing for rapid and accurate quantization with the ability to capture outlier values in LLMs without requiring extensive Quantization-Aware Training (QAT) typically needed for fixed-point formats. Nevertheless, a notable challenge remains: the gap between quantized models created by research frameworks, their standardized representation, and their compatibility with downstream tools. In this work, we help bridge this gap by enhancing the Quantized Open Neural Network Exchange (QONNX) representation format to formally include support for arbitrary-precision minifloat quantization through the newly introduced FloatQuant operator. We also propose a novel cost function for minifloat quantization to help guide quantization decisions in a cost-conscious manner, informed by the architecture of floating-point multiply-accumulate (FPMAC) nodes on FPGAs. Finally, we expand the QONNX model zoo by providing a series of example models quantized with FloatQuant to facilitate practical implementation and testing. Our contributions are upstreamed into the QONNX GitHub repositories.