Speaker
Vladimir Loncar
(CERN)
Description
We present ultra low-latency Deep Neural Networks with large convolutional layers on FPGAs using the hls4ml library. Taking benchmark models trained on public datasets, we discuss various options to reduce the model size and, consequently, the FPGA resource consumption: pruning, quantization to fixed precision, and extreme quantization down to binary or ternary precision. We demonstrate how inference latencies of O(10) micro seconds can be obtained while high accuracy is maintained
Author
Vladimir Loncar
(CERN)