Speaker
Description
Neural networks with a latency requirement at the order of $\mu$s, like the ones used at the CERN Large Hadron Colliders, are typically deployed on FPGAs fully unrolled. A bottleneck for deployment of such neural networks is area utilization, which is directly related to the constant matrix-vector multiplications (CMVM) performed in the networks. In this work, we implement an algorithm that optimizes the area consumption for such neural networks on FPGAs by performing the CMVMs with distributed arithmetic (DA) and integrate with the hls4ml library, a FOSS library for running real-time neural network inference on FPGAs. The optimized resource usage and latency are compared with the ones from the original hls4ml implementation on different networks. The results show that the proposed optimization can achieve a reduction of on-chip resource by up to a half in realistic quantized neural networks, while reducing the latency by up to 40\%, all while maintaining bit-accurate output values.
Talk's Q&A | During the talk |
---|---|
Talk duration | 20'+10' |
Will you be able to present in person? | Yes |