

# Compressing deep NN on FPGAs to ultra-low precision

8 October 2019, CERN

# Efficient NN design



DSPs can be a limiting resource → how to fit my model on a FPGA?

Happy to have many zeros but not straightforward to implement sparse matrix multiplication!

FPGA can optimize those away but not in all cases.



#### Ultra-low precision arithmetic

#### Replace 32-bit floating point multiplications with 1/2 bits arithmetics with limited loss in accuracy:

- 1-bit: binary NN (<u>arxiv.1602.02830</u>)
- 2-bits: ternary NN (<u>arxiv.1605.04711</u>)

nb, only the weights and activations are binarized and not the gradients used to update parameters during backpropagation.

Extremely attractive from a hardware perspective! BNN/ TNN computationally efficient at low power.



### A bit of literature: Xilinx BNN

#### FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Yaman Umuroglu<sup>\*†</sup>, Nicholas J. Fraser<sup>\*‡</sup>, Giulio Gambardella<sup>\*</sup>, Michaela Blott<sup>\*</sup>, Philip Leong<sup>‡</sup>, Magnus Jahre<sup>†</sup> and Kees Vissers<sup>\*</sup> \*Xilinx Research Labs; <sup>†</sup>Norwegian University of Science and Technology; <sup>‡</sup>University of Sydney

yamanu@idi.ntnu.no

• Demonstrated that for by binarizing dense and Conv2D the small memory required removes the off-chip memory bottleneck by keeping parameters on-chip, even for large networks!

| Neurons/layer                               | Binary<br>Err. (%)                                                          | Float<br>Err. (%)                              | # Params                                                       | Ops/frame                                                                |
|---------------------------------------------|-----------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------|--------------------------------------------------------------------------|
| $128 \\ 256 \\ 512 \\ 1024 \\ 2048 \\ 4096$ | $\begin{array}{c} 6.58 \\ 4.17 \\ 2.31 \\ 1.60 \\ 1.32 \\ 1.17 \end{array}$ | $2.70 \\ 1.78 \\ 1.25 \\ 1.13 \\ 0.97 \\ 0.91$ | $134,794\\335,114\\932,362\\2,913,290\\10,020,874\\36,818,954$ | 268,800<br>668,672<br>1,861,632<br>5,820,416<br>20,029,440<br>73,613,312 |

Table 1: Accuracy results - BNN vs NN.

Table 3: Summary of results from FINN 200 MHz prototypes.

| Name    | Thr.put<br>(FPS) | ${ m Latency}\ (\mu { m s})$ | LUT   | BRAM  | $P_{ m chip} \ ({ m W})$ | $egin{array}{c} P_{ m wall} \ ({ m W}) \end{array}$ |
|---------|------------------|------------------------------|-------|-------|--------------------------|-----------------------------------------------------|
| SFC-max | 12361 k          | 0.31                         | 91131 | 4.5   | 7.3                      | 21.2                                                |
| LFC-max | 1561 k           | 2.44                         | 82988 | 396   | 8.8                      | 22.6                                                |
| CNV-max | 21.9 k           | 283                          | 46253 | 186   | 3.6                      | 11.7                                                |
| SFC-fix | 12.2 k           | 240                          | 5155  | 16    | 0.4                      | 8.1                                                 |
| LFC-fix | 12.2 k           | 282                          | 5636  | 114.5 | 0.8                      | 7.9                                                 |
| CNV-fix | 11.6 k           | 550                          | 29274 | 152.5 | 2.3                      | 10                                                  |



# models: jet tagging

a multi-classification task: y energetic (boosted) **q, g, W, Z, t** initiated jets





#### Benchmark models: MNIST





Average accuracy ~ 0.98 AUC per class > 99%

### Binary/Ternary architectures

(First tests and implementation for MLP)



### hls4ml implementation





- Run hyper parameter bayesian optimization: neurons, layers, batch size, learning rate, different optimizers
- Recover performance with 16x448x224x224x5 model (7 times more neurons)



- Run hyper parameter bayesian optimization: neurons, layers, batch size, learning rate, different optimizers
- Recover performance with 16x448x224x224x5 model (7 times more neurons)



| Architecture                         | AUCs<br>[%] | Average<br>accuracy | Minimum<br>latency<br>[µs] | DSPs<br>[%] | LUTs<br>[%] | FFs<br>[%] | BRAMs<br>[%] |
|--------------------------------------|-------------|---------------------|----------------------------|-------------|-------------|------------|--------------|
| float model<br>(16x64x32x32x5)       | 90 - 96     | 0.75                | 0.060                      | 60          | 7           | 1          | 0            |
| float model<br>(compressed)          | 91 - 96     | 0.75                | 0.090                      | 15          | 1.7         | 0.7        | 0.3          |
| Small BNN<br>(16x64x32x32x5)         | 75 - 89     | 0.62                | 0.040                      | 0           | 0.8         | 0.1        | 0            |
| Optimized BNN<br>(16x448x224x224x5)  | 88 - 94     | 0.72                | 0.210                      | 0           | 15          | 7          | 0            |
| BNN + ReLu<br>(16x128x64x64x5)       | 88 - 93     | 0.70                | 0.140                      | 4           | 6           | 1          | 0            |
| Optimized TNN<br>(16x128x64x64x64x5) | 88 - 94     | 0.72                | 0.110                      | 0           | 6           | 1          | 0            |
| TNN + ReLu<br>(16x64x32x32x5)        | 88 - 92     | 0.68                | 0.060                      | 2           | 2           | 0.2        | 0            |

### Results: MNIST

| Architecture<br>784x128x128x128x10 | Average<br>accuracy | Minimum<br>latency<br>[µs] | DSPs<br>[%] | LUTs<br>[%] | FFs<br>[%] | BRAMs [%] |
|------------------------------------|---------------------|----------------------------|-------------|-------------|------------|-----------|
| float model                        | 0.98                | 0.56                       | 100         | 134         | 23         | 54        |
| Binary model<br>(binary tanh)      | 0.93                | 0.21                       | 0           | 34          | 11         | 16        |
| Ternary model<br>(ternary tanh)    | 0.95                | 0.21                       | 0           | 34          | 11         | 16        |

#### Results: MNIST



#### Results: MNIST

