# **Benchmarking High Level Synthesis for Machine Learning** Implementations versus Hand-optimized SystemVerilog

Waiz Khan, Caroline Johnson, Scott Hauck, Shih-Chieh Hsu, Geoff Jones

## Introduction

High Level Synthesis for Machine Learning (H enables rapid prototyping of Machine Learning into hardware designs.



Performance can be optimized by adjusting parameters such as Compression, Precision, and Resource Reuse factors.



Are HLS4ML's implementations efficient compared to lower-level implementations?



| ILS4ML)   |
|-----------|
| ng models |
|           |

| -          |          |
|------------|----------|
|            |          |
|            |          |
|            |          |
|            |          |
|            |          |
|            |          |
|            |          |
| 214-11 VII |          |
| Custom     | firmuaro |

Custom firmware design

[1]

reuse = 1use 4 multipliers 1 time each

[1]

## 2D Conv Layer, Stride 2

### Reuse Factor of 9:

| PERFORMANCE       | Min Period (ns) |    | Latency ( | Latenc  | atency (ns) II |      | II (cycles) |     | ll (ns)  |       |
|-------------------|-----------------|----|-----------|---------|----------------|------|-------------|-----|----------|-------|
| HLS4ML - 256      | 6.2             |    | 592.5     |         | 3673.5         |      | 459.75      |     | 2850.45  |       |
| HLS4ML - 128      | 6.9             |    | 463.5     |         | 3198.15        |      | 395.25      |     | 2727.225 |       |
| Us                | 6.9             |    | 219       |         | 1511.1         |      | 105.75      |     | 729.675  |       |
| Model             | RESOURCES       | LU | Ts        | Total L | UTs            | DFFs |             | RAM |          | DSP   |
| HLS4ML - 256      | 256             |    | 707       |         | 7073           |      | 11031       |     | 2        | 16    |
| HLS4ML - 128      | 128             |    |           |         | 6502           |      | 8799        |     | 3.5      | 24    |
| Us                |                 |    | 231       |         | 2318           |      | 865         |     | 0.5      | 24    |
| Available On Chip |                 |    |           |         | 693120         |      | 86640       | 14  | 470      | 3600  |
| HLS4ML 256        |                 |    |           |         | 1.02%          | 1:   | 2.73%       | 0.1 | 4%       | 0.44% |
| HLS4ML            |                 |    |           |         | 0.94%          | 1    | 0.16%       | 0.2 | 4%       | 0.67% |
| Us                |                 |    |           |         | 0.33%          |      | 1.00%       | 0.0 | 3%       | 0.67% |

## Percentage of Max Resource Utilization vs. Iteration Interval



Hand-implemented Conv2D Layer (stride of 2) with Reuse Factor of 9 achieves better performance than the HLS4ML implementations.

- 52.8% lower latency
- > 73.2% faster iteration interval
- 64.3% fewer total LUT's used
- 90.2% fewer DFFs used



### Currently under development, small scale model functional

### Batch Norm Layer algorithm:

| Cycle | 0 | 1     | 2         | 3       | 4 | 5        | 6        | 7        | 8       | 9      | 10         | 11                                      | -        |
|-------|---|-------|-----------|---------|---|----------|----------|----------|---------|--------|------------|-----------------------------------------|----------|
|       |   | Input | first bat | tch (4) |   |          |          | atch (4) |         |        | first bate | 1.0000000000000000000000000000000000000 | <u>.</u> |
|       |   |       |           |         |   | Input se | econd ba | atch (4) | <u></u> | rocess | second b   | batch (4)                               |          |
|       |   |       |           |         |   |          |          |          | •       | Input  | third bate | ch (4)                                  | <u>.</u> |



## Conclusion

The possibility to improve CONV2D implementation in HLS4ML to be faster or efficient is demonstrated. The lower-level implementation required fewer resources to produce a model with lower latency.

Batch Normalization layer can be implemented efficiently in hardware but will require large LUTs to accelerate some parts of the computation.

### Next Steps

> Implement an HLS4ML-inspired SystemVerilog implementation of Conv2D, stride 2 to improve performance

## References



Batch size of 4, pipelined to three major stages, each taking four cycles

Values processed in the following order: (each step takes one cycle) Batch Mean, Batch Variance, Normalize value, Scale & Shift

Pipelined for efficiency to allow for parallel usage of resources

Small Scale Model Results: (values are in fixed point, 8 integer bits, 8 fraction bits)

| 1        | 200.000 ns  |                  | 400.000 | ns<br>IIII          |        | 600.000 ns | ₅<br>    |         | 800.0 |
|----------|-------------|------------------|---------|---------------------|--------|------------|----------|---------|-------|
|          |             |                  |         |                     |        |            |          |         |       |
|          |             |                  |         |                     |        |            |          |         |       |
| XXXXX    | 010         | 0 / 02           | 200     | 0300                | 04     | •••        |          | 000X    |       |
| 0        |             | X                | 2       | 3                   | Χ      |            | 4        |         |       |
|          | 000         | 0                |         | Χ                   |        | 0a00       |          |         |       |
|          | 000         | 0                |         | Χ                   |        | 0280       |          |         |       |
|          | 0000,0000,0 | 0000,0000        |         | 0240,0040,0040,0240 |        |            |          |         |       |
|          | 000         | 0                |         |                     | X      | 0500       |          |         |       |
|          | 000         | 0                |         |                     | Χ      |            | 0140     |         |       |
| <u>1</u> | 2 <u>3</u>  | 4<br>(fea8,ff8d, |         | (output             | values | , converte | ed to de | ecimal) |       |
| xxxxx    |             | fea              |         | -1.343              |        |            |          |         |       |
| >>>>>    |             | ff8              | 39      | -0.449              |        |            |          |         |       |
| XXXXX    |             | 007              | 72      | 0.494               |        |            |          |         |       |
| xxxx     |             | 01!              | 57      | 1.340               |        |            |          |         |       |

> Implement a scaled-up batch norm layer, to compare with HLS4ML

