## 

## **Co-Design for Efficient & Adaptive ML**

CERN Seminar, 2024-06-26

Dr. Yaman Umuroğlu Senior Member of Technical Staff AMD Research & Advanced Development



## AMD Research and Advanced Development (RAD)

## Integrated Comms and AI Lab (RADICAL)

- Established 15 years ago
- ~20 researchers plus university program
  - 5 different locations
- Highly active internship program

## Focus: Communications and AI

- Building systems, architectural exploration, algorithmic optimizations, benchmarking
- In collaboration with partners, customers, and universities
  - ETH Zürich, Paderborn University, Imperial College, KIT, NTNU, Politecnico di Milano, NUS, University of Sydney



together we advance\_

[Public]

## **Pervasive Al**

ImageNet ChatGPT Recommenders



## **Pervasive Al**













together we advance\_

combine with signal processing



AMD together we advance\_

9



low latency (sub-msec) combine with signal processing

Everything in flux (MLPs -> CNNs -> Transformers...) Pervasive AI needs efficient and adaptive solutions

AMD together we advance\_

10

## **Specialization** is essential

Efficient & Adaptive ML Inference via **Co-Design** 

## **Specialization is essential**



## **Specialization** is essential



- FPGAs: the chameleon amongst the semiconductors...
  - Customize IO interfaces
  - Customize functionality
  - Customize compute architectures & memory subsystems to meet performance or efficiency targets
- Flexible, adaptive, mostly homogeneous hardware architecture
  - Enable post-production customization at the architectural level



[Public]

- FPGAs: the chameleon amongst the semiconductors...
  - Customize IO interfaces
  - Customize functionality
  - Customize compute architectures & memory subsystems to meet performance or efficiency targets
- Flexible, adaptive, mostly homogeneous hardware architecture
  - Enable post-production customization at the architectural level



Sea of programmable Lookup Tables (LUTs) ~millions

- Programmable Interconnect

Programmable IO

- FPGAs: the chameleon amongst the semiconductors...
  - Customize IO interfaces
  - Customize functionality
  - Customize compute architectures & memory subsystems to meet performance or efficiency targets
- Flexible, adaptive, mostly homogeneous hardware architecture
  - Enable post-production customization at the architectural level



[Public]

- FPGAs: the chameleon amongst the semiconductors...
  - Customize IO interfaces
  - Customize functionality
  - Customize compute architectures & memory subsystems to meet performance or efficiency targets
- Flexible, adaptive, mostly homogeneous hardware architecture
  - Enable post-production customization at the architectural level



[Public]

## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency















AMD together we advance\_

## **Running Example: Network Intrusion Detection System (NIDS)**



Minfps: Million inferences per second Assuming 64B/packet

25

## **NIDS Results**

Increased specialization, high performance, and efficiency

### Matrix of Processing Engines

| Topology / #layers / #OPs |  |
|---------------------------|--|
| Datatype                  |  |
| Accuracy                  |  |

| Vitis Al         |
|------------------|
| MLP / 3 / 92kOPs |
| 8b & 8b          |
| 92.3%            |



## **NIDS Results**

Increased specialization, high performance, and efficiency

### Matrix of Processing Engines

| Topology / #layers / #OPs |  |
|---------------------------|--|
| Datatype                  |  |
| Accuracy                  |  |
| Accuracy                  |  |

| Vitis Al         |  |
|------------------|--|
| MLP / 3 / 92kOPs |  |
| 8b & 8b          |  |
| 92.3%            |  |
|                  |  |

| Performance            |   |
|------------------------|---|
| Throughput             | 2 |
| Latency (compute only) |   |

| 22 kinfps |  |
|-----------|--|
| 26 us     |  |
|           |  |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.

## **NIDS Results**

Increased specialization, high performance, and efficiency

### **Matrix of Processing** Engines

| Topology / #layers / #OPs |  |
|---------------------------|--|
| Datatype                  |  |
| Accuracy                  |  |
|                           |  |

| Vitis Al         |
|------------------|
| MLP / 3 / 92kOPs |
| 8b & 8b          |
| 92.3%            |
|                  |

| Performance            |           |
|------------------------|-----------|
| Throughput             | 22 kinfps |
| Latency (compute only) | 26 us     |

| Resources              |             |
|------------------------|-------------|
| Compute (kLUTs, DSPs*) | 122,1124    |
| Memory (BRAM, URAM**)  | 290, 92     |
| Clock                  | 300/600 MHz |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.

\*DSPs: 8b or 16b Multiply Accumulates 28 \*\*BRAMs: 36kb, URAM: 288kbit embedded SRAM blocks



## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency



- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency



- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency



- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency





- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency



- Hardware architecture mimics the topology
- All weights need to be accessible in parallel, but limited activation buffering needed
- Customize *everything* to the specifics of the DNN
- Benefits
  - Improved efficiency
  - Low fixed latency

Dataflow can scale performance to meet the application requirements














- Scale performance & resources to meet the application requirements
- If resources allow, we can fully unfold the NN to create a circuit that inferences at clock speed
  - Enables extra optimizations for fine-granular quantization and sparsity

## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



AMD UltraScale+ MPSoC ZU19EG (conservative estimates)



AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS |        |
|-----------|-------------------|--------|
| 1b        | 64 000            |        |
| 4b        | 16 000            | memory |
| 8b        | 4 000             |        |
| 32b       | 300               |        |

AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS     |        |
|-----------|-----------------------|--------|
| 1b        | 64 000                |        |
| 4b        | 16 000 <sup>2</sup> N | memory |
| 8b        | 4 000                 |        |
| 32b       | 300                   |        |

AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS     |        |
|-----------|-----------------------|--------|
| 1b        | 64 000                |        |
| 4b        | 16 000 <sup>2</sup> N | memory |
| 8b        | 4 000                 |        |
| 32b       | 300                   |        |

# Trillions of

quantized operations per second

AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS | On-chip weights |  |
|-----------|-------------------|-----------------|--|
| 1b        | 64 000            | ~64 M           |  |
| 4b        | 16 000 Z          | ~16 M           |  |
| 8b        | 4 000 00×         | ~8 M            |  |
| 32b       | 300               | ~2 M            |  |

# Trillions of

quantized operations per second

AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS     | On-chip weights |  |
|-----------|-----------------------|-----------------|--|
| 1b        | 64 000                | ~64 M           |  |
| 4b        | 16 000 <sup>7</sup> N | ~16 M           |  |
| 8b        | 4 000                 | ~8 M %          |  |
| 32b       | 300                   | ~2 M            |  |

# Trillions of

quantized operations per second

AMD UltraScale+ MPSoC ZU19EG (conservative estimates)

| Precision | Approx. Peak GOPS     | On-chip weights |  |
|-----------|-----------------------|-----------------|--|
| 1b        | 64 000                | ~64 M           |  |
| 4b        | 16 000 <sup>7</sup> N | ~16 M           |  |
| 8b        | 4 000 OO              | ~8 M            |  |
| 32b       | 300                   | ~2 M            |  |

# Trillions of

quantized operations per second Weights can stay entirely on-chip

# **Granularity of Customizing Arithmetic**









# **Granularity of Customizing Arithmetic**







Dataflow architectures can exploit custom arithmetic at a finer granularity - even per-neuron and per-synapse custom arithmetic with full unfolding

AMD together we advance\_

# **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



# **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



# Sparsity

- DNNs are naturally sparse
  - Zero- or near-zero weights, ReLU activations...
  - Multiplications with zero can be skipped => reduces compute load
- Sparse topologies result in irregular compute & memory access patterns
  - Hard to accelerate on vector- or matrix-based execution units
  - Structured sparsity better, but limits benefits



# Sparsity

- DNNs are naturally sparse
  - Zero- or near-zero weights, ReLU activations...
  - Multiplications with zero can be skipped => reduces compute load
- Sparse topologies result in irregular compute & memory access patterns
  - Hard to accelerate on vector- or matrix-based execution units
  - Structured sparsity better, but limits benefits
- Fully-unrolled streaming dataflow can also exploit unstructured sparsity
  - Each neuron & synapse has its own hardware





Dataflow

on FPGA

Sparse Dataflow on FPGA

Error vs Compute Cost









Different

network topologies

Error vs Compute Cost



Error vs Compute Cost



Error vs Compute Cost



Error vs Compute Cost



Error vs Compute Cost



Error vs Compute Cost



## *FINN* Framework: From DNN to FPGA Deployment



AMD together we advance\_











69



70

together we advance\_



## A Brevitas showcase: Accumulator-Aware Quantization (A2Q)

- Cost of accumulators can be dominant for few-bit quantization
- Can we constrain weights to bound the max accumulator size?
  - Yes! Via Hölder's inequality or zero-centered range analysis
  - See A2Q [7] and A2Q+ [8] for details ☺
- A2Q and A2Q+ implementations are open-sourced as part of Brevitas
- >>> from brevitas.nn import QuantConv2d
- >>> from brevitas.quant import
  Int8AccumulatorAwareWeightQuant
- >>> conv = QuantConv2d(4, 4, 3, weight\_quant=Int8AccumulatorAwareWeightQuant)





Super Resolution ESPCN on BSDS300

Train from checkpoint!



Image Classification ResNet18 on CIFAR10



| Network                            | Method              | P  | Top-1 | Sparsity |
|------------------------------------|---------------------|----|-------|----------|
|                                    | Base                | 32 | 75.9% | 25.8%    |
|                                    | A2Q                 | 16 | 76.0% | 56.1%    |
| <b>ResNet50</b><br>(Float: 76.13%) |                     | 14 | 73.8% | 77.2%    |
|                                    |                     | 12 | 55.0% | 90.7%    |
|                                    | A2Q<br>(w/ EP-init) | 16 | 76.0% | 56.1%    |
|                                    |                     | 14 | 74.5% | 77.1%    |
|                                    |                     | 12 | 66.7% | 88.6%    |
|                                    | A2Q+                | 16 | 76.0% | 44.0%    |
|                                    |                     | 14 | 75.7% | 67.7%    |
|                                    |                     | 12 | 72.0% | 84.4%    |

4-bit weights and activations using 8-bit residuals and visible layers
## QONNX: <u>https://github.com/fastmachinelearning/qonnx</u> Flexible quantized NNs in ONNX + related tools



- Custom ONNX ops to represent arbitrary-bit uniform quantization
  - Standard ONNX only supports 8/16-bit
- ONNX-based common exchange format for QNNs
  - Meeting point between quantization frameworks and backends
- Infrastructure for manipulating + verifying custom ONNX graphs
- Including an own «model zoo» of quantized models
- Co-maintained by AMD RAD & FastML

## **JFINN** Compiler: **From QONNX to hardware**



Streamingfclaver batch 0 (Pre-Producti

# Build configuration build.DataflowBuildConfig( # target performance and clock frequency target\_fps = 100 000 000, synth\_clk\_period\_ns = 5.0, # target FPGA part number (e.g. for ZCU104) fpga\_part = "xczu7ev-ffvc1156-2-e", # ... )

- Network optimizations: constant folding, streamlining
- Compute folding with respect to throughput and resource constraints
- Operator mapping and synthesis (via HLS and RTL op library)
- Assembly of pipelined dataflow IP with AXI stream interfaces

https://github.com/Xilinx/finn

#### *FINN* Compiler: From QONNX to hardware



#### https://github.com/Xilinx/finn

- Compute folding with respect to throughput and resource constraints
- Operator mapping and synthesis (via HLS and RTL op library)

Assembly of pipelined dataflow IP with AXI stream interfaces

Many similarities and differences versus **hls4ml** Ongoing collaboration around common frontend (**QONNX**), knowledge sharing and joint publications since 2020

together we advance\_

Increased specialization, high performance, and efficiency

Matrix of Processing Engines

| Topology / #layers / #OPs |
|---------------------------|
| Datatype                  |
| Accuracy                  |

| Vitis Al          |
|-------------------|
| MLP / 3 / 92 kOPs |
| 8b & 8b           |
| 92.3%             |
|                   |

122,1124

290, 92

300/600 MHz

Dataflow + Quantization + Sparsity

|   | Performance            |           |
|---|------------------------|-----------|
|   | Throughput             | 22 kinfps |
|   | Latency (compute only) | 26 us     |
| _ |                        |           |
|   |                        |           |

| Resources              |  |
|------------------------|--|
| Compute (kLUTs, DSPs*) |  |
| Memory (BRAM, URAM**)  |  |
| Clock                  |  |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.



Increased specialization, high performance, and efficiency



| Performance            |           |
|------------------------|-----------|
| Throughput             | 22 kinfps |
| Latency (compute only) | 26 us     |
|                        |           |

| Resources              |             |
|------------------------|-------------|
| Compute (kLUTs, DSPs*) | 122,1124    |
| Memory (BRAM, URAM**)  | 290, 92     |
| Clock                  | 300/600 MHz |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.



Increased specialization, high performance, and efficiency



| Performance            |             |   | Fold 8      |
|------------------------|-------------|---|-------------|
| Throughput             | 22 kinfps   |   | 25.3 Minfps |
| Latency (compute only) | 26 us       | + | 160 ns      |
|                        |             |   |             |
| Resources              |             |   |             |
| Compute (kLUTs, DSPs*) | 122,1124    |   |             |
| Memory (BRAM, URAM**)  | 290, 92     |   |             |
| Clock                  | 300/600 MHz |   |             |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.



Increased specialization, high performance, and efficiency



| Performance            | Fold 8                 |
|------------------------|------------------------|
| Throughput             | 22 kinfps 25.3 Minfps  |
| Latency (compute only) | 26 us 160 ns           |
|                        |                        |
| Resources              |                        |
| Compute (kLUTs, DSPs*) | 122,1124 44, 0         |
| Memory (BRAM, URAM**)  | 290, 92 no DSPs 166, 0 |
| Clock                  | 300/600 MHz 203 MHz    |

Mapped on UltraScale+, 16nm FPGA, all within the same SLR.



Increased specialization, high performance, and efficiency





Mapped on UltraScale+, 16nm FPGA, all within the same SLR.

\*DSPs: 8b or 16b Multiply Accumulates



## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



## **Specialized FPGA Inference via Co-Design**

Increased specialization, high performance, and efficiency



#### **Bottom-Up: What maps to a 6:1 LUT?**





Total input: 6 bits Total output: 1 bit

PyTorch FPGA





Total dynamic input *in\_bits*: 6 bits Total output *out\_bits*: 1 bit 6:1 LUT

Total input: 6 bits Total output: 1 bit

PyTorch FPGA



















PyTorch FPGA



**FPGA** 

![](_page_92_Figure_2.jpeg)

![](_page_93_Figure_2.jpeg)

![](_page_94_Figure_1.jpeg)

![](_page_95_Figure_1.jpeg)

| Performance                                           |             | Fold 8      | Unfolded   |       |            |
|-------------------------------------------------------|-------------|-------------|------------|-------|------------|
| Throughput                                            | 22 kinfps   | 25.3 Minfps | 300 Minfps | +1.5x | 471 Minfps |
| Latency (compute only)                                | 26 us       | 160 ns      | 18 ns      |       | 9 ns       |
|                                                       |             |             |            |       |            |
| Resources                                             |             |             |            | _     |            |
| Compute (kLUTs, DSPs*)                                | 122,1124    | 44, 0       | 10, 0      | -1.6x | 16, 0      |
| Memory (BRAM, URAM**)                                 | 290, 92     | 166, 0      | 0, 0       |       | 0, 0       |
| Clock                                                 | 300/600 MHz | 203 MHz     | 300 MHz    | +1.5x | 471 MHz    |
| Mapped on UltraScale+, 16nm FPGA, all within the same | e SLR.      |             |            |       |            |

together we advance\_

\*DSPs: 8b or 16b Multiply Accumulates 96

\*\*BRAMs: 36kb, URAM: 288kbit embedded SRAM blocks

#### Related Work

| Config.              | Acc.<br>[%] | LUT             | F <sub>max</sub><br>[MHz] | Latency<br>[ns] |  |  |  |  |
|----------------------|-------------|-----------------|---------------------------|-----------------|--|--|--|--|
| High accuracy ≥73%   |             |                 |                           |                 |  |  |  |  |
| Duarte et al.<br>[2] | 75          | 88k<br>+1k DSPs |                           | 50              |  |  |  |  |
| FINN W8A8            | 75.5        | 581k            | 200                       | 115             |  |  |  |  |
| FINN W4A4            | 73.6        | 47k             |                           | 85              |  |  |  |  |
| NullaNet-L [3]       | 73.4        | 11.8k           | 436                       | -               |  |  |  |  |
| N                    | ledium      | accuracy ≥7     | 1%                        |                 |  |  |  |  |
| FINN W2A2            | 71.0        | 3k              | 200                       | 75              |  |  |  |  |
| NullaNet-M [3]       | 72.2        | 1.6k            | 841                       | -               |  |  |  |  |
| Low accuracy <71%    |             |                 |                           |                 |  |  |  |  |
| NullaNet-S [3]       | 69.7        | 39              | 2,079                     | -               |  |  |  |  |

Related Work

![](_page_97_Figure_4.jpeg)

Related Work

![](_page_98_Figure_4.jpeg)

Related Work

![](_page_99_Figure_4.jpeg)

#### LogicNets for Vision: MNIST

![](_page_100_Figure_2.jpeg)

#### Related Work

| Config.             | Acc.<br>[%] | LUT | F <sub>Max</sub><br>[MHz] | Latency<br>[ns] | FPS    |
|---------------------|-------------|-----|---------------------------|-----------------|--------|
| FINN [4]<br>LFC-max | 98.4        | 83k | 200                       | 2,440           | 1.6M   |
| FINN [4]<br>SFC-max | 95.8        | 91k |                           | 310             | 12.4M  |
| LUTNet [5]          | 97.9        | 58k |                           |                 | 20014  |
| Logic-shrunk [5]    | 97.8        | 55k |                           | -               | 200101 |

#### LogicNets

| Config. | Acc.<br>[%] | LUT | F <sub>Max</sub><br>[MHz] | Latency<br>[ns] | FPS  |
|---------|-------------|-----|---------------------------|-----------------|------|
| М       | 97.7        | 45k | 517                       | 38              | 517M |
| S       | 95.8        | 12k | 458                       | 9               | 458M |

Work in progress – already over 2x faster and 20% smaller, at similar accuracy

*"FINN [...] the <u>fastest method</u> for classifying MNIST at an accuracy of 98.4%,"* Petersen et al., NeurIPS'22 [6]

#### Conclusion

![](_page_101_Figure_2.jpeg)

- Co-design of NNs and FPGA HW can yield orders of magnitude more efficient inference
  - Combination of streaming dataflow, quantization and sparsity
  - Essential ingredients for the "long tail" of Pervasive AI
- Two key ingredients make NN/FPGA co-design technology accessible
  - Open-source tools like Brevitas, FINN, hls4ml and LogicNets
  - Ecosystem to build & share the technical expertise
- Fruitful AMD-FastML collaboration strengthens the ecosystem
  - QONNX active with Thea Aarestad, Sioni Summers ++
  - MLPerf Tiny joint submission
  - Multiple joint papers
  - ...more to come!

Internships available at AMD RADICAL Dublin! Talk to me or e-mail your CV: <u>yamanu@amd.com</u>

#### References

- 1. Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)." 2015 military communications and information systems conference (MilCIS). IEEE, 2015.
- 2. Duarte et al., "Fast inference of deep neural networks in FPGAs for particle physics," Journal of Instrumentation, vol. 13, no. 07, 2018.
- 3. Nazemi et al. "NullaNet Tiny: Ultra-low-latency DNN inference through fixed-function combinational logic." FCCM, 2021.
- 4. Wang et al. "Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks." ACM TRETS, 2023.
- 5. Umuroglu, Yaman, et al. "FINN: A framework for fast, scalable binarized neural network inference." FPGA. 2017.
- 6. Petersen et al. "Deep Differentiable Logic Gate Networks." NeurIPS, 2022.
- 7. Colbert et al. "A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance", ICCV, 2023.
- 8. Colbert et al. "A2Q+: Improving Accumulator-Aware Weight Quantization", ICML, 2024 (to appear).

## **COPYRIGHT AND DISCLAIMER**

©2024 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

## AMDJ