## 📣 MathWorks

## Artificial Intelligence workflows for Edge FPGA & SoC using a Deep Learning Processor

Stephan van Beek svanbeek@mathworks.com

European Technical Specialist SoC/FPGA/ASIC Design Flows

© 2024 The MathWorks, Inc.









#### Artificial Intelligence on Embedded Devices



**Satellite Navigation** 



82 1+ 01 01 1 12 14 15 18 True (s)



#### Industry Trends

#### **Designs with AI accelerator cores increasing**



Source: Wilson Research Group and Siemens EDA, 2022 Functional Verification Study

Unrestricted | © Siemens 2022 | Siemens Digital Industries Software | 2022 Functional Verification Study



## Embedded development makes use of advanced technology capabilities

Embedded AI and machine learning attract the most attention, followed by embedded vision and speech capabilities



embedded survey

27. Which of the following advanced technologies are you <u>currently using</u> in your embedded systems? 28. Which of the following advanced technologies are you <u>considering using</u> in your future embedded systems?

**ASPENCORE** | 19



#### Airbus Designs Onboard FPGA-Based Deep Learning Processor Using MATLAB

Using the workflow provided by Deep Learning HDL Toolbox, Airbus engineers implemented an FPGA-based anomaly detection system for spacecraft employing deep learning models.

#### **Key Outcomes/Results:**

- Workflow for rapid prototyping and verification of deep neural networks on FPGAs
- Enabling collaboration between hardware, systems, and deep learning engineers
- Detected potential satellite failure modes earlier compared to traditional thresholding-based methods
- Produced deep learning processor for use and deployment with any FPGA vendor with FreeRTOS or other operating systems



Real-world anomalies detected by the deep learning network running on an FPGA.

"The MATLAB deep learning processor IP core is essentially platform-agnostic, which allowed for its incorporation into a real-time operating system that could be certified for space. A major challenge was to develop an application that interacted with it, but MathWorks support helped us a lot in this."

- Andreas C. Koch, onboard software engineer, Airbus

#### Link to user story



#### Challenges of Deploying Deep Learning to FPGA Hardware



- How to get the AI model to run on the edge device in first place?
- How to make the AI model fit and performant on an edge device?



#### Customizable Deep Learning Processor





#### Deep Learning HDL Processor steps

Deep Learning Processor Layer Debugger/ Weight Activation Instruction Data Read/Write Data Read control Data Read/Write Arbitrator Arbitrator Arbitrator instructions Application Compile & Memory Access Arbitrator Modules Quantize Deploy Network logic Weights & **Activations Processing Modules Profiler & Top-level** Analyze Custom Conv FC Scheduler Kernel Kernel Kernel Module Profile **FPGA Deep Learning Processor IP** Customize **Build Processor** Download Estimate IP core interface HDL Coder **FPGA** Bitstream **DL Processor** HDL

| 🗐 Y:\N                     | ATLAB\                                                                                         | Mastero                                                     | class\Deep                                                                                        | pLearnir                      | g4FPG/                         | ACraterDetection                                                                                                 | n\crater_ | detect_YOLOv2_FP                                                                                                                       | GAV3.n | ılx *   |               | _         |                        |         |      |                                                         |     |             | _              | Č                                              | ) X    |
|----------------------------|------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------|--------------------------------|------------------------------------------------------------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------|--------|---------|---------------|-----------|------------------------|---------|------|---------------------------------------------------------|-----|-------------|----------------|------------------------------------------------|--------|
| LIV                        | e edito                                                                                        | DR                                                          | INS                                                                                               | ERT                           |                                | VIEW                                                                                                             |           |                                                                                                                                        |        |         |               |           |                        |         |      |                                                         |     |             |                |                                                | ? 🕤 🤇  |
| New                        | -                                                                                              | FILE                                                        | ading wei                                                                                         | rt •<br>g the FF              | GA bits                        | ← ↔<br>← Find ←<br>← Bookmark ←<br>NAVIGATE<br>stream has been<br>Processor.<br>rent time is 24                  | complet   | TEXT<br>ed successfully.                                                                                                               |        | Code    | Control       | Task<br>• | Refactor<br>% & %<br>P | Run     | n Ma | Section Break<br>Run and Advance<br>Run to End<br>CTION | Run | Step<br>RUN | > Construction |                                                | - Inc. |
|                            | R                                                                                              |                                                             | <b>redict</b> i<br>Run on F                                                                       |                               | or on                          | e image                                                                                                          |           |                                                                                                                                        |        |         |               |           |                        |         |      |                                                         |     |             |                |                                                |        |
| 43<br>44                   | <pre>[img_pre, info]=<br/>[predict_out, sp<br/>### Finished writi<br/>### Running single</pre> |                                                             |                                                                                                   |                               | ] = wo                         | bj.predict(in                                                                                                    | ng_pre,   | 'Profile','on'                                                                                                                         | );     |         |               |           |                        |         |      |                                                         |     |             |                |                                                |        |
|                            |                                                                                                |                                                             | D                                                                                                 |                               | tFrame                         |                                                                                                                  | LastF     | ormance Results<br>rameLatency(seco                                                                                                    | nds)   |         | nesNum        |           | al Latency             | Frames/ |      |                                                         |     |             |                |                                                |        |
|                            |                                                                                                | max<br>con<br>max<br>con<br>max<br>con<br>yol<br>yol<br>yol | nv_1<br>xpool1<br>nv_2<br>xpool2<br>nv_3<br>xpool3<br>nv_4<br>lov2Conv1<br>lov2Conv2<br>lov2Class | 2<br>SConv                    |                                | 730510<br>204277<br>161277<br>79491<br>178558<br>44219<br>162118<br>206737<br>207074<br>73949<br>20 DL processor | is: 220M  | 0.00787<br>0.00093<br>0.00097<br>0.00097<br>0.00097<br>0.00081<br>0.00020<br>0.00020<br>0.00074<br>0.00139<br>0.00140<br>0.00140<br>HZ |        |         | 1             | 17        | 731094                 | 127.1   | 1    |                                                         |     | G           |                | Analyze profiling metrics:<br>127.1 frames/sec |        |
| 45<br>46<br>47<br>48<br>49 | 1                                                                                              | classn<br>[bboxn                                            |                                                                                                   | ctor.C<br>n, <mark>lab</mark> | lassNa<br><mark>eln</mark> ] = | mes;<br>yolo_post_p                                                                                              |           | dict_out, info<br>g,'rectangle',                                                                                                       |        |         |               | ;);       |                        |         |      |                                                         |     |             |                |                                                |        |
|                            |                                                                                                | • D                                                         | Display d                                                                                         | letectio                      | n resul                        | ts                                                                                                               |           |                                                                                                                                        |        |         |               |           |                        |         |      |                                                         |     |             |                |                                                | -      |
| 50<br>51                   | <pre>imshow(detectedImg_new2);<br/>title('FPGA (single): Craters Detected!');</pre>            |                                                             |                                                                                                   |                               |                                |                                                                                                                  |           |                                                                                                                                        |        |         |               |           |                        |         |      | -                                                       |     |             |                |                                                |        |
|                            |                                                                                                |                                                             | 0.                                                                                                | 79831                         |                                | 0.79445                                                                                                          | FP        | GA (single): Cra                                                                                                                       | ters C | 0.55303 | I!<br>0.7902! | 5         | 0.503                  | 0.57519 |      |                                                         |     |             |                |                                                |        |

Ψ.

0,71321

12:

0 79782

0.59454



#### Challenges of Deploying Deep Learning to FPGA Hardware



- How to get the AI model to run on the edge device in first place?
- How to make the AI model fit and performant on an edge device?



#### **Two Compression Techniques**





**Pruning** deep neural networks

## **Quantization** of deep neural networks



#### **Taylor Approximation Pruning**



prunableNetwork = taylorPrunableNetwork(dlnet)









Performance

#### **Projected Layer Pruning**



Technical article on projected layer pruning



#### **Deep Network Quantizer - Int8 Quantization**



14



### Deep Learning Processor (DLP) Configuration



#### Under the hood:





# Estimate Resource Utilization and Performance for Custom Processor Configuration

Reference zcu102\_int8 bitstream configuration:

- Possible performance of 13982 frames per second (FPS) to a Xilinx ZCU102 ZU9EG device
- Digital signal processor (DSP) slice count 2520 (available) / 805 (used)
- Block random access memory (BRAM) count 912 (available) / 388 (used)

Requirements:

- Target performance of 500 frames per second (FPS) to a Xilinx ZCU102 ZU4CG device
- Digital signal processor (DSP) slice count 240 (available)
- Block random access memory (BRAM) count 128 (available)



#### Estimate Resource Utilization and Performance for Custom DLP

```
customhPC = dlhdl.ProcessorConfig;
customhPC.ProcessorDataType = 'int8';
customhPC.setModuleProperty('conv','ConvThreadNumber',4); % ConvThreadNumber: 16
customhPC.setModuleProperty('conv','InputMemorySize',[30 30 1]); % InputMemorySize: [227 227 3]
customhPC.setModuleProperty('conv','OutputMemorySize',[30 30 1]); % OutputMemorySize: [227 227 3]
```





#### optimizeConfigurationForNetwork

✓ Generate Optimized Processor Configuration for MobileNetV2 Network

1. Create a dlhdl. ProcessorConfig object.

net = mobilenetv2;

hPC = dlhdl.ProcessorConfig;

2. To retrieve an optimized processor configuration, call the optimizeConfigurationForNetwork method.

hPC.optimizeConfigurationForNetwork(net)

### Optimizing processor configuration for deep learning network begin.

### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'

### Note: Processing module "conv" property "InputMemorySize" changed from "[227 227 3]" to "[224 224 3]".

### Note: Processing module "conv" property "OutputMemorySize" changed from "[227 227 3]" to "[112 112 32]".

### Note: Processing module "conv" property "FeatureSizeLimit" changed from "2048" to "1280".

### Note: Processing module "conv" property "LRNBlockGeneration" changed from "on" to "off" because there is no LRN layer in the deep learning network. ### Note: Processing module "fc" property "InputMemorySize" changed from "25088" to "1280".

### Note: Processing module "fc" property "OutputMemorySize" changed from "4096" to "1000".

Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' ConvThreadNumber: 16 InputMemorySize: [224 224 3] OutputMemorySize: [112 112 32] FeatureSizeLimit: 1280

Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 1280 OutputMemorySize: 1000



#### Solutions for Deploying Deep Learning to FPGA Hardware



Configurable Deep Learning Processor enables:

- Fast prototyping to assess AI model performance
- Adapt to smaller edge devices



#### **Network Examples**

| Network Examples           | Application Area         | Туре | Release |  |
|----------------------------|--------------------------|------|---------|--|
| VGG16/VGG19                | Classification           | CNN  |         |  |
| ResNet18/ResNet50          | Classification/Detection | CNN  |         |  |
| YOLO v2                    | Object detection         | CNN  | R2021b  |  |
| MobileNet v2               | Classification/Detection | CNN  |         |  |
| 1-Dimentional CNN networks | Classification/Detection | CNN  | R2022a  |  |
| Segmentation networks      | Segmentation             | CNN  | KZUZZO  |  |
| LSTM networks              | Signal processing        | RNN  | R2022b  |  |
| YOLO v3                    | Object detection         | CNN  | KZUZZD  |  |
| GRU network                | Signal processing        | RNN  | R2023a  |  |
| YAMNet (Audio toolbox)     | Classification/Detection | CNN  | KZUZJU  |  |
| Projected LSTM             | Signal processing        | RNN  | R2023b  |  |
| YOLO v4 tiny               | Object detection         | CNN  | R2024a  |  |