

# Access Intel® FPGAs for Acceleration

**CERN 2018** 

## Agenda

- High-level synthesis with the Intel® HLS Compiler
- Intel® FPGA SDK for OpenCL<sup>™</sup>
- Acceleration Stack for Intel® Xeon CPUs and FPGAs
- Deep Learning Inference on FPGAs



## **FPGA** Overview

- Field Programmable Gate Array (FPGA)
  - Millions of logic elements
  - Thousands of embedded memory blocks
  - Thousands of DSP blocks
  - Programmable routing
  - High speed transceivers
  - Various built-in hardened IP
- Used to create Custom Hardware!



## **Traditional FPGA Design Process**

### Potentially Time-Consuming Effort



#### **Behavioral Simulation**





Designing IP at a higher level of abstraction = increase productivity

- Debugging software is much faster than hardware
- Easier to specify functions in software
- Productivity tool for RTL designers







## **HLS Procedure**





## **Emulation Mode**

- Just like any executing any other software
- Debug with
  - printf/cout
  - gdb
  - Valgrind





# Cosimulation: Synthesize Component Function into RTL





## **The Cosimulation Flow**



a is the default output name, -o option can be used to specify a non-default output name



## Cosimulation Verifying HLS IP

The Intel<sup>®</sup> HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator

- To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture
  - Any calls to the component function becomes calls the simulator through DPI



## C/C++ Functions to Dataflow Circuits

Each component function is converted into custom dataflow hardware

- Gain the benefits of Intel<sup>®</sup> FPGAs without the length design process
- Implement C/C++ operators as circuits
  - HDL code located in <HLS Installation>\ip
  - Load Store units to read/write memory
  - Arithmetic units to perform calculations
  - Flow control units
  - Connect circuits according to data flow in the function

| acl_staging_reg.v     | acl_work_group_li   | bram_512x4M_hw.tcl   | dotp_core.vhd     |
|-----------------------|---------------------|----------------------|-------------------|
| acl_stall_free_sink.v | acl_work_group_li   | bram_512x33M.v       | dotp_core_sv.vhd  |
| acl_stall_free_sink   | acl_work_item_iter  | bram_512x33M_hw      | dotProduct64_dut  |
| acl_stall_monitor.v   | avalon_concatenat   | config_switch1.v     | dotProduct64_dut  |
| acl_start_signal_ch   | 📄 avalon_concatenat | config_switch32.v    | dotProduct64_safe |
| acl_stream_fifo.v     | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper.v    |
| acl_stream_to_vect    | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper_sv.v |
| acl_task_copy_finis   | avalon_split_multib | CosPiDPStratixVf40   | dotp_wrapper_tom  |
| acl_toggle_detect.v   | avalon_split_multib | CosPiDPStratixVf40   | dp_addb.vhd       |
| acl_token_fifo_cou    | barrier_fifo.v      | cra_ring_node.sv     | dp_addpipe.vhd    |
| acl_valid_fifo_coun   | bram_256x4M.v       | cra_ring_node_hw.tcl | dp_adds.vhd       |
| acl_vector_to_stre    | bram_256x4M_hw.tcl  | cra_ring_rom.sv      | dp_clz64.vhd      |
| acl_vector_to_stre    | bram_256x67M.v      | cra_ring_rom_hw.tcl  | dp_clzpipe64.vhd  |
| acl_work_group_di     | bram_256x67M_hw     | cra_ring_root.sv     | dp_div_core.vhd   |
| acl_work_group_di     | bram_512x4M.v       | cra_ring_root_hw.tcl | dp_divnornd.vhd   |
|                       |                     |                      |                   |



## **Compilation Example**

Software compiled into dataflow circuit with flow control



## The Default Interfaces

```
component int add(int a, int b) {
    return a+b;
```



| C++ Construct             | HDL Interface                                             |  |  |  |  |
|---------------------------|-----------------------------------------------------------|--|--|--|--|
| Scalar arguments          | Conduits associated with the default start/busy interface |  |  |  |  |
| Pointer arguments         | Avalon memory master interface                            |  |  |  |  |
| Global scalars and arrays | Avalon memory master interface                            |  |  |  |  |

Note: more on interfaces later



## **Other Custom Interfaces**

- Customizable Avalon Streaming Interfaces
  - Explicit ready/valid signals for each data argument
- Explicit Memory-Mapped Master interfaces
  - Create a number of master interfaces with customizable features
- Slave Registers
  - Slave port for scalar values
- Slave Memory
- Slave Control
  - Call/Return interface done through register







## Viewing Waveforms in Modelsim

#### ModelSim - Intel FPGA Edition 10 ম File Edit View Compile Simulate Add Objects Tools Layout Bookmarks Window Help ₽<mark>`</mark>a ⊕ ⊨ 100 🔷 🗄 🖹 🗄 🌋 \*) (> | 14¥ $\mathcal{D}$ 117 0 ₩**]** -酋 м ۲ I O I/O İ ALL 🌽 ≷ °a, 4 <u>, 1</u> 80 ď ~ 1.1.5.5 3• • •€ • 💁 🛛 Search: 創業 巻 🛞 🛞 🌒 🥂 ~ 💹 vsim - Default 🖃 🕬 🖛 🗙 🖬 Wave - Default 💁 Objects 🕬 🛨 d' X ▼ Instance ▼ N; 1 ← ● 72883 ps → ſ ▶ - 📤 -🗐 👍 a 🖃 📕 tb /tb/mymult\_inst/clock clock reset inst 🖬 👍 b 主 -- 🗖 👍 /tb/mymult inst/resetn component dpi controlle start + /tb/mymult\_inst/start concatenate component 🐟 busy 👍 /tb/mymult inst/busy 👍 clock nponent Locate 🛨 ᄼ /tb/mymult\_inst/a llo 0 3 7 ller inst 👍 resetn ЦO 🖽 👉 /tb/mymult\_inst/b 6 0 5 Component lent dpi 📥 done 🗄 💠 /tb/mymult\_inst/return... 0 18 ΪO 135 0 myr ult component dpi 👍 stall 🖕 /tb/🚬 /mult\_inst/done | 🗉 🔷 returndata mult component dpi .o/mymult\_inst/stall mymult inst ----😑 🗾 mymult internal inst 🖮 🗾 mymult internal 🖮 📕 split component start i Add Signals to Waveform



## Intel® Quartus<sup>®</sup> Software Integration

- a.prj/components directory contains all the files to integrate
  - One subdirectory for each component
    - Portable, can be moved to a different location if desire
- 2 use scenarios
  - 1. Instantiate in HDL
  - 2. Adding IP to a Platform Designer system integration tool system

| add add_inst (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>// Interface: clock (clock end)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| .clock (), // 1-bit clk input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| <pre>// Interface: reset (reset end)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| .resetn (), // 1-bit reset n input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| <pre>// Interface: call (conduit sink)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| .start (), // 1-bit valid input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| .busy (), // 1-bit stall output                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| <pre>// Interface: return (conduit source)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| .done (), // 1-bit valid output                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| .stall (), // 1-bit stall input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| <pre>// Interface: returndata (conduit source)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| .returndata( ), // 32-bit data output                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| <pre>// Interface: a (conduit sink)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| .a (), // 32-bit data input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| <pre>// Interface: b (conduit sink)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| .b () // 32-bit data input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| A construction of the second |

| 10 s | ystem | Contents 🛛   | Address Map 🛛 Int                                                    | terconnect Requirements 🛛 🖇                                                                | Details 😂                                                                                                                                                       |                                                                           |
|------|-------|--------------|----------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
|      | ∞     | 📃 System: to | p Path: top_lpf_0.ret                                                | urndata                                                                                    |                                                                                                                                                                 |                                                                           |
| +    | Use   | Connections  | nections Name Description                                            |                                                                                            | Export                                                                                                                                                          | Clock                                                                     |
| ×    | 2     |              | clock_in<br>in_clk<br>out.clk                                        | Clock Bridge<br>Clock Input<br>Clock Output                                                | clk<br>Double-click to export                                                                                                                                   | exported                                                                  |
|      | 2     |              | <pre>reset_in     clk     in_reset     out_reset</pre>               | Reset Bridge<br>Clock Input<br>Reset Input<br>Reset Output                                 | Double-click to export<br>reset<br>Double-click to export                                                                                                       | <b>clock_in</b><br>[clk]<br>[clk]                                         |
| M 4  | Y     |              | top_hpf_0 alpha call clock reset return returndata X                 | hpf_internal<br>Conduit<br>Clock Input<br>Reset Input<br>Conduit<br>Conduit<br>Conduit     | top_hpf_0_alpha<br>Double-click to expon<br>Double-click to expon<br>Double-click to expon<br>top_hpf_0_return<br>top_hpf_0_returndata<br>Double-click to expon | [clock]<br>[clock]<br>[clock]<br>[clock]<br>[clock]<br>[clock]<br>[clock] |
|      | r     |              | top_lpf_0<br>alpha<br>call<br>clock<br>reset<br>return<br>returndata | lpf_internal<br>Conduit<br>Clock Input<br>Clock Input<br>Reset Input<br>Conduit<br>Conduit | top_lpf_0_alpha<br>top_lpf_0_call<br>Double-click to export<br>Double-click to export<br>Double-click to export                                                 | [clock]<br>[clock]<br>[clock]<br>[clock]<br>[clock]<br>[clock]            |
|      |       | <u>.</u>     | ×                                                                    | Conduit                                                                                    | top inf 0 x                                                                                                                                                     | [clock]                                                                   |

## Main HTML Optimization Report

### Fast generation of optimization report

| Reports          | View reports+                            |        |
|------------------|------------------------------------------|--------|
| Comment          | Summary                                  |        |
| Summary -        | Loops analysis                           |        |
| Info             | Area analysis of system                  |        |
| Project Name     | Area analysis of source                  |        |
| Target Family, D | Component viewer                         | 3      |
| i++ Version      | Component memory viewer                  |        |
| Quartus Version  | Verification statistics                  |        |
| Command          | i++ -march=Arria10 add_ex<br>/add_ex.out | ccpp - |



Serial loop execution hinders function dataflow circuit performance

- Use Loop Analysis report to see if and how each loop is optimized
  - Helps identify component pipeline bottlenecks





## Loop Unrolling

Loop unrolling: Replicate hardware to execute multiple loop iterations at once

- Simple loops unrolled by the compiler automatically
- User may use #pragma unroll to control loop unrolling
- Dependencies resolved through scheduling of operations



21

## **Loop-Pipelining and Dependencies**

- Execute next iteration as soon as possible
- Dependencies can resolved by the compiler
  - Values transferred between loop iterations with FPGA resources





## Loop Pipeline Analysis

- Automatically Generated
- Reports status of loop pipelining
- Displays dependency information

| Loc  | ops analysis                      |              | $\checkmark$ | Show fully  | y unrolled loop               | IS    | MGS.             | .cpp                                                                    | •                   | ×   |
|------|-----------------------------------|--------------|--------------|-------------|-------------------------------|-------|------------------|-------------------------------------------------------------------------|---------------------|-----|
|      |                                   | Pipelined    | u            | Bottleneck  | Details                       | ^     | 46<br>47         | }                                                                       |                     |     |
|      |                                   |              |              |             |                               | -     | 48               | // Main loop of MGS.                                                    |                     |     |
|      | Coalesced loop (MGS.cpp:42)       | n/a          | n/a          | n/a         | <ul> <li>Coalesced</li> </ul> |       | 49               | for (QKD_COL_LOOP 1 = 0; 1 < COL:                                       | s; 1++)             |     |
|      |                                   |              |              |             | Serial ever                   | 1     | 51               | QrdFloatingPoint t_magnitude                                            | inv =               | 1   |
|      | qrd.B4 (MGS.cpp:49)               | Yes          | >=1          | n/a         | Memory<br>dependency          |       | 52               | .0f;<br>// find magnitude of t_(*i)<br>column)                          | (i-th               |     |
|      |                                   |              |              |             | Uppelled by                   |       | 53               | QrdFloatingPoint sum = 0;                                               |                     |     |
|      | Fully unrolled loop (MGS.cpp:55)  | n/a          | n/a          | n/a         | #pragma unroll                | -     | 54<br>55         | <pre>#pragma unroll for (int row = 0; row &lt; ROWS intervention)</pre> | COMPONE             | ENT |
|      | Fully unrolled loop (MGS.cpp:65)  | n/a          | n/a          | n/a         | Unrolled by<br>#pragma unroll |       | 56 *<br>57<br>58 | {<br>// hardened dot-product<br>OrdFloatingPoint val =                  |                     |     |
|      | qrd.B5 (MGS.cpp:71)               | Yes          | 1            | n/a         |                               | ~     | 59               | t_matrix[row][i & CO<br>sum = sum + (val * val);                        | LS_MASK             | 1;  |
| Det  | tails                             |              |              |             |                               |       | 55               | sum – sum – (vac vac),                                                  |                     |     |
| qrd  | .B4:                              |              |              |             |                               |       |                  |                                                                         |                     |     |
| • 11 | teration executed serially across | ard.B5. Only | a sin        | gle loop it | eration will ex               | ecute | inside           | this region due to memory depende                                       | ncv:                |     |
|      | From: Load Operation (MGS cr      | op: 58)      |              |             |                               |       |                  |                                                                         | - (- <b>-</b> - (-) |     |
|      | To: Store Operation (MGS con      | 109          |              |             |                               |       |                  |                                                                         |                     |     |
|      | teration executed serially across | ard B5 Only  | , a cin      | ale loon it | eration will ex               | ocuto | inside           | this region due to memory depende                                       | DOV                 |     |
| - 10 | Eration executed senally across   | GIU.BS. Only | asin         | gie toop it | eration will ex               | ecute | inside           | this region due to memory depende                                       | ncy.                |     |
|      | From: Load Operation (MGS cl      |              |              |             |                               |       |                  |                                                                         |                     |     |

- Part of HTML Report
  - <.prj folder>\reports\report.html



## Loop Pipelining Optimization Report

Reports shows pipeline status of each loop

- Minimizing II is the key to loop pipelining optimization
- Report shows
  - If loops are pipelined
    - Reason given if loop not pipelined
  - Initiation interval of pipelined loops
    - If II>1, shows operations that contributes to loop-carried dependency
      - Data computation or memory dependencies
      - Dependencies increases II



24

## **Arbitrary Precision Datatypes**

- Algorithmic C (AC) datatypes
  - From Mentor Graphics under the Apache License
  - User Guide shipped with the HLS tool
    - <path\_to\_HLS\_installation>/include/ref/ac\_datatypes\_ref.pdf
- Templated classes that allows instantiation of arbitrary sized integers and arbitrary precision fixed-point datatypes
- ac\_int and ac\_fixed are supported by the Intel® HLS Compiler
- Two implementations shipped with the Intel® HLS Compiler
  - ref/ac\_int.h, ac\_fixed.h: Mentor Graphics reference implementation
  - HLS/ac\_int.h, ac\_fixed.h: Intel-optimized implementation for HLS



## **Local Component Memories**

- Local component memories implemented with on-chip RAM resources
  - Much better performance than off-chip system memories
- Local memory system is customized to your application at compile time
  - Dependent on data type and usage
  - Banking configuration (number of banks, width), and interconnect customized to minimize contention
  - Big advantage over fixed-architecture accelerators
- Note: local memory cannot be dynamically allocated inside the component



## Agenda

- High-level synthesis with the Intel® HLS Compiler
- Intel® FPGA SDK for OpenCL™
- Acceleration Stack for Intel® Xeon CPUs and FPGAs
- Deep Learning Inference on FPGAs



27

## Intel<sup>®</sup> FPGA SDK for OpenCL<sup>™</sup> Usage





## Compiling OpenCL<sup>™</sup> Kernel to Intel<sup>®</sup> FPGA

Using similar concepts and optimization Load Load Load Load Load heo I techniques as HLS Store Store Store Load I oad Load Load Load kernel void increment ( global float \*a, float c, int N) Store Store Store int i; for (i = 0; i < N; i++)a[i] = a[i] + c;aoc -board=a10 ref



Host Interface

DDR/

**QDR** 

## Benefits of OpenCL<sup>™</sup> on FPGAs

- For software developers
- Faster software-centric development flow
  - C-based design leads to shorter architectural exploration and development time
- Obtain performance and power advantages of an FPGA
- Portability between different HW accelerators (CPU, GPGPU, FPGA, etc)



## FPGA Architecture for OpenCL<sup>™</sup> Implementation





íntel

## aoc Output Files

- <kernel file>.aoco
  - Intermediate object file representing the created hardware system
- <kernel file>.aocx
  - Kernel executable file used to program FPGA
- Inside <kernel file> folder
  - <kernel file folder>\reports\report.html
    - Interactive HTML report
    - Static report showing optimization, detailed area, and architectural information
  - <kernel file>.log compilation log
  - Intel<sup>®</sup> Quartus<sup>®</sup> Prime software generated source and report files



## **Example Host Program**

```
void main()
{ ...
   // 1. Create then build program
   c::Program myprogram = (... mybinaries of aocx...);
   err = myprogram.build(mydevlist);
   // 2. Create kernels from the program
   cl::Kernel mykernel (myprogram, "increment", &err);
   // 3. Tansfer buffers on/to device
   err=myqueue.enqueueWriteBuffer(a device, CL FALSE, 0, size, a host);
   // 4. Set up the kernel argument list
   err = mykernel.setArg(0, buffer);
   // 5. Launch the kernel
   err = myqueue.enqueueTask(mykernel);
   // 6. Transfer result buffer back
```

err = myqueue.enqueueReadBuffer(a\_device, CL\_TRUE, 0, NUM\_ELEMENTS\*sizeof(cl\_float), a\_host);

## Compiling the Host Program

- Include CL/opencl.h or CL/cl.hpp
- Use a conventional C compiler (Visual Studio\*/GCC)
- Add \$INTELFPGASDKROOT/host/include to your file search path
  - Recommended to use aocl compile-config
- Link to Intel<sup>®</sup> FPGA OpenCL<sup>™</sup> libraries
  - Link to libraries located in the \$INTELFPGASDKROOT/host/<OS>/lib directory
    - Recommended to use aocl link-config

main() { read data( ... ); manipulate( ... ); clEnqueueWriteBuffer( ... ); clEnqueueNDRange(..., sum, ...); clEnqueueReadBuffer( ... ); display result( ... );

Standard

**C** Compiler



Intel FPGA

Libraries

## Kernel Development Flow and Tools





## Intel<sup>®</sup> FPGA-Specific Features

- Single Work-Item Execution
- Channels
- Controlling Hardware Generation with Attributes
  - Autorun Kernels, Vectorization Factor, Compute Unit replication, etc...
- Libraries (Calling custom RTL)
- SoC Platforms
- Shared Virtual Memory
- Custom Boards


#### Agenda

- High-level synthesis with the Intel® HLS Compiler
- Intel® FPGA SDK for OpenCL<sup>™</sup>
- Acceleration Stack for Intel® Xeon CPUs and FPGAs
- Deep Learning Inference on FPGAs



#### Acceleration Stack for Intel® Xeon® CPU with FPGAs



#### Intel® delivers a system-optimized solution stack for your data center workloads

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Some names pending final approval and may change in the future. Logos and names provided for illustrative purposes only. Current availability may be different.



## Intel® Xeon® with FPGA Virtualization Framework



Simplifies the use of FPGAs in virtualized cloud environments



## Intel® Programmable Acceleration Card with Intel Arria® 10 GX FPGA

#### Intel's 1st versatile FPGA PCIe acceleration card that offers inline & look-aside acceleration for workloads requiring up to 45W



#### 1<sup>st</sup> acceleration card to offer the Acceleration Stack for Intel Xeon CPU with FPGAs enabling broader FPGA adoption in data center

Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA



40

## **Open Programmable Acceleration Engine (OPAE)**

#### **Consistent API across product generations and platforms**

· Abstraction for hardware specific FPGA resource details

#### Designed for minimal software overhead and latency

• Lightweight user-space library (libfpga)

#### Open ecosystem for industry and developer community

License: FPGA API (BSD), FPGA driver (GPLv2)

#### FPGA driver being upstreamed into Linux kernel

Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)

Includes guides, command-line utilities and sample code

Simplified FPGA Programming Model for Application Developers



#### Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE



#### OPAE FPGA API – Enumerate, Manage & Access



42

íntel

#### Two Development Approaches

#### **HDL Programming**





## OpenCL<sup>™</sup> Flow

- Usage no different from traditional OpenCL<sup>™</sup> flow
  - C based development and optimization flow to create AFUs and Host Application
  - Standard OpenCL FPGA application using the Intel® FPGA SDK for OpenCL
    - FPGA OpenCL debug and profiling tools supported
- The Acceleration Stack abstracted away from user
  - OPAE part of the Host Run-Time

#### OpenCL<sup>™</sup> Support Package for Intel® PAC





## **RTL AFU**



- Develop RTLAFU with standard FPGA development tools
- Interface with the acceleration stack through Core Cache Interconnect (CCI-P)
  - Provides a base platform memory interface
    - Simple request/response interface (Simple Read/Write)
    - Physical addresses
    - No order guarantees
  - These minimal requirements satisfy major classes of algorithms, e.g.:
    - Double buffered kernels that read from and write to different buffers
    - Streaming kernels that read from one memory-mapped FIFO and write to another



- AFU Simulation Environment (ASE) enables seamless portability to real HW
  - Allows fast verification of OPAE software together with AFU RTL without HW
    - SW Application loads ASE library and connects to RTL simulation
  - For execution on HW, application loads Runtime library and RTL is compiled by Intel® Quartus into FPGA bitstream



47

#### Agenda

- High-level synthesis with the Intel® HLS Compiler
- Intel® FPGA SDK for OpenCL<sup>™</sup>
- Acceleration Stack for Intel® Xeon CPUs and FPGAs
- Deep Learning Inference on FPGAs



## **Design Flow with Machine Learning**



Choose Network topology

 Use framework (e.g. Caffe, Tensor Flow) **Train Network** 

- A high-performance computing (HPC) workload from large dataset
- Weeks to months process

Inference Engine (FPGA Focus)

 Implementation of the neural network performing real-time inferencing



49

# INTEL<sup>®</sup> AI PORTFOLIO





## Solving Machine Learning Challenges with FPGA









EASE-OF-USE SOFTWARE ABSTRACTION, PLATFORMS & LIBRARIES

Intel FPGA solutions enable software-defined programming of customized machine learning accelerator libraries.

# **REAL-TIME**

DETERMINISTIC LOW LATENCY

Intel FPGA hardware implements a deterministic low latency data path unlike any other competing compute device.

#### **FLEXIBILITY** CUSTOMIZABLE HARDWARE FOR NEXT GEN DNN ARCHITECTURES

Intel FPGAs can be customized to enable advances in machine learning algorithms.



## Why Intel® FPGAs for Machine Learning?

10

#### **Convolutional Neural Networks are Compute Intensive**



| Feature                                                     | Benefit                                                                                                           |
|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| Highly parallel architecture                                | Facilitates efficient low-batch video stream processing and reduces latency                                       |
| Configurable<br>Distributed<br>Floating Point DSP<br>Blocks | FP32 9Tflops, FP16, FP11<br>Accelerates computation by tuning<br>compute performance                              |
| Tightly coupled<br>high-bandwidth<br>memory                 | >50TB/s on chip SRAM bandwidth,<br>random access, reduces latency,<br>minimizes external memory access            |
| Programmable<br>Data Path                                   | Reduces unnecessary data movement, improving latency and efficiency                                               |
| Configurability                                             | Support for variable precision (trade-off throughput and accuracy). Future proof designs, and system connectivity |



## FPGAs Provide Deterministic System Latency

FPGAs can leveraging parallelism across the entire chip reducing the compute time to a fraction System Latency = I/O Latency + Compute Latency





# Config Engine 54

DDR

DDR

## Intel® FPGA Deep Learning Acceleration Suite

- CNN acceleration engine for common topologies executed in a graph loop architecture
  - AlexNet, GoogleNet, LeNet, SqueezeNet, VGG16, ResNet, Yolo, SSD, LSTM...
- Software Deployment
  - No FPGA compile required
  - Run-time reconfigurable
- Customized Hardware Development
  - Custom architecture creation w/ parameters
  - Custom primitives using OpenCL<sup>™</sup> flow \_



## Intel® FPGA DLA Suite Usage





## Mapping a Topology to the Architecture in FPGA

Using the Intel® DL Deployment Toolkit component of the OpenVINO<sup>™</sup> toolkit to enable deployment of trained model on all Intel® architectures

- CPU, GPU, FPGA, ...
- Optimize for best execution
- Enable users to validate and tune
- Easy-to-use runtime API across all devices



## Using the Inference Engine API







ínte

#### Add a custom primitive into crossbar Three primitive types supported: – Unary (ReLU, Tanh) custom - Binary (Eltwise Add, Mult)

- Window (Pool, LRN, Norm)
- Unary w/ coefficients
  - Scale/Dropout (a couple of coefficients per layer: coefficients loaded via layer config)
  - BatchNorm (dozen or more coefficients) per layer: coefficients loaded via DDR)



DDR

## Machine Learning on Intel® FPGA Platform

#### ML Framework Software Stack Hardware (Caffe\*, TensorFlow\*) Platform & IP Application **DL** Deployment Toolkit **DLA Runtime Engine DLA Workload** OpenCL<sup>™</sup> Runtime BBS **Acceleration Stack** Intel® Xeon PAC Family CPU **Boards**

Acceleration Stack Platform Solution

For more information on the Acceleration Stack for Intel® Xeon® CPU with FPGAs on the Intel® Programmable Acceleration Card, visit the Intel® FPGA Acceleration Hub



## **DLA Architecture: Built for Performance**

- Maximize Parallelism on the FPGA
  - Filter Parallelism (Processing Elements)
  - Input-Depth Parallelism
  - Winograd Transformation
  - Batching
  - Feature Stream Buffer
  - Filter Cache
- Choosing FPGA Bitstream
  - Data Type / Design Exploration
  - Primitive Support





Programmable Solutions Group





#### AlexNet Graph







#### AlexNet Graph







#### AlexNet Graph







#### AlexNet Graph

























#### AlexNet Graph





#### **Efficient Parallel Execution of Convolutions**



- Parallel Convolutions
  - Different filters of the same convolution layer processed in parallel in different processing elements (PEs)

Vectored Operations

- Across the depth of feature map
- PE Array geometry can be customized to hyperparameters of given topology


# Winograd Transformation

- Perform convolutions with fewer multiplication
  - Allows more convolutions to be done on FPGA
- Take 6 input features elements and 3 filter elements
  - Standard convolution requires 12 multiplies
  - Transformed convolution requires just 6 multiplies





# Fully Connected Computation and Batching

- Fully Connected Layer computation does not allow for data reuse of weights
  - Different from convolutions
  - Very memory bandwidth intensive
- Solution: Batch up images
  - Weights reused across multiple images





#### Feature Cache

Feature data cached on-chip

- Streamed to a daisy chain of parallel processing elements
- Double buffered
  - Overlap convolution with cache updates
  - Output of one subgraph becomes input of another
  - Eliminates unnecessary external memory accesses





#### **Filter Cache**

Filter weights cached in each processing element

- Double buffered in order to support prefetching
  - While one set is used to calculate output feature maps, another set is prefetched



### **DLA Architecture Selection**

- Find ideal FPGA image that meets your needs
- Create custom FPGA image based on need

| Arch Name                  | ALEXNET | GOOGLENET | SQUEEZENET | VGG | RESNET18 | RESNET18_MANUAL | RESNET50 | RESNET101 |
|----------------------------|---------|-----------|------------|-----|----------|-----------------|----------|-----------|
| 0-8-1_rc_fp32_8x8_arch02   | YES     | YES       | YES        | YES | YES      | YES             |          |           |
| 0-8-1_rc_fp16_4x4_arch03   | YES     | YES       | YES        | YES | YES      | YES             | YES      | YES       |
| 0-8-1_rc_fp16_8x32_arch09  |         |           | YES        |     | YES      |                 | YES      | YES       |
| 0-8-1_rc_fp16_8x32_arch10  | YES     |           |            |     |          |                 |          |           |
| 0-8-1_rc_fp16_8x32_arch11  | YES     | YES       | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp16_8x32_arch12  | YES     | YES       | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp11_16x32_arch17 | YES     | YES       | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp11_16x32_arch18 |         |           | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp11_16x32_arch16 | YES     | YES       | YES        | YES | YES      |                 |          |           |
| 0-8-1_rc_fp11_16x32_arch20 |         |           | YES        |     | YES      | YES             | YES      | YES       |
| 0-8-1_rc_fp10_16x32_arch23 | YES     | YES       | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp9_16x32_arch25  | YES     | YES       | YES        |     |          |                 |          |           |
| 0-8-1_rc_fp8_16x32_arch26  | YES     | YES       | YES        |     |          |                 |          |           |

# Support for Different Topologies

#### Tradeoff between features and performance



# Supported Primitives and Topologies

#### Primitives

| ✓ batch norm                         | ✓ concat                                                        | ✓ flatten                                        |
|--------------------------------------|-----------------------------------------------------------------|--------------------------------------------------|
| 🗸 max pool                           | ✓ relu, leaky relu                                              | <ul> <li>Irn normalization</li> </ul>            |
| ✓ average pool                       | ✓ scale                                                         | ✓ softmax                                        |
| ✓ inner product                      | ✓ permute                                                       | 🗸 prelu                                          |
| ✓ reshape                            | ✓ detection output                                              | ✓ conv                                           |
| ✓ priorbox                           | ✓ fully connected                                               | ✓ eltwise                                        |
|                                      |                                                                 |                                                  |
| bias                                 | group conv                                                      | depthwise conv                                   |
| bias<br>local conv                   | group conv<br>sigmoid                                           | depthwise conv<br>elu                            |
| bias<br>local conv<br>power          | group conv<br>sigmoid<br>crop                                   | depthwise conv<br>elu<br>proporal                |
| bias<br>local conv<br>power<br>slice | group conv<br>sigmoid<br>crop<br>depthwise conv                 | depthwise conv<br>elu<br>proporal<br>roi pooling |
| bias<br>local conv<br>power<br>slice | group conv<br>sigmoid<br>crop<br>depthwise conv<br>dilated conv | depthwise conv<br>elu<br>proporal<br>roi pooling |

#### **Topologies**

| ✓ AlexNet          |              |  |  |  |  |
|--------------------|--------------|--|--|--|--|
| ✓ GoogleNet        | v1 🗸 SSD     |  |  |  |  |
| ✓ ResNet18         | ✓ SSD        |  |  |  |  |
| ✓ ResNet50         |              |  |  |  |  |
| ✓ ResNet101        |              |  |  |  |  |
| ✓ SqueezeNet ✓ SSI |              |  |  |  |  |
| ✓ VGG16            |              |  |  |  |  |
| ✓ Tiny Yolo        |              |  |  |  |  |
| ✓ LeNet            |              |  |  |  |  |
|                    |              |  |  |  |  |
| $\checkmark$       | Supported    |  |  |  |  |
| ✓                  | Upon Request |  |  |  |  |
| $\checkmark$       | Future       |  |  |  |  |

# **Design Exploration with Reduced Precision**

Tradeoff between performance and accuracy

- Reduced precision allows more processing to be done in parallel
- Using smaller Floating Point format does not require retraining of network
- FP11 benefit over using INT8/9
  - No need to retrain, better performance, less accuracy loss



Sign, 5-bit exponent, 10-bit mantissa
Sign, 5-bit exponent, 5-bit mantissa
Sign, 5-bit exponent, 4-bit mantissa
Sign, 5-bit exponent, 3-bit mantissa
Sign, 5-bit exponent, 2-bit mantissa

# Summary

- Use HLS Compiler to generate excelleration IP for HW Developers
- Use OpenCL to accelerate for software developers
  - May use over Acceleration Stack
- Acceleration stack enables data center acceleration
  - Supports RTL and OpenCL flow, HLS in the future
- Use Deep Learning Acceleration Suite to easily deploy inference tasks on the FPGA
  - Supported for the Acceleration Stack
  - In the future will support custom platforms



81

#### Legal Disclaimers/Acknowledgements

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at <u>www.intel.com</u>.

Intel, the Intel Iogo, Intel Inside, the Intel Inside Iogo, MAX, Stratix, Cyclone, Arria, Quartus, HyperFlex, Intel Atom, Intel Xeon and Enpirion are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

OpenCL is the trademark of Apple Inc. used by permission by Khronos

\*Other names and brands may be claimed as the property of others

© Intel Corporation

