## Real-time Al Systems (Academia)



Giuseppe Di Guglielmo

giuseppe@cs.columbia.edu

**Columbia University** 

Fast Machine Learning Fermi Lab September 10-13, 2019



# Technology Trends



- ... but Dennard's Scaling has stopped
  - On-chip power density cannot continue to grow

### Moore's Law had many lives

- 2004: The end of Moore's Law?
- 2015: Beyond Moore's Law
- The Economist

2

- 2016: After Moore's Law
- 2017: A new way to extend Moore's Law



# Emerging Computing Platforms for Al

#### • Heterogeneous Multi-Core Systems

- Mix of processor cores and specialized hardware accelerators yields more energyefficient computing
- The approach to AI is getting heterogeneous
  - DSP, GPU, CPU, FPGA, custom hardware...
    - ... whatever is the best piece of hardware to handle an AI task in the most power efficient way





Giuseppe Di Guglielmo

4

# Accelerator

- A special-purpose hardware that is optimized to perform a dedicated function(s) as part of a general-purpose computational system
  - While being part of a larger system, it spans a scale from being closely integrated with a general purpose processor to being attached to a computing system
  - Increasingly accelerators are migrating into the chip, thus leading to the rise of heterogeneous multi-core SoC architectures"
- Implemented as
  - Application Specific Integrated Circuit (ASIC)
  - Field-Programmable Gate Array (FPGA)



# AI > ML > NN

#### • Artificial Intelligence (AI)

- Computer Vision
- Pattern Recognition
- •
- Machine Learning (ML)
  - Linear Regression
  - K-Means Clustering
  - Decision Trees
  - ..
  - Neural Networks (NN)
    - Convolutional Neural Networks
    - Binary Neural Networks
    - Recurrent Neural Networks

• ...

"Machine mimics cognitive functions such as learning and problem solving"

"Gives computers the ability to learn without being explicitly programmed"

Training Phase



Giuseppe Di Guglielmo



Real-time AI Systems

# (Hardware) Real-Time System



- Deadline-driven design
  - Constraints
    - Period/Frequency
    - Deadline
- Predicable system
  - "all tasks meet all deadlines"
- Deterministic system
  - A predictable system where the "timing behavior can be predetermined"



- HW design optimization space
  - Pareto optimality: "we cannot improve performance without paying costs, and vice versa"

# A Too Short Blanket Problem



Source: http://dogatesketchbook.blogspot.com/2008/02/illustration-friday-blanket.html

# NN = Deterministic Performance

- We can statically compute
  - Memory occupation
  - Number of operations
  - Data movement
  - ..

#### Convolutional layer

- *W<sub>i</sub>* = *IN\_Ch<sub>i</sub>* \* *OUT\_Ch<sub>i</sub>* \* *F<sub>i</sub><sup>2</sup>* \* *bits*
- $A_i = BATCH * F_i^2 * IN_Ch_i * bits$
- $A_{i+1} = BATCH * F_i^2 * OUT_Ch_i * bits$





# Architecture of a ML Accelerator





\* Michaela Blott, Principal Engineer in Xilinx, Architecture for Accelerating DNNs, Hot Chips 2018

**Real-time AI Systems** 

## Cost & Performance Optimization Techniques



- Accelerating CNN Inference on FPGAs: A Survey, K. Abdelouahab et al., Jan 2018
  - <u>https://arxiv.org/pdf/1806.01683</u>
- Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions, S. I. Venieris et al., Mar 2018
  - <u>https://arxiv.org/pdf/1803.05900</u>
- A Survey of FPGA Based Neural Network Accelerator, K. Guo et al., May 2018
  - <u>https://arxiv.org/pdf/1712.08934</u>
- Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs, C. Li et al, 2016
  - https://arxiv.org/pdf/1610.03618

# A Good Start is Half of the Work

- Kernel (Task)
  - Functionally and computationally important portion of an application
  - Well defined interface (inputs and outputs)
  - E.g. Conv2D

#### Algorithm

- An algorithm solves a particular kernel
- E.g. Conv2D via GEMM, Winograd, FFT
- Implementation
  - Different implementations may exist of the same algorithm
  - HLS is very sensitive on the coding
    - style

Source: MachSuite: Benchmarks for Accelerator Design and Customized Architectures, David Brooks, 2014

- Loop transformations to minimize memory access
  - Memory layout
- Create local memory buffers and keep data on-chip as much as



Source: Bandwidth Optimization Through On-Chip Memory Restructuring for HLS, Jason Cong et al., 2017

# **Reducing Bit-Precision**

- Arithmetic
  - Floating Point
    - FP64, FP32, FP16, FP11 ...
  - Fixed Point (Linear Quantization)
    - ✓ A lot of freedom
    - "Essentially" integer arithmetic
    - × Overflow and underflow problems
  - Binary quantization
- Linear reduction in memory footprint
  - Reduce the amount of data transfer
  - Model may fit local buffers (on-chip)
- Reduction of the arithmetic logic
  - Improve area, power, latency





# Pruning

- Goal: Reduce storage, data transfer, and computation
- Reduction of the model size without loss of prediction accuracy
  - Alexnet 9x, VGG-16 13x





Source: Learning both Weights and Connections for Efficient Neural Networks, Song Han, 2015

# Hardware Architecture

- Streaming
  - One distinct hardware block for each CNN layer, where each block is optimized separately to exploit the inner parallelism of the layer
    - Weights: on-chip or off-chip
    - Controller: software or hardware
- Single Computation Engine
  - Single computation engine that executes the CNN layers sequentially
    - Software controller of the scheduling of operations (may be inefficient)
    - Flexibility and reusability over customization
    - "One-size-fits-all" approach has higher performance on NN with a uniform structure





Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions, S. I. Venieris et al., 2018

# Data Layout in Main Memory

- CNNs work on 4-dimensional matrices
  - Data can be stored in memory in 24 (=4!) different ways
- Data layout determines the memory-access patterns and has critical performance impact
- Algorithms and data layout should be designed to minimize the number of DMA transactions
  - Rule of thumb:
    - «Fewer and longer transactions»



Fig. 1. Performance comparison between the CHWN layout (cuda-convnet2) and NCHW layout (cuDNNv4) on convolutional and pooling layers in AlexNet [12]

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs, C. Li et al, 2016



Real-time AI Systems

# OS, Memory Hierarchy, RT-CPU **XILINX**

- Real-time operating system
  - Predictable / Deterministic
  - Concept of running tasks
    - Read in data over an interface
    - Perform an operation on the data
    - ...
- "Caches are evil"
  - Disable caches
  - Cache partitioning

#### **Processing System Graphic Processing Unit** DDR High **Application Processing Unit** Controller Speed NEON™ **Display Port** DDR4/3/3L. ARM Mali<sup>™</sup>- 400 MP Quad ARM Cortex<sup>™</sup>-A53 LPDDR4/3, Floating Point Unit USB 3.0 ECC Support 2 Pixel 32KB 32KB Geometry Memory SATA 3.0 Processors I-Cache Processor D-Cache Management with Parity with ECC Unit Memory Management Unit 256KB 0CM PCIe Gen2 with ECC GIC CCI/SMMU 1MB L2 Cache/ECC SCU 64KB L2 Cache PS-GTR General **Real-Time Processing Unit** System Security Platform Control Connectivity Management Vector Floating Unit Configuration Gigabit Ethernet Point Unit AES Decryption, CAN Dual ARM Cortex<sup>™</sup>-R5 Memory Protection Authentification 12C Power DMA. Timers. Unit and Secure Boot UART WDT, Resets, USB 2.0 32KB I-Cache Clocking 128KB TCM 32KB D-Cache TrustZone SPI and Debug with ECC with ECC with ECC System Quad SPI NOR Management Voltage/Temp NAND GIC SD/eMMC Monitor **Programmable Logic Storage and Signal Processing High Speed Connectivity** Video Codec Block RAM **General Purpose IO** 100G EMAC H.265/H.264 High-Performance HPIO UltraRAM PCIe Gen4 AMS DSP High Density HDIO Interlaken

Zvng UltraScale+ MPSoC

# Conclusions

- The design of a real-time AI system is more complicated than just meeting all of the deadlines
- Trade off between your timing constraints and area and power costs
- Exciting area to work on because of the constantly increasing importance of AI and variety of application domains



Source: http://dogatesketchbook.blogspot.com/2008/02/illustration-friday-blanket.html

# Q/A Thank you!