## Next-Generation Deep-Learning Accelerators: From Hardware to System



ysshao@berkeley.edu Electrical Engineering and Computer Sciences



## Growing Demand in Computing

#### Two Distinct Eras of Compute Usage in Training AI Systems



OpenAl



## **Slowing Supply in Computing**

AMD, HotChips, 2019





## Domain-Specific Accelerators

Growing Demand in Computing



Slowing Supply in Computing

## **Domain-Specific Accelerators**

 Customized hardware designed for a domain of applications.



Apple M1 Chip 2020



#### **Full-Stack Optimization for DL Accelerators**

Design of Accelerators  Simba [MICRO'19 Best Paper Award, CACM RH, VLSI'20, JSSC'20 Best Paper Award]

Integration of Accelerators

- Chipyard [IEEE Micro'20]
- Gemmini [DAC'21, Best Paper Award]

# Scheduling of Accelerators

#### **Full-Stack Optimization for DL Accelerators**

Design of • S Accelerators

 Simba [MICRO'19 Best Paper Award, CACM RH, VLSI'20, JSSC'20 Best Paper Award]

Integration of Accelerators

# Chipyard [IEEE Micro'20] Gemmini [DAC'21, Best Paper Award]

Scheduling of Accelerators

## **Scalable Inference Accelerators**

#### Motivation

• Need for fast and efficient inference accelerators from mobile to datacenter.

#### Challenge

• High design cost of building unique hardware for each design target.

#### Opportunities

- Deep learning inference is intrinsically scalable with abundant parallelism.
- Recent advances in package-level integration for multi-chip-module-based designs.

## The Multi-Chip-Module Approach

#### • Advantages:

- Build systems larger than reticle limit
- Smaller chips are cheaper to design
- Smaller chips have higher yield
- Faster time-to-market
- Challenges:
- Area, energy, and latency for chip-tochip communication



#### Simba: Scaling Inference with MCM-based Architecture

#### Best Paper Award at MICRO'2019, CACM Research Highlights

#### Simba Testchip:

•

- Package and chiplet architecture Processing element design Baseline uniform tiling across chiplets and PEs Simba Characterization: Comparison with GPUs NoP bandwidth sensitivity NoP latency sensitivity Simba NoP-Aware Tiling:
- Non-uniform work partitioning
- Communication-aware data placement
- Cross-layer pipelining

Output

Output

#### Simba: Scalable MCM-Based Architecture

47.5 mm

#### Package and chiplet spec

6mm<sup>2</sup> chiplet in TSMC 16nm 36 chiplets/package

#### Chip-to-chip interconnect

Ground-Referenced Signaling

#### **Efficient compute tiles**

128 TOPS 0.11 pJ/Op 8-bit integer datapath



2.4mm

0.52-1.1 V

0.48-1.8 GHz 624KB/chip

23MB/package

Voltage

SRAM

Frequency

#### **Simba Characterization**

• Comparison with GPUs running ResNet-50





## Simba Characterization

- Layer Sensitivity
- Running three ResNet-50 layers across different number of chiplets.
- Increasing the number of active chiplets does not always translate to performance gains.
- The cost of communication hinders the <sup>2</sup> ability to exploit parallelism.



#### **Full-Stack Optimization for DL Accelerators**



#### Accelerators don't exist in isolation.





http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photoanalysis/

## Mobile SoC Usecase

- Mainstream architecture has long focused on general-purpose CPUs and GPUs.
- In an SoC, multiple IP blocks are active at the same time and communicate frequently with each other.
- Example:
  - Recording a 4K video
  - Camera -> ISP
    - "Preview stream" for display
    - "Video stream" for storage
  - DRAM for data sharing



Two Billion Devices and Counting: An Industry Perspective on the State of Mobile Computer Architecture, IEEE Micro'2018

## **Full-System Visibility for DL Accelerators**





#### **Full-System Visibility: Memory Hierarchy**



#### Shared L2 Cache

Performance Impacts

Resource contention, cache coherence, etc.

#### **Full-System Visibility: Virtual Addresses**



#### **Full-System Visibility: Host CPUs**



#### **Gemmini: Full-System Co-Design of Hardware Accelerators**

- Full-stack
  - Includes OS
  - End-to-end workloads
  - "Multi-level" API
- Full-SoC
  - Host CPUs
  - Shared memory hierarchies
  - Virtual address translation



|                                      | Property                                                                        | NVDLA                    | VTA                     | PolySA                    | DNNBuilder           | MAGNet             | DNNWeaver     | MAERI              | Gemmini                      |
|--------------------------------------|---------------------------------------------------------------------------------|--------------------------|-------------------------|---------------------------|----------------------|--------------------|---------------|--------------------|------------------------------|
| Hardware<br>Architecture<br>Template | Multiple Datatypes<br>Multiple Dataflows<br>Spatial Array<br>Direct convolution | Int/Float<br>★<br>vector | Int<br>X<br>vector<br>X | Int<br>✓<br>systolic<br>✗ | Int<br>✓<br>systolic | Int<br>✓<br>vector | Int<br>vector | Int<br>✓<br>vector | Int/Float<br>vector/systolic |
| Programming<br>Support               | Software Ecosystem                                                              | Custom<br>Compiler       | TVM                     | Xilinx<br>SDAccel         | Caffe                | С                  | Caffe         | Custom<br>Mapper   | ONNX/C                       |
|                                      | Hardware-Supported<br>Virtual Memory                                            | ×                        | ×                       | ×                         | ×                    | ×                  | ×             | ×                  | 1                            |
| System Support                       | Full SoC<br>OS Support                                                          | ×                        | ×                       | ×<br>×                    | ×<br>×               | X<br>X             | ×<br>×        | ×<br>×             | <i>\</i><br><i>\</i>         |

https://github.com/ucb-bar/gemmini

[DAC'2021 Best Paper Award]

## Gemmini Case Study: Allocating on-chip SRAM



#### Where to allocated SRAM?

- Private within each IP
- Shared



https://github.com/ucb-bar/gemmini

[DAC'2021 Best Paper Award]

## Gemmini Case Study: Allocating on-chip SRAM



- Where to allocated SRAM?
  - Private within each IP
  - Shared

• Application dependent.



SoC configuration dependent.



https://github.com/ucb-bar/gemmini

[DAC'2021 Best Paper Award]

#### **Full-Stack Optimization for DL Accelerators**



## Large Space of Mapping Algorithms to ML Hardware

#### Algorithm





#### Hardware



## Navigating the Mapping Space





#### **CoSA: Constrained-Optimization for Spatial Architecture**

**ML** Operator

Inputs Weights Outputs Weight Buffer DRAM C イ S + apits (F - 1) x Stride + R S Buffer ¥ Input Ø **Global Buffer** σţ Ρ C R **Processing Element** Reduction R Router R, S: weight width and height P, Q: output width and height Schedule C: input channel size Accumulation K: output channel size Buffer N: batch size MULT Adder CoSA Variables Constraints **Objectives** 

**Spatial Accelerator** 



#### Results



|                          | CoSA | Random $(5 \times)$ | Timeloop Hybrid |
|--------------------------|------|---------------------|-----------------|
| Avg. Runtime / Layer     | 4.2s | 4.6s                | 379.9s          |
| Avg. Samples / Layer     | 1    | 20K                 | 67M             |
| Avg. Evaluations / Layer | 1    | 5                   | 16K             |

#### Acknowledgement









Jenny Huang

Seah Kim

- Collaborators from UC Berkeley and NVIDIA!
- Sponsored by DARPA, a Facebook Research Award, a Google Research Award, and ADEPT/SLICE industry sponsors!

#### **Full-Stack Optimization for DL Accelerators**

Design of Accelerators  Simba [MICRO'19 Best Paper Award, CACM RH, VLSI'20, JSSC'20 Best Paper Award]

Integration of Accelerators

- Chipyard [IEEE Micro'20]
- Gemmini [DAC'21, Best Paper Award]

# Scheduling of Accelerators