A3D3 Seminar, 11/7/2022

## Closing the Virtuous Cycle of Al for IC and IC for Al

David Z. Pan ECE Department, UT Austin https://www.ece.utexas.edu/~dpan

## AI and the "ABC" Behind



• Big data







#### Chips: CPU, GPU, ASIC, FPGA, and dedicated AI accelerators

#### **Nanometer IC Design/Manufacturing Complexity**



Divide large chip into smaller partitions, e.g., 1~2M cells each

Still, 1 backend iteration for one partition could take days!



#### 8,000 Engineer-Year!

What you see (at design) is not (necessarily) what you get (at fab)!



### **IC Design/Manufacturing Flow**



#### **Nanometer Design/Manufacturing Challenges**



## **AI – IC Interactions**

# Two key themes Al for IC

- How to leverage AI techniques to enable agile and intelligent IC design
- > Equivalent scaling of Moore's law
- > Democratizing IC and EDA R&D

#### IC for AI

- > Customized IC/FPGA for AI applications
- > Efficient/hardware aware ML

#### **Closing the virtuous cycle!**

**AL Arxiv Papers** 

#### Interestingly ...



Year



## Outline

Introduction
AI for IC
IC for AI

Conclusion

## Case Study 1

DREAMPlace: <u>Deep Learning Toolkit-Enabled</u> GPU <u>Acceleration for Modern VLSI Placement</u> [Lin+, DAC'19 Best Paper Award; IEEE TCAD 2021 Donald O. Pederson Best Paper Award]

Source code release: <u>https://github.com/limbo018/DREAMPlace</u> Widely used by industry (Google, Nvidia, Intel, ...) and academia



8

公 Star 400

**Y** Fork 125

#### **Challenges of VLSI Placement**

- A classical NP-hard problem!
- Have to deal with huge designs: 10M+ cells in modern ICs
- Plays a central role in IC design closure as it is in the middle of the entire design flow
  - Placement determines the interconnect to the first order
  - Modern designs are interconnect-centric



Courtesy RePIAce from UCSD

#### **Typical SOTA Nonlinear Placement Algorithm**

$$\begin{array}{ccc}
\min_{\mathbf{x},\mathbf{y}} & \sum_{e \in E} \mathrm{WL}(e; \mathbf{x}, \mathbf{y}), \\
s.t. & D(\mathbf{x}, \mathbf{y}) \leq t_d \\
\hline
\mathbf{Objective of nonlinear placement}} \\
\min & (\underbrace{\sum_{e \in E} \mathrm{WL}(e; \mathbf{x}, \mathbf{y}))}_{\mathrm{Wirelength}} + \lambda D(\mathbf{x}, \mathbf{y}) \\
\hline
\mathbf{Density}
\end{array}$$

Huge development effort and runtime for high-quality placement of modern ASIC/SoC designs

## What is your **Dream** Placement Engine?

- ✓ Best quality: wirelength → congestion, timing, power, …
   ✓ Ultrafast: placement is at the center of entire design flow → faster design turn-around-time
   ✓ Low development overhead: → from 1 year to a month?
   ✓ Extensible: to new algorithms
  - and acceleration techniques



We propose a novel analogy by casting the nonlinear placement optimization into a neural network training problem

 Greatly leverage deep learning hardware (GPU) and opensource software toolkits (e.g., PyTorch)

 Enable ultra-high parallelism and acceleration while getting state-of-the-art results

#### **Analogy Between NN Training and Placement**

$$\min_{\mathbf{w}} \sum_{i}^{n} f(\phi(x_i; \mathbf{w}), y_i) + \lambda R(\mathbf{w})$$

Forward Propagation (Compute obj)



Backward Propagation (Compute Gradient  $\frac{\partial obj}{\partial w}$ )

Train a neural network

$$\min_{\mathbf{w}} \sum_{i}^{n} \mathrm{WL}(e_i; \mathbf{w}) + \lambda D(\mathbf{w})$$

Forward Propagation (Compute obj) Net Instance  $(e_i, 0)$   $\bowtie$  Network WL( $\cdot$ ; w)  $\bowtie$  WL( $e_i$ ; w)

> Backward Propagation (Compute Gradient  $\frac{\partial obj}{\partial w}$ )

Solve a placement

#### **DREAMPlace Architecture**

#### Leverage highly optimized deep learning toolkit





#### DREAMPlace architecture

#### **Global Placement Result Comparison**

#### RePIAce [Cheng+, TCAD'18]

- CPU: 24-core 3GHz Intel Xeon
- 64GB memory allocated
- Current state-of-the-art

#### **34**× speedup by DREAMPlace RePIAce Threads = 1 10 = 20 = 40 V100 DREAMPlace $10^{4}$ Runtime (s) $10^{2}$ $10^{0}$ ISPD 2005 Benchmarks 200K~2M cells

#### **DREAMPlace** [Lin+, DAC'19]

- CPU: Intel E5-2698 v4 @2.20GHz
- GPU: 1 NVIDIA Tesla V100
- Single CPU thread was used

#### 43× speedup by DREAMPlace



## Same placement quality of results!

#### 10M-cell design finishes in min, instead of 3+ hrs

#### **Dreams for DREAMPlace**



#### **Beyond DREAMPlace**



## Case Study 2

#### **MAGICAL:** <u>Machine</u> <u>Generated</u> <u>Analog</u> <u>IC</u> <u>Layout</u>

As part of DARPA ERI (IDEA/POSH) effort



Open source MAGICAL (v1.0) released

https://github.com/magical-eda/MAGICAL

## **Analog IC Layout**

- DREAMPlace mainly for digital IC
- Analog IC to interface with outside world
- Analog IC layout design still mostly manual
  - > Very tedious and error-prone
  - > Prior DA not as successful as that in digital IC

### **MAGICAL** Mission:



- Develop a fully-automated analog layout system, leveraging human and machine intelligence
- Promising results [ISPD'19, DAC'19, ICCAD'19, ASPDAC'20, DATE'20, DAC'20, ICCAD'20, D&T'20, JoS'20, CICC'21, DAC'21, ICCAD'21, ASPDAC'22, DATE'22, ISPD'22, ICCAD'22]

#### **MAGICAL Layout System Framework**



#### **MAGICAL 1.0 Hierarchical Framework [Chen+, CICC'21]**

#### Hierarchical layout synthesis framework



### **MAGICAL 1.0 Tapeout**

#### [Chen+, CICC'21]

- 1GS/s 3rd-order high-performance
   continuous time ΔΣ modulator
- Include various sub-block types
  - > Three integrators: one passive, two active
  - > Two FIR-based feedback DACs
  - One comparator
  - + Digital logics
- TSMC 40nm
- SOTA performance cf. the original manual design [IEEE SSC-L'20]





### **Comparison with SOTA CTDSM ADCs**

 MAGICAL 1.0 layout even slightly outperforms manual layout (SSCL'20) in power, performance, and area



[Chen+, CICC'21]

## **MAGICAL Extension: OpenSAR**

#### End-to-end SAR ADC compilation

Template-based Generation

MAGICAL

**Digital APR** 

MAGICAL



Tape-out validated TSMC 40nm

[Liu+, ICCAD'21]



Route planning

Signa

Routi

### MAGICAL Extension: AutoCRAFT [Chen+, ISPD'22]

- Tech-agnostic FinFET layout style using primitives (w/ Nvidia)
- ♦ Auto custom layout generation → Very promising results obtained



#### AutoCRAFT

## **Case Study 3**

## Al for IC Manufacturability, Reliability, Security

#### **Bottleneck in IC Manufacturing: Lithography**





What you see (at design) is NOT what you get (at fab)
Need to make sure design is manufacturable with high yield
Litho-simulations are extremely CPU intensive

#### **Lithography Hotspot Detection**

**Question 1**: Without going through detailed litho-simulations, can we directly predict lithography hotspot to avoid poor yield?

- Our work [Ding+, ICICDT 2009 Best Paper] is among the first to use machine learning (SVM) for litho-hotspot detection
  - Very active research topic in the last 12+ years
  - Inspired ICCAD 2012 CAD Contest, run by Mentor Graphics
  - Meta-classification combining ML and PM [Ding+, ASPDAC'12 BPA]
  - Deep neural network [Yang+, DAC'17]
  - Big data vs. small data: transfer learning, active learning, semisupervised learning [Lin+, ISPD'18], [Chen+, ASPDAC'19] ...
  - Litho-GPA: confidence estimation [Ye+, DATE 2019]



28

#### LithoGAN: End-to-End Lithography Modeling with Generative Adversarial Networks [Ye+, DAC'19 Best Paper Finalist]

**Question 2 (much harder):** Without going through litho-simulations, can we directly get printed images?

#### **Image Translation for Litho Modeling**

[Ye+, DAC'19]



 Different elements encoded on different image channels  Resist pattern zoomed in for high-resolution/accuracy

#### **LithoGAN Results**

[Ye+, DAC'19]





LithoGAN is **1800x** faster than rigorous simulations, with acceptable error (in consultation with industry)

#### LAPD

#### **Another LAPD**

- ◆ To bridge design and manufacturing →
   Lithography Aware Physical Design (LAPD)
  - > Litho Hotspot Detection
  - > Litho Hotspot Correction
- My group has made many seminal contributions in LAPD
- LithoGAN opens new directions with tremendous potential
- Similar principles apply to other EDA (reliability, 3D-IC, ...)



**Detection** 



Correction

#### **Bridge Design/Manufacturing for Security**

IC supply chains of design, manufacture, test, package, ...



Image source: https://depositphotos.com/2801291/stock-illustration-gray-detailed-world-map.html

#### **Design/Manufacturing for Hardware Security**

- Arm race between attacking and protection
- Hardware IP reverse engineering using learning techniques
- Intelligent IC camouflaging [Li+, ICCAD'16, TCAD'17, HOST'17 BPA]
- Former PhD Meng Li won ACM SRC Grand Finals First Place in 2018



## Outline

Introduction
AI for IC
IC for AI
Conclusion

#### **Photonic Al Chips**

# Based on optics/photonics → photonic ICs



#### **Optical Computing Basics**



#### **ONN Background: Photonics GEMM**

DNNs: linear projection + nonlinear activation

- > Matrix multiplication is computation-intensive
- Photonics is good at ultra-fast linear operations



Photonic tensor unit for analog GEMM [MIT's Nature Photonics'17]



#### **Device-Circuit-Arch-Algorithm Co-Design Stack**



Jiaqi Gu won ACM Student Research Competition Grand Finals 1<sup>st</sup> Place 2021

## Case Study 4 FFT-based ONN [Gu+, ASPDAC'20 BPA]

• Efficient circulant matrix multiplication in Fourier domain

y = Wx  $\checkmark$   $y = \mathcal{F}^{-1}(\mathcal{F}(w) \odot \mathcal{F}(x))$ 

♦ 2.2~3.7× area reduction, no accuracy loss

$$0(m^2+n^2)$$
 -

$$\longrightarrow O\left(\frac{mn}{k}\log_2 k\right)$$





🔨 Inputs 📁 Coupler 💷 Phase Shifter 📁 Attenuator 🗲 Combiner 🗙 Crossing

#### **Our OSNN Neural Chip Tapeout & Measurement**

- Experimental demonstration
  - > Compute density: 225 TOPS/mm<sup>2</sup>
  - > Energy efficiency: 9.5 TOPS/W

### Won the Robert S. Hilbert Memorial Optical Design Competition, July 2022



C. Feng, J. Gu, H. Zhu, Z. Ying, Z. Zhao, D.Z. Pan, R.T. Chen, Under Submission

## Case Study 5 FLOPS [DAC'20 BPC] [NSF Workshop'20, BPA]

- ONN on-chip learning via stochastic zeroth-order optimization
  - Efficiency: WDM-based forward-only gradient estimation
  - > Accuracy: Two-stage learning protocol (FLOPS+) with high accuracy
    - **Robustness**: Robust learning under *in situ* device variations



#### **Robust On-Chip Learning**

- Thermal crosstalk variations
  - Typically not considered in software training
  - Time-consuming
  - > Inaccurate
- Built-in robustness handling on-chip
  - > Ultra-fast: ~1 μs
  - > Accurate: physical noise model





#### **Experimental Results [Gu+, DAC'20]**

- Robust learning under in situ thermal variations
  - **5%** more accurate than hardware-agnostic software training
  - **3%** more robust than previous on-chip training approaches



ONN config: 10-24-24-6 (960 MZIs)

#### L<sup>2</sup>ight – Scalable On-Chip Training [Gu+, NeurIPS'21]

- ♦ Gradient-free methods → First-order gradient-based
- Can handles million-parameter ONNs
  - > 1000× more scalable than [Gu+, DAC'20] to handle million-parameter ONNs
  - > Efficiency: Multi-level sparsity to boost efficiency by  $30 \times$
- In-situ noise consideration for noise-resilient ONNs



#### Outline

- Introduction
  AI for IC
  IC for AI
- Conclusion

#### **To Recap: AI for IC**



### **To Recap: [Photonic] IC for Al**

- How to build ultra-fast (light-speed) and ultra-efficient optical neural accelerators with photonic integrated circuits
  - > Software and hardware co-design is KEY
- FFT-ONN (ASP-DAC 2020 Best Paper Award)
- FLOPS (DAC 2020 Best Paper Finalists; NSF'20 Workshop BPA)
- PhD student Jiaqi Gu won ACM SRC Grand Finals 1<sup>st</sup> Place in 2021
- Robert S. Hilbert Memorial Optical Design Competition, July 2022



#### Conclusion

Advance in AI algorithms/software → Agile IC/hardware design
 Advance in IC/hardware → Enhanced AI capability



### **Closing the Virtuous Cycle!**

#### Acknowledgment

- Funding support / collaborations from NSF, DARPA, MURI, Intel, Nvidia, Google, Synopsys, Toshiba Memory (Kioxia), AMD/Xilinx, VMware, etc.
- Many students/post-docs who do the real work
- Many collaborators
  - > Dr. Haoxing Ren from NVIDIA for DREAMPlace, AutoCRAFT
  - > Prof. Nan Sun at UT Austin (now at Tsinghua) for MAGICAL
  - > Dr. Nojima et al. from Toshiba Memory (KIOXIA) on DFM
  - > Prof. Ray Chen at UT Austin for optical interconnect/computing

› ...

## Thanks!

# Q&A?

