# Introduction to FPGA, FINN and Brevitas

Dr. Mario Ruiz AMD University Program



#### **AUP** Vision

Empower academics with AMD technology to enhance teaching and learning experiences and advance state-of-the-art research.

#### **Our Team**

Dedicated world-wide technical team

Supporting High Performance and Adaptive Compute

25+ years experience working with academia



#### What We Offer



Research Programs



Donation Program



#### Teaching Resources



#### Training



Academic Solutions



Support



#### **HACCs: Heterogeneous Accelerated Compute Clusters**



together we advance\_

5

### **HACC Adaptive Computing Hardware**



HACC hardware consists of:

- Compute and Alveo<sup>™</sup> nodes (initially U250 and U280 with HBM)
- Latest heterogeneous nodes (SMC 4124GS) include:
  - 2 EPYC<sup>™</sup> 3rd generation CPUs
  - 4 AMD Instinct<sup>™</sup> MI210 GPUs
  - 2 Alveo U55C FPGA with HBM
  - 2 VCK5000 Versal Adaptive SoC with AIEs
  - Run-time via AMD ROCm™, XRT
  - SW development via HIP, Vitis, frameworks
- 100G network
- Community hub for researchers
  - Support from in-house AMD research groups
  - Reproducible results & experiments





#### **Contact Us**

**Email us:** 

aup@amd.com

#### Visit our website to:

- Discover our research programs
- Access educational resources
- Submit a donation request
- Find training & other events



AMD | together we advance\_ × + S www.amd.com/universit Q Products Downloads & Suppor Shop **AMD University Program** Educator, researcher and student hub for AMD resources, program and news Educators FAQ Contact Us Researchers Students Events

www.amd.com/AUP

## What is Adaptive Computing?

Adaptive Hardware ("FPGA") Conceptual Representation

#### **Optimize for the Workload**

Domain-Specific Architecture for your exact requirements, accelerating the whole application

#### Adapt as Algorithms Change

Re-implement the silicon after deployment, adapting to evolving use cases

#### **Accelerate Pace of Innovation**

Keep pace with fast moving markets and rapid innovation cycles, e.g., AI algorithms



Matching the Architecture to the Application

Custom Data Flow, Custom Memory Hierarchy, Custom Precision



AMD together we advance\_

#### **Evolution to Heterogeneous Platforms**

- From FPGAs to adaptive SoCs → matching the engine to the workload
- Balancing diverse technologies for domain-specific requirements



#### **Domain Specific Optimization**

[Public]

#### Field Programmable Gate Array (FPGA)

- Semiconductor devices
- Programmed and reprogrammed by a user
  - Configuration attributes manipulated after manufacturing
  - Matrix of configurable logic blocks (CLBs)
  - Dedicated specialized logic
  - Flexible programmable interconnects
- Ideal fit for many different workloads
  - Massive parallelism
- Hardware adaptability is a unique differentiator from CPUs and GPUs
- Invented in 1985

#### Applications

- Automotive
- Broadcast & Pro AV
- Consumer Electronics
- Data Center
- High Performance Computing and Data Storage
- Industrial
- Medical
- Video & Image Processing
- Wired Communications
- Wireless Communications

[Public]

# **Core Adaptable Hardware Technologies**

|                                                                   | Provide and a second se | BCIE 8     DDU     HBM     Allwards     6004     700       BCIE 9     DDU     HBM     Allwards     700     700 |
|-------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FPGAs                                                             | SoCs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Adaptive SoCs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| From high-bandwidth<br>connectivity to massive<br>compute engines | Multi-processing subsystem<br>with Arm <sup>®</sup> cores and integrated<br>FPGA logic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Adaptive Compute Acceleration<br>Platforms for any application,<br>any developer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| AMDA AMDA AMDA AMDA<br>Spartan Artix Kintex Virtex                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | AMDA<br>VERSAL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |

#### **Three Ages of FPGAs**

- A Retrospective on the First Thirty Years of FPGA Technology
- S. M. Trimberger, "Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology," in Proceedings of the IEEE, vol. 103, no. 3, pp. 318-331, March 2015, DOI: 10.1109/JPROC.2015.2392104

| INVITED FAFER                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                         | Trimberger: Three Ages of FPGAs                                                                                                | Trimberger: Three Ages of FPGAs                                                                                         |                                                                                                                     |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                         | The disadvantage of the FPGA per-unit cost premium<br>over ASIC diminished over time as NRE costs became a                     | <br>3 macrocells                                                                                                        | the AND array grows with the square of the number of<br>inputs (more precisely, inputs times product terms). Pro-   |
| Three Ages of FPGAs: A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | ă t                                                                                                                                     | larger fraction of the total cost of ownership of ASIC. The                                                                    |                                                                                                                         | cess scaling delivers more transistors with the square of the                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ASIC Crossover                                                                                                                          | dashed lines in Fig. 2 indicate the total cost at some process                                                                 |                                                                                                                         | shrink factor. However, the quadratic increase in the AND                                                           |
| Potrognoctive on the First Thirty                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | point,<br>generation n+1                                                                                                                | node. The solid lines depict the situation at the next process<br>node, with increased NRE cost, but lower cost per chip, Both | Product terms                                                                                                           | array limits PALs to grow logic only linearly with the<br>shrink factor. PAL input and product-term lines are also  |
| Retrospective on the First Thirty                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Crossover                                                                                                                               | FPGA and ASIC took advantage of lower cost manufacturing.                                                                      |                                                                                                                         | heavily loaded, so delay grows rapidly as size increases. A                                                         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | FPGA generation n                                                                                                                       | while ASIC NRE charges continued to climb, pushing the                                                                         |                                                                                                                         | PAL, like any memory of this type, has word lines and bit                                                           |
| Years of FPGA Technology                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                                                                                                         | crossover point higher. Eventually, the crossover point grew                                                                   |                                                                                                                         | lines that span the entire die. With every generation, the                                                          |
| 5/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Number of Units                                                                                                                         | so high that for the majority of customers, the number of<br>units no longer justified an ASIC. Custom silicon was war-        |                                                                                                                         | ratio of the drive of the programmed transistor to the<br>loading decreased. More inputs or product terms increased |
| This paper reflects on how Moore's Law has driven the design of FPGAs through                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Fig. 2. FPGA versus ASIC Crossover Point. Graph shows total cost                                                                        | ranted only for very high performance or very high volume;                                                                     |                                                                                                                         | loading on those lines. Increasing transistor size to lower                                                         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | versus number of units. FPGA lines are darker and start at the lower<br>left corner. With the adoption of the next process node (arrows | all others could use a programmable solution.                                                                                  | 2 Inputs                                                                                                                | resistance also raised total capacitance. To maintain speed,                                                        |
| hree epochs: the age of invention, the age of expansion, and the age of accumulation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | from the earlier node in dashed lines to later node in solid lines).                                                                    | This insight, that Moore's Law [33] would eventually<br>propel FPGA capability to cover ASIC requirements, was a               | 2 inputs                                                                                                                | power consumption rose dramatically. Large PALs were<br>impractical in both area and performance. In response, in   |
| By Stephen M. (Steve) Trimberger, Fellow IEEE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | the crossover point, indicated by the vertical dotted line, grew larger.                                                                | fundamental early insight in the programmable logic busi-                                                                      | Fig. 3. Generic PAL architecture.                                                                                       | the 1980s, Altera pioneered the Complex Programmable                                                                |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                         | ness. Today, device cost is less of a driver in the FPGA                                                                       |                                                                                                                         | Logic Device (CPLD), composed of several PAL-type blocks                                                            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | <ol> <li>Age of Expansion 1992–1999;</li> </ol>                                                                                         | versus ASIC decision than performance, time-to-market,<br>power consumption, I/O capacity and other capabilities.              | B. FPGA Versus PAL                                                                                                      | with smaller crossbar connections among them. But FPGAs<br>had a more scalable solution,                            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | <ol><li>Age of Accumulation 2000–2007.</li></ol>                                                                                        | Many ASIC customers use older process technology,                                                                              | Programmable logic was well established before the                                                                      | The FPGA innovation was the elimination of the AND                                                                  |
| BSTRACT Since their introduction, field programmable gate 10000                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                         | lowering their NRE cost, but reducing the per-chip cost                                                                        | FPGA. EPROM-programmed Programmable Array Logic                                                                         | array that provided the programmability. Instead, config-                                                           |
| rrays (FPGAs) have grown in capacity by more than a factor of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | II. PREAMBLE: WHAT WAS THE<br>BIG DEAL ABOUT FPGAs?                                                                                     | advantage.                                                                                                                     |                                                                                                                         | uration memory cells were distributed around the array to                                                           |
| 0 000 and in performance by a factor of 100. Cost and energy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                         | Not only did FPGAs eliminate the up-front masking<br>charges and reduce inventory costs, but they also reduced                 | However, FPGAs had an architectural advantage. To un-<br>derstand the FPGA advantage, we first look at the simple       |                                                                                                                     |
| er operation have both decreased by more than a factor of 1000 1000                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | A. FPGA Versus ASIC                                                                                                                     | design costs by eliminating whole classes of design prob-                                                                      | programmable logic structures of these early 1980s de-                                                                  |                                                                                                                     |
| caling, but the FPGA story is much more complex than simple                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                         | lems. These design problems included transistor-level de-                                                                      | vices. A PAL device, as depicted in Fig. 3, consists of a two-                                                          |                                                                                                                     |
| echnology scaling. Quantitative effects of Moore's Law have Speed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | (ASIC) companies brought an amazing product to the<br>electronics market; the built-to-order custom integrated                          |                                                                                                                                | level logic structure [6], [38]. Inputs are shown entering at<br>the bottom. On the left side, a programmable AND array |                                                                                                                     |
| driven qualitative changes in FPGA architecture, applications 100 —Price Power 100 Price 100 Price 100 Price 100 Power 100 Pow | circuit. By the mid-1980s, dozens of companies were sell-                                                                               |                                                                                                                                | generates product terms, ANDS of any combination of the                                                                 |                                                                                                                     |
| ral distinct phases of development. These phases, termed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | ing ASICs, and in the fierce competition, the winning at-                                                                               |                                                                                                                                | inputs and their inverses. A fixed on gate in the block at                                                              | AND array. Not every function was an output of the chip, so                                                         |
| 'Ages" in this paper, are The Age of Invention, The Age of 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | tributes were low cost, high capacity and high speed. When<br>FPGAs appeared, they compared poorly on all of these                      |                                                                                                                                | the right completes the combinational logic function of the<br>macrocell's product terms. Every macrocell output is an  |                                                                                                                     |
| Expansion and The Age of Accumulation. This paper summa-<br>rizes each and discusses their driving pressures and funda-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                                                         | time. With wafer-fabrication turnaround times in the                                                                           | output of the chip. An optional register in the macrocell                                                               |                                                                                                                     |
| west each and consists the paper concludes with a vision of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                         | weeks or months, silicon re-spins impacted schedules sig-                                                                      | and feedback to the input of the AND array enable a very                                                                |                                                                                                                     |
| apcoming Age of FPGAs. 1985 1990 1995 2000 2005 2010                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | tooling. ASIC customers paid for those masks with an up-<br>front non-recurring engineering (NRE) charge. Because                       |                                                                                                                                | flexible state machine implementation.<br>Not every function could be implemented in one pass                           |                                                                                                                     |
| KEYWORDS Application-specific integrated circuit (ASIC): Fig. 1. Xilinx FPGA attributes relative to 1988. Capacity is logic cell                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | they had no custom tooling, FPGAs reduced the up-front                                                                                  |                                                                                                                                | through the PAL's macrocell array, but nearly all common                                                                |                                                                                                                     |
| commercialization; economies of scale; field-programmable<br>price/s per logic coll. Prover is per logic coll. Price and power are scaled                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | cost and risk of building custom digital logic. By making                                                                               | an FPGA can be reworked in minutes, FPGA designs in-                                                                           | functions could be, and those that could not were realized                                                              | <b>*** ** ** **</b>                                                                                                 |
| ate array (FPGA); industrial economics; Moore's Law; pro- up by 10 000 ×. Data: Xilinx published data.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | one custom silicon device that could be used by hundreds or<br>thousands of customers, the FPGA vendor effectively                      |                                                                                                                                | in two passes through the array. The delay through the PAL<br>array is the same regardless of the function performed or |                                                                                                                     |
| rammable logic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | amortized the NRE costs over all customers, resulting in                                                                                |                                                                                                                                | where it is located in the array. PALs had simple fitting                                                               |                                                                                                                     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | no NRE charge for any one customer, while increasing the                                                                                |                                                                                                                                | software that mapped logic quickly to arbitrary locations in                                                            |                                                                                                                     |
| . INTRODUCTION These advancements have been driven largely by process technology, and it is tempting to perceive the evolution of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | per-unit chip cost for all.<br>The up front MRE cost ensured that EBCAs more more                                                       | Finally, there was the ASIC production risk: an ASIC<br>company made money only when their customer's design                   | the array with no performance concerns. PAL fitting soft-<br>ware was available from independent EDA vendors,           |                                                                                                                     |
| Glinx introduced the first field programmable gate arrays<br>FPGAs) in 1984, though they were not called FPGAs until<br>FPGAs as a simple progression of capacity, following semi-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | cost effective than ASICs at some volume [38]. FPGA                                                                                     |                                                                                                                                | allowing IC manufacturers to easily add PALs to their                                                                   |                                                                                                                     |
| roady in 1969, though they were not cancer proas unit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | vendors touted this in their "crossover point," the number                                                                              | requirements during the development process, product                                                                           | product line.                                                                                                           |                                                                                                                     |
| to years, the device we call an FPGA increased in capacity story of FPGA progress is much more interesting.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | of units that justified the higher NRE expense of an ASIC.<br>In Fig. 2, the graphed lines show the total cost for a number             |                                                                                                                                | PALs were very efficient from a manufacturing point of<br>view. The PAL structure is very similar to an EPROM           |                                                                                                                     |
| wy more than a factor of 10 000 and increased in speed by a<br>actor of 100. Cost and energy consumption per unit func-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | In Fig. 2, the graphed lines show the total cost for a number<br>units purchased. An ASIC has an initial cost for the NRE.              |                                                                                                                                | view. The PAL structure is very similar to an EPROM<br>memory array, in which transistors are packed densely to         |                                                                                                                     |
| actor or ioo, cost and energy consumption per unit runce.<br>Each phase was driven by both process technology oppor-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | and each subsequent unit adds its unit cost to the total. An                                                                            | ASIC customers, but also by the ASIC suppliers, whose                                                                          | yield an efficient implementation. PALs were sufficiently                                                               |                                                                                                                     |
| tunity and application demand. These driving pressures<br>caused observable changes in the device characteristics                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | FPGA has no NRE charge, but each unit costs more than the                                                                               |                                                                                                                                | similar to memories that many memory manufacturers<br>were able to expand their product line with PALs. When            |                                                                                                                     |
| Associate resided September 38, 2014; revised November 21, 2014 and and tools. In this paper, I review three phases I call the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | functionally equivalent ASIC, hence the steeper line. The<br>two lines meet at the crossover point. If fewer than that                  |                                                                                                                                | were able to expand their product line with PALs. When<br>the cyclical memory business faltered, memory manufac-        |                                                                                                                     |
| December 11.2014; accepted December 23.2014. Date of Current wreisen April 14.2015.<br>"Ages" of FPGAs. Each age is eight years long and each                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | number of units is required, the FPGA solution is cheaper;                                                                              | programmable-logic companies and customers could still                                                                         | turers entered the programmable logic business.                                                                         | Fig. 4. Generic array FPGA architecture, 4 × 4 array with three wiring                                              |
| ans.trimbsrgsr@silos.com). became apparent only in retrospect. The three ages are:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | more than that number of units indicates the ASIC has                                                                                   |                                                                                                                                | The architectural issue with PALs is evident when one                                                                   | tracks per row and column. Switches are at the circles at intersections.                                            |
| Age of Invention 1984–1991;     0059/219 © 2005 299204     1) Age of Invention 1984–1991;     0059/219 © 2005 #EE. Translations and content minimum are remained for academic research only. Personal are in also remained, but remained, but remained, but remained, but remained.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | lower overall cost.                                                                                                                     | corrected quickly, without costly mask-making.                                                                                 | considers scaling. The number of programmable points in                                                                 | Device inputs and outputs are distributed around the array.                                                         |
| 0018/9219 © 2015 REE. Transferrors and content mining are permitted for academic research only. Personal use in also permitted, but republication<br>reductions requires IEEE permission. See http://www.ieee.org/publications/gaalacation/content/publicational/content/<br>PROCEED/DORG OF THE IEEE [194, Jul 20, No. March 2015]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                         | Vol. 203, No. 3, March 2015   PROCEEDINGS OF THE IEEE 319                                                                      | 320 PROCEEDINGS OF THE IEEE   Vol. 103, No. 3, March 2015                                                               |                                                                                                                     |
| THOSE REPORTED OF THE LEWE   VOL 303, NO. 3, MATCH 2015                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                                                         | THE RO, THE 3, MATCH 2015   PROCEEDINGS OF THE LEEE 319                                                                        | FROM FROM EEDINGS OF THE IEEE   You JUS, No. 3, March 2015                                                              |                                                                                                                     |

12

#### **FPGA: 7-Series Architecture**

- Logic elements distributed on regular columns
  - Scalability from low-cost to high-performance
- High-speed IO
- Clock management
- Interconnect matrix
  - Routing resources





Artix-7 Architecture Overview

## **Configurable Logic Block (CLB)**

- Primary resource for design in AMD FPGAs
  - Combinatorial functions
  - Flip-flops
- CLB contains two slices
- Connected to switch matrix for routing to other FPGA resources
  - Carry chain runs vertically



#### **Two Types of CLB Slices**

- SLICEM: Full slice
  - · Can be used for logic, memory and shift register LUT
  - Has wide multiplexers and carry chain
- SLICEL: Logic and arithmetic only
  - LUT can only be used for logic (not memory)
  - Has wide multiplexers and carry chain



#### **Slice Resource**

- Four six-input Look-Up Tables (LUT)
- Multiplexers
- Carry chains
- Four flip-flops/latches
  - Four additional flip-flops
- The implementation tool will pack multiple slices in the same CLB if certain rules are followed



#### **6-Input LUT with Dual Output**

- LUTs can be two 5-input LUTs with common input
  - Minimal speed impact to a 6-input LUT
  - One or two outputs
- Any combinatorial function of six variables or two functions of five variables



#### **Slice Flip-Flops and Flip-Flop/Latches**

- Each slice has four flip-flop/latches (FF/L)
  - · Can be configured as either flip-flops or latches
- Each slice also has four flip-flops (FF)



#### **Slice Flip-Flop Capabilities**

- All flip-flops are D type
  - Q output
- All flip-flops have a single clock input (CK)
- All flip-flops have an active high chip enable (CE)
- All flip-flops have an active high SR input
  - Input can be synchronous or asynchronous
  - Sets the flip-flop value to a pre-determined



UG474\_c3\_05\_102910

## 7-Series FPGA I/O

- Wide range of voltages
  - 1.2V to 3.3V operation
- Wide I/O standards support
  - Single ended and differential
  - Referenced voltage inputs
  - 3-state capability
- Very high performance
  - Up to 1600 Mbps LVDS
  - Up to 1866 Mbps single-ended for DDR3
- Easy memory interfacing
  - Hardware support for QDRII+ and DDR3
- Digitally controlled impedance
- Power reduction features



## 7-Series Block RAM and FIFO

- Fully synchronous operation
  - Outputs are latched
- Optional internal pipeline register
  - Higher frequency operation
- Two independent ports access common data
  - Individual address, clock, write enable, clock enable
  - Independent data widths for each port



#### 7-Series Block RAM and FIFO

- Multiple configuration options
  - True dual-port, simple dual-port, single-port
- Integrated cascade logic
- Byte-write enable in wider configurations
- Integrated control for fast and efficient FIFOs
- Integrated 64/72-bit Hamming error correction



#### 7-Series DSP48E1 Slice



AMD together we advance\_

## **7-Series FPGAs Clock Management**

- Global clock buffers
  - High fanout clock distribution buffer
- Low-skew clock distribution
  - Regional clock routing
- Clock regions
  - Each clock region is 50 CLBs high and spans half the device
- Clock management tile (CMT)
  - One Mixed-Mode Clock Managers (MMCMs) and one Phase Locked Loop (PLL) in each Clock
  - Performs frequency synthesis, clock de-skew, and jitter-filtering
  - High input frequency range



#### **Programming Model**

#### Hardware Description Languages (HDL)

- Verilog
- VHDL
- System Verilog
- · Closer to the metal
  - Low level abstraction
  - Describe the behaviour

#### **High-Level Synthesis (HLS)**

- C/C++
- High level of abstraction
  - Write algorithms
- Vitis HLS generates the architecture
  - Guided by user directives

AMD Vivado



## **VHLD/Verilog counter**

#### VHDL

```
library IEEE;
use IEEE.STD LOGIC 1164.ALL;
use IEEE.STD LOGIC UNSIGNED.ALL;
entity counter is
    Port ( clk: in std_logic;
           rst: in std logic;
           cout: out std logic vector(3 downto 0)
     );
end counter;
architecture rtl of counter is
signal counter up: std logic vector(3 downto 0);
begin
    process(clk)
    begin
    if(rising_edge(clk)) then
        if(rst='1') then
            counter up <= x"0";
        else
            counter up <= counter up + x"1";
        end if:
    end if;
    cout <= counter up;</pre>
    end process;
end rtl;
```

#### Verilog

```
module counter(
    input clk,
    input rst,
    output reg [7:0] count
    );
 always @(posedge(clk)) begin
     if (rst)
        count <= 0;</pre>
     else
        count <= count + 1;
    end
endmodule
```

#### **Vitis HLS Vector addition**

#### What is AMD Vitis<sup>™</sup> HLS and HLS Benefits



## Al on FPGA

#### **DNNs and their Potential**

- Requires little domain expertise
- 2.
- NNs are a "universal approximation function"
- 3. If you make it big enough and train it long enough
  - Can outperform humans and existing algorithms on specific tasks

Will not only increasingly replace other algorithms, but also...



Nature, Oct 2021

# ... solve previously unsolved problems

- ChatGPT, Copilot
- Stable diffusion
- Protein folding



Stable Diffusion Prompt: "Pencil sketch of an international group of semiconductor research scientists, studio Ghibli"

#### Spectrum of ML use case with very different requirements



AMD together we advance\_

31

#### **DNN Compute Requirements are Outpacing Moore's Law**



together we advance\_

# Innovation is needed to provide the necessary performance scalability

### **Specialization Is #1 Industry Approach to Achieve Performance Scalability and Energy Efficiency**



G

nervana

aws

#### Adaptive Computing or Dedicated Silicon for DPUs



- With increasing specialization of the device, potential sales volume decreases
  - Hard to amortize the increasing NRE costs involved in building ASSPs
  - FPGAs become more attractive
- Increasing specialization scales performance for both ASSPs and FPGAs

The opportunity for FPGAs lies in their ability to specialize

## Vitis AI - ML in general

### **Customization levels on Adaptive Computing**



Specialization/Performance/Efficiency

### Popular Approach: Matrix of Processing Engines (MPEs) Specializing for AI in general

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
  - At latency cost (latency ~ batch size)
- Specialized processing engines
  - Operators
  - ALU types
    - tensor-, matrix- or vector-based



- Customized for ML in general
  - Designed to run any DNN
- Works really well for computer vision and natural language processing (10s kinfps)
- Popular approach: Vitis AI (FPGA or AIE) as well as majority of AI accelerators

### **AMD** Vitis<sup>™</sup> AI Integrated Development Environment

A Complete AI Stack for Adaptable AMD Targets



Your Platform

### **AI Model Zoo – Expanding to Diverse AI Applications**

- A comprehensive AI model repository
  - Open and free to download for any user
  - State-of-the-art models from Pytorch, TF & TF2
  - Retrainable, appliable to various data set & scenario
  - Deployable on AMD FPGA and Versal Adaptive SoC
- New models in each release



[Public]

### **Extensive Application Coverage**

| Classification | <ul> <li>Inception</li> <li>Mobilenet</li> <li>Resnet</li> <li>VGG</li> <li>EfficientNet</li> <li>MLPerf ResNet50</li> <li>OFA ResNet</li> <li>Vision Transformer</li> <li>Car Type classification</li> <li>Car Color classification</li> </ul>                                                                                                                                                                       | IndustrialVision/Robotics         • FADNet       • Superpoint         • PSMNet       • HFNet         • PMG                                  |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Detection      | <ul> <li>ssd_mobilenet</li> <li>Yolov3</li> <li>Yolov4</li> <li>YoloX</li> <li>Refinedet</li> <li>EfficientDet</li> <li>Pointpillars</li> <li>Centerpoint</li> <li>CLOCs</li> <li>Pointpainting</li> <li>Multi-taskv3</li> <li>OFA-Yolo</li> </ul>                                                                                                                                                                    | Medical Image• RCAN• DRUnet• SESR• SSR• OFA-RCAN• C2D2lite                                                                                  |
| Segmentation   | <ul> <li>ENet</li> <li>Semantic FPN</li> <li>Salsanext</li> <li>Salsanextv2</li> <li>SOLO</li> <li>Mobilenetv2</li> <li>HardNet</li> <li>SDLO</li> <li>Mobilenetv2</li> <li>Mobilenetv2</li> <li>Solenetv2</li> <li>Mobilenetv2</li> <li>Solenetv2</li> <li>Mobilenetv2</li> <li>Solenetv2</li> <li>Mobilenetv2</li> <li>HardNet</li> <li>Sa-Gate</li> <li>Sa-Gate</li> <li>Sa-Gate</li> <li>Unet-Chaos-CT</li> </ul> | <ul> <li>NLP</li> <li>Bert-base</li> <li>Sentiment detection</li> <li>Customer satisfaction</li> <li>Open-information-extraction</li> </ul> |

### Video Analytics



- Face Recognition FairMOT
- Face Quality
   FaceMask Detection
- Face ReID
   MoveNet
- Person ReID



• Textmountain, OCR

### **Compiling for DPU - an XIR-based Toolchain**

- Xilinx Intermediate Representation (XIR)
  - Graph-based intermediate representation of the AI algorithms
  - Designed for compilation and efficient deployment of the DPU on the FPGA platform.
- XIR-based compilation flow
  - First, transform the input models to XIR format
  - Breaks up computing graph to subgraphs
  - Execute DPU subgraph to a compiled xmodel file



Techniques for Further Specialization with Adaptive Compute Architectures

### **Specialization beyond MPEs**





### **Dataflow - Specializing for Individual Topologies**

- Hardware instantiates the topology as a dataflow architecture
  - Customize everything to the specifics of the given DNN, any operation, any connectivity
- Benefits:

[Public]

- Improved efficiency
- Low fixed latency
- Scale performance & resources to meet the application requirements
  - If resources allow, we can completely unfold to create a circuit that inferences at clock speed and thereby meet these new throughput requirements

Dataflow can scale performance to meet the application requirements



### **Specialization beyond MPEs**



### **Customizing Arithmetic to Minimum Precision**

- Popular approach which reduces bits in the data representation of weights and activations while preserving accuracy
- Reducing precision shrinks hardware cost/ scales performance
  - Instantiate n-times more compute within the same fabric, thereby scale performance n-times
- Reduces memory footprint
  - NN model can stay on-chip => no memory bottlenecks
- With dataflow: every layer has dedicated compute resources, we can mix and match precision across layers
  - Exploit custom arithmetic at a greater degree than MPEs



C= f(size of accumulator, size of weight, size of activation)

C - Complexity (Bit Products)

| Precision | Model size [MB]<br>(ResNet50) |  |
|-----------|-------------------------------|--|
| 1b        | 3.2                           |  |
| 8b        | 25.5                          |  |
| 32b       | 102.5                         |  |



Reducing precision saves resources/ scales performance, and reduces memory However, it requires quantization support in the training software

### **Specialization beyond MPEs**



### Sparsity

- DNNs are naturally sparse
- Sparse topologies result in irregular compute patterns which are difficult to accelerate on vector- or matrix-based execution units
- With streaming dataflow architectures, where every neuron and synapse is represented in the hardware, we can fully exploit this







Optimized Dataflow on FPGA

## Taking it to the Extreme: LogicNets

### **Specialization beyond MPEs**



## LogicNets with Adaptive Computing





[Public]

•

٠

### How much do we get out of the different specializations?



**Goal:** Implement **NN-based traffic classifier** delivering 100G **line-rate** throughput = 150 Mips Latency sensitive (buffer 10s of MB/msec)

(UNSW-NB15 network data set)." 2015 military communications and information systems conference (MilCIS). IEEE, 2015.

### **Results – Implementations**

### Specialization



together we advance\_

### **Results – Throughput and Latency**



### **Resource Cost - Compute, Memory**



AMD together we advance\_

58

### **Deep Network Intrusion Detection System (NIDS) Results**

- This example illustrates the trade-offs between specialization and performance and efficiency
- Custom arithmetic is effective to scale performance and dataflow to reduce latency
  - If application is amenable, custom arithmetic can meet extreme throughput requirements such as in NIDS
- Reduced precision, fine-granular sparsity & learned circuits can shrink the resource requirements despite speedup
- These are some of the opportunities which make most sense to exploit with FPGAs

[Public]

### **General Introduction to FINN**

# 

## **Project Mission and Key Techniques**



### **FINN – Project Mission**



- Custom Specialization
  - for creating high-throughput, ultra-low-latency DNN inference engines
- End-to-End
  - flow for the easy creation of specialized hardware architectures for FPGAs
- Open Source
  - · for full transparency and flexibility to adapt to end user applications and
  - for easy customer interactions

### **Two Key Techniques for Customization in FINN**



### Custom Precision: Few-bit Weights and Activations



AMD together we advance\_

### Customized Dataflow Processing versus More Generic Architectures

### Matrix of Processing Engines (MPE) (Vitis AI, TPUs, GPUs)



## with FPGAs and FINN Customized Data path

**Dataflow Architectures** 



## Matrix of Processing Engines (MPEs) Specializing for AI in General

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
  - At latency cost (latency ~ batch size)
- Customized for ML in general
  - Designed to run any DNN
  - Specialized processing engines
    - Operators
    - ALU types
- Works really well for computer vision and natural language processing
- Popular approach: Vitis AI (FPGA or AIE) as well as majority of AI accelerators



### **Dataflow - Specializing for Individual Topologies**

- Hardware instantiates the topology as a dataflow architecture
- Customize everything to the specifics of the given DNN, any operation, any connectivity
- Benefits
  - Improved efficiency
  - Low fixed latency
- Scale performance and resources to meet the application requirements



together we advance\_



Dataflow can scale performance to meet the application requirements

### **Dataflow Processing:**

## Scaling to Meet Performance and Resource Requirements



## Scale performance and resources to meet the application requirements If resources allow, we unfold completely, creating a circuit for inference at clock speed

### **Customized Dataflow Processing versus More Generic Architectures**

### Matrix of Processing Engines (MPE) (Vitis AI, TPUs, GPUs)



- Customized for typical DNN operations
  - e.g., multiply accumulate
- Lower throughput (~10KRps)
- Flexibility through programming
- Applications: CV, Speech



- Customized/adapted for specific DNN topologies
- Streaming interfaces
- Specialization -> higher efficiency
- Lower latency (no intermediate buffering)
- Higher throughput (~100MRps)
- Flexibility through reconfiguration
- Applications: radio, networking, material science, particle physics – smaller DNNs

### Quantization

- Reducing precision shrinks hardware cost/scales performance
  - For integer datatypes, LUT cost proportional to both bitwidths in weight and activations (e.g., INT8 : INT1 ≈ 70×)
  - n-times more compute fits into the same fabric, thereby, scaling performance n-times or shrinking hardware cost accordingly

### Energy

- Faster execution or smaller footprint  $\rightarrow$  less energy ( $E = P \cdot time$ )
- Using reduced precision operators saves energy
- Reduces memory footprint
  - ResNet50 @ 32b: 102.5 MB, ResNet50 @ 2b: 6.4 MB
  - \* NN model can stay on-chip  $\rightarrow$  no external memory access  $\rightarrow$  saves energy

| Precision | Model size [MB]<br>(ResNet50) |  |
|-----------|-------------------------------|--|
| 1b        | 3.2                           |  |
| 8b        | 25.5                          |  |
| 32b       | 102.5                         |  |





|        | Onentien               | Picojoules per Operation |                      |      |
|--------|------------------------|--------------------------|----------------------|------|
|        | Operation              | 45 nm                    | 7                    | 45/7 |
|        | Int 8                  | 0.03                     | 0.007                | 4.3  |
| +      | Int 32                 | 0.1                      | 0.03                 | 3.3  |
|        | BFloat 16              |                          | 0.11                 |      |
|        | IEEE FP 16             | 0.4                      | 0.16                 | 2.5  |
|        | IEEE FP 32             | 0.9                      | 0.38                 | 2.4  |
|        | Int 8                  | 2                        | 0.07                 | 2.9  |
|        | Int 32                 |                          | 1.48                 | 2.1  |
| ×      | BFloat 16              |                          | 0.21                 |      |
|        | IEEE FP 16             | 1.1                      | 0.34                 | 3.2  |
|        | IEEE FP 32             | 3.7                      | 1.31                 | 2.8  |
|        | 8 KB SRAM              | 10                       | 7.5                  | 1.3  |
| SRAM   | 32 KB SRAM             | 20                       | 8.5                  | 2.4  |
|        | 1 MB SRAM <sup>1</sup> | 100                      | 14                   | 7.1  |
| GeoM   | GeoMean <sup>1</sup>   |                          |                      | 2.6  |
|        |                        | Circa 45 nm              | Circa 7 nm           |      |
| DRAM   | DDR3/4                 | 1300 <sup>2</sup>        | 1300 <sup>2</sup>    | 1.0  |
| DKAM   | HBM2                   |                          | 250-450 <sup>2</sup> |      |
|        | GDDR6                  |                          | 350-480 <sup>2</sup> |      |
| T-LL-A | E                      | 15                       | 10                   |      |

is pJ per 64-bit access.

### **The FINN Framework**



### FINN Framework: From DNN to FPGA Deployment



71

### Brevitas: A PyTorch Library for Quantization-Aware Training



# FINN Compiler Transform DNN into Custom Dataflow Architecture

QONNX representation of the quantized DNN

#### FINN

- Uses an ONNX-based network description as intermediate representation (IR)
- Is a Python library of graph transformations
- Generates a synthesizable description of each layer (HLS/RTL) encapsulated as an IP block
- Produces a synthesized stitched IP block representing the complete network

# **FINN Compiler - Network preparation**



W (64×3×3×3)



AMD together we advance\_

[Public]

# **FINN Passes - ONNX Graph Transformations**



Optimization, lowering, code generation... are all transformations

AMD together we advance\_

# **FINN Hardware Folding**



# FINN HLS/RTL Library - Parameterizable Kernel Library

- Kernels representing individual layers, a.k.a. Operators
- Flexible parametrization as for
  - Degree of parallelism (output channels, input channels, kernel dimensions ...)
  - Datatypes (INT8, ternary, INT2, ...)
  - Behaviour (activation function)
- Instantiated and stitched by FINN compiler with AXI-Stream data path
- Implemented as synthesizable C++ (Vitis HLS) or SystemVerilog



78

# **FINN Compiler: IP Generation Flow**





# Deployment with **PYNQ** for Python Productivity

| # instantiate the accelerator                                       |
|---------------------------------------------------------------------|
| <pre>accel = models.cnv_w2a2_cifar10()</pre>                        |
| # generate an empty numpy array to use as input                     |
| <pre>dummy_in = np.empty(accel.ishape_normal, dtype=np.uint8)</pre> |
| # perform inference and get output                                  |
| <pre>dummy_out = accel.execute(dummy_in)</pre>                      |



- Use PYNQ-provided Python abstractions and drivers
- User provides NumPy array input, calls driver, retrieves NumPy array output
  - Internally use PYNQ DMA driver to wr/rd NumPy arrays into I/O streams

https://github.com/Xilinx/PYNQ https://github.com/Xilinx/finn-examples



# **FINN Infrastructure and Workflow**

# 

# **The FINN Ecosystem and Software Stack**



**FINN** project landing page: <u>https://xilinx.github.io/finn</u>

- Quick Start, Documentation, Examples (Jupyter Notebooks)
- Links to Repos

[Public]



# **FINN Workflow**



FINN and Brevitas can be used as co-design tools to implement your DNN use case on an FPGA.

- Train a quantized neural network in PyTorch using Brevitas
- Converting trained QNN to Vivado IP
- Fine-tune model to meet resource/performance targets
- Integrate generated IP into a larger design

But you can leverage the infrastructure beyond that...



together we advance\_

# **Research in the FINN Ecosystem**

Infrastructure to enable research on advanced quantization schemes and analysis of quantized neural networks

Enables early design space exploration

Infrastructure for research on neural network hardware design



**System integration** 

**FINN library FPGA dataflow specific** HW components

Explore new optimized neural network layer implementations

> MIIDE! together we advance\_

# **Status and Outlook**



# **Status Summary**

#### Open-Source Adoption

- ~2k+ GitHub stars summarized across repos
- 250k+ Brevitas downloads
- ~200k QONNX downloads
- 17k+ FINN compiler downloads

#### Academic Results

- ACM TRETS 2020, FPL'2020, DFT'2019 Best Paper awards
- 1000+ citations on original paper

#### University Classes on computer architecture for ML with FINN

- Stanford, UNC Charlotte, NTNU in Norway, EPFL in Switzerland
- Regular tutorials, also available on YouTube: <u>https://www.youtube.com/watch?v=zw2aG4PhzmA</u>
- Business units providing customer support
  - Lead engineering team: Custom and Strategic Engineering, Dublin

"The FINN toolset is showing **huge potential using it in upcoming SICK products**. It is **easy to use** and with an **extraordinary performance** and very promising results. In the future, flexible implementations of ML in our products with FINN can be a great advantage and even replace static architectures as they are currently used. Thanks to the FINN team for the great cooperation" – Sick AG

https://github.com/Xilinx/brevitas https://github.com/Xilinx/finn https://github.com/Xilinx/finn-hlslib https://github.com/Xilinx/finn-examples https://github.com/fastmachinelearning/qonnx

# **FINN Layer Support**

| Layer                   | Current Support                                | Outlook                 |
|-------------------------|------------------------------------------------|-------------------------|
| GEMM                    | $\checkmark$                                   |                         |
| Conv1D and Conv2D       | $\checkmark$                                   |                         |
| - Dense                 | $\checkmark$                                   |                         |
| - Depthwise             | $\checkmark$                                   |                         |
| - Separable (pointwise) | $\checkmark$                                   |                         |
| Elementwise (add, sub)  | $\checkmark$                                   | others easily doable    |
| Activation              | ReLU, SeLU                                     |                         |
| BatchNorm               | <ul> <li>(absorbed by streamlining)</li> </ul> |                         |
| Pooling                 | $\checkmark$                                   |                         |
| Scale                   | $\checkmark$ (absorbed by streamlining)        |                         |
| Concat                  | $\checkmark$                                   |                         |
| Reshape                 | <ul> <li>(must be streamlinable)</li> </ul>    |                         |
| Transpose               | <ul> <li>(must be streamlinable)</li> </ul>    |                         |
| Clip by Value           | $\checkmark$ (absorbed by streamlining)        |                         |
| TransposeConv2D         | $\checkmark$                                   | optimized version (WIP) |
| UpSample                | $\checkmark$                                   |                         |
| DownSample              | $\checkmark$                                   |                         |

# **Brevitas Updates**

- Targets the entire AMD product range
- First-class support for integer datatypes
  - prototype support for minifloats (e.g., FP8)
- Supports PTQ and QAT
- Out of the box support for distributed training (e.g., DDP, interoperability with HuggingFace Accelerate (PP))

**FP32** 

 Interoperability with HuggingFace Transformers



# **FINN Compiler Updates**

FINN v0.10.1 Release

- Refactoring of operator instantiation infrastructure
  - FINN compiler used to assume that hardware blocks are synthesized from HLS code
  - New class hierarchy to facilitate integration of RTL components
  - Provide users with an interface to override the compiler's choice for HLS vs. RTL implementation on a per-layer basis
- **RTL component** library optimizing the implementations of critical layers
  - Efficient implementation of 4-bit and 8-bit compute leveraging DSP slices
  - Efficient implementation of multi-level thresholding
  - Eradication of (regularly long) HLS synthesis times for layers with an RTL option
- Compiler optimization pass for accumulator and weight bit width minimization
- Added board support in system integration flow
  - RFSoC 4x2 and U55C (contributed by University of Paderborn)

# **FINN Technical Roadmap: Capabilities**

- Operator Hardening
  - Revised RTL Thresholding by binary search
    - Ingestion of fp32 inputs
  - DSP-enabled Generalized Datatype Support
    - Efficient higher-precision integer compute: int4, int8, ..., int16
    - Small standard floating-point formats: float16, bfloat16
    - Custom MiniFloats: fp4 fp8
  - Internal clock pumping of DSP datapaths to increase their operational density We are aiming at a standard operational frequency around 500 MHz
- New Operators
  - Optimized transposed convolution
  - Fallback float layers to mitigate streamlining limits

[Public]

# FINN Technical Roadmap: Ease of Use

### FINN Library

- Refactoring of streamed layer interfaces
  - Packed flat ap\_uint<W>  $\rightarrow$  explicit hls::vector<T, N>
- Combining HLS and RTL components into one FINN Library

## FINN Examples

- MobileNet-v1 and VGG10-RadioML with efficient DSP compute
- New example: German Traffic Sign Recognition Benchmark

FINN-examples v0.0.7 Release

## Resources

- <u>https://github.com/Xilinx/brevitas</u> <u>https://github.com/Xilinx/finn</u>
- https://github.com/Xilinx/finn-hlslib
- https://github.com/Xilinx/finn-examples
- https://github.com/fastmachinelearning/qonnx
- https://amd.com/aup

**Q & A** 



# **COPYRIGHT AND DISCLAIMER**

#### ©2024 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

#