

## Fast inference on FPGAs

Javier Duarte, Sergo Jindariani, Ben Kreis, Ryan Rivera, Nhan Tran (Fermilab) Jennifer Ngadiuba, Maurizio Pierini, Sioni Summers, **Vladimir Loncar** (CERN)

Edward Kreinar (Hawkeye 360)

Phil Harris, Song Han, Dylan Rankin (MIT)

Zhenbin Wu (University of Illinois at Chicago) Giuseppe di Guglielmo (Columbia University)

CERN



## Challenges in LHC

At the LHC proton beams collide at a frequency of 40 MHz

Extreme data rates of O(100 TB/s)

"Triggering" - Filter events to reduce data rates to manageable levels





**DATA FLOW** 



#### **DATA FLOW**

40 MHz in / 100 KHz out ⇒ absorbs 100s TB/s

Trigger decision to be made in  $\sim$  10  $\mu$ s

FPGAs / Hardware implemented



#### **DATA FLOW**

100 KHz in / 1 KHz out ⇒ ~ 500 KB/event

Processing time ~ 300 ms

Software implemented on CPUs



#### **DATA FLOW**

Output: max. 1 MB/event

Processing time ~ 20 s

Software implemented on CPUs



**Deploy ML algorithms very early** 

**Challenge: strict latency constraints!** 

# high level synthesis for machine learning

User-friendly tool to automatically build and optimize DL models for FPGAs:

- Reads as input models trained with standard DL libraries
- Uses Xilinx HLS software
- Comes with implementation of common ingredients (layers, activation functions, binary NN ...)





### On-chip weights

- Much faster access times
- For longer latency applications, weights storage in on-chip block memory is possible
- No loading weights from external source (e.g. DDR, PCIe)
- Not reconfigurable without reprogramming device

User controllable trade-off between resource usage and latency/throughput

Tuned via "reuse factor"

### Fully extensible through API

Custom layers, custom HLS code, user-defined model transformations...



## A handle to control resource usage and latency

Can be specified per-layer

### **Reuse = 1**: Fully unroll everything

Fastest, most resource intensive

### **Reuse > 1**: reuse one DSP for several operations

- Increases latency, but uses less resources







# hls 4 ml : exploiting FPGA hardware

Parallelization (reuse): Control the inference latency versus utilization of FPGA resources

**Quantization**: Reduce precision of the calculations

**Compression**: Drop unnecessary weights (zero or close to zero) to reduce the number of DSPs used

#### 70% compression ~ 70% fewer DSPs





## Scan integer bits fixed to 8 fractional bits hls4ml Full performance at 6 fractional bits g tagger 0.5 <10.2> <15.7> <20.12> <25.17> <30.22> <35.27> <40.32> Fixed-point precision

#### Scan fractional bits

fixed to 6 integer bits





### Supported architectures:

- MLP
  - Numerous activation functions
  - Support for very large layers



#### Binary and Ternary MLP

- 1- or 2-bit precision with limited loss of performance
- Computation without using DSPs, only LUTs

#### Convolutional NNs

- 1D and 2D with pooling
- Currently limited to very small layers



#### - Other:

- Batch normalization
- Merge layers (concatenation, addition, subtraction etc)



### **Convolutional layers**

Support for "large" convolutional layers



- Express convolution as matrix multiplication
- im2col algorithm
- Reuse "large" matrix multiplication algorithm from MLP
- Quantized (binary and ternary) weights

### Depthwise separable convolution

- First step: depthwise convolution
- Second step: pointwise convolution
- For 3x3 kernels this can yield 8-9 times less multiplications





Credit: Jennifer Naadiuba. Sioni Paris Summer Smages source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728



#### **Boosted decision trees**

Q4 2019

- BDTs have been popular for a long time in HEP reconstruction and analysis
- Suitable for highly parallel implementation in FPGAs
- Implementation in hls4ml optimised for low latency
- No 'if/else' statement in FPGAs → evaluate all options and select the right outcome
  - Compare all features against thresholds, chain together outcomes to make the 'tree'

Test for model with 16 inputs, 5 classes, 100 trees, depth 3 on VU9P FPGA:

- 4% LUTs, 1% FFs (0 DSPs, 0 BRAMs)
- 25 ns latency with II=1

Credit: Sioni Paris Summers



#### Recurrent neural networks

Q4 2019

Simple RNN, LSTM, GRU

#### Two implementations:

- Fully unrolled:
  - Latency optimized with II=1
  - Large resource usage
- **Static:** same resources used for weights and multiplications
  - N (N=latency of layer) copies can go through at the same time
  - Latency is larger and II limited to clock time for each layer

Supports small networks → scale it up using "large" matrix multiplication algorithm





Credit: Phil Harris, Nhan Tran, Richa Rao



### **Graph networks**

H1 2020

Natural solution for reconstructing the trajectories of charged particles



#### Preliminary implementation:

- Implemented as an HLS project, not supported in conversion tools
- Successfully tested a small example with 4 tracks, 4 layers
- Major effort required to scale up to larger graphs

Credit: Javier Duarte and Kazi Asif Ahmed Fuad



#### Multi-FPGA inference

H1 2020

- Main idea: place layers onto multiple FPGAs and pipeline the execution

### Leverage Galapagos framework (<a href="https://github.com/tarafdar/galapagos">https://github.com/tarafdar/galapagos</a>)

- "...a framework for creating network FPGA clusters in a heterogeneous cloud data center."

- Given a description of how a group of FPGA kernels are to be connected, creates a ready-to-use

network device

- Possible to use MPI programming model



Credit: Naif Tarafdar, Phil Harris



### **Training on FPGAs**

H2 2020

- Build on top of multi-FPGA idea

Use synthetic gradients (SG) to remove the update lock

Individual layers to learn in isolation

### Train SGs by another NN

- Each SG generator is only trained using the SGs generated from the next layer
- Only the last layer trains on the data







Autoencoders

GarNet graph NN (https://arxiv.org/abs/1902.07987)

Alternate HLS implementations

- Intel HLS
- Mentor Catapult HLS

Inference engine for CPUs based on hls4ml

Targeting integration into CMSSW

Probably more...

## Conclusions

**hls4ml** - software package for translation of trained neural networks into synthesizable FPGA firmware

- Tunable resource usage latency/throughput
- Fast inference times, O(1µs) latency

#### More information:

- Website: <a href="https://hls-fpga-machine-learning.github.io/hls4ml/">https://hls-fpga-machine-learning.github.io/hls4ml/</a>
- Paper: <a href="https://arxiv.org/abs/1804.06913">https://arxiv.org/abs/1804.06913</a>
- Code: <a href="https://github.com/hls-fpga-machine-learning/hls4ml">https://github.com/hls-fpga-machine-learning/hls4ml</a>

