#### Studies on track finding algorithms based on machine learning with GPU and FPGA

F.A. Di Bello on behalf of the ATLAS TDAQ collaboration

Thennand

CTD, Mini-workshop on Real time Tracking

**Einen 24** 





1 December 2007 Service Company and Company

### Introduction

 We aim to study the performance of ML track finding algorithm in the HLT for muon triggers.

Looking at commercially available FPGA, Xilinx Alveo cards.

Comparative study between Alveo Cards, GPU and CPU

#### CPU and GPU Hardware used:

GPU: NVidia RTX A5000 board with 24GB of GDDR6 memory CPU: single CPU server based on an AMD Epyc 7302 processor running at 2.9 GHz







How do commercial FPGA perform?

#### F. A. Di Bello <sup>2</sup> U. Di Genova

# Hardware platforms

Two concept for machine learning accelerations



Require implementation of a neural network in VHDL or similar.

Significant effort to do so. Platform developed and maintained for HEP community exists: [hls4ml](https://fastmachinelearning.org/hls4ml/)

Main advantage is that is fast and suitable for a level-0 trigger.



Direct HLS implementation into FPGAs:<br>
Use commercial accelerator cards that offer integrated platform for deployment:

> Commercially available, no had-hoc maintenance.

Dedicated hardware and related software to traslate from high level python codes, into code executable in dedicated hardware.

Not as fast as the other approach, suitable for an HLT trigger

Bounded to the supported architecture

### Vitis-AI overview

Xilinx offers several accelerator card designed and built to accelerate ML algorithms (mostly CNN) [xilinx](https://www.xilinx.com/products/boards-and-kits/alveo.html#overview) The claim is that inference and throughput are improved over standard CPU and GPU Improvements also expected in terms of power consumption





#### F. A. Di Bello **4** U. Di Genova

### The hardware tested

Several accelerator [cards](http://www.apple.com/uk) are commercially available: cards





[U250](https://www.xilinx.com/products/boards-and-kits/alveo/u250.html) [U50](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html) [VCK5000](https://www.xilinx.com/products/boards-and-kits/vck5000.html)



Designed for machine learning inference, video transcoding, and database search & analytics

Designed for financial computing, machine learning, computational storage, and data search and analytics

It is an AI development card, more versatile than the other two

#### F. A. Di Bello **5** U. Di Genova

## The toy model used in this study

0 0.5 1 1.5 2 2.5 3 Luminosity  $[10^{34}$  cm<sup>-2</sup> s<sup>-1</sup>] 0 2000 4000 6000 8000 10000 12000 14000  $\dot{\gamma}$  -Cluster rate [Hz cm *ATLAS NSW Preliminary*  $p+p \sqrt{s} = 13.6$  TeV, year 2022 sTGC strips Sector A06, 1st strip, 1st layer Run number 440199

To speed up R&D part of the study, a toy model is simulated

Toy model is inspired by a muon system

4 samples produced with different noise rates: 2, 5, 10, 15 kHz/cm\*\*2

Effect from correlated background is also emulated

Will now discuss the main reco steps for tracking: clustering and pattern reco and their performance on CPU/GPU and FPGAs



#### F. A. Di Bello <sup>6</sup> U. Di Genova

#### Cluster reconstruction

A cluster is formed from neighbouring hits

Typically, the weighted centroid of the cluster is used

$$
x_c = \frac{\sum_i x_i q_i}{\sum_i q_i}
$$

The known challenges with the standard approach are:

1. Depending on the incidence angle of the muon, a degradation is expected

2. "Correlated" background that originates from interaction with material prior to the active layers

ML is good candidate to improve the clustering performance



# The Deep Neural Net approach

A deep neural network is used (similar to what done in the inner silicon tracker ref)

Inputs are:

- 1. The total number of hits belonging to the cluster
- 2. The charge of the strip with highest charge
- 3. The charge of its two left-right closest neighbours
- 4. The position of the strip with highest charge
- 5. The Position of its two left-right closest neighbours

NB: if the cluster has less than 5 strips, zero-padding is employed



Standard regression using as target the true crossing position of the muon



### Run the DNN inference with Alveo cards

Standard workflow for deployment into FPGA. Pruning tools also available but not tested



Quantization convert 32-bit floating-point weights and activations to fixed-point INT8.

Many quantisation models available, here no retraining was performed, accepting a small degradation of performance .



#### F. A. Di Bello 9 U. Di Genova

### Run performance evaluation with GPU/CPU

Usage of ONNX on CPU [link](https://onnx.ai/)

Usage of TensorRT [link](https://developer.nvidia.com/tensorrt) on GPU

Both offer high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.



#### F. A. Di Bello **10** U. Di Genova

### Inference time results

CPU is already well within the latency requirements

GPU with TensorRT improves a bit further

No significant gain observed over both architectures



F. A. Di Bello 11 and 11 and 11 U. Di Genova

### Pattern recognition

Pattern reco. Is currently based on "Hough transform"



Within each sector, it is possible to approximate the muon with a straight line and run 3 HT

In a second step, a functional fit is run to extract the pT of the crossing muon

ML can help here as well, treating all layers simultaneously, profiting from their inter-correlations



### Alternative: a CNN approach

In order to test the algorithm with Alveo cards, a CNN was also developed

A CNN is not an optimal approach for pattern reco tasks but it is useful for testing FPGA performance

An event display is translated into a 3000x16 pixel 2D image, and convolution/deconvolution operation are used

The output is an image whose intensity indicate the probability of the hits being associated to the muon





![](_page_12_Figure_7.jpeg)

#### F. A. Di Bello <sup>13</sup> U. Di Genova

### Comparison CNN

CNN model successfully tested on CPU, GPU and several FPGAs

Overall CPU already meets the requirement imposed by the HLT latency

Largest improvement is seen with TensorRT on GPU. Study on CPU load needs to be studied, as well as the power

dissipations

![](_page_13_Figure_5.jpeg)

#### F. A. Di Bello 14 ann 14 U. Di Genova

### Pattern recognition with an RNN

ML based on what Xilinx commercially releases

An RNN layer is expected to become available, but yet not possible

Inputs are output of the cluster DNN

More sophisticated ML approaches such as GNN and/ or transformers, are not yet supported

In the RNN approach, consequent layers are ordered based on their position.

Two possibilities: outside-in or, inside-out

![](_page_14_Figure_7.jpeg)

## RNN performance results

 A simplified Hough transform is implemented. Detector is split into three sectors, pattern recognition is performed in each of those singularly.

The RNN model instead works for the whole detector.

The HT inference time was estimated to be around 1 ms

#### **Hough Transform RNN**

![](_page_15_Figure_5.jpeg)

![](_page_15_Figure_6.jpeg)

#### F. A. Di Bello 16 U. Di Genova

# RNN performance results

**M. Carnesale PhD Thesis**

Performance evaluated for different rates

Generally, a decrease of performance is seen at higher rates, as expected

Model is evaluated on CPU only. Versal boards do not support RNN yet (supported is expected). On GPU tensorRT did not support timing layer.

Using ONNX on CPU, the performance show an inference for a single event is  $O(1 \text{ ms})$ , well within the latency requirement for a HLT trigger

![](_page_16_Figure_6.jpeg)

#### F. A. Di Bello 17 anno 17 anno 17 anno 17 ann 17 an U. Di Genova

# Conclusions

DNN model for cluster position successfully tested on CPU, GPU and several Alveo cards.

A CNN model has been implemented to study the FPGA performance.

RNN model proposed as an alternative to HT methods. Inference time already on CPU is well within design constraint.

RNN model are yet to be supported by VitisAI, expected support in the next Vitis-AI versions.

CPU load and power consumption shall still be studied… as well as cost

![](_page_17_Picture_8.jpeg)

### Workflow to use Alveo cards

![](_page_18_Figure_1.jpeg)

#### F. A. Di Bello 19 annuali 19 annu

# Varying the batch size

Main point is that tensorRT does not work with dynamic batch sizes

![](_page_19_Figure_2.jpeg)

### Models we are interested in

#### RNN:

![](_page_20_Picture_27.jpeg)

![](_page_20_Picture_28.jpeg)

Total params: 41,309

#### CNN:

#### F. A. Di Bello <sup>Trainable params: 41,309</sup><br>
<sup>Non-trainable params: 0</sup><br>
21 Di Genova

### ATLAS HL-LHC trigger system

The work here is relevant for future RUN3 operations, but most importantly for triggering at HL-LHC

 High luminosity and pile-up makes trigger decisions much more challenging

We will mostly consider as use-case muon system, for future applications to muon tracking

![](_page_21_Figure_4.jpeg)

![](_page_21_Figure_5.jpeg)

[TDR trigger HL-LHC](https://cds.cern.ch/record/2802799?ln=en)