

# **Versal ACAP Processing for HL-LHC Calorimeters** Signal Reconstruction

27th Conference on Computing in High Energy and Nuclear Physics (CHEP) 24th October, 2024

Francisco Hervas, Alberto Valero, Luca Fiorini, Hector Gutierrez





**Financiado por** la Unión Europea NextGenerationEU









# **Table of contents**





HL-LHC calorimeter signal reconstruction





## LHC calorimeter read-out

- In the LHC, Bunch Crossings (BC) happen at 40 MHz (25 ns)
- The processing happens after the Level-1 Trigger, at 100 kHz (10 us)
- Signals are processed online using the **Optimal Filtering** (OF) algorithm
  - The processing is made using **Digital Signal Processors** (DSPs)
  - Therefore, it is **sequential**
  - Fixed point arithmetic is used



Click on the image for reading more on the topic





# **HL-LHC calorimeter signal reconstruction**

- In the HL-LHC, signals will be reconstructed for every BC at 40 MHz (25 ns) before the trigger
  - Signals need to be processed by FPGAs due to their low and deterministic latency for signal synchronization
  - Multiple simultaneous signals will produce pile-up
- There is a need for more sophisticated algorithms for signal reconstruction
  - **Deep learning** algorithms (**Neural Networks**)



**Francisco Hervas Alvarez** 

# **1** Introduction

## Real time processing

- Real time applications fit better with **FPGAs**
- Algorithm replication must fit area/ occupancy
- Cycle accurate in latency to interconnect modules
- **Real time** requirements:
  - Maximum algorithms in parallel: **162**
  - Maximum sample latency: 200 ns
  - Sample frequency: 40 MHz

## **KU115 Chip resources:**

| FFs             | 1326720 |
|-----------------|---------|
| LUTs            | 663360  |
| Block RAMs (Mb) | 75.9    |
| DSP Slices      | 5520    |





# HL-LHC

A. Ruiz Martinez

Instituto de Física Corpuscular, University of Valencia-CSIC, Valencia, Spain

Click on the image for reading more on the topic

**Francisco Hervas Alvarez** 

**CHEP 2024** 

# The PreProcessor module for the ATLAS Tile calorimeter at the

A. Valero\*, F. Carrió, L. Fiorini, A. Cervelló, D. Hernandez and



# 2 Methodology

# Versal ACAP device for algorithms test

- Versal ACAP System on Chip (SoC) for algorithm testing
  - PL FPGA: Algorithm hardware acceleration and data moving between memories
  - **PS CPU:** Managing and control of the accelerators
  - Memory: Data buffering
  - Connectivity: External data reception and data transmission

## VC1902 Chip resources:

| CPU          | ARM A72 Dual core |  |
|--------------|-------------------|--|
| CPU          | ARM R5F Dual core |  |
| Memory       | OCM 256 KB        |  |
|              | Ethernet x2       |  |
|              | USB 2.0 x2        |  |
| Connectivity | UART x2           |  |
| Connectivity | SPI x2            |  |
|              | I2C x2            |  |
|              | CAN-FD x2         |  |

| AI Engines         | 400    |
|--------------------|--------|
| <b>DSP Engines</b> | 1968   |
| LUTs               | 899840 |
| NoC Ports          | 28     |
| DDR MC             | 4      |
| PCIe – DMA         | 1      |





# 2 Methodology

## Setup development

- Driver implementation in host CPU for communication with **XDMA**
- Driver implementation in device CPU for managing internal DMAs
- NoC configuration for internal communication
- Interrupt system development
- Multiple cores executing algorithms





## **Francisco Hervas Alvarez**







# **Modified perceptron RTL**

- Written in VHDL
- Synthesized and implemented in Vivado
- Fixed point arithmetic
- Activation function tanh(x) quantized over 5000 values
- Latency of 34 clock cycles

|          | V           | C1902     |                 |
|----------|-------------|-----------|-----------------|
| Resource | Utilization | Available | Utilization (%) |
| LUT      | 72          | 899840    | 0.01%           |
| FF       | 203         | 1799680   | 0.01%           |
| BRAM     | 4           | 967       | 0.41%           |
| DSP58    | 6           | 1968      | 0.30%           |
| BUFG     | 1           | 980       | 0.10%           |

|          | K           | U115      |                 |
|----------|-------------|-----------|-----------------|
| Resource | Utilization | Available | Utilization (%) |
| LUT      | 65          | 663360    | 0.01%           |
| FF       | 203         | 1326720   | 0.02%           |
| BRAM     | 3.5         | 2160      | 0.16%           |
| DSP48E2  | 6           | 5520      | 0.11%           |
| BUFG     | 1           | 1248      | 0.08%           |





**Francisco Hervas Alvarez** 





- point)
- the fixed point implementation









## **Multi-core implementation**



- NoC and DDR bandwidth
- NoC OT transactions
- DSPs, ...)

If **processing bandwidth** is greater than NoC and DDR bandwidth, backpressure

is going to happen

data is going to be lost

The **number of cores** is dependent of:

• PL resources used for each core (LUTs, FFs, BRAMs,

• Source is ready to send data, but the consumer is **not read to receive**, so





# **Timing comparison**



- transmission and setup
- performance than the CPU

- - more than 10<sup>8</sup> events
- than the CPU

• For more than 10<sup>6</sup> events, the FPGA has a better

• The **speed up factor** remains stable (**x3.2**) for • For 10<sup>12</sup> events, the FPGA is 7.17 hours faster





## Summary and future work

## Summary:

- FPGA implementation of deep learning algorithms improves efficiency over traditional CPU
- Algorithm optimization for real time application is a trade-off between latency/throughput and area usage

## Future work:

- Evaluate more complex deep learning algorithms for real time implementation
- Power consumption will be measured and monitored
- AI Engines utilization and optimization
- Optimization in terms of latency and power consumption













## "This work is supported by Ministerio de Ciencia, Innovación y Universidad con fondos Next Generation y del Plan de Recuperation, Transformacionales y Resiliencia (project -TED2021-130852B-100)"



**GOBIERNO DE ESPAÑA** 

**MINISTERIO** DE CIENCIA, INNOVACIÓN **Y UNIVERSIDADES** 



Financiado por la Unión Europea NextGenerationEU

Francisco Hervas Alvarez





# **Versal ACAP Processing for HL-LHC Calorimeters Signal Reconstruction**

27th Conference on Computing in High Energy and Nuclear Physics (CHEP) 24th October, 2024

Francisco Hervas, Alberto Valero, Luca Fiorini, Hector Gutierrez





Financiado por la Unión Europea NextGenerationEU













Francisco Hervas Alvarez









- 400 AI Engine Tiles
- Frequency
  - 1.25 GHz working
  - 312.5 MHz transport
- Latency
  - Input net: 12 cycles
  - Output net: 8 cycles



Backup







**CHEP 2024** 

Backup







Fixed-Point Format



Francisco Hervas Alvarez

— fractional part -

| $\cdots$ $d_{-n}$ |
|-------------------|
|-------------------|









