



### New possible use-cases of FPGAs in HEP High-Level-Trigger systems



Christian Färber CERN Openlab Fellow LHCb Online group



On behalf of the LHCb Online group and the HTC Collaboration







## HTCC

- High Throughput Computing Collaboration
- Members from Intel<sup>®</sup> and CERN LHCb/IT
- Test Intel technology for the usage in trigger and data acquisition (TDAQ) systems
- Projects
  - Intel<sup>®</sup> KNL computing accelerator



- Intel<sup>®</sup> Omni-Path Architecture 100 Gbit/s network
- Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA computing accelerator







#### **General HEP Readout Chain**

**Optical links** 

Fast networks



Readout electronic for detectors (Custom)

Mainly ASICs In low rad. areas FPGAs

Distribution of ECS/TFC

Back-end electronics (Custom)

Many FPGAs and CCPCs

Pre-processing, zero supression, L0 trigger

Christian Färber,

CERN EP-ESE electronics seminar, Geneva – 12.12.2017



Computing farms (Commercial)

FPGA usage under investigation for the Event Filter Farms (HLT)!

3





#### **Example: Control Electronics**

- LHCb Outer Tracker control box
- Distribution of ECS/TFC
  - Clk, l<sup>2</sup>C
  - Test signals
- Using ACTEL ProASIC







#### Example: Readout Electronics

- Future LHCb SciFi readout electronics
- Using Microsemi IGLOO2 120kLE
- Drive data from PACIFIC to GBT
- Clusterization and Zero-suppression

Hervé Chanal "The readout electronics of the SciFi Tracker for LHCb detector Upgrade" TWEPP 2015







#### **Example: Back-end Electronics**

- · LHCb TELL1
- 4x Stratix GX for pre-processing, zero suppression, error checking

 1x Stratix GX for event building, data flow monitoring and preparing data packets (4x 1GBit/s Ethernet)







- LHCb L0 Muon trigger, searches for the 2 highest trans. momentum muons
- Receiving 130GB/s
- Every 25ns
- Max. latency 1.2µs !
- Using 248 Stratix GX
- Running 18432 tracking algorithms parallel









#### Detector Example: LHCb



- Single-arm spectrometer designed to search new physics through measuring CP violation and rare decays of heavy flavour mesons.
- 40 MHz proton proton collisions
- Trigger with 1 MHz, upgrade to 40 MHz
- Bandwidth after upgrade up to 40 Tbit/s







### **Future Challenges**

- Higher luminosity from LHC
- Upgraded sub-detector Front-Ends
- Removal of hardware trigger
- Software trigger has to handle
  - Larger event size (50 KB to 100 KB)
  - Larger event rate (1 MHz to 40 MHz)









#### **Upgrade Readout Schematic**

- Raw data input ~ 40 Tbit/s
- EFF needs fast processing of trigger algorithms, different technologies are explored.
- Test FPGA compute accelerators for usage in:
  - Event building
    - Decompressing and re-formatting packed binary data from detector
  - Event filtering
    - Tracking
    - Particle identification
- Compare with: GPUs, Intel<sup>®</sup> Xeon Phi<sup>™</sup> and other compute accelerators

Christian Färber,

CERN EP-ESE electronics seminar, Geneva - 12.12.2017



10





#### **FPGAs as Compute Accelerators**

- Microsoft Catapult and Bing
  - Improve performance, reduce power consumption



- Reduce the number of von Neumann abstraction layers
  - Bit level operations
- Power only logic cells and registers needed
- Current test devices in LHCb
  - Nallatech PCIe with OpenCL
  - Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA





## **FPGA** compute accelerators

- Typical PCIe 3.0 card with high performance FPGA
  - NIC or GPU size
- On board memory e.g. 16 GB DDR4
- Some cards have also network e.g. QSFP 10/40 GbE,...
  - More flexible than GPUs
- Programming in OpenCL
  - OpenCL compiler  $\rightarrow$  HDL
- Power consumption below GPU, price higher than GPU
- Use cases: Machine Learning, Gene Sequencing, Real-time Network Analytics Christian Färber,

CERN EP-ESE electronics seminar, Geneva – 12.12.2017



Inte









Second: Intel<sup>®</sup> Stratix<sup>®</sup> V GX A7 FPGA

• 234'720 ALMs, 940'000 Registers, 256 DSPs

Host Interface: high-bandwidth and low latency

Christian Färber.

- Memory: Cache-coherent access to main memory
- Programming model: Verilog and OpenCL





## Mandelbrot on Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017

#### Mandelbrot with floating point precision

- Implemented 22 fpMandel pipelines running at 200 MHz, each handles 16 pixels in parallel (total: 352 pixels)
- FPGA is x12 faster than
   Intel<sup>®</sup> Xeon<sup>®</sup> running
   20 threads in parallel
- Used 72/256 DSPs
- Reuse of data on FPGA high







## Test case: LHCb Calorimeter Raw Data Decoding

- Two types of calorimeters in LHCb: ECAL/HCAL
- 32 ADC channels for each FEB of 238 FEBs
- Raw data format:
  - ADC data is sent using 4 bits or 12 bits
  - A 32 bit word stores information about which channel has short/long decoding

| LHCb Calorimeter raw data bank |                                  |                |            |                                         |               |                    |              |               |  |
|--------------------------------|----------------------------------|----------------|------------|-----------------------------------------|---------------|--------------------|--------------|---------------|--|
| Control wor                    | Control word (9b) (Figure 18) Cr |                | Crate (5b) | ib) Card (4b) Length ADC (7b) Length tr |               | rigger (7b)        |              |               |  |
| Trigger bit pattern (32b)      |                                  |                |            |                                         |               |                    |              |               |  |
| Zero p                         | adding                           | Trigger (8b)   |            |                                         | T ri          | igger (8b)         | Trigger (8b) |               |  |
| ADC bit pattern (32b)          |                                  |                |            |                                         |               |                    |              |               |  |
| ADC low                        | A                                | ADC long (12b) |            |                                         |               | ADC long (12b) ADC |              | ADC (4b)      |  |
| Z ero                          | padding at th                    | eend A         |            | ADC                                     | DC long (12b) |                    | ADC hi       | ADC high (8b) |  |
|                                |                                  |                |            |                                         |               |                    |              |               |  |

## Results Calorimeter Raw Data Decoding: Ivy Bridge + StratixV

On FPGAs the decoding can be realized more efficiently



intel

16

Bottleneck is bandwidth between CPU and FPGA
 → add more cores, tested BDW + Arria10 GX FPGA

#### FPGA resources:

| FPGA Resource Type | FPGA Resources used [%] | For Interface used [%] |  |
|--------------------|-------------------------|------------------------|--|
| ALMs               | 58                      | 30                     |  |
| DSPs               | 0                       | 0                      |  |
| Registers          | 15                      | 5                      |  |

Christian Färber,



## Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA with Arria<sup>®</sup> 10 FPGA

- Multi-chip package including:
  - Intel<sup>®</sup> Xeon<sup>®</sup> E5-2600 v4
  - Intel<sup>®</sup> Arria<sup>®</sup> 10 GX 1150 FPGA



• 427'200 ALMs, 1'708'800 Registers, 1'518 DSPs

- Hardened floating point add/mult blocks (HFB)
- Host Interface: Bandwidth target 5x higher than Stratix<sup>®</sup> V version
- Memory: Cache-coherent access to main memory
- Programming model: Verilog, soon also OpenCL
   Christian Färber,



## Results Calorimeter Raw Data Decoding: BDW+Arria10

intel

18



 The higher bandwidth of the newest Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA results in an impressive acceleration of a factor 180

| FPGA Resource Type | FPGA Resources used [%] | For Interface used [%] | ] |
|--------------------|-------------------------|------------------------|---|
| ALMs               | 57                      | 18                     |   |
| DSPs               | 0                       | 0                      |   |
| Registers          | 19                      | 5                      |   |

Christian Färber,



# Sorting with Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA

inte

#### Sorting of INT arrays with 32 elements

- Implemented pipeline with 32 array stages
- FPGA sort is up to x117 faster than single Xeon<sup>®</sup> thread
- Bandwidth through the FPGA is the bottleneck

Time ratio for sorting with Xeon only to Xeon with FPGA







## **Test Case: RICH PID Algorithm**

- Calculate Cherenkov angle **O** for each track t and detection point **D**, not a typical FPGA algorithm
- RICH PID is not processed for every event, processing time is too long!



#### **Calculations:**

- solve quartic equation
- cube root
- complex square root
- rotation matrix
- scalar/cross products







#### Implementation of Cherenkov Angle Reconstruction Stratix<sup>®</sup> V 748 clock cycle long pipeline written in Verilog

- Additional blocks developed: cube root, complex square root, rot. matrix, cross/scalar product,...
  - Lengthy task in Verilog with all test benches (implementation took 2.5 months)
- Pipeline running with 200 MHz  $\rightarrow$  5 ns per photon
- FPGA resources:

| FPGA Resource Type | FPGA Resources used [%] | For Interface used [%] |  |
|--------------------|-------------------------|------------------------|--|
| ALMs               | 88                      | 30                     |  |
| DSPs               | 67                      | 0                      |  |
| Registers          | 48                      | 5                      |  |







## Implementation of Cherenkov Angle Reconstruction Arria<sup>®</sup> 10

- 259 clock cycle long pipeline written in Verilog
  - Stratix<sup>®</sup> V blocks ported using HFB: complex square root, rot. matrix, cross/scalar product,...
- Pipeline running with 200 MHz  $\rightarrow$  5 ns per photon
  - With Arria® 10 GT FPGA 400 MHz possible
- FPGA resources:

| FPGA Resource Type | FPGA Resources used [%] | For Interface used [%] |  |
|--------------------|-------------------------|------------------------|--|
| ALMs               | 32                      | 18                     |  |
| DSPs               | 15                      | 0                      |  |
| Registers          | 12                      | 5                      |  |



## Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA Results

Compare runtime for Cherenkov angle reconstruction with Intel<sup>®</sup> Xeon<sup>®</sup> CPU and Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA



- Acceleration of up to factor 35 with Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA
- Theoretical limit of photon pipeline: a factor 64 with respect to single Intel<sup>®</sup> Xeon<sup>®</sup> thread, for Arria<sup>®</sup> 10 a factor ~ 300
- Bottleneck: Data transfer bandwidth to FPGA, caching can improve this, tests ongoing

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017

23



#### Open Computing Language (OpenCL)

- Developed by Apple, later Khronos Group, based on C99, first release 2009
- Standard to run code on heterogeneous platforms



24

intel

- CPUs, GPUs, FPGAs, ...
- Program: Host control, kernel run on GPU, FPGA,...
  - Compiled at run-time
- Memory hierarchy: global (main memory), read-only (for kernel), local (shared by group of PE), per-element private memory
- For FPGA case, BSP needed and synthesis is done in advance (OpenCL kernel  $\rightarrow$  HDL  $\rightarrow$  bitstream)



module ADD #(parameter width=32)

FRN







26 Nop

#### Code compare OpenCL



- No interface to write, using Board Support Package (BSP)
- Using high-level language
- Far less code  $\rightarrow$  easier to develop and to maintain



| ALMs      | 88 | 63 |
|-----------|----|----|
| DSPs      | 67 | 82 |
| Registers | 48 | 24 |
|           |    |    |

Christian Färber,

resource

usage

27



# Nallatech 385 Board FPGA: Intel<sup>®</sup> Stratix<sup>®</sup> V GX A7 - 234'720 ALMs, 940'000 Registers - 256 DSPs

- Programming model: OpenCL
- Host Interface: 8-lane PCIe Gen3
  - Up to 7.5 GB/s



28

- Memory: 8 GB DDR3 SDRAM
- Network Enabled with (2) SFP+ 10 GbE ports
- Power usage:  $\leq 25$  W (GPU up to 300 W)



#### Compare PCIe – QPI Interconnect

(intel)

- Nallatech 385 PCIe vs. Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA QPI
- Both Intel<sup>®</sup> Stratix<sup>®</sup> V A7 FPGA with 256 DSPs
- Programming model: OpenCL
- Reconstruct 1'000'000 photons

RICH Kernel

Compare Nallatech 385 and Intel Xeon/FPGA acceleration







## Nallatech 385A Board FPGA: Intel<sup>®</sup> Arria<sup>®</sup> 10 GX 1150 FPGA - 427'200 ALMs, 1'708'800 Registers - 1'518 DSPs

- Programming model: OpenCL
- Host Interface: 8-lane PCIe Gen3
  - Up to 7.9 GB/s
- Memory: 8 GB DDR3 SDRAM
- Network Enabled with (2) QSFP 10/40 GbE ports
- Power usage: full FPGA firmware ~ 40 W

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017



30





RICH CPU core scaling

using OpenMP 2x Xeon E5-2630 v4



16777216 random photons Multi loop factor: 160 Used CPU threads: 40

> Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017



inte





## **RICH with Nallatech 385A**





## RICH w/o Nallatech 385A OpenMP



16777216 random photons Multi loop factor: 160 Used CPU threads: 40

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017



intel



#### Compare energy consumption

Processing:  $2.7 \times 10^9$  photons -  $2 \times X eon^{\ensuremath{\oplus}} E5-2630 \ v4 \ using$ 40 threads OpenMP no vectorization =>  $29 \ s \times 102 \ W = 2960 \ J$ -  $1 \times Arria^{\ensuremath{\oplus}} 10 \ GX \ 1150 \ GX \ x1.6$ =>  $35 \ s \ x \ 52 \ W = 1820 \ J$ 



 FPGA uses 40 W idle + ~12 W single thread pushing data into PCIe card

Check for better firmware to avoid idle state

Use vectorization and OpenCL





### Reached and possible run time for RICH photon reconstruction

Reached and possible run time for single RICH photon reconstruction with different platforms



- The difference between reached and possible time is due to the limitation by the bandwidth between CPU and FPGA, in both cases the FPGA could process the photons faster. The same case is with the PCIe accelerator, but even worst
- The bandwidth gap could be reduced by caching, for RICH kernel possible
- Between Ivy Bridge and BDW the bandwidth improved by a factor 2

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017

35



#### Compare run time variation

• Stddev: CPU : 1.06 % FPGA: 0.29 %



- FPGA runtime is more predictable
- Very important for safety critical control systems

Christian Färber, CERN EP-ESE electronics seminar, Geneva – 12.12.2017

36





USB #2

Nallatech:520 ~10 TFlops

4x 8GB DDR4

Memory

PCle

16-lane Gen 3

## Future Tests

- Implement additional CERN algorithms
  - Tracking Kalman filter, CNNs
- Compare performance with Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA system with Skylake + Arria<sup>®</sup> 10 FPGA
  - Waiting for missing software and firmware

4x OSFP28

USB #1

- Power measurements
- Compare Verilog vs. OpenCL
- Longterm
   Measurements of
   Stratix10 PCIe accelerators
   and Intel<sup>®</sup> Xeon<sup>®</sup> + Stratix10





## **FPGA-based CNN Inference**

- For CNN inference single precision is not always needed
- Take advantage of using precision as needed on FPGA
- This increases the operations per second dramatically
- This could be interesting for Monte Carlo production (e.g. Geant V)





Source: FPGA Datacenters -The New Supercomputer, Andrew Putnam – Microsoft Catapult\_ACAT\_2017\_Public

**38R N** openlab





39

#### **FPGA** development

- FPGA potential for general compute acceleration increased a lot with Arria10 and the hardened floating point DSP blocks
  - Future FPGAs will have sev. 10'000 of these DSPs (nowadays already ~6k)
- FPGA transceivers will make huge bandwidth into chip possible, tightly coupled to RAM
- Programming model is changing now to using mostly HLS and OpenCL even for standard FPGA designs

Intel recommends to use HLS for Stratix10





40

# Challenges to use FPGA accelerators

- Compute heavy blocks have to be identified to be ported to the FPGA
- For PCIe accelerators an off-load model is used (larger latency)
  - $\rightarrow$  Intel<sup>®</sup> Xeon<sup>®</sup> + FPGA advantage (streaming)
- Kernel size limited by FPGA resources

 Intel will change programming time from O(s) to O(us) in the future, which makes kernel swapping during runtime practical



the HEP field

### Summary



Comparing the energy consumption with CPUs show better performance for FPGAs (getting a greener CERN computing ?)



- Programming model with OpenCL very attractive and convenient for HEP field, HLS now also available
- Also other experiments want to test the usage of the Intel<sup>®</sup> Xeon<sup>®</sup>+FPGA with Arria10
- High bandwidth interconnect coupled with Arria<sup>®</sup> 10 FPGA suggests excellent performance per Joule for HEP algorithms! Don't forget Stratix<sup>®</sup> 10, ... !



LHCD

P



## Thank you

