

# Particle identification on an FPGA accelerated (intel) compute platform for the LHCb Upgrade.

- New High Level Trigger farm for raw data input of ~ 40 Tbit/s!
- Different technologies are explored to realize fast and efficient processing of trigger algorithms.
- Test FPGA compute accelerators for the usage in:
  - Event building
    - Decompressing and re-formatting packed binary data from detector
  - Event filtering
    - Tracking
    - Particle identification
- Test system is the

new Intel® Xeon/FPGA prototype!





Christian Färber, 20th Real Time Conference – 07.06.2016



# Poster 192





Particle identification on an FPGA accelerated compute

platform for the LHCb Upgrade

Christian Färber<sup>1</sup>, Niko Neufeld<sup>1</sup>, Rainer Schwemmer<sup>1</sup>, Jonathan Machen<sup>2</sup> {christian.faerber, niko.neufeld, rainer.schwemmer }@cern.ch, jonathan.machen@intel.com <sup>1</sup>Experimental Physics Department, CERN, Geneva, Switzerland <sup>2</sup>DCG IPAG, EU Exascale Labs, Intel, Switzerland

On behalf of the HTC Collaboration - June 2016

- Current situation
  Raw data output ~ 10 Tbit/s (not zero-suppressed)
  Hardware and Software Trigge
- for selecting events
  Application example Process ~ 10<sup>11</sup> B<sub>0</sub> decays for detecting one  $B_{\downarrow}^{0} \rightarrow \mu^{\dagger}\mu^{-}$
- Upgrade program foresees 10x higher pp collision rate



- After the Upgrade
  2018 LHCb will change its detector to a trigger-free readout, reading every collision (one every 25 ns) and a much more flexible software-based trigger system, the Event Filter Farm (EFF)
- Events will be processed and triggered on an event-by-event basis by the Event Filter Farm.

# The Event Filter Farm Raw data input ~ 40 Tbit/s

- (already zero-suppressed by the front-end electronics) selecting the events EFF needs fast processing of trigger algorithms
- (decision within O(10) us). Different technologies have to High-speed interconnect
- technology has to be investigated and used.
  Test FPGA compute accelerators
- for the usage in Event building, Tracking and particle identification of the upgraded High-Level-Trigger farm and compare with: GPUs, Intel® Xeon/Phi and other computing accelerators

# **First Test Cases**

- Sorting Runtime scales on CPU with n x log(n) n = number of elements
- On FPGA with pipeline and parallel compare it depends only on pipeline clock frequency FPGA sort is a factor x50 faster
- than single Intel® Xeon thread
- Implementation for floats with shifting n-th root algorithm Implemented 7 root pipelines for parallel processing (200MHz)
- FPGA cubic root is a factor x35 faster than single Intel® Xeon thread

- Floating point precision
  Implemented 22 fpMandel pipelines running at 200MHz, each handles 16 pixels in parallel (total: 352 pixels). FPGA is a factor x12 faster as
- Intel® Xeon CPU running 20 threads Used 72/256 DSPs
- Reuse of data on FPGA high!





# **High Throughput Computing Collaboration**

(intel

- Members from Intel and CERN LHCb/IT Test Intel technology for the usage in
- trigger and data acquisition (TDAQ) systems Projecte - Intel® Omni-Path 100 Gbit/s network
- Intel® Xeon/Phi computing accelerator
   Intel® Xeon/FPGA computing accelerator

# Intel® Xeon/FPGA

INTEL XEON

- Prototype
  Two socket system: First: Intel® Xeon® E5-2680 v2 Second: Altera Stratix V GX A7 FPGA
- 234'720 ALMs, 940'000 Registers 256 DSPs
- Host Interface: high-bandwidth and low latency (QPI)
- Memory: Cache-coherent access to main memory
  Programming model: Verilog now also OpenCL
  Ower usage: FPGAs are very power efficient up to a factor x10 lower
- than GPUs → Measurements will follow soon

- Intel® Xeon CPU and FPGA in one package
  Including newest high performance Altera FPGA: Arria 10
  Faster interface for interconnect of CPU and FPGA

# **Cherenkov Angle Reconstruction**

- Algorithm

   Particle travelling faster as speed-of-light in medium emitting cherenkov radiation in an angle depending on the particle speed Calculate 6 knowing points
- D,C,E and particle track t

- 748 clock cycle long pipeline written in Verilog
   Additional blocks developed: cubic root, complex square root, rot. matrix, cross/scalar product,...
  - Lengthy task in Verilog with all test benches
- Pipeline running with 200MHz → 5ns per photon Implementation took 2.5 months

| PGA Resources used [%] | For QPI used [%] |
|------------------------|------------------|
| 88                     | 30               |
| 67                     | 0                |
| 48                     | 5                |
|                        | 67               |

# Results so far very encourage Acceleration of factor up to 35 with Intel® Xeon/FPGA

- pipeline: a factor 64
- Bottleneck: Data transfer bandwidth to FPGA

- Implement additional LHCb HLT algorithms
   Tracking, decompressing and re-formatting packed binary data from
- Hardened floating point mult/accumulate blocks
  Test Nallatech CAPI (cache-coherent)
- Compare Verilog OpenCL AFUs
  Power measurements → Compare with GPUs!



Christian Färber, 20th Real Time Conference – 07.06.2016



Used hardware

**HLT PID algorithm** results







Introduction

Results of

test algorithms