# Allen: A High Level Trigger on GPUs for LHCb Physics and throughput performance

#### **Dorothea vom Bruch**

on behalf of the LHCb collaboration

LPNHE, CNRS

Sorbonne University, Paris Diderot University

November 6<sup>th</sup> 2019 CHEP 2019, Adelaide









European Research Council Established by the European Commission LHCb

#### LHC @ CERN



General purpose detector in the forward region specialized in beauty and charm hadrons



## Reaching the MHz signal era



Run 3: Luminosity of  $2x10^{33}$  cm<sup>-2</sup>s<sup>-1</sup>,  $\sqrt{s} = 14$  TeV

# Reaching the MHz signal era



# Reaching the MHz signal era



## Change in trigger paradigm



#### Access as much information about the collision as early as possible

#### Tracks in the LHCb detector



Need information from many subdetectors  $\rightarrow$  read out full detector

# Trigger upgrade for Run 3 (2021)



# Trigger upgrade for Run 3 (2021)



# Trigger in Run 3 (2021)



# Trigger in Run 3 (2021)



#### Architecture for high level trigger?



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

#### **Graphics Processing Units (GPUs) have thousands of cores**

#### Amdahl's law



Speedup in latency = 1 / (S + P/N) S: sequential part of program P: parallel part of program N: number of processors

#### Can we use the FLOPS available on a GPU to run HLT1 @ 30 MHz?

#### Where to place the GPUs?



#### Where to place the GPUs?





#### Where to place the GPUs?



# LHCb HLT1 elements



| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |  |  |  |  |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--|--|--|--|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |  |  |  |  |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--|--|--|--|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |  |  |  |  |
| Huge compute load                                                                               | Many TFLOPS                                                                                |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |  |  |  |  |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--|--|--|--|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |  |  |  |  |
| Huge compute load                                                                               | Many TFLOPS                                                                                |  |  |  |  |
| Full data stream from all detectors is read out<br>→ no stringent latency requirements          | GPUs have higher latency than CPUs, not as predictable as FPGAs                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |  |

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |  |  |  |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--|--|--|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |  |  |  |
| Huge compute load                                                                               | Many TFLOPS                                                                                |  |  |  |
| Full data stream from all detectors is read out<br>→ no stringent latency requirements          | GPUs have higher latency than CPUs, not as predictable as FPGAs                            |  |  |  |
| Small raw event data (~100 kB)                                                                  | Connection via PCIe $\rightarrow$ limited I/O bandwidth                                    |  |  |  |
|                                                                                                 |                                                                                            |  |  |  |

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |
| Huge compute load                                                                               | Many TFLOPS                                                                                |
| Full data stream from all detectors is read out<br>→ no stringent latency requirements          | GPUs have higher latency than CPUs, not as predictable as FPGAs                            |
| Small raw event data (~100 kB)                                                                  | Connection via PCIe $\rightarrow$ limited I/O bandwidth                                    |
| Small event raw data (~100 kB)                                                                  | Thousands of events fit into O(10) GB of memory                                            |

# The Allen R&D project

- Fully standalone software project: https://gitlab.cern.ch/lhcb/Allen
- Only requirements:
  - C++17 compliant compiler, CUDA v10, boost, ZeroMQ
- Built-in physics validation
- Configurable sequence, custom memory manager
- Cross-architecture compatibility
- Project started in February 2018
- After 15 months of development time: project reviewed as viable solution for Run 3 (starting in 2021)
- Talk on software challenges by D. Cámpora: Monday, Track 5



• Named after Frances E. Allen

## HLT1 on GPUs



#### Velo detector





#### Velo detector: track reconstruction



D. Campora, N. Neufeld, A. Riscos Núñez: "A fast local algorithm for track reconstruction on parallel architectures", IPDPSW 2019

#### Velo detector: primary vertex reconstruction



28

## UT detector



#### UT detector: track reconstruction



P. Fernandez Declara, D. Campora Perez, J. Garcia-Blas, D. vom Bruch, J. Daniel Garca, N. Neufeld , IEEE Access 7 (2019)

UTaX

events [a.u.]

Number of

#### SciFi detector



# SciFi detector

- 12 layers of scintillating fibres
- Efficiency of fibres ~98-99%
- Describe trajectories in magnetic field with parameterizations
  - → no need to load large field map into GPU memory





#### SciFi detector: track reconstruction



#### Muon chambers



Four multi-wire proportional chambers Interleaved with iron walls



Muon identification efficiency

#### Ingredients for selections



| Trigger          | Rate [kHz]   |
|------------------|--------------|
| 1-Track          | $249 \pm 18$ |
| 2-Track          | $663 \pm 30$ |
| High- $p_T$ muon | 1 ±1         |
| Displaced dimuon | $50 \pm 8$   |
| High-mass dimuon | $101 \pm 12$ |
| Total            | $971 \pm 36$ |

| Signal                       | GEC        | TIS -OR- TOS | TOS         | $\operatorname{GEC} \times \operatorname{TOS}$ |
|------------------------------|------------|--------------|-------------|------------------------------------------------|
| $B^0 \to K^{*0} \mu^+ \mu^-$ | $89 \pm 2$ | $85 \pm 2$   | $78\ \pm 3$ | $69 \pm 3$                                     |
| $B^0 \to K^{*0} e^+ e^-$     | $84 \pm 3$ | $69 \pm 4$   | $62 \pm 4$  | $53 \pm 3$                                     |
| $B_s^0 \to \phi \phi$        | $83 \pm 3$ | $70 \pm 3$   | $65 \pm 4$  | $54 \pm 3$                                     |
| $D_s^+ \to K^+ K^- \pi^+$    | $82 \pm 4$ | $62 \pm 5$   | $38 \pm 5$  | $32 \pm 4$                                     |
| $Z \to \mu^+ \mu^-$          | $78 \pm 1$ | $97 \pm 1$   | $97 \pm 1$  | $75 \pm 1$                                     |

GEC: Global event cut TIS: Trigger independent from signal TOS: Trigger on signal

# Event rate reduced from 30 MHz to 1 MHz

**Consistent physics performance with TDR,** which assumed running on x86 architecture

# Full HLT1 running on GPUs

Physics performance matches HLT1 requirements

What about the throughput performance?



## Throughput on various GPUs

#### Throughput of the full HLT1 sequence



HLT1 can run on 500 GPUs → Buy GPUs instead of expensive network

#### Allen scalability with GPU model



# The Allen team



# Summary

- Allen is the first complete high throughput trigger implementation on GPUs
- Developed a compact, modular and scalable framework
- Baseline HLT1 can run on GPUs
- Scaling of GPU performance  $\rightarrow$  maximize physics discovery potential of LHCb
- Integration tests ongoing (see talk by D. Cámpora, Monday Track 5)
- HLT1 on GPUs is being considered as alternative to the baseline x86 architecture



# Backup

## LHC Schedule



# **Graphics requirements**

#### **Graphics pipeline**

- Huge amount of arithmetic on independent data:
  - Transforming positions
  - Generating pixel colors
  - Applying material properties and light situation to every pixel

#### Hardware needs

- Access memory simultaneously and contiguously
- Bandwidth more important than latency
- Floating point and fixed-function logic

 $\rightarrow$  Single instruction applied to multiple data: SIMT





## Beauty and charm decays



- B<sup>±/0</sup> mass ~5.3 GeV
  - $\rightarrow$  Daughter p<sub>T</sub> O(1 GeV)
- $\tau \sim 1.6 \text{ ps} \rightarrow \text{flight distance } \sim 1 \text{ cm}$
- Detached muons from  $B \rightarrow J/\Psi X$ ,  $J/\Psi \rightarrow \mu^+\mu^-$
- Displaced tracks with high  $p_{\tau}$



D<sup>±/0</sup> mass ~1.9 GeV

р

- $\rightarrow$  Daughter p<sub>T</sub> O(700 MeV)
- $\tau \sim 0.4 \text{ ps} \rightarrow \text{flight distance } \sim 4 \text{mm}$

р

• Also produced from B decays

PV: Primary vertex SV: Secondary vertex IP: Impact parameter: distance between point of closest approach of a track and a PV

# Why no low level trigger?

Low level trigger on  ${\rm E}_{\rm T}$  from the calorimeter

Low level trigger on muon  $p_{\tau}$ , B  $\rightarrow K^* \mu \mu$ 



Need track reconstruction at first trigger stage

Improved track description  $\rightarrow$  better impact parameter resolution



- Simple: Simplified Kalman filter with constant momentum assumption
- Param.: Parameterized Kalman filter with momentum estimate from SciFi track reconstruction

#### GPU in a nutshell

- Core: multiple SIMT threads grouped together
- GPU: many cores grouped together







| PCIe generation | 16 lanes   | Year |  |
|-----------------|------------|------|--|
| 3.0             | 15.75 GB/s | 2010 |  |
| 4.0             | 31.5 GB/s  | 2017 |  |

#### Selections

| Selection name           | Criteria                                                             |  |  |  |
|--------------------------|----------------------------------------------------------------------|--|--|--|
| 1-Track                  | Single displaced track with high $p_{_{T}}$                          |  |  |  |
| 2-Track                  | Two-track vertex with significant displacement and ${\rm p}_{\rm T}$ |  |  |  |
| High-p <sub>T</sub> muon | Single muon with high p <sub>r</sub>                                 |  |  |  |
| Displaced diumuon        | Displaced di-muon vertex                                             |  |  |  |
| High-mass dimuon         | Di-muon vertex with mass near or larger than the J/ $\Psi$           |  |  |  |

Criteria applied to signal decays in efficiency calculations

| b and $c$ hadrons         | $p_{\rm T} > 2 { m ~GeV}$                                   |
|---------------------------|-------------------------------------------------------------|
|                           | $\tau > 0.2 \text{ ps}$                                     |
| b and $c$ hadron children | $p_{\rm T} > 200 { m ~MeV}$                                 |
|                           | $2 < \eta < 5$                                              |
|                           | reconstructible in the Velo and SciFi detector (long track) |
| Z children                | $p_{\rm T} > 20 { m ~GeV}$                                  |
|                           | $2 < \eta < 5$                                              |
|                           | reconstructible in the Velo and SciFi detector (long track) |

# HLT1 algorithms in Allen



51

## Throughput versus occupancy



- Data volume proportional to occupancy
- Low performance decrease at high occupancy
  - $\rightarrow$  will be able to handle real data (likely higher in occupancy than simulation)

# Algorithm breakdown

search by triplet lf triplet seeding pv beamline multi fitter muon add coords crossing maps lf collect candidates pv beamline peak scifi direct decoder v4 lf quality filter x lf triplet keep best estimate input size compass ut masked velo clustering lf extend tracks x ut search windows calculate phi and sort lf fit lf search initial windows



Showing only algorithms contributing  $\geq 2\%$ 

#### GPUs for throughput measurement



| Card                | # cores | Max freq. | Cache     | DRAM  | DRAM  | CUDA | Allen    |
|---------------------|---------|-----------|-----------|-------|-------|------|----------|
|                     |         | (GHz)     | (MiB, L2) | (GiB) | type  | cap. | settings |
| Geforce GTX 670     | 1344    | 1.06      | 0.5       | 1.95  | GDDR5 | 3.0  | Low      |
| Geforce GTX 680     | 1536    | 1.14      | 0.5       | 1.95  | GDDR5 | 3.0  | Low      |
| Geforce GTX 780 Ti  | 2880    | 0.93      | 1.5       | 2.95  | GDDR5 | 3.5  | Low      |
| Geforce GTX 980     | 2048    | 1.29      | 2         | 2.01  | GDDR5 | 5.2  | Low      |
| Geforce GTX TITAN X | 3072    | 1.08      | 3         | 11.92 | GDDR5 | 5.2  | High     |
| Geforce GTX 1060 6G | 1280    | 1.81      | 1.5       | 5.94  | GDDR5 | 6.1  | Low      |
| Geforce GTX 1080 Ti | 3584    | 1.67      | 2.75      | 10.92 | GDDR5 | 6.1  | High     |
| Geforce RTX 2080 Ti | 4352    | 1.545     | 6         | 10.92 | GDDR5 | 7.5  | High     |
| Tesla T4            | 2560    | 1.59      | 4         | 15.72 | GDDR6 | 7.5  | High     |
| Tesla V100 32GB     | 5120    | 1.37      | 6         | 32    | HBM2  | 7.0  | High     |

# Throughput of x86 HLT1

