# A heterogeneous software-only trigger for the upgraded LHCb experiment

### **Dorothea vom Bruch**

Center for Particle Physics Marseille (CPPM), Aix-Marseille University, IN2P3 / CNRS

September 12<sup>th</sup> 2022 Vistas on Detector Physics, Heidelberg







### "Trigger": Real-time data analysis and reduction



# "Trigger": Real-time data analysis and reduction



# "Trigger": Real-time data analysis and reduction



# Match trigger to hardware

#### First: Hardware trigger

- Data obtained directly from detector
- Decision taken in fixed time, low latency
- Based on local information from a subdetector
- Chip constraints → not too complex calculations

#### Field Programmable Gate Arrays (FPGAs)

- Low & deterministic latency
- Connectivity to any data source → high bandwidth
- Intermediate floating point performance



#### Second: Software trigger

- Data already transferred to a server
- Decision taken with medium latency
- Based on information from several subdetectors
- Processor constraints less stringent

#### **CPUs and GPUs**

- Higher latency
- Very good floating point performance
- Connected to server (via PCIe connection for GPU)





### Efficient signal selection



# Efficient signal selection



## The LHCb experiment at CERN

#### LHC @ CERN



General purpose detector in the forward region specialized in beauty and charm physics



# Beauty and charm decays



- B<sup>±/0</sup> mass ~5.3 GeV
  - → Daughter  $p_T O(1 \text{ GeV})$
- $\tau \sim 1.6 \text{ ps} \Rightarrow \text{flight distance } \sim 1 \text{cm}$
- Detached muons from  $B \rightarrow J/\Psi X$ ,  $J/\Psi \rightarrow \mu^+\mu^-$
- Displaced tracks with high  $p_T$



- D<sup>±/0</sup> mass ~1.9 GeV
  - → Daughter  $p_T O(700 \text{ MeV})$
- $\tau \sim 0.4 \text{ ps} \rightarrow \text{flight distance } \sim 4 \text{mm}$
- Also produced from B decays

PV: Primary vertexSV: Secondary vertexIP: Impact parameter: distance between point of closest approach of a track and a PV

# LHCb Run 1 & 2 trigger



# Why no low level trigger for LHCb in Run 3?



Low level trigger on muon  $p_{\tau}$ , B  $\rightarrow K^* \mu \mu$ 



#### Need track reconstruction at first trigger stage

# Change in trigger paradigm



### Access as much information about the collision as early as possible

# LHCb data processing in Run 3



# Real-time software challenges in HEP



LHC Run 3 (2022) LHCb: pp collisions at 30 MHz, → 5 TB/s processed in software

ALICE: PbPb collisions at 50 kHz  $\rightarrow$  3.5 TB/s processed in software

LHC Run 4 (~2029) CMS & ATLAS pp collisions at 40 MHz, Hardware trigger rate increased: 100 kHz → 1 MHz → 6 TB/s processed in software

LHC Run 5 (~2035) LHCb undergoes Upgrade II 25 TB/s processed in software

Courtesy Alex Cerri, LHCP 2022

Global mobile data traffic in 2020 40 exabytes/ month

LHCb experiment in 2022



### A closer look at LHCb



### What do we reconstruct at LHCb?



### What does track reconstruction imply?

Pattern recognition Track fit  $f(x) = \dots +/- \dots$ 

Huge computing challenge for 10<sup>9</sup> – 10<sup>10</sup> tracks / second



- High Level Trigger 1 (HLT1):
  - Full charged particle track and vertex reconstruction
  - Electron and muon identification
  - Few inclusive single and two-track selections
- High Level Trigger 2 (HLT2):
  - Aligned and calibrated detector
  - Offline-quality pattern recognition
  - Full particle identification, including RICH reconstruction
  - Full track fit, requires detailed magnetic field and detector description



- High Level Trigger 1 (HLT1):
  - Full charged particle track and vertex reconstruction
  - Electron and muon identification
  - Few inclusive single and two-track selections
- High Level Trigger 2 (HLT2):
  - Aligned and calibrated detector
  - Offline-quality pattern recognition
  - Full particle identification, including RICH reconstruction
  - Full track fit, requires detailed magnetic field and detector description

- Manageable amount of algorithms
- Highly parallel tasks
- No detailed knowledge of magnetic field & detector required

- Exclusive selections using full PID information
- Best knowledge of alignment & calibration
- Reconstruction algorithms optimized for different track types
- Full track fit



# Computing performance challenge @ CERN



Courtesy Dr. Bernd Panzer-Steind (CERN/IT, CTO)

- Estimated improvement increase: 10-15% per year for the same budget
- Computing needs are not met

# Trend towards heterogeneous solutions: TOP500



# Graphics Processing Unit (GPU)

Developed for graphics-oriented workloads





# GPU compared to CPU



Low core count / powerful ALU Complex control unit Large chaches

 $\rightarrow$  Latency optimized

High core count No complex control unit Small chaches → **Throughput optimized** 

### When to go parallel? Amdahl's law



Speedup in latency = 1 / (S + P/N)

- S: sequential part of program
- P: parallel part of program
- N: number of processors

#### Parallel

#### Sequential





Consider how much of the problem can actually be parallelized!



# How does HLT1 map to GPUs?

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |
| Huge compute load                                                                               | Many TFLOPS                                                                                |
| Full data stream from all detectors is read out<br>→ no stringent latency requirements          | Higher latency than CPUs, not as predictable as FPGAs                                      |
| Small raw event data (~100 kB)                                                                  | Connection via PCIe → limited I/O bandwidth                                                |
| Small event raw data (~100 kB)                                                                  | Thousands of events fit into O(10) GB of memory                                            |

### Minimize copies to / from GPU



# Three levels of parallelization



- Named after Frances E. Allen
- Fully standalone software project: https://gitlab.cern.ch/lhcb/Allen, Sphinx documentation
- Framework developed for processing LHCb's HLT1 on GPUs
- Cross-architecture compatibility via macros & few coding guide lines
  - GPU code written in CUDA, runs on CPUs, Nvidia GPUs (CUDA), AMD GPUs (HIP)
- Algorithm sequences defined in python and generated at run-time
- Multi-event processing with dedicated scheduler
- Memory manager allocates large chunk of GPU memory at start-up
- Reconstruction algorithms re-designed for parallelism and low memory usage: O(MB) per core



# Common intra-event parallelization techniques

#### Raw data decoding

- Transform binary payload from subdetector raw banks into collections of hits (x,y,z) in global coordinate system
- Parallelize over all readout units

#### Track reconstruction

- Consists of two steps:
  - Pattern recognition: Parallelize across hit combinations
  - Track fitting: Parallelize across track candidates

#### Vertex finding

- Reconstruct primary and secondary vertices
- Parallelize across combinations of tracks and vertex seeds









- Build "triplets" of three hits on consecutive layers → parallelization
- Choose them based on alignment in phi
- Hits sorted by phi → memory accesses as contiguous as possible: data locality
- Extend triplets to next layer → parallelization

# HLT1: Track reconstruction performance



LHCb-FIGURE-2020-014

# HLT1: Computing throughput



# GPU HLT1 within data acquisition system





## HLT1 commissioning: Allen within the DAQ system



# HLT1 commissioning: Towards first collisions



# HLT1 commissioning: Towards first collisions

July 2022: First collisions @ 13.6 TeV at the LHC Happy trigger commissioning team



# Looking at the physics performance



KstEEMD, Hlt1TwoTrackMVADecision

#### KstMuMuMD, Hlt1TwoTrackMVADecision

CERN-LHCC-2020-006

Selection efficiencies for electron and muon final states similar

In Run 2: Electron selection efficiency roughly factor two worse than muons due to hardware level trigger

D. vom Bruch

# Physics prospects with the all-software trigger

- Understand the current pattern of flavor anomalies
- Exploiting the higher statistics and larger phase space of electrons
- Precision measurements of rare decays with electrons:  $b \rightarrow see$ ,  $b \rightarrow dee$ 
  - Branching fractions, ratios of branching fractions to muon modes, angular analyses
- Semileptonic decays with electrons:  $b \rightarrow cev$ 
  - Ratios of branching fractions to tauonic mode, angular analyses
- Exploit higher statistics at low momentum
  - Decays with multiple tracks in the final state
  - Charm decays
- Adding on to the trigger in the future
  - Reconstruct tracks of long-lived particles: K<sub>s</sub> studies
  - Fill histograms directly in the trigger, for example for dark photon searches



ArXiv 1808.08865

#### D. vom Bruch

- HEP experiments real time analysis systems are entering the exascale computing era
- Need to exploit modern computing techonolgies to face this challenge
- LHCb is commissioning a fully software trigger for Run 3 (started in 2022)
- First full trigger stage entirely on GPUs @ 30 MHz  $\rightarrow$  a first in HEP
- Developed Allen: heterogeneous software framework for multi-event processing
- Gain expertise in heterogeneous DAQ systems
  - $\rightarrow$  Preparing to exploit emerging new architectures entering the market
- Physics performance opens new options for physics analyses
- In good position to prepare for LHCb Upgrade II with 400 Tbit/s of data rate



# Backup

# HLT2 on CPUs



- Fully aligned & calibrated detector, offline quality track fit & particle identification @ 1MHz
- HLT2 throughput significantly improved over last years
- Hundreds of exclusive selections being written for specific analyses, using new multi-threaded framework



# Selective persistency: "Turbo stream"



# Recurrent tasks in real-time data analysis

#### Raw data decoding

- Transform binary payload from subdetector raw banks into collections of hits (x,y,z) in LHCb coordinate system
   Track reconstruction
- Consists of two steps:
  - Pattern recognition: Which hits were produced by the same particle? → "Track"
  - Track fitting: Describe track with mathematical model

#### Vertex finding

- Where did proton-proton collisions take place?
- Where did particles decay within the detector volume?
   Particle identification
- Reconstruct clusters in the calorimeter / muon detectors
- Reconstruct rings in the RICH detectors
- Match tracks to clusters / RICH signals







#### What about the cost?



https://arxiv.org/pdf/2003.11491.pdf

# Heterogeneous solutions & sustainability: Green500

| Rank | TOP500<br>Rank | System                                                                                                                                                                | Cores   | Rmax<br>(TFlop/s) | Power<br>(kW) | Power Efficiency<br>(GFlops/watts) |
|------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|-------------------|---------------|------------------------------------|
| 1    | 301            | MN-3 - MN-Core Server, Xeon Platinum<br>8260M 24C 2.4GHz, Preferred Networks<br>MN-Core, MN-Core DirectConnect,<br>Preferred Networks<br>Preferred Networks<br>Japan  | 1,664   | 2,181.2           | 55            | 39.379                             |
| 2    | 291            | SSC-21 Scalable Module - Apollo 6500<br>Gen10 plus, AMD EPYC 7543 32C 2.8GHz,<br>NVIDIA A100 80GB, Infiniband HDR200,<br>HPE<br>Samsung Electronics<br>South Korea    | 16,704  | 2,274.1           | 103           | 33.983                             |
| 3    | 295            | Tethys - NVIDIA DGX A100 Liquid Cooled<br>Prototype, AMD EPYC 7742 64C 2.25GHz,<br>NVIDIA A100 80GB, Infiniband HDR, Nvidia<br>NVIDIA Corporation<br>United States    | 19,840  | 2,255.0           | 72            | 31.538                             |
| 4    | 280            | Wilkes-3 - PowerEdge XE8545, AMD<br>EPYC 7763 64C 2.45GHz, NVIDIA A100<br>80GB, Infiniband HDR200 dual rail, DELL<br>EMC<br>University of Cambridge<br>United Kingdom | 26,880  | 2,287.0           | 74            | 30.797                             |
| 5    | 30             | HiPerGator AI - NVIDIA DGX A100, AMD<br>EPYC 7742 64C 2.25GHz, NVIDIA A100,<br>Infiniband HDR, Nvidia<br>University of Florida<br>United States                       | 138,880 | 17,200.0          | 583           | 29.521                             |

- All top 5 Green500 use accelerators
- 4/5 use Nvidia GPUs combined with AMD Epyc
- MN-3 uses an accelerator optimized for matrix arithmetic
- Of the top 30 Green500:
  - 26 use Nvidia GPUs
  - 3 use A64FX vector-processors (ARM)
  - 1 uses a many-core microprocessor (PEZY-SC3)

# Multi-core versus many-core architecture

#### Multi-core

- O(10) cores
- Flexible: designed for both serial and parallel code
- Larger caches
- Emphasis on single thread performance



#### Many-core

- O(100-1000) cores
- Designed for parallel code
- Small caches
- Simpler cores



|                  | Scientific GPUs                                                                                 | Gaming GPUs                                                                                                                              |  |
|------------------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--|
| Precision        | ~3 times more single precision TFLOPS than<br>double precision<br>→ suited for double precision | <ul> <li>~40 times more single precision<br/>TFLOPS than double precision</li> <li>→ not well suited for double<br/>precision</li> </ul> |  |
| Error correction | Available                                                                                       | Not available                                                                                                                            |  |
| Connection       | NVLink & PCIe                                                                                   | Only PCIe                                                                                                                                |  |
| Price            | ~5-6 x the price of gaming GPUs                                                                 | Several hundred dollars<br>Depending on model (and year)                                                                                 |  |

|                          | AMD Ryzen Threadripper 3990X |   | Nvidia A100                      |  |
|--------------------------|------------------------------|---|----------------------------------|--|
| Core count               | 64 cores / 128 threads       | ( | 6912 cores                       |  |
| Frequency                | 2.9 GHz                      | 1 | 1.41 GHz                         |  |
| Peak Compute Performance | 3.7 TFLOPs                   | 1 | 19.5 TFLOPs (single precision)   |  |
| Memory bandwidth         | Max. 95 GB/s                 |   | 1.6 TB/s                         |  |
| Memory capacity          | Max O(1) TB                  | 4 | 40/80 GB                         |  |
| Technology               |                              | 7 | 7 nm                             |  |
| Die size                 | 717 mm <sup>2</sup>          | 8 | 826 mm <sup>2</sup>              |  |
| Transistor count         | 3.8 billion                  | Ę | 54.2 billion                     |  |
| Model                    | Minimize latency             | H | Hide latency through parallelism |  |

# Connectivity with GPU: PCIe connection

| PCIe<br>generation                        | 1 lane    | 16 lanes   | Year of announcement |  |  |  |  |
|-------------------------------------------|-----------|------------|----------------------|--|--|--|--|
| 2.0                                       | 500 MB/s  | 8 GB/s     | 2007                 |  |  |  |  |
| 3.0                                       | 985 MB/s  | 15.75 GB/s | 2010                 |  |  |  |  |
| 4.0                                       | 1.97 GB/s | 31.5 GB/s  | 2011                 |  |  |  |  |
| 5.0                                       | 3.94 GB/s | 63 GB/s    | 2017                 |  |  |  |  |
| 6.0                                       | 7.56 GB/s | 121 GB/s   | 2019                 |  |  |  |  |
| https://en.wikipedia.org/wiki/PCI_Express |           |            |                      |  |  |  |  |

D. vom Bruch

#### CPU – GPU – FPGA

|      | Latency            | Connection                   | Engineering cost                                                                                             | FP performance                              | Serial /<br>parallel                                             | Memory                               | Backward<br>compatibility                                  |
|------|--------------------|------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------|------------------------------------------------------------------|--------------------------------------|------------------------------------------------------------|
| CPU  | O(10) μs           | Ethernet,<br>USB, PCIe       | Low entry level:<br>Programmable with C++,<br>pthon, etc.                                                    | O(1-10) TFLOPs                              | Optimized for<br>serial,<br>increasingly<br>vector<br>processing | O(100) GB<br>RAM                     | Compatible,<br>except for<br>vector<br>instruction<br>sets |
| GPU  | O(100) µs          | PCIe, Nvlink                 | Low to medium entry level:<br>Programmable with CUDA,<br>OpenCL, etc.                                        | O(10) TFLOPs                                | Optimized for<br>parallel<br>performance                         | O(10) GB                             | Compatible,<br>exept for<br>specific<br>features           |
| FPGA | Fixed<br>O(100) ns | Any<br>connection<br>via PCB | High entry level:<br>traditionally hardware<br>description languages,<br>Some high-level syntax<br>available | Optimized for<br>fixed point<br>performance | Optimized for<br>parallel<br>performance                         | O(10) MB<br>on the<br>FPGA<br>itself | Not easily<br>backward<br>compatible                       |

## Overview of GPU usage in various HEP experiments

| Experiment | Main tasks<br>processed on GPU                                                                                                  | Event / data rate                          | Number of GPUs  | Deployment date                        |
|------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|-----------------|----------------------------------------|
| Mu3e       | Track- & vertex reconstruction                                                                                                  | 20 MHz /<br>32 Gbit/s                      | O(10)           | 2023                                   |
| CMS        | Decoding,<br>clustering, pattern<br>recognition in pixel<br>detector                                                            | 100 kHz                                    | O(400)          | 2022                                   |
| ALICE      | Track reconstruction<br>in three sub-<br>detectors                                                                              | 50 kHz Pb-Pb or < 5<br>MHz p-p / 30 Tbit/s | O(2000)         | 2022                                   |
| LHCb       | Decoding,<br>clustering, track<br>reconstruction in<br>three sub-detectors,<br>vertex<br>reconstruction,<br>muon ID, selections | 30 MHz/ 40 Tbit/s<br>D. vom Bruch          | O(250)<br>https | 2022<br>//arxiv.org/pdf/2003.11491.pdf |

54

# Common characteristics of software frameworks

- Same code base compiled for various computing architectures: GPUs, x86,...
- Memory management system for GPU memory: avoid dynamic memory allocation
- Schedule pipelines of GPU (and CPU) algorithms → hide memory copies
- Integration into experiments' main software frameworks



Allen framework at LHCb





Patatrack at CMS

O2 at ALICE

# History: HLT1 architecture choice





- Developed two solutions simultaneously
- Both the multi-threaded CPU & the GPU HLT1 fulfilled the requirements from the 2014 TDR
- Detailed cost benefit analysis

#### (arXiv:2105.04031)

- GPU solution leads to cost savings on processors and the network
- Throughput headroom for additional features
- Decision: A GPU-based software trigger will allow LHCb to expand its physics reach in Run 3 and beyond.



See also arXiv:2106.07701 on LHCb's energy efficiency with a CPU and GPU HLT1

## Parallelization of reconstruction tasks





Split problem into independent tasks

Example: primary vertex (PV) reconstruction



- One method for track fitting
- Subsequently iterates over all hits on a track
- For every hit, estimate the state of the track at that location:
  - First: predict it based on the previous state
  - Second: update it based on the measurement (hit)



- One method for track fitting
- Subsequently iterates over all hits on a track
- For every hit, estimate the state of the track at that location:
  - First: predict it based on the previous state
  - Second: update it based on the measurement (hit)



- One method for track fitting
- Subsequently iterates over all hits on a track
- For every hit, estimate the state of the track at that location:
  - First: predict it based on the previous state
  - Second: update it based on the measurement (hit)



- One method for track fitting
- Subsequently iterates over all hits on a track
- For every hit, estimate the state of the track at that location:
  - First: predict it based on the previous state
  - Second: update it based on the measurement (hit)



- One method for track fitting
- Subsequently iterates over all hits on a track
- For every hit, estimate the state of the track at that location:
  - First: predict it based on the previous state
  - Second: update it based on the measurement (hit)
- At last plane: best linear estimator for track state



• Only parallelizable over all tracks in one event