# The Allen Project: a GPU trigger for LHCb

Daniel Craik (MIT) on behalf of the LHCb collaboration 2020-01-27











#### The LHCb detector



- Located at point 8 of the LHC
- General-purpose detector in the forward region
- Specialised in studying b- and c-decays

 Instrumented in the forward region to exploit forward-production of c- and b-hadrons





- Instrumentation in the forward region
   (2 < η < 5)</li>
- Excellent secondary vertex reconstruction
- Precise tracking before and after magnet
- Good PID separation up to  $\sim 100 \, {\rm GeV}/c$

## LHCb timeline



# The LHCb detector: Run III upgrade





- New vertex locator
- New tracking (UT, SciFi)
- New front-end electronics
- Run at 5× higher luminosity

# Challenges in Run III



- At increased luminosity, charm (beauty) in 24 % (2 %) of bunch crossings
  - Cannot write out charm at 7 MHz
- Trigger must distinguish signal from less-interesting signal as well as from background
- No longer feasible to have first trigger based on calorimeters and muon detectors alone
- Need as much information about an event as soon as possible → run tracking



# Tracking at LHCb

- Tracking requires readout of several sub-detectors
- Tracks must be extrapolated between VELO, UT and SciFi (T1–T3)
- Also match to muon stations for muon particle ID



# LHCb trigger in Run III



- Hardware trigger to be removed from Run III
- HLT1 software trigger must perform at 30× higher rate with 5× the pileup
- Buffer will reduce from O(weeks)→ O(days)
- Significant increase in data transfer rates
- New trigger setup offers up to  $\sim$  10× efficiency improvement for some physics channels

## Alternative trigger for Run III?



- Option to move to a GPU-based HLT1 with GPUs installed on the Event Builder servers
- Free up full CPU farm for HLT2 and save on networking between event builders and CPU farm
- Demonstrated technical feasibility
- Decision due next few months

# Why GPUs?



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

- Moore's law still holds but single-thread performance has levelled off
- Gains now to be made through parallelisation
- GPUs specialised for massively parallel operations (100s–1000s of cores)

## GPU architecture



- Kernel executed in many threads
- Threads run same algorithm on different parts of the data
- Threads arranged within blocks within a grid
- Threads within a block share memory and synchronised
- Block and grid dimensions optimised for each kernel



Significant gains require large fraction of sequence to be parallelised

#### External constraints

- ullet  $\sim$  250 Event Builder servers  $\to$   $\sim$  500 GPUs
- 16 32 GB/s PCIe rate  $\rightarrow$  sufficient for 5 TB/s input
- ullet Small raw event size  $\sim$  100 kB  $\to$  process several 1000 events at once per GPU

- Generic configurable framework for GPU-based execution of an algorithm sequence
- Stand-alone software package: https://gitlab.cern.ch/lhcb/Allen
- Dependencies: C++17 compiler, CUDA v10.2, boost, ZeroMQ
- Built-in validation and monitoring (requires ROOT)
- Process thousands of events in a single sequence
  - Opportunity for massive parallelisation
- Cross-platform compatibility with CPU architectures
- Named for Frances E. Allen
- Implement HLT1 on GPUs







## Algorithm sequeunce

- Multiple "streams" on each GPU process "slices" of  $\mathcal{O}(1000)$  events
- Single transfer of data to GPU device
  - Data passed to device
  - All algorithms executed in order
  - Results passed back to the host
- Configurable sequences at compile time
- Configurable algorithms at run time via JSON

# Memory management

- No dynamic memory allocation
- Data dependencies and memory assignments resolved at compile time
- Host and device memory handled by custom memory manager
  - All memory allocated on startup
  - Assigned on demand
- Failsafe mechanism to sub-divide data slices with unusually large memory requirements and pass through problematic events



15/26

## Integration

- I/O performed asynchronously by separate CPU thread
  - Input data banks may be read from binary files or decoded from MDF or MEP formats
  - Only selected events sent to output
  - Selection decisions and reconstructed objects added to output data
- Monitoring also performed in dedicated thread

- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised







- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised







- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised







- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised







- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised





- HLT1 involves decoding, clustering and track reconstruction for all tracking subdetectors
- Algorithms also perform Kalman filter and trigger selection
- All stages of the process may be parallelised







Run each event in one block



Run each event in one block

 $Decoding \rightarrow parallelise \ by \ readout \ unit$ 



Run each event in one block

 $Decoding \rightarrow parallelise \ by \ readout \ unit$ 

Clustering  $\rightarrow$  parallelise in (overlapping) detector regions



Run each event in one block

 $\text{Decoding} \rightarrow \text{parallelise by readout unit}$ 

Clustering  $\rightarrow$  parallelise in (overlapping) detector regions

Tracking → parallelise by track



Run each event in one block

 $Decoding \rightarrow parallelise \ by \ readout \ unit$ 

Clustering  $\rightarrow$  parallelise in (overlapping) detector regions

 $\textbf{Tracking} \rightarrow \textbf{parallelise by track}$ 

 $Vertexing \rightarrow parallelise \ by \ combination$ 

# Example: velo clustering

#### 26 planes of silicon pixel detectors



#### Clustering with bit masks



# Example: velo tracking



# Example: velo tracking







20 / 26





# Example: velo vertexing

Record z of closest approach to beamline for each track

Peaks in distribution identify PVs









efficiency, electrons p\_ distribution, not electrons = n<sup>†</sup> distribution, electrons

4000 p, [MeV]

GPILE & D

2000

#### **Selections**

. . .

- One track
- Two tracks
- Single muon
- Two muons (displaced)
- Two muons (high-mass)

Secondary vertices











| Trigger          | Rate [kHz]   |
|------------------|--------------|
| 1-Track          | $215\pm18$   |
| 2-Track          | $659\pm31$   |
| High- $p_T$ muon | $5\pm3$      |
| Displaced dimuon | $74\pm10$    |
| High-mass dimuon | $134\pm14$   |
| Total            | $999 \pm 38$ |

- Total rate reduced 30 → 1 MHz
- Physics performance consistent with x86 baseline

| Signal                            | GEC        | TIS -OR- TOS | TOS        | $GEC \times TOS$             |
|-----------------------------------|------------|--------------|------------|------------------------------|
| $B^0 	o K^{*0} \mu^+ \mu^-$       | $89\pm2$   | 91 ± 2       | $89\pm2$   | $79\pm3$                     |
| $B^0  ightarrow K^{st 0} e^+ e^-$ | $84\pm3$   | $69 \pm 4$   | $62\pm4$   | $52\pm4$                     |
| $B_s^0 	o \phi \phi$              | 83 ± 3     | $76\pm3$     | $69 \pm 3$ | $57\pm3$                     |
| $D_s^+	o K^+K^-\pi^+$             | $82 \pm 4$ | $59\pm5$     | $43\pm 5$  | $\textbf{35} \pm \textbf{4}$ |
| $Z ightarrow \mu^+\mu^-$          | $78\pm1$   | $99\pm0$     | $99 \pm 0$ | $77\pm1$                     |
| $Z \rightarrow \mu^+\mu^-$        | /8 ± 1     | 99 ± 0       | 99 ± 0     | // ± 1                       |

GEC = global event cut, TIS = trigger independent of signal, TOS = trigger on signal



- ullet Full HLT1 algorithm can be run on  $\sim$  500 current GPUs
- Buy GPUs instead of networking



- Performance scales with GPU so can expect more from 2021 GPUs
  - Not yet limited by Amdahl's law
  - Potential to perform more tasks within HLT1

## Summary

- Allen project offers a GPU-implementation of LHCb HLT1
- Full track reconstruction and selection performed
- Generic framework allows for configurable algorithm sequence
- Feasibility for possible use in Run III already demonstrated