# Investigation of Many-Core Scalability of the Track Reconstruction in the CBM Experiment

S. A. Baginyan<sup>1</sup>, V. V. Ivanov<sup>1</sup>, <u>P. I. Kisel<sup>1</sup></u>, I. S. Kulakov<sup>2,3</sup>

1. Joint Institute for Nuclear Research, Russia 2. Goethe University Frankfurt am Main, Germany 3. National Taras Shevchenko University of Kyiv, Ukraine





## **CBM experiment**

- Fixed-target heavy-ion experiment
- > 1000 charged particles/collision
- > Non-homogeneous magnetic field
- > 85% fake combinatorial space points in STS
- $> 10^7$  events/s
- > Track reconstruction and displaced vertex search required in the first trigger level



#### Simulated central Au-Au collision at 25 AGeV.

# **Cellular Automaton (CA) Track Finder**





## **Cellular Automaton:**

- 1. Build short track segments 2. Connect according to the track
- 3. Tree structures appear, collect segments into track candidates 4. Select the best track candidates

#### CA advantages:

- Local w.r.t. data
- Intrinsically parallel
- Perfect for many-core CPU/GPU
- Extremely simple





- Takes into account the detector inefficiency
- Highly optimized code
  - Single precision calculations
  - Magnetic field approximation
  - Reconstruction in several iterations
- Highly parallelized code
  - Data level (SIMD instructions, 4 single-precision floating point calculations in parallel)



\*time using 1 core of cuda.jinr.ru

Efficiency |

Number of logical cores

## **Many-Core Scalability**

### **Minimum bias events**





- ➢ 2 CPUs Intel E5640
- ➢ 4 cores per CPU
- > Hyper-Threading
- ➢ 2.7 GHz
- 12 MB L3 cache
- ➢ 48 GB RAM

|            |            |            |            | ╎╎ | NUMANode P#1 (24GB)                         |
|------------|------------|------------|------------|----|---------------------------------------------|
| Socket P#0 |            |            |            |    | Socket P#1                                  |
| L3 (12MB)  |            |            |            |    | L3 (12MB)                                   |
| L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB) |    | L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) |
| L1 (32KB)  | L1 (32KB)  | L1 (32KB)  | L1 (32KB)  |    | L1 (32KB) L1 (32KB) L1 (32KB) L1 (32        |
| Core P#0   | Core P#1   | Core P#9   | Core P#10  |    | Core P#0 Core P#1 Core P#9 Core P           |
| PU P#0     | PU P#1     | PU P#2     | PU P#3     |    | PU P#4 PU P#5 PU P#6 PU P                   |
| PU P#8     | PU P#9     | PU P#10    | PU P#11    |    | PU P#12 PU P#13 PU P#14 PU P                |



#### Strong linear many-core scalability

#### **Central events**