10:00 → 10:20

Discussion 20m

Speaker: David Rohr (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

GPU Benchmarks in HS23 Contribution from ALICE

Had a meeting 2 weeks ago, Gabriele will report on the status

Sync reconstruction

Async reconstruction

Need to investigate short GPU stall problem.
Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
We can now set the GPU architectures in the environment variable field of Jenkins builds.
Managed to run the o2-gpu-standalone-benchmark from an async build on CVMFS in the default GRID job container on the NERSC perlmutter site running on their A100 GPU.
Async reco test (using some 2024 Pb-Pb setup I copied from the GRID) were successful.
Improved O2DPG scripts and added ALIEN_JDL_SITEARCH setting to select between MI50, MI100, NERSC (so we do not need custom run scripts per site).

GPU ROCm / compiler topics:

Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
Problem with building ONNXRuntime with MigraphX support.
Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
LLVM Bump to 20.1: status?
ROCm 6.4.1 status:
- AMD is checking the reproducer. I have some idea how to narrow down where it miscompiles using different compile flags in per-kernel mode.
New problem with ROCm 6.5 / 7.0 after bumping clang: New clang encounters internal compiler error processing our code...
Improved Standalone Benchmark CI, can now run RTC test for CUDA also with no GPU installed.
Updating alidist/gpu-system to be build_requires only, and reverted that since it broke the dailies. Now generating a dummy modulefile (even if not used), as requested by Giulio.

TPC / GPU Processing

WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
- Matthias Kleiner might look into commissioning this.
Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows.
Need to check the problem with ONNX external memory allocator. Status?

EPN GPU Topics:

10:20 → 10:25

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Framework

Optimizations of kernel speeds. Best possible solution (for now): pad-aligned memory accesses per thread

Computing

Evaluated different network sizes: 5 x (2 TF, FST, only classification network (+ CF regression), float16) per NN

debugLevel=0, batchsize=8192

debugLevel=1, batchsize=8192

Did an OPTUNA optimization for batchsize: For all networks, higher batchsize seems better, although there seems to be some irregularity (sometimes a batchsize which is not maximum was best -> Irregular optimization surface)
Probably will stick with 32 neurons per layer, 32 neurons per layer
Ideally would like to find a pattern: Something like ((neurons per layer) x (input size) x (batch size)) % 32*n ~ 0 (-> analytic formula for when an execution becomes efficient)
RTC compilation working now (update from this morning), thanks to colab with Gabriele

Question

Would like to run optimisation for one chosen NN with launch bounds
- Which launch bounds to take? What do they exactly mean? Maybe these:
  
  #define GPUCA_LB_GPUTPCNNClusterizerKernels_runCfClusterizer 512
  
  #define GPUCA_LB_GPUTPCNNClusterizerKernels_fillInputNNGPU 512
  
  #define GPUCA_LB_GPUTPCNNClusterizerKernels_determineClass1Labels 512
  
  #define GPUCA_LB_GPUTPCNNClusterizerKernels_publishClass1Regression 512
  
  #define GPUCA_LB_GPUTPCNNClusterizerKernels_publishDeconvolutionFlags 512
- What should be the optimisation objective?
  - debugLevel=1: mean(NN kernel times) + fill time
  - debugLevel=0: Cluster finder time -> Does not take into account the benefits in track merger due to cluster rejection
  - debugLevel=0: Wall time -> This could depend on the number of accepted clusters for tracking and threshold might change from simulation to simulation and network to network

10:25 → 10:30

GPU Parameter Optimizations 5m

Speaker: Gabriele Cimador (Universita e INFN Torino (TO))

HS23 Contribution from ALICE

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-07-23T10:00:00+02:00
End: 2025-07-23T11:00:00+02:00
Location: No location set

Tested with Docker on EPN MI50 node
Started from GPU CI container
Load dependencies via CVMFS
Clone O2 into the container
Patch /home/gcimador/alice/O2/GPU/GPUTracking/Standalone/cmake/config.cmake
- turn off ENABLE_[CUDA/HIP/OCL] Cmake variables
- comment out event visualization
For the test:
- set(ENABLE_HIP ON)
- set(HIP_AMDGPUTARGET "gfx906")
- ALIBUILD_O2_FORCE_GPU=0
cmake -DCMAKE_INSTALL_PREFIX=../ -DGLFW_DIR=/cvmfs/alice.cern.ch/el9-x86_64/Packages/GLFW/3.3.2-19/lib/cmake/glfw3/ ../../O2/GPU/GPUTracking/Standalone/
make -j32 install

Questions

ERROR reading events/o2-pbpb-5kHz-32/tpctransform.dump, invalid size: 4552 (4568 expected)
terminate called after throwing an instance of 'std::runtime_error'
what(): invalid size
Is there a quick way to adjust my datasets or I have to simulate new again?
We need to find a way to define the GPU architecture to build for. My idea: provide them as argument of the entrypoint script of the container. It will receive two arguments: CUDA/HIP to enable the ENABLE_CUDA / ENABLE_HIP variables, and the architecture e.g. gfx906 for setting the HIP_AMDGPUTARGET variable.
Might this be ok?
Do I need to use aliBuild init or can I just clone O2 repository alone?

News from GPU parameter tuning

Tuned datasets:
- PbPb: 42kHz, 47 kHz, 50 kHz
- pp: 750kHz, 1MHz

Is the "sweet spot" always the same, or it is a range of values?

To answer this, tuned the same pp 750 kHz dataset three times

Kernel Name	Best (block, grid) - search 1	Best (block, grid) - search 2	Best (block, grid) - search 3
CompressionKernels_step1unattached	384, 120	512, 120	384, 120
GMMergerCollect	512, 540	640, 600	448, 600
GMMergerSectorRefit	64, 660	64, 840	64, 900
GMMergerTrackFit	256, 840	256, 900	256, 660
CompressionKernels_step0attached	192, 60	192, 60	192, 60
GMMergerFollowLoopers	128, 840	64, 660	64, 780
TrackletConstructor	512, 900	512, 780	512, 780
TrackletSelector	128, 720	1024, 900	704, 900
CFClusterizer	448, 900	512, 780	448, 360

Seems that sweet spot lies on a range of values

Todos:

Investigate why grid_size=600 is such an important number (10*CUs)
Correct the number of trials (minimum reached with less evaluations)
Implement dynamic parameters in O2?

10:30 → 10:35

Efficient Data Structures 5m

Speaker: Dr Oliver Gregor Rietmann (CERN)

Summer Student Milla Bramsted

She is working on benchmarking SoA code on GPUs
We will track her project in this google doc
She is adding CUDA kernels to this repo
A few Kernel were integrated to the benchmarking framework
Next steps:
- Add more kernels (bitonic sort)
- Add the other NGT SoA approaches again (were temporarily removed)
- Re-organize the code

ALICE O2 CI-Pipelines on NGT Cluster

A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
It has a GitHub action running the standalone benchmark on NGT GPUs (H100)
Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
O2 standalone benchmarks now works on the following ngt-resources:
- Nvidia: H100 188GB NVL, L40S 48GB
- AMD: Instinct MI300X
- Not working yet on AMD Radeon Pro W7900 due to different warp size / wavefront size.
- Not working yet on lastest commits because TPC changed the data format of the event dumps
Next steps:
- Fix the problems above
- Get the full CI to work

Implement NGT SoA Code in O2 standalone benchmark

Working on this fork of the AliceO2 repo
Simplified and optimized the SoA code in the last few weeks
Everything is running and we have identified the classes apply our SoA code to
Next steps:
- Implement the SoA code on those classes
Remark: Jolly wants to implement her reflection approach in O2 standalone (for a conference)

10:35 → 10:40

Following up GPU to-dos 5m

Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

Working to generate PRs for GPU Todos one by one.

What should be label for a PR and how to choose it?

10:40 → 10:45

TPC Clusterization / OpenCL / Highly Ionizing Particles 5m

Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

OpenCL

Finished standalone Zero Suppression decoder: Can't reproduce issues from O2...

Clang printf

Fixed OpenCL printf by not mangling the symbol:

diff --git a/clang/lib/Headers/opencl-c-base.h b/clang/lib/Headers/opencl-c-base.h
index b6bcf32c09c0..14fa7654849a 100644
--- a/clang/lib/Headers/opencl-c-base.h
+++ b/clang/lib/Headers/opencl-c-base.h
@@ -688,7 +688,15 @@ template <typename _Tp> struct __remove_address_space<__constant _Tp> {
 #if defined(__OPENCL_CPP_VERSION__) || (__OPENCL_C_VERSION__ >= CL_VERSION_1_2)
 // OpenCL v1.2 s6.12.13, v2.0 s6.13.13 - printf

-int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));
+#ifdef __OPENCL_CPP_VERSION__
+#define CLINKAGE extern "C"
+#else
+#define CLINKAGE
+#endif
+
+CLINKAGE int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));
+
+#undef CLINKAGE
 #endif

 #ifdef cl_intel_device_side_avc_motion_estimation

-> Allows printf usage in OpenCL again

TODO: open upstream PR

With printf, found a third issue in ZS decoder:

Sometimes statements like if (iThread == 0) printf("..."); are printed by all threads. Can be fixed with a barrier after the print.

-> Can be reproduced in both O2 and standalone decoder. Unsure how this is connected to the other issues.

10:45 → 10:50

ITS Tracking 5m

Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

# ITS tracking status

1. Quick overview of current CPU refactoring
2. New features currently in development
3. Status of the GPU part

---

## 1.1 Current CPU refactoring

---

## 1.2 Memory consumption

Here for one particulary parasitic TF in Pb-Pb with two central collisions in on readout-frame.
Plot shows before(blue)/after(red) refactoring.

Plot shows a special mode `perPrimaryVertexProcessing` which reduces the comb. significantly but has some currently studied side effects. Blue is with this mode enabled now, red as it is before.

## 2. New/Improved features: `DeltaROF` tracking
In contrived MC with exagerated slope in the chip response function one can clearly see the drop of eff. towards the beginning/end of the readout-frame.
In analysis these collisions are completely cut away (standard selections) ~20% of data loss.
This feature should cure this.

Comes with an increase cost in memory and time (but worth it)

## GPU Part
After the refactoring of the reco code, checked again if GPU is now messed up. After some fixes minor fixes, in debug mode the output is 1:1 the one from the cpu code (10TFs in 40kHz Pb-Pb):

```cpp
#define GPU_BLOCKS GPUCA_DETERMINISTIC_CODE(1, 99999)
#define GPU_THREADS GPUCA_DETERMINISTIC_CODE(1, 99999)

// calling kernels like this -> effectively serializing the code (very slow)
gpu::computeLayerTrackletsMultiROFKernel<true><<<o2::gpu::CAMath::Min(nBlocks, GPU_BLOCKS),
o2::gpu::CAMath::Min(nThreads, GPU_THREADS),
0,
streams[iLayer].get()>>>(...)
// now this is not needed anymore and the det. mode is `fast-ish`
```

TODOs (help welcome)
- validate new features if they also work on GPU
- porting Vertexing code to GPU
- optimize GPU part:
- `started` using multiple streams very much in its infancy
- optimise load times/processing etc. (probably [not measured] parts could be improved)

- optimise number of blocks/threads per kernel (for Pb-Pb + pp sep., fairly small number of kernels to be optimised)

- Have a CI procedure/periodic check that runs ITS GPU reconstruction and then gets some automatic checks and numbers (deterministic mode is more important)

- Ensure gpu-reco-workflow with ITS enabled are the same that one gets from its-reco-workflow --gpu

10:50 → 10:55

System Run Coordination Topics 5m

Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators