10:00 → 10:20

Discussion 20m

Speaker: David Rohr (CERN)

Color code: (critical, news from this week: blue, news from last week: purple, no news: black)

Sync reconstruction

Async reconstruction

Need to investigate short GPU stall problem.
Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
New GPU architecture selection for async in O2DPG looks good, should be merged.
Test with GPU GRID jobs at NERSC pending.
Asked DPG to run first test with ITS tracking on GPU on EPNs.

GPU ROCm / compiler topics:

Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
Problem with building ONNXRuntime with MigraphX support.
Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
LLVM Bump to 20.1: status?
Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- Compilation errors fixed in ROCm trunk, to go into ROCm 7.
- Validated standalone benchmark in deterministic mode, and ran FST.
- Performance regression of ~1.5% without RTC, no regression with RTC. Reported to AMD. Not clear if this will be fixed.
- Serialization bug on MI100 still not fixed, still requires the workaround.

TPC / GPU Processing

WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
Final solution: merging transformation maps on the fly into a single flat object:
- Compilation issues on GPU fixed.
- Matthias did some checks, found two minor bugs to be fixed by Sergey.
- Speed seems ok, 0.25s for merging 2 maps.
Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
Need to check the problem with ONNX external memory allocator.
Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Now has general improvements in all aspects: higher efficiency, lower clone rate, more clusters rejected online, better cluster to track association, faster on AMD GPUs (though slightly slower on NVIDIA and CPU).
- Trying to understand one problem found by Ruben where it gets stuck processing a TF, then we might merge it to benefit from the improevements.
- Further developments in different branch on top.
Bug in multi-threaded pipeline when timeframes do not arrive in order. Happened again, Ernst and Giulio obtained a stack trace and attached a gdb. Not fully clear why, but the completion policy is stuck. Tentative fix for a possible deadlook between the completion policy and DPL is here https://github.com/AliceO2Group/AliceO2/pull/14640. Need to test online if the problem goes away.
gpu-reconstruction quitting with error in some async jobs due to running out of buffers, need to check but probably bad TPC data.

EPN GPU Topics:

10:20 → 10:25

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Online runs - summary

Tested multiple configurations
Almost all failed
- Identified reasons (out-of-bound writes due to more thread-spawning than requested), fixed and PR merged yesterday: https://github.com/AliceO2Group/AliceO2/pull/14657

Physics: ITS-TPC matching

First investigation on its-tpc debug workflow was a bit inconclusive
Second investigation on debug trees leads to more results:
A = Simulation with CF regression, B = simulation with regression network (should show matching losses)
Shown: ITS-TPC matched tracks found in one simulation but not in the other. The plotted parameters will then be from the first simulation.
(selected by the same isPhysicalPrimary as the histograms that show the losses)

Reminder: Major losses observed in 0.2 < | q/pT | < 2, ...

... as found in the debug trees:

Z:

sin(phi):

tan(lambda):

Reason why I didn't see them in the eta distribution: Window of the plot was too small.

nClustersTPC:

So many of the lost tracks are at NCL -> 60. How about the ones with NCL > 100?

Let's have a look at the inverse situation: ITS-TPC matches found in B but not in A:

-> NCL > 100 similar here (so maybe its just an MC label mismatch: {track ID, event ID, source ID})

Finally, the loss is found:

So the matching efficiency loss is now understood and comes from the high inclination region. Q: Anything to do about it?

There is also no (significant) loss in TPC efficiency at high eta, independent of the threshold:

-> Conclusion: This must be an effect of changed track parameters.

10:25 → 10:30

GPU Parameter Optimizations 5m

Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

News from GPU parameter tuning

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-09-10T10:00:00+02:00
End: 2025-09-10T11:00:00+02:00
Location: No location set

Changes after the mismatch fix

Tuned PbPb 50 KHz after fixing the HIP/CUDA mismatch
Lowered grid sizes to maximum of 600 (kept max 900 only for some small kernels)
Lowered number of trials
New constraint in the optimization function:
if min_block_per_cu * max_threads / 256 > 10: return float("inf")
Penalize configurations that would make the second parameter > 10
In this way, avoid kernels with no register spillage (since the compiler ignores min_active_warps > 10) but with high grid size
Forces the algorithm to focus on small but good configurations

Best parameters perf [ms]	Before mismatch fix	After mismatch fix + constraints
Clusterizer step	732.18	757.02
CompressionKernels_step1unattached	523.69	524.31
GMMergerCollect	93.86	94.40
GMMergerSectorRefit	284.58	267.72
GMMergerTrackFit	761.78	704.26
FollowLoopers + CompressionKernels_step0attached	508.43	480.44
TrackletConstructor step	1091.24	> 4000

For the tracklet constructor step, since it is highly dimensional, it was more probable to fall into the condition min_block_per_cu * max_threads / 256 > 10, leading to many inf values and few acceptable values. Removing the constraint lead to an optimal solution of 1128.05.

Version	Mean sync time [ms]
Default - before mismatch fix	4679.83 ms ± 3.64 ms
Default - after mismatch fix	4682.61 ms ± 5.77 ms
Tuned - before mismatch fix	4317.97 ms ± 4.17 ms (7,73%)
Tuned - after mismatch fix + extra constraints	4315.11 ms ± 3.06 ms (7,79%)

Reduced grid sizes in best parameters. Some examples:

Kernel	Best conf now	Best conf before
GMMergerCollect	1024x120	768x600
GMMergerSectorRefit	64x480	448x840
GMMergerTrackFit	64x360	64x900
CompressionKernels_step0attached	64x180	192x60
GMMergerFollowLoopers	128x240	64x900

NVIDIA GPUs tuning

Got access to the NGT cluster. Started to adapt the tuner for Nvidia GPUs.

ToDos:

Investigate why big grid sizes with low occupancy are preferred rather than the small grid sizes with same occupancy
Decide if how to integrate new default parameters in O2

ALICE contribution to HS23

Created repository for the Dockerfile: https://github.com/cima22/O2GPU_HS23
Waiting reply by Robin for testing

10:30 → 10:35

Efficient Data Structures 5m

Speaker: Dr Oliver Gregor Rietmann (CERN)

10:35 → 10:40

Following up GPU to-dos 5m

Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

Working with the TODOs:-

A Compiler directive (GCC diagnostic need to be removed in GPU/GPUTracking/Base/hip/GPUReconstructionHIPIncludesSystem.h class,

but checked that the entire class not needed. If I remove the class then also, no make error.

In this class there are few includes only from HIP and Thrust. So we may remove it fully.

10:40 → 10:45

TPC Clusterization / OpenCL / Highly Ionizing Particles 5m

Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

OpenCL

crash when compiling ZS decoder

Submitted reproducer
Last update from PoCL devs: Not a trivial fix, requires rewriting one step in compiler pipeline

Memory corruption in ZS decoder

Looks like miscompilation after all
Issue disappears when running single-thread decoder via OpenCL
Kernel consists of two parts
- Decode header region in page into shared memory

- Decode adc data from page in global memory
Replacing header decoding with single-threaded variant: Issue almost disappears, but some wrong values still written
Dumping written values with printf:
- OpenCL version writes correct values except for memory corruption
- So compared to CPU version: expect identical output + bunch of additional writes
- Actual output: OpenCL writes less values than CPU + almost no overlap in output???
- -> printf affected by miscompilation, so useless here?

New GPU server

Final configuration:

Dev machine with 64 core threadripper + RTX 5080 (& MI50)
New 2U server for alibi, repurpose old alibi to CI GPU machine

Order with Guy. Waiting for update.

10:45 → 10:50

ITS Tracking 5m

Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

Gabriele

Merged ITS launch bounds into O2
Studied ITS GPU code, next week meeting with Matteo for clarifications + coordination with Felix when he's back

10:50 → 10:55

System Run Coordination Topics 5m

Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators