10:00 → 10:20

Discussion 20m

Speaker: David Rohr (CERN)

Color code: (critical, news from this week: blue, news from last week: purple, no news: black)

Sync reconstruction

Async reconstruction

Need to investigate short GPU stall problem.
Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
Test of new O2DPG was set up incorrectly, now seems to work. Catalin wants to do some extrra checks before merging.
Test with GPU GRID jobs at NERSC pending.
Asked DPG to run first test with ITS tracking on GPU on EPNs.

GPU ROCm / compiler topics:

Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
Problem with building ONNXRuntime with MigraphX support.
Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
LLVM Bump to 20.1: status?
Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- ROCm 7 with full serialization passes validation in deterministic mode.

TPC / GPU Processing

WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
Need to check the problem with ONNX external memory allocator.
Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
Probably a bug in multi-threaded pipeline when timeframes do not arrive in order, trying to reproduce.
Bug in reading of MC data, temporary fix by Ruben applied, will need to implement a proper fix.

EPN GPU Topics:

10:20 → 10:25

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Physics

Reran on partial set of digits from https://alimonitor.cern.ch/catalogue/index.jsp?path=%2Falice%2Fsim%2F2025%2FLHC25b8a_v8b3c%2F0%2F559781#/alice/sim/2025/LHC25b8a_v8b3c/0/559781
Produced own anchored simulation for the 0-5% centrality simulations (finished yesterday night with 10 sims with and w/o SC)

Comparison: Left: Old, 50Ev, 50kHz PbPb without proper correction; Right: New sim from grid, no centrality enforced (40x statistics)

(For completeness: The qa plots on the left were made with different thresholds between the NN and GPU CF regressions. This explains the strong difference. It is not related to a "worse" behaviour of the regression net.)

Clusters fake fraction

Primary tracks: Efficiency

Primary tracks: Fake rate

New plots (only for the new sim for now)

Chi2_red for good tracks

Chi2_red for fake tracks

Z resolution (improves with network -> Better CoG time estimate)

Tracks vs. RowsWClusters

Tracks vs. pT (full)

Tracks vs. pT (ratio)

Ratio of pT RMS (improvement at low pT, otherwise compatible)

Framework
- Currently debugging tpcdigits.root writing from the o2-tpc-reco-workflow and the ChunkedDigitWriter. If that fails still by noon I will switch to custom dumping

10:25 → 10:30

GPU Parameter Optimizations 5m

Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

News from GPU parameter tuning

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-08-20T10:00:00+02:00
End: 2025-08-20T11:00:00+02:00
Location: No location set

HIP/CUDA __launch_bounds__ mismatch problem

Almost fixed mismatch with HIP and CUDA launch bounds
Strategy is to edit the second parameter if the length of the tuple is >= 2
Atm able to split strategy and deal with kernels with 1 parameter
Currently stuck for kernels with parameters >= 2, can't expand last macro to inject launch bounds in the kernel definition.
E.g., preprocessor output for a kernel with 2 params:
__attribute__((global)) void GPUCA_KRNL_REG2 1024, 1 krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) {...

ITS-GPU-tracker parameter tuning

Added launch bounds to 6 kernels
Original version: nThreads=256, nBlocks=30
Changed to nThreads=256, nBlocks=60 to have one block per SM (total of 60 SM on MI50)
Set __launch_bounds__(256, 1) for those 6 kernels
Processed 1091 pp TFs (alien:///alice/data/2025/LHC25ac/563430/)

Measured the following metric from the tracker: TimeFrame 1091 processing completed in: 300.93 ms using 20 thread(s)

Mean TF processing times [ms] on AMD MI50	No `__launch_bounds__`	With `__launch_bounds__`	Gain	Speedup
Before `overlap memcpy with compute kernels`	853.1 ± 191.9 ms	561.2 ms ± 155.5	34.2 %	1.5
After `overlap memcpy with compute kernels`	651.6 ms ± 167.2	454.6 ms ± 137.0 ms	30.2%	1.4

This with simple manual tuning. If this metric is valid, I will use the tuner to look for best configurations for the single kernels.
Will have to think how to deal with different GPU architectures, as these parameters are architecture dependent (suggestions?)

ALICE contribution to HS23

Currently on hold, given priority to __launch_bounds__ fix and ITS tuning, Robin is in vacation anyway.

10:30 → 10:35

Efficient Data Structures 5m

Speaker: Dr Oliver Gregor Rietmann (CERN)

Summer Student Milla Bramsted

She is working on benchmarking SoA code on GPUs
We will track her project in this google doc
She is adding CUDA kernels to this repo
5 CUDA kernels are now running in our benchmark framework
They are running in AoS and SoA data layout
Milla presented her project at the AIP meeting
This is her last week
Next steps:
- Finish the written report

ALICE O2 CI-Pipelines on NGT Cluster

A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
It has a GitHub action running the standalone benchmark on NGT GPUs
Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
Different GPUs are tested in parallel on different VMs
O2 standalone benchmark works on the all ngt-resources:
- Nvidia H100 188GB NVL
- AMD Instinct MI300X
- AMD Radeon Pro W7900
- Nvidia L40S
We are now using custom .par files
Next steps
- Generate optimized .par files with Gabriele
Possible next steps
- Add new architectures to O2 hardcoded ones?
- Store results in csv format and plot them?
- Display the plot in the github web gui?

Implement NGT SoA Code in O2 standalone benchmark

Working on this fork of the AliceO2 repo
Simplified the AoS-SoA wrapper code
Started to apply our AoS-SoA code to:
- Merger
  - GPUTPCGMSectorTrack
  - GPUTPCGMTrackParam
- SectorTracker
  - GPUTPCBaseTrackParam
  - GPUTPCTrackParam
  - GPUTPCTrack
Started with SectorTracker, no problems until now

ACAT Conference Preparation

Presenting with Jolly our joint work on AoS vs SoA.
Added some manual AoS benchmarks as baselines.

10:35 → 10:40

Following up GPU to-dos 5m

Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

Build the O2 GPU TPC Standalone benchmark on two local machines:-

Motivation:- Want to reduce my uses of EPN for routine activities like To generate PRs and learning the framework.

On my local workstation:- Copied the David generated DataSets from EPN machine, but shows size mismatch error.

vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pp-10 --debug 1
Reading events from Directory o2-pp-10
Created GPUReconstruction instance for device type CPU (1)
ERROR reading events/o2-pp-10/tpctransform.dump, invalid size: 4552 (4568 expected)
terminate called after throwing an instance of 'std::runtime_error'
  what():  invalid size
Aborted
vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pbpb-50 --debug 1
Reading events from Directory o2-pbpb-50
Created GPUReconstruction instance for device type CPU (1)
ERROR reading events/o2-pbpb-50/tpctransform.dump, invalid size: 4552 (4568 expected)
terminate called after throwing an instance of 'std::runtime_error'
  what():  invalid size
Aborted
vikas@vsinghal:/localdata/standalone$

Checked the dump files with md5sum, there is no difference.
Checked the OS versions for EPN and my setups. I have two setups One with Debian and one with AlmaLinux but not worked for both.

vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ diff md5EpnO2-pp-10 md5VikasO2-pp-10 
vikas@vsinghal:/localdata/standalone/events/o2-pp-10$

[vsinghal@epn000 o2-pp-10]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID:    AlmaLinux
Description:    AlmaLinux release 9.5 (Teal Serval)
Release:    9.5
Codename:    TealServal
[vsinghal@epn000 o2-pp-10]$

vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:    bullseye
vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ 

trainee@gpu-compute:~/a_standalone/events/o2-pp-10$ lsb_release -a
LSB Version:    n/a
Distributor ID:    AlmaLinux
Description:    AlmaLinux 9.6 (Sage Margay)
Release:    9.6
Codename:    n/a
trainee@gpu-compute:~/a_standalone/events/o2-pp-10$

Tried for generating Dataset: But here need O2sim and CCDB, alien-token-init etc.
Is there a way to use EPN DataSets? or Some other methods.

vikas@vsinghal:/localdata/standalone$ ~/AliceGPU/sw/SOURCES/O2/daily-20250808-0000/daily-20250808-0000/prodtests/full_system_test.sh 
Missing O2sim environment
vikas@vsinghal:/localdata/standalone$ alienv enter O2sim/latest
ERROR: O2sim/latest was not found
vikas@vsinghal:/localdata/standalone$

10:40 → 10:45

TPC Clusterization / OpenCL / Highly Ionizing Particles 5m

Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

OpenCL

Open issues with PoCL:

Memory corruption in ZS decoder
crash when compiling stream compaction kernel
new: crash in ZS decoder
- somehow missed this last week, doesn't happen when OpenCL optimizations are disabled

looking into crash in stream compaction -> small kernel but not straightforward to reproduce, seems to require identical layout as kernel in O2

GPU Servers

Possible configurations that allow buying 64 core threadripper:

Stripped down components (less storage, 5070 TI instead of 5080 for dev machine)
Get Ryzen 9950x instead as CPU for CI machine
- Only 1 mainboard available with AM5 socket, IPMI + 2 GPU slots
- Mainboard has expected delivery time of 2-3 months...
Reuse alibi afterall (currently has RTX 2080 + Radeon VII)
- Machine needs at least OS upgrade anyway
- Buy a new (regular 2U) server for simulation jobs instead

Spreadsheet of all components + cost: https://docs.google.com/spreadsheets/d/1CcPUBvk4QVq344NOnXja-OjEBX1mm0_sFebN-5t1iz0

(TODO: update spreadsheet for option 3)

10:45 → 10:50

ITS Tracking 5m

Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

10:50 → 10:55

System Run Coordination Topics 5m

Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators