Alice Weekly Meeting: Software for Hardware Accelerators
-
-
10:00
→
10:20
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test of new O2DPG was set up incorrectly, now seems to work. Catalin wants to do some extrra checks before merging.
- Test with GPU GRID jobs at NERSC pending.
- Asked DPG to run first test with ITS tracking on GPU on EPNs.
GPU ROCm / compiler topics:
- Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- LLVM Bump to 20.1: status?
- Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
- ROCm 7 with full serialization passes validation in deterministic mode.
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
- Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
- Pending OpenCL2 issues:
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
- Also with optimization disabled, crashing later in TPC merging, need printf to debug.
- printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Probably a bug in multi-threaded pipeline when timeframes do not arrive in order, trying to reproduce.
- Bug in reading of MC data, temporary fix by Ruben applied, will need to implement a proper fix.
EPN GPU Topics:
-
10:20
→
10:25
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Physics
- Reran on partial set of digits from https://alimonitor.cern.ch/catalogue/index.jsp?path=%2Falice%2Fsim%2F2025%2FLHC25b8a_v8b3c%2F0%2F559781#/alice/sim/2025/LHC25b8a_v8b3c/0/559781
- Produced own anchored simulation for the 0-5% centrality simulations (finished yesterday night with 10 sims with and w/o SC)
Comparison: Left: Old, 50Ev, 50kHz PbPb without proper correction; Right: New sim from grid, no centrality enforced (40x statistics)
(For completeness: The qa plots on the left were made with different thresholds between the NN and GPU CF regressions. This explains the strong difference. It is not related to a "worse" behaviour of the regression net.)
Clusters fake fraction


Primary tracks: Efficiency


Primary tracks: Fake rate


New plots (only for the new sim for now)
Chi2_red for good tracks

Chi2_red for fake tracks

Z resolution (improves with network -> Better CoG time estimate)

Tracks vs. RowsWClusters

Tracks vs. pT (full)

Tracks vs. pT (ratio)

Ratio of pT RMS (improvement at low pT, otherwise compatible)

Framework
- Currently debugging tpcdigits.root writing from the o2-tpc-reco-workflow and the ChunkedDigitWriter. If that fails still by noon I will switch to custom dumping -
10:25
→
10:30
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
News from GPU parameter tuning
HIP/CUDA __launch_bounds__ mismatch problem
- Almost fixed mismatch with HIP and CUDA launch bounds
- Strategy is to edit the second parameter if the length of the tuple is >= 2
- Atm able to split strategy and deal with kernels with 1 parameter
- Currently stuck for kernels with parameters >= 2, can't expand last macro to inject launch bounds in the kernel definition.
E.g., preprocessor output for a kernel with 2 params:__attribute__((global)) void GPUCA_KRNL_REG2 1024, 1 krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) {...
ITS-GPU-tracker parameter tuning
- Added launch bounds to 6 kernels
- Original version:
nThreads=256, nBlocks=30 - Changed to
nThreads=256, nBlocks=60to have one block per SM (total of 60 SM on MI50) - Set
__launch_bounds__(256, 1)for those 6 kernels - Processed 1091 pp TFs (alien:///alice/data/2025/LHC25ac/563430/)
- Measured the following metric from the tracker:
TimeFrame 1091 processing completed in: 300.93 ms using 20 thread(s)
Mean TF processing times [ms] on AMD MI50 No __launch_bounds__With __launch_bounds__Gain Speedup Before overlap memcpy with compute kernels853.1 ± 191.9 ms 561.2 ms ± 155.5 34.2 % 1.5 After overlap memcpy with compute kernels651.6 ms ± 167.2 454.6 ms ± 137.0 ms 30.2% 1.4 - This with simple manual tuning. If this metric is valid, I will use the tuner to look for best configurations for the single kernels.
- Will have to think how to deal with different GPU architectures, as these parameters are architecture dependent (suggestions?)
ALICE contribution to HS23
Currently on hold, given priority to __launch_bounds__ fix and ITS tuning, Robin is in vacation anyway.
-
10:30
→
10:35
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
Summer Student Milla Bramsted
- She is working on benchmarking SoA code on GPUs
- We will track her project in this google doc
- She is adding CUDA kernels to this repo
- 5 CUDA kernels are now running in our benchmark framework
- They are running in AoS and SoA data layout
- Milla presented her project at the AIP meeting
- This is her last week
- Next steps:
- Finish the written report
ALICE O2 CI-Pipelines on NGT Cluster
- A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
- It has a GitHub action running the standalone benchmark on NGT GPUs
- Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
- Different GPUs are tested in parallel on different VMs
- O2 standalone benchmark works on the all ngt-resources:
- Nvidia H100 188GB NVL
- AMD Instinct MI300X
- AMD Radeon Pro W7900
- Nvidia L40S
- We are now using custom .par files
- Next steps
- Generate optimized .par files with Gabriele
- Possible next steps
- Add new architectures to O2 hardcoded ones?
- Store results in csv format and plot them?
- Display the plot in the github web gui?
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo
- Simplified the AoS-SoA wrapper code
- Started to apply our AoS-SoA code to:
- Merger
-
GPUTPCGMSectorTrack
-
GPUTPCGMTrackParam
-
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTrack
- Merger
- Started with SectorTracker, no problems until now
ACAT Conference Preparation
- Presenting with Jolly our joint work on AoS vs SoA.
- Added some manual AoS benchmarks as baselines.
-
10:35
→
10:40
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
Build the O2 GPU TPC Standalone benchmark on two local machines:-
Motivation:- Want to reduce my uses of EPN for routine activities like To generate PRs and learning the framework.
On my local workstation:- Copied the David generated DataSets from EPN machine, but shows size mismatch error.vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pp-10 --debug 1 Reading events from Directory o2-pp-10 Created GPUReconstruction instance for device type CPU (1) ERROR reading events/o2-pp-10/tpctransform.dump, invalid size: 4552 (4568 expected) terminate called after throwing an instance of 'std::runtime_error' what(): invalid size Aborted vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pbpb-50 --debug 1 Reading events from Directory o2-pbpb-50 Created GPUReconstruction instance for device type CPU (1) ERROR reading events/o2-pbpb-50/tpctransform.dump, invalid size: 4552 (4568 expected) terminate called after throwing an instance of 'std::runtime_error' what(): invalid size Aborted vikas@vsinghal:/localdata/standalone$Checked the dump files with md5sum, there is no difference.
Checked the OS versions for EPN and my setups. I have two setups One with Debian and one with AlmaLinux but not worked for both.vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ diff md5EpnO2-pp-10 md5VikasO2-pp-10 vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ [vsinghal@epn000 o2-pp-10]$ lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: AlmaLinux Description: AlmaLinux release 9.5 (Teal Serval) Release: 9.5 Codename: TealServal [vsinghal@epn000 o2-pp-10]$ vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye vikas@vsinghal:/localdata/standalone/events/o2-pp-10$ trainee@gpu-compute:~/a_standalone/events/o2-pp-10$ lsb_release -a LSB Version: n/a Distributor ID: AlmaLinux Description: AlmaLinux 9.6 (Sage Margay) Release: 9.6 Codename: n/a trainee@gpu-compute:~/a_standalone/events/o2-pp-10$Tried for generating Dataset: But here need O2sim and CCDB, alien-token-init etc.
Is there a way to use EPN DataSets? or Some other methods.vikas@vsinghal:/localdata/standalone$ ~/AliceGPU/sw/SOURCES/O2/daily-20250808-0000/daily-20250808-0000/prodtests/full_system_test.sh Missing O2sim environment vikas@vsinghal:/localdata/standalone$ alienv enter O2sim/latest ERROR: O2sim/latest was not found vikas@vsinghal:/localdata/standalone$ -
10:40
→
10:45
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
OpenCL
Open issues with PoCL:
- Memory corruption in ZS decoder
- crash when compiling stream compaction kernel
- new: crash in ZS decoder
- somehow missed this last week, doesn't happen when OpenCL optimizations are disabled
looking into crash in stream compaction -> small kernel but not straightforward to reproduce, seems to require identical layout as kernel in O2
GPU Servers
Possible configurations that allow buying 64 core threadripper:
- Stripped down components (less storage, 5070 TI instead of 5080 for dev machine)
- Get Ryzen 9950x instead as CPU for CI machine
- Only 1 mainboard available with AM5 socket, IPMI + 2 GPU slots
- Mainboard has expected delivery time of 2-3 months...
- Reuse alibi afterall (currently has RTX 2080 + Radeon VII)
- Machine needs at least OS upgrade anyway
- Buy a new (regular 2U) server for simulation jobs instead
Spreadsheet of all components + cost: https://docs.google.com/spreadsheets/d/1CcPUBvk4QVq344NOnXja-OjEBX1mm0_sFebN-5t1iz0
(TODO: update spreadsheet for option 3)
-
10:45
→
10:50
ITS Tracking 5mSpeakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
-
10:50
→
10:55
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00
→
10:20