Alice Weekly Meeting: Software for Hardware Accelerators
→
Europe/Zurich
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeaker: David Rohr (CERN)
Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
Sync reconstruction
Async reconstruction
- Need to investigate short GPU stall problem.
- Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
- Test with GPU GRID jobs at NERSC pending.
- Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
- Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
GPU ROCm / compiler topics:
- Problem with building ONNXRuntime with MigraphX support.
- Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
- Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
- Need to check ROCm 7.2 corrtecness.
- Need to understand and fix crash on RTX Pro 6000 reported by Oliver.
- Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver.
- Understand deterministic mode issue on NVIDIA Blackwell.
- Performance issue on Blackwell fixed (see below).
TPC / GPU Processing
- WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
- Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
- Final solution: merging transformation maps on the fly into a single flat object:
- Maps now yielding correct results, but 1.5x performance regression running on GPUs.
- PR is now green in the CI, POD version of the fasttransform and merging of the maps integrated (thanks to Ruben and Matthias and of course Sergey!)
- Remaining regression with the new maps vs old maps is due to reduced L2 cache hit rate (to be understood why, perhaps larger metadata?).
- This is compensated when the maps are merged since we query only 1 map, but could still be improved.
- Discussing with Sergey why new version is worse with respect to cache, Sergey is interested in reproducing and improving it.
- If we still want to change it, I would wait with merging, since otherwise we need yet another compatibility layer to load maps that were created in between.
- On MI50, still miscompiles when RTC and dEdx are enabled.
- New code without RTC is 14% faster than old code with RTC on MI50.
- New maps fix performance issues on new NVIDIA GPUs. 4090 is roughly 3x faster (with RTC vs with RTC).
- Need to check the problem with ONNX external memory allocator.
- Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
- Check for unnecessary f64 instructions in GPU code.
- Want to retune NVIDIA Blackwell after the performance issue was fixed, and manually look what can be optimized, to get a first realistic estimate how many GPUs we need.
- Working on some improvements for general GPU code and TPC POD Fast Transform: get rid of defines, use constexpr; unify multiply defined constants, and use global GPU constants in FastTransform, move TPCFastTransformGeo to GPU constant memory.
- Matthias is working to get rid of FlatObject dependency in FastTransform splines, will reduce memory foorprint slightly and hopefully improve cache efficiency.
Other topics:
- Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
- Test at NERSC still ongoing, all jobs so far failed for non-gpu related reasons. Currently jobs in wait till 6d.
EPN GPU Topics:
-
10:20 AM
→
10:25 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Cluster error parameterization
- Test optimization on default cluster error mode with scaling
- Best parameters found within bounds:
- scaleChiY1: 1.7779086962105004
- scaleChiY2: 0.5157460329068477
- scaleChiY3:1.343090130640468
- scaleChiZ1: 1.0739495259708887
- scaleChiZ2: 0.5274259679551516
- scaleChiZ3: 1.1181900592137066
- The "improvement": +0.15% efficiency, +0.2% clone-rate, -0.5% fake-rate
- All of them around bound by [0.5,2.0] by the optimization bounds
Reminder (GPUTPCGMPropagator.h)

That means: E.g. strongest change: scaleChiY2 and scaleChZ2 -> even a factor of ~2 does not change the behaviour of the tracking much. Suspected reason: chi2 of 9 for a cluster must be an extreme outlier -> Cluster is far away and probably not track-attached in the first place so cut only takes effect when scaling factor (e.g. scaleChi*) is large
-
10:25 AM
→
10:30 AM
GPU Parameter Optimizations 5mSpeaker: Gabriele Cimador (CERN, Università and INFN Torino)
Tuning polishing almost ready
- Auto detects GPU vendors
- Installs all dependencies in a separate python environment (expect profiler)
- README with instructions in progress
- Idea to set a desired duration of the tuning and set number of iterations based on that
Accepted talk on ALICE experience with GPUs at workshop on computing of INFN (11-15 May)
- Will contact Maxim for slides on GPU usage & monitoring on the GRID
-
10:30 AM
→
10:35 AM
Efficient Data Structures 5mSpeaker: Dr Oliver Gregor Rietmann (CERN)
NextGenTrigger Task 1.7
- CHEP talk together with Jolly got accepted.
- Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.
Implement NGT SoA Code in O2 standalone benchmark
- Working on this fork of the AliceO2 repo, with a CI pipeline:
- Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
- Extended CI-pipline to fail if GPU.out changes
- Implemented SoA in:
- SectorTracker
- GPUTPCBaseTrackParam
- GPUTPCTrackParam
- GPUTPCTracklet
- GPUTPCTrack
- SectorTracker
- Benchmarking:
- Two independent measurements for each of the 4 NGT GPUs (standlaone + profiler)
- Integrated in CI-pipeline with a comparison to the unaltered code
- Results are now much more consistent, at least with the profiler
- Ran all the benchmarks with and without monitoring: results looked very consistent.
- Next Steps:
- Build and run O2 with C++26 compiler to integrate reflections (needed for CHEP benchmark)
- Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
- Make better use of SoA to improve performance
- Try David's suggestion
-
10:35 AM
→
10:40 AM
Following up GPU to-dos 5mSpeaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
-
10:40 AM
→
10:45 AM
TPC Clusterization / OpenCL / Highly Ionizing Particles 5mSpeaker: Felix Weiglhofer (CERN)
- 10:45 AM → 10:50 AM
-
10:50 AM
→
10:55 AM
System Run Coordination Topics 5mSpeaker: Ernst Hellbar (CERN)
-
10:00 AM
→
10:20 AM