Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-04-08T10:00:00+02:00
End: 2026-04-08T11:30:00+02:00
Location: No location set

Wednesday 8 Apr 2026, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Test with GPU GRID jobs at NERSC pending.
  
  Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
  
  Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
  
  GPU ROCm / compiler topics:
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Need to check ROCm 7.2 corrtecness.
  
  Need to understand and fix crash on RTX Pro 6000 reported by Oliver.
  
  Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver.
  
  Understand deterministic mode issue on NVIDIA Blackwell.
  
  Performance issue on Blackwell fixed (see below).
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Final solution: merging transformation maps on the fly into a single flat object:
  
  Maps now yielding correct results, but 1.5x performance regression running on GPUs.
  
  PR is now green in the CI, POD version of the fasttransform and merging of the maps integrated (thanks to Ruben and Matthias and of course Sergey!)
  
  Remaining regression with the new maps vs old maps is due to reduced L2 cache hit rate (to be understood why, perhaps larger metadata?).
  
  This is compensated when the maps are merged since we query only 1 map, but could still be improved.
  
  Discussing with Sergey why new version is worse with respect to cache, Sergey is interested in reproducing and improving it.
  
  If we still want to change it, I would wait with merging, since otherwise we need yet another compatibility layer to load maps that were created in between.
  
  On MI50, still miscompiles when RTC and dEdx are enabled.
  
  New code without RTC is 14% faster than old code with RTC on MI50.
  
  New maps fix performance issues on new NVIDIA GPUs. 4090 is roughly 3x faster (with RTC vs with RTC).
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Check for unnecessary f64 instructions in GPU code.
  
  Want to retune NVIDIA Blackwell after the performance issue was fixed, and manually look what can be optimized, to get a first realistic estimate how many GPUs we need.
  
  Working on some improvements for general GPU code and TPC POD Fast Transform: get rid of defines, use constexpr; unify multiply defined constants, and use global GPU constants in FastTransform, move TPCFastTransformGeo to GPU constant memory.
  
  Matthias is working to get rid of FlatObject dependency in FastTransform splines, will reduce memory foorprint slightly and hopefully improve cache efficiency.
  
  Other topics:
  
  Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
  
  Test at NERSC still ongoing, all jobs so far failed for non-gpu related reasons. Currently jobs in wait till 6d.
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Cluster error parameterization
  
  Test optimization on default cluster error mode with scaling
  
  Best parameters found within bounds:
  
  scaleChiY1: 1.7779086962105004
  
  scaleChiY2: 0.5157460329068477
  
  scaleChiY3:1.343090130640468
  
  scaleChiZ1: 1.0739495259708887
  
  scaleChiZ2: 0.5274259679551516
  
  scaleChiZ3: 1.1181900592137066
  
  The "improvement": +0.15% efficiency, +0.2% clone-rate, -0.5% fake-rate
  
  All of them around bound by [0.5,2.0] by the optimization bounds
  
  Reminder (GPUTPCGMPropagator.h)
  
  That means: E.g. strongest change: scaleChiY2 and scaleChZ2 -> even a factor of ~2 does not change the behaviour of the tracking much. Suspected reason: chi2 of 9 for a cluster must be an extreme outlier -> Cluster is far away and probably not track-attached in the first place so cut only takes effect when scaling factor (e.g. scaleChi*) is large
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  Tuning polishing almost ready
  
  Auto detects GPU vendors
  
  Installs all dependencies in a separate python environment (expect profiler)
  
  README with instructions in progress
  
  Idea to set a desired duration of the tuning and set number of iterations based on that
  
  Accepted talk on ALICE experience with GPUs at workshop on computing of INFN (11-15 May)
  
  Will contact Maxim for slides on GPU usage & monitoring on the GRID
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  CHEP talk together with Jolly got accepted.
  
  Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  GPUTPCTrack
  
  Benchmarking:
  
  Two independent measurements for each of the 4 NGT GPUs (standlaone + profiler)
  
  Integrated in CI-pipeline with a comparison to the unaltered code
  
  Results are now much more consistent, at least with the profiler
  
  Ran all the benchmarks with and without monitoring: results looked very consistent.
  
  Next Steps:
  
  Build and run O2 with C++26 compiler to integrate reflections (needed for CHEP benchmark)
  
  Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
  
  Make better use of SoA to improve performance
  
  Try David's suggestion
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (CERN)
- 10:45 → 10:50
  
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  
  Felix: no news
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark