Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-01-28T10:00:00+01:00
End: 2026-01-28T11:00:00+01:00
Location: No location set

Wednesday 28 Jan 2026, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Test with GPU GRID jobs at NERSC pending.
  
  Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
  
  Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
  
  GPU ROCm / compiler topics:
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
  
  Serialization bug pending.
  
  Miscompilation on MI 100 leading to memory error pending.
  
  New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
  
  New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
  
  Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
  
  ROCm 7.2 released, waiting for EPN to set it up with new dev2 server.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Final solution: merging transformation maps on the fly into a single flat object:
  
  Sergey is checking.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Other topics:
  
  Molex connectors for GPU CI server arrived, still waiting for cables.
  
  Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.
  
  EPN GPU Topics:
  
  Status of power cable problem for dev2 server and MI210?
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Fixed problem of cuda and rocm library detection: https://github.com/alisw/alidist/pull/6093
  
  Highlighted problem about GPU architecture recognition (PR by David)
  
  Highlighted problem about virtual py-env aliBuild not being used for building (PR by Giulio)
  
  Currently working on dumping the cluster errors in the right place (work ongoing now since I was on vacation for some days)
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  News from GPU parameter tuning
  
  Benchmarks table
  
  About L40S
  
  Note: L40S uses the AMPERE default parameters
  
  For real PbPb data:
  
  Sync mean time default: 2018.66 ms ± 167.23 ms
  
  Sync mean time optimised: 1300.35 ms ± 92.61 ms
  
  Performance gain with tuned params: 35.58%
  
  For sim PbPb data:
  
  Sync mean time default: 1677.37 ms ± 2.52 ms
  
  Sync mean time optimised: 1031.51 ms ± 2.85 ms
  
  Performance gain with tuned params: 38.50%
  
  New GPU parametrization
  
  PR is online [link]
  
  I will update with the logic of the new GPU detection once it is merged
  
  Possible to suppress clang for that json?
  
  MI100 bug?
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  Implementing our SoA code in MadGraph, together with
  
  Stefan Roiser
  
  Daniele Massaro
  
  No news
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  Maybe go back from CRTP to classical inheritance
  
  Better implementation of iterators to support
  
  std::sort
  
  thrust::sort
  
  our custom sort function
  
  Next Steps:
  
  Make better use of SoA to improve performance
  
  Try David's suggestion
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (CERN)
  OpenCL
  
  No news.
  
  GPU Servers
  
  Dev machine: no news.
  
  Highly Ionizing Particles
  
  Quick followup to last week:
  
  Finished vectorizing kernel (GPU threads load 4-8 charges at once).
  
  Performance barely changes...
  
  Also old kernel was assuming wrong memory layout (pad-major instead of time-major).
  
  Pad filter kernel parallelises across pads. So switching to time-major layout and vectorizing memory access also reduces number of threads by same amount...
  
  Fixed parallelization across rows:
  
  Fixed memory access to time-major layout
  
  Fixed bank conflicts when accessing shared memory
  
  Increases throughput from 80 GB/s to 128 GB/s. Basically identical performance to current version (130 GB/s).
  
  TODO: Needs testing on Nvidia.
  
  Pad filter and cluster finder currently assume 64 byte cache lines, Nvidia moved to 128 byte cache lines a couple generations ago.
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  Felix:
  
  Found TF which cannot be processed on GPU but survives CPU code (peak RSS ~15GB).
  
  Implemented more memory clearing in https://github.com/AliceO2Group/AliceO2/pull/14999 now the same TF survives on GPU (peak allocation ~18 GB).
  
  Note though, that this is at the price of more device<->host<->device data movement.
  
  This should care of the bulk of the (smallish) difference. Remaining: verify with deterministic mode (& maybe check impact on timing, although recovery of TF is more important).
  
  To bring back a thing of the past. In the CI and also local compilations on EPNs I observe always errors of this kind:
  
```

In file included from /opt/rocm-6.3.2/lib/llvm/lib/clang/18/include/cuda_wrappers/bits/basic_string.h:7: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/basic_string.h:40: In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/ext/alloc_traits.h:34: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:272:8: warning: 'destroy<o2::its::TrackITSExt>' is deprecated: use 'allocator_traits::destroy' instead [-Wdeprecated-declarations] 272 | { __a.destroy(__p); } | ^ /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:378:4: note: in instantiation of function template specialization 'std::allocator_traits<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>>::_S_destroy<std::pmr::polymorphic_allocator<o2::its::TrackITSExt>, o2::its::TrackITSExt>' requested here

```
  
  Notice that I see something like c++/11/... from what I assume that this is pulling in c++11?
  
  This is of course just a deprecation warning so not actually critical (for now...).
  
  But David and I had a chat about this a while back and had the following problem:
  
  Almost pseudo CMake code:
  
```

set(CMAKE_HIP_HOST_COMPILER "$ENV{GCC_TOOLCHAIN_ROOT}/bin/gcc")

message(STATUS "SETTING CMAKE_HIP_HOST_COMPILER to ${CMAKE_HIP_HOST_COMPILER}")

enable_language(HIP)

message(STATUS "HIP language enabled: ${CMAKE_HIP_COMPILER}")

message(STATUS "COMPILER to ${CMAKE_HIP_HOST_COMPILER}")

```
  
  Output
  
```

-- SETTING CMAKE_HIP_HOST_COMPILER to /home/fschlepp/alice/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-2/bin/gcc

-- HIP language enabled: /opt/rocm/llvm/bin/clang++

-- COMPILER to

```
  
  Somehow after enable_language the host compiler is reset to what? probably system.
  
  Would be nice to somehow repair this.
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators

News from GPU parameter tuning

About L40S

New GPU parametrization

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark

OpenCL

GPU Servers

Highly Ionizing Particles