Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-05-20T12:45:00+02:00
End: 2026-05-20T14:15:00+02:00
Location: No location set

Wednesday 20 May 2026, 12:45 → 14:15 Europe/Zurich

61230224927

David Rohr

Join via phone

- 12:45 → 13:05
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
  
  Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
  
  GPU ROCm / compiler topics:
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Need to check ROCm 7.2 corrtecness.
  
  Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Check for unnecessary f64 instructions in GPU code.
  
  Other topics:
  
  Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
  
  Test at NERSC successfull, first results presented at CHEP and LHCP.
  
  GPU TPC + ITS tracking working nicely on A100 GPU, no backpressure from GPU, but workflow is CPU-bound, as on EPNs.
  
  For some reasons, the GPUs at NERSC perform much slower than on EPNs (even though both are 64 cores), should be investigated.
  
  EPN GPU Topics:
- 13:05 → 13:10
  
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 13:10 → 13:15
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  Slightly changed strategy for incorrect kernel configurations:
  
  Query GPU constants and limits:
  warp size, max threads / SM, registers / SM, shared mem / SM & block, max blocks / SM
  
  Define parameters range:
  Block size: typically from warp size to 20 * warp size
  Min blocks/SM: sampled as fraction of SM capacity ∈ [0, 1]
  
  Loop:
  
  Sample parameters from Optuna study
  
  RT-compile with sampled parameters; min_blocks_per_sm = 1
  
  Inspect fatbins files to query kernels resource usage:
  
  Function krnl_GPUTPCTrackletSelector: REG:64 STACK:600 SHARED:8976 LOCAL:0 CONSTANT[0]:900 TEXTURE:0 SURFACE:0 SAMPLER:0
  
  Compute per kernel max_blocks_per_sm based on resource constraints
  
  RT-compile with min_blocks_per_sm = sm_fraction × max_blocks_per_sm
  
  Measure tuned step duration and report result to Optuna study
  
  This way ensures always good configurations
  
  Except when block size exceeds limits due to static shared memory dependent on it, example:
  
  class GPUTPCTrackletSelector : public GPUKernelTemplate { public: struct GPUSharedMemory { int32_t mItr0; // index of the first track in the block int32_t mNThreadsTotal; // total n threads int32_t mNTracklets; // n of tracklets int32_t mReserved; // for alignment reasons static_assert(GPUTPCGeometry::NROWS >= GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE); GPUTPCHitId mHits[GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE][GPUCA_GET_THREAD_COUNT(GPUCA_LB_GPUTPCTrackletSelector)]; };
  
  Thus if the first RT compilation fails:
  
  Logs from compilation are inspected
  
  Failing kernels identified
  
  Steps with failing kernels are marked as bad, RT again without those steps
  
  Update per kernel cache with max block size that can be sampled
  
  SUCCESS: Tuner now finds optimal sets within GPU limits
- 13:15 → 13:20
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  CHEP talk together with Jolly got accepted.
  
  Co-Supervision of a summer student. Topic: Imrpove clustering algorithm CLUE.
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  GPUTPCTrack
  
  Added GPU benchmarks to the repo of synthetic benchmarks for our SoA library.
  
  Implemented reflections approach in O2 standalone benchmark.
  
  Prepared CHEP 2026 presentation.
  
  Next Steps:
  
  Make better use of SoA to improve performance
  
  Try David's suggestion
- 13:20 → 13:25
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 13:25 → 13:30
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (CERN)
  OpenCL
  
  No news.
  
  GPU Servers
  
  CI Server: Sergio finished CI integration.
  
  Looking into setting up workflow to run standalone GPU tool.
  
  Need new job variant ci-jobs? Repo is private.
  
  Highly Ionizing Particles
  
  Biggest changes since last meeting:
  
  Fixed issue where filter missed saturated bins (-> All bins with ADC == 1023 are now cleared)
  
  Added rising edge detection
  
  Added exponential filter for tail cutoff
  
  Cluster creation is now working
  
  Filter Performance
  
  Rough parameter sweep over threshold and exp filter alpha:
  
  Performance increased to 60% - 70% depending on tolerated false positive rate.
  
  Filter performance looks better when checking #ADC instead of #digit cleared. -> Reach +80% of ADCs cleared
  
  Caveats: cluster pad width appears to be a fat tail distribution. Filter currently goes +- 5 pads, but ~10% digits and ADC outside of that window and never seen by filter (=> SOL is 90% for filter performance)
  
  Low alpha (slow update rate): Higher recall, but false positive rate explodes
  
  Cluster Creation
  
  Short example for cluster creation. Also shows failure modes of HIP filter.
  
  MC:
  
  Cluster + MC:
  
  Runtime
  
  Cluster creation currently parallel across rows. Still need some idea how to parallelize across tails, or at least make tail -> cluster matching faster single threaded.
- 13:30 → 13:35
  
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
- 13:35 → 13:40
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark

OpenCL

GPU Servers

Highly Ionizing Particles

Filter Performance

Cluster Creation

Runtime