Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2026-06-17T10:00:00+02:00
End: 2026-06-17T11:30:00+02:00
Location: No location set

Wednesday 17 Jun 2026, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
  
  Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.
  
  GPU ROCm / compiler topics:
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Need to check ROCm 7.2 corrtecness.
  
  Need to understand deterministic mode issue on AMD Pro 9700 reported by Oliver - Status?
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Check for unnecessary f64 instructions in GPU code.
  
  PR with changes for qTot-reading / dEdx code, to support >16 bit range for saturated signals.
  
  Other topics:
  
  Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
  
  EPN GPU Topics:
  
  Test of MI50-EPN equipped with 1 RTX Pro 6000 quite successful.
  
  GPU temperature saturates at 78°C under full load.
  
  At high temperature, seeing up to 2% degradation in GPU clocks and GPU performance.
  
  1 RTX Pro 6000 can run at full speed in FST shared by both CPU sockets / NUMA domains.
  
  Right now, EPN with 1 RTX Pro 6000 Max-Q (300W TDP= reaches 5/6th of the performance of 1 8-MI50-EPN.
  
  (From CPU side, would need ~280 64-core EPNs, but then GPU performance should be slightly above 8*MI50).
  
  Perhaps 1 RTX Pro Server GPU (600W) might already be enough, 2 * RTX Pro 5000 might be another option. Almost certainly, next gaming GPU generation will be fast enough.
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Cluster finder
  
  Successful Pb--Pb runs last friday 🥳
  
  Some preliminary results from cpass0:
  
  16% CTF size reduction
  
  NN improves separation power by ~10%!
  
  DCA distributions maintained
  
  Shared cluster map very similar
  
  NCl/track shows the expected reduction: Approx. 2-3 clusters for long tracks
  
  GPU benchmarking
  
  PR: https://github.com/AliceO2Group/AliceO2/pull/15514
  
  Benchmark single GPU performance across servers (testing on epn000 and dev00 server)
  
  Multithreading CPU processes for ITS reduces downtime between gpu-reco processes enough to make measurement
  
  Top: Default reco, Bottom: NN full reco. One TF of LHC24ar. TFLOOP=100
  
  Allowed downtime between gpu processing steps: 50ms
  
  All detectors, default settings
  
  epn000
  
  pdp-dev00 (RTX5080), single GPU:
  
  [1] Downtimes too large for default reco (80-130 ms even when setting CPU process threads to 24) -> Best guess-timate
  
  Wall-time mean including allowed gaps: 0.390235 s
  Individual duration sample mean: 0.364311 s
  Individual duration sample sigma: 0.00405494 s
  
  [2]
  
  Wall-time mean including allowed gaps: 0.607306 s
  Individual duration sample mean: 0.587373 s
  Individual duration sample sigma: 0.0101604 s
  
  Summary:
  
  Processing time increase using MI50 NN / default reco: ~1.5
  
  Processing time increase using RTX5080 NN / default reco (to be taken with a grain of salt): ~1.5
  
  -> No improvement (so far) for NN using RTX5080 over MI50 (concerning hardware acceleration)
  
  Measurement with "WORKFLOW_DETECTORS": "TPC,CTP"
  
  epn000
  
  pdp-dev00 (RTX5080):
  
  Summary:
  
  Seems much more precise and with less losses than using all detectors. Can load GPU much better (way less losses of TFs due to exclusions...)
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  PR with opt params
  
  PR is online
  
  Added two architectures to FindO2GPU.cmake
  
  Hopper
  
  MI300 (GCNA 4.0)
  
  Changed Ampere detection from cc 8.6 to 8.0 (why it was like this?)
  
  if(CUDA_FIRST_TARGET GREATER_EQUAL 120) set(CUDA_TARGET BLACKWELL) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 90) set(CUDA_TARGET HOPPER) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 89) set(CUDA_TARGET ADA) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 80) # was 86 before set(CUDA_TARGET AMPERE) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 75) set(CUDA_TARGET TURING) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 60) set(CUDA_TARGET PASCAL) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 30) set(CUDA_TARGET KEPLER) elseif(CUDA_FIRST_TARGET GREATER_EQUAL 20) set(CUDA_TARGET FERMI) else() set(CUDA_TARGET TESLA) endif()
  
  Best parameters comparison - Nvidia
  
  Kernels in single stream steps
  
  Kernels in multi stream steps
  
  Next steps
  
  Validate against manual tuning
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  GPUTPCTrack
  
  Writing ACAT Proceedings
  
  Next Steps:
  
  Make better use of SoA to improve performance
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (CERN)
  OpenCL
  
  No news.
  
  GPU Servers
  
  CI Server: No update since Sergio left.
  
  Highly Ionizing Particles
  
  MC label forwarding for saturated clusters done
  
  This uncovered a bug where last digit of a sector had wrong labels (fixed now)
  
  Working on CPU version of tail filter now
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  Gabriele
  
  Tested new parallel algorithm on CPU, similar to current implementation but with a parallel approach:
  
  For each track, find z component of DCA to beamline
  
  Sort tracks per z value
  
  For each track, count how many tracks fall into a window centered in z (and are time compatible) --> density
  
  For each track, check if it is a local maximum (look at neighbours)
  
  For each local maximum, fit tracks which falls into the windows
  
  For each seed, check if it is not a duplicate with neighboring seeds
  
  Surviving seeds promoted to vertices
  
  Current version:
  
  [1727121:its-tracker]: [17:12:06][INFO] - Vertex finding: found 132 vertices (total 132) in 50.63 ms
  
  Parallel version (20 threads):
  
  [1725878:its-tracker]: [17:05:17][INFO] - Vertex finding: found 132 vertices (total 132) in 14.35 ms
  
  Preliminary results from 4 simulated Pb-Pb 50kHz TF:
  
  Next steps
  
  Validate with more TFs
  
  Optionally tune parameters
  
  Implement the GPU version
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)