Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-12-03T10:00:00+01:00
End: 2025-12-03T11:00:00+01:00
Location: No location set

Wednesday 3 Dec 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 1
  Discussion
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Crash after receiving corrupt TPC data: Fixed. Handling of bad data was actually correct, but multi-threaded pipeline did not handle the error correctly, and the next TF ran into an error.
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Test with GPU GRID jobs at NERSC pending.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
  
  Serialization bug pending.
  
  Miscompilation on MI 100 leading to memory error pending.
  
  New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
  
  New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
  
  AMD is changing their support structure, we shall fill reports via github (which I like, then it is also better traceable). But they will no longer assign an engineer to follow up all our issues, but has to go through their normal support process. In particular, that means their compiler team might not be able to fix issues, if we do not provide a minimal reproducer.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object:
  
  Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  New safer, dynamic cluster protection working and deployed at P2.
  
  Other topics:
  
  GRID Memory monitoring: Discussed with Maksim, the problem with incorrect values vrom smaps is already fixed, since they switched to cgroup monitoring.
  
  EPN GPU Topics:
- 2
  
  TPC ML Clustering
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 3
  GPU Parameter Optimizations
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  No news
  
  ToDo list:
  
  General report on various GPUs
  
  CSV parameters
- 4
  Efficient Data Structures
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  NextGenTrigger Task 1.7
  
  Presented at two Workshops:
  
  https://indico.cern.ch/event/1557810/
  
  https://indico.cern.ch/event/1590675/
  
  Started implmenting our SoA in MadGraph, together with
  
  Stefan Roiser
  
  Daniele Massaro
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo, with a CI pipeline:
  
  Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
  
  Extended CI-pipline to fail if GPU.out changes
  
  Changed the SoA / AoS code to better fit ALICE O2
  
  Implemented SoA in:
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTracklet
  
  Performance still same (or maybe 2% slower)
  
  Next Steps:
  
  Check if AoS has no overhead due to the new abstraction
  
  Make better use of SoA to improve performance
- 5
  
  Following up GPU to-dos
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 6
  TPC Clusterization / OpenCL / Highly Ionizing Particles
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  GPU Servers
  
  Waiting for parts.
  
  OpenCL
  
  No news.
  
  Highly Ionizing Particle
  
  Current version:
  
  Single warp streams neighboring pads in cacheline
  
  rocprof: 65% Memory Unit utilization (throughput: 130 GB/s ???)
  
  New version:
  
  576 threads to read full row (140 pads * 8 timebins)
  
  Most rows smaller than 140 pads -> wasted threads
  
  Memory Unit utilization drops to 36%
  
  Slower by factor 1.57 (= 65 / 36)
- 7
  
  ITS Tracking
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  
  Felix: ITS GPU tests pending, since no EPNs available now, first batch of jobs was successful but not enough statistic yet to draw conclusions if the memory problem is resolved. Also adapted Ruben's refit/reseeding for the GPU part.
  
  Gabriele: no news, todos: try gaussian vertexer using partial its tracks instead of lines
- 8
  
  System Run Coordination Topics
  
  Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark

GPU Servers

OpenCL

Highly Ionizing Particle