Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-07-16T10:00:00+02:00
End: 2025-07-16T11:00:00+02:00
Location: No location set

Wednesday 16 Jul 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  GPU Benchmarks in HS23 Contribution from ALICE
  
  Had a meeting 2 weeks ago, Gabriele will report on the status
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  We can now set the GPU architectures to build fore in the environment variable field of Jenkins builds.
  
  Managed to run the o2-gpu-standalone-benchmark from an async build on CVMFS in the default GRID job container on the NERSC perlmutter site running on their A100 GPU.
  
  Have a new build (finished this morning after all the build issues) with async reco settings to be enabled via JDL, to be tested at NERSC. Then we could start try to run an async reco on GPUs theere.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support, to be checked.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  LLVM Bump to 20.1: status?
  
  ROCm 6.4.1 status:
  
  AMD is checking the reproducer. I have some idea how to narrow down where it miscompiles using different compile flags in per-kernel mode.
  
  New problem with ROCm 6.5 / 7.0 after bumping clang: New clang encounters internal compiler error processing our code...
  
  Improved Standalone Benchmark CI, can now run RTC test for CUDA also with no GPU installed.
  
  Updating alidist/gpu-system to be build_requires only, and reverted that since it broke the dailies. Now generating a dummy modulefile (even if not used), as requested by Giulio.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
  
  Pending OpenCL2 issues:
  
  printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
  
  Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
  
  Also with optimization disabled, crashing later in TPC merging, need printf to debug.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows.
  
  Need to check the problem with ONNX external memory allocator. Status?
  
  EPN GPU Topics:
- 10:20 → 10:25
  
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 10:25 → 10:30
  
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (Universita e INFN Torino (TO))
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  Summer Student Milla Bramsted
  
  She is working on benchmarking SoA code on GPUs
  
  We will track her project in this google doc
  
  She is adding CUDA kernels to this repo
  
  A first Kernel was integrated to the benchmarking framework
  
  Next steps:
  
  Add more kernels
  
  Add the other NGT SoA approaches again (were temporarily removed)
  
  Re-organize the code
  
  ALICE O2 CI-Pipelines on NGT Cluster
  
  A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
  
  It has a GitHub action running the standalone benchmark on NGT GPUs (H100)
  
  Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
  
  Next steps:
  
  Run on the different GPUs that are now available, see ngt-resources for details:
  
  Nvidia: H100 188GB NVL (12x), H100 80GB SXM (6x), L40S 48GB (7x)
  
  AMD: Instinct MI300X 192GB (2x), Radeon Pro W7900 48GB (6x)
  
  Get the full CI to work
  
  Remark: Most NGT GPUs are busy these days
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo
  
  Simplified and optimized the SoA code in the last few weeks
  
  Everything is running and we have identified the classes apply our SoA code to
  
  Next steps:
  
  Implement the SoA code on those classes
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  OpenCL
  
  Common issues
  
  printf in OpenCL: still broken as of Clang 20.1
  
  C++20 support:
  
  No one actively working on it
  
  But also not opposed to it
  
  If we want to it, we would have to do it ourselves (post contains some pointers to start)
  
  PoCL improved error on recursion: merged
  
  ZS Decoder
  
  Two issues:
  
  Corruption in padding region by kernel
  
  Kernel marks a single charge as invalid
  
  -> Seems to be somehow related to shared memory:
  
  zeroing shared memory at kernel start fixes invalid charge position but still outputs charge in wrong pad
  
  Doesn't fix padding corruption
  
  Currently working on standalone reproducer for decoder kernel
- 10:45 → 10:50
  
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators

Summer Student Milla Bramsted

ALICE O2 CI-Pipelines on NGT Cluster

Implement NGT SoA Code in O2 standalone benchmark

OpenCL

Common issues

ZS Decoder