Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-08-27T10:00:00+02:00
End: 2025-08-27T11:00:00+02:00
Location: No location set

Wednesday 27 Aug 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  New GPU architecture selection for async in O2DPG looks good, should be merged.
  
  Test with GPU GRID jobs at NERSC pending.
  
  Asked DPG to run first test with ITS tracking on GPU on EPNs.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  LLVM Bump to 20.1: status?
  
  Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
  
  ROCm 7 with full serialization passes validation in deterministic mode. Received instructions from AMD how to build the current compiler so we can check the performance.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
  
  Pending OpenCL2 issues:
  
  printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
  
  Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
  
  Also with optimization disabled, crashing later in TPC merging, need printf to debug.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  Probably a bug in multi-threaded pipeline when timeframes do not arrive in order. Cannot reproduce, in my test with non-monotonous timeframe order, pipeline works well. Not following up further for now.
  
  Bug in reading of MC data, temporary fix by Ruben applied. Implemented a proper fix.
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Mostly thesis writing now. But I have some questions / points of discussion
  
  ITS-TPC matching efficiency
  
  Context:
  
  TPC clusters and tracks are created with given algorithmic settings from the plot label
  
  ITS tracks are kept the same -> They come from the previous reconstruction
  
  -> For high-pT (close to the center of the plot) the matching efficiency is systematically higher for the algorithms with the GPU CF regression
  Since these are primaries, I would expect most high-pT particles to be at central eta. But against eta I see a reversed trend. Matching efficiencies are here the same or slightly higher than with the GPU CF (see below).
  
  I then tested if this is a matter of the threshold setting, but it is not:
  
  This is puzzling to me. Two options come to my mind:
  
  Is the ITS-TPC alignment done with TPC tracks or with comics?
  
  Can the ITS tracks change with a different TPC reconstruction (i.e. different number of TPC tracks, different track parameters)?
  
  As a ratio to the TPC tracks, this efficiency improves with the new algorithm
  
  ---------------
  
  And since I have never actually shown it in this meeting: The width estimation of qTot using the different algorithms
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  News from GPU parameter tuning
  
  Still fighting with the launch bound fix:
  
  Need to update dumpGPUDefParam:
  
  void dumpGPUDefParam(const char* outputfile = "parameters.out")
  
  {
  
  auto param = o2::gpu::internal::GPUDefParametersLoad();
  
  printf("Loaded params:\n%s\nWriting them to %s\n", o2::gpu::internal::GPUDefParametersExport(param, false).c_str(), outputfile);
  
  FILE* fp = fopen(outputfile, "w+b");
  
  fwrite(&param, 1, sizeof(param), fp);
  
  fclose(fp);
  
  }
  
  Proposed solution: define WARP size in the header file and add minBlockFactor to the call to GPUDefParametersExport and dump that export somehow
  
  RTC fails with @David's patch:
  29797 | extern "C" {__attribute__((global)) void GPUCA_KRNL_REG_DEFAULT((GPUCA_LB_GPUTPCNeighboursFinder)) krnl_GPUTPCNeighboursFinder( int32_t _iSector_internal ) { __attribute__((shared)) typename GPUTPCNeighboursFinder::GPUSharedMemory smem; GPUTPCNeighboursFinder::template Thread<GPUTPCNeighboursFinder::defaultKernel>((gridDim.x), (blockDim.x), (blockIdx.x), (threadIdx.x), smem, GPUTPCNeighboursFinder::Processor((gGPUConstantMemBuffer.v))[_iSector_internal] ); }}
  
  Fails to expand the macro, suspect that it is not defined there, I am investigating
  
  ALICE contribution to HS23
  
  Completed build.sh . To build, just run
  docker run -v /cvmfs:/cvmfs alice_gpu_hs23 --backend HIP --arch gfx908
  
  Need to test build for multiple archs
  
  Warnings:
  
  CMake Warning (dev) at /opt/rocm/lib/cmake/hip/hip-config-amd.cmake:91 (message):
  AMDGPU_TARGETS was not set, and system GPU detection was unsuccsesful.
  
  The amdgpu-arch tool failed:
  Error: 'Failed to get device count'
  Output: ''
  
  As a result, --offload-arch will not be set for subsuqent
  compilations, and the default architecture
  (gfx906 for dynamic build / gfx942 for static build) will be used
  
  Call Stack (most recent call first):
  /opt/rocm/lib/cmake/hip/hip-config.cmake:149 (include)
  /alice_hs23/O2/dependencies/FindO2GPU.cmake:258 (find_package)
  CMakeLists.txt:130 (find_package)
  This warning is for project developers. Use -Wno-dev to suppress it.
  
  But then:
  
  -- Building GPUTracking with HIP support (GPU Target gfx908)
  -- Using optimized HIP settings for MI100 GPU
  
  And in the actual compilation commands appears --offload-arch=gfx908
  
  First draft of run.sh will use dataset within the container
  
  When Robin is back from vacation, we will test on their machines
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  ALICE O2 CI-Pipelines on NGT Cluster
  
  A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
  
  It has a GitHub action running the standalone benchmark on NGT GPUs
  
  Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
  
  Different GPUs are tested in parallel on different VMs
  
  O2 standalone benchmark works on the all ngt-resources:
  
  Nvidia H100 188GB NVL
  
  AMD Instinct MI300X
  
  AMD Radeon Pro W7900
  
  Nvidia L40S
  
  We are now using custom .par files
  
  Next steps
  
  Generate optimized .par files with Gabriele
  
  Possible next steps
  
  Add new architectures to O2 hardcoded ones?
  
  Store results in csv format and plot them?
  
  Display the plot in the github web gui?
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo
  
  Simplified the AoS-SoA wrapper code
  
  Started to apply our AoS-SoA code to:
  
  Merger
  
  GPUTPCGMSectorTrack
  
  GPUTPCGMTrackParam
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTrack
  
  Started with SectorTracker, no problems until now
  
  Preparation for ACAT Conference (8. - 12. September)
  
  Presenting with Jolly our joint work on AoS/SoA
  
  Jolly: SoA using reflection (C+++26)
  
  Oliver: SoA using templates (C++17)
  
  We will present the benchmark results comparing different AoS/SoA implementations
  
  Fixed some issues with a experimental compiler that supports reflections
  
  Problems:
  
  Example where AoS is faster than SoA on CPU?
  
  Sometimes loops over SoA are not vectorized with the new experimental compiler for reflections
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  
  stream-compaction-crash.tar.gz
  
  zs-decoder-crash.tar.gz
  OpenCL
  
  Resolved issues:
  
  crash when compiling stream compaction kernel
  
  Fixed upstream
  
  Fun usecase for LLMs: struggled for multiple days to reproduce this in a standalone kernel. Gave Claude Opus kernel source, disassembled SPIR-V and source of PoCL function that crashes -> a few seconds later had a working reproducer...
  
  Open issues with PoCL:
  
  crash when compiling ZS decoder:
  
  Submitted reproducer, investigated by PoCL devs
  
  Memory corruption in ZS decoder
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  Still in the HCP school, then on vacation for 2 weeks, no updates except what Gabriele will do. (-Felix)
  
  ITS-GPU-tracker parameter tuning (Gabriele)
  
  Adjusted the tuner for the ITS tracking
  
  Need to evaluate the results
  
  Objective function: output from its tracker (no profiler)
  
  Seems a bit unreliable, need to investigate if profiler is needed
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)