Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-07-30T10:00:00+02:00
End: 2025-07-30T11:00:00+02:00
Location: No location set

Wednesday 30 Jul 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Async reco test (using some 2024 Pb-Pb setup I copied from the GRID) were successful.
  
  Improved O2DPG scripts and added ALIEN_JDL_SITEARCH setting to select between MI50, MI100, NERSC (so we do not need custom run scripts per site). Test jobs with new O2DPG are failing, need to investigate why.
  
  For tests at NERSC, GRID people need to set up a queue such that we can submit GRID jobs that request 64 cores and GPU access.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  LLVM Bump to 20.1: status?
  
  ROCm 6.4.1 status:
  
  AMD is checking the reproducer. I have some idea how to narrow down where it miscompiles using different compile flags in per-kernel mode.
  
  New problem with ROCm 6.5 / 7.0 after bumping clang: New clang encounters internal compiler error processing our code...
  
  ROCm 7.0 introduces some non-backward-compatible changes. Might need code changes in O2, and synchronized bump of ROCm on EPNs, in build and CI container, and in O2/dev.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
  
  Matthias Kleiner might look into commissioning this.
  
  Pending OpenCL2 issues:
  
  printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
  
  Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
  
  Also with optimization disabled, crashing later in TPC merging, need printf to debug.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  debugLevel0
  
  debugLevel0_16TF
  
  Total_Wall_Time_heatmap.pdf
  
  TPC_Cluster_Finding_heatmap.pdf
  
  TPC_Sector_Tracking_heatmap.pdf
  
  debugLevel1
  
  GPUTPCNNClusterizerKernels_determineClass1Labels_heatmap.pdf
  
  GPUTPCNNClusterizerKernels_fillInputNNGPU_heatmap.pdf
  
  GPUTPCNNClusterizerKernels_publishDeconvolutionFlags_heatmap.pdf
  
  GPUTPCNNClusterizer_ONNXClassification_heatmap.pdf
  
  debugLevel1_16TF
  
  GPUTPCNNClusterizerKernels_determineClass1Labels_heatmap.pdf
  
  GPUTPCNNClusterizerKernels_fillInputNNGPU_heatmap.pdf
  
  GPUTPCNNClusterizerKernels_publishDeconvolutionFlags_heatmap.pdf
  
  GPUTPCNNClusterizer_ONNXClassification_heatmap.pdf
  Framework
  
  ONNX: Non-deterministic compute on GPU
  
  More global issue. Tried all variations, but even in float32 using
  
  (mPImplOrt->sessionOptions).AddConfigEntry("session_options.use_deterministic_compute", "1");
  the compute was not deterministic.
  
  Investigation in GitHub threads and forums shows that this is indeed not guaranteed by ONNX (or any other framework), not even during training!
  
  SetDeterministicCompute: "... If set to true, this will enable deterministic compute for GPU kernels where possible. ..."
  
  Variations (using only the classification network with fixed threshold):
  
  L2, N16: 276782342 \pm 105 ~ 3.8x10e-5% deviation
  
  L5, N128: 265601210 \pm 111 ~ 4.2x10e-5% deviation
  
  GPU timing test
  
  (see debugLevel=0 folder) Classification network + CF regression
  
  -> Larger batch size is better, but also: Almost no difference in wall time from L2_N16 to L5_N32 for almost any batchsize -> L5_N32 will be the network of choice
  
  (see debugLevel=1 folder) Classification network + CF regression
  
  -> Larger batch size is better. Increase from L2_N16 to L5_N32 is factor 1.4 but typically largest jumps noticeable from 32 to 64 neurons per layer (potentially connected to cross-warp computations?)
  
  Memory consumption increases with larger batchsize: Input = BS x sizeof(float16) x (e.g. 3x9x9 + 3) x #streams + ONNX. For batchsize of 262144, 3 streams: Input ~387MB + ONNX
  
  Physics
  
  Did the QA on the ideal clusters again now compared to ~1.5 years ago
  
  Changes in between:
  
  Added looper tagging
  
  Changed peak finder to the peak finder of the GPU cluster finder (for training data generation)
  
  Using native clusters from reconstruction, not the clusters generate in the QA macro
  
  Using simulation which enforces 0-5% centrality, so efficiencies go down a bit, fake-rates go up a bit
  
  Network size = classification network size, Regression algorithm = GPU CF regression
  
  Raw efficiency without loopers: How many clusters have been attached to at least one ideal (MC) cluster, removing all (local) regions in loopers
  
  NN can never attach more clusters than GPU CF because it won't produce more clusters (it can only reject)
  
  Another possibility: #(attached ideal clusters, not in looper tagged region) / #reconstructed clusters
  
  Can be artificially inflated with number of loopers, but in all cases network should reject looping clusters better, so efficiency should go up
  
  To-Do
  
  Create standalone reproducer for ONNX non-determinism
  
  Create Efficiency-Fake plots with regression network
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (Universita e INFN Torino (TO))
  News from GPU parameter tuning
  
  Used GPUCA_KERNEL_RESOURCE_USAGE_VERBOSE 1
  
  For 750kHz pp dataset, tuner selected as best configuration block_size = 512, grid_size = 540 for GMMergerCollect
  
  This translated to 512, 9 on the parameter header --> try to fit 9 blocks of 512 threads each on each CU
  
  What GPUCA_KERNEL_RESOURCE_USAGE_VERBOSE said: Occupancy [waves/SIMD]: 9
  
  On MI50: 4 SIMD per CU
  
  Number of threads residing on the GPU at the same time: 64 threads per wave * 9 waves per SIMD * 4 SIMD per CU * 60 CUs in a MI50 = 64 * 9 * 4 * 60 = 138 240
  
  Number of requested threads: 512 threads per block * 9 blocks per CU * 60 CUs in a MI50 = 276 480
  
  Actual threads / Requested threads = 138 240 / 276 480 = 0.5
  
  So half grid can reside on the GPU
  
  HS23 Contribution from ALICE
  
  Chatted with Robin yesterday. They are prone to use the CI container directly.
  
  CUDA compatibility issue
  
  From Nvidia docs:
  
  CUDA Compatibility guarantees allow for upgrading only certain components.
  
  Backwards compatibility ensures that a newer NVIDIA driver can be used with an older CUDA Toolkit.
  
  Minor version and forward compatibility ensure that an older NVIDIA driver can be used with a newer CUDA Toolkit (until certain version).
  
  FAQ: Does CUDA compatibility work with containers? Yes, when using containers that are based on the official CUDA base images.
  
  What's next
  
  Write script to run the benchmark within the CI container
  
  Test this POC on one of their GPUs:
  
  GPU model Driver version
  
  Tesla T4 575.57.08 (CUDA Version: 12.9)
  
  A100 575.57.08 (CUDA Version: 12.9)
  
  V100S 560.35.05 (CUDA Version: 12.6)
  
  Question
  
  Differences between sync and async run of the benchmark, except from RTC and compression/decompression?
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  Summer Student Milla Bramsted
  
  She is working on benchmarking SoA code on GPUs
  
  We will track her project in this google doc
  
  She is adding CUDA kernels to this repo
  
  Three CUDA kernels are now running in our benchmark framework
  
  Next steps:
  
  Run the same kernels for different data layouts
  
  Add the other NGT SoA approaches again (were temporarily removed)
  
  ALICE O2 CI-Pipelines on NGT Cluster
  
  A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
  
  It has a GitHub action running the standalone benchmark on NGT GPUs (H100)
  
  Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
  
  Different GPUs are tested in parallel on different VMs
  
  O2 standalone benchmarks now works on the following ngt-resources:
  
  Nvidia H100 188GB NVL
  
  AMD Instinct MI300X
  
  AMD Radeon Pro W7900: Works, but is a bit hacky, need to improve
  
  Nvidia L40S: Works fine, but needs to be added to the CI
  
  Next steps:
  
  Find a better solution for W7900
  
  Add L40S to CI (trivial)
  
  Avoid some code repetition in the github workflow file
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo
  
  Simplified and optimized the SoA code in the last few weeks
  
  Everything is running and we have identified the classes apply our SoA code to
  
  Next steps:
  
  Discuss these classes with David
  
  Implement the SoA code on those classes
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  
  OpenCL
  
  Clang printf
  
  Upstreamed fix to LLVM: https://github.com/llvm/llvm-project/pull/150210
  
  -> Merged into main branch (LLVM 22) and backported to LLVM 21
  
  -> Branch for LLVM 20 closed, no more releases planned
  
  PoCL
  
  > Sometimes statements like if (iThread == 0) printf("..."); are printed by all threads. Can be fixed with a barrier after the print.
  
  Bug report: https://github.com/pocl/pocl/issues/1977
  
  Bug confirmend, but PoCL devs are rewriting their loop vectorizer.
  
  New loop vectorizer: Fixes this particular issue.
  
  However, full decoder kernel runs into endless loop / deadlock with it, so can't test original issues yet.
  
  Working on reproducer for deadlock now.
- 10:45 → 10:50
  
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  
  TODOs (help welcome)
  - validate new features if they also work on GPU
  - deltaROF tracking currently not working have to check why and fix
  - otherwise everything used in production should be there
  - porting Vertexing code to GPU (in progress)
  - optimize GPU part:
  - `started` using multiple streams very much in its infancy
  - optimise load times/processing etc. (probably [not measured] parts could be improved)
  - optimise number of blocks/threads per kernel (for Pb-Pb + pp sep., fairly small number of kernels to be optimised)
  - Aranged meeting with Gabriele (after his vacation)
  - How difficult is it to also set up RTC for ITS kernels?
  - Have a CI procedure/periodic check that runs ITS GPU reconstruction and then gets some automatic checks and numbers (deterministic mode is more important)
  - Ensure gpu-reco-workflow with ITS enabled are the same that one gets from its-reco-workflow --gpu
  - Erroneous assumption in reproducer, for ITS cluster squashing (merging of clusters across readout-frames) has been disabled online, effectively disabling perfect compression and increasing storage requirements, in async prod. squashing is enabled though (of course)
  - added env var to reclusterize in dpl-workflow.sh
  - could test for pp processed 1000 TFs with TPC, ITS successfully
  - PbPb might be problematic, need to understand when GPU memory is freed using the framework allocator
  - saw an increase in dropped TFs (dropped meaning ITS code exceeded avail. memory, caught the std::bad_alloc and outputted 0 tracks, not crashing processing) using TPC, ITS GPU in high int. rate (40kHz) Pb-Pb
  
  (request by David&Vasco for some ALICE3 hardware estimates with current algorithm, we have some ideas but need to refine a bit)
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)