Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-08-13T10:00:00+02:00
End: 2025-08-13T11:00:00+02:00
Location: No location set

Wednesday 13 Aug 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  Test of new O2DPG was set up incorrectly, now seems to work. Catalin wants to do some extrra checks before merging.
  
  For tests at NERSC, GRID people need to set up a queue such that we can submit GRID jobs that request 64 cores and GPU access.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  LLVM Bump to 20.1: status?
  
  ROCm 6.4.1 status:
  
  AMD is checking the reproducer. I have some idea how to narrow down where it miscompiles using different compile flags in per-kernel mode.
  
  New problem with ROCm 6.5 / 7.0 after bumping clang: New clang encounters internal compiler error processing our code...
  
  ROCm 7.0 introduces some non-backward-compatible changes. Might need code changes in O2, and synchronized bump of ROCm on EPNs, in build and CI container, and in O2/dev.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object: Draft version by Sergey exists but still WIP.
  
  Pending OpenCL2 issues:
  
  printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
  
  Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
  
  Also with optimization disabled, crashing later in TPC merging, need printf to debug.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. WIP PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Flag training
  
  Network finds a correlation, but not very convincing yet
  
  2D Conv net performs significantly better
  
  Sim data
  
  Questions:
  
  When running the full reco workflow dumped in reco_NOGPU.log I get more tracks than when running with standalone o2-tpc-reco-workflow (approx. 7-8%)
  
  I currently run with the options from the reco_NOGPU.log, in particular for SC distortions:
  
  --lumi-type 1
  
  For real data I use (confirmed by Alex):
  
  --lumi-type 2 --corrmap-lumi-mode 1
  
  I found this statement in one of the MC scripts:
  
  # Setup the TPC correction scaling options for reco; They come from the anchoring setup # Some useful comments from Ruben: # - lumi-type == 0 means no-scaling of corrections with any measure of the lumi rather than no corrections at all. # - The "no corrections" mode is imposed by the TPCCorrMap.lumiMean configurable being negative, in this case all other options in the corrections treatment are ignored. # - But if the MC simulation was done with distortions, then the reco needs --lumy-type 1 (i.e. scale with the CTP lumi) even if the corresponding anchor run reco was using --lumy-type 2 # (i.e. scaling according to the TPC IDC, which don't exist in the MC).
  
  My question: The real data reco has a significantly different chi2/(2*NCL - 5) distribution than on simulated data with SC distortions
  
  All my results concerning tracking efficiencies etc. will be incorrect / worse compared to real data
  
  Is this a problem?
  
  New strategy. Run over digits produced with central MC productions
  
  alien_find /alice/sim/2025/LHC25b8a_v8b3c/0/559781 debug_full_archive.tgz
  
  Current problem in reco: IR 49149/0 is ahead of the reference IR 0/47553440
  
  o2-tpc-reco-workflow --session 9 --severity info --shm-segment-id 9 --shm-segment-size 90000000000 --timeframes-rate-limit-ipcid 9 --input-type digits --tpc-digit-reader "--infile $SIMDIR/tpcdigits.root" --output-type tracks --lumi-type 1 --corrmap-lumi-mode 2 --resources-monitoring 2 --configKeyValues "GPU_QA.output=histograms.root;;GPU_proc.runQA=1;GPU_global.rundEdx=1;keyval.input_dir=$SIMDIR;keyval.output_dir=/lustre/alice/users/csonnab/PhD/jobs/clusterization/QA/output/12082025_monalisa_25b8a_v8bc3/cfreg_30/199/reco;GPU_global.deviceType=CPU;GPU_proc.debugLevel=0;GPU_global.synchronousProcessing=1;GPU_proc.clearO2OutputFromGPU=1;GPU_proc.ompThreads=56;GPU_proc.deviceNum=-2;GPU_proc.forceHostMemoryPoolSize=110000000000;GPU_global.overrideNHbfPerTF=128"
  
  Thesis writing
  
  MC chapter almost ready for first reading: Expect to hand it to David end of next week, afterwards to Silvia.
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  News from GPU parameter tuning
  
  Investigated on the optimal configurations found by the tuner.
  
  launch bounds "issue"
  
  HIP definition of launch_bounds (link):
  
  __global__ void __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT) MyKernel(hipGridLaunch lp, ...) ...
  
  MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor.
  
  Compute Unit (CU) for AMD = Streaming Multiprocessors for Nvidia. A CPU is composed of one or more Execution Units (EU) which are responsible for executing waves. For MI50: 60 CUs, 4 EUs per CU.
  
  CUDA definition of __launch_bounds:
  
  __launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
  
  "The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors"
  
  Implication
  
  This means that when the launch bounds are set on the parameter file and compiled for AMD GPUs, we set the min number of active warps each EU has to contain
  
  Example: with #define GPUCA_LB_GPUTPCGMMergerCollect 512, 9
  
  SGPRs: 62
  VGPRs: 28
  ScratchSize [bytes/lane]: 22464
  Dynamic Stack: False
  Occupancy [waves/EU]: 9
  SGPRs Spill: 28
  VGPRs Spill: 111
  LDS Size [bytes/block]: 0
  
  So if GetGridAuto is used to launch the kernel, the grid will be MIN_WARPS_PER_EXECUTION_UNIT * CUs, and the compiler will reserve registers accordingly to the specified MIN_WARPS_PER_EXECUTION_UNIT
  
  On MI50, hw limit is 10 active warps per EU --> when the second parameter is > 10 in the launch bounds, it is ignored and the compiler discards this hint, usually trying not to spill any register
  
  Example: with #define GPUCA_LB_GPUTPCGMMergerCollect 512, 15
  
  SGPRs: 104
  VGPRs: 73
  ScratchSize [bytes/lane]: 22272
  Dynamic Stack: False
  Occupancy [waves/SIMD]: 3
  SGPRs Spill: 0
  VGPRs Spill: 0
  LDS Size [bytes/block]: 0
  
  Each thread uses 73 VGPRs, total of 256 VGPRs per thread --> can accomodate 3 warps without spilling
  
  Grid size = 600 mistery solved
  
  This is why grid size 600 is such a magic number in the tuning, and why grid size is so important. In reality, it is steering the number of active warps per EU. After 10 active warps (600/CUs), the compiler discard the hint, thus changing optimization strategy.
  
  Sometimes spilling and increasing occupancy helps, sometimes it is better to reduce the spilling.
  
  What to do with this
  
  Correct the usage of launch bounds for AMD compilation
  
  Maybe use three parameters per kernel, i.e. max threads per block, active warps, grid size and tune these together
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  Summer Student Milla Bramsted
  
  She is working on benchmarking SoA code on GPUs
  
  We will track her project in this google doc
  
  She is adding CUDA kernels to this repo
  
  5 CUDA kernels are now running in our benchmark framework
  
  They are running in AoS and SoA data layout
  
  Next steps:
  
  Plot the results
  
  Interpret the results
  
  Prepare presentation for AIP meeting next Monday
  
  Add manual AoS and SoA as baseline
  
  ALICE O2 CI-Pipelines on NGT Cluster
  
  A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
  
  It has a GitHub action running the standalone benchmark on NGT GPUs
  
  Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
  
  Different GPUs are tested in parallel on different VMs
  
  O2 standalone benchmark works on the all ngt-resources:
  
  Nvidia H100 188GB NVL
  
  AMD Instinct MI300X
  
  AMD Radeon Pro W7900
  
  Nvidia L40S
  
  We are now using custom .par files
  
  Next steps
  
  Generate optimized .par files with Gabriele
  
  Possible next steps
  
  Add new architectures to O2 hardcoded ones?
  
  Store results in csv format and plot them?
  
  Display the plot in the github web gui?
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo
  
  Simplified the AoS-SoA wrapper code
  
  Possible classes to apply our AoS-SoA code to:
  
  Merger
  
  GPUTPCGMSectorTrack
  
  GPUTPCGMTrackParam
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTrack
- 10:35 → 10:40
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
  Build the O2 GPU TPC Standalone benchmark on two local machines:-
  
  Motivation:- Want to reduce my uses of EPN for routine activities like To generate PRs and learning the framework.
  
  1: On one node of our grid-peer Tier-3 cluster:
  An entry level Graphics Card is available on this machine. (NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER])
  
  trainee@gpu-compute:~$ du -sh AliDist
  31G AliDist
  
  2: On my local workstation:-
  
  vikas@vsinghal:~/AliceGPU$ du -sh *
  14M alidist
  40G sw
  
  I will create a few PRs from this local version.
  
  3:On the login node of EPN:
  
  [PROD][vsinghal@login ~]$ du -sh alisoft/
  25G alisoft/
  [PROD][vsinghal@login ~]$ du -sh NewAliSoft
  22G NewAliSoft
  [PROD][vsinghal@login ~]$
  
  AliDist software occupy different spaces on different machines.
  
  2: On my local workstation:- Copied the generated DataSet from EPN machine.
  
  but shows size mismatch error.
  
  vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pp-10 --debug 1 Reading events from Directory o2-pp-10 Created GPUReconstruction instance for device type CPU (1) ERROR reading events/o2-pp-10/tpctransform.dump, invalid size: 4552 (4568 expected) terminate called after throwing an instance of 'std::runtime_error' what(): invalid size Aborted vikas@vsinghal:/localdata/standalone$ ./ca -e o2-pbpb-50 --debug 1 Reading events from Directory o2-pbpb-50 Created GPUReconstruction instance for device type CPU (1) ERROR reading events/o2-pbpb-50/tpctransform.dump, invalid size: 4552 (4568 expected) terminate called after throwing an instance of 'std::runtime_error' what(): invalid size Aborted vikas@vsinghal:/localdata/standalone$
  
  Tried for generating Dataset:
  
  vikas@vsinghal:/localdata/standalone$ ~/AliceGPU/sw/SOURCES/O2/daily-20250808-0000/daily-20250808-0000/prodtests/full_system_test.sh Missing O2sim environment vikas@vsinghal:/localdata/standalone$ alienv enter O2sim/latest ERROR: O2sim/latest was not found vikas@vsinghal:/localdata/standalone$
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  OpenCL
  
  PoCL
  
  full decoder kernel endless loop with new loop vectorizer. Minimal reproducer:
  
  __kernel void decode(uchar flags) { const ushort N = 1; for (ushort i = 0; i < N; i++) { if (i != N - 1 || flags) { work_group_reduce_add(0); } } }
  
  PoCL devs already have a patch set that also fixes this issue.
  
  New issue: Crash when compiling stream compaction kernel to machine code in clusterizer. (TODO: reproducer)
  
  Original issues in ZS decoder still persist even with new compiler (Maybe caused by O2 after all?)
  
  New GPU Servers
  
  Wanted to buy new GPU machines for a) development and b) CI Tests that run GPU code.
  
  Both machines should have AMD + Nvidia GPU.
  
  Ideally, just buy two new server -> not possible due to budget constraints.
  
  Dev machine -> Buy parts and build server scratch
  
  CI machine -> Add GPUs to existing machine
  
  Two candidates for CI: alibi and alinsure
  
  alibi
  
  already has GPUs, but is occupied by long MC jobs. -> Unsuited to run CI jobs in parallel.
  
  alinsure
  
  Ancient machine (mainboard + CPU from 2013), unclear if BIOS even supports GPUs with >4GB VRAM.
  
  Haven't checked if PSU is strong enough yet (base model only has 460W)
  
  How far can we get if we build both machines from scratch instead?
  
  Peter mentioned remaining budget of 8000CHF + 4000CHF over budget should be ok -> Assume 12000 CHF
  
  CI Machine
  
  Can have smaller CPU and GPU, only needs to compile GPU standalone:
  
  Minimal configuration:
  
  Threadripper 7000 (24 cores)
  
  128GB RAM
  
  RTX 5060 TI (16GB)
  
  2TB NVMe
  
  Total price inc basic components (Mainboard, case, etc): ~5000 CHF
  
  GPU Machine
  
  Max configuration possible with remaining bugdet:
  
  Threadripper 9000 (32 cores)
  
  128GB RAM
  
  RTX 5080 (16GB)
  
  4 TB NVMe
  
  2x 20TB HDD (for RAID 0)
  
  Total price inc basic components: ~7000 CHF
  
  Updgrades possible if we don't need to buy a CI Machine:
  
  CPU: Threadripper 7000 (64 cores) (maybe 9000 series)
  
  Increase RAM to 256 GB
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  TODOs (help welcome)
  - validate new features if they also work on GPU
  - deltaROF tracking currently not working have to check why and fix. (FIXED)
  - otherwise everything used in production should be there
  - porting Vertexing code to GPU (in progress)
  - optimize GPU part:
  - `started` using multiple streams very much in its toddlerhood (see below)
  - optimise load times/processing etc. (probably [not measured] parts could be improved)
  - optimise number of blocks/threads per kernel (for Pb-Pb + pp sep., fairly small number of kernels to be optimised)
  - meeting with Gabriele & David, Gabriele will no start digging in
  - Have a CI procedure/periodic check that runs ITS GPU reconstruction and then gets some automatic checks and numbers (deterministic mode is more important)
  - Have a test with TPC&ITS for async production on epns
  - PbPb might be problematic, need to understand when GPU memory is freed using the framework allocator
  - saw an increase in dropped TFs (dropped meaning ITS code exceeded avail. memory, caught the std::bad_alloc and outputted 0 tracks, not crashing processing) using TPC, ITS GPU in high int. rate (40kHz) Pb-Pb
  
  (request by David&Vasco for some ALICE3 hardware estimates with current algorithm, we have some ideas but need to refine a bit)
  
  AliceO2#14563
  
  Total time in its_time_benchmarks_old.txt: 108960.237939 Total time in its_time_benchmarks_new.txt: 91904.873816 Difference (file2 - file1): -17055.364123 Relative change: -15.65% Metric: cell_finding its_time_benchmarks_old.txt: 5229.111419 (4.80% of total) its_time_benchmarks_new.txt: 3776.888696 (4.11% of total) Difference (file2 - file1): -1452.222723 Relative change: -27.77% Metric: neighbour_finding its_time_benchmarks_old.txt: 15058.566217 (13.82% of total) its_time_benchmarks_new.txt: 13635.751449 (14.84% of total) Difference (file2 - file1): -1422.814768 Relative change: -9.45% Metric: road_finding its_time_benchmarks_old.txt: 54911.194296 (50.40% of total) its_time_benchmarks_new.txt: 53212.793432 (57.90% of total) Difference (file2 - file1): -1698.400864 Relative change: -3.09% Metric: timeframe_initialisation its_time_benchmarks_old.txt: 1323.038784 (1.21% of total) its_time_benchmarks_new.txt: 1205.854911 (1.31% of total) Difference (file2 - file1): -117.183873 Relative change: -8.86% Metric: tracklet_finding its_time_benchmarks_old.txt: 32438.327223 (29.77% of total) its_time_benchmarks_new.txt: 20073.585328 (21.84% of total) Difference (file2 - file1): -12364.741895 Relative change: -38.12%
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)