Alice Weekly Meeting: Software for Hardware Accelerators

Name: Alice Weekly Meeting: Software for Hardware Accelerators
Start: 2025-10-22T10:00:00+02:00
End: 2025-10-22T11:00:00+02:00
Location: No location set

Wednesday 22 Oct 2025, 10:00 → 11:00 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  Discussion 20m
  
  Speaker: David Rohr (CERN)
  Color code: (critical, news from this week: blue, news from last week: purple, no news: black)
  
  CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.
  
  Sync reconstruction
  
  Async reconstruction
  
  Need to investigate short GPU stall problem.
  
  Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
  
  New GPU architecture selection for async in O2DPG looks good, should be merged.
  
  Test with GPU GRID jobs at NERSC pending.
  
  Asked DPG to run first test with ITS tracking on GPU on EPNs.
  
  GPU ROCm / compiler topics:
  
  Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
  
  Problem with building ONNXRuntime with MigraphX support.
  
  Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
  
  Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
  
  Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
  
  Serialization bug pending.
  
  Miscompilation on MI 100 leading to memory error pending.
  
  New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
  
  New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
  
  TPC / GPU Processing
  
  WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
  
  Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
  
  Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
  
  Final solution: merging transformation maps on the fly into a single flat object:
  
  Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
  
  Need to check the problem with ONNX external memory allocator.
  
  Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
  
  New tracking is actually faster in the refit, but slower in the looper following. raw has more loopers than MC --> faster in MC, slower in raw overall. Can disable looper following to gain some speed for 50 kHz.
  
  TPC laser calib issue fixed, was a bug in tracking developments affecting only triggered data
  
  Large memory usage in pp on the GRID - fixed.
  
  Gain calib issue was due to new miscompilation.
  
  Next iteration of tracking improvements in draft PR: https://github.com/AliceO2Group/AliceO2/pull/14651
  
  Should get Gabriele's new parameters merged for Pb-Pb.
  
  EPN GPU Topics:
- 10:20 → 10:25
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Optimization of cluster finder filling kernel
  
  Current filling kernel: Same speed as one neural network evaluation
  
  Inherently expensive due to memory operations
  
  For benchmark: Fully load GPU (loop over filling kernel) and compare current implementation to new implementation in debugLevel=1, FST
  
  Read-writes per thread:
  
  Reads: 1x peak (2 x int16_t = 4 byte) + 1x chargeMap[peak] (uint16_t = 2 byte) + 9x chargeMap[position] (uint16_t = 2 byte each)
  
  Writes: 9x outputCharge (FP16: float16 = 2 byte each, FP32: float = 4 bytes each)
  
  Sum: FP16: 43 bytes, FP32: 61 bytes
  
  Batchsize: 2097152 = 2^21 (just to load the GPU fully)
  
  Kernel execution count: 8640
  
  Number of threads that do the dominant read/writes: 3*9 = 27
  
  Time [s] Bandwidth [GB/s]
  
  Old, float16 63.345 332.1
  
  Old, float32 60.261 495.2
  
  New, float16 52.519 400.6
  
  New, float32 51.794 576.2
  
  Conclusion:
  
  Read and write in FP32 is probably ~same speed per operation (not per byte!) and/or casting to FP16 creates overhead
  
  Time for evaluation: NN, float32 \approx 2x NN, float16
- 10:25 → 10:30
  GPU Parameter Optimizations 5m
  
  Speaker: Gabriele Cimador (CERN, Università and INFN Torino)
  News from GPU parameter tuning
  
  Tuned based on first 3 events of Ihc24ar_raw dasaset
  
  Measured sync time before launch bound fix commit, with RTC enabled with default parameters
  
  Measure sync time after fix commit, with RTC enabled with tuned parameters
  
  Results:
  
  Default vs Tuned, Ihc24ar_raw, per event, on MI50
  
  Zoom in 👀
  
  Gain% (default - tuned)/default * 100
  
  Saved time per event
  
  Following
  
  Currently tuning MI100
  
  Merge in O2 and test in Replay PbPb?
  
  Discuss offline about the dynamic parameters implementation
- 10:30 → 10:35
  Efficient Data Structures 5m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  ALICE O2 CI-Pipelines on NGT Cluster
  
  A fork of the AliceO2 repo is not in the NextGenTrigggers (NGT) GitHub organization
  
  It has a GitHub action running the standalone benchmark on NGT GPUs
  
  Uses the builds in /cvmfs/alice.cern.ch/ of O2 and dependencies (pipeline takes about 7 minutes)
  
  Different GPUs are tested in parallel on different VMs
  
  O2 standalone benchmark works on the all ngt-resources:
  
  Nvidia H100 188GB NVL
  
  AMD Instinct MI300X
  
  AMD Radeon Pro W7900
  
  Nvidia L40S
  
  We are now using custom .par files
  
  Gabriele generated even more optimized .par files
  
  Next steps
  
  Use Gabrieles .par files
  
  Possible next steps
  
  Add new architectures to O2 hardcoded ones?
  
  Store results in csv format and plot them?
  
  Display the plot in the github web gui?
  
  NextGenTrigger Task 1.7
  
  Workshop November 19 - 21
  
  Need to prepare talk about Task 1.7.
  
  Collected summaries from people working on Task 1.7 last week
  
  Next Steps:
  
  Interview task leads
  
  Create the slides
  
  Discussing implementation of our template SoA code in MadGraph with Stefan Roiser
  
  Had another discussion with Daniele Massaro about this
  
  Next Steps:
  
  Help Daniele Massaro by turning the first class into SoA (beginning of November)
  
  Held a meeting with Jolly and Axel from Task 1.7
  
  Topic: Merge reflections and template approaches to SoA
  
  Problems with template approach: Not working with derived classes or private members
  
  Merged with reflections: https://godbolt.org/z/P1xohnxdv
  
  Same without, but with boilerplate code: https://godbolt.org/z/ovavdKba7
  
  Next Steps:
  
  Implement new ideas in template approach
  
  Implement NGT SoA Code in O2 standalone benchmark
  
  Working on this fork of the AliceO2 repo
  
  Simplified the AoS-SoA wrapper code
  
  Started to apply our AoS-SoA code to:
  
  Merger
  
  GPUTPCGMSectorTrack
  
  GPUTPCGMTrackParam
  
  SectorTracker
  
  GPUTPCBaseTrackParam
  
  GPUTPCTrackParam
  
  GPUTPCTrack
  
  Started with SectorTracker
  
  Get same results (same GPU.out files)
  
  Execution time was the same =(
  
  Changed more classes to SoA
  
  Running with large event set (thanks to Gabriele)
  
  Performance still same (or maybe 2% slower)
  
  Explicit template instantiations are too verbose
  
  Next Steps:
  
  Make Merger use SoA too
  
  Store benchmark results in another format (e.g. csv) to plot it
  
  Change CI-Pipeline so that it plots and compares to a baseline
- 10:35 → 10:40
  
  Following up GPU to-dos 5m
  
  Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
- 10:40 → 10:45
  TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
  
  Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
  OpenCL
  
  crash when compiling ZS decoder
  
  Submitted reproducer, no new update
  
  Memory corruption in ZS decoder
  
  Reproducer kernel that compiles to basically identical SPIR-V
  
  Currently working on host code to reproduce setup in O2
  
  New GPU server
  
  Severs have been delivered
  
  Logistics: Need to organize transport to P2 for dev machine?
  
  Schedule downtime for alibi to swap GPUs, install new server
  
  Other
  
  Ernst discovered crash in ZS decoder from invalid TPC input
  
  Improved error handling in decoder kernel, to handle these cases
- 10:45 → 10:50
  ITS Tracking 5m
  
  Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)
  News from ITS vertexing (Gabriele)
  
  Currently porting last step of vertexing
  
  Felix: no news
  
  Just some questions:
  
  How to proceed with validating the GPU tracking in pp, e.g. fixing this TF in-flight issue on epns?
  
  Did you (@David) have time for this memory clearing?
- 10:50 → 10:55
  
  System Run Coordination Topics 5m
  
  Speaker: Ernst Hellbar (CERN)

	Time [s]	Bandwidth [GB/s]
Old, float16	63.345	332.1
Old, float32	60.261	495.2
New, float16	52.519	400.6
New, float32	51.794	576.2

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators

News from GPU parameter tuning

Default vs Tuned, Ihc24ar_raw, per event, on MI50

Zoom in 👀

Gain% (default - tuned)/default * 100

Saved time per event

Following

ALICE O2 CI-Pipelines on NGT Cluster

NextGenTrigger Task 1.7

Implement NGT SoA Code in O2 standalone benchmark

OpenCL

crash when compiling ZS decoder

Memory corruption in ZS decoder

New GPU server

Other

News from ITS vertexing (Gabriele)