Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
Start: 2025-01-22T10:00:00+01:00
End: 2025-01-22T11:30:00+01:00
Location: No location set

Wednesday 22 Jan 2025, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 1
  
  Discussion
  
  Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
- 2
  
  Following up JIRA tickets
  
  Speaker: Ernst Hellbar (CERN)
- 3
  
  TPC ML Clustering
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 4
  
  ITS Tracking
  
  Speaker: Matteo Concas (CERN)
- 5
  TPC Track Model Decoding on GPU
  
  Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
  Global Parameter Optimisation
  
  Context:
  
  Tried manual tuning of GMMergerTrackFit. This kernel is called twice:
  
  First with
  
  block size: 128
  
  grid size s.t. grid size*block size >= #tracks
  
  Second with
  
  block size: 128
  
  grid size: 120
  
  The two mergers are located here in the GPUChain (sync chain in the image below):
  
  Tuning approach:
  
  Used same configuration for both kernels (instead of two separate configurations). Kept 128 threads per block, increased block size: 120 * {1,2,3,4,5,6,7}
  
  Results:
  
  Tested on MI100.
  
  Keep in mind: in the following plots "Normal" for Merger 1 means grid size s.t. grid size*block size >= #tracks. In practice:
  
  grid size = 492 for pp 100kHz
  
  grid size = 10907 for pp 2MHz
  
  grid size = 1795 for PbPb 5kHz
  
  grid size = 19709 for PbPb 50kHz
  
  pp, sync
  
  First merger benefits from large block sizes, but it seems to reach normal configuration at 840 blocks, no need to scale grid size up to 10 thousands
  
  Second merger benefits from larger block sizes than normal (120 blocks)
  
  pp, async
  
  More or less same result as sync for async merger 1 and 2
  
  PbPb, sync
  
  For low IR, merger 1 seems to benefit with lower grid sizes (Normal for 5kHz is 1795), for high IR difficult to reach normal configuration (480 seems promising for both)
  
  Merger 2 also benfits from bigger grid sizes for both IRs
  
  PbPb, async
  
  Same observations for the asynchronous reco as the sync
  
  Grid search
  
  Attempted a grid search approach on MI100. The parameter search span is defined as block_size = {32, 64, 128} and grid_size = {120, 240, 360, 480, 600, 840}. Block size is a multiple of warp size (64). I put also 32 experimentally, to see what happens with a non-optimal block size. Grid size is a multiple of the number of Compute Units of the MI100 (120 CUs).
  
  Thus the parameter search space is {32, 64, 128} x {120, 240, 360, 480, 600, 840}.
  
  Heatmaps are plotted. Every mean execution time is normalised to the mean execution time with the current standard parameters. Hence:
  
  cell < 1 (red cell) better configuration than current conf
  
  cell = 1 (white cell) equal configuration than current conf
  
  cell > 1 (blue cell) worse configuration than current conf
  
  pp
  
  For merger 1, both for low and high IRs and for sync and async, same performance are reached with the {128,840}configuration, instead of the dynamic configuration which results in {128,492} for 100kHz and {128, 10907} for 2MHz (based on #tracks).
  
  For merger 2, low IR seems to prefer smaller configurations, while for high IR bigger configurations works better. In any case there is room for improvement.
  
  PbPb
  
  For Merger 1, configuration {128,840} runs faster for low IR rather than {128,1795}, while for high IR the performance is equal ( w.r.t {128,19709}).
  
  Merger 2 can be leveraged better with several configurations.
  
  To-do:
  
  Based on these observations:
  
  Take measurments also on MI50
  
  Try even higher grid size
  
  Measure other kernels
  
  Understand how to properly time kernels without serialize them
  
  Investigate on the SliceTracker part (concurrent kernels)
- 6
  
  Efficient Data Structures
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Global Parameter Optimisation

Context:

Tuning approach:

Results:

pp, sync

pp, async

PbPb, sync

PbPb, async

Grid search

pp

PbPb

To-do: