Global Parameter Optimisation

Context:

Tried manual tuning of GMMergerTrackFit. This kernel is called twice:

  1. First with
    1. block size: 128
    2. grid size s.t. grid size*block size >= #tracks
  2. Second with
    1. block size: 128
    2. grid size: 120

The two mergers are located here in the GPUChain (sync chain in the image below):

Tuning approach:

Used same configuration for both kernels (instead of two separate configurations). Kept 128 threads per block, increased block size: 120 * {1,2,3,4,5,6,7}

Results:

Tested on MI100.

Keep in mind: in the following plots "Normal" for Merger 1 means grid size s.t. grid size*block size >= #tracks. In practice:

pp, sync

pp, async

More or less same result as sync for async merger 1 and 2

PbPb, sync

PbPb, async

Same observations for the asynchronous reco as the sync

Grid search

Attempted a grid search approach on MI100. The parameter search span is defined as block_size = {32, 64, 128} and grid_size = {120, 240, 360, 480, 600, 840}.  Block size is a multiple of warp size (64). I put also 32 experimentally, to see what happens with a non-optimal block size. Grid size is a multiple of the number of Compute Units of the MI100 (120 CUs).

Thus the parameter search space is {32, 64, 128} x {120, 240, 360, 480, 600, 840}.

Heatmaps are plotted. Every mean execution time is normalised to the mean execution time with the current standard parameters. Hence:

pp

For merger 1, both for low and high IRs and for sync and async, same performance are reached with the {128,840}configuration, instead of the dynamic configuration which results in {128,492} for 100kHz and {128, 10907} for 2MHz (based on #tracks).

For merger 2, low IR seems to prefer smaller configurations, while for high IR bigger configurations works better. In any case there is room for improvement.

PbPb

For Merger 1, configuration {128,840} runs faster for low IR rather than {128,1795}, while for high IR the performance is equal ( w.r.t {128,19709}).

Merger 2 can be leveraged better with several configurations.

To-do:

Based on these observations: