Global Parameter Optimisation

Context:

Tried manual tuning of GMMergerTrackFit. This kernel is called twice:

First with
1. block size: 128
2. grid size s.t. grid size*block size >= #tracks
Second with
1. block size: 128
2. grid size: 120

The two mergers are located here in the GPUChain (sync chain in the image below):

Tuning approach:

Used same configuration for both kernels (instead of two separate configurations). Kept 128 threads per block, increased block size: 120 * {1,2,3,4,5,6,7}

Results:

Tested on MI100.

Keep in mind: in the following plots "Normal" for Merger 1 means grid size s.t. grid size*block size >= #tracks. In practice:

grid size = 492 for pp 100kHz
grid size = 10907 for pp 2MHz
grid size = 1795 for PbPb 5kHz
grid size = 19709 for PbPb 50kHz

pp, sync

First merger benefits from large block sizes, but it seems to reach normal configuration at 840 blocks, no need to scale grid size up to 10 thousands
Second merger benefits from larger block sizes than normal (120 blocks)

pp, async

More or less same result as sync for async merger 1 and 2

PbPb, sync

For low IR, merger 1 seems to benefit with lower grid sizes (Normal for 5kHz is 1795), for high IR difficult to reach normal configuration (480 seems promising for both)
Merger 2 also benfits from bigger grid sizes for both IRs

PbPb, async

Same observations for the asynchronous reco as the sync

Grid search

Attempted a grid search approach on MI100. The parameter search span is defined as block_size = {32, 64, 128} and grid_size = {120, 240, 360, 480, 600, 840}. Block size is a multiple of warp size (64). I put also 32 experimentally, to see what happens with a non-optimal block size. Grid size is a multiple of the number of Compute Units of the MI100 (120 CUs).

Thus the parameter search space is {32, 64, 128} x {120, 240, 360, 480, 600, 840}.

Heatmaps are plotted. Every mean execution time is normalised to the mean execution time with the current standard parameters. Hence:

cell < 1 (red cell) better configuration than current conf
cell = 1 (white cell) equal configuration than current conf

cell > 1 (blue cell) worse configuration than current conf

pp

For merger 1, both for low and high IRs and for sync and async, same performance are reached with the {128,840}configuration, instead of the dynamic configuration which results in {128,492} for 100kHz and {128, 10907} for 2MHz (based on #tracks).

For merger 2, low IR seems to prefer smaller configurations, while for high IR bigger configurations works better. In any case there is room for improvement.

PbPb

For Merger 1, configuration {128,840} runs faster for low IR rather than {128,1795}, while for high IR the performance is equal ( w.r.t {128,19709}).

Merger 2 can be leveraged better with several configurations.

To-do:

Based on these observations:

Take measurments also on MI50
Try even higher grid size
Measure other kernels
- Understand how to properly time kernels without serialize them
- Investigate on the SliceTracker part (concurrent kernels)