GPU param optimisation

Setup

Measured on Alma 9.4, ROCm 6.3.1, MI50 GPU

Executed grid search for the following kernels:

These are the longest single stream kernels. Parameters are independent, so easier to optimise. Custom search space for every kernel (for some can't have large block sizes).

Each mean time is normalised to the mean time of the current (block_size, grid_size) configuration. So < 1 mean a better configuration, > 1 means worse and = 1 equal perfomance as current.

MergerTrackFit

Executed two times (Merger 1 and Merger 2)

pp

Merger 1

Merger 2

PbPb

Merger 1

Merger 2

MergerSliceRefit

Kernel is executed 36 times (once per TPC sector).

MergerCollect

pp

Must retake some measurments due to some unkown problems. Overall best performance given by (64, 960), while current configuration is (512,60).

PbPb

Roughly same as pp

MergerFollowLoopers

Best configuration uses 900 or 960 as grid size. Current configuration is (256,200).

Compression kernels

Step 0 attached clusters

No significant improvements when changing grid and block sizes.

Step 1 unattached clusters

For High IR, (192,180) shows better performances compared to current configuration (512,120).

Grid search script

Since these kernels are not executed concurrently, their parameters are independent. Hence, a python script to perform multiple grid searches at once has been created:

  1. A custom grid search space is defined for each kernel
  2. At each iteration, take a new space point, i.e. (block_size,grid_size), from each search space
    1. Modify (automatically) the code, plugging each new configuration into the correspondent kernel call in O2
    2. Compile
    3. Execute and measure kernels timings
    4. Iterate until the largest search space is exhausted. Skip new point sampling if search space has been completely explored.

Pros: Multiple grid searches possible per single run

Cons: Works effectively only with non concurrent kernels

Next things to do