General summary on GPU param optimisation

Can we optimize parameters individually, and which parameters do we have to optimize globally?

Image below is the GPU sync TPC processing chain. Each colored box is a GPU kernel, time flows in this direction -->.

Drawn following conclusions:

Are the optimal parameters the same for different input data pp vs PbPb and low vs high IR?

Measured on Alma 9.4, ROCm 6.3.1, MI50 GPU. Tested four different configurations: pp 100kHz, pp 2MHz, PbPb 5kHz and PbPb 50kHz. Simulated TFs with 128 lhc orbits.

Independent params optimisation

MergerTrackFit

Executed two times (Merger 1 and Merger 2)

pp

Merger 1

Merger 2

PbPb

Merger 1

Merger 2

MergerSliceRefit

Kernel is executed 36 times (once per TPC sector).

MergerCollect

pp

Overall best performance given by (64, 960), while current configuration is (512,60).

PbPb

Roughly same as pp

MergerFollowLoopers

Best configuration uses 900 or 960 as grid size. Current configuration is (256,200).

Compression kernels

Step 0 attached clusters

No significant improvements when changing grid and block sizes.

Step 1 unattached clusters

No significant improvements when changing grid and block sizes.

After grid search

Create set of best parameters per beamtype (pp, PbPb) and per IR (100kHz, 2MHz for pp and 5kHz, 50kHz for PbPb). How to choose best configuration:

  1. compute conf_mean_time - default_conf_mean_time
  2. propagate error (std dev) of the difference and compute 95% confidence interval
  3. if 0 is in the interval, can not tell with confidence if current configuration is better than the default
  4. if one or more CIs have upperbound < 0, choose the one with smaller mean (i.e. the best)

Plug in the best parameters for each beamtype / IR configuration and check if there is a noticable improvement in the whole sync / async chain (work in progress).

Dependent params optimisation

Possible ideas for post manual optimization

  1. Isolate the parameters which are dependent, i.e. kernels from the same task which run in parallel (e.g. Clusterizer step, SliceTracking slice)
  2. Apply known optimization techniques to such kernel groups
    1. Grid/random search
    2. Bayesian optimization?
      See: F.-J. Willemsen, R. Van Nieuwpoort, and B. Van Werkhoven, “Bayesian Optimization for auto-tuning GPU kernels”, International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) at Supercomputing (SC21), 2021. Available: https://arxiv.org/abs/2111.14991

Possible bug spotted

HIP_AMDGPUTARGET set to "default" in GPU/GPUTracking/Standalone/cmake/config.cmake translates in HIP_AMDGPUTARGET=gfx906;gfx908 and forces to use MI50 params

Basically here HIP_AMDGPUTARGET=gfx906;gfx908 enters the first if clause for MI50 even if I am compiling for MI100. Commented set(HIP_AMDGPUTARGET "default") on the config.cmake of the standalone benchmark and forced usage of MI100 parameters via

cmake -DCMAKE_INSTALL_PREFIX=../ -DHIP_AMDGPUTARGET="gfx908" ~/alice/O2/GPU/GPUTracking/Standalone/

Did not investigate further on this.