Multi grid search script
- Script for measuring kernel durations using different grid_size block_size configurations
- Allows for multiple measurements at once
- Limited to kernels defined in GPUDefGPUParameters.h
- It uses the standalone TPC benchmark
How much it is automated? 🤖
Pretty much, i.e. it is sufficient to specify the kernel name and the space search in the script and it is ready to go

What is the input?
Just the dictionary. For the moment the dataset on which to measure is fixed, and loops between four datasets.
What is the output?
Folder with a csv file for each measured kernel, e.g.

Nice features 👍
- Independent from O2, just compile the standalone benchmark with the desired version
- Efficiently executes multiple grid searches, sampling one point in the search space for each kernel at each measurments
- Duration of the measurments dependent on the biggest search space, not on how many kernels are evaluated
Limitations 👎
- Currently llimited to AMD GPUs as it exploits the
rocprof
profiler from AMD
- Takes a lot of time as each time new points in the search spaces are evaluated, it must compile the benchmark
- Example: Took 23 hours on MI50 when biggest search space was 32 space points
- Useful just for kernels which do not run in parallel with other kernels
Currently doing / to do 🚧
- Select best parameters for each dataset and check correctness / improvement
- Benchmark with other datasets
- Try to use runtime compilation (need a bit of guidance here 😬)
- Tackle concurrent kernels
- Make the search less brute force and more "intelligent"
- See feasibility of other optimisation techniques (optuna, bayesian opt, MCMC...)