Some problems

Tried to plug in new optimum parameters for 6 different kernels after grid search
Parameters should be optimal for 50kHz IR
Turns out it sync time is almost 7% slower with new parameters
In grid search, kernels measured individually using rocprof
In overall evaluation, total sync and async time considered
Maybe there are hidden slowdowns which have not been measured during the grid search?
I will measure kernel durations to see which one is slower w.r.t. data taken during the grid search
In case try to do individual grid searches and see if optimal parameters change
In case try to find another way to measure kernel durations instead of using ROCM profiler

Idea for a more efficient grid search:

Apply a Latin Hypercube Sampling
1. Divide each dimension of the search space in M intervals (bins)
2. Sample N points s.t. each interval (bin) has only one sample point
3. This way the search space should be explored evenly
Select the configuration with the best result
Recursively apply LHS in a more fine-grained search space around that sample

Should I try this type of optimisation or should I just try to apply a known external optimsation framework and somehow adapt it to this problem?

SliceTracker step has 8 main kernels, 16 parameters --> 16 dimension search space (w.r.t. 2 dimension of independent kernels)
Each time 15 minutes are taken to compile the standalone benchmark to evaluate a sample point in the search space
Will this euristic be enough to have feasible runtime of the search space?
External frameworks might be faster?