Should I try this type of optimisation or should I just try to apply a known external optimsation framework and somehow adapt it to this problem?
#define GPUCA_LB_GPUTPCCompressionKernels_step0attached 192, 2
where 192
is block_size and grid_size is 2 * Available Compute Units
grid_size
which is not multiple of Compute Units?