Slightly changed strategy for incorrect kernel configurations:
warp size, max threads / SM, registers / SM, shared mem / SM & block, max blocks / SMBlock size: typically from warp size to 20 * warp sizeMin blocks/SM: sampled as fraction of SM capacity ∈ [0, 1]min_blocks_per_sm = 1Function krnl_GPUTPCTrackletSelector:
REG:64 STACK:600 SHARED:8976 LOCAL:0 CONSTANT[0]:900 TEXTURE:0 SURFACE:0 SAMPLER:0
max_blocks_per_sm based on resource constraintsmin_blocks_per_sm = sm_fraction × max_blocks_per_sm
This way ensures always good configurations
Except when block size exceeds limits due to static shared memory dependent on it, example:
class GPUTPCTrackletSelector : public GPUKernelTemplate
{
public:
struct GPUSharedMemory {
int32_t mItr0; // index of the first track in the block
int32_t mNThreadsTotal; // total n threads
int32_t mNTracklets; // n of tracklets
int32_t mReserved; // for alignment reasons
static_assert(GPUTPCGeometry::NROWS >= GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE);
GPUTPCHitId mHits[GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE][GPUCA_GET_THREAD_COUNT(GPUCA_LB_GPUTPCTrackletSelector)];
};
Thus if the first RT compilation fails:
SUCCESS: Tuner now finds optimal sets within GPU limits