Slightly changed strategy for incorrect kernel configurations:

  1. Query GPU constants and limits:
    warp size, max threads / SM, registers / SM, shared mem / SM  & block, max blocks / SM
  2. Define parameters range:
    Block size: typically from warp size to 20 * warp size
    Min blocks/SM: sampled as fraction of SM capacity ∈ [0, 1]
  3. Loop:
    1. Sample parameters from Optuna study
    2. RT-compile with sampled parameters; min_blocks_per_sm = 1
    3. Inspect fatbins files to query kernels resource usage:
      Function krnl_GPUTPCTrackletSelector:
        REG:64 STACK:600 SHARED:8976 LOCAL:0 CONSTANT[0]:900 TEXTURE:0 SURFACE:0 SAMPLER:0
    4. Compute per kernel max_blocks_per_sm based on resource constraints
    5. RT-compile with min_blocks_per_sm = sm_fraction × max_blocks_per_sm
    6. Measure tuned step duration and report result to Optuna study

 

This way ensures always good configurations

Except when block size exceeds limits due to static shared memory dependent on it, example:

class GPUTPCTrackletSelector : public GPUKernelTemplate
{
 public:
  struct GPUSharedMemory {
    int32_t mItr0;          // index of the first track in the block
    int32_t mNThreadsTotal; // total n threads
    int32_t mNTracklets;    // n of tracklets
    int32_t mReserved;      // for alignment reasons
    static_assert(GPUTPCGeometry::NROWS >= GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE);
    GPUTPCHitId mHits[GPUCA_PAR_TRACKLET_SELECTOR_HITS_REG_SIZE][GPUCA_GET_THREAD_COUNT(GPUCA_LB_GPUTPCTrackletSelector)];
  };

Thus if the first RT compilation fails:

  1. Logs from compilation are inspected
  2. Failing kernels identified
  3. Steps with failing kernels are marked as bad, RT again without those steps
  4. Update per kernel cache with max block size that can be sampled

 

SUCCESS: Tuner now finds optimal sets within GPU limits