News from GPU parameter tuning

Used GPUCA_KERNEL_RESOURCE_USAGE_VERBOSE 1
For 750kHz pp dataset, tuner selected as best configuration block_size = 512, grid_size = 540 for GMMergerCollect
This translated to 512, 9 on the parameter header --> try to fit 9 blocks of 512 threads each on each CU
What GPUCA_KERNEL_RESOURCE_USAGE_VERBOSE said: Occupancy [waves/SIMD]: 9
On MI50: 4 SIMD per CU
Number of threads residing on the GPU at the same time: 64 threads per wave * 9 waves per SIMD * 4 SIMD per CU * 60 CUs in a MI50 = 64 * 9 * 4 * 60 = 138 240
Number of requested threads: 512 threads per block * 9 blocks per CU * 60 CUs in a MI50 = 276 480
Actual threads / Requested threads = 138 240 / 276 480 = 0.5
So half grid can reside on the GPU

HS23 Contribution from ALICE

Chatted with Robin yesterday. They are prone to use the CI container directly.

CUDA Compatibility guarantees allow for upgrading only certain components.
- Backwards compatibility ensures that a newer NVIDIA driver can be used with an older CUDA Toolkit.
- Minor version and forward compatibility ensure that an older NVIDIA driver can be used with a newer CUDA Toolkit (until certain version).
FAQ: Does CUDA compatibility work with containers? Yes, when using containers that are based on the official CUDA base images.

Differences between sync and async run of the benchmark, except from RTC and compression/decompression?