GPU model speeds comparison
1000 evaluations of batches of size 262144, input size per element (3,9,9). 10 warmup evaluations (excluded from measurement). Color is normalised per column


Unfortunately model compilation for CNN on older AMD GPU is not supported (blank spots in table).
Overall fastest for our deployment case: MI300X, second place goes to H100. MI300X outperforms H100 by a factor of ~2.
FP32: CPU (32 threads) is still a factor 30 slower than even the slowest GPU