GPU model speeds comparison

Using prebuilt wheels, python installation of MIGraphX execution providers is usable on AMD GPUs
With access to the NGT cluster hardware (huge thanks to Oliver!) benchmark was extended modern GPUs

1000 evaluations of batches of size 262144, input size per element (3,9,9). 10 warmup evaluations (excluded from measurement). Color is normalised per column

Unfortunately model compilation for CNN on older AMD GPU is not supported (blank spots in table).

Overall fastest for our deployment case: MI300X, second place goes to H100. MI300X outperforms H100 by a factor of ~2.

FP32: CPU (32 threads) is still a factor 30 slower than even the slowest GPU