CuBLonD usage and performance
- kick, drift, histogram, etc. comparison on single CPU/GPU with single and double precision
- up to a factor 5 for large enough input
- mixed CPU/GPU arrays that can be accessed and updated from either
- multi-GPU scaling: we don't gain a factor 2 with 2 GPUs
- some non-parallel regions and communication time
- MPI experiments with multiple workers
- local histogram calculation and reduce_all() to calculate global histogram
- use of approximate methods (update only every x turns, "SRP", or assume every worker has a representative subset, "RDS")
- 20xCPU + 1xGPU -> up to 60x speed-up with approximations
- comparison of 2013 and 2017 GPU architecture -> good scaling, future-proof
- Future work
- distributed initialisation (if e.g. beam doesn't fit in a single memory)
- OpenCL for multi-GPU
- Experiments with equal-capability CPU & GPU -> test of load-balancing in heterogeneous HW
Easy to merge, so it should be integrated into the basic BLonD code. Code can be used for mixed MPI multi-CPU and GPU environment. A cluster with 4 GPUs or more would be ideal. (in most heavy simulations, CPUs were used just for small calculations)