Speaker
Description
The ability to strong scale is crucial for Lattice QCD simulations. Since the creation of the QUDA library for Lattice QCD on NVIDIA GPUs, this has always been a key development goal. Techniques like GPUDirect RDMA and NVLink allow for fast intra-node and inter-node data transfer and QUDA makes extensive use of them. However, API overheads and necessary synchronizations between GPU and CPU are increasingly limiting the ability to strong scale with MPI communication. Fine-grained GPU-centric communication provides a way out as it completely removes these bottlenecks by moving the communication to the GPU kernels. We will discuss the techniques that QUDA implements to achieve the best scaling with MPI and novel improvements using NVSHMEM for GPU-centric communication. Finally, we will show scaling results on x86 and POWER systems.