Speaker
Description
In this talk I will present technical details about the ACM Gordon Bell Prize winner project at Supercomputing 2018. In this work the joint team from NERSC and NVIDIA succeeded in scaling a Deep Learning training across 27.000+ GPUs in Summit (world’s largest HPC system) and obtained a high fraction of peak performance. This research showed that realistic scientific applications could leverage mixed precision with FP16 (without loss of accuracy). A Deep Learning approach to segmentation of Atmospheric Rivers (AR) and Tropical Cyclones (TC) was applied to achieve state-of-the-art pattern detection for characterizing extreme weather. The training was performed at an impressive performance of peak (sustained) FP16 performance of 1.13 EF/s (1.0 EF/s). To achieve this Exascale breakthrough several innovations were developed both in Software and Hardware. These new technologies in software (TensorFlow, NCCL, etc) and hardware (Volta, Tensor Cores, NVLINK, etc) will be explained during the presentation.