20–25 Oct 2019
America/Mexico_City timezone

Scaling Deep Learning to Exascale – ACM Gordon Bell Prize 2018

21 Oct 2019, 12:30
30m
Oral Plenary

Speaker

Pedro Mario Cruz e Silva (NVidia)

Description

In this talk I will present technical details about the ACM Gordon Bell Prize winner project at Supercomputing 2018. In this work the joint team from NERSC and NVIDIA succeeded in scaling a Deep Learning training across 27.000+ GPUs in Summit (world’s largest HPC system) and obtained a high fraction of peak performance. This research showed that realistic scientific applications could leverage mixed precision with FP16 (without loss of accuracy). A Deep Learning approach to segmentation of Atmospheric Rivers (AR) and Tropical Cyclones (TC) was applied to achieve state-of-the-art pattern detection for characterizing extreme weather. The training was performed at an impressive performance of peak (sustained) FP16 performance of 1.13 EF/s (1.0 EF/s). To achieve this Exascale breakthrough several innovations were developed both in Software and Hardware. These new technologies in software (TensorFlow, NCCL, etc) and hardware (Volta, Tensor Cores, NVLINK, etc) will be explained during the presentation.

Primary author

Presentation materials