Oct 19 – 23, 2020
Europe/Zurich timezone

Distributed training of graph neural network at HPC

Oct 22, 2020, 11:25 AM
Lightning talk 6 ML infrastructure : Hardware and software for Machine Learning Workshop


Xiangyang Ju (Lawrence Berkeley National Lab. (US))


Graph Neural Networks (GNN) are trainable functions that operate on a graph to learn latent graph attributes and to form a parameterized message-passing by which information is propagated across the graph, ultimately learning sophisticated graph attributes. Its application in the High Energy Physics grows rapidly in the past years, ranging from event reconstructions to data analyses, from precision measurements to the search of new physics. The size and complexity of the graphs are also growing. Because graph data structure is irregular and sparse, it imposes non-trivial computational challenges. Currently AI hardwares primarily focus on accelerating dense 1D or 2D arrays, to some extend neglecting sparse and irregular tensor calculations. In this talk, we take the GNN architecture used by the Exa.TrkX collaboration for track reconstruction and the tracking ML challenge dataset as the benchmark in evaluating distributed strategies and Artificial Intelligent (AI) accelerators. We study different AI accelerators that are either in the cloud or at a High Performance Computing center. We also study different distributed training strategies for GNN and the scalabilities of these training strategies on different AI accelerators. Finally, the talk ends with an outlook on deploying GNN for real-time data processing.

Primary author

Xiangyang Ju (Lawrence Berkeley National Lab. (US))

Presentation materials