MPI-learn and MPI-opt are libraries to perform large-scale training and hyper-parameter optimization for deep neural networks. The two libraries, based on Message Passing Interface, allows to perform these tasks on GPU clusters, through different kinds of parallelism. The main characteristic of these libraries is their flexibility: the user has complete freedom in building her own model, thanks to the multi-backend support. In addition, the library supports several cluster architectures, allowing a deployment on multiple platforms. This generality can make this the basis for a train & optimise service for the HEP community. We present scalability results obtained from two typical HEP use-case: jet identification from raw data and shower generation from a GAN model. Results on GPU clusters were obtained at the ORNL TITAN supercomputer ad other HPC facilities, as well as exploiting commercial cloud resources and OpenStack. A comprehensive comparisons of scalability performance across platforms will be presented, together with a detailed description of the libraries and their functionalities.
|Consider for promotion||Yes|