19–25 Oct 2024
Europe/Zurich timezone

Improving overall GPU sharing and usage efficiency with Kubernetes

23 Oct 2024, 14:42
18m
Room 2.A (Seminar Room)

Room 2.A (Seminar Room)

Talk Track 7 - Computing Infrastructure Parallel (Track 7)

Speaker

Diana Gaponcic (IT-PW-PI)

Description

GPUs and accelerators are changing traditional High Energy Physics (HEP) deployments while also being the key to enable efficient machine learning. The challenge remains to improve overall efficiency and sharing opportunities of what are currently expensive and scarce resources.

In this paper we describe the common patterns of GPU usage in HEP, including spiky requirements with low overall usage for interactive access, as well as more predictable but potentially bursty workloads including distributed machine learning. We then explore the multiple mechanisms to share and partition GPUs, covering time slicing, virtualization, physical partitioning (MIG) and MPS for Nvidia devices.

We conclude with the results of an extensive set of benchmarks for multiple representative HEP use cases, including traditional GPU usage as well as machine learning. We highlight the limitations of each option and the use cases where they fit best. Finally, we cover the deployment aspects and the different options available targeting a centralized GPU pool that can significantly push the overall GPU usage efficiency.

Primary authors

Presentation materials