11–15 Mar 2024
Charles B. Wang Center, Stony Brook University
US/Eastern timezone

Pinpoint resource allocation for GPU batch applications

14 Mar 2024, 15:10
20m
Theatre ( Charles B. Wang Center, Stony Brook University )

Theatre

Charles B. Wang Center, Stony Brook University

100 Circle Rd, Stony Brook, NY 11794
Oral Track 1: Computing Technology for Physics Research Track 1: Computing Technology for Physics Research

Speaker

Tim Voigtlaender (KIT - Karlsruhe Institute of Technology (DE))

Description

With the increasing usage of Machine Learning (ML) in High Energy Physics (HEP), the breadth of new analyses with a large spread in compute resource requirements, especially when it comes to GPU resources. For institutes, like the Karlsruhe Institute of Technology (KIT), that provide GPU compute resources to HEP via their batch systems or the Grid, a high throughput, as well as energy efficient usage of their systems is of the essence. With low intensity GPU analyses specifically, inefficiencies are created by the standard scheduling, as resources are over-assigned to such workflows. An approach that is flexible enough to cover the entire spectrum, from multi-process per GPU, to multi-GPU per process, is necessary. As a follow-up to the techniques presented at the 2022 ACAT, this time we study Nvidia's multi-process service (MPS), its ability to securely distribute device memory and its interplay with the KIT HTCondor batch system. A number of ML applications were benchmarked using this less demanding and more flexible approach to illustrate the performance implications regarding throughput and energy efficiency.

References

https://indico.cern.ch/event/1106990/contributions/4991345/

Significance

Batch systems are crucial for the efficient and high-throughput computing that is required in modern high energy physics. Often, these batch systems are limited by their coarse granularity. Especially for GPU resources, the safe sharing of high performance datacenter GPUs is necessary to avoid gross over-allocation of costly hardware, while still allowing for workflows that require multiple GPUs at once. Nvidia's multi-process service (MPS) enables this kind of flexibility, and we therefore consider it a valuable tool for our goal of high throughput and high energy efficiency.

Experiment context, if any CMS

Primary author

Tim Voigtlaender (KIT - Karlsruhe Institute of Technology (DE))

Co-authors

Gunter Quast (KIT - Karlsruhe Institute of Technology (DE)) Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE)) Matthias Jochen Schnepf Roger Wolf (KIT - Karlsruhe Institute of Technology (DE))

Presentation materials