25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

Enabling monitoring of GPU accelerators in the ALICE Grid

28 May 2026, 17:09
18m
MHMK 202

MHMK 202

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Speaker

Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))

Description

The ALICE Collaboration actively relies on accelerators, such as GPUs, to handle increasingly complex workflows and data rates. Such resources have rapidly risen in importance across a number of usecases, and their emergence can be reflected in their availability in the WLCG. Through broader vendor support, as well as improved matching techniques, the ALICE Grid middleware may allocate and use these resources as any other. Yet unlike traditional CPU workloads, the utilisation of GPU resources cannot be trivially tracked solely through the kernel, and generally requires interacting with various drivers and kernel modules. These not only vary between vendors, but also between architectures and driver versions. This poses challenges to both providing accurate resource accounting, and monitoring, for GPU workloads across the Grid.

This contribution outlines an updated middleware stack for ALICE, capable of not only allocating individual GPUs, but also providing a monitoring interface that works across GPU resources. Specifically, it describes how it allows exposing these resources in a unified manner that is agnostic to both vendor and driver versions, avoiding having to tailor to multiple vendor-specific APIs. Furthermore, it will examine how the resulting monitoring data can be exposed to the MonALISA monitoring infrastructure of ALICE. In turn, allowing the tracking of both GPU load and virtual memory across the ALICE Grid - just as any other resource.

Author

Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))

Co-author

Presentation materials