28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Name: 28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)
Start: 2026-05-25T08:00:00+07:00
End: 2026-05-29T14:00:00+07:00
Location: Chulalongkorn University

25–29 May 2026

Chulalongkorn University

Asia/Bangkok timezone

Enabling monitoring of GPU accelerators in the ALICE Grid

28 May 2026, 17:09

18m

MHMK 202

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))

The ALICE Collaboration actively relies on accelerators, such as GPUs, to handle increasingly complex workflows and data rates. Such resources have rapidly risen in importance across a number of usecases, and their emergence can be reflected in their availability in the WLCG. Through broader vendor support, as well as improved matching techniques, the ALICE Grid middleware may allocate and use these resources as any other. Yet unlike traditional CPU workloads, the utilisation of GPU resources cannot be trivially tracked solely through the kernel, and generally requires interacting with various drivers and kernel modules. These not only vary between vendors, but also between architectures and driver versions. This poses challenges to both providing accurate resource accounting, and monitoring, for GPU workloads across the Grid.

This contribution outlines an updated middleware stack for ALICE, capable of not only allocating individual GPUs, but also providing a monitoring interface that works across GPU resources. Specifically, it describes how it allows exposing these resources in a unified manner that is agnostic to both vendor and driver versions, avoiding having to tailor to multiple vendor-specific APIs. Furthermore, it will examine how the resulting monitoring data can be exposed to the MonALISA monitoring infrastructure of ALICE. In turn, allowing the tracking of both GPU load and virtual memory across the ALICE Grid - just as any other resource.

Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))

Latchezar Betev (CERN)

CHEP2026_ALICE_GPUMon.pdf

28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Enabling monitoring of GPU accelerators in the ALICE Grid

MHMK 202

Speaker

Description

Author

Co-author

Presentation materials