Speaker
Description
The ALICE Collaboration actively relies on accelerators, such as GPUs, to handle increasingly complex workflows and data rates. Such resources have rapidly risen in importance across a number of usecases, and their emergence can be reflected in their availability in the WLCG. Through broader vendor support, as well as improved matching techniques, the ALICE Grid middleware may allocate and use these resources as any other. Yet unlike traditional CPU workloads, the utilisation of GPU resources cannot be trivially tracked solely through the kernel, and generally requires interacting with various drivers and kernel modules. These not only vary between vendors, but also between architectures and driver versions. This poses challenges to both providing accurate resource accounting, and monitoring, for GPU workloads across the Grid.
This contribution outlines an updated middleware stack for ALICE, capable of not only allocating individual GPUs, but also providing a monitoring interface that works across GPU resources. Specifically, it describes how it allows exposing these resources in a unified manner that is agnostic to both vendor and driver versions, avoiding having to tailor to multiple vendor-specific APIs. Furthermore, it will examine how the resulting monitoring data can be exposed to the MonALISA monitoring infrastructure of ALICE. In turn, allowing the tracking of both GPU load and virtual memory across the ALICE Grid - just as any other resource.