19–25 Oct 2024
Europe/Zurich timezone

Unprivileged subdivision of job resources within the ALICE Grid

23 Oct 2024, 14:06
18m
Room 2.B (Conference Room)

Room 2.B (Conference Room)

Talk Track 4 - Distributed Computing Parallel (Track 4)

Speaker

Maksim Melnik Storetvedt (CERN)

Description

Job pilots in the ALICE Grid have become increasingly tasked with how to best manage the resources given to each job slot. With the emergence of more complex and multicore oriented workflows, this has since become an increasingly challenging process, as users often request arbitrary resources, in particular CPU and memory. This is further exacerbated by often having several user payloads running in parallel in the same slot, and with useful management utilities generally needing elevated privileges to function.

To alleviate resource management within each given job slot, the ALICE Grid has begun utilising novel features introduced in later Linux kernels, such as Cgroups v2, to provide means for fine-grained resource controls. By allowing specific controllers to be delegated down a Cgroup hierarchy, it enables users to access and tune these resource controls as needed - unprivileged. When further used in conjunction with the ALICE job pilot, it enables each job slot to be subpartitioned. In turn, allowing the pilot to act as its own local resource management system in its given slot - with a full “box-in” of each subjob to its own subset of the given resources.

This contribution describes the updated ALICE job pilot and its management and delegation process. Specifically, how it utilises kernel features to create individual resource groups for its jobs, while accommodating for the variety of configurations and computing elements used across participating sites - enabling these features to be used across the ALICE Grid.

Primary author

Co-authors

Presentation materials