Speaker
Description
Job pilots in the ALICE Grid have become increasingly tasked with how to best manage the resources given to each job slot. With the emergence of more complex and multicore oriented workflows, this has since become an increasingly challenging process, as users often request arbitrary resources, in particular CPU and memory. This is further exacerbated by often having several user payloads running in parallel in the same slot, and with useful management utilities generally needing elevated privileges to function.
To alleviate resource management within each given job slot, the ALICE Grid has begun utilising novel features introduced in later Linux kernels, such as Cgroups v2, to provide means for fine-grained resource controls. By allowing specific controllers to be delegated down a Cgroup hierarchy, it enables users to access and tune these resource controls as needed - unprivileged. When further used in conjunction with the ALICE job pilot, it enables each job slot to be subpartitioned. In turn, allowing the pilot to act as its own local resource management system in its given slot - with a full “box-in” of each subjob to its own subset of the given resources.
This contribution describes the updated ALICE job pilot and its management and delegation process. Specifically, how it utilises kernel features to create individual resource groups for its jobs, while accommodating for the variety of configurations and computing elements used across participating sites - enabling these features to be used across the ALICE Grid.