25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

Job scheduling optimization for heterogeneous resources in the ALICE Grid

25 May 2026, 16:33
18m
Chulalongkorn University

Chulalongkorn University

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Speaker

Maria-Elena Mihailescu (National University of Science and Technology POLITEHNICA Bucharest (RO))

Description

Authors: Maria-Elena Mihăilescu (National University of Science and Technology Politehnica Bucharest, maria.mihailescu@upb.ro), Costin Grigoraș (CERN, costin.grigoras@cern.ch), Latchezar Betev (CERN, latchezar.betev@cern.ch), Mihai Carabaș (National University of Science and Technology Politehnica Bucharest, mihai.carabas@upb.ro)
on behalf of the ALICE Collaboration

JAliEn functions as the middleware backbone for the ALICE Grid, managing modules such as job scheduling, accounting, monitoring, and isolated execution environments. Currently, every payload is assigned a strict Time To Live (TTL), while execution hosts operate within fixed availability windows (typically 24 to 72 hours).

Due to the heterogeneous nature of Grid resources - varying in hardware, software, and system load - job TTLs are statically configured to accommodate the slowest hosts. This approach is sub-optimal - high-performance hosts are often unable to schedule payloads with these inflated TTLs because the remaining time in their open slots is shorter than the requested TTL. Conversely, when these jobs do run on faster hosts, they finish significantly earlier than the static TTL, resulting in underutilized slot time.

This contribution presents scheduling optimizations implemented in JAliEn to enhance Grid resource efficiency. We first analyze the current scheduling algorithm, demonstrating that job rejection is frequently caused by a mismatch between the requested TTL and the remaining slot time, rather than a lack of hardware resources (CPU/Disk). However, historical data indicates that most jobs complete well within their assigned TTL.

To address this, we propose two optimization strategies for Monte Carlo simulations and I/O-intensive payloads: historical prediction, which predicts job TTL based on the execution history of jobs with similar characteristics (production type, CPU model, and site configuration), and data scaling, which scales the TTL based on the ratio of assigned input data to the maximum possible data load.
Combined, these approaches maximize the utilization of batch queue slots and improve global resource usage across the ALICE Grid.

Authors

Costin Grigoras (CERN) Latchezar Betev (CERN) Maria-Elena Mihailescu (National University of Science and Technology POLITEHNICA Bucharest (RO)) Mihai Carabas (National University of Science and Technology POLITEHNICA Bucharest (RO))

Presentation materials

There are no materials yet.