Conference on Computing in High Energy and Nuclear Physics

Name: Conference on Computing in High Energy and Nuclear Physics
Start: 2024-10-19T08:00:00+02:00
End: 2024-10-25T18:30:00+02:00
Location: No location set

19–25 Oct 2024

Europe/Zurich timezone

Contact Program Chairs

chep2024-pc@cern.ch

Towards more efficient job scheduling in ALICE: predicting job execution time using machine learning

THU 31

24 Oct 2024, 15:18

57m

Exhibition Hall

Poster Track 4 - Distributed Computing Poster session

Tomasz Marcin Lelek (AGH University of Krakow (PL))

The ALICE Grid processes up to one million computational jobs daily, leveraging approximately 200,000 CPU cores distributed across about 60 computing centers. Enhancing the prediction accuracy for job execution times could significantly optimize job scheduling, leading to better resource allocation and increased throughput of job execution. We present results of applying machine learning techniques to predicting the execution time of ALICE computational jobs. To this end, we focus on the following main challenges in this prediction task:
(1) Feature extraction and selection: extracting the relevant features from the collected data and selecting the ones that are most important for model training and inference
(2) Model selection: identifying an ML model that is accurate and robust for our prediction problem.
(3) Model decay: making sure that the model accuracy does not deteriorate in time as new data arrives, possibly from an evolving data distribution.
(4) Near-real-time processing: predictions need to be made in near-real-time.
Our goal is to develop a solution capable of predicting job execution times for batches of hundreds of elements in less than 100 milliseconds, without compromising accuracy or hindering continuous learning. This requires striking a delicate balance between computational complexity and real-time performance. By addressing these challenges within the ALICE CERN experiment framework, we can enhance job scheduling efficiency and optimize resource allocation, ultimately advancing scientific research in particle physics.

Acknowledgements. This work is co-financed in part supported by the Ministry of Science and Higher Education (Agreement Nr 2023/WK/07) and by the program of the Ministry of Science and Higher Education entitled PMW

Dr Bartosz Balis (AGH University of Krakow (PL)) Costin Grigoras (CERN) Dr Marcin Kurdziel (AGH) Mr Michał Faciszewski (AGH) Mr Mikołaj Zasada (AGH) Ms Sara Świętek (AGH) Tomasz Marcin Lelek (AGH University of Krakow (PL)) on behalf of the ALICE Collaboration

CHEP 2024 aliceML poster (1).pdf

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

Towards more efficient job scheduling in ALICE: predicting job execution time using machine learning

Exhibition Hall

Speaker

Description

Authors

Presentation materials

Choose timezone

Conference on Computing in High Energy and Nuclear Physics

Contact Program Chairs

Speaker

Description

Authors

Presentation materials