28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

Name: 28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)
Start: 2026-05-25T08:00:00+07:00
End: 2026-05-29T14:00:00+07:00
Location: Chulalongkorn University

25–29 May 2026

Chulalongkorn University

Asia/Bangkok timezone

A look inside the ALICE Grid: visualisation tools to better understand how the system operates

28 May 2026, 16:51

18m

MHMK 202

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Marta Bertran Ferrer (CERN)

The ALICE Grid incorporates a large volume of heterogeneous resources, including systems with a diverse range of CPU and GPU resources, various operating system versions, and differing hardware architectures. The Central Grid Operation team lacks direct access to the individual clusters and nodes that compose the Grid, which presents numerous challenges to fully understanding and optimizing the middleware workflow. Consequently, having tools that help streamline the debugging of issues within such a complex environment is extremely valuable for the Grid managers, site administrators, and users alike. This capability allows a faster response to problems, thereby fostering a more collaborative and efficient environment.

This contribution focuses on advanced dashboards of Grid parameters that have been instrumental in identifying relevant issues and areas for improvement. The first of these is the job-to-core allocation, which graphically illustrates the distribution of running jobs across the CPU resources of the Grid nodes. This visualization depicts currently running and recently executed jobs (with a history retention of five days), showing the lifetime from the batch queue slot down to the allocated CPU resources on any given Grid node. To understand why some of the resources remain underused, we have developed specialized views which analyze sampled job match requests to identify the specific conditions that are preventing jobs from matching the advertised resources.

Additionally, statistical visualizations are presented that illustrate the success rates and the reasons for failure of jobs executed under different conditions, such as those that are oversubscribed or those optimized for Time-To-Live (TTL). We demonstrate how these visualizations have aided in diagnosing various problems and how they have directly led to the optimization of our middleware workflows.

Marta Bertran Ferrer (CERN)

Costin Grigoras (CERN)

CHEP2026Visualisations.pdf

28th Conference on Computing in High Energy and Nuclear Physics (CHEP 2026)

A look inside the ALICE Grid: visualisation tools to better understand how the system operates

MHMK 202

Speaker

Description

Author

Co-author

Presentation materials