25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

Real-time supervision and control of Grid workflows

28 May 2026, 14:57
18m
Chulalongkorn University

Chulalongkorn University

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Speaker

Marta Bertran Ferrer (CERN)

Description

Effective tools for monitoring Grid workflow executions are crucial for the prompt identification of issues, which in turn facilitates the design and deployment of appropriate solutions. The ALICE Grid middleware JAliEn utilizes the MonALISA framework to monitor all its Grid components, which collectively generate an enormous amount of data - about 200,000 monitored parameters per second across the entire Grid. The contemporary Grid environment is characterized by execution nodes featuring an increasing CPU core count and larger batch queue slot sizes, sometimes encompassing whole nodes with hundreds of cores. In such an environment, the efficient extraction of monitoring parameters becomes a critical operation, as a single monitoring agent must fetch and transmit all the monitoring data for tens of thousands concurrently executing jobs.

To address this challenge, we have achieved significant performance improvements by leveraging cgroups v2. These are used to set boundaries on resource utilization and their accounting metrics are profited from to monitor all the middleware components and executing payloads. This new methodology has dramatically reduced the time required for monitoring JAliEn agents' resource utilization from the order of tens of seconds to the order of milliseconds on large Grid whole nodes and complex process trees.

Complementing this monitoring enhancement is the remote logging system in JAliEn. This system sends logs generated by the Grid middleware in real time directly to the JAliEn Central Services. This capability enables live supervision of executions at various Grid sites, proving to be an exceptionally effective tool for debugging issues and identifying areas for potential improvements. Furthermore, the system includes a crucial feature for severe cases: it ensures offline persistence of logs if nodes become inaccessible for any reason. Considering that our agents generate logs at an average frequency of 15kHz Grid-wide, running the system in all instances would result in a substantial increase in traffic to the Central Services. Therefore, the remote logging tool has been designed to allow users to selectively "cherry-pick" the desired logs based on criteria such as site, host, and JAliEn version. This selective logging has been particularly valuable for debugging and reacting to major issues effectively.

The combined utilization of JAliEn remote logging and the detailed Grid monitoring data provides a broader, real-time understanding of the complex workflows executing across our heterogeneous sites. Having all this customizable data readily available is a powerful resource for implementing a more robust and adaptable middleware framework.

Author

Co-authors

Costin Grigoras (CERN) Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))

Presentation materials

There are no materials yet.