24–26 Mar 2025
CERN
Europe/Zurich timezone
There is a live webcast for this event.

A Distributed Probe for EOS: Real-Time Availability Monitoring and Alerting

25 Mar 2025, 10:10
20m
40/S2-D01 - Salle Dirac (CERN)

40/S2-D01 - Salle Dirac

CERN

115
Show room on map

Speaker

Gianmaria Del Monte (CERN)

Description

Ensuring the availability of EOS instances is crucial for large-scale storage operations. To enhance monitoring and incident response, we have developed a new distributed probe designed to detect and alert operators about instance malfunctions in real-time.

This talk will introduce the architecture and functionality of the probe, which runs across multiple nodes to provide redundancy and reliability. Alerts are dispatched via multiple channels, including SMS, email, Mattermost, and CERN IT’s General Services Availability. Additionally, all availability events are published on a NATS-based pub-sub channel, enabling future integrations with operational tools such as EOS Diagnostic Tool.

Authors

Presentation materials