Speaker
Gianmaria Del Monte
(CERN)
Description
Ensuring the availability of EOS instances is crucial for large-scale storage operations. To enhance monitoring and incident response, we have developed a new distributed probe designed to detect and alert operators about instance malfunctions in real-time.
This talk will introduce the architecture and functionality of the probe, which runs across multiple nodes to provide redundancy and reliability. Alerts are dispatched via multiple channels, including SMS, email, Mattermost, and CERN IT’s General Services Availability. Additionally, all availability events are published on a NATS-based pub-sub channel, enabling future integrations with operational tools such as EOS Diagnostic Tool.
Authors
Gianmaria Del Monte
(CERN)
Octavian-Mihai Matei
(CERN)