Monitoring of IT infrastructure and services is essential to maximize availability and minimize disruption, by detecting failures and developing issues to allow rapid intervention.
The HEP group at Liverpool have been working on a project to modernize local monitoring infrastructure (previously provided using Nagios and ganglia) with the goal of increasing coverage, improving visualization capabilities, and streamlining configuration and maintenance. Here we discuss some of the tools evaluated, the different approaches they take, and how they can be combined to complement each other to form a comprehensive monitoring infrastructure. An overview of the resulting system and progress on implementation to date will be presented, which is currently as follows:
The system is configured with Puppet. Basic system checks are configured in Puppet using Hiera, and managed by Sensu. Centralised logging is managed with Elasticsearch, together with Logstash and Filebeat. Kibana provides an interface for interactive analysis, including visualization and dashboards. Metric collection is also configured in Puppet, with ganglia, Sensu, riemann.io, and collectd amongst the tools being considered. Metrics are sent to Graphite, with Grafana providing a visualization and dashboard tool. Additional checks on the collated logs and on metric trends are also configured in Puppet and managed by Sensu.
The Uchiwa dashboard for Sensu provides a web interface for viewing infrastructure status. Alert capabilities are provided via external handlers. Liverpool are developing a custom handler to provide an easily configurable, extensible and maintainable alert facility.
|Primary Keyword (Mandatory)||Monitoring|
|Secondary Keyword (Optional)||Computing facilities|