14-18 May 2018
University of Wisconsin-Madison
Monitoring Infrastructure for the CERN Data Centre

May 17, 2018, 2:20 PM
Asier Aguado Corman (Universidad de Oviedo (ES))


Since early 2017, the MONIT infrastructure provides services for monitoring the CERN data centre, together with the WLCG grid resources, and progressively replaces in-house technologies, such as LEMON and SLS, using consolidated open source solutions for monitoring and alarms.

The infrastructure collects data from more than 30k data centre hosts in Meyrin and Wigner sites, with a total volume of 3 TB/day and a rate of 65k documents/sec. It includes OS and hardware metrics, as well as specific IT service metrics. Logs and metrics collection is deployed by default in every machine of the data centre, together with alert reporting. Each machine has a default configuration that can be extended for service-specific data (e.g. for specifically monitoring a database server). Service managers can send custom metrics and logs from their applications to the infrastructure through generic endpoints, and they are provided with an out-of-the-box discovery and visualization interface, data analysis tools and integrated notifications.

The infrastructure stack relies on open source technologies, developed and widely used by the industry and research leaders. Our architecture uses collectd for metric collection, Flume and Kafka for transport, Spark for stream and batch processing, Elasticsearch, HDFS and InfluxDB for search and storage, Kibana and Grafana for visualization, and Zeppelin for analytics. The modularity of collectd provides flexibility to the infrastructure users to configure default and service-specific monitoring, and allows to develop and deploy custom plugins.

This contribution is an updated overview of the monitoring service for CERN data centre. We present our main use cases for collection of metrics and logs. Given that the proposed stack of technologies is widely used, and the MONIT architecture is well consolidated, a main objective is to share the lessons learned and find common monitoring solutions within the community.

Primary authors

Asier Aguado Corman (Universidad de Oviedo (ES))

