A fully High-availability logs/metrics collector @ CSCS

May 17, 2018, 2:00 PM
Dino Conciatore (CSCS (Swiss National Supercomputing Centre))


As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.
Managing the vast amounts of log data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana.
This is a fundamental service at CSCS that provides easy correlation of events bridging the gap from the computation workload to nodes enabling failure diagnosis.
Currently, the Elasticsearch cluster at CSCS is handling more than 22'000'000'000 online documents (one year) and another 20'000'000'000 archived. The integrated environment from logging to graphical representation enables powerful dashboards and monitoring displays.

Dino Conciatore (CSCS (Swiss National Supercomputing Centre))

