9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

Developing a monitoring system for Cloud-based distributed datacenters

10 Jul 2018, 16:00
1h
Sofia, Bulgaria

Sofia, Bulgaria

National Culture Palace, Boulevard "Bulgaria", 1463 NDK, Sofia, Bulgaria
Poster Track 8 – Networks and facilities Posters

Speaker

Gioacchino Vino (Universita e INFN, Bari (IT))

Description

Beside their increasing complexity and variety of provided resources and services, large data-centers nowadays often belong to a distributed network and need non-conventional monitoring tools. This contribution describes the implementation of a monitoring system able to provide active support for problem solving to the system administrators.
The key components are information collection and analysis. Information is acquired from multiple levels in order to allow algorithms to recognize a malfunction and suggest possible root-cause reducing service downtime.
The project has been developed using the Bari ReCaS data-center as testbed. The information is gathered from Zabbix, Openstack, HTCondor as local monitoring system, cloud platform and batch system respectively.
Big Data solutions belonging to the Hadoop ecosystem have been selected: Flume and Kafka as transport layer and Spark as analysis component. Multiple tools have been used to store data, such as Hadoop Distributed File System, HBase and Neo4j. InfluxDB-Grafana and Elasticsearch-Kibana are used as visualization component. Event extraction, correlation and propagation algorithms have been also implemented using Artificial Intelligence and graph libraries to provide the root-cause feature. Results are forwarded to experts by email or Slack, using Riemann.

Primary authors

Gioacchino Vino (Universita e INFN, Bari (IT)) Domenico Elia (INFN Bari) Giacinto Donvito (Universita e INFN, Bari (IT)) Marica Antonacci

Presentation materials