9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

Towards the integrated ALICE Online-Offline (O2) monitoring subsystem

10 Jul 2018, 12:00
15m
Hall 7 (National Palace of Culture)

Hall 7

National Palace of Culture

presentation Track 3 – Distributed computing T3 - Distributed computing

Speaker

Adam Wegrzynek (CERN)

Description

ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout system and computing for LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect performance degradation and component failures. The monitoring subsystem will receive and collect up to 600 kHz of performance parameters. It consists of a custom monitoring library and a server-side, distributed software covering five main functional tasks: parameter collection and processing, storage, visualization and alarms.
To select the most appropriate tools for these tasks, we evaluated three server-side systems: MonALISA, Zabbix and “Modular stack”. The latter one consists of a toolkit including collectd, Apache Flume, Apache Spark, InfluxDB, Grafana and Riemann.
This paper describes the monitoring subsystem functional architecture. It goes through a complete evaluation of the three considered solutions, the selection processes, risk assessment and justification for the final decision. The in-depth comparison includes functional features, latency and throughput measurement to ensure the required processing and storage performance.

Primary author

Co-authors

Vasco Chibante Barroso (CERN) Costin Grigoras (CERN) Andres Gomez Ramirez (Johann-Wolfgang-Goethe Univ. (DE)) Gioacchino Vino (Universita e INFN, Bari (IT))

Presentation materials