Nov 4 – 8, 2019
Adelaide Convention Centre
Australia/Adelaide timezone

The evolution of the ALICE O2 monitoring system

Nov 5, 2019, 3:30 PM
1h
Hall F (Adelaide Convention Centre)

Hall F

Adelaide Convention Centre

Poster Track 1 – Online and Real-time Computing Posters

Speaker

Adam Wegrzynek (CERN)

Description

ALICE (A Large Ion Collider Experiment) is currently ongoing a major upgrade of the detector, read-out and computing system for LHC Run 3. A new facility called O2 (Online-Offline) will perform data acquisition and event processing.
To efficiently operate the experiment and the O2 facility a new observability system has been developed. It will provide a complete overview of the overall health, detect performance degradation and component failures by collecting, processing, storing and visualizing values from hardware and software sensors and probes. The core of the system is based on Apache Big Data tools, InfluxData time-series components and Grafana.
Recent major changes, as adapting Apache Kafka as a metric collector and processor, lead to a more generic system design that, in addition to monitoring, is capable of dealing with logs and request tracing data.
This paper describes the system design and its evolution, reasoning behind adapting new components, performance and latency measurements, challenges with the scaling, stability tests and the automatic deployment process.

Consider for promotion No

Primary author

Co-author

Gioacchino Vino (INFN Bari (IT))

Presentation materials

There are no materials yet.