ALICE (A Large Ion Collider Experiment) is currently ongoing a major upgrade of the detector, read-out and computing system for LHC Run 3. A new facility called O2 (Online-Offline) will perform data acquisition and event processing.
To efficiently operate the experiment and the O2 facility a new observability system has been developed. It will provide a complete overview of the overall health, detect performance degradation and component failures by collecting, processing, storing and visualizing values from hardware and software sensors and probes. The core of the system is based on Apache Big Data tools, InfluxData time-series components and Grafana.
Recent major changes, as adapting Apache Kafka as a metric collector and processor, lead to a more generic system design that, in addition to monitoring, is capable of dealing with logs and request tracing data.
This paper describes the system design and its evolution, reasoning behind adapting new components, performance and latency measurements, challenges with the scaling, stability tests and the automatic deployment process.
|Consider for promotion||No|