19–25 Oct 2024
Europe/Zurich timezone

Advanced monitoring capabilities of the CMS Experiment for LHC Run3 and beyond

23 Oct 2024, 13:30
18m
Room 2.B (Conference Room)

Room 2.B (Conference Room)

Talk Track 4 - Distributed Computing Parallel (Track 4)

Speaker

Brij Kishor Jashal (Rutherford Appleton Laboratory)

Description

The CMS computing infrastructure spread globally over 150 WLCG sites forms a intricate ecosystem of computing resources, software and services. In 2024, the production computing cores breached half a million mark and storage capacity is at 250 PetaBytes on disk and 1.20 ExaBytes on Tape. To monitor these resources in real time, CMS working closely with CERN IT has developed a multifaceted monitoring system providing real time insights using about 100 production dashboards.

In preparation of Run3, the CMS monitoring infrastructure underwent significant evolution to broaden the scope of monitored applications and services while enhancing sustainability and ease of operation. Leveraging open-source solutions, provided either by the CERN IT department or managed internally, monitoring applications have transitioned from bespoke solutions to standardized data flow and visualization services. Notably, monitoring applications for distributed workload management and data handling have migrated to utilize technologies like OpenSearch, VictoriaMetrics, InfluxDB, and HDFS, with access facilitated through programmatic APIs, Apache Spark, or Sqoop jobs, and visualization primarily via Grafana.

The majority of CMS monitoring applications are now deployed on Kubernetes clusters based microservices architecture. This contribution unveils the comprehensive stack of CMS monitoring services, showcasing how the integration of common technologies enables versatile monitoring applications and addresses the computation demands of LHC Run 3. Additionally, it explores the incorporation of analytics into the monitoring framework, demonstrating how these insights contribute to the operational efficiency and scientific output of the CMS experiment.

Primary authors

Brij Kishor Jashal (Rutherford Appleton Laboratory) CMS Collaboration Federica Legger (Universita e INFN Torino (IT)) Nikodemas Tuckus (CERN)

Presentation materials