25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

CMS Monitoring infrastructure beyond Run 3

28 May 2026, 16:15
18m
MHMK 202

MHMK 202

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Speaker

Carlos Borrajo Gomez (CERN)

Description

As part of the Run 3 of the Large Hadron Collider (LHC), the CMS experiment generates large amounts of data that have to be processed and stored efficiently. The complex distributed computing infrastructure used for these purposes has to be highly available, and having a reliable and comprehensive monitoring setup is essential for it. The CMS monitoring team is responsible for providing the necessary monitoring services.

CMS monitoring services are partially based on open-source solutions provided by the CERN IT MONIT infrastructure, and partially on custom applications mainly devoted to data mining deployed on Kubernetes clusters at CERN. We report on recent improvements that increase the productivity and efficiency of the services offered by the CMS Monitoring team, with a strong focus on data popularity monitoring, HTCondor job monitoring and Infrastructure-as-Code integration.

Data popularity is one of the key metrics for CMS, due to the distributed nature of the storage infrastructure. Being able to keep a close eye on which datasets have not been accessed recently, or which ones get the most accesses over time is essential for decision making on data center maintenance or on choosing where popular datasets should be hosted, for example.

HTCondor is a central piece of software for processing and analysing data coming from the CMS experiment, and most applications and users that interact directly with such data do so through “HTCondor jobs”. Due to the large amount of HTCondor jobs running at all CMS sites at the same time, we are completely refactoring the architecture of our current HTCondor job monitoring application in favor of a more scalable and flexible solution by using different Kubernetes resources such as NATS (Neural Autonomic Transport System) as message queue and KEDA (Kubernetes Event-Driven Autoscaling) for horizontal autoscaling.

To improve the work efficiency of operators in the team we are migrating the CMS monitoring infrastructure to use OpenTofu as an Infrastructure-as-Code solution. This will enable better automation with more complex integrations with CI/CD pipelines, as well as easier maintenance of separate environments for the different stages of development.

This contribution will go through these projects, covering the challenges and adopted solutions, which could serve as examples for similar issues faced in different HEP experiments or in the broader physics community.

Author

Presentation materials