The Compact Muon Solenoid (CMS) experiment makes a vast use of alignment and calibration measurements in several crucial workflows: in the event selection at the High Level Trigger (HLT), in the processing of the recorded collisions and in the production of simulated events. A suite of services addresses the key requirements for the handling of the alignment and calibration conditions such as: recording the status of the experiment and of the ongoing data taking, accepting conditions data updates provided by the detector experts, aggregating and navigating the calibration scenarios, and distributing conditions for consumption by the collaborators. Since a large fraction of such services is critical for the data taking and event filtering in the HLT, a comprehensive monitoring and alarm generating system had to be developed. Such monitoring system has been developed based on the open source industry standard for monitoring and alerting services (Nagios) to monitor: the database back-end, the hosting nodes and key heart-beat functionalities for all the services involved. This paper describes the design, implementation and operational experience with the monitoring system developed and deployed at CMS in 2016.
|Primary Keyword (Mandatory)||Monitoring|