21-25 May 2012
New York City, NY, USA
US/Eastern timezone

Monitor and alarm system for time-critical conditions data handling

24 May 2012, 13:30
4h 45m
Rosenthal Pavilion (10th floor) (Kimmel Center)

Rosenthal Pavilion (10th floor)

Kimmel Center

Poster Software Engineering, Data Stores and Databases (track 5) Poster Session

Speaker

Salvatore Di Guida (CERN)

Description

With LHC producing collisions at larger and larger luminosity, CMS must be able to take high quality data and process them reliably: these tasks need not only correct conditions, but also that those datasets must be promptly available. The CMS condition infrastructure relies on many different pieces, such as hardware, networks, and services, which must be constantly monitored, and any faulty situations must be recorded, and notified with different alarm scales. In this talk, we describe EasyMon, a fast, simple, web-based application for monitoring CMS condition infrastructure. It is based on the Nagios framework, where all checks on the different pieces of the system are implemented, and from whence the web server retrieves their status. In case of failures, the Nagios backend evaluates the severity of the issue, and sends alarms or warnings via email and/or sms to the different stakeholders identified for each piece of the infrastructure. The EasyMon GUI, finally, allows to publish the results on the web, using jQuery plugins optimized also for browsing with mobile devices, without exposing any sensitive information. In this way, all experts involved in the CMS condition operations can be easily informed of the status of the system, and take actions as soon as an incident occurs.

Primary author

Presentation Materials