21–25 May 2012
New York City, NY, USA
US/Eastern timezone

Health and performance monitoring of the large and diverse online computing cluster of CMS

22 May 2012, 13:30
4h 45m
Rosenthal Pavilion (10th floor) (Kimmel Center)

Rosenthal Pavilion (10th floor)

Kimmel Center

Poster Computer Facilities, Production Grids and Networking (track 4) Poster Session

Speaker

Olivier Raginel (Massachusetts Inst. of Technology (US))

Description

The CMS experiment online cluster consists of 2300 computers and170 switches or routers operating on a 24 hour basis. This huge infrastructure must be monitored in a way that the administrators are proactively warned of any failures or degradation in the system, in order to avoid or minimize downtime of the system which can lead to loss of data taking. The number of metrics monitored per host varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks (service running) in addition to hardware monitoring (through IPMI). The sheer number of hosts and checks per host in the system stretches the limits of many monitoring tools and requires careful usage of various configuration optimizations in order to work reliably. The initial monitoring system used in the CMS online cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the recently expanded cluster. The CMS cluster administrators investigated the different open source tools available and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability. The Gearman module provides a queuing system for all checks and their results allowing easy load balancing across worker nodes. Supported modules allow the grouping of checks in one single request thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The PNP4nagios module provides the graphing capability to Icing, which uses files as round robin databases (RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to achieve the required number of operations. Furthermore, to make best use of the monitoring information to notify the appropriate communities of any issues with their systems, much work was put into the grouping of the checks according to, for example, the function of the machine, the services running, the sub-detectors they belong to, and the criticality of the computer. An automated system to generate the configuration of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these performance enhancing modules and the work on grouping the checks has yielded impressive performance improvements over the pervious Nagios infrastructure allowing for the monitoring of X metrics per second (compared to Y on the previous system). Furthermore the design allows the easy growth of the infrastructure without the need to rethink the monitoring system as a whole.

Primary author

Olivier Raginel (Massachusetts Inst. of Technology (US))

Co-authors

Mr Alexander Flossdorf (DESY) Andre Georg Holzner (Univ. of California San Diego (US)) Andrea Petrucci (CERN) Andrei Cristian Spataru (CERN) Dr Attila Racz (CERN) Aymeric Arnaud Dupont (CERN) Christian Deldicque (CERN) Christian Hartl (CERN) Christoph Paus (Massachusetts Inst. of Technology (US)) Christoph Schwick (CERN) Dennis Shpakov (Fermi National Accelerator Lab. (US)) Dominique Gigi (CERN) Emilio Meschi (CERN) Frank Glege (CERN) Frans Meijers (CERN) Gerry Bauer (Massachusetts Inst. of Technology (US)) Dr Giovanni Polese (CERN) Hannes Sakulin (CERN) James Branson (Univ. of California San Diego (US)) Dr Jeroen Hegeman (CERN) Dr Jose Antonio Coarasa Perez (CERN) Konstanty Sumorok (Massachusetts Inst. of Technology (US)) Lorenzo Masetti (CERN) Luciano Orsini (CERN) Dr Marc Dobson (CERN) Marco Pieri (Univ. of California San Diego (US)) Marek Ciganek (CERN) Matteo Sani (Univ. of California San Diego (US)) Matthew Bowen (University of the West of England) Michal Simon Olivier Bouffet (CERN) Remi Mommsen (Fermi National Accelerator Lab. (US)) Robert Gomez-Reino Garrido (CERN) Samim Erhan (Univ. of California Los Angeles (US)) Sebastian Bukowiec (CERN) Sergio Cittolin (Univ. of California San Diego (US)) Ulf Behrens (Deutsches Elektronen-Synchrotron (DE)) Vivian O'Dell (Fermi National Accelerator Laboratory (FNAL)) Yi Ling Hwong (CERN)

Presentation materials