Speaker
Ray Spence
(u)
Description
Lawrence Berkeley National Laboratory/NERSC Division
Developing Nagios code to suspend checks during planned outages.
Raymond E. Spence
NERSC currently supports more than 13,000 computation nodes spread over six supercomputing or clustered systems. These systems access cumulatively more than 13.5PB of disk space via thousands of network interfaces. This environment enables scientists from anywhere on the planet to login, run code and thereby to conduct science at elite levels. Scientists depend on NERSC for 24x7 availability and NERSC personnel in turn depend on industrial-strength system administration tools for our support efforts. Since monitoring everything from our largest system to the last network uplink is a chief concern at NERSC we chose several years ago to employ Nagios for our monitoring solution. Nagios is a mature product with a great degree of flexibility. Although NERSC has found the free, open source Nagios version sufficient in many ways we had eventually tired of one specific hole in this tool’s arsenal. The hole NERSC found in Nagios’ configuration involves planned downtime.
Any Nagios user eventually comes to know where to point and click to acknowledge alerts and twiddle other Nagios switches. However, when it comes to running large systems with multiple monitored services per node, point and click solutions do not scale. Like any supercomputing center NERSC has many planned downtimes of varying size throughout the year. Unfortunately we found no obvious path to configure Nagios to temporarily turn off checks on a to-be downed resource. NERSC then began writing code to communicate directly with Nagios to suspend these checks. Over the past year NERSC has produced scripts which configure Nagios to respectively obey a planned downtime, remove a planned downtime and to list scheduled downtimes. Further, each downtime can cover any number of services running on any number of nodes. We used our dedicated Physics cluster, PDSF, as our test bed and first production system for the scripts. Managing planned outages on PDSF aided debugging the code and how to avoid misuse of its various configuration options.
Today NERSC system managers can use our Nagios downtime scripts to quickly and easily accommodate downtime for anything Nagios monitors. Our downtime tool has saved a mountain of both point and click tasks and avoided the risky last resort of manually disabling Nagios checks.
NERSC wishes to present these Nagios downtime scripts and describe more fully how this code has aided our support efforts.
Summary
NERSC has created and implemented original code to directly suspend Nagios monitors to accommodate planned outages.
Primary author
Ray Spence
(u)