Speakers
Description
URL for further information
https://cs-egee.srce.hr/nagios/
Detailed analysis
The new monitoring system comprises three levels:
•The site-level, which consists of Nagios, the Nagios Config Generator (NCG), and probes for monitoring grid services. NCG uses information from grid information systems in order to automatically create the appropriate Nagios configuration. The main purpose of the site-level instance is to instantly notify site administrators in case of problems with grid services.
•The regional level, which monitors the set of sites in a single EGEE region. This instance, also configured by NCG, executes higher-level checks of grid services. The regional instance publishes its monitoring results on the message bus. The main purpose of this instance is to track site availability.
•The Third level is the project level, where availability and reliability calculations take place and reports are generated.
This tiered approach to monitoring is being introduced in collaboration with other operational tols such as Gstat, GOCDB, and BDII.
Keywords
SAM Nagios NCG monitoring grid availability
Conclusions and Future Work
Grids are inherently distributed systems, but the adoption of a distributed monitoring solution must be balanced by the need for reliable and comparable metrics across regions. By showing equivalence in the availability calculations produced by the SAM and Nagios-based solutions, we are confident in the validity of our new distributed monitoring infrastructure. It will actively contribute to enhancing site and grid availability through the greater participation of site adminsitrators.
Impact
Grid service availability on the EGEE grid is currently monitored by SAM (Service Availability Monitoring), which probes all services from a central location. Experience has shown certain drawbacks with this approach.
With the new system, we are able to achieve more precise and reliable monitoring. By alerting site administrators immediately to problems, service can be restored faster - thus leading to a more available infrastructure. The overall system scales better, allowing for an increased testing frequency.
By using popular and well-proven open-source software solutions, we can rely on support from the worldwide community – which is not the case with in-house solutions. In addition, a plethora of standard probes and add-ons for visualization already exist for Nagios. Use of a standard messaging architecture ensures a more robust coupling with other components, such as Gstat, GOCDB and BDII. The overriding goal is to achieve an efficient regional monitoring infrastructure.