Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

2–6 Mar 2009
Le Ciminiere, Catania, Sicily, Italy
Europe/Rome timezone

Evolution of SAM in an enhanced model for monitoring the EGEE grid

3 Mar 2009, 12:00
20m
Michelangelo (120) (Le Ciminiere, Catania, Sicily, Italy)

Michelangelo (120)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania
Oral Grid Services exploiting and extending gLite middleware Monitoring

Speakers

Emir Imamagic (SRCE) Emir Imamagic (SRCE) Emir Imamagic (SRCE)

Description

The SAM monitoring system enables grid administrators to track availability of resources and receive alarms in case of failure of services. We describe an enhanced, distributed, multi-level monitoring system for the EGEE grid. The core of the system consists of commodity technologies: Nagios for the monitoring framework, and the Active MQ messaging system for interconnecting components. We integrated them tightly with the grid information system, and extended them with grid-specific probes.

Conclusions and Future Work

Grids are inherently distributed systems, but the adoption of a distributed monitoring solution must be balanced by the need for reliable and comparable metrics across regions. By showing equivalence in the availability calculations produced by the SAM and Nagios-based solutions, we are confident in the validity of our new distributed monitoring infrastructure. It will actively contribute to enhancing site and grid availability through the greater participation of site adminsitrators.

Keywords

SAM Nagios NCG monitoring grid availability

Detailed analysis

The new monitoring system comprises three levels:
•The site-level, which consists of Nagios, the Nagios Config Generator (NCG), and probes for monitoring grid services. NCG uses information from grid information systems in order to automatically create the appropriate Nagios configuration. The main purpose of the site-level instance is to instantly notify site administrators in case of problems with grid services.
•The regional level, which monitors the set of sites in a single EGEE region. This instance, also configured by NCG, executes higher-level checks of grid services. The regional instance publishes its monitoring results on the message bus. The main purpose of this instance is to track site availability.
•The Third level is the project level, where availability and reliability calculations take place and reports are generated.

This tiered approach to monitoring is being introduced in collaboration with other operational tols such as Gstat, GOCDB, and BDII.

Impact

Grid service availability on the EGEE grid is currently monitored by SAM (Service Availability Monitoring), which probes all services from a central location. Experience has shown certain drawbacks with this approach.
With the new system, we are able to achieve more precise and reliable monitoring. By alerting site administrators immediately to problems, service can be restored faster - thus leading to a more available infrastructure. The overall system scales better, allowing for an increased testing frequency.
By using popular and well-proven open-source software solutions, we can rely on support from the worldwide community – which is not the case with in-house solutions. In addition, a plethora of standard probes and add-ons for visualization already exist for Nagios. Use of a standard messaging architecture ensures a more robust coupling with other components, such as Gstat, GOCDB and BDII. The overriding goal is to achieve an efficient regional monitoring infrastructure.

URL for further information

https://cs-egee.srce.hr/nagios/

Primary authors

Emir Imamagic (SRCE) John Shade (CERN)

Presentation materials