4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Name: 4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event
Start: 2009-03-02T09:00:00+01:00
End: 2009-03-06T17:30:00+01:00
Location: Le Ciminiere, Catania, Sicily, Italy

2–6 Mar 2009

Le Ciminiere, Catania, Sicily, Italy

Europe/Rome timezone

Support

Kristina.Ulrika.Gunne@cern.ch

Evolution of SAM in an enhanced model for monitoring the EGEE grid

3 Mar 2009, 12:00

20m

Michelangelo (120) (Le Ciminiere, Catania, Sicily, Italy)

Michelangelo (120)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania

Oral Grid Services exploiting and extending gLite middleware Monitoring

Emir Imamagic (SRCE) Emir Imamagic (SRCE) Emir Imamagic (SRCE)

The SAM monitoring system enables grid administrators to track availability of resources and receive alarms in case of failure of services. We describe an enhanced, distributed, multi-level monitoring system for the EGEE grid. The core of the system consists of commodity technologies: Nagios for the monitoring framework, and the Active MQ messaging system for interconnecting components. We integrated them tightly with the grid information system, and extended them with grid-specific probes.

URL for further information

https://cs-egee.srce.hr/nagios/

Detailed analysis

The new monitoring system comprises three levels:
•The site-level, which consists of Nagios, the Nagios Config Generator (NCG), and probes for monitoring grid services. NCG uses information from grid information systems in order to automatically create the appropriate Nagios configuration. The main purpose of the site-level instance is to instantly notify site administrators in case of problems with grid services.
•The regional level, which monitors the set of sites in a single EGEE region. This instance, also configured by NCG, executes higher-level checks of grid services. The regional instance publishes its monitoring results on the message bus. The main purpose of this instance is to track site availability.
•The Third level is the project level, where availability and reliability calculations take place and reports are generated.

This tiered approach to monitoring is being introduced in collaboration with other operational tols such as Gstat, GOCDB, and BDII.

Keywords

SAM Nagios NCG monitoring grid availability

Conclusions and Future Work

Grids are inherently distributed systems, but the adoption of a distributed monitoring solution must be balanced by the need for reliable and comparable metrics across regions. By showing equivalence in the availability calculations produced by the SAM and Nagios-based solutions, we are confident in the validity of our new distributed monitoring infrastructure. It will actively contribute to enhancing site and grid availability through the greater participation of site adminsitrators.

Impact

Grid service availability on the EGEE grid is currently monitored by SAM (Service Availability Monitoring), which probes all services from a central location. Experience has shown certain drawbacks with this approach.
With the new system, we are able to achieve more precise and reliable monitoring. By alerting site administrators immediately to problems, service can be restored faster - thus leading to a more available infrastructure. The overall system scales better, allowing for an increased testing frequency.
By using popular and well-proven open-source software solutions, we can rely on support from the worldwide community – which is not the case with in-house solutions. In addition, a plethora of standard probes and add-ons for visualization already exist for Nagios. Use of a standard messaging architecture ensures a more robust coupling with other components, such as Gstat, GOCDB and BDII. The overriding goal is to achieve an efficient regional monitoring infrastructure.

Emir Imamagic (SRCE) John Shade (CERN)

Slides

Evolution_of_SAM_in_an_enhanced_model_for_monitoring_the_EGEE_grid.pdf

Evolution_of_SAM_in_an_enhanced_model_for_monitoring_the_EGEE_grid.ppt

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

Evolution of SAM in an enhanced model for monitoring the EGEE grid

Michelangelo (120)

Le Ciminiere, Catania, Sicily, Italy

Speakers

Description

URL for further information

Detailed analysis

Keywords

Conclusions and Future Work

Impact

Authors

Presentation materials

Choose timezone

4th EGEE User Forum/OGF 25 and OGF Europe's 2nd International Event

Support

Speakers

Description

URL for further information

Detailed analysis

Keywords

Conclusions and Future Work

Impact

Authors

Presentation materials