9–11 May 2007
Manchester, United Kingdom
Europe/Zurich timezone

EGEE grid infrastructure monitoring based on Nagios

11 May 2007, 15:20
20m
Manchester, United Kingdom

Manchester, United Kingdom

oral presentation Grid Monitoring and Accounting Grid Monitoring and Accounting

Speaker

Mr Emir Imamagic (University Computing Centre (SRCE))

Describe the scientific/technical community and the scientific/technical activity using (planning to use) the EGEE infrastructure. A high-level description is needed (neither a detailed specialist report nor a list of references).

We extended Nagios monitoring framework with grid specific features in order to
implement efficient grid monitoring system. Main goal of this system is to achieve
better availability of grid hosts and services, by precise problem detection and
instant notification. The most important extensions we implemented are sensors for
various EGEE services, advanced sensor hierarchy and certificate based authorization
on web interface. This system is intended for various types of grid operators.

Report on the experience (or the proposed activity). It would be very important to mention key services which are essential for the success of your activity on the EGEE infrastructure.

Our monitoring system does not require any additional services or software to be
deployed on EGEE resources. Single Nagios service which performs all the checks is
deployed on the central server. So far, we have developed over 20 sensors for various
EGEE services. Some sensors perform basic checks, while others perform complex
functional checks (e.g. job submission). Sensors are organized hierarchically in a
way that simple checks are performed more often and in the case when simple check
fails, dependant complex check is not performed. In order to form as accurate picture
of overall system as possible we utilize several available EGEE information services.
The most important service we use is Grid Operational Database (GOCDB), from which we
gather all the information about hosts, services and site administrators. GOCDB is
also used for importing scheduled downtimes to our system. Beside GOCDB, we utilize
central and site BDII services for getting additional information about sites.

Describe the added value of the Grid for the scientific/technical activity you (plan to) do on the Grid. This should include the scale of the activity and of the potential user community and the relevance for other scientific or business applications

This system has been deployed for monitoring core EGEE services (e.g. BDII, Resource
Broker) in Central Europe (CE) Federation since May 2006. In September 2006, system
was extended to monitor all grid sites in CE Federation. Currently the system is
monitoring 67 nodes and over 550 services. System status can be seen on the following
address: http://nagios.ce-egee.org. The system has been very well accepted in CE
Federation. Since deployment, it has been used by core services managers and the
first line of support personnel. Beside services monitoring, our system is used for
certification (e.g. functionality testing) of new middleware installations on sites.
Grid sites are provided with two options: receiving instant notifications and
retrieving information through web interface. Beside EGEE monitoring we use this
system for monitoring resources on Croatian national grid CRO-GRID, where further
extension in form of automatic recovery mechanism was implemented and successfully
utilized.

With a forward look to future evolution, discuss the issues you have encountered (or that you expect) in using the EGEE infrastructure. Wherever possible, point out the experience limitations (both in terms of existing services or missing functionality)

Currently our system performs only external monitoring of hosts and services.
However, the system could be easily extended to monitor local fabric. Such deployment
would enable combining external and internal view of individual hosts and services
and thus getting more accurate status of monitored objects. This would also enable
utilizing system's mechanisms for automatic recovery of services. Furthermore, our
system is open and capable of supporting new EGEE grid services as they emerge.

Primary author

Mr Emir Imamagic (University Computing Centre (SRCE))

Co-author

Mr Dobrisa Dobrenic (University Computing Centre (SRCE))

Presentation materials