Integration of Nagios plug-in into a data model to improve the Grid stability from the user point of view

Dr Silvio Pardi (INFN - Naples Unit)


In this work, we propose a new approach to publish and consume monitoring information about Grid sites within a gLite based infrastructure. Starting from a set of tests that are crucial for a Grid site or for a Virtual Organization, we created a data model to represent them in the standard gLite information system. Through the Nagios tool, we have periodically performed sanity checks publishing the results concerning resource monitoring or resource discovery. Preliminary tests have shown the effective benefits provided by this approach in term of successful jobs.

In the EGEE framework, we have many monitoring tools that provide a large amount of site administrator oriented information: ie. GridIce, GSTAT, Gridmap, Service Availability Monitoring SAM and the Site-Nagios.
Our approach aims at using monitoring information at a user level, in order to minimize the effects of grid services failure and improve the user's perception. Therefore, we have selected a set of tests crucial for the success of Grid jobs and data-management activities. Then we have created a data model able to represent this information by defining attribute-value pair. In particular, each result has been represented by a string containing the test status, followed by the death line of check validity. We implemented a set of plug-in for Nagios, able to publish the results of the test on the site-BDII. The plug-in have been designed to test the crucial aspects within the pilot-virtual organization matisse, and they are periodically scheduled by the nagios. In this scenario, each site publishes in real-time the status of grid services to the top level Information system. Then the users can specify the monitoring metrics as job requirements at submission time.

Preliminary tests show interesting improvements from the user’s point of view. In future works we plan to deploy this approach in a large scale Grid and obtain feedbacks in order to detect the best metrics to publish, to improve the data model and to maximize the positive impact on the Grid stability. Finally, we plan to investigate the introduction of site reputation concept as a new metric to use during the resource discovery process.


The introduction of monitoring metrics on the information system opens up new interesting scenarios to improve the grid stability and facilitate operations in a production grid infrastructure.
We obtain interesting stability improvements from the user’s point of view. This effect emerged through a set of preliminary tests that showed the positive impact of the proposed approach in term of number of jobs successfully completed. By adding the requirement tag in the JDL at submission time, the users are able to avoid in a transparent way, the instable resources still present in the informative system. Finally regarding the operation, the expected impact consists in supporting administrators that can take advantage from a more stable system during the recovery of incoming problems.

URL for further information http://people.na.infn.it/spardi
Keywords Monitoring, Information System, User interface

Dr Silvio Pardi (INFN - Naples Unit)


Dr Andrea Apicella (University of Naples Federico II) Dr Francesco Palmieri (University of Naples Federico II) Prof. Guido Russo (University of Naples Federico II) Prof. Leonardo Merola (University of Naples Federico II) Dr Simone Celestino (University of Naples Federico II)

