Integration of Nagios plug-in into a data model to improve the Grid stability from the user point of view
Presented by Dr. Silvio PARDI on 12 Apr 2010 from 17:30 to 17:33
Session: Poster session
Track: Software services exploiting and/or extending grid middleware (gLite, ARC, UNICORE etc)
In this work, we propose a new approach to publish and consume monitoring information about Grid sites within a gLite based infrastructure. Starting from a set of tests that are crucial for a Grid site or for a Virtual Organization, we created a data model to represent them in the standard gLite information system. Through the Nagios tool, we have periodically performed sanity checks publishing the results concerning resource monitoring or resource discovery. Preliminary tests have shown the effective benefits provided by this approach in term of successful jobs.
In the EGEE framework, we have many monitoring tools that provide a large amount of site administrator oriented information: ie. GridIce, GSTAT, Gridmap, Service Availability Monitoring SAM and the Site-Nagios. Our approach aims at using monitoring information at a user level, in order to minimize the effects of grid services failure and improve the user's perception. Therefore, we have selected a set of tests crucial for the success of Grid jobs and data-management activities. Then we have created a data model able to represent this information by defining attribute-value pair. In particular, each result has been represented by a string containing the test status, followed by the death line of check validity. We implemented a set of plug-in for Nagios, able to publish the results of the test on the site-BDII. The plug-in have been designed to test the crucial aspects within the pilot-virtual organization matisse, and they are periodically scheduled by the nagios. In this scenario, each site publishes in real-time the status of grid services to the top level Information system. Then the users can specify the monitoring metrics as job requirements at submission time.
The introduction of monitoring metrics on the information system opens up new interesting scenarios to improve the grid stability and facilitate operations in a production grid infrastructure. We obtain interesting stability improvements from the user’s point of view. This effect emerged through a set of preliminary tests that showed the positive impact of the proposed approach in term of number of jobs successfully completed. By adding the requirement tag in the JDL at submission time, the users are able to avoid in a transparent way, the instable resources still present in the informative system. Finally regarding the operation, the expected impact consists in supporting administrators that can take advantage from a more stable system during the recovery of incoming problems.
Preliminary tests show interesting improvements from the user’s point of view. In future works we plan to deploy this approach in a large scale Grid and obtain feedbacks in order to detect the best metrics to publish, to improve the data model and to maximize the positive impact on the Grid stability. Finally, we plan to investigate the introduction of site reputation concept as a new metric to use during the resource discovery process.
Monitoring, Information System, User interface
Location: Uppsala University