In the EGEE framework, we have many monitoring tools that provide a large amount of site administrator oriented information: ie. GridIce, GSTAT, Gridmap, Service Availability Monitoring SAM and the Site-Nagios.
Our approach aims at using monitoring information at a user level, in order to minimize the effects of grid services failure and improve the user's perception. Therefore, we have selected a set of tests crucial for the success of Grid jobs and data-management activities. Then we have created a data model able to represent this information by defining attribute-value pair. In particular, each result has been represented by a string containing the test status, followed by the death line of check validity. We implemented a set of plug-in for Nagios, able to publish the results of the test on the site-BDII. The plug-in have been designed to test the crucial aspects within the pilot-virtual organization matisse, and they are periodically scheduled by the nagios. In this scenario, each site publishes in real-time the status of grid services to the top level Information system. Then the users can specify the monitoring metrics as job requirements at submission time.
Conclusions and Future Work
Preliminary tests show interesting improvements from the user’s point of view. In future works we plan to deploy this approach in a large scale Grid and obtain feedbacks in order to detect the best metrics to publish, to improve the data model and to maximize the positive impact on the Grid stability. Finally, we plan to investigate the introduction of site reputation concept as a new metric to use during the resource discovery process.
The introduction of monitoring metrics on the information system opens up new interesting scenarios to improve the grid stability and facilitate operations in a production grid infrastructure.
We obtain interesting stability improvements from the user’s point of view. This effect emerged through a set of preliminary tests that showed the positive impact of the proposed approach in term of number of jobs successfully completed. By adding the requirement tag in the JDL at submission time, the users are able to avoid in a transparent way, the instable resources still present in the informative system. Finally regarding the operation, the expected impact consists in supporting administrators that can take advantage from a more stable system during the recovery of incoming problems.
|URL for further information||http://people.na.infn.it/spardi|
|Keywords||Monitoring, Information System, User interface|