Speaker
Description
Detailed analysis
There are various monitoring systems in place (R-GMA, IC-RTM, MonALISA), of which some are also able to deliver error codes of failed Grid jobs. However, the error codes do not always denote the actual source of the error. Instead, a more sophisticated methodology is required to locate problematic Grid elements. We propose to mine Grid monitoring data using association rules. This approach produces additional knowledge about the Grid elements' behavior by taking correlations and dependencies between the characteristics of failed Grid jobs into account.
Aside from the detection, also the interpretation and understanding of errors are necessary to solve occurring problems. This crucial task is accomplished by the experienced users and administrators of everyday Grid operations.
In our contribution we provide the design of an expert system, which combines found error patterns from mining the monitoring data with the experts' knowledge about their underlying problem and its solution.
URL for further information
http://twiki.cern.ch/twiki/bin/view/ArdaGrid/AutomaticFaultDetection
Keywords
Grid job monitoring, fault detection, association rule mining, expert system
Conclusions and Future Work
The proposed design combines machine created knowledge with human knowledge and provides an expert system which detects problematic Grid elements and reacts accordingly. To evaluate the design, a prototype implementation was created and tested with ATLAS production shifters to verify the correctness of the error pattern-solution pairs.
In the future we plan to deduce SAM (Service Availability Monitoring) tests from very frequent error patterns to avoid problems in advance.
Impact
The first part of our proposal - mining the monitoring data - finds error patterns automatically and very fast, which helps tracing back errors to their origin. Therewith, the time to detect a problematic Grid element can be significantly decreased.
The second part adds an expert system on top of the found rules and provides a collection of pairs in the form of an error pattern and its recovery to less experienced people in charge of error recovery in a Grid's environment. As a consequence, a contribution to the reliability of a Grid is achieved.