2–6 Mar 2009
Le Ciminiere, Catania, Sicily, Italy
Europe/Rome timezone

Design of an Expert System for Enhancing Grid Fault Detection based on Grid Monitoring Data

2 Mar 2009, 17:30
20m
Machiavelli (40) (Le Ciminiere, Catania, Sicily, Italy)

Machiavelli (40)

Le Ciminiere, Catania, Sicily, Italy

Viale Africa 95100 Catania
Oral Planned or on-going scientific work using the grid Grid Research

Speaker

Ms Gerhild Maier (Johannes Kepler Universität Linz)

Description

Grid computing is associated with a complex, large scale, heterogeneous and distributed environment. The combination of different Grid infrastructures, middleware implementations, and job submission tools into one reliable production system is a challenging task. Given the impracticability to provide an absolutely fail-safe system, focusing on strong error reporting and handling is a crucial part in Grid computing.

Detailed analysis

There are various monitoring systems in place (R-GMA, IC-RTM, MonALISA), of which some are also able to deliver error codes of failed Grid jobs. However, the error codes do not always denote the actual source of the error. Instead, a more sophisticated methodology is required to locate problematic Grid elements. We propose to mine Grid monitoring data using association rules. This approach produces additional knowledge about the Grid elements' behavior by taking correlations and dependencies between the characteristics of failed Grid jobs into account.
Aside from the detection, also the interpretation and understanding of errors are necessary to solve occurring problems. This crucial task is accomplished by the experienced users and administrators of everyday Grid operations.
In our contribution we provide the design of an expert system, which combines found error patterns from mining the monitoring data with the experts' knowledge about their underlying problem and its solution.

URL for further information

http://twiki.cern.ch/twiki/bin/view/ArdaGrid/AutomaticFaultDetection

Keywords

Grid job monitoring, fault detection, association rule mining, expert system

Conclusions and Future Work

The proposed design combines machine created knowledge with human knowledge and provides an expert system which detects problematic Grid elements and reacts accordingly. To evaluate the design, a prototype implementation was created and tested with ATLAS production shifters to verify the correctness of the error pattern-solution pairs.
In the future we plan to deduce SAM (Service Availability Monitoring) tests from very frequent error patterns to avoid problems in advance.

Impact

The first part of our proposal - mining the monitoring data - finds error patterns automatically and very fast, which helps tracing back errors to their origin. Therewith, the time to detect a problematic Grid element can be significantly decreased.
The second part adds an expert system on top of the found rules and provides a collection of pairs in the form of an error pattern and its recovery to less experienced people in charge of error recovery in a Grid's environment. As a consequence, a contribution to the reliability of a Grid is achieved.

Author

Ms Gerhild Maier (Johannes Kepler Universität Linz)

Co-authors

Dr Benjamin Gaidioz (CERN) Prof. Dieter Kranzlmüller (Ludwig-Maximilians-Universität München)

Presentation materials