21-25 September 2009
Hotel Barcelo Sants
Europe/Zurich timezone

Enhancing Grid Fault Detection and Recovery with an Expert System

Not scheduled
Hotel Barcelo Sants

Hotel Barcelo Sants

Barcelona
Poster

Speaker

Ms Gerhild Maier (CERN)

Special requirements other than the set up mentioned in the CfA text.

none.

Abstract

Error handling is a crucial task in an infrastructure as complex as a grid. There are monitoring tools which report faulty grid behavior and error codes of failed grid jobs. However, even error codes do not necessarily indicate the actual source of an error. A more sophisticated methodology is proposed to locate grid problems and offer solutions. The system, called QAOES (Quick Analysis of Error Sources), operates in two phases. First, problematic grid components are automatically detected by applying the data mining method association rule mining, which takes dependencies between characteristics of failed grid jobs into account. Second, expert knowledge about the problem and its solution is collected and transformed to generic rules. Based on these rules, QAOES provides a list of current problems and suggested solutions in a web interface. Therewith, the time to detect and solve grid problems is reduced and the overall reliability is improved.

Project(s) or EGEE activity presenting the demo or poster (project or activity names only)

WLCG and EGEE/NA4-HEP

Primary author

Ms Gerhild Maier (CERN)

Co-authors

Mr Daniel van der Ster (CERN) Prof. Dieter Kranzlmüller (Ludwig-Maximilians-Universität München)

Presentation Materials