Error handling is a crucial task in an infrastructure as complex as a grid. There are monitoring tools which report faulty grid behavior and error codes of failed grid jobs. However, even error codes do not necessarily indicate the actual source of an error. A more sophisticated methodology is proposed to locate grid problems and offer solutions. The system, called QAOES (Quick Analysis of Error Sources), operates in two phases. First, problematic grid components are automatically detected by applying the data mining method association rule mining, which takes dependencies between characteristics of failed grid jobs into account. Second, expert knowledge about the problem and its solution is collected and transformed to generic rules. Based on these rules, QAOES provides a list of current problems and suggested solutions in a web interface. Therewith, the time to detect and solve grid problems is reduced and the overall reliability is improved.
Project(s) or EGEE activity presenting the demo or poster (project or activity names only)
WLCG and EGEE/NA4-HEP
Special requirements other than the set up mentioned in the CfA text.