Speakers
Description
Provide a set of generic keywords that define your contribution (e.g. Data Management, Workflows, High Energy Physics)
Fault detection, Statistics
1. Short overview
Both Grid middleware services and applications face failures, and the more widely deployed they are, the higher is the price for not detecting the failures early (lost jobs, wasted resources ...). Automated detection, diagnosis, and ultimately management, of software/hardware problems define autonomic dependability. This work report on a generic mechanism for autonomic detection of EGEE failures involving abrupt changes in the behaviour of quantities of interest, and on some applications.
4. Conclusions / Future plans
The implementation of the statistics per-se is fairly straightforward. The codes for exploiting the test on archived data, including both the extraction of the quantities of interest and the test itself, will be released through the Grid Observatory, in order to demonstrate the performance and scalability levels required for the production environment. Full integration into gLite raises the usual technical issues, and appropriate tools (triggering alarms etc.) remain to be developed.
3. Impact
Fast and reliable detection of failures can both raise alarms bringing operator intervention, as well as trigger automatic reaction, e.g. avoid job submission to blackhole sites. The proposed method is quite general, and can be applied at various points in the middleware, including the site level, or by end-user software. Nonetheless, gLite Logging and Bookkeeping service, which concentrates information on the job processing, would be the most effective target. The approach of affecting job scheduling by LB-computed statistics had been used before. Experimental validation and comparison is thus desirable: a significant dataset of “challenge examples” should be available. Examples tagged by system administrators are rare. The Job Provenance (archive of LB data and more) provides the required information from two aspects: easy access to filtered L&B data, and valuable information for calibrating and evaluating failure detection methods wrt. known and well-understood past events.