CHEP 2009

Name: CHEP 2009
Start: 2009-03-21T08:00:00+01:00
End: 2009-03-27T13:30:00+01:00
Location: Prague

21–27 Mar 2009

Prague

Europe/Prague timezone

Support

chep2009@particle.cz

LQCD Workflow Execution Framework: Models, Provenance, and Fault-Tolerance

23 Mar 2009, 08:00

Prague

Prague Congress Centre 5. května 65, 140 00 Prague 4, Czech Republic

Board: Monday 090

poster Distributed Processing and Analysis Poster session

Luciano Piccoli (Fermilab)

Large computing clusters used for scientific processing suffer from systemic failures when operated over long continuous periods for executing workflows. Diagnosing job problems and faults leading to eventual failures in this complex environment is difficult, specifically when the success of whole workflow might be affected by a single job failure. In this paper, we introduce a model-based, hierarchical, reliable execution framework that encompass workflow specification, data provenance, execution tracking and online monitoring of each workflow task, also referred to as participants. The sequence of participants is described in an abstract parameterized view, which is translated into a concrete data dependency based sequence of participants with defined arguments. As participants belonging to a workflow are mapped onto machines and executed, periodic and on-demand monitoring of vital health parameters on allocated nodes is enabled according to pre-specified rules. These rules specify conditions that must be true pre-execution, during execution and post-execution. Monitoring information for each participant is propagated upwards through the reflex and healing architecture, which consist of hierarchical network of decentralized fault management entities, called reflex engines. They are instantiated as state machines or timed automatons that change state and initiate reflexive mitigation action(s) upon occurrence of certain faults. We describe how this cluster reliability framework is combined with the workflow execution framework using formal rules and actions specified within a structure of first order predicate logic that enables a dynamic management design that reduces manual administrative workload, and increases cluster-productivity. Preliminary results on a virtual setup with injection failures are shown.

Presentation type (oral \| poster)	oral

Abhishek Dubey (Vanderbilt University) James Kowalkowsky (Fermilab) James Simone (Fermilab) Luciano Piccoli (Fermilab)

Poster

CHEP09-Poster.pdf

CHEP09-Poster.ppt

CHEP 2009

Support

LQCD Workflow Execution Framework: Models, Provenance, and Fault-Tolerance

Prague

Speaker

Description

Authors

Presentation materials

Choose timezone

CHEP 2009

Support

Speaker

Description

Authors

Presentation materials