September 27, 2004 to October 1, 2004
Interlaken, Switzerland
Europe/Zurich timezone

Fault Tolerance and Fault Adaption for High Performance Large Scale Embedded Systems

Sep 29, 2004, 2:40 PM
Jungfrau (Interlaken, Switzerland)


Interlaken, Switzerland

oral presentation Track 1 - Online Computing Online Computing




The BTeV experiment, a proton/antiproton collider experiment at the Fermi National Accelerator Laboratory, will have a trigger that will perform complex computations (to reconstruct vertices, for example) on every collision (as opposed to the more traditional approach of employing a first level hardware based trigger). This trigger requires large-scale fault adaptive embedded software: with thousands of processors involved in performing event filtering in the trigger farm fault conditions must be given proper treatment. Without fault mitigation, it is conceivable that the trigger system will experience failures at a high enough rate to have an unacceptable negative impact on BTeV's physics goals. The RTES (Real Time Embedded Systems) collaboration is a group of physicists, engineers, and computer scientists working to address the problem of reliability in large-scale clusters with real-time constraints such as this. Resulting infrastructure must be highly scalable, verifiable, extensible by users, and dynamically changeable. An initial prototype has been built to test design ideas and methods for the final system, and a larger scale and more ambitious prototype is currently under construction. I will discuss the lessons learned from these prototypes as well as the overall design and deliverables for the BTeV experiment.

Primary author


Presentation materials