Coding approaches to fault tolerance in dynamic systems
Author(s)Hadjicostis, Christoforos N. (Christoforos Nikos)
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
George C. Verghese.
MetadataShow full item record
A fault-tolerant system tolerates internal failures while preserving desirable overall behavior. Fault tolerance is necessary in life-critical or inaccessible applications, and also enables the design of reliable systems out of unreliable, less expensive components. This thesis discusses fault tolerance in dynamic systems, such as finite-state controllers or computer simulations, whose internal state influences their future behavior. Modular redundancy (system replication) and other traditional techniques for fault tolerance are expensive, and rely heavily -- particularly in the case of dynamic systems operating over extended time horizons -- on the assumption that the error-correcting mechanism (e.g., voting) is fault-free. The thesis develops a systematic methodology for adding structured redundancy to a dynamic system and introducing associated fault tolerance. Our approach exposes a wide range of possibilities between no redundancy and full replication. Assuming that the error-correcting mechanism is fault-free, we parameterize the different possibilities in various settings, including algebraic machines, linear dynamic systems and Petri nets. By adopting specific error models and, in some cases, by making explicit connections with hardware implementation, we demonstrate how the redundant systems can be designed to allow detection/correction of a fixed number of failures. We do not explicitly address optimization criteria that could be used in choosing among different redundant implementations, but our examples illustrate how such criteria can be investigated in future work. The last part of the thesis relaxes the traditional assumption that error-correction be fa.ult-free. We use unreliable system replicas and unreliable voters to construct redundant dynamic systems that evolve in time with low probability of failure. Our approach generalizes modular redundancy by using distributed voting schemes. Combining these techniques with low-complexity error-correcting coding, we are able to efficiently protect identical unreliable linear finite-state machines that operate in parallel on distinct input sequences. The approach requires only a constant amount of redundant hardware per machine to achieve a probability of failure that remains below any pre-specified bound over any given finite time interval.
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 189-196).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.