CAPRI : a common architecture for distributed probabilistic Internet fault diagnosis

Lee, George J. (George Janbing), 1979-

Author(s)

Lee, George J. (George Janbing), 1979-

DownloadFull printable version (1.032Mb)

Alternative title

Common architecture for distributed probabilistic Internet fault diagnosis

Other Contributors

Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.

Advisor

David D. Clark.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

This thesis presents a new approach to root cause localization and fault diagnosis in the Internet based on a Common Architecture for Probabilistic Reasoning in the Internet (CAPRI) in which distributed, heterogeneous diagnostic agents efficiently conduct diagnostic tests and communicate observations, beliefs, and knowledge to probabilistically infer the cause of network failures. Unlike previous systems that can only diagnose a limited set of network component failures using a limited set of diagnostic tests, CAPRI provides a common, extensible architecture for distributed diagnosis that allows experts to improve the system by adding new diagnostic tests and new dependency knowledge. To support distributed diagnosis using new tests and knowledge, CAPRI must overcome several challenges including the extensible representation and communication of diagnostic information, the description of diagnostic agent capabilities, and efficient distributed inference. Furthermore, the architecture must scale to support diagnosis of a large number of failures using many diagnostic agents.

(cont.) To address these challenges, this thesis presents a probabilistic approach to diagnosis based on an extensible, distributed component ontology to support the definition of new classes of components and diagnostic tests; a service description language for describing new diagnostic capabilities in terms of their inputs and outputs; and a message processing procedure for dynamically incorporating new information from other agents, selecting diagnostic actions, and inferring a diagnosis using Bayesian inference and belief propagation. To demonstrate the ability of CAPRI to support distributed diagnosis of real-world failures, I implemented and deployed a prototype network of agents on Planetlab for diagnosing HTTP connection failures. Approximately 10,000 user agents and 40 distributed regional and specialist agents on Planetlab collect information from over 10,000 users and diagnose over 140,000 failures using a wide range of active and passive tests, including DNS lookup tests, connectivity probes, Rockettrace measurements, and user connection histories.

(cont.) I show how to improve accuracy and cost by learning new dependency knowledge and introducing new diagnostic agents. I also show that agents can manage the cost of diagnosing many similar failures by aggregating related requests and caching observations and beliefs.

Description

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.

Includes bibliographical references (p. 215-222).

Date issued

2007

URI

http://hdl.handle.net/1721.1/40316

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses