CAPRI: A Common Architecture for Distributed Probabilistic Internet Fault Diagnosis

Lee, George J.

dc.contributor.advisor	David Clark
dc.contributor.author	Lee, George J.
dc.contributor.other	Advanced Network Architecture
dc.date.accessioned	2007-06-05T14:21:56Z
dc.date.available	2007-06-05T14:21:56Z
dc.date.issued	2007-06-04
dc.identifier.other	MIT-CSAIL-TR-2007-031
dc.identifier.uri	http://hdl.handle.net/1721.1/37595
dc.description	PhD thesis
dc.description.abstract	This thesis presents a new approach to root cause localization and fault diagnosis in the Internet based on a Common Architecture for Probabilistic Reasoning in the Internet (CAPRI) in which distributed, heterogeneous diagnostic agents efficiently conduct diagnostic tests and communicate observations, beliefs, and knowledge to probabilistically infer the cause of network failures. Unlike previous systems that can only diagnose a limited set of network component failures using a limited set of diagnostic tests, CAPRI provides a common, extensible architecture for distributed diagnosis that allows experts to improve the system by adding new diagnostic tests and new dependency knowledge.To support distributed diagnosis using new tests and knowledge, CAPRI must overcome several challenges including the extensible representation and communication of diagnostic information, the description of diagnostic agent capabilities, and efficient distributed inference. Furthermore, the architecture must scale to support diagnosis of a large number of failures using many diagnostic agents. To address these challenges, this thesis presents a probabilistic approach to diagnosis based on an extensible, distributed component ontology to support the definition of new classes of components and diagnostic tests; a service description language for describing new diagnostic capabilities in terms of their inputs and outputs; and a message processing procedure for dynamically incorporating new information from other agents, selecting diagnostic actions, and inferring a diagnosis using Bayesian inference and belief propagation.To demonstrate the ability of CAPRI to support distributed diagnosis of real-world failures, I implemented and deployed a prototype network of agents on Planetlab for diagnosing HTTP connection failures. Approximately 10,000 user agents and 40 distributed regional and specialist agents on Planetlab collect information from over 10,000 users and diagnose over 140,000 failures using a wide range of active and passive tests, including DNS lookup tests, connectivity probes, Rockettrace measurements, and user connection histories. I show how to improve accuracy and cost by learning new dependency knowledge and introducing new diagnostic agents. I also show that agents can manage the cost of diagnosing many similar failures by aggregating related requests and caching observations and beliefs.
dc.format.extent	222 p.
dc.relation.ispartofseries	Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
dc.title	CAPRI: A Common Architecture for Distributed Probabilistic Internet Fault Diagnosis

Files in this item

Name:: MIT-CSAIL-TR-2007-031.ps
Size:: 29.25Mb
Format:: Postscript

View/Open

Name:: MIT-CSAIL-TR-2007-031.pdf
Size:: 1.212Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

CSAIL Technical Reports (July 1, 2003 - present)

Show simple item record