Show simple item record

dc.contributor.advisorHadfield-Menell, Dylan
dc.contributor.authorCasper, Stephen
dc.date.accessioned2024-03-15T19:22:36Z
dc.date.available2024-03-15T19:22:36Z
dc.date.issued2024-02
dc.date.submitted2024-02-21T17:10:02.907Z
dc.identifier.urihttps://hdl.handle.net/1721.1/153769
dc.description.abstractThe most common way to evaluate AI systems is by analyzing their performance on a test set. However, test sets can fail to identify some problems (such as out-of-distribution failures) and can actively reinforce others (such as dataset biases). Identifying problems like these requires techniques that are not simply based on passing a dataset through a black-box model. In practice, this challenge lies at the confluence of two fields: interpreting and attacking deep neural networks. Both of these goals help to improve oversight of AI. However, existing techniques are often not competitive for practical debugging in real-world applications. This thesis is dedicated to identifying and addressing gaps between research and practice. I focus on evaluating diagnostic tools based on how useful they are for identifying problems with networks under realistic assumptions. Specifically, this thesis introduces a benchmark for these tools based on their usefulness for identifying trojans– specific bugs that are deliberately implanted into networks. I present the following thesis: 1. Trojan discovery is a practical benchmarking task for diagnostic tools that can be applied to both dataset-based and dataset-free techniques. 2. State-of-the-art feature attribution methods often perform poorly relative to an edge detector at discovering trojans even under permissive conditions with access to data containing trojan triggers. 3. Feature synthesis methods– particularly ones that leverage the latent representations of models– can be more effectively used for diagnostics in dataset-free contexts. Chapter 1 adopts an engineer’s perspective on techniques for studying AI systems. It overviews motivations for building a versatile toolbox of model-diagnostic tools. These hinge on their unique ability to help humans understand models without being limited to some readily accessible dataset. Chapter 2 overviews literature on interpretable AI, adversarial attacks, feature attribution, feature synthesis methods, and evaluation methods for these tools. It also reviews connections between research on interpretability tools, adversarial examples, continual learning, modularity, network compression, and biological brains. Chapter 3 presents a benchmark for diagnostic tools that is based on helping humans discover trojans. This can be done either (a) under permissive assumptions by allowing access to data that include the trojan triggers or (b) under stringent assumptions where no such access is available. Chapter 4 demonstrates the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution tools. This reveals two shortcomings of them. First, because they can only explain model decisions on specific examples, these tools are not equipped to help diagnose bugs without data that trigger them. Second, even under idealized conditions where examples containing a trojan trigger are available, most feature attribution methods consistently fail to identify them better than an edge detector. Chapter 5 focuses on dataset-free feature synthesis methods. It introduces two novel techniques for studying networks with feature-level adversarial attacks. Both use model latents to produce interpretable adversarial attacks. Compared to other state-of-the-art feature-synthesis tools, these techniques are the most useful for trojan-discovery. However, there remains room for improvement on this benchmark. No techniques help humans identify trojans in more than 50% of 8-option multiple choice questions. Finally, Chapter 6, analyzes gaps between research and practical applications. It argues that a lack of clear and consistent criteria for assessing the real-world competitiveness of techniques has hampered progress. I conclude by discussing directions for future work emphasizing benchmarking, interdisciplinarity, and building a dynamic AI interpretability toolbox.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titlePractical Diagnostic Tools for Deep Neural Networks
dc.typeThesis
dc.description.degreeS.M.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Science in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record