Adversarial Examples and Distribution Shift: A Representations Perspective
Author(s)
Nadhamuni, Kaveri
DownloadThesis PDF (1.885Mb)
Advisor
Madry, Aleksander
Terms of use
Metadata
Show full item recordAbstract
Adversarial attacks cause machine learning models to produce wrong predictions by minimally perturbing their input. In this thesis, we take a step towards understanding how these perturbations affect the intermediate data representations of the model. Specifically, we compare standard and adversarial representations for models of varying robustness based on a variety of similarity metrics. In fact, we find that it’s possible to detect adversarial examples by examining nearby examples, though we also find that this method can be circumvented by an adaptive attack. We then explore methods to improve generalization to natural distribution shift and hypothesize that models trained with different notions of feature bias will learn fundamentally different representations. We find that combining such diverse representations can provide a more comprehensive representation of the input data, potentially allowing better generalization to novel domains. Finally, we find that representation similarity metrics can be used to predict how well a model will be able to transfer between tasks.
Date issued
2021-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology