Adversarial robustness without perturbations
Author(s)
Rodríguez Muñoz, Adrán
DownloadThesis PDF (15.16Mb)
Advisor
Torralba, Antonio
Terms of use
Metadata
Show full item recordAbstract
Models resistant to adversarial perturbations are stable around the neighbourhoods of input images, such that small changes, known as adversarial attacks, cannot dramatically change the prediction. Currently, this stability is obtained with Adversarial Training, which directly teaches models to be robust by training on the perturbed examples themselves. In this work, we show the surprisingly similar performance of instead regularizing the model input-gradients of un-perturbed examples only. Regularizing the input-gradient norm is commonly believed to be significantly worse than Adversarial Training. Our experiments determine that the performance of Gradient Norm critically depends on the smoothness of the activation functions of the model, and is in fact highly peformant on modern vision transformers that natively use smooth GeLU over piecewise linear ReLUs. On ImageNet-1K, Gradient Norm regularization achieves more than 90% of the performance of state-of-the-art Adversarial Training with PGD-3 (52% vs. 56%) with 60% of the training time and without complex inner-maximization. Further experiments shed light on additional properties relating model robustness and input-gradients of unperturbed images, such as asymmetric color statistics. Suprisingly, we also show significant adversarial robustness may be obtained by simply conditioning gradients to focus on image edges, without explicit regularization of the norm.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology