Understanding the Robustness of Vision Models and Humans to Occlusion-Based Corruptions
Author(s)
Lu, David
DownloadThesis PDF (11.01Mb)
Advisor
Katz, Boris
Terms of use
Metadata
Show full item recordAbstract
Humans are excellent object recognizers. Not only can they identify fully visible objects, but they can also recognize objects that are partially blocked from view (i.e., occluded). Moreover, vision models have made substantial progress in object recognition over the past decade. However, their proficiency in identifying occluded objects has not been thoroughly investigated. In this work, we analyze the robustness of models and humans to occlusions by building artificial occlusion transforms that mask out parts of images. We design occlusion transforms to model a diverse range of occlusion scenarios, varying two key factors: (1) the percentage of the image that is occluded, and (2) the granularity of the occlusion pattern, from large chunks to fine-grained pepper noise. We then evaluate the performance of humans and models on these occluded images. Our experiments yield several key findings. Intriguingly, pretrained models exhibit a U-shaped accuracy curve, with medium-granularity occlusions posing the greatest challenge. This pattern closely aligns with the one observed in our human experiments, which is particularly surprising, considering the substantial disparities between human visual systems and machine-based perception. Additionally, we explore whether performance losses caused by occlusions can be mitigated through two approaches: finetuning using occluded images and inpainting occluded pixels before classification. We discover that finetuning leads to a considerable increase in accuracy, but we suspect that finetuned models are relying on a different set of features. Inpainting helps significantly for mid- and high-frequency occlusions, but has the disadvantage of misleading both models and humans at low frequencies. Lastly, we introduce a new adversarial occlusion task, and propose two attack methods based on differential evolution and Grad-CAM. We find that occluding fewer than 10% of pixels is enough to fool vision classifiers. This demonstrates that adversarial attacks can be executed by eliminating image content rather than introducing perturbations. Complementing our analysis of a variety of state-of-the-art models, we offer our occlusion benchmark as a resource for researchers to evaluate the performance of future models intended for real-world deployment.
Date issued
2023-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology