Towards Deployable Robust Text Classifiers
Author(s)
Xu, Lei
DownloadThesis PDF (5.048Mb)
Advisor
Veeramachaneni, Kalyan
Terms of use
Metadata
Show full item recordAbstract
Text classification has been studied for decades as a fundamental task in natural language processing. Deploying classifiers enables more efficient information processing, which is useful for various applications, including decision-making. However, classifiers also present challenging and long-standing problems. As their use increases, expectations about their level of robustness, fairness, accuracy, and other metrics increase in turn.
In this dissertation, we aim to develop more deployable and robust text classifiers, with a focus on improving classifier robustness against adversarial attacks by developing both attack and defense approaches. Adversarial attacks are a security concern for text classifiers, as they involve cases where a malicious user takes a sentence and perturbs it slightly to manipulate the classifier’s output. To design more effective attack methods, we focus first on improving adversarial sentence quality – unlike existing methods that prioritize misclassification and ignore sentence similarity and fluency, we synthesize these three criteria into a combined critique score. We then outline a rewrite and rollback framework for optimizing this score and achieving state-of-theart attack success rates while improving similarity and fluency. We focus second on computational requirements. Existing methods typically use combinatorial search to find adversarial examples that alter multiple words, which are inefficient and require many queries to the classifier. We overcome this problem by proposing a single-word adversarial perturbation attack. This attack only needs to replace a single word in the original sentence with a high-adversarial-capacity word, significantly improving efficiency while the attack success rate remains similar to that of existing methods.
We then turn to defense. Currently, the most common approach for defending against attacks is training classifiers using adversarial examples as data augmentation, a method limited by the inefficiency of many attack methods. We show that training classifiers with data augmentation through our efficient single-word perturbation attack can improve the robustness of the classifier against other attack methods. We also design in situ data augmentation to counteract adversarial perturbations in the classifier input. We use the gradient norm to identify keywords for classification and a pre-trained language model to replace them. Our in situ augmentation can effectively improve robustness and does not require tuning the classifier.
Finally, we explore the vulnerability of a very recent text classification architecture – prompt-based classifiers — and find them to be vulnerable to attacks as well. We also develop a library called Fibber to facilitate adversarial robustness research.
Date issued
2023-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology