Prompt Injection Generation Using Small Language Models with Reinforcement Learning with Artificial Intelligence Feedback
Author(s)
Gupta, Aneesh
DownloadThesis PDF (1.527Mb)
Advisor
Gupta, Amar
Terms of use
Metadata
Show full item recordAbstract
Large language models (LLMs) have become an integral part of many fields from customer support automation to research assistants. However, despite their growing adoption, they face significant challenges, particularly when it comes to safety in sensitive contexts. Existing methods like Reinforcement Learning with Human Feedback (RLHF) and keyword filtering have contributed to improving the robustness of these models, but these approaches are very resource-intensive and the models can still be vulnerable to malicious attacks like prompt injections and jailbreaking. One notable limitation in testing defenses against such attacks is the scarcity of appropriate datasets. This thesis investigates the use of small language models (SLMs) to generate goal hijacking messages, a subset of prompt injection messages. Techniques such as LoRA fine-tuning and full fine-tuning of even smaller models are employed in this short form text generation model. We also introduce a fine-tuned SLM enhanced with Reinforcement Learning with Artificial Intelligence Feedback (RLAIF), which removes reliance on slow human feedback by using faster AI-generated feedback instead. By optimizing the reference model and reward functions, we improve alignment with ground truth prompt injection messages while addressing issues such as mode collapse and overfitting. These findings show promise, and further research is necessary to determine how well the approach can generalize to other domains and perform in real-world scenarios. Future work is likely to focus on multilingual datasets and distributed computation to further extend the applicability and efficiency of the method.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology