Prompt Injection Generation Using Small Language
Models with Reinforcement Learning with Artificial
Intelligence Feedback

Gupta, Aneesh

Author(s)

Gupta, Aneesh

DownloadThesis PDF (1.527Mb)

Advisor

Gupta, Amar

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Large language models (LLMs) have become an integral part of many fields from customer support automation to research assistants. However, despite their growing adoption, they face significant challenges, particularly when it comes to safety in sensitive contexts. Existing methods like Reinforcement Learning with Human Feedback (RLHF) and keyword filtering have contributed to improving the robustness of these models, but these approaches are very resource-intensive and the models can still be vulnerable to malicious attacks like prompt injections and jailbreaking. One notable limitation in testing defenses against such attacks is the scarcity of appropriate datasets. This thesis investigates the use of small language models (SLMs) to generate goal hijacking messages, a subset of prompt injection messages. Techniques such as LoRA fine-tuning and full fine-tuning of even smaller models are employed in this short form text generation model. We also introduce a fine-tuned SLM enhanced with Reinforcement Learning with Artificial Intelligence Feedback (RLAIF), which removes reliance on slow human feedback by using faster AI-generated feedback instead. By optimizing the reference model and reward functions, we improve alignment with ground truth prompt injection messages while addressing issues such as mode collapse and overfitting. These findings show promise, and further research is necessary to determine how well the approach can generalize to other domains and perform in real-world scenarios. Future work is likely to focus on multilingual datasets and distributed computation to further extend the applicability and efficiency of the method.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/159142

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses