Adversarial Prompt Transformation for Systematic Jailbreaks of LLMs
Author(s)
Awoufack, Kevin E.
DownloadThesis PDF (546.9Kb)
Advisor
Kagal, Lalana
Terms of use
Metadata
Show full item recordAbstract
The rapid integration of Large Language Models (LLMs) like OpenAI’s GPT series into diverse sectors has significantly enhanced digital interactions but also introduced new security challenges, notably the risk of "jailbreaking" where inputs cause models to deviate from their operational guidelines. This vulnerability poses risks such as misinformation spread and privacy breaches, highlighting the need for robust security measures. Traditional red-teaming methods, involving manually crafted prompts to test model vulnerabilities, are labor-intensive and lack scalability. This thesis proposes a novel automated approach using Reinforcement Learning from Human Feedback (RLHF) to transform unsuccessful adversarial prompts into a successful jailbreak. Thus it learns a policy based on relation to existing jailbreak prompts that informs the generator LLM of what makes an adversarial prompt successful. This was implemented using Proximal Policy Optimization (PPO) and tested with both a classifier and judge reward model, attaining at best a 16% attach success rate on a target model. This research can be applied to any prompt at the word level and further analyzed on characteristics of toxicity. This work contributes to advancing LLM security measures, ensuring their safer deployment across various applications.
Date issued
2024-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology