Adversarial Prompt Transformation for Systematic
Jailbreaks of LLMs

Awoufack, Kevin E.

Author(s)

Awoufack, Kevin E.

DownloadThesis PDF (546.9Kb)

Advisor

Kagal, Lalana

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The rapid integration of Large Language Models (LLMs) like OpenAI’s GPT series into diverse sectors has significantly enhanced digital interactions but also introduced new security challenges, notably the risk of "jailbreaking" where inputs cause models to deviate from their operational guidelines. This vulnerability poses risks such as misinformation spread and privacy breaches, highlighting the need for robust security measures. Traditional red-teaming methods, involving manually crafted prompts to test model vulnerabilities, are labor-intensive and lack scalability. This thesis proposes a novel automated approach using Reinforcement Learning from Human Feedback (RLHF) to transform unsuccessful adversarial prompts into a successful jailbreak. Thus it learns a policy based on relation to existing jailbreak prompts that informs the generator LLM of what makes an adversarial prompt successful. This was implemented using Proximal Policy Optimization (PPO) and tested with both a classifier and judge reward model, attaining at best a 16% attach success rate on a target model. This research can be applied to any prompt at the word level and further analyzed on characteristics of toxicity. This work contributes to advancing LLM security measures, ensuring their safer deployment across various applications.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/157167

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses