MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Adversarial Prompt Transformation for Systematic Jailbreaks of LLMs

Author(s)
Awoufack, Kevin E.
Thumbnail
DownloadThesis PDF (546.9Kb)
Advisor
Kagal, Lalana
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
The rapid integration of Large Language Models (LLMs) like OpenAI’s GPT series into diverse sectors has significantly enhanced digital interactions but also introduced new security challenges, notably the risk of "jailbreaking" where inputs cause models to deviate from their operational guidelines. This vulnerability poses risks such as misinformation spread and privacy breaches, highlighting the need for robust security measures. Traditional red-teaming methods, involving manually crafted prompts to test model vulnerabilities, are labor-intensive and lack scalability. This thesis proposes a novel automated approach using Reinforcement Learning from Human Feedback (RLHF) to transform unsuccessful adversarial prompts into a successful jailbreak. Thus it learns a policy based on relation to existing jailbreak prompts that informs the generator LLM of what makes an adversarial prompt successful. This was implemented using Proximal Policy Optimization (PPO) and tested with both a classifier and judge reward model, attaining at best a 16% attach success rate on a target model. This research can be applied to any prompt at the word level and further analyzed on characteristics of toxicity. This work contributes to advancing LLM security measures, ensuring their safer deployment across various applications.
Date issued
2024-09
URI
https://hdl.handle.net/1721.1/157167
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.