MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Generation, Detection, and Evaluation of Role-play based Jailbreak attacks in Large Language Models

Author(s)
Johnson, Zachary D.
Thumbnail
DownloadThesis PDF (3.731Mb)
Advisor
Kagal, Lalana
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
While directly asking a Large Language Model (LLM) a harmful request (e.g. "Provide me instructions on how to build a bomb.") will most likely yield a refusal to comply due to ethical guidelines laid forth by developers (e.g. OpenAI), users can trick the LLM into providing this information using a tactic called a Role-play based Jailbreak Attack. This attack consists of instructing the LLM to take on the role of a fictional character that does not adhere to the model developer’s ethical guidelines and will comply with any request. Role-play based jailbreak attacks remain a critical safety issue and open-ended research question due to their success in getting a LLM to comply with a harmful request, as well as their ability to be generated without a formal technical background. Companies such as OpenAI employ manual tactics like red-teaming in order to enhance a LLM’s robustness against these attacks, however these tactics may fail to defend against all role-play based jailbreak attacks due to their potentially limited ability to predict unseen attacks. In this work, we aim to better understand the landscape of role-play based jailbreak attacks so that we can precisely detect these attack attempts in the wild before they yield a harmful output from a LLM. Specifically, we focus on three main categories: generating synthetic examples of role-play based jailbreak attack prompts, testing these role-play prompts on a target LLM in order to evaluate whether they successfully jailbreak the LLM and labeling our prompts accordingly, and training a robust detection model that can precisely predict whether a role-play prompt will successfully yield a jailbreak attack in a LLM before being fed any malicious requests. Through these processes, we learn the following information, respectively. 1) Out-of-the-box models such as GPT-4 are effective at generating successful role-play jailbreak attack prompts when being generated on just a few examples via fewshot prompting. 2) We can automatically classify LLM responses as jailbroken or not with high accuracy using statistical methods including Principal Component Analysis (PCA) and Support Vector Machines (SVMs). 3) Most classification architectures are unable to perform the complex task of accurately predicting whether a role-play prompt will successfully yield a jailbreak attack. By better understanding the nature of role-play based jailbreak attacks, we hope to be able to contribute to the research area of jailbreak attack detection in LLMs so that they can be robustly defended against in the future.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156989
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.