Generation, Detection, and Evaluation of Role-play based Jailbreak attacks in Large Language Models
Author(s)
Johnson, Zachary D.
DownloadThesis PDF (3.731Mb)
Advisor
Kagal, Lalana
Terms of use
Metadata
Show full item recordAbstract
While directly asking a Large Language Model (LLM) a harmful request (e.g. "Provide me instructions on how to build a bomb.") will most likely yield a refusal to comply due to ethical guidelines laid forth by developers (e.g. OpenAI), users can trick the LLM into providing this information using a tactic called a Role-play based Jailbreak Attack. This attack consists of instructing the LLM to take on the role of a fictional character that does not adhere to the model developer’s ethical guidelines and will comply with any request. Role-play based jailbreak attacks remain a critical safety issue and open-ended research question due to their success in getting a LLM to comply with a harmful request, as well as their ability to be generated without a formal technical background. Companies such as OpenAI employ manual tactics like red-teaming in order to enhance a LLM’s robustness against these attacks, however these tactics may fail to defend against all role-play based jailbreak attacks due to their potentially limited ability to predict unseen attacks. In this work, we aim to better understand the landscape of role-play based jailbreak attacks so that we can precisely detect these attack attempts in the wild before they yield a harmful output from a LLM. Specifically, we focus on three main categories: generating synthetic examples of role-play based jailbreak attack prompts, testing these role-play prompts on a target LLM in order to evaluate whether they successfully jailbreak the LLM and labeling our prompts accordingly, and training a robust detection model that can precisely predict whether a role-play prompt will successfully yield a jailbreak attack in a LLM before being fed any malicious requests. Through these processes, we learn the following information, respectively. 1) Out-of-the-box models such as GPT-4 are effective at generating successful role-play jailbreak attack prompts when being generated on just a few examples via fewshot prompting. 2) We can automatically classify LLM responses as jailbroken or not with high accuracy using statistical methods including Principal Component Analysis (PCA) and Support Vector Machines (SVMs). 3) Most classification architectures are unable to perform the complex task of accurately predicting whether a role-play prompt will successfully yield a jailbreak attack. By better understanding the nature of role-play based jailbreak attacks, we hope to be able to contribute to the research area of jailbreak attack detection in LLMs so that they can be robustly defended against in the future.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology