Investigating Model Editing for Unlearning in Large Language Models

Hossain, Shariqah

Author(s)

Hossain, Shariqah

DownloadThesis PDF (808.9Kb)

Advisor

Kagal, Lalana

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Data regulations on the Right to be Forgotten such as that in the General Data Protection Regulation (GDPR) of the European Union protect the right of users to remove private information from organizations. With the increasing usage and influence of large language models (LLMs) that are trained on personal data, a question of how to implement the removal of information within these models arises. In addition, large language models (LLMs) are trained on a large corpus of data that is usually scraped from the Web. A current challenge with ensuring reliable and safe outputs from LLMs is false, toxic, harmful or biased information from Web data that is captured in the knowledge of the model. Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for models with large numbers of parameters or fail to remove the entire scope of information without harming performance in the knowledge that is to be retained. Model editing algorithms solve a similar problem of changing information in LLMs, but they focus on redirecting inputs to a new target rather than removing that information altogether. Despite the parallels between model editing and unlearning, there has yet to be a thorough investigation of the potential of model editing approaches within this setting. In this work, we explore ROME, IKE, and WISE editing algorithms and design new editing targets for an unlearning setting. For evaluating the potential of the model editing algorithms, we focus on unlearning fictitious information using the Task of Fictitious Unlearning (TOFU) benchmark. Through this investigation, we show that model editing approaches can exceed the performance of current unlearning methods at removing information depending on the setting. They share the limitation of traditional unlearning of being unable to encapsulate the scope of what is to be unlearned without damage to overall model performance. We hope to leverage this information to improve methods for unlearning model knowledge and therefore improve the reliability of LLMs.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/159086

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses