Interpreting and Editing Memory in Large Transformer Language Models

Meng, Kevin

Author(s)

Meng, Kevin

DownloadThesis PDF (12.25Mb)

Advisor

Andreas, Jacob D.

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156794

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses