Interpreting and Editing Memory in Large Transformer Language Models

dc.contributor.advisor	Andreas, Jacob D.
dc.contributor.author	Meng, Kevin
dc.date.accessioned	2024-09-16T13:49:41Z
dc.date.available	2024-09-16T13:49:41Z
dc.date.issued	2024-05
dc.date.submitted	2024-07-11T14:36:44.224Z
dc.identifier.uri	https://hdl.handle.net/1721.1/156794
dc.description.abstract	This thesis investigates the mechanisms of factual recall in large language models. We first apply causal interventions to identify neuron activations that are decisive in a model’s factual predictions; surprisingly, we find that factual recall corresponds to a sparse, localizable computation in the MLP weights of the GPT models we study. Harnessing this insight, we then develop methods for efficiently and surgically inserting up to 10,000 new memories into a transformer; these methods perform well in terms of both generalization and specificity. We conclude with some directions for future work.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Interpreting and Editing Memory in Large Transformer Language Models
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science