Understanding Concept Representations and their Transformations in Transformer Models

Kearney, Matthew

Author(s)

Kearney, Matthew

DownloadThesis PDF (13.89Mb)

Advisor

Andreas, Jacob

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

As transformer language models continue to be more widely used in a variety of applications, developing methods to understand their internal reasoning processes becomes more critical. One category of such methods called neuron labeling identifies salient directions in the model’s internal representation space and asks what features of the input these directions represent and how they evolve. While research using these methods has focused on finding and automating the label process, a prerequisite to this is first identifying which directions are the salient ones in the model’s computation. There exists theoretical arguments that the activations of the first layer of the multi-layer perceptrons (MLPs) in transformers are the salient basis for represent the information the model is using for computation. However, there currently do not exist any empirical studies comparing these internal representations to others that have been used in prior work. This research answers this question by comparing several directions in the internal representation space of transformers in terms of how well they represent basic linguistic concepts we expect the model to be using in computation. We find that the empirical evidence does support the theoretical arguments and that the first layer of the MLP modules is the most representative basis for these concepts. We further extend this exploration by examining the connections between MLP neurons and developing a method of determining which neurons have the potential of communicating information between one another. In the process we discover specialized neurons for erasing and preserving information in the model’s hidden state and characterize this phenomenon.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/151276

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses