Show simple item record

dc.contributor.advisorTegmark, Max
dc.contributor.authorLiao, Isaac C.
dc.date.accessioned2024-09-16T13:49:06Z
dc.date.available2024-09-16T13:49:06Z
dc.date.issued2024-05
dc.date.submitted2024-07-11T14:36:47.288Z
dc.identifier.urihttps://hdl.handle.net/1721.1/156787
dc.description.abstractMechanistic interpretability research aims to deconstruct the underlying algorithms that neural networks use to perform computations, such that we can modify their components, causing them to change behavior in predictable and positive ways. This thesis details three novel methods for automating the interpretation process for neural networks that are too large to manually interpret. Firstly, we detect inherently multidimensional representations of data; we discover that large language models use circular representations to perform modular addition tasks. Secondly, we introduce methods to penalize complexity in neural circuitry; we discover the automatic emergence of interpretable properties such as sparsity, weight tying, and circuit duplication. Last but not least, we apply neural network symmetries to put networks into a simplified normal form, for conversion into human-readable python; we introduce a program synthesis benchmark with this and successfully convert 32 out of 62 of them.
dc.publisherMassachusetts Institute of Technology
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleAutomated Mechanistic Interpretability for Neural Networks
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record