Show simple item record

dc.contributor.advisorBertsimas, Dimitris J.
dc.contributor.authorGurnee, Robert Wesley
dc.date.accessioned2025-03-24T18:48:37Z
dc.date.available2025-03-24T18:48:37Z
dc.date.issued2025-02
dc.date.submitted2025-01-28T00:52:39.368Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158869
dc.description.abstractThe growing deployment of neural language models demands greater understanding of their internal mechanisms. The goal of this thesis is to make progress on understanding the latent computations within large language models (LLMs) to lay the groundwork for monitoring, controlling, and aligning future powerful AI systems. We explore four areas using open source language models: concept encoding across neurons, universality of learned features and components across model initializations, presence of spatial and temporal representations, and basic dynamical systems modeling. In Chapter 2, we adapt optimal sparse classification methods to neural network probing, allowing us to study how concepts are represented across multiple neurons. This sparse probing technique reveals both monosemantic neurons (dedicated to single concepts) and polysemantic neurons (representing multiple concepts in superposition) in full-scale LLMs confirming predictions from toy models. In Chapter 3, we identify and exhaustively catalog universal neurons across different model initializations by computing pairwise correlations of neuron activations over large datasets. Our findings show that 1-5\% of neurons are universal, often with clear interpretations, and we taxonomize them into distinct neuron families. To investigate spatial and temporal representations, we analyze LLM activations on carefully curated datasets of real-world entities in Chapter 4. We discover that models learn linear representations of space and time across multiple scales, which are robust to prompting variations and unified across different entity types. We identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. In Chapter 5, we use optimal sprase regression techniques to improve the sparse identification of nonlinear dynamics (SINDy) framework, demonstrating improved sample efficiency and support recovery in canonical differential systems. We then leverage this improvement to study the ability of LLMs to in-context learn dynamical systems and find internal representations which track the underlying system state.
dc.publisherMassachusetts Institute of Technology
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleTowards an Artificial Neuroscience: Analytics for Language Model Interpretability
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Operations Research Center
dc.contributor.departmentSloan School of Management
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record