MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards an Artificial Neuroscience: Analytics for Language Model Interpretability

Author(s)
Gurnee, Robert Wesley
Thumbnail
DownloadThesis PDF (17.69Mb)
Advisor
Bertsimas, Dimitris J.
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
The growing deployment of neural language models demands greater understanding of their internal mechanisms. The goal of this thesis is to make progress on understanding the latent computations within large language models (LLMs) to lay the groundwork for monitoring, controlling, and aligning future powerful AI systems. We explore four areas using open source language models: concept encoding across neurons, universality of learned features and components across model initializations, presence of spatial and temporal representations, and basic dynamical systems modeling. In Chapter 2, we adapt optimal sparse classification methods to neural network probing, allowing us to study how concepts are represented across multiple neurons. This sparse probing technique reveals both monosemantic neurons (dedicated to single concepts) and polysemantic neurons (representing multiple concepts in superposition) in full-scale LLMs confirming predictions from toy models. In Chapter 3, we identify and exhaustively catalog universal neurons across different model initializations by computing pairwise correlations of neuron activations over large datasets. Our findings show that 1-5\% of neurons are universal, often with clear interpretations, and we taxonomize them into distinct neuron families. To investigate spatial and temporal representations, we analyze LLM activations on carefully curated datasets of real-world entities in Chapter 4. We discover that models learn linear representations of space and time across multiple scales, which are robust to prompting variations and unified across different entity types. We identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. In Chapter 5, we use optimal sprase regression techniques to improve the sparse identification of nonlinear dynamics (SINDy) framework, demonstrating improved sample efficiency and support recovery in canonical differential systems. We then leverage this improvement to study the ability of LLMs to in-context learn dynamical systems and find internal representations which track the underlying system state.
Date issued
2025-02
URI
https://hdl.handle.net/1721.1/158869
Department
Massachusetts Institute of Technology. Operations Research Center; Sloan School of Management
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.