Towards an Artificial Neuroscience: Analytics for Language Model Interpretability
Author(s)
Gurnee, Robert Wesley
DownloadThesis PDF (17.69Mb)
Advisor
Bertsimas, Dimitris J.
Terms of use
Metadata
Show full item recordAbstract
The growing deployment of neural language models demands greater understanding of their internal mechanisms. The goal of this thesis is to make progress on understanding the latent computations within large language models (LLMs) to lay the groundwork for monitoring, controlling, and aligning future powerful AI systems. We explore four areas using open source language models: concept encoding across neurons, universality of learned features and components across model initializations, presence of spatial and temporal representations, and basic dynamical systems modeling.
In Chapter 2, we adapt optimal sparse classification methods to neural network probing, allowing us to study how concepts are represented across multiple neurons. This sparse probing technique reveals both monosemantic neurons (dedicated to single concepts) and polysemantic neurons (representing multiple concepts in superposition) in full-scale LLMs confirming predictions from toy models. In Chapter 3, we identify and exhaustively catalog universal neurons across different model initializations by computing pairwise correlations of neuron activations over large datasets. Our findings show that 1-5\% of neurons are universal, often with clear interpretations, and we taxonomize them into distinct neuron families.
To investigate spatial and temporal representations, we analyze LLM activations on carefully curated datasets of real-world entities in Chapter 4. We discover that models learn linear representations of space and time across multiple scales, which are robust to prompting variations and unified across different entity types. We identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. In Chapter 5, we use optimal sprase regression techniques to improve the sparse identification of nonlinear dynamics (SINDy) framework, demonstrating improved sample efficiency and support recovery in canonical differential systems. We then leverage this improvement to study the ability of LLMs to in-context learn dynamical systems and find internal representations which track the underlying system state.
Date issued
2025-02Department
Massachusetts Institute of Technology. Operations Research Center; Sloan School of ManagementPublisher
Massachusetts Institute of Technology