AutoDiff: A Scalable Framework for Automated Model
Comparison
Name
woo-kwoo1-meng-eecs-2025-thesis.pdf
Description
Thesis PDF
Size
1.49 MB
Format
Adobe PDF
Checksum (MD5)
d45d8f6381df70307ee41fd4a1c4b277
Author(s)
Woo, Andrew Kyoungwan
Advisor(s)
Torralba, Antonio
Date Issued
May 2025
Publisher
Massachusetts Institute of Technology
Abstract
Post-training adaptations such as supervised fine-tuning, quantization, and reinforcement learning can cause large language models (LLMs) with identical architectures to exhibit divergent behaviors. However, the mechanisms driving these behavioral shifts remain largely opaque, limiting the reliability and interpretability of adapted models. AutoDiff is a scalable, automated framework for tracing model divergence on a per-neuron basis. It exhaustively profiles every feed-forward (MLP) unit across a pair of models, identifies the neurons with the largest activation gaps, and links these differences to downstream behavioral changes. The pipeline identifies exemplars that maximize between-model activation divergence and clusters the highest-gap neurons into an interpretable, queryable difference report. Proof-ofconcept experiments on GPT-2 small validate AutoDiff’s ability to rediscover synthetic perturbations without manual supervision. A larger case study on Llama3.1–8B contrasts the base model with several adapted variants, surfacing neurons whose behavioral shifts align with observed topic-level gains and losses. By uncovering these mechanistic divergences, AutoDiff transforms black-box model updates into actionable insights, enabling safer deployment, principled debugging, and interpretable model evaluation.
MIT Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Terms of Use
In Copyright - Educational Use Permitted
Copyright retained by author(s)
Persistent DSpace Link