Towards AI Safety via Interpretability and Oversight

Kantamneni, Subhash

dc.contributor.advisor	Tegmark, Max
dc.contributor.author	Kantamneni, Subhash
dc.date.accessioned	2025-09-18T14:29:07Z
dc.date.available	2025-09-18T14:29:07Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-23T14:02:34.398Z
dc.identifier.uri	https://hdl.handle.net/1721.1/162723
dc.description.abstract	In this thesis, we advance AI safety through mechanistic interpretability and oversight methodologies across three key areas: mathematical reasoning in large language models (LLMs), the validity of sparse autoencoders, and scalable oversight. First, we reverse-engineer addition within mid-sized LLMs and discover that LLMs represent numbers as helices. We demonstrate that LLMs perform addition via the manipulation of these helices using a "Clock" algorithm, providing the first representation-level explanation of mathematical reasoning in LLMs, verified through causal interventions on model activations. Next, we rigorously evaluate sparse autoencoders (SAEs), a popular interpretability tool, by testing their effectiveness on the downstream task of probing. We test SAEs under challenging probing conditions, including data scarcity, class imbalance, label noise, and covariate shift. While SAEs occasionally outperform baseline methods, they fail to consistently enhance task performance, underscoring a potentially critical limitation of SAEs. Lastly, we introduce a quantitative framework to evaluate scalable oversight - a promising idea where weaker AI systems supervise stronger ones - as a function of model intelligence. Applying our framework to four oversight games ("Mafia," "Debate," "Backdoor Code," and "Wargames"), we identify clear scaling patterns and extend our findings through a theoretical analysis of Nested Scalable Oversight (NSO), deriving conditions for optimal oversight structures. Together, these studies advance our understanding of AI interpretability and alignment, providing insights and frameworks to progress AI safety.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Towards AI Safety via Interpretability and Oversight
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: kantamneni-subhashk-meng-eecs- ...
Size:: 9.629Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record