MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Inverse Constitutional AI

Author(s)
Kostolansky, Timothy H.
Thumbnail
DownloadThesis PDF (1.012Mb)
Advisor
Hadfield-Menell, Dylan
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
The alignment of large language models (LLMs) to human values becomes more and more pressing as their scale and capabilities have grown. One important feature of alignment is understanding the preference datasets that are used to finetune LLMs. Inverse Constitutional AI (ICAI) is presented as a novel interpretability framework to discover the principles underlying preference datasets. Motivated by the Constitutional AI training paradigm of instilling principles in models, ICAI aims to extract a succinct "constitution" of natural language principles from data. This thesis contributes an initial attempt at realizing ICAI through a clustering-based methodology applied to preference datasets. The proposed approach involves embedding preference pairs into vector representations, clustering the embeddings to group related preferences, generating interpretable principles for each cluster using language models, and validating these principles against held-out samples. Empirical evaluation is conducted on the hh-rlhf dataset for training helpful and harmless AI assistants, as well as a synthetic dataset constructed by relabeling hh-rlhf samples with predefined principles. Results demonstrate promising capabilities in clustering semantically coherent topics and generating human-interpretable principles, while also highlighting limitations in achieving fully disentangled, principle-based clustering. Directions for future work are discussed, including soft clustering, bottom-up principle extraction, prompt optimization approaches, and sparse dictionary learning methods. In this work, I argue the following thesis: ICAI shows promise as a strategy to disentangle and explain the preferences represented in preference data. A clustering-based approach to ICAI, though, fails to successfully extract a constitution of principles from preference data, as a result of clustering occurring along the topics in the data instead of the preferences themselves.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156804
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.