MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise

Author(s)
Booth, Serena Lynn
Thumbnail
DownloadThesis PDF (36.42Mb)
Advisor
Shah, Julie A.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. These four interactions are critical building blocks for human-AI alignment. In this thesis, I study each of these problems. First, I study how experts write reward function specifications for reinforcement learning (RL). I find that these specifications are written with respect to the RL algorithm, not independently, and I find that experts often write erroneous specifications that fail to encode their true intent, even in a trivial setting [22]. Second, I study how to support people in inspecting the agent’s learned behaviors. To do so, I introduce two related Bayesian inference methods to find examples or environments which invoke particular system behaviors; viewing these examples and environments is helpful for conceptual model formation and for system debugging [25, 213]. Third, I study cognitive science theories that govern how people build conceptual models to explain these observed examples of agent behaviors. While I find that some foundations of these theories are employed in typical interventions to support humans in learning about agent behaviors, I also find there is significant room to build better curricula for interaction—for example, by showing counterexamples of alternative behaviors [24]. I conclude by speculating about how these building blocks of human-AI interaction can be combined to enable people to revise their specifications, and, in doing so, create better aligned agents.
Date issued
2024-02
URI
https://hdl.handle.net/1721.1/153862
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.