Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise

Booth, Serena Lynn

Author(s)

Booth, Serena Lynn

DownloadThesis PDF (36.42Mb)

Advisor

Shah, Julie A.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. These four interactions are critical building blocks for human-AI alignment. In this thesis, I study each of these problems. First, I study how experts write reward function specifications for reinforcement learning (RL). I find that these specifications are written with respect to the RL algorithm, not independently, and I find that experts often write erroneous specifications that fail to encode their true intent, even in a trivial setting [22]. Second, I study how to support people in inspecting the agent’s learned behaviors. To do so, I introduce two related Bayesian inference methods to find examples or environments which invoke particular system behaviors; viewing these examples and environments is helpful for conceptual model formation and for system debugging [25, 213]. Third, I study cognitive science theories that govern how people build conceptual models to explain these observed examples of agent behaviors. While I find that some foundations of these theories are employed in typical interventions to support humans in learning about agent behaviors, I also find there is significant room to build better curricula for interaction—for example, by showing counterexamples of alternative behaviors [24]. I conclude by speculating about how these building blocks of human-AI interaction can be combined to enable people to revise their specifications, and, in doing so, create better aligned agents.

Date issued

2024-02

URI

https://hdl.handle.net/1721.1/153862

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses