Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise

Booth, Serena Lynn

dc.contributor.advisor	Shah, Julie A.
dc.contributor.author	Booth, Serena Lynn
dc.date.accessioned	2024-03-21T19:11:23Z
dc.date.available	2024-03-21T19:11:23Z
dc.date.issued	2024-02
dc.date.submitted	2024-02-21T17:18:40.234Z
dc.identifier.uri	https://hdl.handle.net/1721.1/153862
dc.description.abstract	The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. These four interactions are critical building blocks for human-AI alignment. In this thesis, I study each of these problems. First, I study how experts write reward function specifications for reinforcement learning (RL). I find that these specifications are written with respect to the RL algorithm, not independently, and I find that experts often write erroneous specifications that fail to encode their true intent, even in a trivial setting [22]. Second, I study how to support people in inspecting the agent’s learned behaviors. To do so, I introduce two related Bayesian inference methods to find examples or environments which invoke particular system behaviors; viewing these examples and environments is helpful for conceptual model formation and for system debugging [25, 213]. Third, I study cognitive science theories that govern how people build conceptual models to explain these observed examples of agent behaviors. While I find that some foundations of these theories are employed in typical interventions to support humans in learning about agent behaviors, I also find there is significant room to build better curricula for interaction—for example, by showing counterexamples of alternative behaviors [24]. I conclude by speculating about how these building blocks of human-AI interaction can be combined to enable people to revise their specifications, and, in doing so, create better aligned agents.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: booth-sbooth-phd-eecs-2024-the ...
Size:: 36.42Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record