Steerable Alignment with Conditional Multiobjective Preference Optimization

Manyika, Julian

Author(s)

Manyika, Julian

DownloadThesis PDF (2.185Mb)

Advisor

Hadfield-Menell, Dylan

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

As the scale, capabilities and use-cases of large language models (LLMs) continue to grow, it is imperative that these systems are aligned with human preferences. Current state of the art strategies for alignment such as Reinforcement Learning from Human Feedback (RLHF) have provided useful paradigms for finetuning LLMs to produce outputs that are more consistent with human preferences. These approaches, however, assume that preferences are formed by a single, underlying reward model, which is likely insufficient for representing an individual’s preferences, certainly unable to represent diverse group preferences, and inf lexible for users at inference time. To address these limitations, we propose Conditional Multiobjective Preference Optimization (CMPO), a novel alignment strategy that trains a user-steerable LLM along multiple attributes of text, such as helpfulness and humor. CMPO simulates the pareto front of multiple single-attribute preference-optimized models through structural plurality and finetuning with Direct Preference Optimzation (DPO), and allows users to condition outputs on the predefined attributes at inference-time. Experiments show that CMPO generates responses that are preferred to those from separate attribute-specific DPO models and from models trained using SteerLM, a alternate model steering approach. CMPO empirically shows promise as a scalable and flexible finetuning strategy for creating LLMs that are attribute-steerable from parameterized preferences.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156747

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses