Generative Discovery via Reinforcement Learning
Author(s)
Hong, Zhang-Wei
DownloadThesis PDF (5.058Mb)
Advisor
Agrawal, Pulkit
Terms of use
Metadata
Show full item recordAbstract
Discovering new knowledge is crucial for technological advancement and mirrors how humans and animals learn new skills—often through trial and error. Ancient humans, for example, discovered fire by experimenting with different methods, and children learned to walk and use tools through repeated attempts and failures. In chemistry, scientists find new catalysts by testing various compositions. But how exactly do humans use trial-and-error to improve existing solutions (like learning more efficient ways to walk or synthesizing novel compounds)? Can we design computational models that mimic or exceed human discovery? Such computational models could greatly accelerate progress in science and engineering since they can automate or assist human scientists’ and engineers’ works and discover new knowledge more efficiently (e.g., new compounds, streamlining the robot controller design, etc.). Reinforcement learning (RL) is well-suited for discovery tasks because it enables machines to learn through trial and error. My work overcomes the following major limitation of today’s RL algorithms and thereby advances their discovery potential: Mitigate the bias of reward shaping. RL relies on reward signals from trial-anderror experience, but these signals can be sparse, meaning they are only provided once a desired solution is found and otherwise zero. Most trials, therefore, offer little to no feedback. A common strategy to improve performance under sparse rewards is to provide additional hints (i.e., reward shaping) to guide RL algorithms. However, if these hints are inaccurate, they can steer the algorithm toward worse solutions than those without them. I propose a new RL framework that can be combined with any standard RL algorithm, ensuring that training with hints finds better solutions instead of harming performance. Learning with sub-optimal data. RL can learn not only from online interaction with the world but also from datasets of logged experiences. For expensive or time-consuming tasks like material discovery or robot learning, offline RL could be preferred because it leverages existing data rather than requires new interaction with the world. However, such datasets could contain mostly low-reward solutions, which limits the offline RL algorithm’s performance in finding solutions better than what’s in the dataset (as we show later in this thesis). I introduce sample reweighting strategies that reweight the dataset in a way that current offline RL algorithms trained with the weighted samples are able to discover solutions far better than what’s in the dataset, even if low-reward solutions predominated the dataset. Safety via Diversity. Standard RL algorithms aim to find a single “best” solution. Yet, in many discovery problems—such as drug development—it is more valuable to generate multiple high-rewards solutions with distinct properties (i.e., diversity) than to focus on only one. I study this problem in an emerging discovery task-red-teaming large language models (LLMs). In red-teaming, we desire diverse prompts that trigger undesired outputs from target language models. Current approaches leveraging RL to train an LLM to red-team another one, but they fall short of the diversity of generated prompts and often converge to a few prompts that consistently trigger undesired outputs. I propose to reward the agent to maximize the diversity of generated prompts, which also improves the the success of prompts at triggering undesired outputs from the target LLM.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology