Generalizable Robot Manipulation through Unified Perception, Policy Learning, and Planning

Fang, Xiaolin

Author(s)

Fang, Xiaolin

DownloadThesis PDF (51.30Mb)

Advisor

Kaelbling, Leslie Pack

Lozano-Pérez, Tomás

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Advancing robotic manipulation to achieve generalization across diverse goals, environments, and embodiments is a critical challenge in robotics research. While the availability of data and large-scale training has brought exciting progress in robotics manipulation, current methods often struggle with generalizing to unseen, unstructured environments and solving long-horizon tasks. In this thesis, I will present my work in robot learning and planning that enables multi-step manipulation in partially observable environments, towards general-purpose embodied agents. Specifically, I will talk about my work in 1) constructing a modular framework that estimates affordances with learned perception models with task-and-motion-planning (TAMP) for object rearrangement in unstructured scenes, 2) learning generative diffusion models of robot skills, which can be composed to solve unseen combination of environmental constraints through infeference-time optimization, 3) leveraging large vision-language models (VLMs) in building task-oriented visual abstractions, allowing skills to generalize across different environments with only 5 to 10 demonstrations. Together, these approaches contribute to the generality and scalability of embodied agents towards solving real-world manipulation in unstructured environments.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164567

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses