Building a Language Conditioned System for 6-DoF Tabletop Manipulation

Parakh, Meenal

Author(s)

Parakh, Meenal

DownloadThesis PDF (10.99Mb)

Advisor

Agrawal, Pulkit

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics.

Date issued

2023-09

URI

https://hdl.handle.net/1721.1/152838

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses