Building a Language Conditioned System for 6-DoF Tabletop Manipulation
Author(s)
Parakh, Meenal
DownloadThesis PDF (10.99Mb)
Advisor
Agrawal, Pulkit
Terms of use
Metadata
Show full item recordAbstract
We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics.
Date issued
2023-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology