Learning to understand spatial language for robotic navigation and mobile manipulation
Author(s)Kollar, Thomas (Thomas Fleming)
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
MetadataShow full item record
This thesis focuses on understanding task-constrained natural language commands, where a person gives a natural language command to the robot and the robot infers and executes the corresponding plan. Understanding natural language is difficult because a system must infer the location of landmarks such as "the computer cluster," and actions corresponding to spatial relations such as "to" or "around" and verbs such as "put" or "take." each of which may be composed in complex ways. In addition, different people may give very different types of commands to perform the same action. The first chapter of this thesis focuses on simple natural language commands such as "Find the computer." where a person commands the robot to find an object or place and the robot must infer a corresponding plan. This problem would be easy if we constrained the set of words that the robot might need to reason about. However, if a person says, "find the computer," and the robot has not previously detected a "computer," then it is not clear where the robot should look. We present a method that uses previously detected objects and places in order to bias the search process toward areas of the environment where a previously unseen object is likely to be found. The system uses a semantic map of the environment together with a model of contextual relationships between objects to infer this plan, which finds the query object with minimal travel time. The contextual relationships are learned from the captions of a large dataset of photos downloaded from Flickr. Simulated and realworld experiments show that a small subset of detectable objects and scenes are able to predict the location of previously unseen objects and places. In the second chapter, we take steps toward building a robust spatial language understanding system for three different domains: route directions, visual inspection, and indoor mobility. We take as input a natural language command such as "Go through the double doors and down the hallway," extract a semantic structure called a Spatial Description Clause (SDC) from the language, and ground each SDC in a partial or complete semantic map of the environment. By extracting a flat sequence of SDCs, we are able to ground the language by using a probabilistic graphical model that is factored into three key components. First, a landmark component grounds novel noun phrases such as "'the computers" in the perceptual frame of the robot by exploiting object co-occurrence statistics between unknown noun phrases and known perceptual features.(cont.) These statistics are learned from a large database of tagged images such as Flickr, and build off of the model developed in the first component of the thesis. Second, a spatial reasoning component judges how well spatial relations such as "past the computers" describe the path of the robot relative to a landmark. Third, a verb understanding component judges how well spatial verb phrases such as "follow". "meet", "avoid" and "turn right" describe how an agent moves on its own or in relation to another agent. Once trained, our model requires only a metric map of the environment together with the locations of detected objects in order to follow directions through it. This map can be given a priori or created on the fly as the robot explores the environment. In the final chapter of the thesis, we focus on understanding mobile manipulation commands such as, "Put the tire pallet oii the truck." The first contribution of this chapter is the Generalized Grounding Graph (G3 ), which connects language onto grounded aspects of the environment. In this chapter, we relax the assumption that the language has fixed and flat structure and provide a method for constructing a hierarchical probabilistic graphical model that connects each element in a natural language command to an object. place., path or event in the environment. The structure of the G3 model is dynamically instantiated according to the compositional and hierarchical structure of the command, enabling efficient learning and inference. The second contribution of this chapter is to formulate the problem as a discriminative learning problem that maps from language directly onto a robot plan. This probabilistic model is represented as a conditional random field (CRF) that learns the correspondence of robot plans and the language and is able to learn the meanings of complex verbs such as "put" and "take," as well as spatial relations such as "on" and "to."
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 103-108).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.