A computational model to connect gestalt perception and natural language
Author(s)
Dhande, Sheel Sanjay, 1979-
DownloadFull printable version (4.475Mb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Architecture. Program In Media Arts and Sciences.
Advisor
Deb K. Roy.
Terms of use
Metadata
Show full item recordAbstract
We present a computational model that connects gestalt visual perception and language. The model grounds the meaning of natural language words and phrases in terms of the perceptual properties of visually salient groups. We focus on the semantics of a class of words that we call conceptual aggregates e.g., pair, group, stuff, which inherently refer to groups of objects. The model provides an explanation for how the semantics of these natural language terms interact with gestalt processes in order to connect referring expressions to visual groups. Our computational model can be divided into two stages. The first stage performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups arising from the scene. This stage also assigns a saliency score to each group. In the second stage, visual grounding, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. Parameters of the model are trained on the basis of observed data from a linguistic description and visual selection task. The proposed model has been implemented in the form of a program that takes as input a synthetic visual scene and linguistic description, and as output identifies likely groups of objects within the scene that correspond to the description. We present an evaluation of the performance of the model on a visual referent identification task. This model may be applied in natural language understanding and generation systems that utilize visual context such as scene description systems for the visually impaired and functionally illiterate.
Description
Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003. Includes bibliographical references (p. 79-82).
Date issued
2003Department
Program in Media Arts and Sciences (Massachusetts Institute of Technology)Publisher
Massachusetts Institute of Technology
Keywords
Architecture. Program In Media Arts and Sciences.