Recurrent Multimodal Interaction for Referring Image Segmentation

Liu, Chenxi; Lin, Zhe; Shen, Xiaohui; Yang, Jimei; Lu, Xin; Yuille, Alan L.

dc.contributor.author	Liu, Chenxi
dc.contributor.author	Lin, Zhe
dc.contributor.author	Shen, Xiaohui
dc.contributor.author	Yang, Jimei
dc.contributor.author	Lu, Xin
dc.contributor.author	Yuille, Alan L.
dc.date.accessioned	2018-05-15T15:51:29Z
dc.date.available	2018-05-15T15:51:29Z
dc.date.issued	2018-05-10
dc.identifier.uri	http://hdl.handle.net/1721.1/115374
dc.description.abstract	In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.	en_US
dc.description.sponsorship	This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Center for Brains, Minds and Machines (CBMM)	en_US
dc.relation.ispartofseries	CBMM Memo Series;079
dc.title	Recurrent Multimodal Interaction for Referring Image Segmentation	en_US
dc.type	Technical Report	en_US
dc.type	Working Paper	en_US
dc.type	Other	en_US

Files in this item

Name:: CBMM-Memo-079.pdf
Size:: 10.15Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

CBMM Memo Series

Show simple item record