Computational perception for multi-modal document understanding

Bylinskii, Zoya

dc.contributor.advisor	Fredo Durand and Aude Oliva.	en_US
dc.contributor.author	Bylinskii, Zoya	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2019-02-14T15:22:24Z
dc.date.available	2019-02-14T15:22:24Z
dc.date.copyright	2018	en_US
dc.date.issued	2018	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/120375
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.	en_US
dc.description	This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.	en_US
dc.description	Cataloged from student-submitted PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 171-192).	en_US
dc.description.abstract	Multimodal documents occur in a variety of forms, as graphs in technical reports, diagrams in textbooks, and graphic designs in bulletins. Humans can efficiently process the visual and textual information contained within to make decisions on topics including business, healthcare, and science. Building the computational tools to understand multimodal documents can have important applications for web search, information retrieval, captioning and summarization, and automated design. This thesis makes contributions on two fronts: (i) to the development of data collection methods for measuring how humans perceive multimodal documents (i.e., where they look, what they find important), and (ii) to the development of computer vision tools for automatically parsing and making predictions about multimodal documents (i.e., the subject matter they are about). Specifically, the crowdsourced attention data captured from our novel user interfaces is used to train neural network models to predict where people look in graphic designs and information visualizations, with demonstrated applications to thumbnailing, design retargeting, and interactive feedback within graphic design tools. Separately, our models for detecting visual elements and parsing text elements in infographics (information graphics) are used for topic prediction and to present a system for automatic summarization. This thesis makes contributions at the interface of human and computer vision, with applications to human-computer interfaces and design.	en_US
dc.description.statementofresponsibility	by Zoya Bylinskii.	en_US
dc.format.extent	192 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Computational perception for multi-modal document understanding	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	1084273965	en_US

Files in this item

Name:: 1084273965-MIT.pdf
Size:: 64.43Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record