Parsing with sparse annotated resources

Zhang, Yuan, Ph. D. Massachusetts Institute of Technology

Author(s)

Zhang, Yuan, Ph. D. Massachusetts Institute of Technology

DownloadFull printable version (918.2Kb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Regina Barzilay.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

This thesis focuses on algorithms for parsing within the context of sparse annotated resources. Despite recent progress in parsing techniques, existing methods require significant resources for training. Therefore, current technology is limited when it comes to parsing sentences in new languages or new grammars. We propose methods for parsing when annotated resources are limited. In the first scenario, we explore an automatic method for mapping language-specific part of- speech (POS) tags into a universal tagset. Universal tagsets play a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Our central assumption is that a high-quality mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function. Given the exponential size of the mapping space, we propose a novel method for optimizing the objective over mappings. Our results demonstrate that automatically induced mappings rival their manually designed counterparts when evaluated in the context of multilingual parsing. In the second scenario, we consider the problem of cross-formalism transfer in parsing. We are interested in parsing constituency-based grammars such as HPSG and CCG using a small amount of data annotated in the target formalisms and a large quantity of coarse CFG annotations from the Penn Treebank. While the trees annotated in all of the target formalisms share a similar basic syntactic structure with the Penn Treebank CFG, they also encode additional constraints and semantic features. To handle this apparent difference, we design a probabilistic model that jointly generates CFG and target formalism parses. The model includes features of both parses, enabling transfer between the formalisms, and preserves parsing efficiency. Experimental results show that across a range of formalisms, our model benefits from the coarse annotations.

Description

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.

Cataloged from student-submitted PDF version of thesis.

Includes bibliographical references (p. 67-73).

Date issued

2013

URI

http://hdl.handle.net/1721.1/82180

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses