Parsing with sparse annotated resources
Author(s)
Zhang, Yuan, Ph. D. Massachusetts Institute of Technology
DownloadFull printable version (918.2Kb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Regina Barzilay.
Terms of use
Metadata
Show full item recordAbstract
This thesis focuses on algorithms for parsing within the context of sparse annotated resources. Despite recent progress in parsing techniques, existing methods require significant resources for training. Therefore, current technology is limited when it comes to parsing sentences in new languages or new grammars. We propose methods for parsing when annotated resources are limited. In the first scenario, we explore an automatic method for mapping language-specific part of- speech (POS) tags into a universal tagset. Universal tagsets play a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Our central assumption is that a high-quality mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function. Given the exponential size of the mapping space, we propose a novel method for optimizing the objective over mappings. Our results demonstrate that automatically induced mappings rival their manually designed counterparts when evaluated in the context of multilingual parsing. In the second scenario, we consider the problem of cross-formalism transfer in parsing. We are interested in parsing constituency-based grammars such as HPSG and CCG using a small amount of data annotated in the target formalisms and a large quantity of coarse CFG annotations from the Penn Treebank. While the trees annotated in all of the target formalisms share a similar basic syntactic structure with the Penn Treebank CFG, they also encode additional constraints and semantic features. To handle this apparent difference, we design a probabilistic model that jointly generates CFG and target formalism parses. The model includes features of both parses, enabling transfer between the formalisms, and preserves parsing efficiency. Experimental results show that across a range of formalisms, our model benefits from the coarse annotations.
Description
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Cataloged from student-submitted PDF version of thesis. Includes bibliographical references (p. 67-73).
Date issued
2013Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.