Advanced Search
DSpace@MIT

Groundtruth budgeting : a novel approach to semi-supervised relation extraction in medical language

Research and Teaching Output of the MIT Community

Show simple item record

dc.contributor.advisor Özlem Uzuner and Peter Szolovits. en_US
dc.contributor.author Ryan, Russell J. (Russell John Wyatt) en_US
dc.contributor.other Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. en_US
dc.date.accessioned 2011-10-17T21:28:01Z
dc.date.available 2011-10-17T21:28:01Z
dc.date.copyright 2011 en_US
dc.date.issued 2011 en_US
dc.identifier.uri http://hdl.handle.net/1721.1/66456
dc.description Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011. en_US
dc.description Cataloged from PDF version of thesis. en_US
dc.description Includes bibliographical references (p. 67-69). en_US
dc.description.abstract We address the problem of weakly-supervised relation extraction in hospital discharge summaries. Sentences with pre-identified concept types (for example: medication, test, problem, symptom) are labeled with the relationship between the concepts. We present a novel technique for weakly-supervised bootstrapping of a classifier for this task: Groundtruth Budgeting. In the case of highly-overlapping, self-similar datasets as is the case with the 2010 i2b2/VA challenge corpus, the performance of classifiers on the minority classes is often poor. To address this we set aside a random portion of the groundtruth at the beginning of bootstrapping which will be gradually added as the classifier is bootstrapped. The classifier chooses groundtruth samples to be added by measuring the confidence of its predictions on them and choosing samples for which it has the least confident predictions. By adding samples in this fashion, the classifier is able to increase its coverage of the decision space while not adding too many majority-class examples. We evaluate this approach on the 2010 i2b2/VA challenge corpus containing of 477 patient discharge summaries and show that with a training corpus of 349 discharge summaries, budgeting 10% of the corpus achieves equivalent results to a bootstrapping classifier starting with the entire corpus. We compare our results to those of other papers published in the proceedings of the 2010 Fourth i2b2/VA Shared-Task and Workshop. en_US
dc.description.statementofresponsibility by Russell J. Ryan. en_US
dc.format.extent 69 p. en_US
dc.language.iso eng en_US
dc.publisher Massachusetts Institute of Technology en_US
dc.rights M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. en_US
dc.rights.uri http://dspace.mit.edu/handle/1721.1/7582 en_US
dc.subject Electrical Engineering and Computer Science. en_US
dc.title Groundtruth budgeting : a novel approach to semi-supervised relation extraction in medical language en_US
dc.title.alternative Ground truth budgeting en_US
dc.title.alternative Novel approach to semi-supervised relation extraction in medical language en_US
dc.type Thesis en_US
dc.description.degree M.Eng. en_US
dc.contributor.department Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. en_US
dc.identifier.oclc 756040752 en_US


Files in this item

Name Size Format Description
756040752-MIT.pdf 3.344Mb PDF Full printable version

This item appears in the following Collection(s)

Show simple item record

MIT-Mirage