Advanced Search
DSpace@MIT

Active duplicate detection with Bayesian nonparametric models

Research and Teaching Output of the MIT Community

Show simple item record

dc.contributor.advisor Leslie Pack Kaelbling. en_US
dc.contributor.author Matsakis, Nicholas E. (Nicholas Elias), 1976- en_US
dc.contributor.other Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. en_US
dc.date.accessioned 2010-08-30T14:31:49Z
dc.date.available 2010-08-30T14:31:49Z
dc.date.copyright 2010 en_US
dc.date.issued 2010 en_US
dc.identifier.uri http://hdl.handle.net/1721.1/57679
dc.description Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010. en_US
dc.description Cataloged from PDF version of thesis. en_US
dc.description Includes bibliographical references (p. 129-137). en_US
dc.description.abstract When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains. en_US
dc.description.statementofresponsibility by Nicholas Elias Matsakis. en_US
dc.format.extent 137 p. en_US
dc.language.iso eng en_US
dc.publisher Massachusetts Institute of Technology en_US
dc.rights M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. en_US
dc.rights.uri http://dspace.mit.edu/handle/1721.1/7582 en_US
dc.subject Electrical Engineering and Computer Science. en_US
dc.title Active duplicate detection with Bayesian nonparametric models en_US
dc.type Thesis en_US
dc.description.degree Ph.D. en_US
dc.contributor.department Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. en_US
dc.identifier.oclc 631212387 en_US


Files in this item

Name Size Format Description
631212387.pdf 10.83Mb PDF Preview, non-printable (open to all)
631212387-MIT.pdf 10.83Mb PDF Full printable version (MIT only)

This item appears in the following Collection(s)

Show simple item record

MIT-Mirage