Active duplicate detection with Bayesian nonparametric models
Author(s)Matsakis, Nicholas E. (Nicholas Elias), 1976-
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Leslie Pack Kaelbling.
MetadataShow full item record
When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 129-137).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.