Approximate string joins with abbreviations
Author(s)
Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
DownloadFull printable version (4.544Mb)
Alternative title
ASJ with abbreviations
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Michael Stonebraker.
Terms of use
Metadata
Show full item recordAbstract
String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this thesis, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable due to the exponential time complexity. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an unsupervised approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency.
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018. Cataloged from PDF version of thesis. Includes bibliographical references (pages 61-63).
Date issued
2018Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.