Efficient near duplicate document detection for specialized corpora
Author(s)
Seshasai, Shreyes
DownloadFull printable version (8.515Mb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
David Spencer and Regina Barzilay.
Terms of use
Metadata
Show full item recordAbstract
Knowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to efficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines.
Description
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009. Includes bibliographical references (p. 75-77).
Date issued
2009Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.