Efficient near duplicate document detection for specialized corpora

Seshasai, Shreyes

Author(s)

Seshasai, Shreyes

DownloadFull printable version (8.515Mb)

Other Contributors

Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.

Advisor

David Spencer and Regina Barzilay.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Knowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to efficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines.

Description

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.

Includes bibliographical references (p. 75-77).

Date issued

2009

URI

http://hdl.handle.net/1721.1/53116

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses