dc.contributor.advisor | David Spencer and Regina Barzilay. | en_US |
dc.contributor.author | Seshasai, Shreyes | en_US |
dc.contributor.other | Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. | en_US |
dc.date.accessioned | 2010-03-25T15:03:11Z | |
dc.date.available | 2010-03-25T15:03:11Z | |
dc.date.copyright | 2009 | en_US |
dc.date.issued | 2009 | en_US |
dc.identifier.uri | http://hdl.handle.net/1721.1/53116 | |
dc.description | Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009. | en_US |
dc.description | Includes bibliographical references (p. 75-77). | en_US |
dc.description.abstract | Knowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to efficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines. | en_US |
dc.description.statementofresponsibility | by Shreyes Seshasai. | en_US |
dc.format.extent | 77 p. | en_US |
dc.language.iso | eng | en_US |
dc.publisher | Massachusetts Institute of Technology | en_US |
dc.rights | M.I.T. theses are protected by
copyright. They may be viewed from this source for any purpose, but
reproduction or distribution in any format is prohibited without written
permission. See provided URL for inquiries about permission. | en_US |
dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
dc.subject | Electrical Engineering and Computer Science. | en_US |
dc.title | Efficient near duplicate document detection for specialized corpora | en_US |
dc.type | Thesis | en_US |
dc.description.degree | M.Eng. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
dc.identifier.oclc | 503131434 | en_US |