Show simple item record

dc.contributor.advisorDavid Spencer and Regina Barzilay.en_US
dc.contributor.authorSeshasai, Shreyesen_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2010-03-25T15:03:11Z
dc.date.available2010-03-25T15:03:11Z
dc.date.copyright2009en_US
dc.date.issued2009en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/53116
dc.descriptionThesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.en_US
dc.descriptionIncludes bibliographical references (p. 75-77).en_US
dc.description.abstractKnowledge of near duplicate documents can be adventagous to search engines, even those that only cover a small enterprise or specialized corpus. In this thesis, we investigate improvements to simhash, a signature-based method which can be used to efficiently detect near duplicate documents. We implement simhash in its original form, and demonstrate its effectiveness on a small corpus of newspaper articles, and improve its accuracy through utilizing external metadata and altering its feature selection approach. We also demonstrate the fragility of simhash towards changes in the weighting of features by applying novel changes to the weights. As motivation for performing this near duplicate detection, we discuss the impact it can have on search engines.en_US
dc.description.statementofresponsibilityby Shreyes Seshasai.en_US
dc.format.extent77 p.en_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleEfficient near duplicate document detection for specialized corporaen_US
dc.typeThesisen_US
dc.description.degreeM.Eng.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc503131434en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record