Show simple item record

dc.contributor.authorNahnsen, Thade
dc.contributor.authorUzuner, Ozlem
dc.contributor.authorKatz, Boris
dc.date.accessioned2005-12-22T02:29:37Z
dc.date.available2005-12-22T02:29:37Z
dc.date.issued2005-05-19
dc.identifier.otherMIT-CSAIL-TR-2005-034
dc.identifier.otherAIM-2005-017
dc.identifier.urihttp://hdl.handle.net/1721.1/30546
dc.description.abstractWe present a system to determine content similarity of documents. More specifically, our goal is to identify book chapters that are translations of the same original chapter; this task requires identification of not only the different topics in the documents but also the particular flow of these topics. We experiment with different representations employing n-grams of lexical chains and test these representations on a corpus of approximately 1000 chapters gathered from books with multiple parallel translations. Our representations include the cosine similarity of attribute vectors of n-grams of lexical chains, the cosine similarity of tf*idf-weighted keywords, and the cosine similarity of unweighted lexical chains (unigrams of lexical chains) as well as multiplicative combinations of the similarity measures produced by these approaches. Our results identify fourgrams of unordered lexical chains as a particularly useful representation for text similarity evaluation.
dc.format.extent9 p.
dc.format.extent17827888 bytes
dc.format.extent7011726 bytes
dc.format.mimetypeapplication/postscript
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.relation.ispartofseriesMassachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
dc.subjectAI
dc.subjectNatural Language Processing
dc.subjectN-grams
dc.subjectText Similarity
dc.subjectLexical Chains
dc.titleLexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record