Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection
Author(s)
Nahnsen, Thade; Uzuner, Ozlem; Katz, Boris
DownloadMIT-CSAIL-TR-2005-034.ps (17.00Mb)
Additional downloads
Metadata
Show full item recordAbstract
We present a system to determine content similarity of documents. More specifically, our goal is to identify book chapters that are translations of the same original chapter; this task requires identification of not only the different topics in the documents but also the particular flow of these topics. We experiment with different representations employing n-grams of lexical chains and test these representations on a corpus of approximately 1000 chapters gathered from books with multiple parallel translations. Our representations include the cosine similarity of attribute vectors of n-grams of lexical chains, the cosine similarity of tf*idf-weighted keywords, and the cosine similarity of unweighted lexical chains (unigrams of lexical chains) as well as multiplicative combinations of the similarity measures produced by these approaches. Our results identify fourgrams of unordered lexical chains as a particularly useful representation for text similarity evaluation.
Date issued
2005-05-19Other identifiers
MIT-CSAIL-TR-2005-034
AIM-2005-017
Series/Report no.
Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
Keywords
AI, Natural Language Processing, N-grams, Text Similarity, Lexical Chains