Human Document Classification Using Bags of Words
Author(s)
Wolf, Florian; Poggio, Tomaso; Sinha, Pawan
DownloadMIT-CSAIL-TR-2006-054.ps (1579.Kb)
Additional downloads
Other Contributors
Center for Biological and Computational Learning (CBCL)
Advisor
Tomaso Poggio
Metadata
Show full item recordAbstract
Humans are remarkably adept at classifying text documents into cate-gories. For instance, while reading a news story, we are rapidly able to assess whether it belongs to the domain of finance, politics or sports. Automating this task would have applications for content-based search or filtering of digital documents. To this end, it is interesting to investigate the nature of information humans use to classify documents. Here we report experimental results suggesting that this information might, in fact, be quite simple. Using a paradigm of progressive revealing, we determined classification performance as a function of number of words. We found that subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. These results have implications for models of human text-understanding and also allow us to estimate what level of performance we can expect, in principle, from a system without requiring a prior step of complex natural language processing.
Date issued
2006-08-09Other identifiers
MIT-CSAIL-TR-2006-054
CBCL-263
Series/Report no.
Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
Keywords
text classification