Login

Human Document Classification Using Bags of Words

Show full item record




Title: Human Document Classification Using Bags of Words
Author: Wolf, Florian; Poggio, Tomaso; Sinha, Pawan
Other Contributors: Center for Biological and Computational Learning (CBCL)
Advisor: Tomaso Poggio
Issue Date: 2006-08-09
Abstract: Humans are remarkably adept at classifying text documents into cate-gories. For instance, while reading a news story, we are rapidly able to assess whether it belongs to the domain of finance, politics or sports. Automating this task would have applications for content-based search or filtering of digital documents. To this end, it is interesting to investigate the nature of information humans use to classify documents. Here we report experimental results suggesting that this information might, in fact, be quite simple. Using a paradigm of progressive revealing, we determined classification performance as a function of number of words. We found that subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. These results have implications for models of human text-understanding and also allow us to estimate what level of performance we can expect, in principle, from a system without requiring a prior step of complex natural language processing.
URI: http://hdl.handle.net/1721.1/33789
Other Identifiers: MIT-CSAIL-TR-2006-054
CBCL-263
Series/Report no.: Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
Keywords: text classification

Files in this item

Files Size Format View
MIT-CSAIL-TR-2006-054.ps 1.617Mb Postscript View/Open

Files in this item

Files Size Format View
MIT-CSAIL-TR-2006-054.pdf 134.0Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record

Search DSpace@MIT


Advanced Search

Browse

My Account

Links