Human Document Classification Using Bags of Words

Wolf, Florian; Poggio, Tomaso; Sinha, Pawan

Author(s)

Wolf, Florian; Poggio, Tomaso; Sinha, Pawan

DownloadMIT-CSAIL-TR-2006-054.ps (1579.Kb)

Additional downloads

Other Contributors

Center for Biological and Computational Learning (CBCL)

Advisor

Tomaso Poggio

Metadata

Show full item record

Abstract

Humans are remarkably adept at classifying text documents into cate-gories. For instance, while reading a news story, we are rapidly able to assess whether it belongs to the domain of finance, politics or sports. Automating this task would have applications for content-based search or filtering of digital documents. To this end, it is interesting to investigate the nature of information humans use to classify documents. Here we report experimental results suggesting that this information might, in fact, be quite simple. Using a paradigm of progressive revealing, we determined classification performance as a function of number of words. We found that subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. These results have implications for models of human text-understanding and also allow us to estimate what level of performance we can expect, in principle, from a system without requiring a prior step of complex natural language processing.

Date issued

2006-08-09

URI

http://hdl.handle.net/1721.1/33789

Other identifiers

MIT-CSAIL-TR-2006-054

CBCL-263

Series/Report no.

Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory

Keywords

text classification

Collections

CSAIL Technical Reports (July 1, 2003 - present)