Login

Machine learning on Web documents

Show full item record




Title: Machine learning on Web documents
Author: Shih, Lawrence Kai, 1974-
Other Contributors: Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor: David R. Karger.
Department: Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Publisher: Massachusetts Institute of Technology
Issue Date: 2004
Abstract: The Web is a tremendous source of information: so tremendous that it becomes difficult for human beings to select meaningful information without support. We discuss tools that help people deal with web information, by, for example, blocking advertisements, recommending interesting news, and automatically sorting and compiling documents. We adapt and create machine learning algorithms for use with the Web's distinctive structures: large-scale, noisy, varied data with potentially rich, human-oriented features. We adapt two standard classification algorithms, the slow but powerful support vector machine and the fast but inaccurate Naive Bayes, to make them more effective for the Web. The support vector machine, which cannot currently handle the large amount of Web data potentially available, is sped up by "bundling" the classifier inputs to reduce the input size. The Naive Bayes classifier is improved through a series of three techniques aimed at fixing some of the severe, inaccurate assumptions Naive Bayes makes. Classification can also be improved by exploiting the Web's rich, human-oriented structure, including the visual layout of links on a page and the URL of a document. These "tree-shaped features" are placed in a Bayesian mutation model and learning is accomplished with a fast, online learning algorithm for the model. These new methods are applied to a personalized news recommendation tool, "the Daily You." The results of a 176 person user-study of news preferences indicate that the new Web-centric techniques out-perform classifiers that use traditional text algorithms and features. We also show that our methods produce an automated ad-blocker that performs as well as a hand-coded commercial ad-blocker.
Description: Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (leaves 111-115).
URI: http://hdl.handle.net/1721.1/28331
Keywords: Electrical Engineering and Computer Science.

Files in this item

Files Size Format View Description
Preview, non-printable (open to all) 11.94Mb PDF View/Open Preview, non-printable (open to all)
Full printable version (MIT only) 11.95Mb PDF View/Open Full printable version (MIT only)

This item appears in the following Collection(s)

Show full item record

Search DSpace@MIT


Advanced Search

Browse

My Account

Links