Show simple item record

dc.contributor.advisorDavid R. Karger.en_US
dc.contributor.authorShen, Yuan Kuien_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2007-01-10T16:47:43Z
dc.date.available2007-01-10T16:47:43Z
dc.date.copyright2005en_US
dc.date.issued2006en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/35609
dc.descriptionThesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2006.en_US
dc.descriptionIncludes bibliographical references (p. 149-152).en_US
dc.description.abstractAs the amount of information on the World Wide Web grows, there is an increasing demand for software that can automatically process and extract information from web pages. Despite the fact that the underlying data on most web pages is structured, we cannot automatically process these web sites/pages as structured data. We need robust technologies that can automatically understand human-readable formatting and induce the underlying data structures. In this thesis, we are focused on solving a specific facet of this general unsupervised web information extraction problem. Structured data can appear in diverse forms from lists to trees to even semi-structured graphs. However, much of the information on the web appears in a flat format we call "records". In this work, we will describe a system, MURIEL, that uses supervised and unsupervised learning techniques to effectively extract records from webpages.en_US
dc.description.statementofresponsibilityby Yuan Kui Shen.en_US
dc.format.extent152 p.en_US
dc.format.extent10456463 bytes
dc.format.extent10997527 bytes
dc.format.mimetypeapplication/pdf
dc.format.mimetypeapplication/pdf
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleAutomatic record extraction for the World Wide Weben_US
dc.typeThesisen_US
dc.description.degreeS.M.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.identifier.oclc75289843en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record