MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Extracting fields from free-text

Author(s)
Cattori, Pedro
Thumbnail
DownloadFull printable version (11.86Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Samuel Madden.
Terms of use
M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
The Field Extraction Library (FEL) provides functions for named-entity extraction within free text. FEL models the content structure of the specified named-entities rather than relying on brittle, context-specific separator logic. Users specify the names of the fields they wish to extract, which determine the number of states for an underlying Hidden Markov Model. The observable emission set is pre-determined by FEL's tokenizer. Once the model topology is set, users provide training examples of the form: x = raw text, y {fieldl: val1, field2:val2, ... } FEL learns the parameters of the underlying Hidden Markov Model by maximum likelihood model-estimation on the training examples. FEL is designed to operate on small, sparse training data. As a result, users can provide few (less than 10) training examples to bootstrap the model. FEL offers 3 iterative mechanisms for scaling data quality as users provide guidance through additional feedback: (1) accept more training examples, (2) create landmark states, and (3) bridge related states with state bridges. FEL detects ambiguities both in its internal model and in the extraction results to prompt users for more feedback. Once the model yields acceptable result quality, users can extract fields into a table for easy querying and exporting.
Description
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
 
Cataloged from PDF version of thesis.
 
Includes bibliographical references (pages 86-87).
 
Date issued
2016
URI
http://hdl.handle.net/1721.1/106077
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.