MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Automated Pipelines for Information Extraction from Semi-Structured Documents in Structured Format

Author(s)
Chu, Jung Soo
Thumbnail
DownloadThesis PDF (3.330Mb)
Advisor
Gupta, Amar
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
As documents are one of the main tools for storing and communicating information, there have been a large amount of eff orts towards developing methods to parse information from them automatically. While many parts of this industry are automated, there are still scenarios where certain types of documents cannot be read by machine with high accuracy and throughput. It becomes especially more difficult when the documents are semi-structured, or in other words have widely varying formats. With the significant leaps in optical character recognition, computer vision, and natural language processing, there have been great progress towards this problem. In this paper, we propose two pipeline designs that utilize these newer techniques to extract information from semi-structured documents in a structured output format. The two pipelines are the fully automated pipeline and semi automated pipeline. The fully automated pipeline has a region detection module that finds the location of text blocks and table blocks regardless of the format of the document and a region extraction module that extracts information from each of the text and table blocks. The semi automated pipeline on the other hand has a classification module and an extraction module. The classification module determines the format class of the input document, while the extraction module has templates that can parse information from the documents in each format class. We evaluate the two pipelines in four key metrics: accuracy, coverage, time efficiency, and scalability. The fully automated pipeline shows a strong result in coverage and scalability, while the semi automated pipeline succeeds in accuracy and time efficiency.
Date issued
2023-06
URI
https://hdl.handle.net/1721.1/151614
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.