Automated Pipelines for Information Extraction from Semi-Structured Documents in Structured Format

Chu, Jung Soo

Author(s)

Chu, Jung Soo

DownloadThesis PDF (3.330Mb)

Advisor

Gupta, Amar

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

As documents are one of the main tools for storing and communicating information, there have been a large amount of eff orts towards developing methods to parse information from them automatically. While many parts of this industry are automated, there are still scenarios where certain types of documents cannot be read by machine with high accuracy and throughput. It becomes especially more difficult when the documents are semi-structured, or in other words have widely varying formats. With the significant leaps in optical character recognition, computer vision, and natural language processing, there have been great progress towards this problem. In this paper, we propose two pipeline designs that utilize these newer techniques to extract information from semi-structured documents in a structured output format. The two pipelines are the fully automated pipeline and semi automated pipeline. The fully automated pipeline has a region detection module that finds the location of text blocks and table blocks regardless of the format of the document and a region extraction module that extracts information from each of the text and table blocks. The semi automated pipeline on the other hand has a classification module and an extraction module. The classification module determines the format class of the input document, while the extraction module has templates that can parse information from the documents in each format class. We evaluate the two pipelines in four key metrics: accuracy, coverage, time efficiency, and scalability. The fully automated pipeline shows a strong result in coverage and scalability, while the semi automated pipeline succeeds in accuracy and time efficiency.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/151614

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses