Automated Pipelines for Information Extraction from Semi-Structured Documents in Structured Format

Chu, Jung Soo

dc.contributor.advisor	Gupta, Amar
dc.contributor.author	Chu, Jung Soo
dc.date.accessioned	2023-07-31T19:52:54Z
dc.date.available	2023-07-31T19:52:54Z
dc.date.issued	2023-06
dc.date.submitted	2023-06-06T16:35:41.762Z
dc.identifier.uri	https://hdl.handle.net/1721.1/151614
dc.description.abstract	As documents are one of the main tools for storing and communicating information, there have been a large amount of eff orts towards developing methods to parse information from them automatically. While many parts of this industry are automated, there are still scenarios where certain types of documents cannot be read by machine with high accuracy and throughput. It becomes especially more difficult when the documents are semi-structured, or in other words have widely varying formats. With the significant leaps in optical character recognition, computer vision, and natural language processing, there have been great progress towards this problem. In this paper, we propose two pipeline designs that utilize these newer techniques to extract information from semi-structured documents in a structured output format. The two pipelines are the fully automated pipeline and semi automated pipeline. The fully automated pipeline has a region detection module that finds the location of text blocks and table blocks regardless of the format of the document and a region extraction module that extracts information from each of the text and table blocks. The semi automated pipeline on the other hand has a classification module and an extraction module. The classification module determines the format class of the input document, while the extraction module has templates that can parse information from the documents in each format class. We evaluate the two pipelines in four key metrics: accuracy, coverage, time efficiency, and scalability. The fully automated pipeline shows a strong result in coverage and scalability, while the semi automated pipeline succeeds in accuracy and time efficiency.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Automated Pipelines for Information Extraction from Semi-Structured Documents in Structured Format
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: chu-jschu99-meng-eecs-2023-the ...
Size:: 3.330Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record