Show simple item record

dc.contributor.advisorGupta, Amar
dc.contributor.authorChu, Jung Soo
dc.date.accessioned2023-07-31T19:52:54Z
dc.date.available2023-07-31T19:52:54Z
dc.date.issued2023-06
dc.date.submitted2023-06-06T16:35:41.762Z
dc.identifier.urihttps://hdl.handle.net/1721.1/151614
dc.description.abstractAs documents are one of the main tools for storing and communicating information, there have been a large amount of eff orts towards developing methods to parse information from them automatically. While many parts of this industry are automated, there are still scenarios where certain types of documents cannot be read by machine with high accuracy and throughput. It becomes especially more difficult when the documents are semi-structured, or in other words have widely varying formats. With the significant leaps in optical character recognition, computer vision, and natural language processing, there have been great progress towards this problem. In this paper, we propose two pipeline designs that utilize these newer techniques to extract information from semi-structured documents in a structured output format. The two pipelines are the fully automated pipeline and semi automated pipeline. The fully automated pipeline has a region detection module that finds the location of text blocks and table blocks regardless of the format of the document and a region extraction module that extracts information from each of the text and table blocks. The semi automated pipeline on the other hand has a classification module and an extraction module. The classification module determines the format class of the input document, while the extraction module has templates that can parse information from the documents in each format class. We evaluate the two pipelines in four key metrics: accuracy, coverage, time efficiency, and scalability. The fully automated pipeline shows a strong result in coverage and scalability, while the semi automated pipeline succeeds in accuracy and time efficiency.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleAutomated Pipelines for Information Extraction from Semi-Structured Documents in Structured Format
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record