A Hybrid Approach for Key-Value Extraction from Technical Specification Documents
Author(s)
Lee, Samuel S.
DownloadThesis PDF (3.175Mb)
Advisor
Gupta, Amar
Terms of use
Metadata
Show full item recordAbstract
As the number of documents processed by businesses across the world increases daily, the demand for streamlined and automated document processing methods grows. However, commercial methods for information extraction from documents do not generalize well across different document formats, as each solution is tailored to specific types of documents. This thesis provides an overview of a hybrid document processing pipeline designed to extract key-value pairs from technical specification documents with high accuracy. Two different phases of the pipeline are introduced, both employing rule-based methods and machine learning to cover a variety of document types. The first is an earlier iteration that extracts information from a simpler collection of documents, and the second is the current iteration designed to handle a much larger dataset containing more complex documents. Lastly, the initial stages of a module designed for key-value extraction from a specific type of technical specification document is also proposed.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology