Leveraging Large Language Models (LLMs) for Automated Extraction and Processing of Complex Ordering Forms

Daqqah, Bilal H.

Author(s)

Daqqah, Bilal H.

DownloadThesis PDF (4.593Mb)

Advisor

Gupta, Amar

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

Data extraction from business documents is a critical but under-exploited area capable of unlocking significant value from vast document archives. Traditional methods relying on manual intervention or outsourcing are inefficient, error-prone, and costly, and commercial Deep Learning-based and OCR solutions still struggle with highly unstructured documents. This thesis explores the use of Large Language Models (LLMs) to automate the extraction and processing of ordering forms and procurement documents in collaboration with SiliconExperts. These documents contain complex codes used in electronic component procurement, which guide the manufacture and specification of parts. We developed an end-to-end pipeline comprising four key modules: Page Classification, OCR and Table Extraction, LLM Inference, and Code Combination Generation. Two approaches for key-value extraction were compared: one-shot prompting with in-context learning using GPT-4 Turbo with Vision (GPT-4V) and a fine-tuned GPT-3.5 model, in which the GPT-4V approach demonstrated superior performance. The pipeline effectively generated correct code combinations with high accuracy, although data quality issues impacted precision and performance. This research highlights the potential of LLMs to transform document processing workflows, bridging the gap between academic advancements and practical business applications.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156567

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses