From Structured Document To Structured Knowledge
Author(s)
Qian, Yujie
DownloadThesis PDF (15.69Mb)
Advisor
Barzilay, Regina
Terms of use
Metadata
Show full item recordAbstract
Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents.
First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text.
Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams.
Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs.
Date issued
2023-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology