Using Language Models to Understand Molecular Structures

Fan, Vincent K.

Author(s)

Fan, Vincent K.

DownloadThesis PDF (5.668Mb)

Advisor

Barzilay, Regina

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156795

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses