MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Using Language Models to Understand Molecular Structures

Author(s)
Fan, Vincent K.
Thumbnail
DownloadThesis PDF (5.668Mb)
Advisor
Barzilay, Regina
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156795
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.