MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra

Author(s)
Goldman, Samuel Lucas
Thumbnail
DownloadThesis PDF (17.65Mb)
Advisor
Coley, Connor W.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Small molecule metabolites mediate myriad biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measuring both their masses and fragmentation spectra. However, the complexity and high dimensionality of spectral data makes it difficult to identify unknown metabolites and their roles, with a large majority of detected metabolites remaining unidentified in public data. This thesis proposes a suite of new computational methodologies for higher accuracy annotation of small molecule metabolites from mass spectrometry data that integrate chemistry-informed priors with modern deep learning advancements. I begin by decomposing and framing the metabolite annotation pipeline into four key tasks well-fit for supervised deep learning including (A) molecular formula prediction, (B) spectrum-to-molecule property prediction, (C) molecule-to-spectrum prediction, and (D) de novo generation of molecular candidates. To address these various tasks, I first introduce the Molecular Formula Transformer to predict molecular property fingerprints from spectra by changing the tandem mass spectrum input basis from scalar mass values to plausible molecular formula annotations. This method is then extended to an energy-based-model formulation to predict the molecular formula of an unknown molecule from its tandem mass spectrum. Following these initial efforts to learn better representations of fragmentation spectra, I develop new neural networks capable of generating fragmentation spectra from small molecules through two-step autoregressive modeling. I show how this can be accomplished by generating either molecular formula peaks or molecular fragment peaks. Downstream of metabolite prediction, a separate key question is to identify the function of discovered small molecules. To this end, I study and probe the ability to model enzyme-substrate compatibility from high throughput screens within a single enzyme family. In a final collaborative work, I further demonstrate how a new method for epistemic uncertainty quantification, evidential deep learning, can be applied to molecular property prediction. Altogether, this work outlines a path forward to a fully neuralized pipeline for the high throughput identification of small molecule metabolites and their functions.
Date issued
2024-02
URI
https://hdl.handle.net/1721.1/154037
Department
Massachusetts Institute of Technology. Computational and Systems Biology Program
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.