Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra

Goldman, Samuel Lucas

Author(s)

Goldman, Samuel Lucas

DownloadThesis PDF (17.65Mb)

Advisor

Coley, Connor W.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Small molecule metabolites mediate myriad biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measuring both their masses and fragmentation spectra. However, the complexity and high dimensionality of spectral data makes it difficult to identify unknown metabolites and their roles, with a large majority of detected metabolites remaining unidentified in public data. This thesis proposes a suite of new computational methodologies for higher accuracy annotation of small molecule metabolites from mass spectrometry data that integrate chemistry-informed priors with modern deep learning advancements. I begin by decomposing and framing the metabolite annotation pipeline into four key tasks well-fit for supervised deep learning including (A) molecular formula prediction, (B) spectrum-to-molecule property prediction, (C) molecule-to-spectrum prediction, and (D) de novo generation of molecular candidates. To address these various tasks, I first introduce the Molecular Formula Transformer to predict molecular property fingerprints from spectra by changing the tandem mass spectrum input basis from scalar mass values to plausible molecular formula annotations. This method is then extended to an energy-based-model formulation to predict the molecular formula of an unknown molecule from its tandem mass spectrum. Following these initial efforts to learn better representations of fragmentation spectra, I develop new neural networks capable of generating fragmentation spectra from small molecules through two-step autoregressive modeling. I show how this can be accomplished by generating either molecular formula peaks or molecular fragment peaks. Downstream of metabolite prediction, a separate key question is to identify the function of discovered small molecules. To this end, I study and probe the ability to model enzyme-substrate compatibility from high throughput screens within a single enzyme family. In a final collaborative work, I further demonstrate how a new method for epistemic uncertainty quantification, evidential deep learning, can be applied to molecular property prediction. Altogether, this work outlines a path forward to a fully neuralized pipeline for the high throughput identification of small molecule metabolites and their functions.

Date issued

2024-02

URI

https://hdl.handle.net/1721.1/154037

Department

Massachusetts Institute of Technology. Computational and Systems Biology Program

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses