Application of foundation models for molecular representation in cancer drug discovery and precision oncology
Author(s)
Khokhlov, Khrystofor
DownloadThesis PDF (31.23Mb)
Advisor
Zhang, Bin
Getz, Gad
Terms of use
Metadata
Show full item recordAbstract
Drug discovery is a resource-intensive and time-consuming process, often requiring decades of effort and substantial financial investment, with a high risk of failure. Despite advances in high-throughput screening technologies, the size of chemical space presents a significant challenge: it is not feasible to experimentally screen all potential drug-like molecules. Most commercially available chemical libraries consist of molecules that are synthesized on demand from pre-existing building blocks, further limiting the exploration of novel chemotypes. This thesis aims to explore whether drug discovery could be accelerated by leveraging advances in deep learning (DL) models to identify promising hit candidates and improve the prediction of drug response in cancer. Development of cancer drugs that will be effective on a predictable set of targets remains a major challenge. We are developing a DL model capable of identifying potentially novel cancer drug chemotypes and reliably predicting drug response on cancer cell line targets. Leveraging recent progress in transformer-based architectures and graph neural networks, we use molecular language models, graph models and cell foundation models to embed both molecular and genomic data into low-dimensional subspaces and then use standard machine learning (ML) tools in these low-dimensional spaces to predict the efficacy of the molecules in particular cell lines. We utilize the large-scale drug repurposing and oncology datasets from the PRISM project at the Broad Institute, which provide a wealth of drug repurposing and oncology data, enabling robust training of ML models. We show that these vector embeddings are superior to existing methods, as they enable more accurate drug response predictions. The first part of this thesis is dedicated to development of a deep learning cancer drug discovery model, focused on in silico screening of chemical space to search for cancer drug candidates. The second part is focused on development of a precision oncology model, based on a multichannel neural network architecture. Our pipeline involves training single-target models on drug molecular structures, followed by integrating genomic data to enhance biological context and train a hybrid model capable of predicting drug response for novel drug:target pairs. Our results demonstrate that vector embeddings produced by the proposed framework outperform existing approaches, offering a more accurate and efficient means of exploring chemical space. This work highlights the transformative potential of ML/DL methods in drug discovery, enabling targeted, cost-effective exploration of chemical libraries, and advancing the development of precision oncology treatments.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of ChemistryPublisher
Massachusetts Institute of Technology