Investigating the Capacity of Generative AI to Learn Genotype-by-Environment Interactions in Brachypodium distachyon
Author(s)
Neufeldt, Charlie
DownloadThesis PDF (7.069Mb)
Advisor
Marais, Dave Des
Terms of use
Metadata
Show full item recordAbstract
Climate change exacerbates environmental stressors such as drought, challenging the resilience of agricultural systems and highlighting the need to understand plant genomic architecture and its responses to such environmental variation. A key molecular mechanism underlying these responses is transcriptional plasticity: environment-induced changes in gene expression that vary among genotypes, representing one way that genotype-by-environment (GxE) interactions manifest at the molecular level. While transcriptomic data offers a unique and powerful view into these responses, traditional modeling approaches often rely on linear assumptions, limiting their ability to detect complex, nonlinear patterns of regulation. This thesis investigates whether generative machine learning modeling, specifically the use of transformers, can extract biologically meaningful representations of gene expression dynamics in plants. Inspired by the successes of the scGPT model for human genomics, I developed and trained a compact transformer architecture, the PlantGeneEncoder, on bulk RNA-seq data from two natural accessions of Brachypodium distachyon grown under drought and control conditions. The model was trained on binned expression values using both a baseline configuration and a set of regularized variants incorporating noise injection, co-expression preservation, entropy-based sample weighting, and masked gene modeling as a self-supervised objective. While baseline models achieved perfect reconstruction accuracy, they failed to preserve meaningful biological structure in the latent space. Regularized models achieved a better trade-off, maintaining high reconstruction fidelity while demonstrating improved genotype classification performance and modestly better alignment with the original expression structure. However, environmental condition signals remained difficult to capture across all configurations, with classification accuracies only marginally above random chance. These findings highlight the promise and limitations of transformer-based generative modeling for plant transcriptomics and provide a flexible framework for future efforts to model transcriptional plasticity and regulatory responses to environmental stress.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Civil and Environmental EngineeringPublisher
Massachusetts Institute of Technology