Characterizing population-level variation in mRNA splicing and implications for human genetic interpretation
Author(s)
Jacobs, Hannah N.
DownloadThesis PDF (11.08Mb)
Advisor
Burge, Christopher
Terms of use
Metadata
Show full item recordAbstract
Alternative splicing is when a single gene sequence gives rise to multiple RNA sequences. DNA mutations in this gene sequence can alter this process, shifting the relative usage of RNA sequences. This relative usage is called percent spliced in (PSI). Sometimes changes in PSI triggers a change in function, happening at the level of a cell, organism, or of fitness. The consequences of splicing variability, and the contribution of genetic variation to this process, remains incompletely characterized.
In this thesis, we seek to characterize the splicing events specifically present in a subset of the human population. We use the Genotype-Tissue Expression project (GTEx), which encompasses genomic DNA sequence information and bulk mRNA data from 49 tissues in 838 individuals. In this dataset we implement a 3-component beta-binomal model using RNA-sequencing reads, at a tissue-specifc level, to reliably call splicing events present in a subset of the samples within a tissue. We call these naturally variable exons (NVEs), and identify a total 57,271 unique NVEs in GTEx. We find NVEs in a large portion of the transcriptome, existing in 75% of all protein-coding genes.
The beta-binomal model generates a population distribution of each NVE and we leverage that to estimate an NVE frequency at a PSI level of interest. This enables us to compare NVEs by their frequencies. We find that NVEs either tend to be rare in frequency ( ≤ 10%) in the population) or quite high in frequency ( ≥ 90%). We find that NVEs tend to be in 5' untranslated regions at higher frequencies, and tend to be in coding regions at lower frequencies.
60% of NVEs have been previously found to be modulated by genetic variants. We find that proximity to a splice site is one of the most important predictors in predicting if a genetic variant will impact splicing in GTEx, which enables better predictions over existing methods (increase in AUC by 0.39). Surprisingly, we find that NVEs tend to be in genetically constrained genes (depleted of loss-of-function mutations), with the lowest frequency NVEs occurring in the most constrained genes. We find a subset of genetically-modified NVEs that target genes in a manner consistent with inducing nonsense-mediated decay (NMD). We highlight a couple of such variants linked to diseases, such as those associated with heart disease.
These findings demonstrate that quantifying the population frequency of splicing events can reveal novel axes of molecular variability, and provide potential insight into the evolution of alternative splicing.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of BiologyPublisher
Massachusetts Institute of Technology