Dimensionality reduction in immunology : from viruses to cells
Massachusetts Institute of Technology. Department of Chemical Engineering.
Arup K. Chakraborty.
MetadataShow full item record
Developing successful prophylactic and therapeutic strategies against infections of RNA viruses like HIV requires a combined understanding of the evolutionary constraints of the virus, as well as of the immunologic determinants associated with effective viremic control. Recent technologies enable viral and immune parameters to be measured at an unprecedented scale and resolution across multiple patients, and the resulting data could be harnessed towards these goals. Such datasets typically involve a large number of parameters; the goal of analysis is to infer underlying biological relationships that connect these parameters by examining the data. This dissertation combines principles and techniques from the physical and the computational sciences to "reduce the dimensionality" of such data in order to reveal novel biological relationships of relevance to vaccination and therapeutic strategies. Much of our work is concerned with HIV. 1. How can collective evolutionary constraints be inferred from viral sequences derived from infected patients? Using principles of Random Matrix Theory, we derive a low dimensional representation of HIV proteins based on circulating sequence data and identify independent groups of residues within viral proteins that are coordinately linked. One such group of residues within the polyprotein Gag exhibits statistical signatures indicative of strong constraints that limit the viability of a higher proportion of strains bearing multiple mutations in this group. We validate these predictions from independent experimental data, and based on our results, propose candidate immunogens for the Caucasian American population that target these vulnerabilities. 2. To what extent do mutational patterns observed in circulating viral strains accurately reflect intrinsic fitness constraints of viral proteins? Each strain is the result of evolution against an immune background, which is highly diverse across patients. Spin models constructed to reproduce the prevalence of sequences have tested positively against intrinsic fitness assays (where immune selection is absent). Why "prevalence" should correlate with "replicative fitness" in the case of such complex evolutionary dynamics is conceptually puzzling. We combine computer simulations and analytical theory to show that the prevalence can correctly reflect the fitness rank order of mutant viral strains that are proximal in sequence space. Our analysis suggests that incorporating a "phylogenetic correction" in the parameters might improve the predictive power of these models. 3. Can cellular phenotypes be discovered in an unbiased way from high dimensional protein expression data in single cells? Mass cytometry, where > 40 protein parameters can be quantitated in single cells affords a route, but analyzing such high dimensional data can be challenging. Traditional "gating approaches" are unscalable, and computational methods that account for multivariate relationships among different proteins are needed. High-dimensional clustering and principal component analysis, two approaches that have been explored so far, suffer from important limitations. We propose a computational tool rooted in nonlinear dimensionality reduction which overcomes these limitations, and automatically identifies phenotypes based on a two-dimensional distillation of the cellular data; the latter feature facilitates unbiased visualization of high dimensional relationships. Our tool reveals a previously unappreciated phenotypic complexity within murine CD8+ T cells, and identifies a novel phenotype that is conflated by traditional approaches. 4. Antigen-specific immune cells that mediate efficacious antiviral responses in infections like HIV involve complex phenotypes and typically constitute a small fraction of the population. In such circumstances, seeking correlative features in bulk expression levels of key proteins can be misleading. Using the approach introduced in 3., we analyze multiparameter flow cytometry data of CD4+ T-cell samples from 20 patients representing diverse clinical groups, and identify cellular phenotypes whose proportion in patients is strongly correlated with quantitative clinical parameters. Many of these correlations are inconsistent with bulk signals. Furthermore, a number of correlative phenotypes are characterized by the expression of multiple proteins at individually modest levels; such subsets are likely be missed by conventional gating strategies. Using the in-patient proportions of different phenotypes as predictors, a cross-validated, sparse linear regression model explains 87 % of the variance in the viral load across the twenty patients. Our approach is scalable to datasets involving dozens of parameters.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Chemical Engineering, February 2015.Cataloged from PDF version of thesis.Includes bibliographical references (pages 301-318).
DepartmentMassachusetts Institute of Technology. Department of Chemical Engineering.
Massachusetts Institute of Technology