Predicting genetic interactions in Caenorhabditis elegans using machine learning
Author(s)Missiuro, Patrycja Vasilyev, 1976-
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Tommi S. Jaakkola and Hui Ge.
MetadataShow full item record
The presented work develops a set of machine learning and other computational techniques to investigate and predict gene properties across a variety of biological datasets. In particular, our main goal is the discovery of genetic interactions based on sparse and incomplete information. In our development, we use gene data from two model organisms, Caenorhabditis elegans and Saccharomyces cerevisiae. Our first method, information flow, uses circuit theory to evaluate the importance of a protein in an interactome. We find that proteins with high i-flow scores mediate information exchange between functional modules. We also show that increasing information flow scores strongly correlate with the likelihood of observing lethality or pleiotropy as well as observing genetic interactions. Our metric significantly outperforms other established network metrics such as degree or betweenness. Next, we show how Bayesian sets can be applied to gain intuition as to which datasets are the most relevant for predicting genetic interactions. In order to directly apply this method to microarray data, we extend Bayesian sets to handle continuous variables. Using Bayesian sets, we show that genetically interacting genes tend to share phenotypes but are not necessarily co-localized. Additionally, they have similar development and aging temporal expression profiles. One of the major difficulties in dealing with biological data is the problem of incomplete datasets. We describe a novel application of collaborative filtering (CF) in order to predict missing values in the biological datasets.(cont.) We adapt the factorization-based and the neighborhood-aware CF  to deal with a mixture of continuous and discrete entries. We use collaborative filtering to input missing values, assess how much information relevant to genetic interactions is present, and, finally, to predict genetic interactions. We also show how CF can reduce input dimensionality. Our last development is the application of Support Vector Machines (SVM), an adapted machine learning classification method, to predicting genetic interactions. We find that SVM with nonlinear radial basis function (RBF) kernel has greater predictive power over CF. Its performance, however, greatly benefits from using CF to fill in missing entries in the input data. We show that SVM performance further improves if we constrain the group of genes to a specific functional category. Throughout this thesis, we emphasize the features of the studied datasets and explain our findings from a biological perspective. In this respect, we hope that this work possesses an independent biological significance. The final step would be to confirm our predictions experimentally. This would allow us to gain new insights into C. elegans biology: specific genes orchestrating developmental and regulatory pathways, response to stress, etc.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 191-204).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.