Continuous representations and models from random walk diffusion limits
Author(s)Hashimoto, Tatsunori B. (Tatsunori Benjamin)
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Tommi S. Jaakkola and David K. Gifford.
MetadataShow full item record
Structured data such as sequences and networks pose substantial difficulty for traditional statistical theory which has focused on data drawn independently from a vector space. A popular and empirically effective technique for dealing with such data is to map elements of the data to a vector space and to operate over the embedding as a summary statistic. Such a vector representation of discrete objects is known as a 'continuous representation'. Continuous space models of words, objects, and signals have become ubiquitous tools for learning rich representations of data, from natural language processing to computer vision. Even in cases that the embedding is not explicit, many algorithms operate over similarity measures which implicitly embed the original dataset. In this thesis, we attempt to understand the intuition behind continuous representations. Can we construct a general theory of continuous representations? Are there general principles for semantically meaninguful representations? In order to answer these questions, we develop a framework for analyzing continuous representations through diffusion limits of random walks. We show that measureable quantities of discrete random walks with a latent metric structure have closed form diffusion limits. These diffusion limits allow us to approximate attributes of the discrete random walk such as the stationary distribution, hitting time, or co-occurrence using closed-form expressions from diffusions. We establish limits which guarantee asymptotic consistency of such estimators, and show they work well in practice. Using this new approach, we solve three classes of problems: first, we derive principled network algorithms which connect statistical estimation tasks such as density estimation to network algorithms such as PageRank. Next, we demonstrate that continuous representations of words are a type of random walk metric estimator with close connections to manifold learning. Finally, we apply our theory to single-cell RNA seq data, and derive a way to learn time-series models without trajectories by using stochastic recurrent neural networks.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.Cataloged from PDF version of thesis.Includes bibliographical references (pages 193-202).
DepartmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.