Missing data imputation in a clinical registry with deep generative models
Author(s)
Dai, Wangzhi(Scientist in electrical engineering and computer science)Massachusetts Institute of Technology.
Download1252062525-MIT.pdf (2.708Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Collin M. Stultz.
Terms of use
Metadata
Show full item recordAbstract
Missing data is a common problem in all data driven algorithms. An incomplete dataset can bring bias to the trained model, or cause failures in the deployment of models that require a complete input. A clinical registry is a record of patients information about their health history, status and the healthcare they receive during various periods of time. Due to the challenge of data collection and the un-structured nature of patients information, missing data is ubiquitous and can lead to series problems. Traditional imputing techniques to cope with missing data include simple mean or zero imputation and multivariate imputation that needs a more complex modeling. With the explosion of data and the advancement in the machine learning techniques, more advanced deep generative models have shown the ability to learn complex distributions in high dimensional space. In this work, we explored two deep generative models, Restricted Boltzmann Machine (RBM) and Variational Autoencoder (VAE) as potential modeling and imputation techniques for missing data. We examined the training of the model with incomplete dataset and mixed types of variable. For VAE, we further discussed a robust and efficient Markov Chain Monte Carlo (MCMC) sampling technique to estimate probability density of a given point. Two different Markov Chains, the random walk Metropolis and Hamiltonian Markov Chain were compared by their convergence speed. For imputation, we conducted synthetic experiments with Gaussian mixture model. We also applied the proposed methods to a real word clinical dataset, the Global Registry of Acute Coronary Events (GRACE) and compared the imputation performance to traditional methods like multivariate normal distribution.
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021 Cataloged from the official PDF version of thesis. Includes bibliographical references (pages 51-52).
Date issued
2021Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.