MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Missing data imputation in a clinical registry with deep generative models

Author(s)
Dai, Wangzhi(Scientist in electrical engineering and computer science)Massachusetts Institute of Technology.
Thumbnail
Download1252062525-MIT.pdf (2.708Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Collin M. Stultz.
Terms of use
MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582
Metadata
Show full item record
Abstract
Missing data is a common problem in all data driven algorithms. An incomplete dataset can bring bias to the trained model, or cause failures in the deployment of models that require a complete input. A clinical registry is a record of patients information about their health history, status and the healthcare they receive during various periods of time. Due to the challenge of data collection and the un-structured nature of patients information, missing data is ubiquitous and can lead to series problems. Traditional imputing techniques to cope with missing data include simple mean or zero imputation and multivariate imputation that needs a more complex modeling. With the explosion of data and the advancement in the machine learning techniques, more advanced deep generative models have shown the ability to learn complex distributions in high dimensional space. In this work, we explored two deep generative models, Restricted Boltzmann Machine (RBM) and Variational Autoencoder (VAE) as potential modeling and imputation techniques for missing data. We examined the training of the model with incomplete dataset and mixed types of variable. For VAE, we further discussed a robust and efficient Markov Chain Monte Carlo (MCMC) sampling technique to estimate probability density of a given point. Two different Markov Chains, the random walk Metropolis and Hamiltonian Markov Chain were compared by their convergence speed. For imputation, we conducted synthetic experiments with Gaussian mixture model. We also applied the proposed methods to a real word clinical dataset, the Global Registry of Acute Coronary Events (GRACE) and compared the imputation performance to traditional methods like multivariate normal distribution.
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021
 
Cataloged from the official PDF version of thesis.
 
Includes bibliographical references (pages 51-52).
 
Date issued
2021
URI
https://hdl.handle.net/1721.1/130776
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.