Missing data imputation in a clinical registry with deep generative models

Dai, Wangzhi(Scientist in electrical engineering and computer science)Massachusetts Institute of Technology.

Author(s)

Dai, Wangzhi(Scientist in electrical engineering and computer science)Massachusetts Institute of Technology.

Download1252062525-MIT.pdf (2.708Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Collin M. Stultz.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Missing data is a common problem in all data driven algorithms. An incomplete dataset can bring bias to the trained model, or cause failures in the deployment of models that require a complete input. A clinical registry is a record of patients information about their health history, status and the healthcare they receive during various periods of time. Due to the challenge of data collection and the un-structured nature of patients information, missing data is ubiquitous and can lead to series problems. Traditional imputing techniques to cope with missing data include simple mean or zero imputation and multivariate imputation that needs a more complex modeling. With the explosion of data and the advancement in the machine learning techniques, more advanced deep generative models have shown the ability to learn complex distributions in high dimensional space. In this work, we explored two deep generative models, Restricted Boltzmann Machine (RBM) and Variational Autoencoder (VAE) as potential modeling and imputation techniques for missing data. We examined the training of the model with incomplete dataset and mixed types of variable. For VAE, we further discussed a robust and efficient Markov Chain Monte Carlo (MCMC) sampling technique to estimate probability density of a given point. Two different Markov Chains, the random walk Metropolis and Hamiltonian Markov Chain were compared by their convergence speed. For imputation, we conducted synthetic experiments with Gaussian mixture model. We also applied the proposed methods to a real word clinical dataset, the Global Registry of Acute Coronary Events (GRACE) and compared the imputation performance to traditional methods like multivariate normal distribution.

Description

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021

Cataloged from the official PDF version of thesis.

Includes bibliographical references (pages 51-52).

Date issued

2021

URI

https://hdl.handle.net/1721.1/130776

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses