Show simple item record

dc.contributor.advisorCollin M. Stultz.en_US
dc.contributor.authorDai, Wangzhi(Scientist in electrical engineering and computer science)Massachusetts Institute of Technology.en_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2021-05-24T20:23:34Z
dc.date.available2021-05-24T20:23:34Z
dc.date.copyright2021en_US
dc.date.issued2021en_US
dc.identifier.urihttps://hdl.handle.net/1721.1/130776
dc.descriptionThesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021en_US
dc.descriptionCataloged from the official PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 51-52).en_US
dc.description.abstractMissing data is a common problem in all data driven algorithms. An incomplete dataset can bring bias to the trained model, or cause failures in the deployment of models that require a complete input. A clinical registry is a record of patients information about their health history, status and the healthcare they receive during various periods of time. Due to the challenge of data collection and the un-structured nature of patients information, missing data is ubiquitous and can lead to series problems. Traditional imputing techniques to cope with missing data include simple mean or zero imputation and multivariate imputation that needs a more complex modeling. With the explosion of data and the advancement in the machine learning techniques, more advanced deep generative models have shown the ability to learn complex distributions in high dimensional space. In this work, we explored two deep generative models, Restricted Boltzmann Machine (RBM) and Variational Autoencoder (VAE) as potential modeling and imputation techniques for missing data. We examined the training of the model with incomplete dataset and mixed types of variable. For VAE, we further discussed a robust and efficient Markov Chain Monte Carlo (MCMC) sampling technique to estimate probability density of a given point. Two different Markov Chains, the random walk Metropolis and Hamiltonian Markov Chain were compared by their convergence speed. For imputation, we conducted synthetic experiments with Gaussian mixture model. We also applied the proposed methods to a real word clinical dataset, the Global Registry of Acute Coronary Events (GRACE) and compared the imputation performance to traditional methods like multivariate normal distribution.en_US
dc.description.statementofresponsibilityby Wangzhi Dai.en_US
dc.format.extent52 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsMIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleMissing data imputation in a clinical registry with deep generative modelsen_US
dc.typeThesisen_US
dc.description.degreeS.M.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.identifier.oclc1252062525en_US
dc.description.collectionS.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienceen_US
dspace.imported2021-05-24T20:23:34Zen_US
mit.thesis.degreeMasteren_US
mit.thesis.departmentEECSen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record