Synthesizing tabular data using conditional GAN

Xu, Lei(Electrical and computer scienctist)Massachusetts Institute of Technology.

dc.contributor.advisor	Kalyan Veeramachaneni.	en_US
dc.contributor.author	Xu, Lei(Electrical and computer scienctist)Massachusetts Institute of Technology.	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2020-11-03T20:32:18Z
dc.date.available	2020-11-03T20:32:18Z
dc.date.copyright	2020	en_US
dc.date.issued	2020	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/128349
dc.description	Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2020	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 89-93).	en_US
dc.description.abstract	In data science, the ability to model the distribution of rows in tabular data and generate realistic synthetic data enables various important applications including data compression, data disclosure, and privacy-preserving machine learning. However, because tabular data usually contains a mix of discrete and continuous columns, building such a model is a non-trivial task. Continuous columns may have multiple modes, while discrete columns are sometimes imbalanced, making modeling difficult. To address this problem, I took two major steps. (1) I designed SDGym, a thorough benchmark, to compare existing models, identify different properties of tabular data and analyze how these properties challenge different models. Our experimental results show that statistical models, such as Bayesian networks, that are constrained to a fixed family of available distributions cannot model tabular data effectively, especially when both continuous and discrete columns are included. Recently proposed deep generative models are capable of modeling more sophisticated distributions, but cannot outperform Bayesian network models in practice, because the network structure and learning procedure are not optimized for tabular data which may contain non-Gaussian continuous columns and imbalanced discrete columns. (2) To address these problems, I designed CTGAN, which uses a conditional generative adversarial network to address the challenges in modeling tabular data. Because CTGAN uses reversible data transformations and is trained by re-sampling the data, it can address common challenges in synthetic data generation. I evaluated CTGAN on the benchmark and showed that it consistently and significantly outperforms existing statistical and deep learning models.	en_US
dc.description.statementofresponsibility	by Lei Xu.	en_US
dc.format.extent	93 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Synthesizing tabular data using conditional GAN	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1202001437	en_US
dc.description.collection	S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2020-11-03T20:32:17Z	en_US
mit.thesis.degree	Master	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1202001437-MIT.pdf
Size:: 7.589Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record