Generating Differentially Private Synthetic Text
Author(s)
Park, YeonHwan
DownloadThesis PDF (2.340Mb)
Advisor
Kagal, Lalana
Terms of use
Metadata
Show full item recordAbstract
The advent of more powerful cloud compute over the past decade has made it possible to train the deep neural networks used today for applications in almost everything we do. However, the amount of existing data for private datasets, such as hospital records, remain scarce and will probably remain scarce for the foreseeable future. Without high-quality data, neural networks will not be able to perform high-quality inference.
To aid in training models when existing information is limited, we aim to train existing deep neural network architectures to generate synthetic text that is similar to the text it was trained on without memorizing one-to-one mappings or leaking any sensitive data. To achieve this goal, we fine-tune our models to adhere to a strong notion differential privacy – a mathematical model bounding the extent to which an adversary can reconstruct the original dataset.
In the desire to use the differentially private models to generate mixed-type tabular datasets with unstructured text, we also perform a survey to gain a better understanding of how our algorithm might be used to supplement existing neural networks.
Date issued
2022-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology