Generating Differentially Private Synthetic Text

Park, YeonHwan

Author(s)

Park, YeonHwan

DownloadThesis PDF (2.340Mb)

Advisor

Kagal, Lalana

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The advent of more powerful cloud compute over the past decade has made it possible to train the deep neural networks used today for applications in almost everything we do. However, the amount of existing data for private datasets, such as hospital records, remain scarce and will probably remain scarce for the foreseeable future. Without high-quality data, neural networks will not be able to perform high-quality inference. To aid in training models when existing information is limited, we aim to train existing deep neural network architectures to generate synthetic text that is similar to the text it was trained on without memorizing one-to-one mappings or leaking any sensitive data. To achieve this goal, we fine-tune our models to adhere to a strong notion differential privacy – a mathematical model bounding the extent to which an adversary can reconstruct the original dataset. In the desire to use the differentially private models to generate mixed-type tabular datasets with unstructured text, we also perform a survey to gain a better understanding of how our algorithm might be used to supplement existing neural networks.

Date issued

2022-05

URI

https://hdl.handle.net/1721.1/144503

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses