MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Privacy-Preserving Natural Language Dataset Generation

Author(s)
Chen, Ashley
Thumbnail
DownloadThesis PDF (1.249Mb)
Advisor
Kagal, Lalana
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
As we depend on data more heavily to power the insights made by machine learning systems, it becomes imperative that we design guarantees for protecting the privacy of such data. Recent research has shown the ease with which attacks such as membership inference or model inversion can extract potentially sensitive training data given the model alone. To prevent curious or malevolent users from gleaning training data through these attacks, we propose the generation of private synthetic datasets to replace the original datasets in training and testing the model. These synthetic datasets will have the same semantic and statistical distribution as the original dataset, but will be differentially private, thus preventing individuals in the dataset from being identified. This would guarantee that no sensitive information from the original dataset can be extracted from the generated synthetic dataset. Compared to related works that dealt with either structured data or unstructured data separately, our work developed a pipeline for generating synthetic datasets given a complex dataset consisting of structured and unstructured text, as well as numerical data. We used a number of metrics to evaluate the generation pipeline according to its statistical similarity to the original dataset, its utility, and its privacy. Our experiments focused on varying the degree of privacy across the sub-modules of the pipeline. We found that we can generate differentially private synthetic datasets whose structured and unstructured components each achieve good performance in similarity, utility, and privacy.
Date issued
2023-06
URI
https://hdl.handle.net/1721.1/151313
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.