Dataset Deduplication with Datamodels
Author(s)
Liao, Yunxing
DownloadThesis PDF (9.273Mb)
Advisor
Mądry, Aleksander
Terms of use
Metadata
Show full item recordAbstract
Large curated datasets have been essential to the development of deep learning models across many disciplines. Consequently, the properties of these datasets have a large impact on the behavior of these models. As machine learning pipelines increasingly leverage more unlabelled datasets—which tend to undergo less curation than labelled datasets—controlling data quality becomes even more important. We focus on a particular aspect of data quality: train-test leakage or duplicate examples. These can cause overestimation of models’ performance on benchmarks among other issues. In this work, we apply datamodels, a framework for analyzing the behavior of a model class as a function of its training data, to deduplicate unlabelled datasets. Inspired by the recent CLIP model, we focus on detecting duplicates between YFCC15M and the ImageNet validation dataset. Our results demonstrate how to adapt datamodels effectively for these filtering tasks in unsupervised, large-scale settings. We finish by discussing the challenges of our method and duplicate detection more broadly.
Date issued
2022-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology