Dataset Deduplication with Datamodels

Liao, Yunxing

dc.contributor.advisor	Mądry, Aleksander
dc.contributor.author	Liao, Yunxing
dc.date.accessioned	2022-08-29T16:20:00Z
dc.date.available	2022-08-29T16:20:00Z
dc.date.issued	2022-05
dc.date.submitted	2022-05-27T16:18:46.252Z
dc.identifier.uri	https://hdl.handle.net/1721.1/144905
dc.description.abstract	Large curated datasets have been essential to the development of deep learning models across many disciplines. Consequently, the properties of these datasets have a large impact on the behavior of these models. As machine learning pipelines increasingly leverage more unlabelled datasets—which tend to undergo less curation than labelled datasets—controlling data quality becomes even more important. We focus on a particular aspect of data quality: train-test leakage or duplicate examples. These can cause overestimation of models’ performance on benchmarks among other issues. In this work, we apply datamodels, a framework for analyzing the behavior of a model class as a function of its training data, to deduplicate unlabelled datasets. Inspired by the recent CLIP model, we focus on detecting duplicates between YFCC15M and the ImageNet validation dataset. Our results demonstrate how to adapt datamodels effectively for these filtering tasks in unsupervised, large-scale settings. We finish by discussing the challenges of our method and duplicate detection more broadly.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Dataset Deduplication with Datamodels
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: Liao-yunxingl-meng-eecs-2022-t ...
Size:: 9.273Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record