Show simple item record

dc.contributor.advisorMądry, Aleksander
dc.contributor.authorLiao, Yunxing
dc.date.accessioned2022-08-29T16:20:00Z
dc.date.available2022-08-29T16:20:00Z
dc.date.issued2022-05
dc.date.submitted2022-05-27T16:18:46.252Z
dc.identifier.urihttps://hdl.handle.net/1721.1/144905
dc.description.abstractLarge curated datasets have been essential to the development of deep learning models across many disciplines. Consequently, the properties of these datasets have a large impact on the behavior of these models. As machine learning pipelines increasingly leverage more unlabelled datasets—which tend to undergo less curation than labelled datasets—controlling data quality becomes even more important. We focus on a particular aspect of data quality: train-test leakage or duplicate examples. These can cause overestimation of models’ performance on benchmarks among other issues. In this work, we apply datamodels, a framework for analyzing the behavior of a model class as a function of its training data, to deduplicate unlabelled datasets. Inspired by the recent CLIP model, we focus on detecting duplicates between YFCC15M and the ImageNet validation dataset. Our results demonstrate how to adapt datamodels effectively for these filtering tasks in unsupervised, large-scale settings. We finish by discussing the challenges of our method and duplicate detection more broadly.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleDataset Deduplication with Datamodels
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record