Robust Learning from Uncurated Data

Chuang, Ching-Yao

dc.contributor.advisor	Jegelka, Stefanie
dc.contributor.advisor	Torralba, Antonio
dc.contributor.author	Chuang, Ching-Yao
dc.date.accessioned	2023-11-02T20:14:23Z
dc.date.available	2023-11-02T20:14:23Z
dc.date.issued	2023-09
dc.date.submitted	2023-09-21T14:26:22.982Z
dc.identifier.uri	https://hdl.handle.net/1721.1/152764
dc.description.abstract	The field of machine learning has witnessed a growing interest in learning from uncurated data, which involves training models from data that has not been carefully curated or labeled. However, this type of data is typically noisy, incomplete, and riddled with errors, making it challenging for machine learning algorithms to learn effectively. This thesis focuses on the development of robust learning methods that can effectively leverage uncurated data while being resilient to the inherent noise and errors in the data. Specifically, we investigate the robustness of contrastive learning, a prominent technique for self-supervised representation learning by contrasting semantically similar and dissimilar pairs of samples. Firstly, we delve into the fundamental challenge inherent in learning from unlabeled data. We find that eliminating false negatives and encouraging hard negatives notably enhance downstream performance and training efficiency. Subsequently, we shift our focus to the omnipresent noise within the dataset. We pay particular attention to the emergence of false positive pairs, a phenomenon particularly prevalent in multimodal contrastive learning settings. In the final segment of our study, we contemplate the efficient eradication of biases from large-scale models. It is observed that, when models are pretrained on biased, uncurated data, they frequently inherit numerous inappropriate biases, which consequentially lead to skewed predictions. In an effort to rectify this, we devise a debiasing algorithm that operates independently of any data or training requirements. Throughout the dissertation, the common thread tying these three components together is a robust and comprehensive approach to mitigating the unique error types associated with unlabeled, noisy, and biased data respectively, offering substantial contributions to the realm of machine learning research.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Robust Learning from Uncurated Data
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: chuang-cychuang-phd-eecs-2023- ...
Size:: 58.60Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record