LANCET: labeling complex data at scale

Zhang, Huayi; Cao, Lei; Madden, Samuel; Rundensteiner, Elke

dc.contributor.author	Zhang, Huayi
dc.contributor.author	Cao, Lei
dc.contributor.author	Madden, Samuel
dc.contributor.author	Rundensteiner, Elke
dc.date.accessioned	2022-07-15T16:19:12Z
dc.date.available	2022-07-15T16:19:12Z
dc.date.issued	2021
dc.identifier.uri	https://hdl.handle.net/1721.1/143771
dc.description.abstract	<jats:p>Cutting-edge machine learning techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET consistently outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin - up to 30 percentage points increase in accuracy.</jats:p>	en_US
dc.language.iso	en
dc.publisher	VLDB Endowment	en_US
dc.relation.isversionof	10.14778/3476249.3476269	en_US
dc.rights	Creative Commons Attribution-NonCommercial-NoDerivs License	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	en_US
dc.source	VLDB Endowment	en_US
dc.title	LANCET: labeling complex data at scale	en_US
dc.type	Article	en_US
dc.identifier.citation	Zhang, Huayi, Cao, Lei, Madden, Samuel and Rundensteiner, Elke. 2021. "LANCET: labeling complex data at scale." Proceedings of the VLDB Endowment, 14 (11).
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.relation.journal	Proceedings of the VLDB Endowment	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2022-07-15T16:03:56Z
dspace.orderedauthors	Zhang, H; Cao, L; Madden, S; Rundensteiner, E	en_US
dspace.date.submission	2022-07-15T16:03:57Z
mit.journal.volume	14	en_US
mit.journal.issue	11	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 3476249.3476269.pdf
Size:: 695.0Kb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record