Scaling up crowd-sourcing to very large datasets: a case for active learning

Mozafari, Barzan; Sarkar, Purna; Franklin, Michael; Jordan, Michael; Madden, Samuel

dc.contributor.author	Mozafari, Barzan
dc.contributor.author	Sarkar, Purna
dc.contributor.author	Franklin, Michael
dc.contributor.author	Jordan, Michael
dc.contributor.author	Madden, Samuel R.
dc.date.accessioned	2016-01-20T18:35:50Z
dc.date.available	2016-01-20T18:35:50Z
dc.date.issued	2014-10
dc.identifier.issn	21508097
dc.identifier.uri	http://hdl.handle.net/1721.1/100958
dc.description.abstract	Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44[superscript ×] fewer than existing active learning algorithms.	en_US
dc.language.iso	en_US
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.isversionof	http://dx.doi.org/10.14778/2735471.2735474	en_US
dc.rights	Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/	en_US
dc.source	MIT web domain	en_US
dc.title	Scaling up crowd-sourcing to very large datasets: a case for active learning	en_US
dc.type	Article	en_US
dc.identifier.citation	Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proc. VLDB Endow. 8, 2 (October 2014), 125-136.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.mitauthor	Madden, Samuel R.	en_US
dc.relation.journal	Proceedings of the VLDB Endowment	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dspace.orderedauthors	Mozafari, Barzan; Sarkar, Purna; Franklin, Michael; Jordan, Michael; Madden, Samuel	en_US
dc.identifier.orcid	https://orcid.org/0000-0002-7470-3265
mit.license	PUBLISHER_CC	en_US
mit.metadata.status	Complete

Files in this item

Name:: Madden_Scaling up.pdf
Size:: 1.625Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record