Show simple item record

dc.contributor.authorMozafari, Barzan
dc.contributor.authorSarkar, Purna
dc.contributor.authorFranklin, Michael
dc.contributor.authorJordan, Michael
dc.contributor.authorMadden, Samuel R.
dc.date.accessioned2016-01-20T18:35:50Z
dc.date.available2016-01-20T18:35:50Z
dc.date.issued2014-10
dc.identifier.issn21508097
dc.identifier.urihttp://hdl.handle.net/1721.1/100958
dc.description.abstractCrowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44[superscript ×] fewer than existing active learning algorithms.en_US
dc.language.isoen_US
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isversionofhttp://dx.doi.org/10.14778/2735471.2735474en_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/en_US
dc.sourceMIT web domainen_US
dc.titleScaling up crowd-sourcing to very large datasets: a case for active learningen_US
dc.typeArticleen_US
dc.identifier.citationBarzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proc. VLDB Endow. 8, 2 (October 2014), 125-136.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.mitauthorMadden, Samuel R.en_US
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dspace.orderedauthorsMozafari, Barzan; Sarkar, Purna; Franklin, Michael; Jordan, Michael; Madden, Samuelen_US
dc.identifier.orcidhttps://orcid.org/0000-0002-7470-3265
mit.licensePUBLISHER_CCen_US
mit.metadata.statusComplete


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record