Show simple item record

dc.contributor.advisorIsaac L. Chuang.en_US
dc.contributor.authorNorthcutt, Curtis Georgeen_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2017-10-18T14:43:00Z
dc.date.available2017-10-18T14:43:00Z
dc.date.copyright2017en_US
dc.date.issued2017en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/111870
dc.descriptionThesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.en_US
dc.descriptionThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.en_US
dc.descriptionCataloged from student-submitted PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 113-122).en_US
dc.description.abstractMassive Open Online Courses (MOOCs) have the potential to enhance socioeconomic mobility through education. Yet, the viability of this outcome largely depends on the reputation of MOOC certificates as a credible academic credential. I describe a cheating strategy that threatens this reputation and holds the potential to render the MOOC certificate valueless. The strategy, Copying Answers using Multiple Existences Online (CAMEO), involves a user who gathers solutions to assessment questions using one or more harvester accounts and then submits correct answers using one or more separate master accounts. To estimate a lower bound for CAMEO prevalence among 1.9 million course participants in 115 HarvardX and MITx courses, I introduce a filter-based CAMEO detection algorithm and use a small-scale experiment to verify CAMEO use with certainty. I identify preventive strategies that can decrease CAMEO rates and show evidence of their effectiveness in science courses. Because the CAMEO algorithm functions as a lower bound estimate, it fails to detect many CAMEO cheaters. As a novelty of this thesis, instead of improving the shortcomings of the CAMEO algorithm directly, I recognize that we can think of the CAMEO algorithm as a method for producing noisy predicted cheating labels. Then a solution to the more general problem of binary classification with noisy labels ( ~ P̃̃̃ Ñ learning) is a solution to CAMEO cheating detection. ~ P̃ Ñ learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate 1 for positive examples and 0 for negative examples. I propose Rank Pruning to solve ~ P ~N learning and the open problem of estimating the noise rates. Unlike prior solutions, Rank Pruning is efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. I prove Rank Pruning achieves consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. Rank Pruning achieves state-of-the-art noise rate estimation and F1, error, and AUC-PR on the MNIST and CIFAR datasets, regardless of noise rates. To highlight, Rank Pruning with a CNN classifier can predict if a MNIST digit is a one or not one with only 0:25% error, and 0:46% error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples. Rank Pruning achieves similarly impressive results when as large as 50% of training examples are actually just noise drawn from a third distribution. Together, the CAMEO and Rank Pruning algorithms allow for a robust, general, and time-efficient solution to the CAMEO cheating detection problem. By ensuring the validity of MOOC credentials, we enable MOOCs to achieve both openness and value, and thus take one step closer to the greater goal of democratization of education.en_US
dc.description.statementofresponsibilityby Curtis George Northcutt.en_US
dc.format.extent122 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsMIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleClassification with noisy labels : "Multiple Account" cheating detection in Open Online Coursesen_US
dc.typeThesisen_US
dc.description.degreeS.M.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc1005737327en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record