dc.contributor.advisor | Samuel Madden. | en_US |
dc.contributor.author | Siegel, Kathryn I. (Kathryn Iris) | en_US |
dc.contributor.other | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. | en_US |
dc.date.accessioned | 2016-12-22T16:29:12Z | |
dc.date.available | 2016-12-22T16:29:12Z | |
dc.date.copyright | 2016 | en_US |
dc.date.issued | 2016 | en_US |
dc.identifier.uri | http://hdl.handle.net/1721.1/106105 | |
dc.description | Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016. | en_US |
dc.description | Cataloged from PDF version of thesis. | en_US |
dc.description | Includes bibliographical references (page 53). | en_US |
dc.description.abstract | The random forest is a machine learning algorithm that has gained popularity due to its resistance to noise, good performance, and training efficiency. Random forests are typically constructed using a static dataset; to accommodate new data, random forests are usually regrown. This thesis presents two main strategies for updating random forests incrementally, rather than entirely rebuilding the forests. I implement these two strategies-incrementally growing existing trees and replacing old trees-in Spark Machine Learning(ML), a commonly used library for running ML algorithms in Spark. My implementation draws from existing methods in online learning literature, but includes several novel refinements. I evaluate the two implementations, as well as a variety of hybrid strategies, by recording their error rates and training times on four different datasets. My benchmarks show that the optimal strategy for incremental growth depends on the batch size and the presence of concept drift in a data workload. I find that workloads with large batches should be classified using a strategy that favors tree regrowth, while workloads with small batches should be classified using a strategy that favors incremental growth of existing trees. Overall, the system demonstrates significant efficiency gains when compared to the standard method of regrowing the random forest. | en_US |
dc.description.statementofresponsibility | by Kathryn I. Siegel. | en_US |
dc.format.extent | 53 pages | en_US |
dc.language.iso | eng | en_US |
dc.publisher | Massachusetts Institute of Technology | en_US |
dc.rights | M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. | en_US |
dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
dc.subject | Electrical Engineering and Computer Science. | en_US |
dc.title | Incremental random forest classifiers in spark | en_US |
dc.type | Thesis | en_US |
dc.description.degree | M. Eng. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
dc.identifier.oclc | 965551002 | en_US |