Incremental random forest classifiers in spark

Siegel, Kathryn I. (Kathryn Iris)

dc.contributor.advisor	Samuel Madden.	en_US
dc.contributor.author	Siegel, Kathryn I. (Kathryn Iris)	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2016-12-22T16:29:12Z
dc.date.available	2016-12-22T16:29:12Z
dc.date.copyright	2016	en_US
dc.date.issued	2016	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/106105
dc.description	Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (page 53).	en_US
dc.description.abstract	The random forest is a machine learning algorithm that has gained popularity due to its resistance to noise, good performance, and training efficiency. Random forests are typically constructed using a static dataset; to accommodate new data, random forests are usually regrown. This thesis presents two main strategies for updating random forests incrementally, rather than entirely rebuilding the forests. I implement these two strategies-incrementally growing existing trees and replacing old trees-in Spark Machine Learning(ML), a commonly used library for running ML algorithms in Spark. My implementation draws from existing methods in online learning literature, but includes several novel refinements. I evaluate the two implementations, as well as a variety of hybrid strategies, by recording their error rates and training times on four different datasets. My benchmarks show that the optimal strategy for incremental growth depends on the batch size and the presence of concept drift in a data workload. I find that workloads with large batches should be classified using a strategy that favors tree regrowth, while workloads with small batches should be classified using a strategy that favors incremental growth of existing trees. Overall, the system demonstrates significant efficiency gains when compared to the standard method of regrowing the random forest.	en_US
dc.description.statementofresponsibility	by Kathryn I. Siegel.	en_US
dc.format.extent	53 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Incremental random forest classifiers in spark	en_US
dc.type	Thesis	en_US
dc.description.degree	M. Eng.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	965551002	en_US

Files in this item

Name:: 965551002-MIT.pdf
Size:: 4.617Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record