What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

Raykov, Yordan P.; Boukouvalas, Alexis; Baig, Fahd; Little, Max A.

dc.contributor.author	Raykov, Yordan P.
dc.contributor.author	Boukouvalas, Alexis
dc.contributor.author	Baig, Fahd
dc.contributor.author	Little, Max
dc.date.accessioned	2017-05-16T18:43:38Z
dc.date.available	2017-05-16T18:43:38Z
dc.date.issued	2016-09
dc.date.submitted	2016-01
dc.identifier.issn	1932-6203
dc.identifier.uri	http://hdl.handle.net/1721.1/109129
dc.description.abstract	The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.	en_US
dc.language.iso	en_US
dc.publisher	Public Library of Science	en_US
dc.relation.isversionof	http://dx.doi.org/10.1371/journal.pone.0162259	en_US
dc.rights	Creative Commons Attribution 4.0 International License	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en_US
dc.source	PLoS	en_US
dc.title	What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm	en_US
dc.type	Article	en_US
dc.identifier.citation	Raykov, Yordan P.; Boukouvalas, Alexis; Baig, Fahd and Little, Max A. “What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm.” Edited by Byung-Jun Yoon. PLOS ONE 11, no. 9 (September 2016): e0162259. © 2016 Raykov et al	en_US
dc.contributor.department	Program in Media Arts and Sciences (Massachusetts Institute of Technology)	en_US
dc.contributor.mitauthor	Little, Max
dc.relation.journal	PLOS ONE	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Raykov, Yordan P.; Boukouvalas, Alexis; Baig, Fahd; Little, Max A.	en_US
dspace.embargo.terms	N	en_US
mit.license	PUBLISHER_CC	en_US

Files in this item

Name:: Raykov-2016-What to Do When ...
Size:: 3.449Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record