MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain

Author(s)
Yu, Shangdi; Wang, Yiqiu; Gu, Yan; Dhulipala, Laxman; Shun, Julian
Thumbnail
DownloadPublished version (1.492Mb)
Publisher with Creative Commons License

Publisher with Creative Commons License

Creative Commons Attribution

Terms of use
Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
<jats:p>This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.</jats:p> <jats:p>Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.</jats:p>
Date issued
2021
URI
https://hdl.handle.net/1721.1/143883
Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Journal
Proceedings of the VLDB Endowment
Publisher
VLDB Endowment
Citation
Yu, Shangdi, Wang, Yiqiu, Gu, Yan, Dhulipala, Laxman and Shun, Julian. 2021. "ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain." Proceedings of the VLDB Endowment, 15 (2).
Version: Final published version

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.