Show simple item record

dc.contributor.authorYu, Shangdi
dc.contributor.authorWang, Yiqiu
dc.contributor.authorGu, Yan
dc.contributor.authorDhulipala, Laxman
dc.contributor.authorShun, Julian
dc.date.accessioned2022-07-20T15:02:25Z
dc.date.available2022-07-20T15:02:25Z
dc.date.issued2021
dc.identifier.urihttps://hdl.handle.net/1721.1/143883
dc.description.abstract<jats:p>This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.</jats:p> <jats:p>Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.</jats:p>en_US
dc.language.isoen
dc.publisherVLDB Endowmenten_US
dc.relation.isversionof10.14778/3489496.3489509en_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.sourceVLDB Endowmenten_US
dc.titleParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chainen_US
dc.typeArticleen_US
dc.identifier.citationYu, Shangdi, Wang, Yiqiu, Gu, Yan, Dhulipala, Laxman and Shun, Julian. 2021. "ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain." Proceedings of the VLDB Endowment, 15 (2).
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2022-07-20T14:38:52Z
dspace.orderedauthorsYu, S; Wang, Y; Gu, Y; Dhulipala, L; Shun, Jen_US
dspace.date.submission2022-07-20T14:38:53Z
mit.journal.volume15en_US
mit.journal.issue2en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record