Bridging Theory and Practice in Parallel Clustering
Author(s)
Shi, Jessica
DownloadThesis PDF (10.52Mb)
Advisor
Shun, Julian
Terms of use
Metadata
Show full item recordAbstract
Large-scale graph processing is a fundamental tool in modern data mining, with wide-ranging applications in domains including social network analysis, bioinformatics, and machine learning. In particular, graph clustering, or community detection, is an important problem in graph processing that addresses tangible problems including fraud and threat detection, recommendation and search system design, and the detection of functional contributions of proteins and genes in biological systems. At its core, identifying the underlying substructures of a graph can indicate essential functional groups, such as people with similar interests, news articles on similar topics, or proteins with similar utilities, which can then be synthesized for a variety of applications. However, as the need to analyze larger and larger data sets increases, graph processing poses a major computational challenge, and designing scalable algorithms that can handle billions of edges while maintaining fast performance and high quality becomes crucial.
This thesis addresses the challenges of designing highly scalable graph clustering solutions by bridging theory and practice in parallel algorithms. The thesis takes a multi-faceted approach, where first, we develop algorithms with strong theoretical guarantees, which often translates to significant performance improvements in practice, and second, we use performance engineering techniques implemented on top of these theoretically efficient algorithms to achieve fast implementations on real-world data sets. The results are highly scalable and provably-efficient algorithms for a broad class of computationally graph clustering problems, and the first practical solutions to a number of problems on graphs with hundreds of billions of edges. Some of the implementations in this thesis are used in production environments in industry, and have significant real-world impacts.
The first part of this thesis studies the efficient counting and enumeration of small subgraphs, including small cycles and cliques, which has applications in clustering metrics and graph statistics. We design new theoretically efficient parallel algorithms for exact and approximate butterfly (four-cycle), five-cycle, and k-clique counting, and demonstrate significant performance improvements over the prior state-of-the-art small subgraph counting implementations. This part of the thesis also provides algorithms for low out-degree orientations, which are crucial as subroutines in our counting and enumeration algorithms to reduce the required work. Notably, we are the first to report four-clique counts for the largest publicly available graph with over two hundred billion undirected edges. We also explore the batch-dynamic setting, in which graph properties are maintained over batches of multiple edge updates applied simultaneously, and present a novel parallel batch-dynamic data structure that we leverage in a wide variety of classic graph processing applications, including the k-core decomposition, low out-degree orientations, and k-clique counting. Importantly, our algorithm is the first parallel batch-dynamic algorithm for k-clique counting to achieve polylogarithmic span, or a polylogarithmic longest chain of sequential dependencies.
The second part of this thesis addresses a class of problems relating to the discovery and classification of dense substructures within a graph, focusing on hierarchical decompositions that reveal structural properties of the underlying graph with different notions or levels of density. We address bi-core decomposition and butterfly peeling, which are specialized algorithms for bipartite graphs, and we study k-clique peeling and nucleus decomposition, which generalize classic decomposition algorithms to higher order structures. This part leverages our subgraph counting algorithms to present new theoretically efficient parallel algorithms for these decomposition problems, and shows that many of these problems are P-complete, suggesting that solutions that take polylogarithmic span are unlikely. We also explore approximation algorithms as a result, and conduct thorough experimental evaluations comparing speed and accuracy trade-offs between our exact and approximate implementations.
The final part of this thesis focuses on highly scalable graph clustering algorithms that are effective in practice and give good quality clusters compared to ground truth data, considering a broad class of classification tasks. We study classic graph clustering algorithms including correlation clustering, modularity clustering, and hierarchical agglomerative clustering. This part develops heuristic and approximation algorithms for classic graph clustering objective functions, and additionally demonstrates important relationships between graph clustering algorithms and their counterparts in pointset clustering.
Date issued
2023-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology