ScaleGPS: Scalable Graph Parallel Sampling via Data-centric Performance Engineering
Author(s)
Cai, Miranda J.
DownloadThesis PDF (1.034Mb)
Advisor
Chen, Xuhao
Terms of use
Metadata
Show full item recordAbstract
Graph sampling extracts representative samples of a graph, so that approximate graph algorithms can be used in place of expensive, exact algorithms while still achieving highquality results. Thus, graph sampling plays an important role in many modern graph-based applications, such as graph machine learning and graph data mining. However, because of unstructured sparsity in the graph data and the randomness in the sampling algorithms, graph sampling often is the computational bottleneck. To accelerate it, there exist parallel graph sampling methods on multicore CPUs or GPUs. However, limitations arise at both sides. Due to lower throughput, CPU implementations are much slower than GPU ones, while GPU memory capacity is limited to only being able to handle small input graphs. We present the idea behind a scalable graph sampling framework, ScaleGPS, to support high performance graph sampling on huge graphs in a single machine with a CPU and a GPU. The key idea is to cooperatively employ data caching and compression to reduce memory footprint and data movement overhead, and thus achieve high performance and scalability. The challenge in applying caching and compression for graph sampling is two-fold. First, the randomness in sampling leads to redundant computation and memory accesses, and thus low work efficiency. Second, real-world graphs often exhibit skewed degree distribution, where a f ixed strategy cannot optimally handle various cases. We propose a hybrid and adaptive strategy to address this challenge. First, we split the vertices in the graph into two groups based on their degrees. For each group, we store the neighbor lists in different formats, to make full use of the scarce GPU memory resources. Based on this hybrid compression method, we use the GPU memory as a cache of the CPU memory, and adaptively cache hot data to minimize the data movement overhead between the CPU and GPU. We implement our strategy in ScaleGPS and evaluate it on a single machine with a 48-core CPU and an A100 GPU. Our experimental results on various sampling algorithms show that ScaleGPS is able to support billion-edge graphs (up to 84-billion) in a single machine. While the performance benefits over these large graphs are still undetermined, ScaleGPS achieves an average of 33.4× (up to 93×) speedups for smaller graphs over state-of-the-art parallel CPU implementations.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology