TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Tang, Haotian; Yang, Shang; Liu, Zhijian; Hong, Ke; Yu, Zhongming; Li, Xiuyu; Dai, Guohao; Wang, Yu; Han, Song

dc.contributor.author	Tang, Haotian
dc.contributor.author	Yang, Shang
dc.contributor.author	Liu, Zhijian
dc.contributor.author	Hong, Ke
dc.contributor.author	Yu, Zhongming
dc.contributor.author	Li, Xiuyu
dc.contributor.author	Dai, Guohao
dc.contributor.author	Wang, Yu
dc.contributor.author	Han, Song
dc.date.accessioned	2024-01-02T19:51:01Z
dc.date.available	2024-01-02T19:51:01Z
dc.date.issued	2023-10-28
dc.identifier.isbn	979-8-4007-0329-4
dc.identifier.uri	https://hdl.handle.net/1721.1/153260
dc.description.abstract	Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9 × , 3.3 × , 2.2 × and 1.7 × measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3 × faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6 × faster inference speed compared with state-of-the-art graph deep learning libraries. Our code is publicly released at https://github.com/mit-han-lab/torchsparse.	en_US
dc.publisher	ACM\|56th Annual IEEE/ACM International Symposium on Microarchitecture	en_US
dc.relation.isversionof	https://doi.org/10.1145/3613424.3614303	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.title	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs	en_US
dc.type	Article	en_US
dc.identifier.citation	Tang, Haotian, Yang, Shang, Liu, Zhijian, Hong, Ke, Yu, Zhongming et al. 2023. "TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs."
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.mitlicense	PUBLISHER_CC
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2024-01-01T08:47:54Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2024-01-01T08:47:55Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: license_rdf
Size:: 40bytes
Format:: application/rdf+xml

View/Open

Name:: 3613424.3614303.pdf
Size:: 10.92Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record