Co-Designing Efficient Systems and Algorithms for Sparse and
Quantized Deep Learning Computing

Tang, Haotian

dc.contributor.advisor	Han, Song
dc.contributor.author	Tang, Haotian
dc.date.accessioned	2025-03-27T16:58:40Z
dc.date.available	2025-03-27T16:58:40Z
dc.date.issued	2025-02
dc.date.submitted	2025-03-04T17:26:06.142Z
dc.identifier.uri	https://hdl.handle.net/1721.1/158928
dc.description.abstract	Deep learning models are becoming increasingly complex, expanding from 1D text and 2D images to 3D point clouds, while their size continues to grow exponentially. This trend highlights the need for greater efficiency. This thesis systematically explores efficiency in two resource-intensive domains—autonomous driving and generative AI—by focusing on fundamental model compression techniques: sparsity and quantization, alongside the co-optimization of systems and algorithms. Sparsity is crucial for autonomous vehicle (AV) applications. LiDAR processing, which requires 3D sparse computation, is inefficiently handled by current GPU libraries, creating a performance bottleneck in AV perception. To address this, we propose TorchSparse++, a high-performance GPU system for 3D sparse convolution, achieving 1.7-3.3× speedups over state-of-the-art libraries. Additionally, we introduce BEVFusion, an efficient multi-sensor fusion framework that fuses information in bird’s-eye-view (BEV) space, reducing computation by 1.9× while enhancing accuracy compared to prior methods. Generative AI is constrained by the massive size of models, necessitating quantization for efficient deployment. This thesis presents two GPU systems for accelerating large language models (LLMs): TinyChat for edge LLM deployment and QServe for cloud-based LLM serving. TinyChat boosts edge LLM inference by 3× using activation-aware weight quantization (AWQ). QServe further improves performance with activation and KV cache quantization, enhancing the throughput of NVIDIA TensorRT-LLM by 1.2-2.4× on A100 GPUs. Finally, we introduce HART, an efficient autoregressive image generation method that achieves 4.5-7.7× higher throughput compared to diffusion models while maintaining visual quality. HART achieves this improvement by leveraging quantized, or discrete, visual tokens to capture the high-level structure of images, while a lightweight diffusion model is used for fast inference of finer details.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Co-Designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: tang-kentang-phd-eecs-2025-the ...
Size:: 20.32Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record

Co-Designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing

Files in this item

This item appears in the following Collection(s)