Co-Designing Efficient Systems and Algorithms for Sparse and
Quantized Deep Learning Computing

Tang, Haotian

Author(s)

Tang, Haotian

DownloadThesis PDF (20.32Mb)

Advisor

Han, Song

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Deep learning models are becoming increasingly complex, expanding from 1D text and 2D images to 3D point clouds, while their size continues to grow exponentially. This trend highlights the need for greater efficiency. This thesis systematically explores efficiency in two resource-intensive domains—autonomous driving and generative AI—by focusing on fundamental model compression techniques: sparsity and quantization, alongside the co-optimization of systems and algorithms. Sparsity is crucial for autonomous vehicle (AV) applications. LiDAR processing, which requires 3D sparse computation, is inefficiently handled by current GPU libraries, creating a performance bottleneck in AV perception. To address this, we propose TorchSparse++, a high-performance GPU system for 3D sparse convolution, achieving 1.7-3.3× speedups over state-of-the-art libraries. Additionally, we introduce BEVFusion, an efficient multi-sensor fusion framework that fuses information in bird’s-eye-view (BEV) space, reducing computation by 1.9× while enhancing accuracy compared to prior methods. Generative AI is constrained by the massive size of models, necessitating quantization for efficient deployment. This thesis presents two GPU systems for accelerating large language models (LLMs): TinyChat for edge LLM deployment and QServe for cloud-based LLM serving. TinyChat boosts edge LLM inference by 3× using activation-aware weight quantization (AWQ). QServe further improves performance with activation and KV cache quantization, enhancing the throughput of NVIDIA TensorRT-LLM by 1.2-2.4× on A100 GPUs. Finally, we introduce HART, an efficient autoregressive image generation method that achieves 4.5-7.7× higher throughput compared to diffusion models while maintaining visual quality. HART achieves this improvement by leveraging quantized, or discrete, visual tokens to capture the high-level structure of images, while a lightweight diffusion model is used for fast inference of finer details.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/158928

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses