Co-Designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing
Author(s)
Tang, Haotian
DownloadThesis PDF (20.32Mb)
Advisor
Han, Song
Terms of use
Metadata
Show full item recordAbstract
Deep learning models are becoming increasingly complex, expanding from 1D text and 2D images to 3D point clouds, while their size continues to grow exponentially. This trend highlights the need for greater efficiency. This thesis systematically explores efficiency in two resource-intensive domains—autonomous driving and generative AI—by focusing on fundamental model compression techniques: sparsity and quantization, alongside the co-optimization of systems and algorithms. Sparsity is crucial for autonomous vehicle (AV) applications. LiDAR processing, which requires 3D sparse computation, is inefficiently handled by current GPU libraries, creating a performance bottleneck in AV perception. To address this, we propose TorchSparse++, a high-performance GPU system for 3D sparse convolution, achieving 1.7-3.3× speedups over state-of-the-art libraries. Additionally, we introduce BEVFusion, an efficient multi-sensor fusion framework that fuses information in bird’s-eye-view (BEV) space, reducing computation by 1.9× while enhancing accuracy compared to prior methods. Generative AI is constrained by the massive size of models, necessitating quantization for efficient deployment. This thesis presents two GPU systems for accelerating large language models (LLMs): TinyChat for edge LLM deployment and QServe for cloud-based LLM serving. TinyChat boosts edge LLM inference by 3× using activation-aware weight quantization (AWQ). QServe further improves performance with activation and KV cache quantization, enhancing the throughput of NVIDIA TensorRT-LLM by 1.2-2.4× on A100 GPUs. Finally, we introduce HART, an efficient autoregressive image generation method that achieves 4.5-7.7× higher throughput compared to diffusion models while maintaining visual quality. HART achieves this improvement by leveraging quantized, or discrete, visual tokens to capture the high-level structure of images, while a lightweight diffusion model is used for fast inference of finer details.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology