MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Co-Designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing

Author(s)
Tang, Haotian
Thumbnail
DownloadThesis PDF (20.32Mb)
Advisor
Han, Song
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Deep learning models are becoming increasingly complex, expanding from 1D text and 2D images to 3D point clouds, while their size continues to grow exponentially. This trend highlights the need for greater efficiency. This thesis systematically explores efficiency in two resource-intensive domains—autonomous driving and generative AI—by focusing on fundamental model compression techniques: sparsity and quantization, alongside the co-optimization of systems and algorithms. Sparsity is crucial for autonomous vehicle (AV) applications. LiDAR processing, which requires 3D sparse computation, is inefficiently handled by current GPU libraries, creating a performance bottleneck in AV perception. To address this, we propose TorchSparse++, a high-performance GPU system for 3D sparse convolution, achieving 1.7-3.3× speedups over state-of-the-art libraries. Additionally, we introduce BEVFusion, an efficient multi-sensor fusion framework that fuses information in bird’s-eye-view (BEV) space, reducing computation by 1.9× while enhancing accuracy compared to prior methods. Generative AI is constrained by the massive size of models, necessitating quantization for efficient deployment. This thesis presents two GPU systems for accelerating large language models (LLMs): TinyChat for edge LLM deployment and QServe for cloud-based LLM serving. TinyChat boosts edge LLM inference by 3× using activation-aware weight quantization (AWQ). QServe further improves performance with activation and KV cache quantization, enhancing the throughput of NVIDIA TensorRT-LLM by 1.2-2.4× on A100 GPUs. Finally, we introduce HART, an efficient autoregressive image generation method that achieves 4.5-7.7× higher throughput compared to diffusion models while maintaining visual quality. HART achieves this improvement by leveraging quantized, or discrete, visual tokens to capture the high-level structure of images, while a lightweight diffusion model is used for fast inference of finer details.
Date issued
2025-02
URI
https://hdl.handle.net/1721.1/158928
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.