Show simple item record

dc.contributor.advisorHan, Song
dc.contributor.authorTang, Haotian
dc.date.accessioned2025-03-27T16:58:40Z
dc.date.available2025-03-27T16:58:40Z
dc.date.issued2025-02
dc.date.submitted2025-03-04T17:26:06.142Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158928
dc.description.abstractDeep learning models are becoming increasingly complex, expanding from 1D text and 2D images to 3D point clouds, while their size continues to grow exponentially. This trend highlights the need for greater efficiency. This thesis systematically explores efficiency in two resource-intensive domains—autonomous driving and generative AI—by focusing on fundamental model compression techniques: sparsity and quantization, alongside the co-optimization of systems and algorithms. Sparsity is crucial for autonomous vehicle (AV) applications. LiDAR processing, which requires 3D sparse computation, is inefficiently handled by current GPU libraries, creating a performance bottleneck in AV perception. To address this, we propose TorchSparse++, a high-performance GPU system for 3D sparse convolution, achieving 1.7-3.3× speedups over state-of-the-art libraries. Additionally, we introduce BEVFusion, an efficient multi-sensor fusion framework that fuses information in bird’s-eye-view (BEV) space, reducing computation by 1.9× while enhancing accuracy compared to prior methods. Generative AI is constrained by the massive size of models, necessitating quantization for efficient deployment. This thesis presents two GPU systems for accelerating large language models (LLMs): TinyChat for edge LLM deployment and QServe for cloud-based LLM serving. TinyChat boosts edge LLM inference by 3× using activation-aware weight quantization (AWQ). QServe further improves performance with activation and KV cache quantization, enhancing the throughput of NVIDIA TensorRT-LLM by 1.2-2.4× on A100 GPUs. Finally, we introduce HART, an efficient autoregressive image generation method that achieves 4.5-7.7× higher throughput compared to diffusion models while maintaining visual quality. HART achieves this improvement by leveraging quantized, or discrete, visual tokens to capture the high-level structure of images, while a lightweight diffusion model is used for fast inference of finer details.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleCo-Designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record