dc.description.abstract | Deep learning for visual perception on edge devices has become increasingly critical, driven by emerging applications in autonomous driving and AR/VR. Typically, sparse convolution on 3D point clouds and Visual Language Models (VLMs) for image processing are two important methods for visual understanding and reasoning. However, the limited compute resources and memory on edge devices pose significant challenges, necessitating specialized system support for deep learning models. Specifically, the efficiency challenges for edge visual perception are twofold: First, the sparsity and inherent irregularity of point cloud data introduce substantial complexity for parallel processing. Second, the colossal model sizes and amount of computation of LLMs and VLMs render edge deployment particularly challenging. In this thesis, we aim to address the efficiency issues of on-device deep learning via system-algorithm co-design. We first introduce TorchSparse++, a high-performance inference engine for sparse convolution on GPUs. Unlike existing sparse convolution systems, TorchSparse++ well balances the efficiency and implementation simplicity, achieving the best performance across different application scenarios. Specifically, we first create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9×, 3.3×, 2.2× and 1.7× measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3× faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6× faster inference speed compared with state-of-the-art graph deep learning libraries. Furthermore, to democratize the power of large foundation models in edge AI, we propose AWQ and TinyChat, a hardware-friendly full-stack solution for efficient on-device LLM and VLM deployment. AWQ is a novel quantization method based on the insight that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. Specifically, AWQ employs an equivalent transformation and scales up the salient weight channels to reduce the weight quantization error, during which the scale is determined by collecting the activation statistics offline. Alongside AWQ, we further introduce TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With on-the-fly dequantization, extensive kernel fusion and platform-aware weight packing, TinyChat offers 2.7-3.7× speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also enables the deployment of the 70B Llama-2 model on mobile GPUs. Together, these techniques significantly reduce the computational and memory costs for deploying deep learning models on edge devices, increasing the accessibility of deep learning for practical application. We hope that this thesis can inspire future research on efficient edge AI across diverse modalities. | |