Efficient ML Inference via Matrix-Vector Approximations
Author(s)
Li, Daniel D.
DownloadThesis PDF (1.318Mb)
Advisor
Lynch, Jayson
Terms of use
Metadata
Show full item recordAbstract
Efficient inference is a growing priority in deep learning, where large model sizes and increasing deployment demands pose challenges for latency, memory, and energy usage. This thesis presents a unified framework for evaluating approximation methods that accelerate inference by modifying weight matrices. We model each method as a function ƒ_c(A) that approximates a weight matrix A under a compression rate c, and assess its impact on both matrix–vector accuracy and downstream task performance. We conduct empirical evaluations across two representative models, AlexNet on CIFAR10 and DistilBERT on AG News, comparing quantization, sparsification, and low-rank approximations. Our analysis spans four perspectives: (1) how different methods trade off ℓ₂ error and compression, (2) how weight statistics and input distributions shape error, (3) how well ℓ₂ error predicts classification accuracy, and (4) how idealized compression differs from real memory savings. We find that sparsification offers a strong trade-off between storage and accuracy, particularly because it preserves task-relevant structure in the weights. We also show that ℓ₂ error is not always a reliable proxy for accuracy, especially when input data lie on low-dimensional manifolds. These results suggest that approximation quality must be evaluated not only by global distortion metrics, but also by how the method interacts with model structure and input distributions. Our findings offer practical guidance for deploying efficient deep learning models and shed light on how compression affects performance in real-world settings.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology