Sparse Expansion and Neuronal Disentanglement
Author(s)
Kong, Linghao
DownloadThesis PDF (1.497Mb)
Advisor
Shavit, Nir N.
Terms of use
Metadata
Show full item recordAbstract
We show how to improve the inference efficiency of an LLM by expanding it into a mixture of sparse experts, where each expert is a copy of the same weights and one-shot pruned for a specific cluster of input values. We call this approach Sparse Expansion. We show that for models like Llama 2 7B, as we increase the number of experts, Sparse Expansion outperforms all other one-shot sparsification approaches for the same FLOPs budget, and this gap grows as sparsity increases. But why? To answer this, we provide strong evidence that the mixture of sparse experts is effectively disentangling the input-output relationship of every individual neuron. Sparse experts approximate a neuron’s dense output distribution with fewer weights by decomposing the distribution into a collection of simpler ones, each with a separate sparse dot product covering it. Interestingly, we show that the Wasserstein distance between a neuron’s output distribution and a Gaussian distribution is an indicator of its entanglement level and contribution to the accuracy of the model. Every layer of an LLM has highly entangled neurons, and model performance suffers more when these are sparsified as opposed to others. We believe that these neurons may have implications beyond sparsity in understanding the performance of LLMs.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology