The Lottery Ticket Hypothesis: On Sparse, Trainable Neural Networks
Author(s)
Frankle, Jonathan
DownloadThesis PDF (13.00Mb)
Advisor
Carbin, Michael
Terms of use
Metadata
Show full item recordAbstract
In this thesis, I show that, from an early point in training, typical neural networks for computer vision contain subnetworks capable of training in isolation to the same accuracy as the original unpruned network. These subnetworks—which I find retroactively by pruning after training and rewinding weights to their values from earlier in training—are the same size as those produced by state-of-the-art pruning techniques from after training. They rely on a combination of structure and initialization: if either is modified (by reinitializing the network or shuffling which weights are pruned in each layer), accuracy drops.
In small-scale settings, I show that these subnetworks exist from initialization; in large-scale settings, I show that they exist early in training (< 5% of the way through). In general, I find these subnetworks when the outcome of optimizing them becomes robust to the sample of SGD noise used to train them; that is, when they train to the same convex region of the loss landscape regardless of data order. This occurs at initialization in small-scale settings and early in training in large-scale settings.
The implication of these findings is that it may be possible to prune neural networks early in training, which would create an opportunity to substantially reduce the cost of training from that point forward. In service of this goal, I establish a framework for what success would look like in solving this problem and survey existing techniques for pruning neural networks at initialization and early in training. I find that magnitude pruning at initialization matches state-of-the-art performance for this task. In addition, the only information that existing techniques extract are the per-layer proportions in which to prune the network; in the case of magnitude pruning, this means that the only signals necessary to achieve state-of-the-art results are the per-layer widths used by variance-scaled initialization techniques.
Date issued
2023-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology