The Lottery Ticket Hypothesis: On Sparse, Trainable Neural Networks

Frankle, Jonathan

Author(s)

Frankle, Jonathan

DownloadThesis PDF (13.00Mb)

Advisor

Carbin, Michael

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

In this thesis, I show that, from an early point in training, typical neural networks for computer vision contain subnetworks capable of training in isolation to the same accuracy as the original unpruned network. These subnetworks—which I find retroactively by pruning after training and rewinding weights to their values from earlier in training—are the same size as those produced by state-of-the-art pruning techniques from after training. They rely on a combination of structure and initialization: if either is modified (by reinitializing the network or shuffling which weights are pruned in each layer), accuracy drops. In small-scale settings, I show that these subnetworks exist from initialization; in large-scale settings, I show that they exist early in training (< 5% of the way through). In general, I find these subnetworks when the outcome of optimizing them becomes robust to the sample of SGD noise used to train them; that is, when they train to the same convex region of the loss landscape regardless of data order. This occurs at initialization in small-scale settings and early in training in large-scale settings. The implication of these findings is that it may be possible to prune neural networks early in training, which would create an opportunity to substantially reduce the cost of training from that point forward. In service of this goal, I establish a framework for what success would look like in solving this problem and survey existing techniques for pruning neural networks at initialization and early in training. I find that magnitude pruning at initialization matches state-of-the-art performance for this task. In addition, the only information that existing techniques extract are the per-layer proportions in which to prune the network; in the case of magnitude pruning, this means that the only signals necessary to achieve state-of-the-art results are the per-layer widths used by variance-scaled initialization techniques.

Date issued

2023-02

URI

https://hdl.handle.net/1721.1/150166

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses