Deep learning: a statistical viewpoint

Bartlett, Peter L; Montanari, Andrea; Rakhlin, Alexander

dc.contributor.author	Bartlett, Peter L
dc.contributor.author	Montanari, Andrea
dc.contributor.author	Rakhlin, Alexander
dc.date.accessioned	2021-12-03T16:28:58Z
dc.date.available	2021-12-03T16:28:58Z
dc.date.issued	2021-05
dc.identifier.uri	https://hdl.handle.net/1721.1/138312
dc.description.abstract	<jats:p>The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.</jats:p>	en_US
dc.language.iso	en
dc.publisher	Cambridge University Press (CUP)	en_US
dc.relation.isversionof	10.1017/s0962492921000027	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	arXiv	en_US
dc.title	Deep learning: a statistical viewpoint	en_US
dc.type	Article	en_US
dc.identifier.citation	Bartlett, Peter L, Montanari, Andrea and Rakhlin, Alexander. 2021. "Deep learning: a statistical viewpoint." Acta Numerica, 30.
dc.contributor.department	Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
dc.contributor.department	Statistics and Data Science Center (Massachusetts Institute of Technology)
dc.relation.journal	Acta Numerica	en_US
dc.eprint.version	Original manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2021-12-03T16:24:47Z
dspace.orderedauthors	Bartlett, PL; Montanari, A; Rakhlin, A	en_US
dspace.date.submission	2021-12-03T16:24:49Z
mit.journal.volume	30	en_US
mit.license	OPEN_ACCESS_POLICY
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 2103.09177.pdf
Size:: 1.397Mb
Format:: PDF
Description:: Submitted version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record