Optimization for Deep Learning: Bridging the Theory-Practice Gap

Yun, Chulhee

dc.contributor.advisor	Sra, Suvrit
dc.contributor.advisor	Jadbabaie, Ali
dc.contributor.author	Yun, Chulhee
dc.date.accessioned	2022-02-07T15:15:54Z
dc.date.available	2022-02-07T15:15:54Z
dc.date.issued	2021-09
dc.date.submitted	2021-09-21T19:30:42.053Z
dc.identifier.uri	https://hdl.handle.net/1721.1/139962
dc.description.abstract	The success of deep learning has shown impressive empirical breakthroughs, but many theoretical questions still remain unsolved. For example, despite the nonconvexity of training objectives, deep neural networks can be reliably trained to fully memorize training datasets, yet perform very well on unseen test data. Although progress has been made over the recent years, the gap between theory and practice remains wide open; this thesis takes a step towards closing this gap. First, we discuss the optimization landscape of neural networks. We prove the existence of spurious local minima for general datasets and activation functions, which suggests that the convergence of optimization methods on neural networks cannot be explained solely via the training objectives. Next, we establish tight bounds on memorization capacity of ReLU networks. We present results showing that width-$\Theta(\sqrt{n})$ ReLU networks can memorize arbitrary $n$ data points, which brings down the existing width requirement $n$ to a much more realistic number. The third part discusses implicit bias in training neural networks, which is an area to understand generalization via the optimization trajectory. Through a unified analysis using a tensor representation of neural networks, we show how different architectures in linear neural networks lead to different global minima in overparameterized regimes. Lastly, we address a major theory-practice gap in stochastic finite-sum optimization: practical algorithms shuffle and iterate through component indices, while most theoretical analyses assume uniformly sampling the indices. To close this gap, we develop tight convergence rates of shuffling-based SGD faster than the rate of its uniform-sampling counterpart.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Optimization for Deep Learning: Bridging the Theory-Practice Gap
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.orcid	https://orcid.org/0000-0002-3504-4690
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: Yun-chulheey-PhD-EECS-2021-the ...
Size:: 2.063Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record