The Janus effects of SGD vs GD: high noise and low rank

Xu, Mengjia; Galanti, Tomer; Rangamani, Akshay; Rosasco, Lorenzo; Poggio, Tomaso

dc.contributor.author	Xu, Mengjia
dc.contributor.author	Galanti, Tomer
dc.contributor.author	Rangamani, Akshay
dc.contributor.author	Rosasco, Lorenzo
dc.contributor.author	Poggio, Tomaso
dc.date.accessioned	2023-12-21T20:32:01Z
dc.date.available	2023-12-21T20:32:01Z
dc.date.issued	2023-12-21
dc.identifier.uri	https://hdl.handle.net/1721.1/153227
dc.description.abstract	It was always obvious that SGD has higher fluctuations at convergence than GD. It has also been often reported that SGD in deep RELU networks has a low-rank bias in the weight matrices. A recent theoretical analysis linked SGD noise with the low-rank bias induced by the SGD updates associated with small minibatch sizes [1]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the components of the matrix W corresponding to the null space of the data matrix X converges to zero for both SGD and GD, provided the regularization term is non-zero (in the case of square loss; for exponential loss the result holds independently of regularization). The convergence rate, however, is exponential for SGD, and linear for GD. Thus SGD has a much stronger bias than GD towards solutions for weight matrices W with high fluctuations and low rank, provided the initialization is from a random matrix (but not if W is initialized as a zero matrix). Thus SGD under exponential loss, or under the square loss with non-zero regularization, shows the coupled phenomenon of low rank and asymptotic noise.	en_US
dc.description.sponsorship	This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.	en_US
dc.relation.ispartofseries	CBMM Memo;144
dc.title	The Janus effects of SGD vs GD: high noise and low rank	en_US
dc.type	Article	en_US
dc.type	Technical Report	en_US
dc.type	Working Paper	en_US

Files in this item

Name:: CBMM-Memo-144.pdf
Size:: 2.849Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

CBMM Memo Series

Show simple item record