Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning Applications
Author(s)
Li, Haochuan
DownloadThesis PDF (976.9Kb)
Advisor
Jadbabaie, Ali
Rakhlin, Alexander
Terms of use
Metadata
Show full item recordAbstract
Nonlinear optimization has become the workhorse of machine learning. However, our theoretical understanding of optimization in machine learning is still limited. For example, classical optimization theory relies on assumptions like bounded Lipschitz smoothness of the loss function which are rarely met in machine learning. Besides, existing theory cannot well explain why adaptive methods outperform gradient descent in certain machine learning tasks like training transformers. In this thesis, to bridge this gap, we propose more general smoothness conditions that are closer to machine learning practice and study the convergence of popular classical and adaptive methods under such conditions. Our convergence results improve over existing ones and also provide new insights into understanding the role of adaptivity in optimization for machine learning applications. First, inspired by some recent works and insights from deep neural network training, we propose a generalized non-uniform smoothness condition with the Hessian norm bounded by a function of the gradient norm almost everywhere. We develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for gradient descent (GD), stochastic gradient descent (SGD), and Nesterov’s accelerated gradient method (NAG) in the convex or non-convex settings under this general smoothness condition. In addition, the new analysis technique also allows us to obtain an improved convergence result for the Adaptive Moment Estimation (Adam) method. Despite the popularity and efficiency of Adam in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this thesis, we show that Adam provably converges to stationary points under far more realistic conditions. In particular, we do not require the strong assumptions made in previous works and also consider the generalized smoothness condition. However, the above results can not explain why adaptive methods like Adam significantly outperform SGD in machine learning applications like training transformers, as the convergence rate we have obtained for Adam is not faster than that of SGD. Previous research has empirically observed that adaptive methods tend to exhibit much smaller directional smoothness along the training trajectory compared to SGD. In this thesis, we formalize this observation into a more rigorous theoretical explanation. Specifically, we propose a directional smoothness condition, under which we prove faster convergence of memoryless Adam and RMSProp in the deterministic setting. Notably, our convergence rate is faster than the typical rate of gradient descent, providing new insights into the benefits of adaptivity in training transformers.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology