Escaping Saddle Points with Adaptive Gradient Methods

Staib, Matthew; Reddi, Sashank; Kale, Satyen; Kumar, Sanjiv; Sra, Suvrit

Author(s)

Staib, Matthew; Reddi, Sashank; Kale, Satyen; Kumar, Sanjiv; Sra, Suvrit

DownloadPublished version (621.3Kb)

Publisher Policy

Terms of use

Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.

Metadata

Show full item record

Abstract

© 2019 International Machine Learning Society (IMLS). Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

Date issued

2019

URI

https://hdl.handle.net/1721.1/137532

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Journal

36th International Conference on Machine Learning, ICML 2019

Citation

Staib, Matthew, Reddi, Sashank, Kale, Satyen, Kumar, Sanjiv and Sra, Suvrit. 2019. "Escaping Saddle Points with Adaptive Gradient Methods." 36th International Conference on Machine Learning, ICML 2019, 2019-June.

Version: Final published version

Collections

MIT Open Access Articles