Greed, hedging, and acceleration in convex optimization
Author(s)
Altschuler, Jason (Jason M.)
DownloadFull printable version (12.18Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Pablo A. Parrilo.
Terms of use
Metadata
Show full item recordAbstract
This thesis revisits the well-studied and practically motivated problem of minimizing a strongly convex, smooth function with first-order information. The first main message of the thesis is that, surprisingly, algorithms which are individually suboptimal can be combined to achieve accelerated convergence rates. This phenomenon can be intuively understood as "hedging" between safe strategies (e.g. slowly converging algorithms) and aggressive strategies (e.g. divergent algorithms) since bad cases for the former are good cases for the latter, and vice versa. Concretely, we implement the optimal hedging by simply running Gradient Descent (GD) with prudently chosen stepsizes. This result goes against the conventional wisdom that acceleration is impossible without momentum. The second main message is a universality result for quadratic optimization. We show that, roughly speaking, "most" Krylov-subspace algorithms are asymptotically optimal (in the worst-case) and "most" quadratic functions are asymptotically worst-case functions (for all algorithms). From an algorithmic perspective, this goes against the conventional wisdom that accelerated algorithms require extremely careful parameter tuning. From a lower-bound perspective, this goes against the conventional wisdom that there are relatively few "worst functions in the world" and they have lots of structure. It also goes against the conventional wisdom that a quadratic function is easier to optimize when the initialization error is more concentrated on certain eigenspaces - counterintuitively, we show that so long as this concentration is not "pathologically" extreme, this only leads to faster convergence in the beginning iterations and is irrelevant asymptotically. Part I of the thesis shows the algorithmic side of this universality by leveraging tools from potential theory and harmonic analysis. The main result is a characterization of non-adaptive randomized Krylov-subspace algorithms which asymptotically achieve the so-called "accelerated rate" in the worst case. As a special case, this recovers the known fact that GD accelerates when inverse stepsizes are i.i.d. from the Arcsine distribution. This distribution has a remarkable "equalizing" property: every quadratic function is equally easy to optimize. We interpret this as "optimal hedging" since there is no worst-case function. Leveraging the equalizing property also provides other new insights including asymptotic isotropy of the iterates around the optimum, and uniform convergence guarantees for extending our analysis to l2. Part II of the thesis shows the lower-bound side of this universality by connecting quadratic optimization to the universality of orthogonal polynomials. We also characterize, for every finite number of iterations n, all worst-case quadratic functions for n iterations of any Krylov-subspace algorithm. Previously no tight constructions were known. (Note the classical construction of [Nemirovskii and Yudin, 1983] is only tight asymptotically.) As a corollary, this result also proves that randomness does not help Krylov-subspace algorithms. Combining the results in Parts I and II uncovers a duality between optimal Krylov-subspace algorithms and worst-case quadratic functions. It also shows new close connections between quadratic optimization, orthogonal polynomials, Gaussian quadrature, Jacobi operators, and their spectral measures. Part III of the thesis extends the algorithmic techniques in Part I to convex optimization. We first show that running the aforementioned random GD algorithm accelerates on separable convex functions. This is the first convergence rate that exactly matches the classical quadratic-optimization lower bound of [Nemirovskii and Yudin, 1983] on any class of convex functions richer than quadratics. This provides partial evidence suggesting that convex optimization might be no harder than quadratic optimization. However, these techniques (provably) do not extend to general convex functions. This is roughly because they do not require all observed data to be consistent with a single valid function - we call this "stitching." We turn to a semidefinite programming formulation of worst-case rate from [Taylor et al., 2017] that ensures stitching. Using this we compute the optimal GD stepsize schedules for 1, 2, and 3 iterations, and show that they partially accelerate on general convex functions. These optimal schedules for convex optimization are remarkably different from the optimal schedules for quadratic optimization. The rate improves as the number of iterations increases, but the algebraic systems become increasingly complicated to solve and the general case eludes us.
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018. Cataloged from PDF version of thesis. Includes bibliographical references (pages 153-156).
Date issued
2018Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.