ACCEPTED MANUSCRIPT Factor-√22 Acceleration of Accelerated Gradient Methods This Accepted Manuscript (AM) is a PDF file of the manuscript accepted for publication after peer review, when applicable, but does not reflect post-acceptance improvements, or any corrections. Use of this AM is subject to the publisher's embargo period and AM terms of use. Under no circumstances may this AM be shared or distributed under a Creative Commons or other form of open access license, nor may it be reformatted or enhanced, whether by the Author or third parties. By using this AM (for example, by accessing or downloading) you agree to abide by Springer Nature's terms of use for AM versions of subscription articles: https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms The Version of Record (VOR) of this article, as published and maintained by the publisher, is available online at: https://doi.org/10.1007/s00245-023-10047-9. The VOR is the version of the article after copy-editing and typesetting, and connected to open research data, open protocols, and open code where available. Any supplementary information can be found on the journal website, connected to the VOR. For research integrity purposes it is best practice to cite the published Version of Record (VOR), where available (for example, see ICMJE’s guidelines on overlapping publications). Where users do not have access to the VOR, any citation must cletarly indicate that the reference is to an Accepted Manuscript (AM) version. ripc nu s a m te d ce p c A ACCEPTED MANUSCRIPT Noname manuscript No. (will be inserted by the editor) √ Factor- 2 Acceleration of Accelerated Gradient Methods Chanwoo Park · Jisun Park · Ernest K. Ryu t Received: date / Accepted: date ip √ Abstract The optimized gradient method (OGM) provides a factor- 2 speedup r upon Nesterov’s celebrated accelerated gradient method in the convex (but non-strongly convex) setup. However, this improved acceleration mechanism c has not been well understood; prior analyses of OGM relied on a computer- assisted proof methodology, so the proofs were opaque for humans despite being s verifiable and correct. In this work, we present a new analysis of OGM based on a Lyapunov function and linear coupling. These analyses are develouped and presented without the assistance of computers and are understandable by humans. Fu√rthermore, we generalize OGM’s acceleration mechannism and obtain a factor- 2 speedup in other setups: acceleration with a simpler rational stepsize, the strongly convex setup, and the mirror descent seatup. 1 Introduction m Nesterov’s celebrated accelerated gradient method (AGM) solves the problem of finding the minimum of an L-smooth codnvex function with an “optimal” accelerated O(1/k2) complexity [38t]. Seurprisingly, AGM turned out to be notexactly optimal, but optimal only up to a constant. The optimized gradientmethod (OGM)√has a factor-2psmaller (better) worst-case guarantee and therebyrequires factor- 2 fewer iterations to guarantee the same accuracy [22,26]. Chanwoo Park Department of Statistics, Seoul National University E-mail: chanwoo.park@snu.aec.kr Jisun Park Department ofcMathecmatical Sciences, Seoul National UniversityE-mail: colleenp0515@snu.ac.krErnestAK. RyuDepartment of Mathematical Sciences, Seoul National UniversityE-mail: eryu@snu.ac.kr 1 ACCEPTED MANUSCRIPT 2 Chanwoo Park, Jisun Park, Ernest K. Ryu However, this remarkable discovery has not been well understood. OGM was originally obtained through a computer-assisted methodology based on the performance estimation problem (PEP). The resulting convergence analyses involve arduous but elementary calculations that are verifiable but arguably not understandable by humans. Contribution. In this work, we present human-understandable analyses of OGM. First, we show that the improved acceleration mechanism of OGM can be un- derstood and analyzed through an unconventional Lyapunov functi√on. We then use this insight to propose a new method that obtains the factor- 2 speedup in the strongly convex setup. Finally, we present a human-understandable derivation of OGM based on refining the linear coupling analysis of Allen-Zhu t and Orecchia [5], and generalize OGM to the mirror descent setup. As minor contributions, we analyze the primary and secondary sequences p of OGM through a single unified analysis; to the best of our knowledge, prior i works provide two separate convergence proofs for x- and y-sequences. Moreover, r we present a unified class of accelerated methods containing AGM and OGM through the linear coupling analysis. sc 1.1 Definitions and prior work u For L > 0, a differentiable convex function f : Rn → R is L-smooth with respect to a norm ∥ · ∥ if n ∥∇f(x)−∇f(y)∥∗ ≤ L∥x− y∥ ∀x, y ∈ Rna, where ∥ · ∥∗ denotes the dual norm. A convex functionmf : Rn → R is µ-stronglyconvex if f(x)− (µ/2)∥x∥2 is convex [39,47].Throughout this paper, we consider the proble m minimize f(x) x∈Rn d and make the following assumptions oen f : Rn → R: (A1) f is convex, differentiable, and L-smooth with respect to ∥ · ∥ and (A2) f has a minimizer (not nepcessartily unique).We write x⋆ for a miniemizer of f and f⋆ = f(x⋆) for the optimal value. Toclarify, the proofs of Section 2 do not require the minimizer x⋆ to be unique. Nesterov’s AGM.cNesterov’s AGM isc 1yk+1 = xk − ∇f(xk)L A θk − 1xk+1 = yk+1 + (yk+1 − yk),θk+1 2 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 3 where y0 = x0, θ0 = 1, and θ 2 k+1 − θ 2 k+1 = θk for k = 0, 1, . . . [38]. We can equivalently write AGM as 1 yk+1 = xk − ∇f(xk) L θk zk+1 = zk − ∇f(xk) L ( ) 1 1 xk+1 = 1− yk+1 + zk+1 θk+1 θk+1 with z0 = x0 [40]. AGM can be generalized to use the relaxed parameter requirement θ2k+1 − θk+1 ≤ θ 2 on the positive sequence {θ }∞k k k=0. The choice θk = (k + 2)/2 is a t common instance. In the setup where f is furthermore µ-strongly convex, Nesterov’s AGM p for the strongly convex setup (SC-AGM) is ri 1 yk+1 = xk − ∇f(xk) L √ c κ− 1 xk+1 = yk+1 + √ (yk+1 − yk) κ+ 1 s for k = 0, 1, . . . , where κ = L/µ and y0 = x0 [39]. u Optimized gradient method. OGM is n 1 yk+1 = xk − ∇f(xk) L a θk − 1 θk xk+1 = yk+1 + (yk+1 − yk) + m(yk+1 − xk)θk+1 θk+1 for k = 0, 1, . . . , where y = x ∞0 0 and {θk}k=1 ids the s ame as that of AGM [22,26].We refer to θk−1 (yk+1 − yk) as the momentum term and θk (y − x ) asθk+1 θ k+1 kk+1the correction term. The added correcteion term is the difference between AGMand OGM. We can equivalently writte OGM as1yk+1 = xk − ∇f(xk)L 2θk zk+1 = zkp− ∇f(xk)L ( ) 1 1 cxk+1e= 1− yk+1 + zk+1,θk+1 θk+1where z0 = x0 [26]. The factor 2 in zk+1 is the difference compared to AGM. The y√k-secquence of OGM exhibits a rate faster than that of AGM by afactor of 2. This rate was proved in [27], and we also state it in Corollary 1. To claArify, the guarantee on the function value is smaller (better) by a factorof 2, and, combined with the O(1/k2) iteration dependence, this represents 3 ACCEPTED MANUSCRIPT 4 Chanwoo Park, Jisun Park, Ernest K. Ryu √ a factor- 2 reduction in the number of iterations necessary to reach a given accuracy. Furthermore, OGM’s original presentation [22,26] involves what we refer to as the last-step modification on the secondary sequence θk − 1 θk x̃k+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) φk+1 φk+1 ( ) 1 1 = 1− yk+1 + zk+1, φk+1 φk+1 where φ2k − φk − 2θ 2 k−1 = 0. The x̃k-sequence of OGM exhibits a rate slightly better than OGM’s yk-sequence and is in fact exactly optimal [19] under the smooth (non-strongly) convex function class. This rate was proved in the t original presentation of OGM [22,26], and we also state it in Corollary 3. In this work, we present the first variant of OGM for the strongly convex setup. ip θk-sequence asymptotic characterization. Throughout the exposition of this r work, we will use the following asymptotic characterization: if θ0 = 1 and θ2 2k+1 − θk+1 = θk for k = 0, 1, . . . , then c k + ζ + 1 log k θk = + + o(1) (1) 2 4 s as k → ∞, where ζ ≈ 0.646. While we suspect this result may be known, we could not find it in any reference. Therefore, we formally state andnprovue (1)as Lemma 7 in the appendix. Computer-assisted derivation and analysis of OGM. OGM waas originally ob-tained through a computer-assisted methodology based on the performance estimation problem (PEP); it was first discovered numerically [22] and then its analytical form and convergence analysis was foundm[26]. The PEP methodol-ogy’s key insight is to optimize over the class of fix ed-step first-order gradientmethods, with the objective being the convergence guarantee. Surprisingly, this problem is semidefinite programming- (SDP-) representable and has a tightness guarantee [54]. OGM was re-discodvered by using the PEP to find a greedy first-order method simplified with a “subspace-search elimination procedure” [21]. However, these prior analyses oftOGeM, generated by computers, are verifi-able but arguably not undersptandable by humans. Moreover, as the analysesrely on finding analytical solutions to the SDPs arising from the PEP, they areinaccessible to those unfamiliar with the methodology. Lyapunov analysis of AeGM. Nesterov’s original 1983 paper established the celebrated O(1/k2c) rate using a Lyapunov analysis [38]. Subsequent works[11, 12,32, 39–c41,43,55] analyzed AGM and its variants through the “estimatesequence” technique, which many consider to be less transparent than Lya-punov analyses. In recent years, there has been a renewed interest in studying accelerAated methods via Lyapunov analyses [1, 7–9,13,16, 50, 52]. In this work,we present the first Lyapunov analysis of OGM. 4 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 5 Linear coupling analysis of AGM. The interpretation of AGM as a linear coupling between gradient descent and mirror descent was presented in [5]. Specifically, AGM can be written as { } L 2 yk+1 = argmin ⟨∇f(xk), y − xk⟩+ ∥y − xk∥ y 2 zk+1 = argmin {Vz (y) + ⟨αk+1∇f(xk), y − x ⟩}k k y xk+1 = (1− τk+1)yk+1 + τk+1zk+1, where Vz is a Bregman divergence. The yk-update can be viewed as a gradient descent update and the zk-update can be viewed as a mirror descent update. t Mirror descent [37] was originally presented as a method that maps the current point to a dual space, performs a gradient update, and maps the point back to the primal space. An alternate proximal form of mirror descent (which we ip use) was presented in [15]. An alternate “dual averaging” interpretation of mirror descent as a method that constructs a lower bound of the function was r presented in [42]. The key insight of linear coupling is to carefully interpolate between mirror descent and gradient descent to obtain AGM. c Linear coupling has been used to obtain and analyze many extensions of s AGM [2–4,6], but whether the linear coupling argument itself can be further refined seems not to have been studied. In this work, we show that refining the linear coupling analysis naturally leads to OGM. nu Tight inequalities. We informally say an inequality is tight if it cannot be improved without further assumptions and formally if it satisfieas the “interpola-tion conditions” [54]. The recent literature on performance estimation problem focuses on using tight inequalities to obtain proofs that are provably cannot be improved [17,24,25,33,46,52,53]. The tight inequality we use is m 1 f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+d∥∇f(x)−∇f(y)∥22L ∗ for all L-smooth convex function f andex, y ∈ Rn. The linear coupling analysisof AGM uses strictly weaker inequaliti s, as discussed in Section 4. By refining the analysis by replacing thepnon-titght inequalities with tight ones, we obtainOGM. Accelerated methods for smooth strongly convex minimization. For the problem setup of minimizingcsmoeoth strongly convex functions, Nesterov’s SC-AGM [39]√achieves the convergence rate O (exp (−k/ κ)). Recently, the triple momentummethod [31] acnd the information-theoretic exact method [51] were presented√with an improved O (exp (−2k/ κ))-rate, and their optimality was established√through the matching Θ (exp (−2k/ κ))-lower bound of [20], which improves√ upon tAhe classical Θ (exp (−4k/ κ))-lower bound of [35√, 36]. The SC-OGM( ( √ ))method we present in this work has a rate of O exp − 2k/ κ , between 5 ACCEPTED MANUSCRIPT 6 Chanwoo Park, Jisun Park, Ernest K. Ryu the rates of SC-AGM and TMM. For strongly convex quadratic functions, √ the heavy ball method exhibi√ts the rate O (exp (−4k/ κ)) [39] and OGM-q( ( √ )) exhibits the rate O exp −2 2k/ κ [28]. The heavy ball method’s rate √ matches the classical Θ (exp (−4k/ κ))-lower bound of [35,36]. 2 Lyapunov analysis of OGM In this section, we present a Lyapunov analysis of OGM. Our key insight is to use ( ) 1 f(xk)− f − ∥∇f(x 2 ⋆ k)∥ , 2L t which is nonnegative due to L-smoothness, instead of (f(xk)− f⋆) or (f(yk)− f⋆) p in the construction of the Lyapunov function. Throughout this section, ∥ · ∥ = i ∥ · ∥∗ denotes the Euclidean norm. r Based on this insight, we present: (i) a more human-understandable analysis of OGM (ii) a unified analysis of both the primary and secondary sequences of c OGM that admits simpler θk-choices. s 2.1 Nesterov’s AGM u Nesterov’s AGM has the rate n 2 L ∥x0 − x⋆∥ f(yk)− f⋆ ≤ a 2θ2k−1 2 2 ( ) 2L ∥x0 − x⋆∥ 2L ∥x0 − x ⋆∥mlog k 1= − + o(k + ζ)2 (k + ζ)3 k3 for k = 0, 1, . . . . (We derived the equality in Appendix E.) This rate can be established through the following Lyeapunodv analysis [38]: for k = 0, 1, . . . ,define L 2 Uk = θ 2 k−1 (f(ykt)− f⋆) + ∥zk − x⋆∥2 with θ−1 = 0 and showeUk ≤p· · · ≤ U0. Conclude withL 2θ2k−c1 (f(yk)− f⋆) ≤ Uk ≤ U0 = ∥x0 − x⋆∥ .2 2.2 Primary scequence analysis of OGM We nowAanalyze OGM’s convergence through an analogous Lyapunov analysis. 6 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 7 Theorem 1 Assume (A1) and (A2). Let the positive sequence {θ }∞k k=0 satisfy θ0 = 1 and 0 ≤ θ 2 k+1 − θk+1 ≤ θ 2 k for k = 0, 1, . . . . OGM’s yk-sequence exhibits the rate 2 L ∥x0 − x⋆∥ f(yk)− f⋆ ≤ 4θ2k−1 for k = 1, 2, . . . . Proof Set θ−1 = 0 and x−1 = x0. For k = −1, 0, 1, . . . , define ( ) 1 2 L 2 Uk =2θ 2 k f(xk)− f⋆ − ∥∇f(xk)∥ + ∥zk+1 − x⋆∥ .2L 2 t We can show that {U ∞k}k=−1 is nonincreasing. Using f(yk) ≤ f(xk−1) − p 1 2∥∇f(xk−1)∥ , which follows from L-smoothness, we conclude the rate with2L ( ) ri 1 2 2θ2 2k−1 (f(yk)− f⋆) ≤ 2θk−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L c L 2 ≤ Uk−1 ≤ U−1 = ∥z0 − x⋆∥ 2 s for k = 1, 2, . . . . Now we complete the proof by showing that {U ∞k}k=−u1 is nonincreasing. For k = −1, 0, 1, . . . , we have Uk − Uk+1 n ( ) ( ) 1 2 1 2 = 2θ2 2k f(xk)− f⋆ − ∥∇f(xk)∥ − 2θk+1 f(xk+1)− f2L a⋆ − ∥∇f(xk+1)∥2L L 2 L 2 + ∥zk+1 − x⋆∥ − ∥zk+2 − x⋆∥ 2 2 ( ) ( m ) 1 2 1 2 = 2θ2k f(xk)− f⋆ − ∥∇f(xk)∥ − 2θ 2 k+1 f(xk+1)− f⋆ − ∥∇f(xk+1)∥2L 2L 2 2 − ⟨2θ 2k+1∇f(xk+1), x⋆ − zk+1⟩ − θkdL +1 ∥∇f(xk+1)∥ ( ) ( ) 1 2 1 2 = 2θ2k f(xk)− f⋆ − ∥∇f(xk)t∥e− 2θ2k+1 f(xk+1)− f⋆ + ∥∇f(xk+1)∥2L 2L − ⟨2θk+1∇f(xk+1), x⋆p− zk+1⟩( )1 2 ≥ 2(θ2k+1 − θk+1) f(exk)− f⋆ − ∥∇f(xk)∥2L( )1 2 − 2θ2k+1 fc(xk+1)− f⋆ + ∥∇f(xk+1)∥ − ⟨2θk+1∇f(xk+1), x⋆ − zk+1⟩2L( ) = 2(θ2k+1 −c 1 2 1 2θk+1) f(xk)− f⋆ − ∥∇f(xk)∥ − f(xk+1) + f⋆ − ∥∇f(xk+1)∥2L 2L A ( )1 2− 2θk+1 f(xk+1)− f⋆ + ∥∇f(xk+1)∥ − ⟨2θk+1∇f(xk+1), x⋆ − zk+1⟩2L 7 ACCEPTED MANUSCRIPT 8 Chanwoo Park, Jisun Park, Ernest K. Ryu ( ) 2 1 2 1 2= 2(θk+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L ( ) 1 2 + 2θk+1 f⋆ − f(xk+1)− ∥∇f(xk+1)∥ + ⟨∇f(xk+1), xk+1 − x⋆⟩ 2L + 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩ ( ) ≥ 2(θ2 1 2 1 2 k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L + 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩, where the inequalities follow from the cocoercivity of f . Consider two separate cases k = −1 and k = 0, 1, . . . . In case of k = −1, t θ2k+1 − θk+1 = 1 − 1 = 0 and zk+1 − xk+1 = z0 − x0 = 0. The last formula becomes zero, so U−1 − U0 ≥ 0. For k = 0, 1, . . . , p ( ) i 1 2 1 2 2(θ2k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥ r2L 2L + 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩ c ( ) 1 2 1 2 = 2(θ2k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L s 1 + 2θk+1(θk+1 − 1)⟨∇f(xk+1), xk+1 − xk + ∇f(xk)⟩ u L ( 2 1 2= (2θk+1 − 2θk+1) f(xk)− f(xk+1)− ∥∇f(xk)−∇f(xk+1)∥n2L ) + ⟨∇f(xk+1), xk+1 − xk⟩ ≥ 0, a where the inequalities follow from the cocoercivity ofmf . ⊔⊓ As with AGM, the optimal {θ }∞k k=0 is given b y θ2k+1 − θk+1 = θ2k, which was used in the original presentation of OGMd[22, 26]. Corollary 1 Under the setup of Theorem 1, the choice θ2 2k+1− θk+1 = θk leads to the rate te2 2 2 ( )L ∥x0 − x⋆∥ pL ∥x0 − x⋆∥ L ∥x0 − x⋆∥ log k 1f(yk)− f⋆ ≤ = − + o4θ2k−1 (k + ζ)2 (k + ζ)3 k3 for k = 1, 2, . . . . e Proof This followscfrom Theorem 1 and (1). ⊔⊓ The relaxecd parameter requirement 0 ≤ θ2k+1 − θ ≤ θ2k+1 k of Theorem 1 isreminiscent of the requirement for AGM. We note that [30] had presented a gen- ∑k+1 eralizeAd analysis with requirement θ2k+1 ≤ i=1 θi based on the performanceestimation problem methodology. 8 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 9 The relaxed parameter requirement allows us to use the simpler rational coefficients θk = (k + 2)/2. This leads to 1 yk+1 = xk − ∇f(xk) L k k + 2 xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk), k + 3 k + 3 which we call Simple-OGM. Corollary 2 Assume (A1) and (A2). Simple-OGM’s yk-sequence exhibits the rate 2 L ∥x0 − x⋆∥ f(yk)− f⋆ ≤ (k + 1)2 pt for k = 1, 2, . . . . i Proof This follows from Theorem 1. ⊔⊓ cr 2.3 Secondary sequence analysis of OGM s We now analyze the convergence of OGM’s secondary sequence with last-step modification through a unified Lyapunov analysis. u Theorem 2 Assume (A1) and (A2). Let the positive sequence {θk} ∞ k=0 satisfy θ0 = 1, and 0 ≤ θ 2 2 k+1 − θk+1 ≤ θk for k = 0, 1, . . . . Let the paositivensequence{φ }∞k k=0 satisfy 0 ≤ φ2k−φk ≤ 2θ2k−1 for k = 0, 1, . . . , where we define θ−1 = 0.OGM’s x̃k-sequence, the secondary sequence with last-step modification, exhibits the rate 2 L ∥x0 − x⋆∥ f(x̃k)− f⋆ ≤ m 2φ2k for k = 0, 1, . . . . d Proof Let {U }∞k k=−1 be as definedtinethe proof of the Theorem 1. Define{Ũ }∞k k=0 as ∥ ∥2 L ∥ 1 ∥ Ũ 2k =φk (f(x̃ )−pf ) + ∥z − φ ∥k ⋆ ∥ k k∇f(x̃k)− x⋆ .2 L ∥ We can show that Ũk ≤eUk−1, we conclude the rate with φ2 L 2 ckc (f(x̃k)− f⋆) ≤ Ũk ≤ U−1 = ∥x0 − x⋆∥ 2 for k = 0, 1, . . . . Now we complete the proof by showing that Ũk ≤ Uk−1. For k = 0,A1, . . . , we haveUk−1 − Ũk 9 ACCEPTED MANUSCRIPT 10 Chanwoo Park, Jisun Park, Ernest K. Ryu ( ) 1 2 = 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − φ 2 k (f(x̃k)− f⋆)2L ∥ ∥2 L ∥2 L 1 ∥ + ∥zk − x ∥ ∥⋆∥ − z∥ k − φk∇f(x̃k)− x⋆2 2 L ∥ ( ) 1 2 = 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − φ 2 (f(x̃k)− f⋆) 2L k 1 2 − ⟨φk∇f(x̃k), x 2 ⋆ − zk⟩ − φk ∥∇f(x̃k)∥2L ( ) 1 2 = 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L ( ) 2 1 2 t− φk f(x̃k)− f⋆ + ∥∇f(x̃k)∥ − ⟨φk∇f(x̃k), x⋆ − zk⟩2L ( ) 1 p2 ≥ (φ2k − φk) f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L ( ) ri 1 2 − φ2k f(x̃k)− f⋆ + ∥∇f(x̃k)∥ − ⟨φk∇f(x̃k), x⋆ − zk⟩2L c ( ) 1 2 1 2 = (φ2k − φk) f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − f(x̃k) + f⋆ − ∥∇f(x̃k)∥2L 2L s ( ) 1 2 + φk f⋆ − f(x̃k)− ∥∇f(x̃k)∥ + ⟨∇f(x̃k), x̃k − x⋆⟩ 2L u + ⟨φk∇f(x̃k), zk − x̃k⟩ ( n) 1 2 1 2 ≥ (φ2k − φk) f(xk−1)− f(x̃k)− ∥∇f(xk−1)∥ − ∥∇af(x̃k)∥2L 2L + ⟨φk∇f(x̃k), zk − x̃k⟩ ( ) 2 1 2 1 2= (φk − φk) f(xk−1)− f(x̃k)− ∥∇f(xk−1) ∥m− ∥∇f(x̃k)∥2L 2L1 + φk(φk − 1)⟨∇f(x̃k), x̃k − xk−1 + d∇f(xk−1)⟩L( )1 2 = (φ2k − φk) f(xk−1)− f(x̃k)−te∥∇f(xk−1)−∇f(x̃k)∥ + ⟨∇f(x̃k), x̃k − xk−1⟩2L≥ 0, where the inequalities follow fprom the cocoercivity of f . ⊔⊓ Corollary 3 Under theesetup of Theorem 2, the choice θ2 2c k+1 − θk+1 = θk and φ2k − φk = 2θ 2 k−1 leads to the rate c 2 2 2 ( )L ∥x0 − x⋆∥ L ∥x0 − x⋆∥ L ∥x0 − x⋆∥ log k 1f(x̃k)− f⋆ ≤ = √ − √ + o2φ2 A k (k + ζ + 1/ 2) 2 (k + ζ + 1/ 2)3 k3 for k = 0, 1, . . . . 10 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 11 √ k+ζ+√1 2 log k Proof This follows from (1), which implies φ = √ 2k + + o(1), and2 4 Theorem 2. ⊔⊓ Simple-OGM with the last-step modification is 1 yk+1 = xk − ∇f(xk) L k k + 2 xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) k + 3 k + 3 k k + 2 x̃k+1 = yk+1 + √ (yk+1 − yk) + √ (yk+1 − xk), 2(k + 2) + 1 2(k + 2) + 1 where x0 = y0. t Corollary 4 Assume (A1) and (A2). Simple-OGM’s x̃k-sequence, the sec- ondary sequence with last-step modification, exhibits the rate p 2 L ∥x0 − x⋆∥ i f(x̃k)− f⋆ ≤ √ (k + 1 + 1/ 2)2 r for k = 0, 1, . . . . 1 ck+1+√ Proof Use Corollary 3 with θ k+2k = and φ 2 2 k = √ . ⊔⊓ 2 2.4 Discussion us We clarify that the presented Lyapunov analysis is a novel contributnion, while the results themselves are mostly known [26,27,30]. We emphasize two key points. First is the somewhat unusual construction of the Lyapunov function. This key insight will be used in theafollowing section to present a novel method for the strongly convex setup. The second point we emphasize is that we present a unified analysis of the primary and last-step-modified secondary sequenmces using the Lyapunov functions Uk and Ũk. Prior works on the two seq uences of AGM and OGM rely on two separate analyses [26,27]. d 3 Strongly convex OGM e In this section, we pre√sent strongly tconvex OGM (SC-OGM), a novel method that provides a factor- 2 impprovement over Nesterov’s SC-AGM. The methodand its analysis are obteained with following the key insight of Section 2: usethe OGM-type correction term in the method and usec( )1f(xk)− f⋆ − ∥∇f(x 2k)∥2L in the construcction of the Lyapunov function. Throughout this section, ∥ · ∥ =∥ · ∥∗ denotes the Euclidean norm. BaAsed on this insight, we present: (i) a novel method SC-OGM and (ii) aunified analysis of both the primary and secondary sequences of SC-OGM. 11 ACCEPTED MANUSCRIPT 12 Chanwoo Park, Jisun Park, Ernest K. Ryu 3.1 Nesterov’s SC-AGM Further assume f is µ-strongly convex and write κ = L/µ. SC-AGM’s conver- gence rate ( )−k ( ( )) 1 µ+ L 2 k f(yk)− f⋆ ≤ 1 + √ ∥x0 − x⋆∥ = O exp −√ κ− 1 2 κ can be established through the following Lyapunov analysis [13]. For k = 0, 1, . . . , define ( )k 1 ( µ )2 t Uk = 1 + √ f(yk)− f⋆ + ∥zk − x⋆∥ κ− 1 2 p √ √ µ+L 2 with zk = ( κ+ 1)xk − κyk and show Uk ≤ · · · ≤ U0 ≤ ∥x − x ∥ . i 2 0 ⋆ cr 3.2 Primary-sequence analysis of SC-OGM s We newly propose SC-OGM: 1 u yk+1 = xk − ∇f(xk) L 1 1 n xk+1 = yk+1 + (yk+1 − yk) + (yk+1 −axk)2γ + 1 2γ + 1 √ for k = 0, 1, . . . , where y0 = x0 and γ = 8κ+1+3 . 2κ−2 Theorem 3 Assume (A1), (A2), and that f is µ-strmongly convex. SC-OGM’s yk-sequence exhibits the rate µ+ 2L d ( ( √ ))2 2k f(yk)− f⋆ ≤ (1 + γ) −k+1 e∥x0 − x⋆∥ = O exp − √2 κ for k = 1, 2, . . . . t Proof For k = 0, 1, . . . ,edefinepc 2γ + 1 γ + 1zk = xk − ykγ γ and c ( ) A k 1 2 µ 2Uk = (1 + γ) f(xk)− f⋆− ∥∇f(xk)∥ + ∥zk+1 − x⋆∥ .2L 2 12 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 13 µ+2L 2 We can show that {Uk} ∞ k=0 is nonincreasing and U0 ≤ ∥x0 − x⋆∥ . Using2 2 f(yk) ≤ f(xk−1) − 1 ∥∇f(xk−1)∥ , which follows from L-smoothness, we2L conclude the rate with ( ) 1 2 (1 + γ)k−1 (f(yk)− f⋆) ≤ (1 + γ) k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥ 2L µ+ 2L 2 ≤ Uk−1 ≤ U0 ≤ ∥x0 − x⋆∥ 2 µ+2L 2 for k = 1, 2, . . . . Now we complete the proof by showing U0 ≤ ∥x0 − x⋆∥ ,2 showing some relationships between xk and zk, and showing that {Uk} ∞ k=0 is nonincreasing. t Firstly, we have ip1 2 µ 2 U0 = f(x0)− f⋆ − ∥∇f(x0)∥ + ∥z1 − x⋆∥ 2L 2 r ∥ ∥2 1 ∥2 µ 1 γ + 2 ∥ = f(x0)− f⋆ − ∥∇f(x0)∥ + ∥x0 − ∇f(x0)− x ∥∥ ⋆∥ c2L 2 L γ 1 1 2 γ µ = f(x0)− f⋆ + ∥∇f(x0)∥ − ⟨∇f(x0), x0 − x⋆⟩+ ∥xu 20 − xs⋆∥2L γ + 1 1 + γ 21 1 1 2 µ 2≤ (f(x0)− f⋆) + ∥∇f(x0)∥ + ∥x0 − x⋆∥ γ + 1 2L 1 + γ 2 2 µ 2 ≤ (f(x0)− f n ⋆) + ∥x0 − x⋆∥ 1 + γ 2 ( µ) ≤ L+ ∥x − x ∥2. a0 ⋆ 2 Second, Let Xk = xk −x⋆ and Zk = zk −x⋆, fo r km= 0, 1, . . . . We will prove 1 1 (xk+1 − xk) + ∇f(xk) + γX 2 k+1 = (dγZk+1 + γ Xk+1) (2)L 1 + γ Zk+1t=e 1 γ 1 γ + 2 Zk + Xk − ∇f(xk) γ + 1 γ + 1 L γ (3) for k = 0, 1, . . . . p Plug y 1k+1 = xk− ∇f(xk) in the definition of zk+1. (We remind the readerL that zk was defined in tehe beginning of the proof.) Then we obtain (2).For (3), from defincition of zk and zk+1c 2γ + 1 γ + 1 1 1 + γzk+1 = xk+1 − xk + ∇f(xk)γ γ L γ A 2γ + 1 γ + 1 1 1 + γzk = xk − xk−1 + ∇f(xk−1)γ γ L γ 13 ACCEPTED MANUSCRIPT 14 Chanwoo Park, Jisun Park, Ernest K. Ryu and definition of xk, we have 2γ + 2 1 1 1 xk+1 = yk+1 − yk − ∇f(xk) 2γ + 1 2γ + 1 L 2γ + 1 2γ + 2 1 1 2γ + 3 1 1 = xk − xk−1 − ∇f(xk) + ∇f(xk−1). 2γ + 1 2γ + 1 L 2γ + 1 L 2γ + 1 Therefore, 1 2γ + 1 γ + 1 1 1 + γ zk+1 − zk = xk+1 − xk + ∇f(xk) γ + 1 γ γ L γ ( ) 1 2γ + 1 γ + 1 1 1 + γ − xk − xk−1 + ∇f(xk−1) γ + 1 γ γ L γ t 2γ + 1 γ2 + 4γ + 2 1 1 1 + γ = xk+1 − xk + xk−1 + ∇f(xk) γ γ(γ + 1) γ L γ rip1 1− ∇f(xk−1)L γ ( 2γ + 1 2γ + 2 1 1 2γ + 3 = xk − xk−1 − ∇f(x ) ck γ 2γ + 1 2γ + 1 L 2γ + 1 ) 1 1 γ2 + 4γ + 2 1 s + ∇f(xk−1) − xk + xk−1 L 2γ + 1 γ(γ + 1) γ 1 1 + γ 1 1 u + ∇f(xk)− ∇f(xk−1) L γ L γ γ 1 γ + 2 n = xk − ∇f(xk) γ + 1 L γ a so we obtained (3). Lastly, we will show that {U ∞k}k=0 is nonincrea singm. It suffices to show thatfor k = 0, 1, . . . , (1 + γ)−k(Uk − Uk+1) ≥ 0 which is equivalent to showing ( d ) 1 2 1 2 (f(xk)− f⋆ − ∥∇f(xk)∥ )− (t1 +eγ)(f(xk+1)− f⋆ − ∥∇f(xk+1)∥ )2L 2Lµ ( )2 2+ ∥zk+1 − x⋆∥ − (1 + γ) ∥zk+2 − x⋆∥ ≥ 0. 2 By L-smoothness ofef , wephave1 2f(xk+1)− f(xk) ≤ − ∥∇f(xk+1)−∇f(xk)∥ + ⟨∇f(xk+1), xk+1 − xk⟩ 2L and from strocng cocnvexity, µ 2f(xk+1)− f⋆ ≤ ⟨∇f(xk+1), xk+1 − x⋆⟩ − ∥xk+1 − x⋆∥ . 2 For k =A0, 1, . . . , using above two inequalities, (2), and (3), 14 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 15 ( ) 1 2 1 2 f(xk)− f⋆ − ∥∇f(xk)∥ − (1 + γ)(f(xk+1)− f⋆ − ∥∇f(xk+1)∥ ) 2L 2L 1 + γ 2 1 2 = (f(xk)− f(xk+1))− γ(f(xk+1)− f⋆) + ∥∇f(xk+1)∥ − ∥∇f(xk)∥ 2L 2L ( ) 1 2 ≥ ∥∇f(xk+1)−∇f(xk)∥ + ⟨∇f(xk+1), xk − xk+1⟩ 2L ( µ )2 − γ ⟨∇f(xk+1), xk+1 − x⋆⟩ − ∥xk+1 − x⋆∥ 2 1 + γ 2 1 2 + ∥∇f(x tk+1)∥ − ∥∇f(xk)∥ 2L 2L 1 = ⟨∇f(xk+1),− ∇f(xk)− xk+1 + xk − γ(x pk+1 − x⋆)⟩ L i 2 + γ 2 µγ 2 + ∥∇f(xk+1)∥ + ∥x − x ∥ rk+1 ⋆ 2L 2 1 = ⟨∇f(x 2 ck+1),− (γZk+1 + γ Xk+1)⟩ 1 + γ 2 + γ 2 µγ 2 s + ∥∇f(xk+1)∥ + ∥xk+1 − x⋆∥ . 2L 2 u In addition, n µ ( )2 2 a (1 + γ) ∥Zk+2∥ − ∥Zk+1∥ 2 ( ∥ m ∥ ) 2 µ ∥ 1 γ 1 2 + γ ∥ 2 = (1 + γ)∥ Zk+1 + X − ∇f(x )∥k+1 k+1 − ∥Z ∥ 2 ∥ k+1 1 + γ 1 + γ L γ ∥ ( µ γ γ2 1 (2 + γ)22 2 2 = − ∥Zk+1∥ + ∥Xk+1∥ + (1 + γ) ∥∇f(xk+1)∥ 2 1 + γ 1 + γ d L2 γ2 γ 2 + γ + 2 ⟨Zk+1, Xk+1⟩ − 2 te⟨∇f(xk+1), Zk+1⟩1 + γ Lγ ) 2 + γ − 2 ⟨∇f(xk+p1), Xk+1⟩ .L e Since cc 2 + γ 1µ = ,Lγ2 1 + γ we canAtelescope concerned ∇f(xk+1)’s inner product in Uk − Uk+1. 15 ACCEPTED MANUSCRIPT 16 Chanwoo Park, Jisun Park, Ernest K. Ryu For k = 0, 1, . . . , we have (1 + γ)−k(Uk − Uk+1) 2 + γ 2 µγ 2 ≥ ∥∇f(xk+1)∥ + ∥Xk+1∥ 2L 2 ( µ γ 2 γ 2 2 − − ∥Zk+1∥ + ∥Xk+1∥ 2 1 + γ 1 + γ ) 1 (2 + γ)2 2 γ + (1 + γ) ∥∇f(x 2 2 k+1 )∥ + 2 ⟨Zk+1, Xk+1⟩ L γ 1 + γ ( ) µ γ 2 γ 2 γ = − − ∥Xk+1∥ − ∥Zk+1∥ + 2 ⟨Zk+1, Xk+1⟩ 2 1 + γ 1 + γ 1 + γ t µ γ 2 = ∥Zk+1 −Xk+1∥ ≥ 0. 2 1 + γ p ⊔⊓ ri 3.3 Secondary sequence analysis c We now analyze the convergence of SC-OGM’s secondary sequence with a s unified Lyapunov analysis. We note that SC-OGM does not require the last-ustepmodification, unlike the non-strongly convex counterpart. Theorem 4 Assume (A1), (A2), and that f is µ-strongly convex. SnC-OGM’s xk-sequence, the secondary sequence without last-step modificataion, exhibits therate ( ) (1 + γ)−k+2 µ+ 2L 2 f(xk)− f⋆ ≤ ∥x0 − x⋆∥ 2γ 2 m for k = 1, 2, . . . . Proof Let {zk} ∞ ∞ k=0 and {Uk}k=0 be defined as in the proof of the Theorem 3. For k = 0, 1, . . . , define ed( ∥ ( ) ∥2)2γ µ ∥k−1 γ + 2 1 ∥Ũk = (1+γ) (f(x ∥k)− f⋆) + zk − ∇f(xk)− x ∥⋆ 1 + γ We can show that Ũk ≤ Uk−1pt 2 ∥ γ L ∥. We conclude the rate with k−1 2γe µ+ 2L 2(1 + γ) c (f(xk)− f⋆) ≤ Ũk ≤ U0 ≤ ∥x0 − x⋆∥1 + γ 2for k = 1, 2, . .c. . Now we complete the proof by showing that Ũk ≤ Uk−1. Note( )γ+1that (x 1k − xk−1) + ∇f(xk−1) = (Zk −Xk). Then we haveγ L ( ) 1 2 2γ f(xkA−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆)2L 1 + γ 16 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 17 ∥ ( ) ∥2 Lγ2 Lγ2 ∥2 γ + 2 1 ∥ + ∥zk − x ∥⋆∥ − zk − ∇f(xk)− x ∥⋆ 2(1 + γ)(2 + γ) 2(1 + γ)(2 + γ) ∥ γ L ∥ ( ) 1 2 2γ = f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆) 2L 1 + γ γ 1 2 + γ 2 + ⟨Zk,∇f(xk)⟩ − ∥∇f(xk)∥ 1 + γ 2L 1 + γ ( ) 1 2 2γ = f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆) 2L 1 + γ 〈 ( ) 〉 γ γ + 1 1 + (xk − xk−1) + ∇f(xk−1) +Xk,∇f(xk) 1 + γ γ L 1 2 + γ 2 t − ∥∇f(xk)∥ 2L 1 + γ ( ) 1 2 2γ p = f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆) i 2L 1 + γ r 1 γ + ⟨xk − xk−1,∇f(xk)⟩+ ⟨∇f(xk−1),∇f(xk)⟩+ ⟨Xk,∇f(xk)⟩ L 1 + γ c 1 2 + γ 2 − ∥∇f(xk)∥ 2L 1 + γ s ( ) 1 2 = f(xk−1)− f(xk)− ∥∇f(xk−1)−∇f(xk)∥ + ⟨∇f(xk), xk − xk−1⟩ 2L u ( ) 1 γ 2 1 1 2 + ∥∇f(xk)∥ + f(xk)− f⋆ − ∥∇f(xk)∥ 2L 1 + γ 1 + γ 2L an( )γ 1 2+ f⋆ − f(xk)− ∥∇f(xk)∥ + ⟨Xk,∇f(xk)⟩ 1 + γ 2L ≥ 0. Lγ2 µ Since = , above inequality indicates thamt 2(1+γ)(2+γ) 2 ( ) 1 2 µ 2 f(xk−1)− f⋆− ∥∇f(xk−1)∥ + ∥zkd− x⋆∥2L 2 ∥ ( ) ∥2 2γ µ ∥ γ + 2 1 ∥ ≥ (f(xk)− f 1 + γ t⋆)e+ ∥z − ∇f(x )− x ∥ .2 ∥ k k ⋆γ L ∥p ⊔⊓ 3.4 Discussion √ e The fac√tor- 2 impcrovement of SC-OGM over SC-AGM is consistent with thefactor- 2 impcrovement of OGM over AGM. AGM and OGM share the samemomentum term while OGM has the additional “correction term”. In contrast,the momentum coefficients differ in the strongly convex case: SC-AGM has A √ ( )κ− 1 2 1√ = 1− √ +Oκ+ 1 κ κ 17 ACCEPTED MANUSCRIPT 18 Chanwoo Park, Jisun Park, Ernest K. Ryu while SC-OGM has √ ( ) 1 2 2 1 = 1− √ +O . 2γ + 1 κ κ Of course, SC-OGM also has the correction term, which is essential in the analysis. We clarify that SC-OGM is not an optimal algorithm for the set of minimizing smooth strongly convex functions as discussed in Section 1.1. Another interesting line of research is to extend the faster rates to the composite minimization setup, which minimize f + g with a smooth strongly convex f and convex but possibly non-smooth g, as has been pursued in [49] and [10]. Interestingly, the algorithm of [1√0, Theorem 6] is different from SC-( ( √ )) t OGM, but achieves the same O exp − 2k/ κ -rate as SC-OGM, while having an extension to the composite minimization setup. rip 4 Linear coupling analysis c While the Lyapunov analyses of Sections 2 and 3 do provide insight into the acceleration mechanism of OGM, they do not shed light onto the provenance of s the method. Originally, OGM was generated through a computer-assisted proof methodology as the exactly optimal first-order method, but this approacuh is arguably opaque to humans. In this section, we present a human-understandable deriavationnof OGMbased on linear coupling. Specifically, we obtain OGM by refining the linearcoupling analysis of Allen-Zhu and Orecchia [5] through replacing the use of non-tight inequalities with tight inequalities. We specifically provide: (i) a natural (and non-computer assisted) derivation of OGM, (ii) a generalization of OGM to the mirrormdescent setup, and (iii) a unification of AGM and OGM. We moreover prov ide (iv) a generalization of SC-OGM to the mirror descent setup in thedappendix, in Section D. Assumption and notation. In this section, assume √ e (A3) ∥·∥ = xTQx is a quadratic nortm, where Q is a symmetric positive definitematrix. Assumption (A1) is to bee intperpreted as L-smoothness with respect to norm∥ · ∥. Write ∥ · ∥∗ = xTQ−1x for the dual norm of ∥ · ∥. However, ⟨·, ·⟩ is thestandard Euclideacn inner product (unrelated to Q). Let w : Rn → R be a“distance generating function” that is differentiable and 1-strongly convex withrespect to ∥ ·c∥, and letVx(y) = w(y)− ⟨∇w(x), y − x⟩ − w(x) ∀x, y ∈ Rn be theABregman divergence generated by w. 18 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 19 4.1 Linear coupling analysis of AGM We briefly outline the linear coupling analysis of AGM presented in [5] and point out where the analysis can be refined. Consider the problem of minimizing f under assumptions (A1), (A2), and (A3). The linear coupling method is y −1 −1k+1 = xk − L Q ∇f(xk) (LC) zk+1 = argmin {Vz (y) + ⟨αk k+1∇f(xk), y − xk⟩} y∈Rn xk+1 = (1− τk+1)yk+1 + τk+1zk+1 t for k = 0, 1, . . . , where x0 = z0 and {α } ∞ k k=1 and {τ ∞ k}k=1 are positive sequences to be determined. We obtain AGM by performing a non-tight analysis of (LC) and letting ip the analysis inform the choices of {α ∞ ∞k}k=1 and {τk}k=1. The first step of this analysis is r α2k+1 2 c αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (x⋆)− Vz (x )2 k k+1 ⋆ ≤ α2k+1L(f(xk)− f(yk+1)) + Vz (x⋆)− Vz (x ). s k k+1 ⋆ The second inequality follows from u 1 2 1 2 f(xk)− f(yk+1) ≥ ∥∇f(xk)∥∗ + ∥∇f(yk+1)∥∗, n2L 2L 2 but the underscored term 1 ∥∇f(yk+1)∥∗ is not used, i.e., p aroof utilizes the 2L weaker and non-tight inequality 1 2 f(xk)− f(y mk+1) ≥ ∥∇f(x k)∥∗ .2L The second step of this analysis is to choosde τ 1k = to eliminate f(xk)αk+1Land to show ( ) ( ) α2k+1L f(yk+1)− f⋆ + Vz (x⋆) ≤teα2k+1L− αk+1 k+1 (f(yk)− f⋆) + Vz (x⋆).k The inequality follows from p 1 2 f(xk)− fe⋆ ≤ ⟨∇f(xk), xk − x⋆⟩ − ∥∇f(xk)∥2L ∗ and c 1 2 ⟨∇f(xck), yk − xk⟩ ≤ f(yk)− f(xk)− ∥∇f(yk)−∇f(xk)∥∗,2L but thAe underscored terms are not used. Finally, convergence is establishedthrough a telescoping sum argument as Appendix C. 19 ACCEPTED MANUSCRIPT 20 Chanwoo Park, Jisun Park, Ernest K. Ryu 4.2 Linear coupling analysis of OGM We now derive OGM through performing a tight analysis of (LC) and letting the analysis inform the choices of {α }∞ ∞k k=1 and {τk}k=0. In the first step of our linear coupling analysis, we follow the same arguments but do not take the step utilizing the non-tight inequality. Lemma 1 Assume (A1) and (A2). The iterates (LC) satisfy α2k+1 2 αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (x⋆)− V (x )2 k zk+1 ⋆ for k = 0, 1, . . . . t Proof This is exactly the first part of Lemma 4.2 of [5]. ⊔⊓ p In the second step of our linear coupling analysis, we choose τ 2k = toαk+1L i allow for a telescoping sum argument and show the following lemma. r Lemma 2 Assume (A1), (A2) and (A3). Let 0 < τ = 2k ≤ 1 for k =αk+1L 2 c 0, 1, .., α1 = 2 , and x−1 = x0. Set h(x) = f(x) − f − 1 ⋆ ∥∇f(x)∥∗. TheL 2L iterates (LC) satisfy s α2 2k+1L αk+1L− 2αk+1 h(xk) + Vz (x⋆) ≤ h(xk−1) + Vz (xn⋆)2 k+1 2 k ufor k = 0, 1, . . . . Proof For k = 1, 2, . . . , we have a αk+1 (f(xk)− f⋆)) αk+1 2 ≤ αk+1⟨∇f(xk), xk − x⋆⟩ − ∥∇f(xk)∥ (4) 2L ∗ m αk+1 2 = αk+1⟨∇f(xk), xk − zk⟩+ αk+1⟨∇f(xk), zk − x⋆ ⟩ − ∥∇f(xk)∥ 2L ∗ 1− τk αk+1 2 = αk+1⟨∇f(xk), yk − xk⟩+ αk+e1⟨∇fd(xk), zk − x⋆⟩ − ∥∇f(xk)∥τk 2L ∗1− τk 1= αk+1⟨∇f(xk), x −1k−1 − xk −t Q ∇f(xk−1)⟩τk Lαk+1 2 + αk+1⟨∇f(xk), zk − x⋆⟩ − ∥∇f(xk)∥ (5) 2L ∗ ( ) 1− τ pk 1 2 1 2 ≤ αk+1 f(xk−1e)− f(xk)− ∥∇f(xk−1)∥ − ∥∇f(xk)∥ (6)τk 2L ∗ 2L ∗c αk+1 2+ αk+1⟨∇f(xk), zk − x⋆⟩ − ∥∇f(xk)∥2L ∗( )1− τk 1 2 1 2 ≤ αk+c1 f(xk−1)− f(xk)− ∥∇f(xk−1)∥∗ − ∥∇f(xk)∥ (7)τk 2L 2L ∗ A α2k+1 2 αk+1 2+ ∥∇f(xk)∥∗ + Vz (x⋆)− Vz (x⋆)− ∥∇f(xk)∥ .2 k k+1 2L ∗ 20 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 21 (4) and (6) follow from Lemma 11, (5) follows from the definition of linear coupling, and (7) follows from Lemma 1. The case of k = 0 follows from α 21 = and f⋆ − f(x0) − ⟨∇f(x0), xL ⋆ − x 1 2 0⟩ − ∥∇f(x0)∥ ≥ 0 with Lemma 1. ⊔⊓2L ∗ Theorem 5 Assume (A1), (A2), and (A3). Let the positive sequence {α }∞k k=1 satisfy 0 ≤ α2k+1L− 2αk+1 ≤ α 2 kL for k = 1, 2 . . . and α 2 2 1 = . Let τ =L k αk+1L for k = 1, 2, . . . . The yk-sequence of (LC) exhibits the rate 2Vx (x⋆) f(y )− f ≤ 0k ⋆ Lα2k for k = 1, 2, . . . . t Proof Sum the inequality of Lemma 2 from 0 to (k− 1). Then use Vz (x⋆) ≥ 0k 2 and f(yk) ≤ f(xk−1)− 1 ∥∇f(xk−1)∥ ip2L ∗ to conclude the rate. ⊔⊓ r The {θk} ∞ k=0 of the original OGM formulation is related to {α } ∞ k k=1 through αk+1 = 2θk/L for k = 0, 1, . . . . The seemingly different parameter choices c τk = 1 for AGM and τ 2 α L k = for OGM actually turn out to be the k+1 αk+1L same as {α ∞k}k=1 for AGM and OGM differ by a factor of 2. s The parameters {α }∞ and {τ }∞k k=1 k k=1 are chosen to make the telescoping sum argument work and to make it work tightly, as described in Sectioun C. Specifically, one starts with the form ( ) 1 n2 Mk f(xk)− f⋆− ∥∇f(xk)∥ + Vz (x⋆) 2L ∗ k+1 ( a) 1 2 ≤ Nk−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥∗ + Vz (xk ⋆),2L where the scalar coefficients Mk, N mk−1 are determ ined by (7). Comparing the2coefficients of ∥∇f(xk)∥∗, we have ( ) ( ) 1 1− τ 2k αk+1 d1 1− τk − αk+1 + αk+1 = −e+ αk+1 + αk+1 .2L τk 2 2L τk Solving this equation leads to the chtoice τ 2k = . The requirement α2Lαk+1 k+1L− 2αk+1 ≤ α 2 kL is needed for tphe telescoping sum argument to work, and thechoice α2 2k+1L− 2αk+1 =eαkL makes the argument tight. 4.3 Secondary sequcence analysis In the linearccoupling context, the last-step modification can be expressed as A x̃k = (1− τ̃k)yk + τ̃kzk (8)for k = 0, 1, . . . , where {τ̃ }∞k k=0 is a positive sequence to be determined. 21 ACCEPTED MANUSCRIPT 22 Chanwoo Park, Jisun Park, Ernest K. Ryu Lemma 3 Assume (A1), (A2) and (A3). Let 0 < τ̃k = 1 ≤ 1 for k = α̃k+1L 0, 1, . . . , α̃ 11 = , and x−1 = x0. Then the x̃k-sequence of (8), the secondaryL sequence with last-step modification of (LC), satisfies ( ) α̃2 2k+1L (f(x̃k)− f⋆) + Vz (x⋆) ≤ α̃k+1L− α̃k+1 k+1 h(xk−1) + Vz (xk ⋆) for k = 0, 1, . . . . Proof Proof is identical to that of Lemma 2 with substituted τk by τ̃k. ⊔⊓ Theorem 6 In the setup of Theorem 5, let 0 ≤ α̃2 1 2k+1L − α̃k+1 ≤ αkL2 and α̃ = 11 . Then the x̃k-sequence, the secondary sequence with last-stepL modification, of the linear coupling method (LC) exhibits the rate t Vx (x⋆) f(x̃k)− f ≤ 0 ⋆ p Lα̃2k+1 i for k = 0, 1, . . . r Proof Sum the inequality of Lemma 2 from 0 to (k − 2) and the inequality of c Lemma 3 with k − 1. Then use Vz (x⋆) ≥ 0 to conclude the rate. ⊔⊓k s 4.4 Comparison of the linear coupling analyses of AGM and OGM u The linear coupling analysis of Allen-Zhu and Orecchia [5], whiach dernives AGM,relies on the following two key lemmas.Lemma 4 [5, Lemma 4.2] In the linear coupling setup, α2k+1 2 αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (xm⋆)− Vk z (x )2 k+1 ⋆≤ α2k+1L (f(xk)− f(yk+1)) + Vz (xk ⋆)− Vz (xk+1 ⋆) for k = 0, 1, . . . . d Lemma 5 [5, Lemma 4.3] (Couplineg Lemma) In the linear coupling setup, α2k+1L (f(yk+1)− f⋆) + Vz p(x⋆) ≤t(α 2 k+1L− αk+1) (f(yk)− f⋆) + Vz (x⋆).k+1 k for k = 0, 1, . . . . As discussed, the preoof of [5, Lemma 4.2] uses of the non-tight inequalityc 1 2f(xk)− f(yk+1) ≥ ∥∇f(xk)∥∗ ,2L and the proofcof [5, Lemma 4.3] follows steps similar to that of Lemma 2, butuses thAe non-tight inequalitiesf(xk)− f⋆ ≤ ⟨∇f(xk), xk+1 − x⋆⟩ 22 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 23 and ⟨∇f(xk), yk − xk⟩ ≤ f(yk)− f(xk). In both linear coupling analyses, for OGM and AGM, the telescoping sum argument is made tight by choosing {α }∞ ∞k k=1 and {τk}k=1 appropriately. However, the analysis of Allen-Zhu and Orecchia [5] uses non-tight inequalities before the telescoping sum argument, while our analysis uses tight inequalities in all steps. 4.5 Unification of AGM and OGM t If we choose w(y) = 1 2 2 ∥y∥ , so that Vx(y) = 1 ∥x− y∥ , and 0 < t ≤ 1, so 2t 2t p that w is 1-strongly convex, and substitute αk+1 = 2θk/L, (LC) becomes ri 1 yk+1 = xk − ∇f(xk) L 2tθk zk+1 = zk − ∇f(xk) L sc ( ) 1 1 xk+1 = 1− yk+1 + zk+1 θk+1 θk+1 u for k = 0, 1, . . . . We also express this method with the momentumaandncorrectionterms and without the zk-iterates in Lemma 6. This method unifies AGMand OGM through the constant t; AGM and OGM respectively correspond to t = (1/2) and t = 1. Corollary 5 Assume (A1), (A2) and (A3). Let 0 0. HAL Archives Ouvertes (2017) 8. Aujol, J.F., Dossal, C., Fort, G., Moulines, É.: Rates of convergence of perturbed c FISTA-based algorithms. HAL Archives Ouvertes (2019) 9. Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceler- ation. SIAM Journal on Optimization 29(4), 3131–3153 (2019) s 10. Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the heavy-ball method for quasi-strongly convex optimization (2021) u 11. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization 16(3), 697–725 (2006) 12. Baes, M.: Estimate sequence methods: extensions and approximations.nTech. rep., Institute for Operations Research, ETH, Zürich, Switzerland (2009) 13. Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory of Computing 15(4), 1–32 (2019) a 14. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applicationsm. Mathematics of OperationsResearch 42(2), 330–348 (2017)15. Beck, A., Teboulle, M.: Mirror descent and nonlinear p rojected subgradient methods forconvex optimization. Operations Research Letters 31(3), 167–175 (2003) 16. Beck, A., Teboulle, M.: A fast iterative shrinkage-dthresholding algorithm for linear inverseproblems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)17. De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact gradient and newton methods through semidefinite programming performance estimation. SIAM Journal on Optimization 30(3), 2053–2082 (2020) 18. Dragomir, R.A., Taylor, A.B., d’Asptremeont, A., Bolte, J.: Optimal complexity andcertification of Bregman first-oprder methods. Mathematical Programming (2021)19. Drori, Y.: The exact information-based complexity of smooth convex minimization.Journal of Complexity 39, 1–16 (2017)20. Drori, Y., Taylor, A.: Oen the oracle complexity of smooth strongly convex minimization.Journal of Complexity 68, 101590 (2022)21. Drori, Y., Taylorc, A.B.: Efficient first-order methods for convex minimization: a con-structive approach. Mathematical Programming 184(1), 183–220 (2020)22. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimiza-tion: a novecl approach. Mathematical Programming 145(1-2), 451–482 (2014)23. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochas-tic programming. Mathematical Programming 156(1-2), 59–99 (2016) 24. GuA, G., Yang, J.: Tight sublinear convergence rate of the proximal point algorithm formaximal monotone inclusion problems. SIAM Journal on Optimization 30(3), 1905–1921(2020) 27 ACCEPTED MANUSCRIPT 28 Chanwoo Park, Jisun Park, Ernest K. Ryu 25. Kim, D.: Accelerated proximal point method for maximally monotone operators. Mathe- matical Programming (2021) 26. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Mathematical Programming 159(1-2), 81–107 (2016) 27. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. Journal of Optimization Theory and Applications 172(1), 187–205 (2017) 28. Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. Journal of Optimization Theory and Applications 178(1), 240–263 (2018) 29. Kim, D., Fessler, J.A.: Another look at the fast iterative shrinkage/thresholding algorithm (FISTA). SIAM Journal on Optimization 28(1), 223–250 (2018) 30. Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM Journal on Optimization 28(2), 1920–1950 (2018) 31. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization 26(1), 57–95 (2016) 32. Li, B., Coutiño, M., Giannakis, G.B.: Revisit of estimate sequence for accelerated gradient t methods. ICASSP (2020) 33. Lieder, F.: On the convergence rate of the halpern-iteration. Optimization Letters pp. 1–14 (2020) p 34. Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order i methods, and applications. SIAM Journal on Optimization 28(1), 333–354 (2018) 35. Nemirovsky, A.S.: On optimality of Krylov’s information when solving linear operator r equations. Journal of Complexity 7(2), 121–130 (1991) 36. Nemirovsky, A.S.: Information-based complexity of linear operator equations. Journal of c Complexity 8(2), 153–175 (1992) 37. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Opti- mization. (1983) s 38. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Proceedings of the USSR Academy of Sciences 269, 543u–547 (1983) 39. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course (2004) 40. Nesterov, Y.: Smooth minimization of non-smooth functions. Mathemaatical Pnrogramming103(1), 127–152 (2005)41. Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convexproblems. Mathematical Programming 112(1), 159–181 (2008) 42. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Mathematical Programming 120(1), 221–259 (2009) 43. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization 22(2), 341–362m(2012) 44. Nesterov, Y., Stich, S.U.: Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization 27(1), 110–123 (2017) 45. Rockafellar, R.T.: Convex Analysis (1970) 46. Ryu, E.K., Taylor, A.B., Bergeling, C.,eGiselssdon, P.: Operator splitting performanceestimation: Tight contraction factorstand optimal parameter selection. SIAM Journalon Optimization 30(3), 2251–2271 (2020)47. Ryu, E.K., Yin, W.: Large-Scale Convex Optimization via Monotone Operators. Draft(2021) 48. Shi, B., Du, S.S., Su, W., Jordan, M.I.: Acceleration via symplectic discretization of high-resolution differential equations. NeurIPS (2019) 49. Siegel, J.W.: Accelerateed first-porder methods: Differential equations and lyapunov func-tions. arXiv preprint arXiv:1903.05671 (2019)50. Su, W., Boyd, S.,cCandes, E.: A differential equation for modeling Nesterov’s acceleratedgradient method: Theory and insights. NeurIPS (2014)51. Taylor, A., Drori, Y.: An optimal gradient method for smooth strongly convex minimiza-tion. Mathematical Programming (2022) 52. Taylor, A.Bc., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT (2019) 53. TayAlor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-ordermethods for composite convex optimization. SIAM Journal on Optimization 27(3),1283–1313 (2017) 28 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 29 54. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming 161(1- 2), 307–345 (2017) 55. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences 113(47), E7351–E7358 (2016) rip t sc an u d m pt e cec A 29 ACCEPTED MANUSCRIPT 30 Chanwoo Park, Jisun Park, Ernest K. Ryu A Method reference For reference, we restate all aforementioned methods. In all methods, we assume that f is L-smooth function, {θ }∞k and {φ } ∞ k are the sequences of positive scalars, andk=0 k=0 x0 = y0 = z0. OGM. One form of OGM is 1 yk+1 = xk − ∇f(xk) L θk − 1 θk xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) θk+1 θk+1 and an equivalent form with z-iterates is t 1 yk+1 = xk − ∇f(xk) L p 2θk zk+1 = zk − ∇f(xk) i L ( ) 1 1 r xk+1 = 1− yk+1 + zk+1 θk+1 θk+1 c for k = 0, 1, . . . . The last-step modification on the secondary sequence can be written as θ sk − 1 θk x̃k+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) φk+1 φk+1 ( ) 1 1 nu= 1− yk+1 + zk+1φk+1 φk+1 where k = 0, 1, . . . . a OGM-simple. OGM-simple is a simpler variant of OGM with θ k+2k = and φk =2 k+1+√1 √ 2 . One form of OGM-simple is 2 m 1 yk+1 = xk − ∇f(xk) L k k + 2 xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) k + 3 dk + 3 and an equivalent form with z-iterates is e 1 yk+1 = xk − ∇tf(xk) L k + 2 zke+1 = zpk − ∇f(xk)L( )2 2xk+1 = 1− yk+1 + zc k+1k + 3 k + 3for k = 0, 1, . . . . The last-step modification on secondary sequence is written as x̃k+1c k k + 2= yk+1 + √ (yk+1 − yk) + √ (yk+1 − xk)2(k + 2) + 1 2(k + 2) + 1 where kA= 0, 1, . . . . 30 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 31 SC-OGM. Here, we assume that f is a µ-strongly convex function, condition number of f √ 8κ+1+3 is κ = L/µ, and γ = . SC-OGM is written as 2κ−2 1 yk+1 = xk − ∇f(xk) L 1 1 xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk) 2γ + 1 2γ + 1 for k = 0, 1, . . . . LC-OGM. LC-OGM (Linear Coupling OGM) is defined as y −1 −1k+1 = xk − L Q ∇f(xk) zk+1 = argmin {Vz (y) + ⟨αk+1∇f(xk), y − xk⟩}k t y∈Rn xk+1 = (1− τk+1)yk+1 + τk+1zk+1 p for k = 0, 1, . . . , where Vz(y) is a Bregman divergence, {α } ∞ k and {τk} ∞ are nonnegative i k=1 k=1 sequences defined as α = 21 , 0 ≤ α 2 L− 2α ≤ α2L, τ 2 L k+1 k+1 k k = , and Q is a positive αk+1L r definite matrix defining ∥x∥2 = xTQx. For last step modification, we define positive sequences {α̃ ∞k} and {τ̃ } ∞ k as α = 1 , k=1 k=1 1 L c 0 ≤ α̃2 L− α ˜ ≤ 1α2L, and τ̃ = 1 , and also define k+1 k+1 2 k k α̃k+1L s x̃k = (1− τ̃k)yk + τ̃kzk for k = 1, 2, . . . . u Unification of AGM and OGM. Using LC-OGM, we can unify AGaM andnOGM as1yk+1 = xk − ∇f(xk)L 2tθk zk+1 = zk − ∇f(xk) L ( ) 1 1 xk+1 = 1− yk+1 + zkm+1.θk+1 θk+1 for k = 0, 1, . . . . This is equivalent to 1 d yk+1 = xk − ∇f(xk) L θk − 1 θk x ek+1 = yk+1 + (yk+t1 − yk) + (2t− 1) (yk+1 − xk).θk+p1 θk+1 LC-SC-OGM. LC-SC-OGeM (Linear Coupling Strongly Convex OGM) isc 1yk+1 = xk − Q−1∇f(xk)L( )c 1 γzk+1 = z −1k + γxk − Q ∇f(xk)1 + γ µxk+1 = τzk+1 + (1− τ)yk+1, for k =A0, 1, . . . , where Q is a positive definite matrix. 31 ACCEPTED MANUSCRIPT 32 Chanwoo Park, Jisun Park, Ernest K. Ryu B Co-coercivity inequality in general norm Lemma 7 Let f be a closed convex proper function. Then, 0 ≤ f(x) + f∗(u)− ⟨x, u⟩ and inf{f(x) + f∗(u)− ⟨x, u⟩} = 0 x inf{f(x) + f∗(u)− ⟨x, u⟩} = 0. u Proof By the definition of the conjugate function, −f∗(u) = inf {f(x)− ⟨x, u⟩} x and t inf{f(x) + f∗(u)− ⟨x, u⟩} = 0. x Therefore, p 0 ≤ f(x) + f∗(u)− ⟨x, u⟩ ∀x. i The statement with u follows from the same argument and the fact that f∗∗ = f . ⊓⊔ r Lemma 8 Consider a norm ∥ · ∥ and its dual norm ∥ · ∥∗. Then, c 1 1 0 ≤ ∥x∥2 + ∥u∥2∗ − ⟨x, u⟩2 2 s and { } 1 1 inf ∥x∥2 + ∥u∥2 − ⟨x, u⟩ = 0 x∈Rn 2 2 ∗ u { } 1 1 inf ∥x∥2 + ∥u∥2 − ⟨x, u⟩ = 0. u∈Rn 2 2 ∗ n ( )∗ Proof This follows from Lemma 7 with f(x) = 1 ∥x∥2 and 1 ∥·∥2 = 1 ∥·∥2. ⊓⊔ 2 2 a2 ∗ Lemma 9 Let { } L 2 Grad(x) = argmin ∥y − x∥ + ⟨∇f( x),my − x⟩ .y∈Rn 2Then, L 1 ⟨∇f(x), Grad(x)− x⟩+ ∥ (x)−dx∥2Grad = − ∥∇f(x)∥2 .2 2L ∗ Proof Let z = L(Grad(x)− x). By the definition of Grad(x) and Lemma 8, we have 1 2 L e∥∇f(x)∥∗ + ∥Grad(x)−tx∥2 + ⟨∇f(x), Grad(x)− x⟩2L 2 p1 1 1= inf ∥∇f(x)∥2 2∗ + ∥z∥ + ⟨∇f(x), z⟩z∈Rne 2L 2L L= 0. ⊓⊔ Lemma 10 Let f : Rcn → R be a differentiable convex function such thatc ∥∇f(x)−∇f(y)∥∗ ≤ L ∥x− y∥for all xA, y ∈ R n. Then L f(y) ≤ f(x) + ⟨∇f(x), y − x⟩+ ∥y − x∥2 . 2 32 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 33 Proof Since a differentiable convex function is continuously differentiable [45, Theorem 25.5], ∫ 1 f(y)− f(x) = ⟨∇f(x+ t(y − x)), y − x⟩dt 0 ∫ 1 = ⟨∇f(x+ t(y − x))−∇f(x), y − x⟩dt+ ⟨∇f(x), y − x⟩ 0 ∫ 1 ≤ ∥∇f(x+ t(y − x))−∇f(x)∥∗ ∥y − x∥ dt+ ⟨∇f(x), y − x⟩ 0 ∫ 1 L ≤ tL ∥y − x∥2 dt+ ⟨∇f(x), y − x⟩ = ∥y − x∥2 + ⟨∇f(x), y − x⟩. 0 2 ⊓⊔ Lemma 11 (Co-coercivity inequality with general norm) Let f : Rn → R be a differentiable convex function such that t ∥∇f(x)−∇f(y)∥∗ ≤ L ∥x− y∥ p for all x, y ∈ Rn. Then 1 ri f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+ ∥∇f(x)−∇f(y)∥2 2L ∗ . Proof Set ϕ(y) = f(y)− ⟨∇f(x), y − x⟩. Then x ∈ argminϕ. So by Lemma 9, c ϕ(x) ≤ ϕ(Grad(y)) s L ≤ ϕ(y) + ⟨∇ϕ(y), Grad(y)− y⟩+ ∥ 2Grad(y)− y∥ 2 1 u = ϕ(y)− ∥∇ϕ(y)∥2∗ .2L Substituting f back in ϕ yields the co-coercivity inequality. an ⊓⊔ C Telescoping sum argument Suppose we established the inequality m aiFi + biGi ≤ ciFi−1 + diGi−1 − Ei for i = 1, 2, . . . , where Ei, Fi, Gi are nonnegativedquantities and ai, bi, ci, and di arenonnegative scalars. Assume ci ≤ ai−1 and di ≤ bi−1. By summing the inequalities for i = 1, 2, . . . , k, we obtain ∑k ∑ke ∑k akFk ≤ −bkGk − (ai−1 − ci)Fi−1 −t (bi−1 − di)Gi−1 − Ei + c1F0 + d1G0i=2 p i=2 i=2≤ c1F0 + d1G0. However, note that the c∑ke ∑k ∑k−bkGk − (ai−1 − ci)Fi−1 − (bi−1 − di)Gi−1 − Eii=2 i=2 i=1 terms are wastecd in the analysis. If one has the freedom to do so, it may be good to chooseparameters so that A ai−1 = ci, bi−1 = diand Ei = 0 for i = 1, 2, . . . . Not having wasted terms may be an indication that the analysisis tight. 33 ACCEPTED MANUSCRIPT 34 Chanwoo Park, Jisun Park, Ernest K. Ryu D SC-OGM via linear coupling In this section, we analyze SC-OGM through the linear coupling analysis. We consider the linear coupling form 1 yk+1 = x −1 k − Q ∇f(xk) L ( ) 1 γ zk+1 = zk + γx − Q −1 k ∇f(xk) 1 + γ µ xk+1 = τzk+1 + (1− τ)yk+1, where τ is a coupling coefficient to be determined. As an aside, we can view zk+1 as a mirror descent update of the form { } 1 γ γ zk+1 = argmin ∥z − zk∥ 2 + ∥z − xk∥ 2 + ⟨∇f(xk), z⟩ , t z 2 2 µ which is similar to what was considered in [6]. p Lemma 12 Assume (A1), (A2) and (A3). Then, riγ γ ⟨∇f(xk), z 2 k+1 − x⋆⟩ − ∥xk − x⋆∥ µ 2 γ2 1 1 + γ c ≤ − ∥∇f(x )∥2k ∗ + ∥zk − x 2 2 ⋆∥ − ∥z 2 k+1 − x⋆∥ 2(1 + γ)µ 2 2 s for k = 0, 1, . . . . Proof This proof follows steps similar to that of [6, Lemma 5.4]. u From the definition of zk+1, we say { } ∣ ∂ 1 γ γ ∣ 0 =⟨ ∥z − z ∥2k + ∥z − xk∥ 2 + ⟨∇f(xk), z⟩ ∣ , z n∣ k+1 − x⋆⟩∂z 2 2 µ zk+1 γ =⟨Q(zk+1 − zk), zk+1 − x⋆⟩+ ⟨∇f(xk), zk+1 − x⋆⟩+ γ⟨Q(zk+1 −axk), zk+1 − x⋆⟩ µ By three point equation, ( ) γ 1 1 ⟨∇f(xk), zk+1 − m x⋆⟩+ γ ∥x 2 2 k − zk+1∥ − ∥x k − x⋆∥µ 2 2 1 1 1 + γ = − ∥zk − zk+1∥ 2 + ∥z 2kd− x⋆∥ − ∥zk+1 − x ∥2⋆ .2 2 2 Plugging the definition of zk+1, γ 1 ∥xk − z 2 k+1∥ + ∥zk − z 2 e k+1∥ 2 2 ∥ t ∥2 ∥ ∥ γ ∥ 1 γ ∥ ∥ ∥ 2 = ∥ (x − z ) + pQ−1 1 γ γ∇f(x )∥ + ∥k k k − (x −1 ∥∥ ∥ ∥ k − zk) + Q ∇f(xk)2 1 + γ (1 + γ)µ 2 1 + γ (1 + γ)µ ∥ γ2 ≥ ∥∇f(x 2k)∥∗e.2(1 + γ)µ2 Combining results abocve, we getγ c γ⟨∇f(xk), zk+1 − x⋆⟩ − ∥xk − x⋆∥2µ 2γ2 1 1 + γ A ≤ − ∥∇f(xk)∥ 2 2 ∗ + ∥zk − x⋆∥ − ∥z 2 2 k+1 − x⋆∥ . 2(1 + γ)µ 2 2 ⊓⊔ 34 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 35 Lemma 13 (Coupling lemma in SC-OGM) Assume (A1), (A2) and (A3). Then ( ) 1 µ (1 + γ) f(xk)− ∥∇f(xk)∥ 2 2 2L ∗ + ∥zk − x⋆∥ 2 ( ) 1 µ ≤ f(xk−1)− ∥∇f(x )∥ 2 k−1 ∗ + ∥z 2 k−1 − x⋆∥ 2L 2 holds for k = 1, 2, . . . Proof We have γ (f(xk)− f(x⋆)) µγ ≤ γ⟨∇f(xk), xk − x⋆⟩ − ∥x 2 k − x⋆∥ 2 t µγ = γ⟨∇f(xk), x 2 k − zk⟩+ γ⟨∇f(xk), zk − x⋆⟩ − ∥xk − x⋆∥ 2 p 1− τ µγ = γ⟨∇f(xk), yk − xk⟩+ γ⟨∇f(xk), zk − x⋆⟩ − ∥x 2 k − x⋆∥ i τ 2 1− τ 1 −1 µγ r= γ⟨∇f(xk), xk−1 − xk − Q ∇f(x )⟩+ γ⟨∇f(x ), z − x ⟩ − ∥x − x ∥2k−1 k k ⋆ k ⋆ τ L 2 ( ) 1− τ 1 c ≤ γ − 1 ⟨∇f(xk), xk−1 − x − Q −1 k ∇f(xk−1)⟩ τ L ( ) 1 1 s + f(xk−1)− f(xk)− ∥∇f(x 2 k−1)∥∗ − ∥∇f(x 2 k)∥ 2L 2L ∗ µγ + γ⟨∇f(xk), zk − zk+1⟩+ γ⟨∇f(x 2 k), zk+1 − x⋆⟩ − ∥xk − x⋆∥ u 2 ( ) ( ) 1− τ 1 1 ≤ γ − 1 ⟨∇f(xk), yk − xk⟩+ f(x 2 2 k−1)− f(xk)− ∥∇f(xka−1)∥∗n− ∥∇f(xk)∥τ 2L 2L ∗γ2 µ (1 + γ)µ+ γ⟨∇f(xk), zk − zk+1⟩ − ∥∇f(xk)∥2 + ∥z 2 2∗ k − x⋆∥ − ∥zk+1 − x⋆∥ ,2(1 + γ)µ 2 2 where the last inequality is an application of Lemma 12. Note that ( ) 1 γ m z − z = z − z + γx − Q− 1k k+1 k k k ∇f(xk)1 + γ µ γ γ = (zk − x −1 k) + dQ ∇f(xk)1 + γ (1 + γ)µ γ 1− τ γ = (xke− y −1k) + Q ∇f(xk).1 + γ τ (1 + γ)µ To eliminate the ⟨∇f(xk), ·⟩ teprm, wetchoose τ to satisfy1− τ γ 1− τγ − 1 = . (9) τ 1 + γ τ Plugging this in, the icnequaelity above isγ (f(xkc)− f(x⋆))( )1 1≤ f(xk−1)− f(xk)− ∥∇f(x 2k−1)∥ − ∥∇f(x 2k)∥2L ∗ 2L ∗ A γ2 µ (1 + γ)µ+ ∥∇f(x 2k)∥∗ + ∥zk − x ∥2⋆ − ∥z 2k+1 − x⋆∥ .2(1 + γ)µ 2 2 35 ACCEPTED MANUSCRIPT 36 Chanwoo Park, Jisun Park, Ernest K. Ryu In order to make the telescoping form such as ( ) Mk f(xk)−Bk ∥∇f(x )∥ 2 k ∗ +Ck ∥zk+1 − x ∥ 2 ⋆ ( ) ≤ N 2 2k−1 f(xk−1)−Bk−1 ∥∇f(xk−1)∥∗ + Ck−1 ∥zk − x⋆∥ , µ we chose B 1k = and Ck = , which leads to the choice of γ satisfying2L 2 2 + γ γ2 = . (10) 2L 2(1 + γ)µ We get the desired result by plugging (9) and (10) in the above inequality. ⊓⊔ t E Asymptotic characterization of θk p Theorem 7 Let the positive sequence {θ ∞ 2 2k} satisfy θk=0 0 = 1 and θ − θk+1 − θ = 0 i k+1 k for k = 0, 1, . . . . Then, r k + ζ + 1 log k θk = + + o(1). 2 4 c Proof Let θ k+2k = + ck log k. The proof consists of the following 3 steps:2 s 1. If c 1 1k < , then ck+1 < .4 4 2. ck → 1 as k → ∞. 4 k+2 log k3. If θk = + + ek, then ek is convergent. u 2 4 First step. If c 1 1 nk < , then c4 k+1 < .4 For our convenience, let c0 = 0 with c 2 2 0 log 0 = 0. Plugging this in θ − θ − θ = 0,k+1 k+1 k we have ( )2 ( ) a 2 k + 2 k + 2 1 + ck+1 log(k + 1) = + ck lmog k + ,2 2 4so 1(ck+1 log(k + 1)− ck log k) (k + 2 + ck+1 log(k + 1) + ck log k) = .4 Assume ck+1 ≥ 1/4. Then 1 d = (ck+1 log(k + 1)− ck log k) (ek + 2 + ck+1 log(k + 1) + ck log k)4 ( )1 1≥ log 1 + (k + 2) 4 k t 1 > , 4 p which proves the first claim.e Second step. c 1k → as k → ∞.4 Put dk = 1 − c , then 0 < d ≤ 1 . 4 kc k 4 ( ( ) )( ) 1 1 1 1 = log 1c+ − dk+1 log(k + 1) + dk log k k + 2 + log k(k + 1)− dk+1 log(k + 1)− dk log k4 4 k 4 (A( ) )( )1 1 1≤ log 1 + − dk+1 log(k + 1) + dk log k k + 2 + log(k + 1)4 k 2 36 ACCEPTED MANUSCRIPT √ Factor- 2 Acceleration of Accelerated Gradient Methods 37 Therefore ( ) 1 1 1 1 dk+1 log(k + 1)− dk log k ≤ log 1 + − . 4 k 4 k + 2 + 1 log(k + 1) 2 By talyor expansion, ( ( )) 1 3 + 2 log k 1 dk+1 log(k + 1)− dk log k ≤ +O . 4 2k2 k2 So, By summing all the above inequality from 1 to k, dk+1 log(k + 1) ≤ C so d < Ck+1 . In conclusion, as k → ∞, d → 0.log(k+1) k t Third step. log kIf θk = k+2 + + e , then, e converges. 2 4 k k From the previous claim, we can say that for some sufficiently large k, |e | < 1k log k.6 p ( )2 ( )2 i k + 2 1 k + 2 1 1 + log(k + 1) + ek+1 = + log k + ek + 2 4 2 4 4 crThen, ( ( ) )( ) 1 1 1 1 = log 1 + + ek+1 − ek k + 2 + log k(k + 1) + ek+1 + ek 4 4 k 4 s ( ( ) )( ) 1 1 5 ≤ log 1 + + ek+1 − ek k + 2 + log(k + 1) . 4 k 6 nuSo, ( ) 5 ( ) 1 1 1 log k + 3 1 ek+1 − ek ≥ ( ) − log 1 + = − 6 2 +O . 4 k + 2 + 5 log(k + 1) 4 k k2 a k26 Summing this for k = 1, . . . , k, we get that ek+1 > D for some constant D. Moreover, ( ( ) )( ) 1 1 1 1 = log 1 + + ek+1 − ek k + 2 + log k(k + 1) + ek+1 + ek 4 4 k 4 m ( ( ) ) 1 1 1 ≥ log 1 + + e k+1 − ek (k + 2)d> + (k + 2)(ek+1 − ek),4 k 4which indicates that e ∞k+1 < ek. Since {ek} eis a decreasing sequence with a lower bound,k=0it converges. ⊓⊔Proof of equality in Section 2.1 We htave L ∥x0 − x 2 2 ⋆∥ L ∥x0 − x⋆∥ = ( ) 2θ2 k+ζ log(k− 21) k−1 2 + + o(1) e2p 42L ∥x 20 − x⋆∥= ( )c 22 log(k−1)(k + ζ) 1 + + o(1/k)2(k+ζ)( )c 2L ∥x − x ∥ 2 0 ⋆ log(k − 1) = 1− 2 + o(1/k) (k + ζ)2 2(k + ζ) ( ) 2L ∥x0 − x⋆∥ 2 2L ∥x0 − x⋆∥ 2 log k 1 A = − + o ,(k + ζ)2 (k + ζ)3 k3which verifies the equality in Section 2.1. 37