ACCEPTED MANUSCRIPT                                      
Factor-√22 Acceleration of Accelerated Gradient Methods
This Accepted Manuscript (AM) is a PDF file of the manuscript accepted for publication after peer review, when applicable, but
does not reflect post-acceptance improvements, or any corrections. Use of this AM is subject to the publisher's embargo period
and AM terms of use. Under no circumstances may this AM be shared or distributed under a Creative Commons or other form of
open access license, nor may it be reformatted or enhanced, whether by the Author or third parties. By using this AM (for
example, by accessing or downloading) you agree to abide by Springer Nature's terms of use for AM versions of subscription
articles: https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms
The Version of Record (VOR) of this article, as published and maintained by the publisher, is available online at:
https://doi.org/10.1007/s00245-023-10047-9. The VOR is the version of the article after copy-editing and typesetting, and
connected to open research data, open protocols, and open code where available. Any supplementary information can be found on
the journal website, connected to the VOR.
For research integrity purposes it is best practice to cite the published Version of Record (VOR), where available (for example,
see ICMJE’s guidelines on overlapping publications). Where users do not have access to the VOR, any citation must cletarly
indicate that the reference is to an Accepted Manuscript (AM) version.
ripc
nu
s
a
 m
te
d
ce
p
c
A
                                          ACCEPTED MANUSCRIPT                                      
Noname manuscript No.
(will be inserted by the editor)
√
Factor- 2 Acceleration of Accelerated Gradient
Methods
Chanwoo Park · Jisun Park · Ernest K. Ryu
t
Received: date / Accepted: date ip
√
Abstract The optimized gradient method (OGM) provides a factor- 2 speedup r
upon Nesterov’s celebrated accelerated gradient method in the convex (but
non-strongly convex) setup. However, this improved acceleration mechanism c
has not been well understood; prior analyses of OGM relied on a computer-
assisted proof methodology, so the proofs were opaque for humans despite being s
verifiable and correct. In this work, we present a new analysis of OGM based
on a Lyapunov function and linear coupling. These analyses are develouped
and presented without the assistance of computers and are understandable
by humans. Fu√rthermore, we generalize OGM’s acceleration mechannism and
obtain a factor- 2 speedup in other setups: acceleration with a simpler rational
stepsize, the strongly convex setup, and the mirror descent seatup.
1 Introduction  m
Nesterov’s celebrated accelerated gradient method (AGM) solves the problem
of finding the minimum of an L-smooth codnvex function with an “optimal”
accelerated O(1/k2) complexity [38t]. Seurprisingly, AGM turned out to be notexactly optimal, but optimal only up to a constant. The optimized gradientmethod (OGM)√has a factor-2psmaller (better) worst-case guarantee and therebyrequires factor- 2 fewer iterations to guarantee the same accuracy [22,26].
Chanwoo Park
Department of Statistics, Seoul National University
E-mail: chanwoo.park@snu.aec.kr
Jisun Park
Department ofcMathecmatical Sciences, Seoul National UniversityE-mail: colleenp0515@snu.ac.krErnestAK. RyuDepartment of Mathematical Sciences, Seoul National UniversityE-mail: eryu@snu.ac.kr
1            
                                          ACCEPTED MANUSCRIPT                                      
2 Chanwoo Park, Jisun Park, Ernest K. Ryu
However, this remarkable discovery has not been well understood. OGM
was originally obtained through a computer-assisted methodology based on the
performance estimation problem (PEP). The resulting convergence analyses
involve arduous but elementary calculations that are verifiable but arguably
not understandable by humans.
Contribution. In this work, we present human-understandable analyses of OGM.
First, we show that the improved acceleration mechanism of OGM can be un-
derstood and analyzed through an unconventional Lyapunov functi√on. We then
use this insight to propose a new method that obtains the factor- 2 speedup
in the strongly convex setup. Finally, we present a human-understandable
derivation of OGM based on refining the linear coupling analysis of Allen-Zhu t
and Orecchia [5], and generalize OGM to the mirror descent setup.
As minor contributions, we analyze the primary and secondary sequences p
of OGM through a single unified analysis; to the best of our knowledge, prior i
works provide two separate convergence proofs for x- and y-sequences. Moreover, r
we present a unified class of accelerated methods containing AGM and OGM
through the linear coupling analysis. sc
1.1 Definitions and prior work u
For L > 0, a differentiable convex function f : Rn → R is L-smooth with
respect to a norm ∥ · ∥ if n
∥∇f(x)−∇f(y)∥∗ ≤ L∥x− y∥ ∀x, y ∈ Rna,
where ∥ · ∥∗ denotes the dual norm. A convex functionmf : Rn → R is µ-stronglyconvex if f(x)− (µ/2)∥x∥2 is convex [39,47].Throughout this paper, we consider the proble m
minimize f(x)
x∈Rn d
and make the following assumptions oen f : Rn → R:
(A1) f is convex, differentiable, and L-smooth with respect to ∥ · ∥ and
(A2) f has a minimizer (not nepcessartily unique).We write x⋆ for a miniemizer of f and f⋆ = f(x⋆) for the optimal value. Toclarify, the proofs of Section 2 do not require the minimizer x⋆ to be unique.
Nesterov’s AGM.cNesterov’s AGM isc 1yk+1 = xk − ∇f(xk)L
A θk − 1xk+1 = yk+1 + (yk+1 − yk),θk+1
2            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 3
where y0 = x0, θ0 = 1, and θ
2
k+1 − θ
2
k+1 = θk for k = 0, 1, . . . [38]. We can
equivalently write AGM as
1
yk+1 = xk − ∇f(xk)
L
θk
zk+1 = zk − ∇f(xk)
L
( )
1 1
xk+1 = 1− yk+1 + zk+1
θk+1 θk+1
with z0 = x0 [40].
AGM can be generalized to use the relaxed parameter requirement θ2k+1 −
θk+1 ≤ θ
2 on the positive sequence {θ }∞k k k=0. The choice θk = (k + 2)/2 is a
t
common instance.
In the setup where f is furthermore µ-strongly convex, Nesterov’s AGM p
for the strongly convex setup (SC-AGM) is ri
1
yk+1 = xk − ∇f(xk)
L
√ c
κ− 1
xk+1 = yk+1 + √ (yk+1 − yk)
κ+ 1 s
for k = 0, 1, . . . , where κ = L/µ and y0 = x0 [39]. u
Optimized gradient method. OGM is n
1
yk+1 = xk − ∇f(xk)
L a
θk − 1 θk
xk+1 = yk+1 + (yk+1 − yk) + m(yk+1 − xk)θk+1 θk+1
for k = 0, 1, . . . , where y = x ∞0 0 and {θk}k=1 ids the s ame as that of AGM [22,26].We refer to θk−1 (yk+1 − yk) as the momentum term and θk (y − x ) asθk+1 θ k+1 kk+1the correction term. The added correcteion term is the difference between AGMand OGM. We can equivalently writte OGM as1yk+1 = xk − ∇f(xk)L
2θk
zk+1 = zkp− ∇f(xk)L
( )
1 1
cxk+1e= 1− yk+1 + zk+1,θk+1 θk+1where z0 = x0 [26]. The factor 2 in zk+1 is the difference compared to AGM.
The y√k-secquence of OGM exhibits a rate faster than that of AGM by afactor of 2. This rate was proved in [27], and we also state it in Corollary 1.
To claArify, the guarantee on the function value is smaller (better) by a factorof 2, and, combined with the O(1/k2) iteration dependence, this represents
3            
                                          ACCEPTED MANUSCRIPT                                      
4 Chanwoo Park, Jisun Park, Ernest K. Ryu
√
a factor- 2 reduction in the number of iterations necessary to reach a given
accuracy.
Furthermore, OGM’s original presentation [22,26] involves what we refer
to as the last-step modification on the secondary sequence
θk − 1 θk
x̃k+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
φk+1 φk+1
( )
1 1
= 1− yk+1 + zk+1,
φk+1 φk+1
where φ2k − φk − 2θ
2
k−1 = 0. The x̃k-sequence of OGM exhibits a rate slightly
better than OGM’s yk-sequence and is in fact exactly optimal [19] under the
smooth (non-strongly) convex function class. This rate was proved in the t
original presentation of OGM [22,26], and we also state it in Corollary 3. In
this work, we present the first variant of OGM for the strongly convex setup. ip
θk-sequence asymptotic characterization. Throughout the exposition of this r
work, we will use the following asymptotic characterization: if θ0 = 1 and
θ2 2k+1 − θk+1 = θk for k = 0, 1, . . . , then c
k + ζ + 1 log k
θk = + + o(1) (1)
2 4 s
as k → ∞, where ζ ≈ 0.646. While we suspect this result may be known, we
could not find it in any reference. Therefore, we formally state andnprovue (1)as Lemma 7 in the appendix.
Computer-assisted derivation and analysis of OGM. OGM waas originally ob-tained through a computer-assisted methodology based on the performance
estimation problem (PEP); it was first discovered numerically [22] and then its
analytical form and convergence analysis was foundm[26]. The PEP methodol-ogy’s key insight is to optimize over the class of fix ed-step first-order gradientmethods, with the objective being the convergence guarantee. Surprisingly,
this problem is semidefinite programming- (SDP-) representable and has a
tightness guarantee [54]. OGM was re-discodvered by using the PEP to find
a greedy first-order method simplified with a “subspace-search elimination
procedure” [21].
However, these prior analyses oftOGeM, generated by computers, are verifi-able but arguably not undersptandable by humans. Moreover, as the analysesrely on finding analytical solutions to the SDPs arising from the PEP, they areinaccessible to those unfamiliar with the methodology.
Lyapunov analysis of AeGM. Nesterov’s original 1983 paper established the
celebrated O(1/k2c) rate using a Lyapunov analysis [38]. Subsequent works[11, 12,32, 39–c41,43,55] analyzed AGM and its variants through the “estimatesequence” technique, which many consider to be less transparent than Lya-punov analyses. In recent years, there has been a renewed interest in studying
accelerAated methods via Lyapunov analyses [1, 7–9,13,16, 50, 52]. In this work,we present the first Lyapunov analysis of OGM.
4            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 5
Linear coupling analysis of AGM. The interpretation of AGM as a linear
coupling between gradient descent and mirror descent was presented in [5].
Specifically, AGM can be written as
{ }
L 2
yk+1 = argmin ⟨∇f(xk), y − xk⟩+ ∥y − xk∥
y 2
zk+1 = argmin {Vz (y) + ⟨αk+1∇f(xk), y − x ⟩}k k
y
xk+1 = (1− τk+1)yk+1 + τk+1zk+1,
where Vz is a Bregman divergence. The yk-update can be viewed as a gradient
descent update and the zk-update can be viewed as a mirror descent update. t
Mirror descent [37] was originally presented as a method that maps the current
point to a dual space, performs a gradient update, and maps the point back
to the primal space. An alternate proximal form of mirror descent (which we ip
use) was presented in [15]. An alternate “dual averaging” interpretation of
mirror descent as a method that constructs a lower bound of the function was r
presented in [42]. The key insight of linear coupling is to carefully interpolate
between mirror descent and gradient descent to obtain AGM. c
Linear coupling has been used to obtain and analyze many extensions of s
AGM [2–4,6], but whether the linear coupling argument itself can be further
refined seems not to have been studied. In this work, we show that refining the
linear coupling analysis naturally leads to OGM. nu
Tight inequalities. We informally say an inequality is tight if it cannot be
improved without further assumptions and formally if it satisfieas the “interpola-tion conditions” [54]. The recent literature on performance estimation problem
focuses on using tight inequalities to obtain proofs that are provably cannot be
improved [17,24,25,33,46,52,53].
The tight inequality we use is  m
1
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+d∥∇f(x)−∇f(y)∥22L ∗
for all L-smooth convex function f andex, y ∈ Rn. The linear coupling analysisof AGM uses strictly weaker inequaliti s, as discussed in Section 4. By refining
the analysis by replacing thepnon-titght inequalities with tight ones, we obtainOGM.
Accelerated methods for smooth strongly convex minimization. For the problem
setup of minimizingcsmoeoth strongly convex functions, Nesterov’s SC-AGM [39]√achieves the convergence rate O (exp (−k/ κ)). Recently, the triple momentummethod [31] acnd the information-theoretic exact method [51] were presented√with an improved O (exp (−2k/ κ))-rate, and their optimality was established√through the matching Θ (exp (−2k/ κ))-lower bound of [20], which improves√
upon tAhe classical Θ (exp (−4k/ κ))-lower bound of [35√, 36]. The SC-OGM( ( √ ))method we present in this work has a rate of O exp − 2k/ κ , between
5            
                                          ACCEPTED MANUSCRIPT                                      
6 Chanwoo Park, Jisun Park, Ernest K. Ryu
the rates of SC-AGM and TMM. For strongly convex quadratic functions,
√
the heavy ball method exhibi√ts the rate O (exp (−4k/ κ)) [39] and OGM-q( ( √ ))
exhibits the rate O exp −2 2k/ κ [28]. The heavy ball method’s rate
√
matches the classical Θ (exp (−4k/ κ))-lower bound of [35,36].
2 Lyapunov analysis of OGM
In this section, we present a Lyapunov analysis of OGM. Our key insight is to
use
( )
1
f(xk)− f − ∥∇f(x
2
⋆ k)∥ ,
2L t
which is nonnegative due to L-smoothness, instead of (f(xk)− f⋆) or (f(yk)− f⋆) p
in the construction of the Lyapunov function. Throughout this section, ∥ · ∥ = i
∥ · ∥∗ denotes the Euclidean norm. r
Based on this insight, we present: (i) a more human-understandable analysis
of OGM (ii) a unified analysis of both the primary and secondary sequences of c
OGM that admits simpler θk-choices. s
2.1 Nesterov’s AGM u
Nesterov’s AGM has the rate n
2
L ∥x0 − x⋆∥
f(yk)− f⋆ ≤ a
2θ2k−1
2 2 ( )
2L ∥x0 − x⋆∥ 2L ∥x0 − x ⋆∥mlog k 1= − + o(k + ζ)2 (k + ζ)3 k3
for k = 0, 1, . . . . (We derived the equality in Appendix E.) This rate can be
established through the following Lyeapunodv analysis [38]: for k = 0, 1, . . . ,define
L 2
Uk = θ
2
k−1 (f(ykt)− f⋆) + ∥zk − x⋆∥2
with θ−1 = 0 and showeUk ≤p· · · ≤ U0. Conclude withL 2θ2k−c1 (f(yk)− f⋆) ≤ Uk ≤ U0 = ∥x0 − x⋆∥ .2
2.2 Primary scequence analysis of OGM
We nowAanalyze OGM’s convergence through an analogous Lyapunov analysis.
6            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 7
Theorem 1 Assume (A1) and (A2). Let the positive sequence {θ }∞k k=0 satisfy
θ0 = 1 and 0 ≤ θ
2
k+1 − θk+1 ≤ θ
2
k for k = 0, 1, . . . . OGM’s yk-sequence exhibits
the rate
2
L ∥x0 − x⋆∥
f(yk)− f⋆ ≤
4θ2k−1
for k = 1, 2, . . . .
Proof Set θ−1 = 0 and x−1 = x0. For k = −1, 0, 1, . . . , define
( )
1 2 L 2
Uk =2θ
2
k f(xk)− f⋆ − ∥∇f(xk)∥ + ∥zk+1 − x⋆∥ .2L 2 t
We can show that {U ∞k}k=−1 is nonincreasing. Using f(yk) ≤ f(xk−1) − p
1 2∥∇f(xk−1)∥ , which follows from L-smoothness, we conclude the rate with2L
( ) ri
1 2
2θ2 2k−1 (f(yk)− f⋆) ≤ 2θk−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L c
L 2
≤ Uk−1 ≤ U−1 = ∥z0 − x⋆∥
2 s
for k = 1, 2, . . . . Now we complete the proof by showing that {U ∞k}k=−u1 is
nonincreasing. For k = −1, 0, 1, . . . , we have
Uk − Uk+1 n
( ) ( )
1 2 1 2
= 2θ2 2k f(xk)− f⋆ − ∥∇f(xk)∥ − 2θk+1 f(xk+1)− f2L a⋆ − ∥∇f(xk+1)∥2L
L 2 L 2
+ ∥zk+1 − x⋆∥ − ∥zk+2 − x⋆∥
2 2
( ) ( m )
1 2 1 2
= 2θ2k f(xk)− f⋆ − ∥∇f(xk)∥ − 2θ
2
k+1  f(xk+1)− f⋆ − ∥∇f(xk+1)∥2L 2L
2 2
− ⟨2θ 2k+1∇f(xk+1), x⋆ − zk+1⟩ − θkdL +1 ∥∇f(xk+1)∥
( ) ( )
1 2 1 2
= 2θ2k f(xk)− f⋆ − ∥∇f(xk)t∥e− 2θ2k+1 f(xk+1)− f⋆ + ∥∇f(xk+1)∥2L 2L
− ⟨2θk+1∇f(xk+1), x⋆p− zk+1⟩( )1 2
≥ 2(θ2k+1 − θk+1) f(exk)− f⋆ − ∥∇f(xk)∥2L( )1 2
− 2θ2k+1 fc(xk+1)− f⋆ + ∥∇f(xk+1)∥ − ⟨2θk+1∇f(xk+1), x⋆ − zk+1⟩2L( )
= 2(θ2k+1 −c 1 2 1 2θk+1) f(xk)− f⋆ − ∥∇f(xk)∥ − f(xk+1) + f⋆ − ∥∇f(xk+1)∥2L 2L
A ( )1 2− 2θk+1 f(xk+1)− f⋆ + ∥∇f(xk+1)∥ − ⟨2θk+1∇f(xk+1), x⋆ − zk+1⟩2L
7            
                                          ACCEPTED MANUSCRIPT                                      
8 Chanwoo Park, Jisun Park, Ernest K. Ryu
( )
2 1 2 1 2= 2(θk+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L
( )
1 2
+ 2θk+1 f⋆ − f(xk+1)− ∥∇f(xk+1)∥ + ⟨∇f(xk+1), xk+1 − x⋆⟩
2L
+ 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩
( )
≥ 2(θ2
1 2 1 2
k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L
+ 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩,
where the inequalities follow from the cocoercivity of f .
Consider two separate cases k = −1 and k = 0, 1, . . . . In case of k = −1, t
θ2k+1 − θk+1 = 1 − 1 = 0 and zk+1 − xk+1 = z0 − x0 = 0. The last formula
becomes zero, so U−1 − U0 ≥ 0. For k = 0, 1, . . . , p
( ) i
1 2 1 2
2(θ2k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥ r2L 2L
+ 2θk+1⟨∇f(xk+1), zk+1 − xk+1⟩ c
( )
1 2 1 2
= 2(θ2k+1 − θk+1) f(xk)− f(xk+1)− ∥∇f(xk)∥ − ∥∇f(xk+1)∥2L 2L s
1
+ 2θk+1(θk+1 − 1)⟨∇f(xk+1), xk+1 − xk + ∇f(xk)⟩ u
L
(
2 1 2= (2θk+1 − 2θk+1) f(xk)− f(xk+1)− ∥∇f(xk)−∇f(xk+1)∥n2L
)
+ ⟨∇f(xk+1), xk+1 − xk⟩ ≥ 0, a
where the inequalities follow from the cocoercivity ofmf . ⊔⊓
As with AGM, the optimal {θ }∞k k=0 is given b y θ2k+1 − θk+1 = θ2k, which
was used in the original presentation of OGMd[22, 26].
Corollary 1 Under the setup of Theorem 1, the choice θ2 2k+1− θk+1 = θk leads
to the rate te2 2 2 ( )L ∥x0 − x⋆∥ pL ∥x0 − x⋆∥ L ∥x0 − x⋆∥ log k 1f(yk)− f⋆ ≤ = − + o4θ2k−1 (k + ζ)2 (k + ζ)3 k3
for k = 1, 2, . . . . e
Proof This followscfrom Theorem 1 and (1). ⊔⊓
The relaxecd parameter requirement 0 ≤ θ2k+1 − θ ≤ θ2k+1 k of Theorem 1 isreminiscent of the requirement for AGM. We note that [30] had presented a gen-
∑k+1
eralizeAd analysis with requirement θ2k+1 ≤ i=1 θi based on the performanceestimation problem methodology.
8            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 9
The relaxed parameter requirement allows us to use the simpler rational
coefficients θk = (k + 2)/2. This leads to
1
yk+1 = xk − ∇f(xk)
L
k k + 2
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk),
k + 3 k + 3
which we call Simple-OGM.
Corollary 2 Assume (A1) and (A2). Simple-OGM’s yk-sequence exhibits the
rate
2
L ∥x0 − x⋆∥
f(yk)− f⋆ ≤
(k + 1)2 pt
for k = 1, 2, . . . . i
Proof This follows from Theorem 1. ⊔⊓ cr
2.3 Secondary sequence analysis of OGM s
We now analyze the convergence of OGM’s secondary sequence with last-step
modification through a unified Lyapunov analysis. u
Theorem 2 Assume (A1) and (A2). Let the positive sequence {θk}
∞
k=0 satisfy
θ0 = 1, and 0 ≤ θ
2 2
k+1 − θk+1 ≤ θk for k = 0, 1, . . . . Let the paositivensequence{φ }∞k k=0 satisfy 0 ≤ φ2k−φk ≤ 2θ2k−1 for k = 0, 1, . . . , where we define θ−1 = 0.OGM’s x̃k-sequence, the secondary sequence with last-step modification, exhibits
the rate
2
L ∥x0 − x⋆∥
f(x̃k)− f⋆ ≤ m
2φ2k  
for k = 0, 1, . . . . d
Proof Let {U }∞k k=−1 be as definedtinethe proof of the Theorem 1. Define{Ũ }∞k k=0 as ∥ ∥2
L ∥ 1 ∥
Ũ 2k =φk (f(x̃ )−pf ) + ∥z − φ ∥k ⋆ ∥ k k∇f(x̃k)− x⋆ .2 L ∥
We can show that Ũk ≤eUk−1, we conclude the rate with
φ2
L 2
ckc
(f(x̃k)− f⋆) ≤ Ũk ≤ U−1 = ∥x0 − x⋆∥
2
for k = 0, 1, . . . . Now we complete the proof by showing that Ũk ≤ Uk−1. For
k = 0,A1, . . . , we haveUk−1 − Ũk
9            
                                          ACCEPTED MANUSCRIPT                                      
10 Chanwoo Park, Jisun Park, Ernest K. Ryu
( )
1 2
= 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − φ
2
k (f(x̃k)− f⋆)2L
∥ ∥2
L ∥2 L 1 ∥
+ ∥zk − x ∥ ∥⋆∥ − z∥ k − φk∇f(x̃k)− x⋆2 2 L ∥
( )
1 2
= 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − φ
2 (f(x̃k)− f⋆)
2L k
1 2
− ⟨φk∇f(x̃k), x
2
⋆ − zk⟩ − φk ∥∇f(x̃k)∥2L
( )
1 2
= 2θ2k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L
( )
2 1 2 t− φk f(x̃k)− f⋆ + ∥∇f(x̃k)∥ − ⟨φk∇f(x̃k), x⋆ − zk⟩2L
( )
1 p2
≥ (φ2k − φk) f(xk−1)− f⋆ − ∥∇f(xk−1)∥2L
( ) ri
1 2
− φ2k f(x̃k)− f⋆ + ∥∇f(x̃k)∥ − ⟨φk∇f(x̃k), x⋆ − zk⟩2L c
( )
1 2 1 2
= (φ2k − φk) f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − f(x̃k) + f⋆ − ∥∇f(x̃k)∥2L 2L s
( )
1 2
+ φk f⋆ − f(x̃k)− ∥∇f(x̃k)∥ + ⟨∇f(x̃k), x̃k − x⋆⟩
2L u
+ ⟨φk∇f(x̃k), zk − x̃k⟩
( n)
1 2 1 2
≥ (φ2k − φk) f(xk−1)− f(x̃k)− ∥∇f(xk−1)∥ − ∥∇af(x̃k)∥2L 2L
+ ⟨φk∇f(x̃k), zk − x̃k⟩
( )
2 1 2 1 2= (φk − φk) f(xk−1)− f(x̃k)− ∥∇f(xk−1) ∥m− ∥∇f(x̃k)∥2L 2L1
+ φk(φk − 1)⟨∇f(x̃k), x̃k − xk−1 + d∇f(xk−1)⟩L( )1 2
= (φ2k − φk) f(xk−1)− f(x̃k)−te∥∇f(xk−1)−∇f(x̃k)∥ + ⟨∇f(x̃k), x̃k − xk−1⟩2L≥ 0,
where the inequalities follow fprom the cocoercivity of f . ⊔⊓
Corollary 3 Under theesetup of Theorem 2, the choice θ2 2c k+1
− θk+1 = θk and
φ2k − φk = 2θ
2
k−1 leads to the rate
c 2 2 2 ( )L ∥x0 − x⋆∥ L ∥x0 − x⋆∥ L ∥x0 − x⋆∥ log k 1f(x̃k)− f⋆ ≤ = √ − √ + o2φ2
A k (k + ζ + 1/ 2)
2 (k + ζ + 1/ 2)3 k3
for k = 0, 1, . . . .
10            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 11
√
k+ζ+√1 2 log k
Proof This follows from (1), which implies φ = √ 2k + + o(1), and2 4
Theorem 2. ⊔⊓
Simple-OGM with the last-step modification is
1
yk+1 = xk − ∇f(xk)
L
k k + 2
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
k + 3 k + 3
k k + 2
x̃k+1 = yk+1 + √ (yk+1 − yk) + √ (yk+1 − xk),
2(k + 2) + 1 2(k + 2) + 1
where x0 = y0. t
Corollary 4 Assume (A1) and (A2). Simple-OGM’s x̃k-sequence, the sec-
ondary sequence with last-step modification, exhibits the rate p
2
L ∥x0 − x⋆∥ i
f(x̃k)− f⋆ ≤ √
(k + 1 + 1/ 2)2 r
for k = 0, 1, . . . .
1 ck+1+√
Proof Use Corollary 3 with θ k+2k = and φ
2
2 k
= √ . ⊔⊓
2
2.4 Discussion us
We clarify that the presented Lyapunov analysis is a novel contributnion, while
the results themselves are mostly known [26,27,30].
We emphasize two key points. First is the somewhat unusual construction
of the Lyapunov function. This key insight will be used in theafollowing section
to present a novel method for the strongly convex setup.
The second point we emphasize is that we present a unified analysis of
the primary and last-step-modified secondary sequenmces using the Lyapunov
functions Uk and Ũk. Prior works on the two seq uences of AGM and OGM
rely on two separate analyses [26,27]. d
3 Strongly convex OGM e
In this section, we pre√sent strongly
tconvex OGM (SC-OGM), a novel method
that provides a factor- 2 impprovement over Nesterov’s SC-AGM. The methodand its analysis are obteained with following the key insight of Section 2: usethe OGM-type correction term in the method and usec( )1f(xk)− f⋆ − ∥∇f(x 2k)∥2L
in the construcction of the Lyapunov function. Throughout this section, ∥ · ∥ =∥ · ∥∗ denotes the Euclidean norm.
BaAsed on this insight, we present: (i) a novel method SC-OGM and (ii) aunified analysis of both the primary and secondary sequences of SC-OGM.
11            
                                          ACCEPTED MANUSCRIPT                                      
12 Chanwoo Park, Jisun Park, Ernest K. Ryu
3.1 Nesterov’s SC-AGM
Further assume f is µ-strongly convex and write κ = L/µ. SC-AGM’s conver-
gence rate
( )−k ( ( ))
1 µ+ L 2 k
f(yk)− f⋆ ≤ 1 + √ ∥x0 − x⋆∥ = O exp −√
κ− 1 2 κ
can be established through the following Lyapunov analysis [13]. For k =
0, 1, . . . , define
( )k
1 ( µ )2 t
Uk = 1 + √ f(yk)− f⋆ + ∥zk − x⋆∥
κ− 1 2 p
√ √ µ+L 2
with zk = ( κ+ 1)xk − κyk and show Uk ≤ · · · ≤ U0 ≤ ∥x − x ∥ .
i
2 0 ⋆
cr
3.2 Primary-sequence analysis of SC-OGM s
We newly propose SC-OGM:
1 u
yk+1 = xk − ∇f(xk)
L
1 1 n
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 −axk)2γ + 1 2γ + 1
√
for k = 0, 1, . . . , where y0 = x0 and γ =
8κ+1+3 .
2κ−2
Theorem 3 Assume (A1), (A2), and that f is µ-strmongly convex. SC-OGM’s
yk-sequence exhibits the rate
µ+ 2L d ( ( √ ))2 2k
f(yk)− f⋆ ≤ (1 + γ)
−k+1 e∥x0 − x⋆∥ = O exp − √2 κ
for k = 1, 2, . . . . t
Proof For k = 0, 1, . . . ,edefinepc 2γ + 1 γ + 1zk = xk − ykγ γ
and c ( )
A k 1 2 µ 2Uk = (1 + γ) f(xk)− f⋆− ∥∇f(xk)∥ + ∥zk+1 − x⋆∥ .2L 2
12            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 13
µ+2L 2
We can show that {Uk}
∞
k=0 is nonincreasing and U0 ≤ ∥x0 − x⋆∥ . Using2
2
f(yk) ≤ f(xk−1) −
1 ∥∇f(xk−1)∥ , which follows from L-smoothness, we2L
conclude the rate with
( )
1 2
(1 + γ)k−1 (f(yk)− f⋆) ≤ (1 + γ)
k−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥
2L
µ+ 2L 2
≤ Uk−1 ≤ U0 ≤ ∥x0 − x⋆∥
2
µ+2L 2
for k = 1, 2, . . . . Now we complete the proof by showing U0 ≤ ∥x0 − x⋆∥ ,2
showing some relationships between xk and zk, and showing that {Uk}
∞
k=0 is
nonincreasing. t
Firstly, we have ip1 2 µ 2
U0 = f(x0)− f⋆ − ∥∇f(x0)∥ + ∥z1 − x⋆∥
2L 2 r
∥ ∥2
1 ∥2 µ 1 γ + 2 ∥
= f(x0)− f⋆ − ∥∇f(x0)∥ + ∥x0 − ∇f(x0)− x ∥∥ ⋆∥ c2L 2 L γ
1 1 2 γ µ
= f(x0)− f⋆ + ∥∇f(x0)∥ − ⟨∇f(x0), x0 − x⋆⟩+ ∥xu 20 − xs⋆∥2L γ + 1 1 + γ 21 1 1 2 µ 2≤ (f(x0)− f⋆) + ∥∇f(x0)∥ + ∥x0 − x⋆∥
γ + 1 2L 1 + γ 2
2 µ 2
≤ (f(x0)− f
n
⋆) + ∥x0 − x⋆∥
1 + γ 2
( µ)
≤ L+ ∥x − x ∥2. a0 ⋆
2
Second, Let Xk = xk −x⋆ and Zk = zk −x⋆, fo r km= 0, 1, . . . . We will prove
1 1
(xk+1 − xk) + ∇f(xk) + γX
2
k+1 = (dγZk+1 + γ Xk+1) (2)L 1 + γ
Zk+1t=e
1 γ 1 γ + 2
Zk + Xk − ∇f(xk)
γ + 1 γ + 1 L γ
(3)
for k = 0, 1, . . . . p
Plug y 1k+1 = xk− ∇f(xk) in the definition of zk+1. (We remind the readerL
that zk was defined in tehe beginning of the proof.) Then we obtain (2).For (3), from defincition of zk and zk+1c 2γ + 1 γ + 1 1 1 + γzk+1 = xk+1 − xk + ∇f(xk)γ γ L γ
A 2γ + 1 γ + 1 1 1 + γzk = xk − xk−1 + ∇f(xk−1)γ γ L γ
13            
                                          ACCEPTED MANUSCRIPT                                      
14 Chanwoo Park, Jisun Park, Ernest K. Ryu
and definition of xk, we have
2γ + 2 1 1 1
xk+1 = yk+1 − yk − ∇f(xk)
2γ + 1 2γ + 1 L 2γ + 1
2γ + 2 1 1 2γ + 3 1 1
= xk − xk−1 − ∇f(xk) + ∇f(xk−1).
2γ + 1 2γ + 1 L 2γ + 1 L 2γ + 1
Therefore,
1 2γ + 1 γ + 1 1 1 + γ
zk+1 − zk = xk+1 − xk + ∇f(xk)
γ + 1 γ γ L γ
( )
1 2γ + 1 γ + 1 1 1 + γ
− xk − xk−1 + ∇f(xk−1)
γ + 1 γ γ L γ t
2γ + 1 γ2 + 4γ + 2 1 1 1 + γ
= xk+1 − xk + xk−1 + ∇f(xk)
γ γ(γ + 1) γ L γ rip1 1− ∇f(xk−1)L γ
(
2γ + 1 2γ + 2 1 1 2γ + 3
= xk − xk−1 − ∇f(x ) ck
γ 2γ + 1 2γ + 1 L 2γ + 1
)
1 1 γ2 + 4γ + 2 1 s
+ ∇f(xk−1) − xk + xk−1
L 2γ + 1 γ(γ + 1) γ
1 1 + γ 1 1 u
+ ∇f(xk)− ∇f(xk−1)
L γ L γ
γ 1 γ + 2 n
= xk − ∇f(xk)
γ + 1 L γ a
so we obtained (3).
Lastly, we will show that {U ∞k}k=0 is nonincrea singm. It suffices to show thatfor k = 0, 1, . . . , (1 + γ)−k(Uk − Uk+1) ≥ 0
which is equivalent to showing
( d )
1 2 1 2
(f(xk)− f⋆ − ∥∇f(xk)∥ )− (t1 +eγ)(f(xk+1)− f⋆ − ∥∇f(xk+1)∥ )2L 2Lµ ( )2 2+ ∥zk+1 − x⋆∥ − (1 + γ) ∥zk+2 − x⋆∥ ≥ 0.
2
By L-smoothness ofef , wephave1 2f(xk+1)− f(xk) ≤ − ∥∇f(xk+1)−∇f(xk)∥ + ⟨∇f(xk+1), xk+1 − xk⟩
2L
and from strocng cocnvexity, µ 2f(xk+1)− f⋆ ≤ ⟨∇f(xk+1), xk+1 − x⋆⟩ − ∥xk+1 − x⋆∥ .
2
For k =A0, 1, . . . , using above two inequalities, (2), and (3),
14            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 15
( )
1 2 1 2
f(xk)− f⋆ − ∥∇f(xk)∥ − (1 + γ)(f(xk+1)− f⋆ − ∥∇f(xk+1)∥ )
2L 2L
1 + γ 2 1 2
= (f(xk)− f(xk+1))− γ(f(xk+1)− f⋆) + ∥∇f(xk+1)∥ − ∥∇f(xk)∥
2L 2L
( )
1 2
≥ ∥∇f(xk+1)−∇f(xk)∥ + ⟨∇f(xk+1), xk − xk+1⟩
2L
( µ )2
− γ ⟨∇f(xk+1), xk+1 − x⋆⟩ − ∥xk+1 − x⋆∥
2
1 + γ 2 1 2
+ ∥∇f(x tk+1)∥ − ∥∇f(xk)∥
2L 2L
1
= ⟨∇f(xk+1),− ∇f(xk)− xk+1 + xk − γ(x pk+1 − x⋆)⟩
L i
2 + γ 2 µγ 2
+ ∥∇f(xk+1)∥ + ∥x − x ∥ rk+1 ⋆
2L 2
1
= ⟨∇f(x 2 ck+1),− (γZk+1 + γ Xk+1)⟩
1 + γ
2 + γ 2 µγ 2 s
+ ∥∇f(xk+1)∥ + ∥xk+1 − x⋆∥ .
2L 2 u
In addition, n
µ ( )2 2 a
(1 + γ) ∥Zk+2∥ − ∥Zk+1∥
2
( ∥  m ∥
)
2
µ ∥ 1 γ 1 2 + γ ∥ 2
= (1 + γ)∥ Zk+1 + X − ∇f(x )∥k+1 k+1 − ∥Z ∥
2 ∥
k+1
1 + γ 1 + γ L γ ∥
(
µ γ γ2 1 (2 + γ)22 2 2
= − ∥Zk+1∥ + ∥Xk+1∥ + (1 + γ) ∥∇f(xk+1)∥
2 1 + γ 1 + γ d L2 γ2
γ 2 + γ
+ 2 ⟨Zk+1, Xk+1⟩ − 2 te⟨∇f(xk+1), Zk+1⟩1 + γ Lγ )
2 + γ
− 2 ⟨∇f(xk+p1), Xk+1⟩ .L e
Since
cc 2 + γ 1µ = ,Lγ2 1 + γ
we canAtelescope concerned ∇f(xk+1)’s inner product in Uk − Uk+1.
15            
                                          ACCEPTED MANUSCRIPT                                      
16 Chanwoo Park, Jisun Park, Ernest K. Ryu
For k = 0, 1, . . . , we have
(1 + γ)−k(Uk − Uk+1)
2 + γ 2 µγ 2
≥ ∥∇f(xk+1)∥ + ∥Xk+1∥
2L 2
(
µ γ 2 γ
2
2
− − ∥Zk+1∥ + ∥Xk+1∥
2 1 + γ 1 + γ
)
1 (2 + γ)2 2 γ
+ (1 + γ) ∥∇f(x
2 2 k+1
)∥ + 2 ⟨Zk+1, Xk+1⟩
L γ 1 + γ
( )
µ γ 2 γ 2 γ
= − − ∥Xk+1∥ − ∥Zk+1∥ + 2 ⟨Zk+1, Xk+1⟩
2 1 + γ 1 + γ 1 + γ t
µ γ 2
= ∥Zk+1 −Xk+1∥ ≥ 0.
2 1 + γ p
⊔⊓ ri
3.3 Secondary sequence analysis c
We now analyze the convergence of SC-OGM’s secondary sequence with a s
unified Lyapunov analysis. We note that SC-OGM does not require the last-ustepmodification, unlike the non-strongly convex counterpart.
Theorem 4 Assume (A1), (A2), and that f is µ-strongly convex. SnC-OGM’s
xk-sequence, the secondary sequence without last-step modificataion, exhibits therate
( )
(1 + γ)−k+2 µ+ 2L 2
f(xk)− f⋆ ≤ ∥x0 − x⋆∥
2γ 2 m
for k = 1, 2, . . . .  
Proof Let {zk}
∞ ∞
k=0 and {Uk}k=0 be defined as in the proof of the Theorem 3.
For k = 0, 1, . . . , define ed( ∥ ( ) ∥2)2γ µ ∥k−1 γ + 2 1 ∥Ũk = (1+γ) (f(x ∥k)− f⋆) + zk − ∇f(xk)− x ∥⋆
1 + γ
We can show that Ũk ≤ Uk−1pt 2 ∥ γ L ∥. We conclude the rate with
k−1 2γe µ+ 2L 2(1 + γ) c (f(xk)− f⋆) ≤ Ũk ≤ U0 ≤ ∥x0 − x⋆∥1 + γ 2for k = 1, 2, . .c. . Now we complete the proof by showing that Ũk ≤ Uk−1. Note( )γ+1that (x 1k − xk−1) + ∇f(xk−1) = (Zk −Xk). Then we haveγ L
( )
1 2 2γ
f(xkA−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆)2L 1 + γ
16            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 17
∥ ( ) ∥2
Lγ2 Lγ2 ∥2 γ + 2 1 ∥
+ ∥zk − x ∥⋆∥ − zk − ∇f(xk)− x ∥⋆
2(1 + γ)(2 + γ) 2(1 + γ)(2 + γ) ∥ γ L ∥
( )
1 2 2γ
= f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆)
2L 1 + γ
γ 1 2 + γ 2
+ ⟨Zk,∇f(xk)⟩ − ∥∇f(xk)∥
1 + γ 2L 1 + γ
( )
1 2 2γ
= f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆)
2L 1 + γ
〈 ( ) 〉
γ γ + 1 1
+ (xk − xk−1) + ∇f(xk−1) +Xk,∇f(xk)
1 + γ γ L
1 2 + γ 2 t
− ∥∇f(xk)∥
2L 1 + γ
( )
1 2 2γ
p
= f(xk−1)− f⋆ − ∥∇f(xk−1)∥ − (f(xk)− f⋆) i
2L 1 + γ r
1 γ
+ ⟨xk − xk−1,∇f(xk)⟩+ ⟨∇f(xk−1),∇f(xk)⟩+ ⟨Xk,∇f(xk)⟩
L 1 + γ c
1 2 + γ 2
− ∥∇f(xk)∥
2L 1 + γ s
( )
1 2
= f(xk−1)− f(xk)− ∥∇f(xk−1)−∇f(xk)∥ + ⟨∇f(xk), xk − xk−1⟩
2L u
( )
1 γ 2 1 1 2
+ ∥∇f(xk)∥ + f(xk)− f⋆ − ∥∇f(xk)∥
2L 1 + γ 1 + γ 2L an( )γ 1 2+ f⋆ − f(xk)− ∥∇f(xk)∥ + ⟨Xk,∇f(xk)⟩
1 + γ 2L
≥ 0.
Lγ2 µ
Since = , above inequality indicates thamt
2(1+γ)(2+γ) 2
( )  
1 2 µ 2
f(xk−1)− f⋆− ∥∇f(xk−1)∥ + ∥zkd− x⋆∥2L 2
∥ ( ) ∥2
2γ µ ∥ γ + 2 1 ∥
≥ (f(xk)− f
1 + γ t⋆)e+ ∥z − ∇f(x )− x ∥ .2 ∥ k k ⋆γ L ∥p ⊔⊓
3.4 Discussion
√ e
The fac√tor- 2 impcrovement of SC-OGM over SC-AGM is consistent with thefactor- 2 impcrovement of OGM over AGM. AGM and OGM share the samemomentum term while OGM has the additional “correction term”. In contrast,the momentum coefficients differ in the strongly convex case: SC-AGM has
A √ ( )κ− 1 2 1√ = 1− √ +Oκ+ 1 κ κ
17            
                                          ACCEPTED MANUSCRIPT                                      
18 Chanwoo Park, Jisun Park, Ernest K. Ryu
while SC-OGM has
√ ( )
1 2 2 1
= 1− √ +O .
2γ + 1 κ κ
Of course, SC-OGM also has the correction term, which is essential in the
analysis. We clarify that SC-OGM is not an optimal algorithm for the set of
minimizing smooth strongly convex functions as discussed in Section 1.1.
Another interesting line of research is to extend the faster rates to the
composite minimization setup, which minimize f + g with a smooth strongly
convex f and convex but possibly non-smooth g, as has been pursued in [49]
and [10]. Interestingly, the algorithm of [1√0, Theorem 6] is different from SC-( ( √ )) t
OGM, but achieves the same O exp − 2k/ κ -rate as SC-OGM, while
having an extension to the composite minimization setup.
rip
4 Linear coupling analysis c
While the Lyapunov analyses of Sections 2 and 3 do provide insight into the
acceleration mechanism of OGM, they do not shed light onto the provenance of s
the method. Originally, OGM was generated through a computer-assisted proof
methodology as the exactly optimal first-order method, but this approacuh is
arguably opaque to humans.
In this section, we present a human-understandable deriavationnof OGMbased on linear coupling. Specifically, we obtain OGM by refining the linearcoupling analysis of Allen-Zhu and Orecchia [5] through replacing the use of
non-tight inequalities with tight inequalities.
We specifically provide: (i) a natural (and non-computer assisted) derivation
of OGM, (ii) a generalization of OGM to the mirrormdescent setup, and (iii) a
unification of AGM and OGM. We moreover prov ide (iv) a generalization of
SC-OGM to the mirror descent setup in thedappendix, in Section D.
Assumption and notation. In this section, assume
√ e
(A3) ∥·∥ = xTQx is a quadratic nortm, where Q is a symmetric positive definitematrix.
Assumption (A1) is to bee intperpreted as L-smoothness with respect to norm∥ · ∥. Write ∥ · ∥∗ = xTQ−1x for the dual norm of ∥ · ∥. However, ⟨·, ·⟩ is thestandard Euclideacn inner product (unrelated to Q). Let w : Rn → R be a“distance generating function” that is differentiable and 1-strongly convex withrespect to ∥ ·c∥, and letVx(y) = w(y)− ⟨∇w(x), y − x⟩ − w(x) ∀x, y ∈ Rn
be theABregman divergence generated by w.
18            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 19
4.1 Linear coupling analysis of AGM
We briefly outline the linear coupling analysis of AGM presented in [5] and
point out where the analysis can be refined.
Consider the problem of minimizing f under assumptions (A1), (A2), and
(A3). The linear coupling method is
y −1 −1k+1 = xk − L Q ∇f(xk) (LC)
zk+1 = argmin {Vz (y) + ⟨αk k+1∇f(xk), y − xk⟩}
y∈Rn
xk+1 = (1− τk+1)yk+1 + τk+1zk+1 t
for k = 0, 1, . . . , where x0 = z0 and {α }
∞
k k=1 and {τ
∞
k}k=1 are positive sequences
to be determined.
We obtain AGM by performing a non-tight analysis of (LC) and letting ip
the analysis inform the choices of {α ∞ ∞k}k=1 and {τk}k=1. The first step of this
analysis is r
α2k+1 2 c
αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (x⋆)− Vz (x )2 k k+1
⋆
≤ α2k+1L(f(xk)− f(yk+1)) + Vz (x⋆)− Vz (x ).
s
k k+1 ⋆
The second inequality follows from u
1 2 1 2
f(xk)− f(yk+1) ≥ ∥∇f(xk)∥∗ + ∥∇f(yk+1)∥∗, n2L 2L
2
but the underscored term 1 ∥∇f(yk+1)∥∗ is not used, i.e., p
aroof utilizes the
2L
weaker and non-tight inequality
1 2
f(xk)− f(y mk+1) ≥ ∥∇f(x k)∥∗ .2L
The second step of this analysis is to choosde τ 1k = to eliminate f(xk)αk+1Land to show
( ) ( )
α2k+1L f(yk+1)− f⋆ + Vz (x⋆) ≤teα2k+1L− αk+1 k+1 (f(yk)− f⋆) + Vz (x⋆).k
The inequality follows from p 1 2
f(xk)− fe⋆ ≤ ⟨∇f(xk), xk − x⋆⟩ − ∥∇f(xk)∥2L ∗
and c 1 2
⟨∇f(xck), yk − xk⟩ ≤ f(yk)− f(xk)− ∥∇f(yk)−∇f(xk)∥∗,2L
but thAe underscored terms are not used. Finally, convergence is establishedthrough a telescoping sum argument as Appendix C.
19            
                                          ACCEPTED MANUSCRIPT                                      
20 Chanwoo Park, Jisun Park, Ernest K. Ryu
4.2 Linear coupling analysis of OGM
We now derive OGM through performing a tight analysis of (LC) and letting
the analysis inform the choices of {α }∞ ∞k k=1 and {τk}k=0.
In the first step of our linear coupling analysis, we follow the same arguments
but do not take the step utilizing the non-tight inequality.
Lemma 1 Assume (A1) and (A2). The iterates (LC) satisfy
α2k+1 2
αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (x⋆)− V (x )2 k
zk+1 ⋆
for k = 0, 1, . . . . t
Proof This is exactly the first part of Lemma 4.2 of [5]. ⊔⊓ p
In the second step of our linear coupling analysis, we choose τ 2k = toαk+1L i
allow for a telescoping sum argument and show the following lemma. r
Lemma 2 Assume (A1), (A2) and (A3). Let 0 < τ = 2k ≤ 1 for k =αk+1L
2 c
0, 1, .., α1 =
2 , and x−1 = x0. Set h(x) = f(x) − f −
1
⋆ ∥∇f(x)∥∗. TheL 2L
iterates (LC) satisfy s
α2 2k+1L αk+1L− 2αk+1
h(xk) + Vz (x⋆) ≤ h(xk−1) + Vz (xn⋆)2 k+1 2 k ufor k = 0, 1, . . . .
Proof For k = 1, 2, . . . , we have a
αk+1 (f(xk)− f⋆))
αk+1 2
≤ αk+1⟨∇f(xk), xk − x⋆⟩ − ∥∇f(xk)∥ (4)
2L ∗ m
αk+1 2
= αk+1⟨∇f(xk), xk − zk⟩+ αk+1⟨∇f(xk), zk − x⋆ ⟩ − ∥∇f(xk)∥
2L ∗
1− τk αk+1 2
= αk+1⟨∇f(xk), yk − xk⟩+ αk+e1⟨∇fd(xk), zk − x⋆⟩ − ∥∇f(xk)∥τk 2L ∗1− τk 1= αk+1⟨∇f(xk), x −1k−1 − xk −t Q ∇f(xk−1)⟩τk Lαk+1 2
+ αk+1⟨∇f(xk), zk − x⋆⟩ − ∥∇f(xk)∥ (5)
2L ∗
( )
1− τ pk 1 2 1 2
≤ αk+1 f(xk−1e)− f(xk)− ∥∇f(xk−1)∥ − ∥∇f(xk)∥ (6)τk 2L ∗ 2L ∗c αk+1 2+ αk+1⟨∇f(xk), zk − x⋆⟩ − ∥∇f(xk)∥2L ∗( )1− τk 1 2 1 2
≤ αk+c1 f(xk−1)− f(xk)− ∥∇f(xk−1)∥∗ − ∥∇f(xk)∥ (7)τk 2L 2L ∗
A α2k+1 2 αk+1 2+ ∥∇f(xk)∥∗ + Vz (x⋆)− Vz (x⋆)− ∥∇f(xk)∥ .2 k k+1 2L ∗
20            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 21
(4) and (6) follow from Lemma 11, (5) follows from the definition of linear
coupling, and (7) follows from Lemma 1.
The case of k = 0 follows from α 21 = and f⋆ − f(x0) − ⟨∇f(x0), xL ⋆ −
x 1
2
0⟩ − ∥∇f(x0)∥ ≥ 0 with Lemma 1. ⊔⊓2L ∗
Theorem 5 Assume (A1), (A2), and (A3). Let the positive sequence {α }∞k k=1
satisfy 0 ≤ α2k+1L− 2αk+1 ≤ α
2
kL for k = 1, 2 . . . and α
2 2
1 = . Let τ =L k αk+1L
for k = 1, 2, . . . . The yk-sequence of (LC) exhibits the rate
2Vx (x⋆)
f(y )− f ≤ 0k ⋆
Lα2k
for k = 1, 2, . . . . t
Proof Sum the inequality of Lemma 2 from 0 to (k− 1). Then use Vz (x⋆) ≥ 0k
2
and f(yk) ≤ f(xk−1)−
1 ∥∇f(xk−1)∥ ip2L ∗ to conclude the rate. ⊔⊓ r
The {θk}
∞
k=0 of the original OGM formulation is related to {α }
∞
k k=1 through
αk+1 = 2θk/L for k = 0, 1, . . . . The seemingly different parameter choices c
τk =
1 for AGM and τ 2
α L k
= for OGM actually turn out to be the
k+1 αk+1L
same as {α ∞k}k=1 for AGM and OGM differ by a factor of 2. s
The parameters {α }∞ and {τ }∞k k=1 k k=1 are chosen to make the telescoping
sum argument work and to make it work tightly, as described in Sectioun C.
Specifically, one starts with the form
( )
1 n2
Mk f(xk)− f⋆− ∥∇f(xk)∥ + Vz (x⋆)
2L ∗ k+1
( a)
1 2
≤ Nk−1 f(xk−1)− f⋆ − ∥∇f(xk−1)∥∗ + Vz (xk ⋆),2L
where the scalar coefficients Mk, N mk−1 are determ ined by (7). Comparing the2coefficients of ∥∇f(xk)∥∗, we have
( ) ( )
1 1− τ 2k αk+1 d1 1− τk
− αk+1 + αk+1 = −e+ αk+1 + αk+1 .2L τk 2 2L τk
Solving this equation leads to the chtoice τ 2k = . The requirement α2Lαk+1 k+1L−
2αk+1 ≤ α
2
kL is needed for tphe telescoping sum argument to work, and thechoice α2 2k+1L− 2αk+1 =eαkL makes the argument tight.
4.3 Secondary sequcence analysis
In the linearccoupling context, the last-step modification can be expressed as
A x̃k = (1− τ̃k)yk + τ̃kzk (8)for k = 0, 1, . . . , where {τ̃ }∞k k=0 is a positive sequence to be determined.
21            
                                          ACCEPTED MANUSCRIPT                                      
22 Chanwoo Park, Jisun Park, Ernest K. Ryu
Lemma 3 Assume (A1), (A2) and (A3). Let 0 < τ̃k =
1 ≤ 1 for k =
α̃k+1L
0, 1, . . . , α̃ 11 = , and x−1 = x0. Then the x̃k-sequence of (8), the secondaryL
sequence with last-step modification of (LC), satisfies
( )
α̃2 2k+1L (f(x̃k)− f⋆) + Vz (x⋆) ≤ α̃k+1L− α̃k+1 k+1 h(xk−1) + Vz (xk ⋆)
for k = 0, 1, . . . .
Proof Proof is identical to that of Lemma 2 with substituted τk by τ̃k. ⊔⊓
Theorem 6 In the setup of Theorem 5, let 0 ≤ α̃2 1 2k+1L − α̃k+1 ≤ αkL2
and α̃ = 11 . Then the x̃k-sequence, the secondary sequence with last-stepL
modification, of the linear coupling method (LC) exhibits the rate t
Vx (x⋆)
f(x̃k)− f ≤
0
⋆ p
Lα̃2k+1 i
for k = 0, 1, . . . r
Proof Sum the inequality of Lemma 2 from 0 to (k − 2) and the inequality of c
Lemma 3 with k − 1. Then use Vz (x⋆) ≥ 0 to conclude the rate. ⊔⊓k s
4.4 Comparison of the linear coupling analyses of AGM and OGM u
The linear coupling analysis of Allen-Zhu and Orecchia [5], whiach dernives AGM,relies on the following two key lemmas.Lemma 4 [5, Lemma 4.2] In the linear coupling setup,
α2k+1 2
αk+1⟨∇f(xk), zk − x⋆⟩ ≤ ∥∇f(xk)∥∗ + Vz (xm⋆)− Vk z (x )2 k+1 ⋆≤ α2k+1L (f(xk)− f(yk+1)) + Vz (xk ⋆)− Vz (xk+1 ⋆)
for k = 0, 1, . . . . d
Lemma 5 [5, Lemma 4.3] (Couplineg Lemma) In the linear coupling setup,
α2k+1L (f(yk+1)− f⋆) + Vz p(x⋆) ≤t(α
2
k+1L− αk+1) (f(yk)− f⋆) + Vz (x⋆).k+1 k
for k = 0, 1, . . . .
As discussed, the preoof of [5, Lemma 4.2] uses of the non-tight inequalityc 1 2f(xk)− f(yk+1) ≥ ∥∇f(xk)∥∗ ,2L
and the proofcof [5, Lemma 4.3] follows steps similar to that of Lemma 2, butuses thAe non-tight inequalitiesf(xk)− f⋆ ≤ ⟨∇f(xk), xk+1 − x⋆⟩
22            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 23
and
⟨∇f(xk), yk − xk⟩ ≤ f(yk)− f(xk).
In both linear coupling analyses, for OGM and AGM, the telescoping
sum argument is made tight by choosing {α }∞ ∞k k=1 and {τk}k=1 appropriately.
However, the analysis of Allen-Zhu and Orecchia [5] uses non-tight inequalities
before the telescoping sum argument, while our analysis uses tight inequalities
in all steps.
4.5 Unification of AGM and OGM t
If we choose w(y) = 1
2 2
∥y∥ , so that Vx(y) =
1 ∥x− y∥ , and 0 < t ≤ 1, so
2t 2t p
that w is 1-strongly convex, and substitute αk+1 = 2θk/L, (LC) becomes ri
1
yk+1 = xk − ∇f(xk)
L
2tθk
zk+1 = zk − ∇f(xk)
L sc
( )
1 1
xk+1 = 1− yk+1 + zk+1
θk+1 θk+1 u
for k = 0, 1, . . . . We also express this method with the momentumaandncorrectionterms and without the zk-iterates in Lemma 6. This method unifies AGMand OGM through the constant t; AGM and OGM respectively correspond to
t = (1/2) and t = 1.
Corollary 5 Assume (A1), (A2) and (A3). Let 0 <mt ≤ 1. Then2
L ∥x0d− x⋆∥f(yk)− f⋆ ≤ 4tθ2
for k = 1, 2, . . . e k−1
Proof This follows from Theoprem 5twith α = 2θkk+1 . ⊔⊓LThe rates of Corollarye5 at t = 1 and t = 1 exactly match the previously2discussed rates of AGM and OGM.
Lemma 6 The uncified form is equivalent to
1
yk+1c= xk − ∇f(xk)
A L θk − 1 θkxk+1 = yk+1 + (yk+1 − yk) + (2t− 1) (yk+1 − xk).θk+1 θk+1
23            
                                          ACCEPTED MANUSCRIPT                                      
24 Chanwoo Park, Jisun Park, Ernest K. Ryu
Proof To prove the equivalency, we show that the above sequence leads to
( )
1 1
xk+1 = 1− yk+1 + zk+1.
θk+1 θk+1
That is,
( )
1 θk θk − 1 θk 1
xk+1 = 1− yk+1 + yk+1 − yk − (2t− 1) ∇f(xk)
θk+1 θk+1 θk+1 θk+1 L
( ) ( )
1 θk 1
= 1− yk+1 + xk − ∇f(xk)
θk+1 θk+1 L
θk − 1 θk 1
− yk − (2t− 1) ∇f(xk)
θk+1 θk+1 L t
( )
1 θk θk − 1 θk 1
= 1− yk+1 + xk − yk − 2t ∇f(xk)
θk+1 θk+1 θ θ L ipk+1 k+1( )
1 θk − 1 θk 1
= 1− yk+1 − yk − 2t ∇f(xk) r
θk+1 θk+1 θk+1 L
( )
θk θk−1 − 1 θk−1 1
+ yk + (yk − yk−1)− (2t− 1) ∇f(x )
c
k−1
θk+1 θk θk L
( ) ( )
1 θ sk θk−1 − 1 θk − 1 θk−1 − 1
= 1− yk+1 + + − yk − yk−1
θk+1 θk+1 θk+1 θk+1 θk+1 u
θk−1 1 θk 1
− (2t− 1) ∇f(xk−1)− 2t ∇f(xk)
θk+1 L θk+1 L
( ) n
1 θk−1 θk−1 − 1
= 1− yk+1 + yk − yk−1
θk+1 θk+1 θk+1 a
θk−1 1 θk 1
− (2t− 1) ∇f(xk−1)− 2t ∇f(xk)
θk+1 L θk+1 L
( ) ( )
1 θk−1 1 mθk−1 − 1
= 1− yk+1 + xk−1 − ∇f(xk−1) − yk−1
θk+1 θk+1 L
θk−1 1 θ
− (2t− 1) ∇f(xk−1)− 2t d θk+1k 1∇f(xk)θk+1 L θk+1 L
( )
1 θk−1 θk−1 − 1
= 1− yk+1 + xkt−1e− yk−1θk+1 θk+1 θk+1
θk 1 θk−1 1
− 2t ∇f(xk)− 2t ∇f(xk−1)
θk+1 L θk+1 L
ep...( ) k
1 θ0 θ0 − 1 1 ∑ 1
= 1− cyk+1 + x0 − y0 − 2tθi ∇f(xc i
)
θk+1 θk+1 θk+1 θk+1 L
i=0
( )
1 1
=A1− yk+1 + zk+1.θk+1 θk+1 ⊔⊓
24            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 25
4.6 Discussion
By identifying OGM as an instance of linear coupling, we generalized OGM to
the setup w√ith quadratic norms and mirror descent steps while maintaining
the factor- 2 improvement. However, we do point out that the generalization
is narrower than that of [5], which allows non-quadratic norms and constrained
yk-and zk-updates. The analysis on strongly convex case follows from a similar
line of reasoning, and is presented in Appendix, Section D.
In addition to the human-understandable derivation of OGM, this section
provides two non-obvious observations, which we point out again. The first
is that AGM and OGM can be unified into a single parameterized family of
accelerated gradient methods, all achieving the O(1/k2) rate. Another is that t
the linear coupling analysis of Allen-Zhu and Orecchia [5] was suboptimal in
the same way that AGM is suboptimal and can be improved. ip
5 Conclusion r
In this work, we presented human-understandable analyses of OGM. The first c
2
key insight is to use a Lyapunov function with f(xk) − f⋆ −
1 ∥∇f(xk)∥ ,2L
a somewhat unusual term in Lyapunov analyses. The second key insight is s
to obtain OGM by refining the linear coupling analysis of Allen-Zhu and
Orecchia [5] through replacing non-tig√ht inequalities with tight ones. W
uith
these insights, we extended the factor- 2 acceleration to other setups.
In our view, the most significant contribution of this work is thenimproved
understanding of OGM’s acceleration mechanism. While Nesterov’s acceleration
mechanism has been utilized as a component in a wide range oaf setups, OGM’s
acceleration mechanism has not yet seen any external use. Through the under-
standing provided by the analysis of this work, we hope O√GM’s acceleration
becomes more widely utilized to gain a (perhaps fmactor- 2) speedup com-
pared to what can be achieved with AGM’s adcceler ation. For example, whetheraccelerated coordinate gradient methods [6, 44] or non-convex stochastic opti-mization [23] can be improved with OGM’s acceleration mechanism would be
an interesting question to address in future work. Improving the FISTA [16]
and the more general mirror descent seteup [14,34] are also interesting directions,
although there are known limitationts [18,29].
Finally, studying how OGM’s acceleration interacts with other techniques
used to analyze AGM, such as the continuous-time analysis [50], high-resolution
ODEs [48], and variatioenal peprspective [55] is also an interesting direction.
Acknowledgemecnts
JP and EKRcwere supported by the Samsung Science and Technology Founda-tion (Project Number SSTF-BA2101-02) and the National Research Foundation
of KorAea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010]. We thank Gyumin Roh for reviewing the manuscript and
25            
                                          ACCEPTED MANUSCRIPT                                      
26 Chanwoo Park, Jisun Park, Ernest K. Ryu
providing valuable feedback. We thank Bryan Van Scoy and Suvrit Sra for the
discussions regarding the triple momentum method and estimate sequences,
respectively.
ip
t
sc
r
an
u
 m
te
d
p
cc
e
A
26            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 27
Conflict of interest
The authors declare that they have no conflict of interest.
References
1. Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. COLT
(2020)
2. Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods.
STOC (2017)
3. Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. ICML
(2016)
4. Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, t
parallel, simpler, and faster positive SDP solver. SODA (2016)
5. Allen-Zhu, Z., Orecchia, L.: Linear coupling: An ultimate unification of gradient and
mirror descent. ITCS (2017) p
6. Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent i
using non-uniform sampling. ICML (2016)
7. Aujol, J., Dossal, C.: Optimal rate of convergence of an ODE associated to the fast r
gradient descent schemes for b > 0. HAL Archives Ouvertes (2017)
8. Aujol, J.F., Dossal, C., Fort, G., Moulines, É.: Rates of convergence of perturbed c
FISTA-based algorithms. HAL Archives Ouvertes (2019)
9. Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceler-
ation. SIAM Journal on Optimization 29(4), 3131–3153 (2019) s
10. Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the heavy-ball method for
quasi-strongly convex optimization (2021) u
11. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and
conic optimization. SIAM Journal on Optimization 16(3), 697–725 (2006)
12. Baes, M.: Estimate sequence methods: extensions and approximations.nTech. rep.,
Institute for Operations Research, ETH, Zürich, Switzerland (2009)
13. Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory of
Computing 15(4), 1–32 (2019) a
14. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient
continuity: First-order methods revisited and applicationsm. Mathematics of OperationsResearch 42(2), 330–348 (2017)15. Beck, A., Teboulle, M.: Mirror descent and nonlinear p rojected subgradient methods forconvex optimization. Operations Research Letters 31(3), 167–175 (2003)
16. Beck, A., Teboulle, M.: A fast iterative shrinkage-dthresholding algorithm for linear inverseproblems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)17. De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact
gradient and newton methods through semidefinite programming performance estimation.
SIAM Journal on Optimization 30(3), 2053–2082 (2020)
18. Dragomir, R.A., Taylor, A.B., d’Asptremeont, A., Bolte, J.: Optimal complexity andcertification of Bregman first-oprder methods. Mathematical Programming (2021)19. Drori, Y.: The exact information-based complexity of smooth convex minimization.Journal of Complexity 39, 1–16 (2017)20. Drori, Y., Taylor, A.: Oen the oracle complexity of smooth strongly convex minimization.Journal of Complexity 68, 101590 (2022)21. Drori, Y., Taylorc, A.B.: Efficient first-order methods for convex minimization: a con-structive approach. Mathematical Programming 184(1), 183–220 (2020)22. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimiza-tion: a novecl approach. Mathematical Programming 145(1-2), 451–482 (2014)23. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochas-tic programming. Mathematical Programming 156(1-2), 59–99 (2016)
24. GuA, G., Yang, J.: Tight sublinear convergence rate of the proximal point algorithm formaximal monotone inclusion problems. SIAM Journal on Optimization 30(3), 1905–1921(2020)
27            
                                          ACCEPTED MANUSCRIPT                                      
28 Chanwoo Park, Jisun Park, Ernest K. Ryu
25. Kim, D.: Accelerated proximal point method for maximally monotone operators. Mathe-
matical Programming (2021)
26. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization.
Mathematical Programming 159(1-2), 81–107 (2016)
27. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method.
Journal of Optimization Theory and Applications 172(1), 187–205 (2017)
28. Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex
optimization. Journal of Optimization Theory and Applications 178(1), 240–263 (2018)
29. Kim, D., Fessler, J.A.: Another look at the fast iterative shrinkage/thresholding algorithm
(FISTA). SIAM Journal on Optimization 28(1), 223–250 (2018)
30. Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex
minimization. SIAM Journal on Optimization 28(2), 1920–1950 (2018)
31. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via
integral quadratic constraints. SIAM Journal on Optimization 26(1), 57–95 (2016)
32. Li, B., Coutiño, M., Giannakis, G.B.: Revisit of estimate sequence for accelerated gradient t
methods. ICASSP (2020)
33. Lieder, F.: On the convergence rate of the halpern-iteration. Optimization Letters pp.
1–14 (2020) p
34. Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order i
methods, and applications. SIAM Journal on Optimization 28(1), 333–354 (2018)
35. Nemirovsky, A.S.: On optimality of Krylov’s information when solving linear operator r
equations. Journal of Complexity 7(2), 121–130 (1991)
36. Nemirovsky, A.S.: Information-based complexity of linear operator equations. Journal of c
Complexity 8(2), 153–175 (1992)
37. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Opti-
mization. (1983) s
38. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate
of convergence O(1/k2). Proceedings of the USSR Academy of Sciences 269, 543u–547
(1983)
39. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course (2004)
40. Nesterov, Y.: Smooth minimization of non-smooth functions. Mathemaatical Pnrogramming103(1), 127–152 (2005)41. Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convexproblems. Mathematical Programming 112(1), 159–181 (2008)
42. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Mathematical
Programming 120(1), 221–259 (2009)
43. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization 22(2), 341–362m(2012)
44. Nesterov, Y., Stich, S.U.: Efficiency of the accelerated coordinate descent method on
structured optimization problems. SIAM Journal on Optimization 27(1), 110–123 (2017)
45. Rockafellar, R.T.: Convex Analysis (1970)
46. Ryu, E.K., Taylor, A.B., Bergeling, C.,eGiselssdon, P.: Operator splitting performanceestimation: Tight contraction factorstand optimal parameter selection. SIAM Journalon Optimization 30(3), 2251–2271 (2020)47. Ryu, E.K., Yin, W.: Large-Scale Convex Optimization via Monotone Operators. Draft(2021)
48. Shi, B., Du, S.S., Su, W., Jordan, M.I.: Acceleration via symplectic discretization of
high-resolution differential equations. NeurIPS (2019)
49. Siegel, J.W.: Accelerateed first-porder methods: Differential equations and lyapunov func-tions. arXiv preprint arXiv:1903.05671 (2019)50. Su, W., Boyd, S.,cCandes, E.: A differential equation for modeling Nesterov’s acceleratedgradient method: Theory and insights. NeurIPS (2014)51. Taylor, A., Drori, Y.: An optimal gradient method for smooth strongly convex minimiza-tion. Mathematical Programming (2022)
52. Taylor, A.Bc., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT (2019)
53. TayAlor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-ordermethods for composite convex optimization. SIAM Journal on Optimization 27(3),1283–1313 (2017)
28            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 29
54. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and
exact worst-case performance of first-order methods. Mathematical Programming 161(1-
2), 307–345 (2017)
55. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated
methods in optimization. Proceedings of the National Academy of Sciences 113(47),
E7351–E7358 (2016)
rip
t
sc
an
u
d 
m
pt
e
cec
A
29            
                                          ACCEPTED MANUSCRIPT                                      
30 Chanwoo Park, Jisun Park, Ernest K. Ryu
A Method reference
For reference, we restate all aforementioned methods. In all methods, we assume that f
is L-smooth function, {θ }∞k and {φ }
∞
k are the sequences of positive scalars, andk=0 k=0
x0 = y0 = z0.
OGM. One form of OGM is
1
yk+1 = xk − ∇f(xk)
L
θk − 1 θk
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
θk+1 θk+1
and an equivalent form with z-iterates is t
1
yk+1 = xk − ∇f(xk)
L p
2θk
zk+1 = zk − ∇f(xk) i
L
( )
1 1 r
xk+1 = 1− yk+1 + zk+1
θk+1 θk+1 c
for k = 0, 1, . . . . The last-step modification on the secondary sequence can be written as
θ sk − 1 θk
x̃k+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
φk+1 φk+1
( )
1 1 nu= 1− yk+1 + zk+1φk+1 φk+1
where k = 0, 1, . . . . a
OGM-simple. OGM-simple is a simpler variant of OGM with θ k+2k = and φk =2
k+1+√1
√ 2 . One form of OGM-simple is
2 m
1
yk+1 = xk − ∇f(xk)  
L
k k + 2
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
k + 3 dk + 3
and an equivalent form with z-iterates is e
1
yk+1 = xk − ∇tf(xk)
L
k + 2
zke+1 = zpk − ∇f(xk)L( )2 2xk+1 = 1− yk+1 + zc k+1k + 3 k + 3for k = 0, 1, . . . . The last-step modification on secondary sequence is written as
x̃k+1c k k + 2= yk+1 + √ (yk+1 − yk) + √ (yk+1 − xk)2(k + 2) + 1 2(k + 2) + 1
where kA= 0, 1, . . . .
30            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 31
SC-OGM. Here, we assume that f is a µ-strongly convex function, condition number of f
√
8κ+1+3
is κ = L/µ, and γ = . SC-OGM is written as
2κ−2
1
yk+1 = xk − ∇f(xk)
L
1 1
xk+1 = yk+1 + (yk+1 − yk) + (yk+1 − xk)
2γ + 1 2γ + 1
for k = 0, 1, . . . .
LC-OGM. LC-OGM (Linear Coupling OGM) is defined as
y −1 −1k+1 = xk − L Q ∇f(xk)
zk+1 = argmin {Vz (y) + ⟨αk+1∇f(xk), y − xk⟩}k t
y∈Rn
xk+1 = (1− τk+1)yk+1 + τk+1zk+1 p
for k = 0, 1, . . . , where Vz(y) is a Bregman divergence, {α }
∞
k and {τk}
∞ are nonnegative i
k=1 k=1
sequences defined as α = 21 , 0 ≤ α
2 L− 2α ≤ α2L, τ 2
L k+1 k+1 k k
= , and Q is a positive
αk+1L r
definite matrix defining ∥x∥2 = xTQx.
For last step modification, we define positive sequences {α̃ ∞k} and {τ̃ }
∞
k as α =
1 ,
k=1 k=1 1 L c
0 ≤ α̃2 L− α ˜ ≤ 1α2L, and τ̃ = 1 , and also define
k+1 k+1 2 k k α̃k+1L s
x̃k = (1− τ̃k)yk + τ̃kzk
for k = 1, 2, . . . . u
Unification of AGM and OGM. Using LC-OGM, we can unify AGaM andnOGM as1yk+1 = xk − ∇f(xk)L
2tθk
zk+1 = zk − ∇f(xk)
L
( )
1 1
xk+1 = 1− yk+1 +  zkm+1.θk+1 θk+1
for k = 0, 1, . . . . This is equivalent to
1 d
yk+1 = xk − ∇f(xk)
L
θk − 1 θk
x ek+1 = yk+1 + (yk+t1 − yk) + (2t− 1) (yk+1 − xk).θk+p1 θk+1
LC-SC-OGM. LC-SC-OGeM (Linear Coupling Strongly Convex OGM) isc 1yk+1 = xk − Q−1∇f(xk)L( )c 1 γzk+1 = z −1k + γxk − Q ∇f(xk)1 + γ µxk+1 = τzk+1 + (1− τ)yk+1,
for k =A0, 1, . . . , where Q is a positive definite matrix.
31            
                                          ACCEPTED MANUSCRIPT                                      
32 Chanwoo Park, Jisun Park, Ernest K. Ryu
B Co-coercivity inequality in general norm
Lemma 7 Let f be a closed convex proper function. Then,
0 ≤ f(x) + f∗(u)− ⟨x, u⟩
and
inf{f(x) + f∗(u)− ⟨x, u⟩} = 0
x
inf{f(x) + f∗(u)− ⟨x, u⟩} = 0.
u
Proof By the definition of the conjugate function,
−f∗(u) = inf {f(x)− ⟨x, u⟩}
x
and t
inf{f(x) + f∗(u)− ⟨x, u⟩} = 0.
x
Therefore, p
0 ≤ f(x) + f∗(u)− ⟨x, u⟩ ∀x. i
The statement with u follows from the same argument and the fact that f∗∗ = f . ⊓⊔ r
Lemma 8 Consider a norm ∥ · ∥ and its dual norm ∥ · ∥∗. Then, c
1 1
0 ≤ ∥x∥2 + ∥u∥2∗ − ⟨x, u⟩2 2 s
and { }
1 1
inf ∥x∥2 + ∥u∥2 − ⟨x, u⟩ = 0
x∈Rn 2 2 ∗ u
{ }
1 1
inf ∥x∥2 + ∥u∥2 − ⟨x, u⟩ = 0.
u∈Rn 2 2 ∗ n
( )∗
Proof This follows from Lemma 7 with f(x) = 1 ∥x∥2 and 1 ∥·∥2 = 1 ∥·∥2. ⊓⊔
2 2 a2 ∗
Lemma 9 Let
{ }
L 2
Grad(x) = argmin ∥y − x∥ + ⟨∇f( x),my − x⟩ .y∈Rn 2Then,
L 1
⟨∇f(x), Grad(x)− x⟩+ ∥ (x)−dx∥2Grad = − ∥∇f(x)∥2 .2 2L ∗
Proof Let z = L(Grad(x)− x). By the definition of Grad(x) and Lemma 8, we have
1 2 L e∥∇f(x)∥∗ + ∥Grad(x)−tx∥2 + ⟨∇f(x), Grad(x)− x⟩2L 2 p1 1 1= inf ∥∇f(x)∥2 2∗ + ∥z∥ + ⟨∇f(x), z⟩z∈Rne 2L 2L L= 0. ⊓⊔
Lemma 10 Let f : Rcn → R be a differentiable convex function such thatc ∥∇f(x)−∇f(y)∥∗ ≤ L ∥x− y∥for all xA, y ∈ R
n. Then
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩+ ∥y − x∥2 .
2
32            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 33
Proof Since a differentiable convex function is continuously differentiable [45, Theorem 25.5],
∫ 1
f(y)− f(x) = ⟨∇f(x+ t(y − x)), y − x⟩dt
0
∫ 1
= ⟨∇f(x+ t(y − x))−∇f(x), y − x⟩dt+ ⟨∇f(x), y − x⟩
0
∫ 1
≤ ∥∇f(x+ t(y − x))−∇f(x)∥∗ ∥y − x∥ dt+ ⟨∇f(x), y − x⟩
0
∫ 1 L
≤ tL ∥y − x∥2 dt+ ⟨∇f(x), y − x⟩ = ∥y − x∥2 + ⟨∇f(x), y − x⟩.
0 2
⊓⊔
Lemma 11 (Co-coercivity inequality with general norm) Let f : Rn → R be a
differentiable convex function such that t
∥∇f(x)−∇f(y)∥∗ ≤ L ∥x− y∥ p
for all x, y ∈ Rn. Then
1 ri
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+ ∥∇f(x)−∇f(y)∥2
2L ∗
.
Proof Set ϕ(y) = f(y)− ⟨∇f(x), y − x⟩. Then x ∈ argminϕ. So by Lemma 9, c
ϕ(x) ≤ ϕ(Grad(y)) s
L
≤ ϕ(y) + ⟨∇ϕ(y), Grad(y)− y⟩+ ∥ 2Grad(y)− y∥
2
1 u
= ϕ(y)− ∥∇ϕ(y)∥2∗ .2L
Substituting f back in ϕ yields the co-coercivity inequality. an ⊓⊔
C Telescoping sum argument
Suppose we established the inequality m
aiFi + biGi ≤ ciFi−1 + diGi−1 − Ei
for i = 1, 2, . . . , where Ei, Fi, Gi are nonnegativedquantities and ai, bi, ci, and di arenonnegative scalars. Assume ci ≤ ai−1 and di ≤ bi−1. By summing the inequalities for
i = 1, 2, . . . , k, we obtain
∑k ∑ke ∑k
akFk ≤ −bkGk − (ai−1 − ci)Fi−1 −t (bi−1 − di)Gi−1 − Ei + c1F0 + d1G0i=2 p i=2 i=2≤ c1F0 + d1G0.
However, note that the
c∑ke ∑k ∑k−bkGk − (ai−1 − ci)Fi−1 − (bi−1 − di)Gi−1 − Eii=2 i=2 i=1
terms are wastecd in the analysis. If one has the freedom to do so, it may be good to chooseparameters so that
A ai−1 = ci, bi−1 = diand Ei = 0 for i = 1, 2, . . . . Not having wasted terms may be an indication that the analysisis tight.
33            
                                          ACCEPTED MANUSCRIPT                                      
34 Chanwoo Park, Jisun Park, Ernest K. Ryu
D SC-OGM via linear coupling
In this section, we analyze SC-OGM through the linear coupling analysis. We consider the
linear coupling form
1
yk+1 = x
−1
k − Q ∇f(xk)
L
( )
1 γ
zk+1 = zk + γx − Q
−1
k ∇f(xk)
1 + γ µ
xk+1 = τzk+1 + (1− τ)yk+1,
where τ is a coupling coefficient to be determined. As an aside, we can view zk+1 as a mirror
descent update of the form
{ }
1 γ γ
zk+1 = argmin ∥z − zk∥
2 + ∥z − xk∥
2 + ⟨∇f(xk), z⟩ , t
z 2 2 µ
which is similar to what was considered in [6]. p
Lemma 12 Assume (A1), (A2) and (A3). Then, riγ γ
⟨∇f(xk), z
2
k+1 − x⋆⟩ − ∥xk − x⋆∥
µ 2
γ2 1 1 + γ c
≤ − ∥∇f(x )∥2k ∗ + ∥zk − x
2 2
⋆∥ − ∥z
2 k+1
− x⋆∥
2(1 + γ)µ 2 2 s
for k = 0, 1, . . . .
Proof This proof follows steps similar to that of [6, Lemma 5.4]. u
From the definition of zk+1, we say
{ } ∣
∂ 1 γ γ ∣
0 =⟨ ∥z − z ∥2k + ∥z − xk∥
2 + ⟨∇f(xk), z⟩ ∣ , z n∣ k+1 − x⋆⟩∂z 2 2 µ zk+1
γ
=⟨Q(zk+1 − zk), zk+1 − x⋆⟩+ ⟨∇f(xk), zk+1 − x⋆⟩+ γ⟨Q(zk+1 −axk), zk+1 − x⋆⟩
µ
By three point equation,
( )
γ 1 1
⟨∇f(xk), zk+1 −
m
x⋆⟩+ γ ∥x
2 2
k − zk+1∥ − ∥x k − x⋆∥µ 2 2
1 1 1 + γ
= − ∥zk − zk+1∥
2 + ∥z 2kd− x⋆∥ − ∥zk+1 − x ∥2⋆ .2 2 2
Plugging the definition of zk+1,
γ 1
∥xk − z
2
k+1∥ + ∥zk − z
2 e
k+1∥
2 2
∥ t ∥2 ∥ ∥
γ ∥ 1 γ ∥ ∥ ∥
2
= ∥ (x − z ) + pQ−1 1 γ γ∇f(x )∥ + ∥k k k − (x −1 ∥∥ ∥ ∥ k − zk) + Q ∇f(xk)2 1 + γ (1 + γ)µ 2 1 + γ (1 + γ)µ ∥
γ2
≥ ∥∇f(x 2k)∥∗e.2(1 + γ)µ2
Combining results abocve, we getγ c γ⟨∇f(xk), zk+1 − x⋆⟩ − ∥xk − x⋆∥2µ 2γ2 1 1 + γ
A ≤ − ∥∇f(xk)∥
2 2
∗ + ∥zk − x⋆∥ − ∥z
2
2 k+1
− x⋆∥ .
2(1 + γ)µ 2 2
⊓⊔
34            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 35
Lemma 13 (Coupling lemma in SC-OGM) Assume (A1), (A2) and (A3). Then
( )
1 µ
(1 + γ) f(xk)− ∥∇f(xk)∥
2 2
2L ∗
+ ∥zk − x⋆∥
2
( )
1 µ
≤ f(xk−1)− ∥∇f(x )∥
2
k−1 ∗ + ∥z
2
k−1 − x⋆∥
2L 2
holds for k = 1, 2, . . .
Proof We have
γ (f(xk)− f(x⋆))
µγ
≤ γ⟨∇f(xk), xk − x⋆⟩ − ∥x
2
k − x⋆∥
2 t
µγ
= γ⟨∇f(xk), x
2
k − zk⟩+ γ⟨∇f(xk), zk − x⋆⟩ − ∥xk − x⋆∥
2 p
1− τ µγ
= γ⟨∇f(xk), yk − xk⟩+ γ⟨∇f(xk), zk − x⋆⟩ − ∥x
2
k − x⋆∥ i
τ 2
1− τ 1 −1 µγ r= γ⟨∇f(xk), xk−1 − xk − Q ∇f(x )⟩+ γ⟨∇f(x ), z − x ⟩ − ∥x − x ∥2k−1 k k ⋆ k ⋆
τ L 2
( )
1− τ 1 c
≤ γ − 1 ⟨∇f(xk), xk−1 − x − Q
−1
k ∇f(xk−1)⟩
τ L
( )
1 1 s
+ f(xk−1)− f(xk)− ∥∇f(x
2
k−1)∥∗ − ∥∇f(x
2
k)∥
2L 2L ∗
µγ
+ γ⟨∇f(xk), zk − zk+1⟩+ γ⟨∇f(x
2
k), zk+1 − x⋆⟩ − ∥xk − x⋆∥ u
2
( ) ( )
1− τ 1 1
≤ γ − 1 ⟨∇f(xk), yk − xk⟩+ f(x
2 2
k−1)− f(xk)− ∥∇f(xka−1)∥∗n− ∥∇f(xk)∥τ 2L 2L ∗γ2 µ (1 + γ)µ+ γ⟨∇f(xk), zk − zk+1⟩ − ∥∇f(xk)∥2 + ∥z 2 2∗ k − x⋆∥ − ∥zk+1 − x⋆∥ ,2(1 + γ)µ 2 2
where the last inequality is an application of Lemma 12. Note that
( )
1 γ m
z − z = z − z + γx − Q− 1k k+1 k k k ∇f(xk)1 + γ µ
γ γ
= (zk − x
−1
k) + dQ ∇f(xk)1 + γ (1 + γ)µ
γ 1− τ γ
= (xke− y −1k) + Q ∇f(xk).1 + γ τ (1 + γ)µ
To eliminate the ⟨∇f(xk), ·⟩ teprm, wetchoose τ to satisfy1− τ γ 1− τγ − 1 = . (9)
τ 1 + γ τ
Plugging this in, the icnequaelity above isγ (f(xkc)− f(x⋆))( )1 1≤ f(xk−1)− f(xk)− ∥∇f(x 2k−1)∥ − ∥∇f(x 2k)∥2L ∗ 2L ∗
A γ2 µ (1 + γ)µ+ ∥∇f(x 2k)∥∗ + ∥zk − x ∥2⋆ − ∥z 2k+1 − x⋆∥ .2(1 + γ)µ 2 2
35            
                                          ACCEPTED MANUSCRIPT                                      
36 Chanwoo Park, Jisun Park, Ernest K. Ryu
In order to make the telescoping form such as
( )
Mk f(xk)−Bk ∥∇f(x )∥
2
k ∗ +Ck ∥zk+1 − x ∥
2
⋆
( )
≤ N 2 2k−1 f(xk−1)−Bk−1 ∥∇f(xk−1)∥∗ + Ck−1 ∥zk − x⋆∥ ,
µ
we chose B 1k = and Ck = , which leads to the choice of γ satisfying2L 2
2 + γ γ2
= . (10)
2L 2(1 + γ)µ
We get the desired result by plugging (9) and (10) in the above inequality. ⊓⊔ t
E Asymptotic characterization of θk p
Theorem 7 Let the positive sequence {θ ∞ 2 2k} satisfy θk=0 0 = 1 and θ − θk+1 − θ = 0
i
k+1 k
for k = 0, 1, . . . . Then, r
k + ζ + 1 log k
θk = + + o(1).
2 4 c
Proof Let θ k+2k = + ck log k. The proof consists of the following 3 steps:2 s
1. If c 1 1k < , then ck+1 < .4 4
2. ck →
1 as k → ∞.
4
k+2 log k3. If θk = + + ek, then ek is convergent.
u
2 4
First step. If c 1 1 nk < , then c4 k+1 < .4
For our convenience, let c0 = 0 with c
2 2
0 log 0 = 0. Plugging this in θ − θ − θ = 0,k+1 k+1 k
we have
( )2 ( )
a
2
k + 2 k + 2 1
+ ck+1 log(k + 1) = + ck lmog k + ,2 2 4so 1(ck+1 log(k + 1)− ck log k) (k + 2 + ck+1 log(k + 1) + ck log k) = .4
Assume ck+1 ≥ 1/4. Then
1 d
= (ck+1 log(k + 1)− ck log k) (ek + 2 + ck+1 log(k + 1) + ck log k)4 ( )1 1≥ log 1 + (k + 2)
4 k t
1
> ,
4 p
which proves the first claim.e
Second step. c 1k → as k → ∞.4
Put dk =
1 − c , then 0 < d ≤ 1 .
4 kc k 4
( ( ) )( )
1 1 1 1
= log 1c+ − dk+1 log(k + 1) + dk log k k + 2 + log k(k + 1)− dk+1 log(k + 1)− dk log k4 4 k 4
(A( ) )( )1 1 1≤ log 1 + − dk+1 log(k + 1) + dk log k k + 2 + log(k + 1)4 k 2
36            
                                          ACCEPTED MANUSCRIPT                                      
√
Factor- 2 Acceleration of Accelerated Gradient Methods 37
Therefore
( )
1 1 1 1
dk+1 log(k + 1)− dk log k ≤ log 1 + − .
4 k 4 k + 2 + 1 log(k + 1)
2
By talyor expansion,
( ( ))
1 3 + 2 log k 1
dk+1 log(k + 1)− dk log k ≤ +O .
4 2k2 k2
So, By summing all the above inequality from 1 to k,
dk+1 log(k + 1) ≤ C
so d < Ck+1 . In conclusion, as k → ∞, d → 0.log(k+1) k t
Third step. log kIf θk =
k+2 + + e , then, e converges.
2 4 k k
From the previous claim, we can say that for some sufficiently large k, |e | < 1k log k.6 p
( )2 ( )2 i
k + 2 1 k + 2 1 1
+ log(k + 1) + ek+1 = + log k + ek +
2 4 2 4 4 crThen,
( ( ) )( )
1 1 1 1
= log 1 + + ek+1 − ek k + 2 + log k(k + 1) + ek+1 + ek
4 4 k 4 s
( ( ) )( )
1 1 5
≤ log 1 + + ek+1 − ek k + 2 + log(k + 1) .
4 k 6 nuSo, ( ) 5 ( )
1 1 1 log k +
3
1
ek+1 − ek ≥ ( ) − log 1 + = −
6 2 +O .
4 k + 2 + 5 log(k + 1) 4 k k2 a k26
Summing this for k = 1, . . . , k, we get that ek+1 > D for some constant D. Moreover,
( ( ) )( )
1 1 1 1
= log 1 + + ek+1 − ek k + 2 + log k(k + 1) + ek+1 + ek
4 4 k 4 m
( ( ) )
1 1 1
≥ log 1 + + e  k+1 − ek (k + 2)d> + (k + 2)(ek+1 − ek),4 k 4which indicates that e ∞k+1 < ek. Since {ek} eis a decreasing sequence with a lower bound,k=0it converges. ⊓⊔Proof of equality in Section 2.1 We htave
L ∥x0 − x
2 2
⋆∥ L ∥x0 − x⋆∥
= ( )
2θ2 k+ζ log(k− 21)
k−1 2 + + o(1)
e2p 42L ∥x 20 − x⋆∥= ( )c 22 log(k−1)(k + ζ) 1 + + o(1/k)2(k+ζ)( )c 2L ∥x − x ∥
2
0 ⋆ log(k − 1)
= 1− 2 + o(1/k)
(k + ζ)2 2(k + ζ)
( )
2L ∥x0 − x⋆∥
2 2L ∥x0 − x⋆∥
2 log k 1
A = − + o ,(k + ζ)2 (k + ζ)3 k3which verifies the equality in Section 2.1.
37