Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method...
Transcript of Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method...
ELE 522: Large-Scale Optimization for Data Science
Accelerated gradient methods
Yuxin Chen
Princeton University, Fall 2019
Outline
• Heavy-ball methods
• Nesterov’s accelerated gradient methods
• Accelerated proximal gradient methods (FISTA)
• Convergence analysis
• Lower bounds
(Proximal) gradient methods
Iteration complexities of (proximal) gradient methods• strongly convex and smooth problems
O
(κ log 1
ε
)
• convex and smooth problems
O
(1ε
)
Can one still hope to further accelerate convergence?
Accelerated GD 7-3
Issues and possible solutions
Issues:
• GD focuses on improving the cost per iteration, which mightsometimes be too “short-sighted”• GD might sometimes zigzag or experience abrupt changes
Solutions:• exploit information from the history (i.e. past iterates)• add buffers (like momentum) to yield smoother trajectory
Accelerated GD 7-4
Heavy-ball methods— Polyak ’64
Heavy-ball method
B. Polyak
minimizex∈Rn f(x)
xt+1 = xt − ηt∇f(xt) + θt(xt − xt−1)︸ ︷︷ ︸momentum term
• add inertia to the “ball” (i.e. include a momentum term) tomitigate zigzagging
Accelerated GD 7-6
Heavy-ball method
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
gradient descent
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
heavy-ball method
Accelerated GD 7-7
State-space models
minimizex12(x− x∗)>Q(x− x∗)
where Q � 0 has a condition number κ
One can understand heavy-ball methods through dynamical systems
Accelerated GD 7-8
State-space models
Consider the following dynamical system[
xt+1
xt
]=[
(1 + θt)I −θtII 0
] [xt
xt−1
]−[ηt∇f(xt)
0
]
or equivalently,[
xt+1 − x∗
xt − x∗
]
︸ ︷︷ ︸state
=[
(1 + θt)I −θtII 0
] [xt − x∗
xt−1 − x∗
]−[ηt∇f(xt)
0
]
=[
(1 + θt)I − ηtQ −θtII 0
]
︸ ︷︷ ︸system matrix
[xt − x∗
xt−1 − x∗
]
Accelerated GD 7-9
System matrix
[xt+1 − x∗
xt − x∗
]=[ (
1 + θt)I − ηtQ −θtII 0
]
︸ ︷︷ ︸=:Ht (system matrix)
[xt − x∗
xt−1 − x∗
](7.1)
implication: convergence of heavy-ball methods depends on thespectrum of the system matrix Ht
key idea: find appropriate stepsizes ηt and momentum coefficients θtto control the spectrum of Ht
Accelerated GD 7-10
Convergence for quadratic problems
Theorem 7.1 (Convergence of heavy-ball methods for quadraticfunctions)
Suppose f is a L-smooth and µ-strongly convex quadratic function.Set ηt ≡ 4/(
√L+√µ)2, θt ≡ max
{|1−√ηtL|, |1−√ηtµ|}2, and
κ = L/µ. Then∥∥∥∥∥
[xt+1 − x∗
xt − x∗
]∥∥∥∥∥2.
(√κ− 1√κ+ 1
)t ∥∥∥∥∥
[x1 − x∗
x0 − x∗
]∥∥∥∥∥2
• iteration complexity: O(√κ log 1
ε )• significant improvement over GD: O(
√κ log 1
ε ) vs. O(κ log 1ε )
• relies on knowledge of both L and µ
Accelerated GD 7-11
Proof of Theorem 7.1In view of (7.1), it suffices to control the spectrum of Ht (which istime-invariant). Let λi be the ith eigenvalue of Q and set
Λ :=
λ1
. . .λn
, then the spectral radius (denoted by ρ(·)) of Ht
obeys
ρ(Ht) = ρ
([ (1 + θt
)I − ηtΛ −θtII 0
])
≤ max1≤i≤n
ρ
([1 + θt − ηtλi −θt
1 0
])
To finish the proof, it suffices to show
maxiρ
([1 + θt − ηtλi −θt
1 0
])≤√κ− 1√κ+ 1 (7.2)
Accelerated GD 7-12
Proof of Theorem 7.1
To show (7.2), note that the two eigenvalues of[
1 + θt − ηtλi −θt1 0
]are
the roots ofz2 − (1 + θt − ηtλi)z + θt = 0 (7.3)
If (1 + θt − ηtλi)2 ≤ 4θt, then the roots of this equation have the samemagnitudes
√θt (as they are either both imaginary or there is only one root).
In addition, one can easily check that (1 + θt − ηtλi)2 ≤ 4θt is satisfied if
θt ∈[(
1−√ηtλi
)2,(1 +
√ηtλi
)2], (7.4)
which would hold if one picks θt = max{(
1−√ηtL)2,(1−√ηtµ
)2}
Accelerated GD 7-13
Proof of Theorem 7.1
With this choice of θt, we have (from (7.3) and the fact that twoeigenvalues have identical magnitudes)
ρ (Ht) ≤√θt
Finally, setting ηt = 4(√L+√µ)2 ensures 1−√ηtL = −(1−√ηtµ), which
yields
θt = max
(1− 2
√L√
L+õ
)2
,
(1− 2√µ√
L+õ
)2 =
(√κ− 1√κ+ 1
)2
This in turn establishes
ρ (Ht) ≤√κ− 1√κ+ 1
Accelerated GD 7-14
Nesterov’s accelerated gradient methods
Convex case
minimizex∈Rn f(x)
For a positive definite quadratic function f , including momentumterms allows to improve the iteration complexity from O
(κ log 1
ε
)to
O(√κ log 1
ε
)
Can we obtain improvement for more general convex cases as well?
Accelerated GD 7-16
Nesterov’s idea
Y. Nesterov
— Nesterov ’83
xt+1 = yt − ηt∇f(yt)
yt+1 = xt+1 + t
t+ 3(xt+1 − xt
)
• alternates between gradient updates and proper extrapolation• each iteration takes nearly the same cost as GD• not a descent method (i.e. we may not have f(xt+1) ≤ f(xt))• one of the most beautiful and mysterious results in optimization
...Accelerated GD 7-17
Convergence of Nesterov’s accelerated gradientmethod
Suppose f is convex and L-smooth. If ηt ≡ η = 1/L, then
f(xt)− fopt ≤ 2L‖x0 − x∗‖22(t+ 1)2
• iteration complexity: O( 1√
ε
)
• much faster than gradient methods• we’ll provide proof for the (more general) proximal version later
Accelerated GD 7-18
Interpretation using differential equations
Nesterov’s momentum coefficient tt+3 = 1− 3
t is particularlymysterious
Accelerated GD 7-19
Interpretation using differential equations
pi /k(U�)i,:k2
2
kU�k2F
=1
nk(U�)i,:k2
2
ConsiderP := W 1/2B
�B>WB
�†B>W 1/2
and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB
�†B>,
whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas
P = �(�>�)†�> = U�U>� .
As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG
F = �rf(X)
1
pi /k(U�)i,:k2
2
kU�k2F
=1
nk(U�)i,:k2
2
ConsiderP := W 1/2B
�B>WB
�†B>W 1/2
and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB
�†B>,
whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas
P = �(�>�)†�> = U�U>� .
As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG
F = �rf(X)
damping force � 3⌧
·X� 3
⌧
·X
1
pi /k(U�)i,:k2
2
kU�k2F
=1
nk(U�)i,:k2
2
ConsiderP := W 1/2B
�B>WB
�†B>W 1/2
and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB
�†B>,
whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas
P = �(�>�)†�> = U�U>� .
As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG
F = �rf(X)
damping force � 3⌧
·X� 3
⌧
·X
time-varying dampling force spring force
1
pi /k(U�)i,:k2
2
kU�k2F
=1
nk(U�)i,:k2
2
ConsiderP := W 1/2B
�B>WB
�†B>W 1/2
and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB
�†B>,
whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas
P = �(�>�)†�> = U�U>� .
As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG
F = �rf(X)
damping force � 3⌧
·X� 3
⌧
·X
time-varying dampling force spring force
1
To develop insight into why Nesterov’s method works so well, it’shelpful to look at its continuous limits (ηt → 0), which is given bysecond-order ordinary differential equations (ODE)
..X(τ) + 3/τ︸︷︷︸
dampling coefficient
.X(τ) +∇f(X(τ))︸ ︷︷ ︸
potential
= 0
— Su, Boyd, Candes ’14Accelerated GD 7-20
Heuristic derivation of ODETo begin with, Nesterov’s update rule is equivalent to
xt+1 − xt√η
= t− 1t+ 2
xt − xt−1√η
−√η∇f(yt) (7.5)
Let t = τ√η . Set X(τ) ≈ xτ/
√η = xt and X(τ +√η) ≈ xt+1. Then the
Taylor expansion givesxt+1 − xt√
η≈
.
X(τ) + 12..
X(τ)√η
xt − xt−1√η
≈.
X(τ)− 12..
X(τ)√η
which combined with (7.5) yields.
X(τ) + 12..
X(τ)√η ≈(
1− 3√ητ
)(.
X(τ)− 12..
X(τ)√η)−√η∇f
(X(τ)
)
=⇒..
X(τ) + 3τ
.
X(τ) +∇f(X(τ)
)≈ 0
Accelerated GD 7-21
Convergence rate of ODE
..X + 3
τ
.X +∇f(X) = 0 (7.6)
Standard ODE theory reveals that
f(X(τ))− fopt ≤ O( 1τ2
)(7.7)
which somehow explains Nesterov’s O(1/t2) convergence
Accelerated GD 7-22
Proof of (7.7)
Define E(τ) := τ2(f(X)− fopt)+ 2∥∥X + τ
2.
X −X∗∥∥2
2︸ ︷︷ ︸Lyapunov function / energy function
. This obeys
.
E = 2τ(f(X)− fopt)+ τ2⟨∇f(X),
.
X⟩
+ 4⟨X + τ
2.
X −X∗,32.
X + τ
2..
X⟩
(i)= 2τ(f(X)− fopt)− 2τ
⟨X −X∗,∇f
(X)⟩ (by convexity)
≤ 0
where (i) follows by replacing τ..
X + 3.
X with −τ∇f(X)
This means E is non-decreasing in τ , and hence
f(X(τ))− fopt (defn of E)≤ E(τ)
τ2 ≤ E(0)τ2 = O
(1τ2
)
Accelerated GD 7-23
Magic number 3
..X + 3
τ
.X +∇f(X) = 0
• 3 is the smallest constant that guarantees O(1/τ2) convergence,and can be replaced by any other α ≥ 3• in some sense, 3 minimizes the pre-constant in the convergence
bound O(1/τ2) (see Su, Boyd, Candes ’14)
Accelerated GD 7-24
Numerical exampletaken from UCLA EE236C
minimizex log(
m∑
i=1exp(a>i x + bi)
)
with randomly generated problems and m = 2000, n = 1000
Example
minimize logmP
i=1
exp(aTi x + bi)
• two randomly generated problems with m = 2000, n = 1000
• same fixed step size used for gradient method and FISTA
• figures show (f(x(k)) � f?)/f?
0 50 100 150 20010�6
10�5
10�4
10�3
10�2
10�1
100
k
gradientFISTA
0 50 100 150 20010�6
10�5
10�4
10�3
10�2
10�1
100
k
gradientFISTA
Accelerated proximal gradient methods 9-8
Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then
fbest,t ≠ fopt ÆsupxœC DÏ
!x,x0"
+ L2f
2fl
qtk=0 ÷2
kqtk=0 ÷k
• If ÷t =Ô
2flRLf
1Ôt
with R := supxœC DÏ!x,x0"
, then
fbest,t ≠ fopt Æ O
ALf
ÔRÔ
fl
log tÔt
B
¶ one can further remove log t factorMirror descent 5-37
Opt
imal
ityof
Nes
tero
v’s
met
hod
Inte
rest
ingly
,no
first
-ord
erm
etho
dsca
nim
prov
eup
onNe
ster
ov’s
resu
ltin
gene
ral
Mor
epr
ecise
ly,÷
conv
exan
dL
-smoo
thfu
nctio
nf
s.t.
f(x
t )≠
fop
tØ
3LÎx
0≠
xú Î
2 232
(t+
1)2
aslon
gas
xk
œx
0+
span
{Òf(x
0 ),·
··,Ò
f(x
k≠
1 )}
¸˚˙
˝de
finiti
onof
first
-ord
erm
etho
ds
fora
ll1
Æk
Æt
Acce
lerat
edGD
7-35
Accelerated GD 7-25
Extension to composite models
minimizex F (x) := f(x) + h(x)subject to x ∈ Rn
• f : convex and smooth• h: convex (may not be differentiable)
let F opt := minx F (x) be the optimal cost
Accelerated GD 7-26
FISTA (Beck & Teboulle ’09)
Fast iterative shrinkage-thresholding algorithm
xt+1 = proxηth
(yt − ηt∇f(yt)
)
yt+1 = xt+1 + θt − 1θt+1
(xt+1 − xt)
where y0 = x0, θ0 = 1 and θt+1 = 1+√
1+4θ2t
2
• adopt the momentum coefficients originally proposed byNesterov ’83
Accelerated GD 7-27
Momentum coefficient
0 10 20 300
5
10
15
20
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
θt+1 =1 +
√1 + 4θ2
t
2 with θ0 = 1
coefficient θt−1θt+1
= 1− 3t + o
(1t
)(homework)
• asymptotically equivalent to tt+3
Fact 7.2
For all t ≥ 1, one has θt ≥ t+22 (homework)
Accelerated GD 7-28
Momentum coefficient
0 10 20 300
5
10
15
20
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
θt+1 =1 +
√1 + 4θ2
t
2 with θ0 = 1
coefficient θt−1θt+1
= 1− 3t + o
(1t
)(homework)
• asymptotically equivalent to tt+3
Fact 7.2
For all t ≥ 1, one has θt ≥ t+22 (homework)
Accelerated GD 7-28
Convergence analysis
Convergence for convex problems
Theorem 7.3 (Convergence of accelerated proximal gradientmethods for convex problems)
Suppose f is convex and L-smooth. If ηt ≡ 1/L, then
F (xt)− F opt ≤ 2L‖x0 − x∗‖22(t+ 1)2
• improved iteration complexity (i.e. O(1/√ε)) than proximal
gradient method (i.e. O(1/ε))• fast if prox can be efficiently implemented
Accelerated GD 7-30
Recap: the fundamental inequality for proximalmethod
Recall the following fundamental inequality shown in the last lecture:
Lemma 7.4
Let y+ = prox 1Lh
(y − 1
L∇f(y)), then
F (y+)− F (x) ≤ L
2 ‖x− y‖22 −L
2 ‖x− y+‖22
Accelerated GD 7-31
Proof of Theorem 7.6
1. build a discrete-time version of “Lyapunov function”
2. magic happens!◦ “Lyapunov function” is non-increasing when Nesterov’s
momentum coefficients are adopted
Accelerated GD 7-32
Proof of Theorem 7.6
Key lemma: monotonicity of a certain “Lyapunov function”
Lemma 7.5
Let ut = θt−1xt − (x∗ + (θt−1 − 1)xt−1)
︸ ︷︷ ︸or θt−1(xt−x∗)−(θt−1−1)(xt−1−x∗)
. Then
‖ut+1‖22 + 2Lθ2t
(F (xt+1)− F opt) ≤ ‖ut‖22 + 2
Lθ2t−1(F (xt)− F opt)
• quite similar to 2∥∥X + τ
2.X −X∗
∥∥22 + τ2(f(X)− fopt)
(Lyapunov function) as discussed before (think about θt ≈ t/2)
Accelerated GD 7-33
Proof of Theorem 7.6With Lemma 7.5 in place, one has
2Lθ2t−1(F (xt)− F opt) ≤ ‖u1‖2
2 + 2Lθ2
0(F (x1)− F opt)
= ‖x1 − x∗‖22 + 2
L
(F (x1)− F opt)
To bound the RHS of this inequality, we use Lemma 7.4 and y0 = x0 to get2L
(F (x1)− F opt) ≤ ‖y0 − x∗‖2
2 − ‖x1 − x∗‖22 = ‖x0 − x∗‖2
2 − ‖x1 − x∗‖22
⇐⇒ ‖x1 − x∗‖22 + 2
L
(F (x1)− F opt) ≤ ‖x0 − x∗‖2
2
As a result,2Lθ2t−1(F (xt)− F opt) ≤ ‖x1 − x∗‖2
2 + 2L
(F (x1)− F opt) ≤ ‖x0 − x∗‖2
2,
=⇒ F (xt)− F opt ≤ L‖x0 − x∗‖22
2θ2t−1
(Fact 7.2)≤ 2L‖x0 − x∗‖2
2(t+ 1)2
Accelerated GD 7-34
Proof of Lemma 7.5
Take x = 1θtx∗ +
(1− 1
θt
)xt and y = yt in Lemma 7.4 to get
F (xt+1)− F(θ−1t x∗ +
(1− θ−1
t
)xt)
(7.8)
≤ L
2∥∥θ−1t x∗ +
(1− θ−1
t
)xt − yt
∥∥22 −
L
2∥∥θ−1t x∗ +
(1− θ−1
t
)xt − xt+1∥∥2
2
= L
2θ2t
∥∥x∗ +(θt − 1
)xt − θtyt
∥∥22 −
L
2θ2t
∥∥x∗ +(θt − 1
)xt − θtxt+1
︸ ︷︷ ︸=−ut+1
∥∥22
(i)= L
2θ2t
(‖ut‖2
2 − ‖ut+1‖22), (7.9)
where (i) follows from the definition of ut and yt = xt + θt−1−1θt
(xt−xt−1)
Accelerated GD 7-35
Proof of Lemma 7.5 (cont.)
We will also lower bound (7.8). By convexity of F ,
F(θ−1t x∗ +
(1− θ−1
t
)xt)≤ θ−1
t F (x∗) +(1− θ−1
t
)F (xt)
= θ−1t F opt +
(1− θ−1
t
)F (xt)
⇐⇒ F(θ−1t x∗ +
(1− θ−1
t
)xt)− F (xt+1)
≤(1− θ−1
t
)(F (xt)− F opt)−
(F (xt+1)− F opt)
Combining this with (7.9) and θ2t − θt = θ2
t−1 yields
L
2(‖ut‖2
2 − ‖ut+1‖22)≥ θ2
t
(F (xt+1)− F opt)−
(θ2t − θt
)(F (xt)− F opt)
= θ2t
(F (xt+1)− F opt)− θ2
t−1(F (xt)− F opt),
thus finishing the proof
Accelerated GD 7-36
Convergence for strongly convex problems
xt+1 = proxηth
(yt − ηt∇f(yt)
)
yt+1 = xt+1 +√κ− 1√κ+ 1(xt+1 − xt)
Theorem 7.6 (Convergence of accelerated proximal gradientmethods for strongly convex case)
Suppose f is µ-strongly convex and L-smooth. If ηt ≡ 1/L, then
F (xt)− F opt ≤(
1− 1√κ
)t(F (x0)− F opt + µ‖x0 − x∗‖22
2
)
Accelerated GD 7-37
A practical issue
Fast convergence requires knowledge of κ = L/µ
• in practice, estimating µ is typically very challenging
A common observation: ripples / bumps in the traces of cost values
Accelerated GD 7-38
Rippling behaviorNumerical example: take yt+1 = xt+1 + 1−√q
1+√q (xt+1 − xt); q∗ = 1/κ
0 500 1000 1500 2000 2500 3000 3500 4000 4500 500010−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
102
k
f(x
k)−
f⋆
q = 0q = q⋆/10q = q⋆/3q = q⋆
q = 3q⋆
q = 10q⋆
q = 1
Figure 1: Convergence of Algorithm 1 with different estimates of q.
Interpretation. The optimal momentum depends on the condition number of the function;specifically, higher momentum is required when the function has a higher condition number. Under-estimating the amount of momentum required leads to slower convergence. However we are moreoften in the other regime, that of overestimated momentum, because generally q = 0, in which caseβk ↑ 1; this corresponds to high momentum and rippling behavior, as we see in Figure 1. Thiscan be visually understood in Figure (2), which shows the trajectories of sequences generated byAlgorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q⋆, the optimalchoice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum andoscillate around it. This causes a rippling in the function values along the trajectory. Later weshall demonstrate that the period of these ripples is proportional to the square root of the (local)condition number of the function.
Lastly we mention that the condition number is a global parameter; the sequence generated byan accelerated scheme may enter regions that are locally better conditioned, say, near the optimum.In these cases the choice of q = q⋆ is appropriate outside of this region, but once we enter it weexpect the rippling behavior associated with high momentum to emerge, despite the optimal choiceof q.
4
Opt
imal
ityof
Nes
tero
v’s
met
hod
Inte
rest
ingly
,no
first
-ord
erm
etho
dsca
nim
prov
eup
onNe
ster
ov’s
resu
ltin
gene
ral
Mor
epr
ecise
ly,÷
conv
exan
dL
-smoo
thfu
nctio
nf
s.t.
f(x
t )≠
fop
tØ
3LÎx
0≠
xú Î
2 232
(t+
1)2
aslon
gas
xk
œx
0+
span
{Òf(x
0 ),·
··,Ò
f(x
k≠
1 )}
¸˚˙
˝de
finiti
onof
first
-ord
erm
etho
ds
fora
ll1
Æk
Æt
Acce
lerat
edGD
7-35
Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then
fbest,t ≠ fopt ÆsupxœC DÏ
!x,x0"
+ L2f
2fl
qtk=0 ÷2
kqtk=0 ÷k
• If ÷t =Ô
2flRLf
1Ôt
with R := supxœC DÏ!x,x0"
, then
fbest,t ≠ fopt Æ O
ALf
ÔRÔ
fl
log tÔt
B
¶ one can further remove log t factorMirror descent 5-37
period of ripples isoften proportional to√L/µ
O’Donoghue, Candes ’12
• when q > q∗: we underestimate momentum −→ slower convergence
• when q < q∗: we overestimate momentum ( 1−√q1+√q is large)
−→ overshooting / rippling behaviorAccelerated GD 7-39
Adaptive restart (O’Donoghue, Candes ’12)
When a certain criterion is met, restart running FISTA with
x0 ← xt
y0 ← xt
θ0 = 1
• take the current iterate as a new starting point• erase all memory of previous iterates and reset the momentum
back to zero
Accelerated GD 7-40
Numerical comparisons of adaptive restart schemes
200 400 600 800 1000 1200 1400 1600 1800 200010−18
10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
102
Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme
k
f(x
k)−
f⋆
Figure 3: Comparison of fixed and adaptive restart intervals.
−0.1 −0.05 0 0.05 0.1
−0.1
−0.05
0
0.05
0.1
x1
x2
q = q⋆
q = 0
adaptive restart
Figure 4: Sequence trajectories under scheme I and with adaptive restart.
7
Opt
imal
ityof
Nes
tero
v’s
met
hod
Inte
resti
ngly,
nofir
st-or
derm
etho
dsca
nim
prov
eup
onNe
stero
v’sre
sult
inge
nera
l
Mor
epr
ecise
ly,÷
conv
exan
dL
-smoo
thfu
nctio
nf
s.t.
f(x
t )≠
fop
tØ
3LÎx
0≠
xú Î
2 232
(t+
1)2
aslon
gas
xk
œx
0+
span
{Òf(x
0 ),·
··,Ò
f(x
k≠
1 )}
¸˚˙
˝de
finiti
onof
first
-ord
erm
etho
ds
fora
ll1
Æk
Æt
Acce
lerat
edGD
7-35
Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then
fbest,t ≠ fopt ÆsupxœC DÏ
!x,x0"
+ L2f
2fl
qtk=0 ÷2
kqtk=0 ÷k
• If ÷t =Ô
2flRLf
1Ôt
with R := supxœC DÏ!x,x0"
, then
fbest,t ≠ fopt Æ O
ALf
ÔRÔ
fl
log tÔt
B
¶ one can further remove log t factorMirror descent 5-37
• function scheme: restart when f(xt) > f(xt−1)• gradient scheme: restart when 〈∇f(yt−1),xt − xt−1〉 > 0︸ ︷︷ ︸
restart when momentum lead us towards a bad directionAccelerated GD 7-41
Illustration
200 400 600 800 1000 1200 1400 1600 1800 200010−18
10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
102
Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme
k
f(x
k)−
f⋆
Figure 3: Comparison of fixed and adaptive restart intervals.
−0.1 −0.05 0 0.05 0.1
−0.1
−0.05
0
0.05
0.1
x1
x2
q = q⋆
q = 0
adaptive restart
Figure 4: Sequence trajectories under scheme I and with adaptive restart.
7
• with overestimated momentum (e.g. q = 0), one sees spirallingtrajectory• adaptive restart helps mitigate this issue
Accelerated GD 7-42
Lower bounds
Optimality of Nesterov’s method
Interestingly, no first-order methods can improve upon Nesterov’sresults in general
More precisely, ∃ convex and L-smooth function f s.t.
f(xt)− fopt ≥ 3L‖x0 − x∗‖2232(t+ 1)2
as long as xk ∈ x0 + span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸definition of first-order methods
for all 1 ≤ k ≤ t
— Nemirovski, Yudin ’83
Accelerated GD 7-44
Example
minimizex∈R(2n+1) f(x) = L
4
(12x>Ax− e>1 x
)
where A =
2 −1−1 2 −1
. . . . . . . . .−1 2 −1
−1 2
∈ R(2n+1)×(2n+1)
• f is convex and L-smooth• the optimizer x∗ is given by x∗i = 1− i
2n+2 (1 ≤ i ≤ n) obeying
fopt = L
8
( 12n+ 2 − 1
)and ‖x∗‖22 ≤
2n+ 23
• ∇f(x) = L4 Ax− L
4 e1
• span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸=:Kk
= span{e1, · · · , ek} if x0 = 0
◦ every iteration of first-order methods expands the search space byat most one dimension
Accelerated GD 7-45
Example
minimizex∈R(2n+1) f(x) = L
4
(12x>Ax− e>1 x
)
where A =
2 −1−1 2 −1
. . . . . . . . .−1 2 −1
−1 2
∈ R(2n+1)×(2n+1)
• f is convex and L-smooth• the optimizer x∗ is given by x∗i = 1− i
2n+2 (1 ≤ i ≤ n) obeying
fopt = L
8
( 12n+ 2 − 1
)and ‖x∗‖22 ≤
2n+ 23
• ∇f(x) = L4 Ax− L
4 e1
• span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸=:Kk
= span{e1, · · · , ek} if x0 = 0
◦ every iteration of first-order methods expands the search space byat most one dimension
Accelerated GD 7-45
Example (cont.)
If we start with x0 = 0, then
f(xn) ≥ infx∈Kn
f(x) = L
8
( 1n+ 1 − 1
)
=⇒ f(xn)− fopt
‖x0 − x∗‖22≥
L8
(1
n+1 − 12n+2
)
13(2n+ 2)
= 3L32(n+ 1)2
Accelerated GD 7-46
Summary: accelerated proximal gradient
stepsize convergence iterationrule rate complexity
convex & smoothηt = 1
L O(
1t2
)O(
1√ε
)problems
strongly convex &ηt = 1
L O
((1− 1√
κ
)t)O(√
κ log 1ε
)smooth problems
Accelerated GD 7-47
Reference
[1] ”Some methods of speeding up the convergence of iteration methods,”B. Polyak, USSR Computational Mathematics and MathematicalPhysics, 1964
[2] ”A method of solving a convex programming problem with convergencerate O(1/k2),” Y. Nestrove, Soviet Mathematics Doklady, 1983.
[3] ”A fast iterative shrinkage-thresholding algorithm for linear inverseproblems,” A. Beck and M. Teboulle, SIAM journal on imaging sciences,2009.
[4] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[5] ”Mathematical optimization, MATH301 lecture notes,” E. Candes,
Stanford.[6] ”Gradient methods for minimizing composite functions,”, Y. Nesterov,
Technical Report, 2007.
Accelerated GD 7-48
Reference
[7] ”Large-scale numerical optimization, MS&E318 lecture notes,”M. Saunders, Stanford.
[8] ”Analysis and design of optimization algorithms via integral quadraticconstraints,” L. Lessard, B. Recht, A. Packard, SIAM Journal onOptimization, 2016.
[9] ”Proximal algorithms,” N. Parikh and S. Boyd, Foundations and Trendsin Optimization, 2013.
[10] ”Convex optimization: algorithms and complexity,” S. Bubeck,Foundations and trends in machine learning, 2015.
[11] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.
[12] ”Problem complexity and method efficiency in optimization,”A. Nemirovski, D. Yudin, Wiley, 1983.
Accelerated GD 7-49
Reference
[13] ”Introductory lectures on convex optimization: a basic course,”Y. Nesterov, 2004
[14] ”A differential equation for modeling Nesterov’s accelerated gradientmethod,” W. Su, S. Boyd, E. Candes, NIPS, 2014.
[15] ”On accelerated proximal gradient methods for convex-concaveoptimization,” P. Tseng, 2008
[16] ”Adaptive restart for accelerated gradient schemes,” B. O’donoghue,and E. Candes, Foundations of computational mathematics, 2012
Accelerated GD 7-50