Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method...

ELE 522: Large-Scale Optimization for Data Science

Accelerated gradient methods

Yuxin Chen

Princeton University, Fall 2019

Outline

• Heavy-ball methods

• Nesterov’s accelerated gradient methods

• Accelerated proximal gradient methods (FISTA)

• Convergence analysis

• Lower bounds

(Proximal) gradient methods

Iteration complexities of (proximal) gradient methods• strongly convex and smooth problems

O

(κ log 1

ε

)

• convex and smooth problems

O

(1ε

)

Can one still hope to further accelerate convergence?

Accelerated GD 7-3

Issues and possible solutions

Issues:

• GD focuses on improving the cost per iteration, which mightsometimes be too “short-sighted”• GD might sometimes zigzag or experience abrupt changes

Solutions:• exploit information from the history (i.e. past iterates)• add buffers (like momentum) to yield smoother trajectory

Accelerated GD 7-4

Heavy-ball methods— Polyak ’64

Heavy-ball method

B. Polyak

minimizex∈Rn f(x)

xt+1 = xt − ηt∇f(xt) + θt(xt − xt−1)︸︷︷︸momentum term

• add inertia to the “ball” (i.e. include a momentum term) tomitigate zigzagging

Accelerated GD 7-6

Heavy-ball method

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)Í¸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

gradient descent

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)Í¸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

heavy-ball method

Accelerated GD 7-7

State-space models

minimizex12(x− x∗)>Q(x− x∗)

where Q � 0 has a condition number κ

One can understand heavy-ball methods through dynamical systems

Accelerated GD 7-8

State-space models

Consider the following dynamical system[

xt+1

xt

]=[

(1 + θt)I −θtII 0

] [xt

xt−1

]−[ηt∇f(xt)

0

]

or equivalently,[

xt+1 − x∗

xt − x∗

]

︸︷︷︸state

=[

(1 + θt)I −θtII 0

] [xt − x∗

xt−1 − x∗

]−[ηt∇f(xt)

0

]

=[

(1 + θt)I − ηtQ −θtII 0

]

︸︷︷︸system matrix

[xt − x∗

xt−1 − x∗

]

Accelerated GD 7-9

System matrix

[xt+1 − x∗

xt − x∗

]=[ (

1 + θt)I − ηtQ −θtII 0

]

︸︷︷︸=:Ht (system matrix)

[xt − x∗

xt−1 − x∗

](7.1)

implication: convergence of heavy-ball methods depends on thespectrum of the system matrix Ht

key idea: find appropriate stepsizes ηt and momentum coefficients θtto control the spectrum of Ht

Accelerated GD 7-10

Convergence for quadratic problems

Theorem 7.1 (Convergence of heavy-ball methods for quadraticfunctions)

Suppose f is a L-smooth and µ-strongly convex quadratic function.Set ηt ≡ 4/(

√L+√µ)2, θt ≡ max

{|1−√ηtL|, |1−√ηtµ|}2, and

κ = L/µ. Then∥∥∥∥∥

[xt+1 − x∗

xt − x∗

]∥∥∥∥∥2.

(√κ− 1√κ+ 1

)t ∥∥∥∥∥

[x1 − x∗

x0 − x∗

]∥∥∥∥∥2

• iteration complexity: O(√κ log 1

ε )• significant improvement over GD: O(

√κ log 1

ε ) vs. O(κ log 1ε )

• relies on knowledge of both L and µ

Accelerated GD 7-11

Proof of Theorem 7.1In view of (7.1), it suffices to control the spectrum of Ht (which istime-invariant). Let λi be the ith eigenvalue of Q and set

Λ :=

λ1

. . .λn

, then the spectral radius (denoted by ρ(·)) of Ht

obeys

ρ(Ht) = ρ

([ (1 + θt

)I − ηtΛ −θtII 0

])

≤ max1≤i≤n

ρ

([1 + θt − ηtλi −θt

1 0

])

To finish the proof, it suffices to show

maxiρ

([1 + θt − ηtλi −θt

1 0

])≤√κ− 1√κ+ 1 (7.2)

Accelerated GD 7-12

Proof of Theorem 7.1

To show (7.2), note that the two eigenvalues of[

1 + θt − ηtλi −θt1 0

]are

the roots ofz2 − (1 + θt − ηtλi)z + θt = 0 (7.3)

If (1 + θt − ηtλi)2 ≤ 4θt, then the roots of this equation have the samemagnitudes

√θt (as they are either both imaginary or there is only one root).

In addition, one can easily check that (1 + θt − ηtλi)2 ≤ 4θt is satisfied if

θt ∈[(

1−√ηtλi

)2,(1 +

√ηtλi

)2], (7.4)

which would hold if one picks θt = max{(

1−√ηtL)2,(1−√ηtµ

)2}

Accelerated GD 7-13


With this choice of θt, we have (from (7.3) and the fact that twoeigenvalues have identical magnitudes)

ρ (Ht) ≤√θt

Finally, setting ηt = 4(√L+√µ)2 ensures 1−√ηtL = −(1−√ηtµ), which

yields

θt = max

(1− 2

√L√

L+√µ

)2

,

(1− 2√µ√

L+√µ

)2 =

(√κ− 1√κ+ 1

)2

This in turn establishes

ρ (Ht) ≤√κ− 1√κ+ 1

Accelerated GD 7-14

Nesterov’s accelerated gradient methods

Convex case

minimizex∈Rn f(x)

For a positive definite quadratic function f , including momentumterms allows to improve the iteration complexity from O

(κ log 1

ε

)to

O(√κ log 1

ε

)

Can we obtain improvement for more general convex cases as well?

Accelerated GD 7-16

Nesterov’s idea

Y. Nesterov

— Nesterov ’83

xt+1 = yt − ηt∇f(yt)

yt+1 = xt+1 + t

t+ 3(xt+1 − xt

)

• alternates between gradient updates and proper extrapolation• each iteration takes nearly the same cost as GD• not a descent method (i.e. we may not have f(xt+1) ≤ f(xt))• one of the most beautiful and mysterious results in optimization

...Accelerated GD 7-17

Convergence of Nesterov’s accelerated gradientmethod

Suppose f is convex and L-smooth. If ηt ≡ η = 1/L, then

f(xt)− fopt ≤ 2L‖x0 − x∗‖22(t+ 1)2

• iteration complexity: O( 1√

ε

)

• much faster than gradient methods• we’ll provide proof for the (more general) proximal version later

Accelerated GD 7-18

Interpretation using differential equations

Nesterov’s momentum coefficient tt+3 = 1− 3

t is particularlymysterious

Accelerated GD 7-19

Interpretation using differential equations

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2

and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB

�†B>,

whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas

P = �(�>�)†�> = U�U>� .

As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG

F = �rf(X)

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2


�†B>,


P = �(�>�)†�> = U�U>� .


F = �rf(X)

damping force � 3⌧

·X� 3

⌧

·X

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2


�†B>,


P = �(�>�)†�> = U�U>� .


F = �rf(X)


·X� 3

⌧

·X

time-varying dampling force spring force

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2


�†B>,


P = �(�>�)†�> = U�U>� .


F = �rf(X)


·X� 3

⌧

·X

time-varying dampling force spring force

1

To develop insight into why Nesterov’s method works so well, it’shelpful to look at its continuous limits (ηt → 0), which is given bysecond-order ordinary differential equations (ODE)

..X(τ) + 3/τ︸︷︷︸

dampling coefficient

.X(τ) +∇f(X(τ))︸︷︷︸

potential

= 0

— Su, Boyd, Candes ’14Accelerated GD 7-20

Heuristic derivation of ODETo begin with, Nesterov’s update rule is equivalent to

xt+1 − xt√η

= t− 1t+ 2

xt − xt−1√η

−√η∇f(yt) (7.5)

Let t = τ√η . Set X(τ) ≈ xτ/

√η = xt and X(τ +√η) ≈ xt+1. Then the

Taylor expansion givesxt+1 − xt√

η≈

.

X(τ) + 12..

X(τ)√η

xt − xt−1√η

≈.

X(τ)− 12..

X(τ)√η

which combined with (7.5) yields.

X(τ) + 12..

X(τ)√η ≈(

1− 3√ητ

)(.

X(τ)− 12..

X(τ)√η)−√η∇f

(X(τ)

)

=⇒..

X(τ) + 3τ

.

X(τ) +∇f(X(τ)

)≈ 0

Accelerated GD 7-21

Convergence rate of ODE

..X + 3

τ

.X +∇f(X) = 0 (7.6)

Standard ODE theory reveals that

f(X(τ))− fopt ≤ O( 1τ2

)(7.7)

which somehow explains Nesterov’s O(1/t2) convergence

Accelerated GD 7-22

Proof of (7.7)

Define E(τ) := τ2(f(X)− fopt)+ 2∥∥X + τ

2.

X −X∗∥∥2

2︸︷︷︸Lyapunov function / energy function

. This obeys

.

E = 2τ(f(X)− fopt)+ τ2⟨∇f(X),

.

X⟩

+ 4⟨X + τ

2.

X −X∗,32.

X + τ

2..

X⟩

(i)= 2τ(f(X)− fopt)− 2τ

⟨X −X∗,∇f

(X)⟩ (by convexity)

≤ 0

where (i) follows by replacing τ..

X + 3.

X with −τ∇f(X)

This means E is non-decreasing in τ , and hence

f(X(τ))− fopt (defn of E)≤ E(τ)

τ2 ≤ E(0)τ2 = O

(1τ2

)

Accelerated GD 7-23

Magic number 3

..X + 3

τ

.X +∇f(X) = 0

• 3 is the smallest constant that guarantees O(1/τ2) convergence,and can be replaced by any other α ≥ 3• in some sense, 3 minimizes the pre-constant in the convergence

bound O(1/τ2) (see Su, Boyd, Candes ’14)

Accelerated GD 7-24

Numerical exampletaken from UCLA EE236C

minimizex log(

m∑

i=1exp(a>i x + bi)

)

with randomly generated problems and m = 2000, n = 1000

Example

minimize logmP

i=1

exp(aTi x + bi)

• two randomly generated problems with m = 2000, n = 1000

• same fixed step size used for gradient method and FISTA

• figures show (f(x(k)) � f?)/f?

0 50 100 150 20010�6

10�5

10�4

10�3

10�2

10�1

100

k

gradientFISTA

0 50 100 150 20010�6

10�5

10�4

10�3

10�2

10�1

100

k

gradientFISTA

Accelerated proximal gradient methods 9-8

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

tØ

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Accelerated GD 7-25

Extension to composite models

minimizex F (x) := f(x) + h(x)subject to x ∈ Rn

• f : convex and smooth• h: convex (may not be differentiable)

let F opt := minx F (x) be the optimal cost

Accelerated GD 7-26

FISTA (Beck & Teboulle ’09)

Fast iterative shrinkage-thresholding algorithm

xt+1 = proxηth

(yt − ηt∇f(yt)

)

yt+1 = xt+1 + θt − 1θt+1

(xt+1 − xt)

where y0 = x0, θ0 = 1 and θt+1 = 1+√

1+4θ2t

2

• adopt the momentum coefficients originally proposed byNesterov ’83

Accelerated GD 7-27

Momentum coefficient

0 10 20 300

5

10

15

20

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

θt+1 =1 +

√1 + 4θ2

t

2 with θ0 = 1

coefficient θt−1θt+1

= 1− 3t + o

(1t

)(homework)

• asymptotically equivalent to tt+3

Fact 7.2

For all t ≥ 1, one has θt ≥ t+22 (homework)

Accelerated GD 7-28

Convergence analysis

Convergence for convex problems

Theorem 7.3 (Convergence of accelerated proximal gradientmethods for convex problems)

Suppose f is convex and L-smooth. If ηt ≡ 1/L, then

F (xt)− F opt ≤ 2L‖x0 − x∗‖22(t+ 1)2

• improved iteration complexity (i.e. O(1/√ε)) than proximal

gradient method (i.e. O(1/ε))• fast if prox can be efficiently implemented

Accelerated GD 7-30

Recap: the fundamental inequality for proximalmethod

Recall the following fundamental inequality shown in the last lecture:

Lemma 7.4

Let y+ = prox 1Lh

(y − 1

L∇f(y)), then

F (y+)− F (x) ≤ L

2 ‖x− y‖22 −L

2 ‖x− y+‖22

Accelerated GD 7-31


1. build a discrete-time version of “Lyapunov function”

2. magic happens!◦ “Lyapunov function” is non-increasing when Nesterov’s

momentum coefficients are adopted

Accelerated GD 7-32


Key lemma: monotonicity of a certain “Lyapunov function”

Lemma 7.5

Let ut = θt−1xt − (x∗ + (θt−1 − 1)xt−1)

︸︷︷︸or θt−1(xt−x∗)−(θt−1−1)(xt−1−x∗)

. Then

‖ut+1‖22 + 2Lθ2t

(F (xt+1)− F opt) ≤ ‖ut‖22 + 2

Lθ2t−1(F (xt)− F opt)

• quite similar to 2∥∥X + τ

2.X −X∗

∥∥22 + τ2(f(X)− fopt)

(Lyapunov function) as discussed before (think about θt ≈ t/2)

Accelerated GD 7-33

Proof of Theorem 7.6With Lemma 7.5 in place, one has

2Lθ2t−1(F (xt)− F opt) ≤ ‖u1‖2

2 + 2Lθ2

0(F (x1)− F opt)

= ‖x1 − x∗‖22 + 2

L

(F (x1)− F opt)

To bound the RHS of this inequality, we use Lemma 7.4 and y0 = x0 to get2L

(F (x1)− F opt) ≤ ‖y0 − x∗‖2

2 − ‖x1 − x∗‖22 = ‖x0 − x∗‖2

2 − ‖x1 − x∗‖22

⇐⇒ ‖x1 − x∗‖22 + 2

L

(F (x1)− F opt) ≤ ‖x0 − x∗‖2

2

As a result,2Lθ2t−1(F (xt)− F opt) ≤ ‖x1 − x∗‖2

2 + 2L

(F (x1)− F opt) ≤ ‖x0 − x∗‖2

2,

=⇒ F (xt)− F opt ≤ L‖x0 − x∗‖22

2θ2t−1

(Fact 7.2)≤ 2L‖x0 − x∗‖2

2(t+ 1)2

Accelerated GD 7-34

Proof of Lemma 7.5

Take x = 1θtx∗ +

(1− 1

θt

)xt and y = yt in Lemma 7.4 to get

F (xt+1)− F(θ−1t x∗ +

(1− θ−1

t

)xt)

(7.8)

≤ L

2∥∥θ−1t x∗ +

(1− θ−1

t

)xt − yt

∥∥22 −

L

2∥∥θ−1t x∗ +

(1− θ−1

t

)xt − xt+1∥∥2

2

= L

2θ2t

∥∥x∗ +(θt − 1

)xt − θtyt

∥∥22 −

L

2θ2t

∥∥x∗ +(θt − 1

)xt − θtxt+1

︸︷︷︸=−ut+1

∥∥22

(i)= L

2θ2t

(‖ut‖2

2 − ‖ut+1‖22), (7.9)

where (i) follows from the definition of ut and yt = xt + θt−1−1θt

(xt−xt−1)

Accelerated GD 7-35

Proof of Lemma 7.5 (cont.)

We will also lower bound (7.8). By convexity of F ,

F(θ−1t x∗ +

(1− θ−1

t

)xt)≤ θ−1

t F (x∗) +(1− θ−1

t

)F (xt)

= θ−1t F opt +

(1− θ−1

t

)F (xt)

⇐⇒ F(θ−1t x∗ +

(1− θ−1

t

)xt)− F (xt+1)

≤(1− θ−1

t

)(F (xt)− F opt)−

(F (xt+1)− F opt)

Combining this with (7.9) and θ2t − θt = θ2

t−1 yields

L

2(‖ut‖2

2 − ‖ut+1‖22)≥ θ2

t

(F (xt+1)− F opt)−

(θ2t − θt

)(F (xt)− F opt)

= θ2t

(F (xt+1)− F opt)− θ2

t−1(F (xt)− F opt),

thus finishing the proof

Accelerated GD 7-36

Convergence for strongly convex problems

xt+1 = proxηth

(yt − ηt∇f(yt)

)

yt+1 = xt+1 +√κ− 1√κ+ 1(xt+1 − xt)

Theorem 7.6 (Convergence of accelerated proximal gradientmethods for strongly convex case)

Suppose f is µ-strongly convex and L-smooth. If ηt ≡ 1/L, then

F (xt)− F opt ≤(

1− 1√κ

)t(F (x0)− F opt + µ‖x0 − x∗‖22

2

)

Accelerated GD 7-37

A practical issue

Fast convergence requires knowledge of κ = L/µ

• in practice, estimating µ is typically very challenging

A common observation: ripples / bumps in the traces of cost values

Accelerated GD 7-38

Rippling behaviorNumerical example: take yt+1 = xt+1 + 1−√q

1+√q (xt+1 − xt); q∗ = 1/κ

0 500 1000 1500 2000 2500 3000 3500 4000 4500 500010−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

k

f(x

k)−

f⋆

q = 0q = q⋆/10q = q⋆/3q = q⋆

q = 3q⋆

q = 10q⋆

q = 1

Figure 1: Convergence of Algorithm 1 with different estimates of q.

Interpretation. The optimal momentum depends on the condition number of the function;specifically, higher momentum is required when the function has a higher condition number. Under-estimating the amount of momentum required leads to slower convergence. However we are moreoften in the other regime, that of overestimated momentum, because generally q = 0, in which caseβk ↑ 1; this corresponds to high momentum and rippling behavior, as we see in Figure 1. Thiscan be visually understood in Figure (2), which shows the trajectories of sequences generated byAlgorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q⋆, the optimalchoice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum andoscillate around it. This causes a rippling in the function values along the trajectory. Later weshall demonstrate that the period of these ripples is proportional to the square root of the (local)condition number of the function.

Lastly we mention that the condition number is a global parameter; the sequence generated byan accelerated scheme may enter regions that are locally better conditioned, say, near the optimum.In these cases the choice of q = q⋆ is appropriate outside of this region, but once we enter it weexpect the rippling behavior associated with high momentum to emerge, despite the optimal choiceof q.

4

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

tØ

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35




!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt


, then


ALf

ÔRÔ

fl

log tÔt

B


period of ripples isoften proportional to√L/µ

O’Donoghue, Candes ’12

• when q > q∗: we underestimate momentum −→ slower convergence

• when q < q∗: we overestimate momentum ( 1−√q1+√q is large)

−→ overshooting / rippling behaviorAccelerated GD 7-39

Adaptive restart (O’Donoghue, Candes ’12)

When a certain criterion is met, restart running FISTA with

x0 ← xt

y0 ← xt

θ0 = 1

• take the current iterate as a new starting point• erase all memory of previous iterates and reset the momentum

back to zero

Accelerated GD 7-40

Numerical comparisons of adaptive restart schemes

200 400 600 800 1000 1200 1400 1600 1800 200010−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme

k

f(x

k)−

f⋆

Figure 3: Comparison of fixed and adaptive restart intervals.

−0.1 −0.05 0 0.05 0.1

−0.1

−0.05

0

0.05

0.1

x1

x2

q = q⋆

q = 0

adaptive restart

Figure 4: Sequence trajectories under scheme I and with adaptive restart.

7

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

resti

ngly,

nofir

st-or

derm

etho

dsca

nim

prov

eup

onNe

stero

v’sre

sult

inge

nera

l

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

tØ

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35




!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt


, then


ALf

ÔRÔ

fl

log tÔt

B


• function scheme: restart when f(xt) > f(xt−1)• gradient scheme: restart when 〈∇f(yt−1),xt − xt−1〉 > 0︸︷︷︸

restart when momentum lead us towards a bad directionAccelerated GD 7-41

Illustration

200 400 600 800 1000 1200 1400 1600 1800 200010−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme

k

f(x

k)−

f⋆

Figure 3: Comparison of fixed and adaptive restart intervals.

−0.1 −0.05 0 0.05 0.1

−0.1

−0.05

0

0.05

0.1

x1

x2

q = q⋆

q = 0

adaptive restart

Figure 4: Sequence trajectories under scheme I and with adaptive restart.

7

• with overestimated momentum (e.g. q = 0), one sees spirallingtrajectory• adaptive restart helps mitigate this issue

Accelerated GD 7-42

Lower bounds

Optimality of Nesterov’s method

Interestingly, no first-order methods can improve upon Nesterov’sresults in general

More precisely, ∃ convex and L-smooth function f s.t.

f(xt)− fopt ≥ 3L‖x0 − x∗‖2232(t+ 1)2

as long as xk ∈ x0 + span{∇f(x0), · · · ,∇f(xk−1)}︸︷︷︸definition of first-order methods

for all 1 ≤ k ≤ t

— Nemirovski, Yudin ’83

Accelerated GD 7-44

Example

minimizex∈R(2n+1) f(x) = L

4

(12x>Ax− e>1 x

)

where A =

2 −1−1 2 −1

. . . . . . . . .−1 2 −1

−1 2

∈ R(2n+1)×(2n+1)

• f is convex and L-smooth• the optimizer x∗ is given by x∗i = 1− i

2n+2 (1 ≤ i ≤ n) obeying

fopt = L

8

( 12n+ 2 − 1

)and ‖x∗‖22 ≤

2n+ 23

• ∇f(x) = L4 Ax− L

4 e1

• span{∇f(x0), · · · ,∇f(xk−1)}︸︷︷︸=:Kk

= span{e1, · · · , ek} if x0 = 0

◦ every iteration of first-order methods expands the search space byat most one dimension

Accelerated GD 7-45

Example (cont.)

If we start with x0 = 0, then

f(xn) ≥ infx∈Kn

f(x) = L

8

( 1n+ 1 − 1

)

=⇒ f(xn)− fopt

‖x0 − x∗‖22≥

L8

(1

n+1 − 12n+2

)

13(2n+ 2)

= 3L32(n+ 1)2

Accelerated GD 7-46

Summary: accelerated proximal gradient

stepsize convergence iterationrule rate complexity

convex & smoothηt = 1

L O(

1t2

)O(

1√ε

)problems

strongly convex &ηt = 1

L O

((1− 1√

κ

)t)O(√

κ log 1ε

)smooth problems

Accelerated GD 7-47

Reference

[1] ”Some methods of speeding up the convergence of iteration methods,”B. Polyak, USSR Computational Mathematics and MathematicalPhysics, 1964

[2] ”A method of solving a convex programming problem with convergencerate O(1/k2),” Y. Nestrove, Soviet Mathematics Doklady, 1983.

[3] ”A fast iterative shrinkage-thresholding algorithm for linear inverseproblems,” A. Beck and M. Teboulle, SIAM journal on imaging sciences,2009.

[4] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[5] ”Mathematical optimization, MATH301 lecture notes,” E. Candes,

Stanford.[6] ”Gradient methods for minimizing composite functions,”, Y. Nesterov,

Technical Report, 2007.

Accelerated GD 7-48

Reference

[7] ”Large-scale numerical optimization, MS&E318 lecture notes,”M. Saunders, Stanford.

[8] ”Analysis and design of optimization algorithms via integral quadraticconstraints,” L. Lessard, B. Recht, A. Packard, SIAM Journal onOptimization, 2016.

[9] ”Proximal algorithms,” N. Parikh and S. Boyd, Foundations and Trendsin Optimization, 2013.

[10] ”Convex optimization: algorithms and complexity,” S. Bubeck,Foundations and trends in machine learning, 2015.

[11] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.

[12] ”Problem complexity and method efficiency in optimization,”A. Nemirovski, D. Yudin, Wiley, 1983.

Accelerated GD 7-49

Reference

[13] ”Introductory lectures on convex optimization: a basic course,”Y. Nesterov, 2004

[14] ”A differential equation for modeling Nesterov’s accelerated gradientmethod,” W. Su, S. Boyd, E. Candes, NIPS, 2014.

[15] ”On accelerated proximal gradient methods for convex-concaveoptimization,” P. Tseng, 2008

[16] ”Adaptive restart for accelerated gradient schemes,” B. O’donoghue,and E. Candes, Foundations of computational mathematics, 2012

Accelerated GD 7-50

Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method...

Documents

Transcript of Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method...