NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient...
-
Upload
kai-wen-zhao -
Category
Data & Analytics
-
view
106 -
download
0
Transcript of NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient...
A Differential Equation for Modeling Nesterov’sAccelerated Gradient Method: Theory and Insights
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014
speaker: kv
MCLab, CITI Academia Sinicas
May 14, 2015
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40
Overview
1 IntroductionSmooth Unconstrained OptimizationAccelerated SchemeOrdinary Differential Equation
2 Derivation of the ODESimple Properties
3 Equivalence between the ODE and Nesterov’s schemeAnalogous Convergence RateQuadratic f and Bessel functionEquivalence between the ODE and Nesterov’s Scheme
4 A family of generalized Nesterov’s schemesContinuous OptimizationComposite Optimization
5 New Restart Scheme6 Accelerating to linear convergence by restarting
Numerical examples7 Discussion
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 2 / 40
Introduction
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 3 / 40
Smooth Unconstrained Optimization
We wish to minimize a smooth convex function
minimize f (x)
where f : Rn → R has a Lipschitz continuous gradient
‖∇f (x)−∇f (y)‖2 < L‖x − y‖2
µ-strong convexity
f (x)− µ‖x‖2/2
In this paper, FL denotes the class of convex function f with L-Lipschitzcontinuous gradients defined on Rn; Sµ denotes the class of µ-stronglyconvex function f on Rn. We set Sµ,L = FL ∩ Sµ
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 4 / 40
Introduction: Accelerated Scheme
Nesterov’s Accelerated Gradient Scheme
xk = yk−1 − s∇f (yk−1) (1)
yk = xk +k − 1
k + 2(xk − xk−1) (2)
For fixed step size s = 1/L, where L is Lipschitz constant of ∇f , thisscheme exhibits the convergence rate
f (xk)− f ∗ ≤ O(L||x0 − x∗||2
k2)
This improvement relies on the introduction to momentum xk − xk−1 andthe particularly tuned coefficient (k − 1)/(k + 2) = 1− 3/(k + 2)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 5 / 40
Accelerated Scheme: Oscillation Problem
In general, Nesterov’s scheme is not monotone in the objective function.(due to introduction to the momentum)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 6 / 40
Introduction: Second order ODE
Derive a second order ordinary differential equation(ODE), which is theexact limit of Nesterov’s scheme by taking small step size
X +3
tX +∇f (X ) = 0 (3)
for t > 0, with initial condition X (0) = x0, ˙X (0) = 0; here x0 is thestarting point in Nesterov’s scheme. X denotes to velocity and X isacceleration.
Small t: large 3/t leads the ODE to be an over-damped system
Large t: As t increases, system behaves like under-damped system,oscillating with amplitude decreases to zero
Time parameter in this ODE is related to step size
t ∼ k√s
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 7 / 40
An Example: Trajectories
Minimize f = 0.02x21 + 0.005x22
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 8 / 40
An Example: Zoomed Trajectories
Minimize f = 0.02x21 + 0.005x22
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 9 / 40
An Example: Errors f − f ∗
Minimize f = 0.02x21 + 0.005x22
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 10 / 40
Derivation of the ODE
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 11 / 40
Derivation of the ODE
Assmue f ∈ FL for L > 0, combine two equations of (1) and applyingrescaling give
xk+1 − xk√s
=k − 1
k + 2
xk − xk−1√s
−√s∇f (yk) (4)
Introducfelis e the ansatz xk ∼ X (k√s) for smooth curve X (t) define for
t > 0. With these approximations, we get Taylor expansions:
(xk+1 − xk)/√s = X (t) +
1
2X (t)
√s + o(
√s)
(xk − xk−1)/√s = X (t)− 1
2X (t)
√s + o(
√s)
√s∇f (yk) =
√s∇f (X (t)) + o(
√s)
where the third equality we use yk − X (t) = o(1)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 12 / 40
Derivation of the ODE
The formula (4) can be rewritten as
X (t) +1
2X (t)
√s + o(
√s)
= (1− 3√s
t){X (t)− 1
2X (t)
√s + o(
√s)} −
√s∇f (X (t)) + o(
√s)
By comparing the coefficients of√s, we obtain
X +3
tX +∇f (X ) = 0
Theorem (Well posed ODE, Existence and Uniqueness)
For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initialconditions X (0) = x0, X (0) = 0 has an unique global solution X .
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 13 / 40
Simple Properties
Invariance
ODE is invariant under time change and invariant under transformation
Initial asymptotic
Assume sufficient smoothness of X , that limt→0 X exists. The Mean ValueTheorem guarantees the existence ζ ∈ (0, t) that
X (t)/t = (X − X (0))/t = X (ζ)
Hence the ODE deduces to X (t) + 3X (ζ) +∇f (X (t)) = 0 Taking t → 0(for small t), we have
X (t) = −∇f (x0)t2
8+ x0 + o(t2)
Consistent with the empirical observation the Nesterov’s scheme movesslowly in the beginning.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 14 / 40
Connections and Interpretations
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 15 / 40
Analogous Convergence Rate
Now, we exhibit approximate equivalence between the ODE andNesterov’s scheme in terms of convergence rate.
Theorem (Discrete Nesterov Scheme 3.1)
For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys
f (xk)− f ∗ ≤ 2||x0 − x∗||2
s(k + 1)2
First result indicates the trajectory of ODE (3) closely resembles thesequence {xk} in terms of the convergence rate
Theorem (Continuous ODE Scheme 3.2)
For any f ∈ F∞, let X (t) be the unique global solution to (3) with initialconditions X (0) = x0, X (0) = 0, for any t > 0
f (X (t))− f ∗ ≤ 2||x0 − x∗||2
t2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40
Proof of Theorem 3.2
Consider energy functional defined as
ε(t) := t2(f (X (t))− f ∗) + 2||X +t
2X − x∗||2
whose time derivative is
ε(t) = 2t(f (X )− f ∗) + t2〈∇f , X 〉+ 4〈X +t
2X − x∗,
3
2X +
t
2X 〉
Substituting 3X/2 + tX/2 with −t∇f (X )/2
ε(t) = 2t(f (X )− f ∗) + 4〈X − x∗,−t∇f (X )/2〉= 2t(f (X )− f ∗)− 2t〈X − x∗,∇f (X )〉 ≤ 0
Hence the monotonicity of ε and non-negativity of 2||X + t2 X − x∗||2, the
gap obeys
f (X (t))− f ∗ ≤ ε(t)
t2≤ ε(0)
t2=
2||x0 − x∗||2
t2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 17 / 40
Quadratic f and Bessel function
For quadratic f , we have
f (x) =1
2〈x ,Ax〉+ 〈b, x〉
where A ∈ Rn×n. Simple translation can absorb the liner term 〈b, x〉. Weassume A is positive semi-definite, admitting spectral decompositionA = QTΛQ, replace x with Qx , we assume f = 1/2〈x ,Λx〉. The ODEadmits
Xi +3
tXi + λiXi = 0
where i = 1, ..., n
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 18 / 40
Quadratic f and Bessel function
Introduce Yi (u) = uXi (u/√λi ), which satisfies
u2Yi + uYi + (u2 − 1)Yi = 0
This is the Bessel’s Differential Equation with order 1. Apply asymptoticexpansion, we obtain
Xi (t) =2xx0,i
t√λi
J1(t√λi )
For t is large, the Bessel function has asymptotic form
J1(t) =
√2
πt(cos(t − 3π/4) + O(1/t))
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 19 / 40
Quadratic f and Bessel function: Example
Minimizef = 0.02x21 + 0.005x22 , whose eigenvalues are λ1,2 = 0.02, 0.005
f (X )− f ∗ = f (X ) =n∑
i=1
2x20,it2
J1(t√λi )
2
Denote two major period T1,T2. We get T1 = π/√λ1 = 22.214 and
T2 = π/√λ2 = 44.423
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 20 / 40
Equivalence between the ODE and Nesterov’s scheme
We study the stable step size for numerically solving ODE. The finitedifference approximation of (3) by the forward Euler method
X (t + ∆t)− 2X (t) + X (t + ∆t)
∆t2+
3
t
X (t)− X (t −∆t)
∆t+∇f (X (t)) = 0
which is equivalent to
X (t + ∆t) = (2− 3∆t
t)X (t)−∆t2∇f (X (t))− (1− ∆t
t)X (t −∆t)
Assuming that f is sufficiently smooth, for small perturbation. Thecharacteristic equation of this finite difference scheme is approximately(identify k = t/∆t)
det{λ2 − (2−∆t2∇2f − 3∆t
t)λ+ 1− 3∆t
t} = 0 (5)
For numerical stability, all the roots of (5) should lie on unit circle.Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40
A family of generalized Nesterov’s schemes
Exploit the power of ODE. We would be interested in studying the ODE(3) with the number of 3 appearing the coefficient of X/t replaced by ageneral constant r as
X +r
tX +∇f (X ) = 0, X (0) = x0, X (0) = 0 (6)
Using the argument similar to theorem 2.1, this ODE is guaranteed toassume a unique global solution for any f ∈ F∞
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 22 / 40
Generalized Nesterov’s Scheme: Continuous Optimization
Theorem (4.1)
Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.Then X (t) obeys
f (X (t))− f ∗ ≤ (r − 1)2||x0 − x∗||2
2t2
and ∫ ∞0
t(f (X (t))− f ∗)dt ≤ (r − 1)2||x0 − x∗||2
2(r − 3)
Theorem (4.2)
For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys
f (X (t))− f ∗ ≤ Cr5/2||x0 − x∗||2
t3õ
For any t > 0 and an universal constant C > 1/2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40
Generalized Nesterov’s Scheme: Continuous Optimization
For example, the solution to (6) with f (x) = ||x ||2/2 is
X (t) =2
r−12 Γ((r + 1)/2)J(r−1)/2(t)
t(r−1)/2
where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. Forlarge t, this Bessel function obeysJ(r−1)/2(t) =
√2/(πt)(cos(t − rπ/4) + O(1/t). Hence
f (X (t))− f ∗ ≤ ||x0 − x∗||2/tr
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 24 / 40
Generalized Nesterov’s Scheme: Composite Optimization
Skip this part
minx∈Rn
f (x) = g(x) + h(x)
where g ∈ FL for some L > 0 and h is convex on Rn with possibleextended value ∞. Define proximal subgradient
Gs(x) :=x − arg minz{||z − (x − s∇g(x))||2/(2s) + h(z)}
s
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 25 / 40
New Restart Scheme
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 26 / 40
Restart Scheme: Previous Works
Restart: Erase the memory of previous iterations and resets themomentum back to zero
Function Scheme: we restart whenever
f (xk) > f (xk−1)
Gradient Scheme: we restart whenever
∇f (yk−1)T (xk − xk−1) > 0
Refer to: Adaptive Restart for Accelerated Gradient Schemes, BrendanODonoghue Emmanuel Cands, 2012
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 27 / 40
Restart Scheme: Previous Works
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 28 / 40
Restart Scheme: Previous Works
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 29 / 40
New Restart Scheme: Speed Restart
This work provides a new restarting strategy, called speed restartingscheme. The underlying motivation is to maintain relatively high velocityX along the trajectory.
Definition 5.1
For ODE (3) with X (0) = x0, X (0) = 0, let
T = T (f , x0) = sup{t > 0 : ∀u ∈ (0, t),||X (u)||2
du> 0}
be the speed restarting time.
In words, T is the first time the velocity ||X || decreases. Indeed, f (X (t))is the decreasing function before time T , for t < T ,
df (X (t))
dt=< ∇f (X ), X >= −3
t||X ||2 − 1
2
||X ||2
dt< 0
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 30 / 40
Accelerating to linear convergence by restarting
The speed restarting ODE is thus
X (t) +3
tsrX (t) +∇f (X (t)) = 0 (7)
where tsr is set to zero whenever < X , X >= 0. We have followingobservations
X sr (t) is continuous for t ≤ 0, with X sr (0) = x0
X sr (t) satisfied (3) for 0 < t < T1 := T (x0; f )
Recursively define Ti+1 = Ti (Xsr (
∑Tj ; f ) for i ≤ 1 and
X (t) = X sr (∑
Tj + t)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 31 / 40
Accelerating to linear convergence by restarting
Lemma 5.2
There is a universal constant C > 0 such that
f (X (T ))− f (x∗) ≤ (1− Cµ
L)(f (x0)− f ∗)
Guarantee each restarting reduces the error by a constant factor
Lemma 5.3
There is a universal constant C such that
T ≤4exp( CL
µ )
5√L
An upper bound for T. It conforms that restartings are adequate
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 32 / 40
Accelerating to linear convergence by restarting
Applying Lemma 5.2, 5.3, we have
Theorem (5.1)
There exists positive constants c1 and c2, which only depend on thecondition number L/µ, such that for any f ∈ Sµ,L, we have
f (X sr (t))− f (x∗) ≤ c1L||x0 − x∗||2
2exp−c2t
√L
The theorem guarantees linear convergence of solution to (7). This is newresult in the literature.
where c1 = exp(Cµ/L) and c2 = 5Cµ4L e(−Cµ/L)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 33 / 40
Numerical examples: algorithm of speed restarting
Below we present a discrete analog to restarted scheme.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 34 / 40
Quadratic
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 35 / 40
Log-sum-exp
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 36 / 40
Matrix compleltion
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 37 / 40
Lasso in l1-constrainted form with large space design
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 38 / 40
References
W. Su, S. Boyd, E. Candes (2014)
A Differential Equation for Modeling Nesterovs Accelerated Gradient Method:Theory and Insights
NIPS 2014
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 39 / 40
thanks!
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 40 / 40