ELE604/ELE704 Optimization - Hacettepe

20
L1 I min f (x) f (x): R N R I x * I p * = inf x f (x)= f (x * ) (> -∞) I min xR N f (x)= 1 2 x T Qx - b T x + c Q : R N×N b R N c R f (x * )= Qx * - b =0 H(x * )= Q 0 Q 0 f (x) Q 0 x * = Q -1 b Q 0 I min x1,x2R f (x1,x2)= 1 2 (αx 2 1 + βx 2 2 - x1) Q = α γ -γ β b = 1 0 γ R γ =0 α> 0 β> 0 Q 0 x * =( 1 α , 0) α> 0 β =0 Q 0 ( 1 α ,y),y R α =0 β> 0 Q 0 α< 0 β> 0 α> 0 β< 0 Q -10 -5 0 5 10 -10 -5 0 5 10 -50 0 50 100 150 x1 α > 0, β > 0 x2 -10 -5 0 5 10 -10 -5 0 5 10 -20 0 20 40 60 x1 α > 0, β = 0 x2 -10 -5 0 5 10 -10 -5 0 5 10 -20 0 20 40 60 x1 α = 0, β > 0 x2 -10 -5 0 5 10 -10 -5 0 5 10 -60 -40 -20 0 20 40 60 x1 α > 0, β < 0 x2

Transcript of ELE604/ELE704 Optimization - Hacettepe

Page 1: ELE604/ELE704 Optimization - Hacettepe

ELE604/ELE704 Optimization

Unconstrained Optimization

http://www.ee.hacettepe.edu.tr/∼usezen/ele604/

Dr. Umut Sezen & Dr. Cenk TokerDepartment of Electrical and Electronic Engineering

Hacettepe University

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 1 / 120

Contents

Unconstrained Optimization

Unconstrained Minimization

Descent Methods

Motivation

General Descent Method

Line SearchExact Line SearchBisection AlgorithmBacktracking Line Search

Convergence

Gradient Descent (GD) Method

Gradient Descent Method

Convergence Analysis

Conv. of GD with Exact Line Search

Conv. of GD with Backtracking Line Search

Examples

Steepest Descent (SD) Method

Preliminary De�nitions

Steepest Descent Method

Steepest Descent for di�erent normsEuclidean NormQuadratic NormL1-normChoice of norm

Convergence Analysis

Examples

Conjugate Gradient (GD) Method

Introduction

Conjugate DirectionsDescent Properties of the Conjugate Gradient Method

The Conjugate Gradient Method

Extension to Nonquadratic Problems

Newton's Method (NA)

The Newton StepInterpretation of the Newton Step

The Newton Decrement

Newton's Method

Convergence Analysis

Examples

Approximation of the Hessian

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 2 / 120

Unconstrained Optimization Unconstrained Minimization

Unconstrained Minimization

I The aim is

min f(x)

where f(x) : RN → R is twice di�erentiable.

I The problem is solvable, i.e., �nite optimal point x∗ exists.

I The optimal value (�nite) is given by

p∗ = infx

f(x) = f(x∗) (> −∞)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 3 / 120

Unconstrained Optimization Unconstrained Minimization

I Example 1: Quadratic program

minx∈RN

f(x) =1

2xTQx− bTx + c

where Q : RN×N is symmetric, b ∈ RN and c ∈ R.

Necessary conditions:

∇f(x∗) = Qx∗ − b = 0

H(x∗) = Q � 0 (PSD)

- Q ≺ 0⇒ f(x) has no local minimum.

- Q � 0⇒ x∗ = Q−1b is the unique global minimum.

- Q � 0⇒ either no solution or ∞ number of solutions.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 4 / 120

Unconstrained Optimization Unconstrained Minimization

I Example 2: Consider

minx1,x2∈R

f(x1, x2) =1

2(αx2

1 + βx22 − x1)

Here, let us �rst express the above equation in the quadratic program form with

Q =

[α γ−γ β

]b =

[10

]where γ ∈ R, for simplicity we can take γ = 0. So,

- If α > 0 and β > 0 (i.e., Q � 0): x∗ = ( 1α, 0) is the unique global minimum.

- If α > 0 and β = 0 (i.e., Q � 0): In�nite number of solutions,{( 1α, y), y ∈ R

}.

- If α = 0 and β > 0 (i.e., Q � 0): No solution.

- If α < 0 and β > 0, or α > 0 and β < 0 (i.e., Q is inde�nite): No solution.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 5 / 120

Unconstrained Optimization Unconstrained Minimization

−10−5

05

10

−10−5

05

10−50

0

50

100

150

x1

α > 0, β > 0

x2 −10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α > 0, β = 0

x2

−10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α = 0, β > 0

x2−10

−50

510

−10−5

05

10−60

−40

−20

0

20

40

60

x1

α > 0, β < 0

x2

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 6 / 120

Page 2: ELE604/ELE704 Optimization - Hacettepe

Unconstrained Optimization Unconstrained Minimization

I Two possibilities:

- {f(x) : x ∈ X} is unbounded below ⇒ no optimal solution.

- {f(x) : x ∈ X} is bounded below ⇒ a global minimum exists, if ‖x‖ 6=∞.

Then, unconstrained minimization methods

- produce sequence of points x(k) ∈ dom f(x) for k = 0, 1, . . . with

f(x(k))→ p∗

- can be interpreted as iterative methods for solving the optimality condition

∇f(x∗) = 0

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 7 / 120

Descent Methods Motivation

Motivation

I If ∇f(x) 6= 0, there is an interval (0, δ) of stepsizes such that

f(x− α∇f(x)) < f(x) ∀α ∈ (0, δ)

I If d makes an angle with ∇f(x) that is greater than 90◦, i.e.,

∇T f(x)d < 0

∃ an interval (0, δ) of stepsizes such that

f(x + αd) < f(x) ∀α ∈ (0, δ)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 8 / 120

Descent Methods Motivation

I De�nition: The descent direction d is selected such that

∇T f(x)d < 0

I Proposition: For a descent method

f(x(k+1)) < f(x(k))

except x(k) = x∗.

I De�nition: Minimizing sequence is de�ned as

x(k+1) = x(k) + α(k)d(k)

where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration k, andd(k) ∈ RN is the step or search direction.

- How to �nd optimum α(k)? Line Search Algorithm

- How to �nd optimum d(k)? Depends on the descent algorithm,e.g., d = −∇f(x(k)).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 9 / 120

Descent Methods General Descent Method

General Descent Method

I Given a starting point x(0) ∈ dom f(x)

repeat

1. Determine a descent direction d(k),

2. Line search: Choose a stepsize α(k) > 0,

3. Update: x(k+1) = x(k) + α(k)d(k),

until stopping criterion is satis�ed.

I Example 3: Simplest method: Gradient Descent

x(k+1) = x(k) − α(k)∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −∇f(x(k)).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 10 / 120

Descent Methods General Descent Method

I Example 4: Most sophisticated method: Newton's Method

x(k+1) = x(k) − α(k)H−1(x(k))∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −H−1(x(k))∇f(x(k)).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 11 / 120

Descent Methods Line Search

Line Search

I Suppose f(x) is a continuously di�erentiable convex function and we want to �nd

α(k) = argminα

f(x(k) + αd(k))

for a given descent direction d(k). Now, let

h(α) = f(x(k) + αd(k))

where h(α) : R→ R is a "convex function" in the scalar variable α then theproblem becomes

α(k) = argminα

h(α)

Then, as h(α) is convex, it has a minimum at

h′(α(k)) =∂h(α(k))

∂α= 0

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 12 / 120

Page 3: ELE604/ELE704 Optimization - Hacettepe

Descent Methods Line Search

where h′(α) is given by

h′(α) =∂h(α)

∂α= ∇T f(x(k) + αd(k))d(k) (using chain rule)

Therefore, since d is the descent direction, (i.e., ∇T f(x(k))d(k)< 0 ), we haveh′(0) < 0. Also, h′(α) is monotone increasing function of α because h(α) isconvex. Hence. search for h′(α(k)) = 0.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 13 / 120

Descent Methods Line Search

Choice of stepsize:

I Constant stepsizeα(k) = c : constant

I Diminishing stepsizeα(k) → 0

while satisfying∞∑

k=−∞α(k) =∞.

I Exact line search (analytic)

α(k) = argminα

f(x(k) + αd(k))

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 14 / 120

Descent Methods Line Search

Exact line search: (for quadratic programs)

I If f(x) is a quadratic function, then h(α) is also a quadratic function, i.e.,

h(α) = f(x(k) + αd(k))

= f(x(k)) + α∇T f(x(k))d(k) +α2

2d(k)TH(x(k))d(k)

Exact line search solution α0 which minimizes the quadratic equation above, i.e.,∂h(α0)∂α

= 0, is given by

α0 = α(k) = − ∇T f(x(k))d(k)

d(k)TH(x(k))d(k)

- If f(x) is a higher order function, then second order Taylor series approximationcan be used for the exact line search algorithm (which also gives an approximatesolution).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 15 / 120

Descent Methods Line Search

Bisection Algorithm:

I Assume h(α) is convex, then h′(α) is monotonically increasing function. Supposethat we know a value α such that h′(α) > 0.

- Since h′(0) < 0, α = 0+α2

is the next test point

- If h′(α) = 0, α(k) = α is found (very di�cult to achieve)

- If h′(α) > 0 narrow down the search interval to (0, α)

- If h′(α) < 0 narrow down the search interval to (α, α)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 16 / 120

Descent Methods Line Search

Algorithm:

1. Set k = 0, α` = 0 and αu = α

2. Set α = α`+αu

2and calculate h′(α)

3. If h′(α) > 0 ⇒ αu = α and k = k + 1

4. If h′(α) < 0 ⇒ α` = α and k = k + 1

5. If h′(α) = 0 ⇒ stop.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 17 / 120

Descent Methods Line Search

Proposition: After every iteration, the current interval [α`, αu] contains α∗,h′(α∗) = 0.

Proposition: At the k-th iteration, the length of the current interval is

L =

(1

2

)kα

Proposition: A value of α such that |α− α∗| < ε can be found at most⌈log2

ε

)⌉steps.

I How to �nd α such that h′(α) > 0?

1. Make an initial guess of α

2. If h′(α) < 0 ⇒ α = 2α, go to step 2

3. Stop.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 18 / 120

Page 4: ELE604/ELE704 Optimization - Hacettepe

Descent Methods Line Search

I Stopping criterion for the Bisection Algorithm: h′(α)→ 0 as k →∞, may notconverge quickly.

Some relevant stopping criteria:

1. Stop after k = K iterations (K : user de�ned)

2. Stop when |αu − α`| ≤ ε (ε : user de�ned)

3. Stop when h′(α) ≤ ε (ε : user de�ned)

In general, 3rd criterion is the best.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 19 / 120

Descent Methods Line Search

Backtracking line search

For small enough α:

f(x0 + αd) ' f(x0) + α∇T f(x0)d < f(x0) + γα∇T f(x0)d

where 0 < γ < 0.5 as ∇T f(x0)d < 0.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 20 / 120

Descent Methods Line Search

I Algorithm: Backtracking line search

Given a descent direction d for f(x) at x0 ∈ dom f(x)

α = 1

while f(x0 + αd) > f(x0) + γα∇T f(x0)d

α = βα

end

where 0 < γ < 0.5 and 0 < β < 1

- At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 :�ne search).

- γ can be interpreted as the fraction of the decrease in f(x) predicted by linearextrapolation (γ = 0.01↔ 0.3 (typical) meaning that we accept a decrease inf(x) between 1% and 30%).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 21 / 120

Descent Methods Line Search

- The backtracking exit inequality

f(x0 + αd) ≤ f(x0) + γα∇T f(x0)d

holds for α ∈ [0, α0]. Then, the line search stops with a step length α

i. α = 1 if α0 ≥ 1

ii. α ∈ [βα0, α0].

In other words, the step length obtained by backtracking line search satis�es

α ≥ min {1, βα0} .

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 22 / 120

Descent Methods Convergence

I Convergence

De�nition: Let ‖ · ‖ be a norm on RN . Let{x(k)

}∞k=0

be a sequence of vectors in

RN . Then, the sequence{x(k)

}∞k=0

is said to converge to a limit x∗ if

∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε)⇒ (‖x(k) − x∗‖ < ε)

If the sequence{x(k)

}∞k=0

converges to x∗ then we write

limk→∞

x(k) = x∗

and call x∗ as the limit of the sequence{x(k)

}∞k=0

.

- Nε may depend on ε

- For a distance ε, after Nε iterations, all the subsequent iterations are withinthis distance ε to x∗.

This de�nition does not characterize how fast the convergence is (i.e., rate ofconvergence).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 23 / 120

Descent Methods Convergence

I Rate of Convergence

De�nition: Let ‖ · ‖ be a norm on RN . A sequence{x(k)

}∞k=0

that converges to

x∗ ∈ RN is said to converge at rate R ∈ R++ and with rate constant δ ∈ R++ if

limk→∞

‖x(k+1) − x∗‖‖x(k) − x∗‖R

= δ

- If R = 1, 0 < δ < 1, then rate is linear

- If 1 < R < 2, 0 < δ <∞, then rate is called super-linear

- If R = 2, 0 < δ <∞, then rate is called quadratic

The rate of convergence R is sometimes called asymptotic convergence rate. Itmay not apply for the iterates, but applies asymptotically as k →∞.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 24 / 120

Page 5: ELE604/ELE704 Optimization - Hacettepe

Descent Methods Convergence

Example 5: The sequence{ak}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖ak+1 − 0‖‖ak − 0‖1 = a ⇒ R = 1, δ = a

Example 6: The sequence{a2k}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖a2k+1

− 0‖‖a2k − 0‖2

= 1 ⇒ R = 2, δ = 1

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 25 / 120

Gradient Descent (GD) Method Gradient Descent Method

Gradient Descent MethodI First order Taylor series expansion at x0 gives us

f(x0 + αd) ≈ f(x0) + α∇T f(x0)d.

This approximation is valid for α‖d‖ → 0.

I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.

I If we normalize d, i.e., ‖d‖ = 1, then normalized direction d

d = − ∇f(x0)

‖∇f(x0)‖

makes the smallest inner product with ∇f(x0).

I Then, the unnormalized direction

d = −∇f(x0)

is called the direction of gradient descent (GD) at the point of x0.

I d is a direction as long as ∇f(x0) 6= 0.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 26 / 120

Gradient Descent (GD) Method Gradient Descent Method

I Algorithm: Gradient Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. d(k) = −∇f(x(k))

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)

until stopping criterion is satis�ed

- A typical stopping criterion is ‖∇f(x)‖ < ε, ε→ 0 (small)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 27 / 120

Gradient Descent (GD) Method Convergence Analysis

I Convergence Analysis

The Hessian matrix H(x) is bounded as

1. mI � H(x), i.e.,

(H(x)−mI) � 0

yTH(x)y ≥ m‖y‖2, ∀y ∈ RN

2. H(x) �MI, i.e.,

(MI−H(x)) � 0

yTH(x)y ≤M‖y‖2,∀y ∈ RN

with ∀x ∈ dom f(x).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 28 / 120

Gradient Descent (GD) Method Convergence Analysis

Note that, condition number of a matrix is given by the ratio of the largest andthe smallest eigenvalue, e.g.,

κ(H(x)) =

∣∣∣∣maxλiminλi

∣∣∣∣ =M

m

If the condition number is close to one, the matrix is well-conditioned which meansits inverse can be computed with good accuracy. If the condition number is large,then the matrix is said to be ill-conditioned.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 29 / 120

Gradient Descent (GD) Method Convergence Analysis

I Lower Bound: mI � H(x)

For x,y ∈ dom f(x)

f(y) = f(x) +∇T f(x)(y − x) +1

2(y − x)TH(z)(y − x)

for some z on the line segment [x,y] where H(z) � mI. Thus,

f(y) ≥ f(x) +∇T f(x)(y − x) +m

2‖y − x‖2

- If m = 0, then the inequality characterizes convexity.

- If m > 0, then we have a better lower bound for f(y)

Right-hand side is convex in y. Minimum is achieved at

y0 = x− 1

m∇f(x)

Then,

f(y) ≥ f(x) +∇T f(x)(y0 − x) +m

2‖y0 − x‖2

≥ f(x)− 1

2m‖∇f(x)‖2

∀y ∈ dom f .

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 30 / 120

Page 6: ELE604/ELE704 Optimization - Hacettepe

Gradient Descent (GD) Method Convergence Analysis

When y = x∗

f(x∗) = p∗ ≥ f(x)− 1

2m‖∇f(x)‖2

- A stopping criterion

f(x)− p∗ ≤ 12m‖∇f(x)‖2

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 31 / 120

Gradient Descent (GD) Method Convergence Analysis

I Upper Bound: H(x) �MI

For any x,y ∈ dom f(x), using similar derivations as the lower bound, we arrive at

f(y) ≤ f(x) +∇T f(x)(y0 − x) +M

2‖y0 − x‖2

Then for y = x∗

f(x∗) = p∗ ≤ f(x)− 1

2M‖∇f(x)‖2

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 32 / 120

Gradient Descent (GD) Method Conv. of GD with Exact Line Search

I Convergence of GD using exact line search

For the exact line search, let us use second order approximation for f(x(k+1)):

f(x(k+1)) = f(x(k) − α∇f(x(k)))

∼= f(x(k))− α∇T f(x(k))∇f(x(k))︸ ︷︷ ︸‖∇f(x(k))‖2

+α2

2∇T f(x(k))H(x(k))︸ ︷︷ ︸

�MI

∇f(x(k))

This criterion is quadratic in α.

Normally, exact line search solution α0 which minimizes the quadratic equationabove is given by

α0 =∇T f(x(k))∇f(x(k))

∇T f(x(k))H(x(k))∇f(x(k))

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 33 / 120

Gradient Descent (GD) Method Conv. of GD with Exact Line Search

- However, let us use the upper bound of the second order approximation forconvergence analysis

f(x(k+1)) ≤ f(x(k))− α‖∇f(x(k))‖2 +Mα2

2‖∇f(x(k))‖2

Find α′0 such that upper bound of f(x(k) − α∇f(x(k))) is minimized over α.

Upper bound equation (i.e., right-hand side equation) is quadratic in α, henceminimized for

α′0 =1

M

with the minimum value

f(x(k))− 1

2M‖∇f(x(k))‖2

Then, for α′0

f(x(k+1)) ≤ f(x(k))− 1

2M‖∇f(x(k))‖2

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 34 / 120

Gradient Descent (GD) Method Conv. of GD with Exact Line Search

Subtract p∗ for both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − 1

2M‖∇f(x(k))‖2

We know that

f(x(k))− p∗ ≤ 1

2m‖∇f(x(k))‖2 ⇒ ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗)

Then substituting this result to the above inequality

f(x(k+1))− p∗ ≤ (f(x(k))− p∗)− m

M(f(x(k))− p∗)

≤ (1− m

M)(f(x(k))− p∗)

orf(x(k+1))− p∗

f(x(k))− p∗≤ (1− m

M) = c ≤ 1 (

m

M≤ 1)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 35 / 120

Gradient Descent (GD) Method Conv. of GD with Exact Line Search

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Upper limit of rate constant is(1− m

M

)I Number of steps? Apply the above inequality recursively

f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗)

i.e., f(x(k+1))→ p∗ as k →∞, since 0 ≤ c < 1. Thus, convergence is guaranteed.

- If m = M ⇒ c = 0, then convergence occurs in one iteration.

- If m�M ⇒ c→ 1, the slow convergence.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 36 / 120

Page 7: ELE604/ELE704 Optimization - Hacettepe

Gradient Descent (GD) Method Conv. of GD with Exact Line Search

(f(x(k+1))− p∗) ≤ ε is achieved after at most

K =log(

[f(x(0))− p∗]/ε)

log (1/c)

iterations

- Numerator is small when initial point is close to x∗ (K gets smaller).

- Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger).

- Denominator decreases linearly with mM

(reciprocal of the condition number)as c = (1− m

M), i.e., log(1/c) = − log(1− m

M) ' m

M(using

log(x) = log(x0) + 1x0

(x− x0)− 12x20

(x− x0)2 + · · · with x0 = 1).

- well-conditioned Hessian, mM→ 1⇒ denominator is large (K gets

smaller).

- ill-conditioned Hessian, mM→ 0⇒ denominator is small (K gets

larger).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 37 / 120

Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search

I Convergence of GD using backtracking line search

Backtracking exit condition

f(x(k) − α∇f(x(k))) ≤ f(x(k))− γα‖∇f(x(k))‖2

is satis�ed when α ∈ [βα0, α0] where α0 ≥ 1M.

Backtracking line search terminates either if α = 1 or α ≥ βM

which gives a lowerbound on the decrease

1. f(x(k+1)) ≤ f(x(k))− γ‖∇f(x(k))‖2 if α = 1

2. f(x(k+1)) ≤ f(x(k))− βγM‖∇f(x(k))‖2 if α ≥ β

M

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 38 / 120

Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search

If we put these inequalities (1 & 2) together

f(x(k+1)) ≤ f(x(k))−min

{γ,βγ

M

}‖∇f(x(k))‖2

Similar to the analysis exact line search, subtract p∗ from both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − γmin

{1,β

M

}‖∇f(x(k))‖2

But, we know that ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), then

f(x(k+1))− p∗ ≤(

1− 2mγmin

{1,β

M

})(f(x(k))− p∗

)Finally

f(x(k+1))− p∗

f(x(k))− p∗≤(

1− 2mγmin

{1,β

M

})= c < 1

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 39 / 120

Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Rate is constant c < 1

f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗

)Thus, k →∞⇒ ck → 0, so convergence is guaranteed.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 40 / 120

Gradient Descent (GD) Method Examples

Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).

Example 7: (quadratic problem in R2) Replace γ with σ.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 41 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 42 / 120

Page 8: ELE604/ELE704 Optimization - Hacettepe

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 43 / 120

Gradient Descent (GD) Method Examples

Example 8: (nonquadratic problem in R2) Replace α and t with γ and α.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 44 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 45 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 46 / 120

Gradient Descent (GD) Method Examples

Example 9: (problem in R100) Replace α and t with γ and α.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 47 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 48 / 120

Page 9: ELE604/ELE704 Optimization - Hacettepe

Gradient Descent (GD) Method Examples

Example 10: (Condition number) Replace γ, α and t with σ, γ and α.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 49 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 50 / 120

Gradient Descent (GD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 51 / 120

Gradient Descent (GD) Method Examples

Observations:

- The gradient descent algorithm is simple.

- The gradient descent method often exhibits approximately linear convergence.

- The choice of backtracking parameters γ and β has a noticeable but not dramatice�ect on the convergence. Exact line search sometimes improves the convergenceof the gradient method, but the e�ect is not large (and probably not worth thetrouble of implementing the exact line search).

- The convergence rate depends greatly on the condition number of the Hessian, orthe sublevel sets. Convergence can be very slow, even for problems that aremoderately well-conditioned (say, with condition number in the 100s). When thecondition number is larger (say, 1000 or more) the gradient method is so slow thatit is useless in practice.

- The main advantage of the gradient method is its simplicity. Its main disadvantageis that its convergence rate depends so critically on the condition number of theHessian or sublevel sets.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 52 / 120

Steepest Descent (SD) Method Preliminary De�nitions

I Dual Norm: Let ‖ · ‖ denote any norm on RN , then the dual norm, denoted by‖ · ‖∗, is the function from RN to R with values

‖x‖∗ = maxy

yTx : ‖y‖ ≤ 1 = sup{yTx : ‖y‖ ≤ 1

}The above de�nition also corresponds to a norm: it is convex, as it is the pointwisemaximum of convex (in fact, linear) functions y→ xTy; it is homogeneous ofdegree 1, that is, ‖αx‖∗ = α‖x‖∗ for every x in RN and α ≥ 0.

I By de�nition of the dual norm,

xTy ≤ ‖x‖ · ‖y‖∗

This can be seen as a generalized version of the Cauchy-Schwartz inequality, whichcorresponds to the Euclidean norm.

I The dual to the dual norm above is the original norm.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 53 / 120

Steepest Descent (SD) Method Preliminary De�nitions

- The norm dual to the Euclidean norm is itself. This comes directly from theCauchy-Schwartz inequality.

‖x‖2∗ = ‖x‖2

- The norm dual to the the L∞-norm is the L1-norm, or vice versa.

‖x‖∞∗ = ‖x‖1 and ‖x‖1∗ = ‖x‖∞

- More generally, the dual of the Lp-norm is the Lq-norm

‖x‖p∗ = ‖x‖q

where q =p

1− p .

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 54 / 120

Page 10: ELE604/ELE704 Optimization - Hacettepe

Steepest Descent (SD) Method Preliminary De�nitions

I Quadratic norm: A generalized quadratic norm of x is de�ned by

‖x‖P =(xTPx

)1/2

= ‖P1/2x‖2 = ‖Mx‖2

where P = MTM is an N ×N symmetric positive de�nite (SPD) matrix.

I When P = I then, quadratic norm is equal to the Euclidean norm.

I The dual of the quadratic norm is given by

‖x‖P∗ = ‖x‖Q =(xTP−1x

)1/2

where Q = P−1.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 55 / 120

Steepest Descent (SD) Method Steepest Descent Method

Steepest Descent Method

I The �rst-order Taylor series approximation of f(x(k) + αd) around x(k) is

f(x(k) + αd) ≈ f(x(k)) + α∇T f(x(k))d.

This approximation is valid for α‖d‖2 → 0.

I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.

I First normalize d to obtain normalized steepest descent direction (nsd) dnsd

dnsd = argmin{∇T f(x(k))d : ‖d‖ = 1

}where ‖ · ‖ is any norm and ‖ · ‖∗ is its dual norm on RN . Choice of norm is veryimportant.

I It is also convenient to consider the unnormalized steepest descent direction (sd)

dsd = ‖∇f(x)‖∗dnsd

where ‖ · ‖∗ is the dual norm of ‖ · ‖.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 56 / 120

Steepest Descent (SD) Method Steepest Descent Method

I Then, for the steepest descent step, we have

∇T f(x)dsd = ‖∇f(x)‖∗∇T f(x)dnsd︸ ︷︷ ︸−‖∇f(x)‖∗

= −‖∇f(x)‖2∗

I Algorithm: Steepest Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. Compute the steepest descent direction d(k)sd

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)sd

until stopping criterion is satis�ed

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 57 / 120

Steepest Descent (SD) Method Steepest Descent for di�erent norms

I Steepest Descent for di�erent norms:

- Euclidean norm: As ‖ · ‖2∗ = ‖ · ‖2 and having x0 = x(k), the steepest descentdirection is the negative gradient, i.e.,

dsd = −∇f(x0)

For Euclidean norm, steepest descent algorithm is the same as the gradientdescent algorithm.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 58 / 120

Steepest Descent (SD) Method Steepest Descent for di�erent norms

- Quadratic norm: For a quadratic norm ‖ · ‖P and having x0 = x(k), thenormalized descent direction is given by

dnsd = −P−1 ∇f(x0)

‖∇f(x0)‖P∗= −P−1 ∇f(x0)

(∇T f(x0)P−1∇f(x0))1/2

As ‖∇f(x)‖P∗ = ‖P−1/2∇f(x)‖2, then

dsd = −P−1∇f(x0)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 59 / 120

Steepest Descent (SD) Method Steepest Descent for di�erent norms

Change of coordinates: Let y = P 1/2x, then ‖x‖P = ‖y‖2. Using this change ofcoordinates, we can solve the original problem of minimizing f(x) by solving theequivalent problem of minimizing the function f(y) : RN → R, given by

f(y) = f(P−1/2y) = f(x)

Apply the gradient descent method to f(y). The descent direction at y0

(x0 = P−1/2y0 for the original problem) is

dy = −∇f(y0) = −P−1/2∇f(P−1/2y0) = −P−1/2∇f(x0)

Then the descent direction for the original problem becomes

dx = P−1/2 dy = −P−1∇f(x0)

Thus, x∗ = P−1/2y∗.

The steepest descent method in the quadratic norm ‖ · ‖P is equivalent to thegradient descent method applied to the problem after the coordinatetransformation

y = P1/2x

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 60 / 120

Page 11: ELE604/ELE704 Optimization - Hacettepe

Steepest Descent (SD) Method Steepest Descent for di�erent norms

- L1-norm: For an L1-norm ‖ · ‖1 and having x0 = x(k), the normalized descentdirection is given by

dnsd = − argmin{∇T f(x)d : ‖d‖1 = 1

}.

Let i be any index for which ‖∇f(x0)‖∞ = max |(∇f(x0))i|. Then a normalizedsteepest descent direction dnsd for the L1-norm is given by

dnsd = − sign

(∂f(x0)

∂xi

)ei

where ei is the i-th standard basis vector (i.e., the coordinate axis direction) withthe steepest gradient. For example, in the �gure above we have dnsd = e1.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 61 / 120

Steepest Descent (SD) Method Steepest Descent for di�erent norms

Then, the unnormalized steepest descent direction is given by

dsd = dnsd ‖∇f(x0)‖∞ = −∂f(x0)

∂xiei

The steepest descent algorithm in the L1-norm has a very natural interpretation:

- At each iteration we select a component of ∇f(x0) with maximum absolutevalue, and then decrease or increase the corresponding component of x0,according to the sign of (∇f(x0))i.

- The algorithm is sometimes called a coordinate-descent algorithm, sinceonly one component of the variable x(k) is updated at each iteration.

- This can greatly simplify, or even trivialize, the line search.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 62 / 120

Steepest Descent (SD) Method Steepest Descent for di�erent norms

Choice of norm:

- Choice of norm can dramatically a�ect the convergence

- Condition number of the Hessian should be close to unity for fast convergence

- Consider quadratic norm with respect to SPD matrix P. Performing the change ofcoordinates

y = P1/2x

can change the condition number.

- If an approximation of the Hessian at the optimal point, H(x∗), is known,then setting P ∼= H(x∗) will yield

P−1/2H(x∗)P1/2 ∼= I

resulting in a very low condition number.

- If P is chosen correctly the ellipsoid ε ={x : xTPx ≤ 1

}approximates the

cost surface at point x.

- A correct P will greatly improve the convergence whereas the wrong choiceof P will result in very poor convergence.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 63 / 120

Steepest Descent (SD) Method Convergence Analysis

Convergence Analysis

- (Using backtracking line search) It can be shown that any norm can be bounded interms of Euclidean norm with a constant η ∈ (0, 1]

‖x‖∗ ≥ η‖x‖2

- Assuming strongly convex f(x) and using H(x) ≺MI

f(x(k) + αdsd) ≤ f(x(k)) + α∇T f(x(k))dsd +Mα2

2‖dsd‖22

≤ f(x(k)) + α∇T f(x(k))dsd +Mα2

2η2‖dsd‖2∗

≤ f(x(k))− α‖∇f(x(k))‖2∗ + α2 M

2η2‖∇f(x(k))‖2∗

Right hand side of the inequality is a quadratic function of α and has a minimum

at α = η2

M. Then,

f(x(k) + αdsd) ≤ f(x(k))− η2

2M‖∇f(x(k))‖2∗ ≤ f(x(k)) +

γη2

M∇T f(x(k))dsd

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 64 / 120

Steepest Descent (SD) Method Convergence Analysis

Since γ < 0.5 and −‖∇f(x)‖2∗ = ∇T f(x)dsd, backtracking line search will return

α ≥ min{

1, βη2

M

}, then

f(x(k) + αdsd) ≤ f(x(k))− γmin

{1,βη2

M

}‖∇f(x(k))‖2∗

≤ f(x(k))− γη2 min

{1,βη2

M

}‖∇f(x(k))‖22

Subtracting p∗ from both sides and using ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), wehave

f(x(k+1))− p∗

f(x(k))− p∗≤ 1− 2mγη2 min

{1,βη2

M

}= c < 1

- Linear convergence

f(x(k))− p∗ ≤ ck(f(x(0))− p∗

)as k →∞, ck → 0. So, convergence is guaranteed,

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 65 / 120

Steepest Descent (SD) Method Examples

Example 11: A steepest descent example with L1-norm.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 66 / 120

Page 12: ELE604/ELE704 Optimization - Hacettepe

Steepest Descent (SD) Method Examples

Example 12: Consider the nonquadratic problem in R2 given in Example 8 (replace α

and t with γ and α).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 67 / 120

Steepest Descent (SD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 68 / 120

Steepest Descent (SD) Method Examples

When P = I, i.e., gradient descent

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 69 / 120

Steepest Descent (SD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 70 / 120

Steepest Descent (SD) Method Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 71 / 120

Conjugate Gradient (GD) Method Introduction

Conjugate Gradient Method

I Can overcome the slow convergence of Gradient Descent algorithm

I Computational complexity is lower than Newton's Method.

I Can be very e�ective in dealing with general objective functions.

I We will �rst investigate the quadratic problem

min1

2xTQx− bTx

where Q is SPD, and then extend the solution to the general case byapproximation.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 72 / 120

Page 13: ELE604/ELE704 Optimization - Hacettepe

Conjugate Gradient (GD) Method Conjugate Directions

Conjugate Directions

I De�nition: Given a symmetric matrix Q, two vectors d1 and d2 are said to beQ-orthogonal or conjugate with respect to Q if

dT1 Qd2 = 0

- Although it is not required, we will assume that Q is SPD.

- If Q = I, then the above de�nition becomes the de�nition of orthogonality.

- A �nite set of non-zero vectors d0,d1, . . . ,dk is said to be a Q-orthogonal set if

dTi Qdj = 0, ∀i, j : i 6= j

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 73 / 120

Conjugate Gradient (GD) Method Conjugate Directions

I Proposition: If Q is SPD and the set of non-zero vectors d0,d1, . . . ,dk areQ-orthogonal, then these vectors are linearly indepedent.

Proof: Assume linear dependency and suppose ∃αi, i = 0, 1, . . . , k :

α0d0 + α1d1 + · · ·+ αkdk = 0

Multiplying with dTi Q yields

α0 dTi Qd0︸ ︷︷ ︸

=0

+α1 dTi Qd1︸ ︷︷ ︸

=0

+ · · ·+ αidTi Qdi︸ ︷︷ ︸

must be 0

+ · · ·+ αk dTi Qdk︸ ︷︷ ︸

=0

= 0

But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 74 / 120

Conjugate Gradient (GD) Method Conjugate Directions

I Quadratic Problem:

min1

2xTQx− bTx

If Q is N ×N PD matrix, then we have unique solution

Qx∗ = b

Let d0,d1, . . . ,dN−1 be non-zero Q-orthogonal vectors corresponding to theN ×N SPD matrix Q. They are linearly independent. Then the optimum solutionis given by

x∗ = α0d0 + α1d1 + · · ·+ αN−1dN−1

We can �nd the value of the coe�cients αi by multiplying the above equationwith dTi Q:

dTi Qx∗ = αidTi Qdi

αi =dTi b

dTi Qdi. . . Qx∗ = b

Finally the optimum solution is given by,

x∗ =

N−1∑i=0

dTi b

dTi Qdidi

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 75 / 120

Conjugate Gradient (GD) Method Conjugate Directions

- αi can be found from the known vector b and matrix Q once di are found.

- The expansion of x∗ is a result of an iterative process of N steps where at the i-thstep αidi is added.

I Conjugate Direction Theorem: Let {di}N−1i=0 be a set of non-zero Q-orthogonal

vectors. For any x(0) ∈ dom f(x), the sequence{x(k)

}Nk=0

generated according to

x(k+1) = x(k) + α(k)dk, . . . k ≥ 0

with

α(k) = − dTk g(k)

dTkQdk

and g(k) is the gradient at x(k)

g(k) = ∇f(x(k)) = Qx(k) − b

converges to the unique solution x∗ of Qx∗ = b after N steps, i.e., x(N) = x∗.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 76 / 120

Conjugate Gradient (GD) Method Conjugate Directions

Proof: Since dk are linearly independent, we can write

x∗ − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(N−1)dN−1

for some α(k). We can �nd α(k) by

α(k) =dTkQ

(x∗ − x(0)

)dTkQdk

(1)

Now, the iterative steps from x(0) to x(k)

x(k) − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(k−1)dk−1

and due to Q-orthogonality

dTkQ(x(k) − x(0)

)= 0 (2)

Using (1) and (2) we arrive at

α(k) =dTkQ

(x∗ − x(k)

)dTkQdk

= − dTk g(k)

dTkQdk

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 77 / 120

Conjugate Gradient (GD) Method Conjugate Directions

Descent Properties of the Conjugate Gradient Method

I We de�ne B(k) which is spanned by {d0,d1, . . . ,dk−1} as the subspace of RN ,i.e.,

B(k) = span {d0,d1, . . . ,dk−1} ⊆ RN

We will show that at each step x(k) minimizes the objective over thek-dimensional linear variety x(0) + B(k).

I Theorem: (Expanding Subspace Theorem) Let {di}N−1i=0 be non-zero,

Q-orthogonal vectors in RN .

For any x(0) ∈ RN , the sequence

x(k+1) = x(k) + α(k)dk

α(k) = − dTk g(k)

dTkQdk

minimizes f(x) = 12xTQx− bTx on the line

x = x(k−1) − αdk−1, −∞ < α <∞

and on x(0) + B(k).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 78 / 120

Page 14: ELE604/ELE704 Optimization - Hacettepe

Conjugate Gradient (GD) Method Conjugate Directions

Proof: Since x(k) ∈ x(0) + B(k), i.e., B(k) contains the line x = x(k−1) − αdk−1,it is enough to show that x(k) minimizes f(x) over x(0) + B(k)

Since we assume that f(x) is strictly convex, the above condition holds when g(k)

is orthogonal to B(k), i.e., the gradient of f(x) at x(k) is orthogonal to B(k).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 79 / 120

Conjugate Gradient (GD) Method Conjugate Directions

- Proof of g(k) ⊥ B(k) is by induction

Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.

Now assume that g(k) ⊥ B(k), show that g(k+1) ⊥ B(k+1)

From the de�nition of g(k) (g(k) = Qx(k) − b), it can be shown that

g(k+1) = g(k) + αkQdk

Hence, by the de�nition of αk

dTk g(k+1) = dTk g

(k) + αkdTkQdk = 0

Also, for i < k

dTi g(k+1) = dTi g

(k)︸ ︷︷ ︸vanishes by induction

+ αk dTi Qdk︸ ︷︷ ︸

=0

= 0

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 80 / 120

Conjugate Gradient (GD) Method Conjugate Directions

- Corollary: The gradients g(k), k = 0, 1, . . . , N satisfy

dTi g(k) = 0

for i < k.

Expanding subspace, at every iteration dk increases the dimensionality of B. Sincex(k) minimizes f(x) over x(0) + B(k), x(N) is the overall minimum of f(x).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 81 / 120

Conjugate Gradient (GD) Method The Conjugate Gradient Method

The Conjugate Gradient Method

In the conjugate direction method, select the successive direction vectors as aconjugate version of the successive gradients obtained as the method progresses

I Conjugate Gradient Algorithm:

Start at any x(0) ∈ RN and de�ne d(0) = −g(0) = b−Qx(0)

repeat

1. α(k) = − d(k)T g(k)

d(k)TQd(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = Qx(k+1) − b

4. β(k) = g(k+1)Qd(k)

d(k)TQd(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 82 / 120

Conjugate Gradient (GD) Method The Conjugate Gradient Method

- Algorithm terminates in at most N steps with the exact solution (for the quadraticcase)

- Gradient is always linearly independent of all previous direction vectors, i.e.,g(k) ⊥ B(k), where B(k) = {d0,d1, . . . ,dk−1}

- If solution is reached before N steps, the gradient is zero

- Very simple formula, computational complexity is slightly higher than gradientdescent algorithm

- The process makes uniform progress toward the solution at every step. Importantfor the nonquadratic case.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 83 / 120

Conjugate Gradient (GD) Method The Conjugate Gradient Method

Example 13: Consider the quadratic problem

min1

2xTQx− bTx

where Q =

[3 22 6

]and b =

[2−8

].

Solution is given by

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 84 / 120

Page 15: ELE604/ELE704 Optimization - Hacettepe

Conjugate Gradient (GD) Method The Conjugate Gradient Method

CG Summary

I In theory (with exact arithmetic) converges to solution in N steps

- The bad news: due to numerical round-o� errors, can take more than Nsteps (or fail to converge)

- The good news: with luck (i.e., good spectrum of Q), can get goodapproximate solution in � N steps

I Compared to direct (factor-solve) methods, CG is less reliable, data dependent;often requires good (problem-dependent) preconditioner

I But, when it works, can solve extremely large systems

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 85 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

Extension to Nonquadratic Problems

I Idea is simple. We have two loops

- Outer loop approximates the problem with a quadratic one- Inner loop runs conjugate gradient method (CGM) for the approximation

i.e., for the neighbourhood of point x0

f(x) ∼= f(x0) +∇T f(x0)(x− x0) +1

2(x− x0)TH(x0)(x− x0)︸ ︷︷ ︸

quadratic function

+ residual︸ ︷︷ ︸→0

- Expanding

f(x) ∼=1

2xTH(x0)x+

(∇T f(x0)− xT0 H(x0)

)x+f(x0) +

1

2xT0 H(x0)x0 −∇T f(x0)x0︸ ︷︷ ︸

independent of x, i.e., constant

Thus,

min f(x) ≡ min1

2xTH(x0)x+

(∇T f(x0)− xT0 H(x0)

)x

≡ min1

2xTQx− bTx

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 86 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

I Here,

Q = H(x0)

bT = −∇T f(x0) + xT0 H(x0)

The gradient g(k) is

g(k) = Qx(k) − b

= H(x0)x0 +∇f(x0)−H(x0)x0 . . . x0 = x(k)

= ∇f(x0)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 87 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

I Nonquadratic Conjugate Gradient Algorithm:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. α(k) = − d(k)T g(k)

d(k)TH(x(k))d(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = ∇f(x(k+1))

4. β(k) = g(k+1)TH(x(k))d(k)

d(k)TH(x(k))d(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).

until stopping criterion is satis�ed

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 88 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

- No line search is required.

- H(x(k)) must be evaluated at each point, can be impractical.

- Algorithm may not be globally convergent.

I Involvement of H(x(k)) can be avoided by employing a line search algorithm forα(k) and slightly modifying β(k)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 89 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

I Nonquadratic Conjugate Gradient Algorithm with Line-search:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. Line search: α(k) = argminα

f(x(k) + αd(k))

2. Update: x(k+1) = x(k) + α(k)d(k)

3. Gradient: g(k+1) = ∇f(x(k+1))

4. Use

Fletcher-Reeves method: β(k) = g(k+1)T g(k+1)

g(k)T g(k), or

Polak-Ribiere method: β(k) =(g(k+1)−g(k))

Tg(k+1)

g(k)T g(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).

until stopping criterion is satis�ed

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 90 / 120

Page 16: ELE604/ELE704 Optimization - Hacettepe

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

- Polak-Ribiere method can be superior to the Fletcher-Reeves method.

- Global convergence of the line search methods is established by noting that agradient descent step is taken every N steps and serves as a spacer step. Since theother steps do not increase the objective, and in fact hopefully they decrease it,global convergence is guaranteed.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 91 / 120

Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A complicatedfunction with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linearCG, convergence does not occur in two steps. (c) Cross-section of the surface corresponding to the �rst linesearch. (d) Convergence path of Polak-Ribiere CG.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 92 / 120

Newton's Method (NA) The Newton Step

The Newton Step

I In Newton's Method, local quadratic approximations of f(x) are utilized. Startingwith the second-order Taylor's approximation around x(k),

f(x(k+1)) = f(x(k)) +∇f(x(k))∆x +1

2∆xTH(x(k))∆x︸ ︷︷ ︸

f(x(k+1))

+ residual

where ∆x = x(k+1) − x(k), �nd ∆x = ∆xnt such that f(x(k+1)) is minimized.

I Quadratic approximation optimum step ∆xnt (by solving ∂f(x(k+1))∂∆x

= 0)

∆xnt = −H−1(x(k))∇f(x(k))

is called the Newton step, which is a descent direction, i.e.,

∇T f(x(k))∆xnt = −∇T f(x(k))H−1(x(k))∇f(x(k)) < 0

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 93 / 120

Newton's Method (NA) The Newton Step

I Then

x(k+1) = x(k) + ∆xnt

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 94 / 120

Newton's Method (NA) The Newton Step

Interpretation of the Newton Step1. Minimizer of second-order approximation

As given on the previous slide ∆x minimizes f(x(k+1)), i.e., the quadraticapproximation of f(x) in the neighbourhood of x(k).

- If f(x) is quadratic, then f(x(0)) + ∆x is the exact minimizer of f(x) andalgorithm terminates in a single step with the exact answer.

- If f(x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizerof f(x), x∗.

- For twice di�erentiable f(x), quadratic approximation is very accurate in theneighbourhood of x∗, i.e., when x is very close to x∗, the point x + ∆x is a verygood estimate of x∗.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 95 / 120

Newton's Method (NA) The Newton Step

2. Steepest Descent Direction in Hessian Norm- The Newton step is the steepest descent direction at x(k), i.e.,

‖v‖H(x(k)) =(vTH(x(k))v

) 12

- In the steepest descent method, the quadratic norm ‖ · ‖P can signi�cantlyincrease speed of convergence, by decreasing the condition number. In theneighbourhood of x∗, P = H(x∗) is a very good choice.

- In Newton's method when x is near x∗, we have H(x) ' H(x∗).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 96 / 120

Page 17: ELE604/ELE704 Optimization - Hacettepe

Newton's Method (NA) The Newton Step

3. Solution of Linearized Optimality Condition

- First-order optimality condition

∇f(x∗) = 0

near x∗ (using �rst order Taylor's approximation for ∇f(x + ∆x))

∇f(x + ∆x) ' ∇f(x) + H(x)∆x = 0

with the solution∆xnt = −H−1(x)∇f(x)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 97 / 120

Newton's Method (NA) The Newton Decrement

The Newton Decrement

I The norm of the Newton step in the quadratic norm de�ned by H(x) is called theNewton decrement

λ(x) = ‖∆xnt‖H(x) =(

∆xTntH(x)∆xnt) 1

2

I It can be used as a stopping criterion since it is an estimate of f(x)− p∗, i.e.,

f(x)− infyf(y) = f(x)− f(x + ∆xnt) =

1

2λ2(x)

where

f(x + ∆xnt) = f(x) +∇T f(x)∆xnt +1

2∆xTntH(x)∆xnt

i.e., the second-order quadratic approximation of f(x) at x.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 98 / 120

Newton's Method (NA) The Newton Decrement

Substitute f(x + ∆xnt) into f(x)− infyf(y) and let

∆xnt = −H−1(x)∇f(x)

then1

2∇T f(x)H−1(x)∇f(x) =

1

2λ2(x)

I So, if λ2(x)2

< ε, then algorithm can be terminated for some small ε.

I With the substitution of ∆xnt = −H−1(x)∇f(x), the Newton decrement can alsobe written as

λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))

) 12

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 99 / 120

Newton's Method (NA) Newton's Method

Newton's Method

I Given a starting point x(0) ∈ dom f(x) and some small tolerance ε > 0

repeat

1. Compute the Newton step and Newton decrement

∆x(k) = −H−1(x(k))∇f(x(k))

λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))

) 12

2. Stopping criterion, quit if λ2(x(k))/2 ≤ ε.

3. Line search: Choose a stepsize α(k) > 0, e.g., by backtracking line search.

4. Update: x(k+1) = x(k) + α(k)∆x(k).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 100 / 120

Newton's Method (NA) Newton's Method

I The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts ofthe algorithm. Otherwise, algorithm may not converge due to large higher-orderresiduals.

I As x(k) gets closer to x∗. f(x) can be better approximated by the second-orderexpansion. Hence, stepsize α(k) is no longer required. Line search algorithm willautomatically set α(k) = 1.

I If we start with α(k) = 1 and keep it the same, then the algorithm is called thepure Newton's method.

I For an arbitrary f(x), there are two regions of convergence.

- damped Newton phase, when x is far from x∗

- quadratically convergent phase, when x gets closer to x∗

I If we let H(x) = I, the algorithm reduces to gradient descent (GD)

x(k+1) = x(k) − α(k)∇f(x(k))

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 101 / 120

Newton's Method (NA) Newton's Method

I If H(x) is not positive de�nite, Newton's method will not converge.

So, use (aI + H(x))−1 instead of H−1(x), also known as (a.k.a) Marquardtmethod. There always exists an a which will make the matrix (aI + H(x)) positivede�nite.

a is a trade-o� between GD and NA

- a→∞ ⇒ Gradient Descent (GD)

- a→ 0 ⇒ Newton's Method (NA)

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 102 / 120

Page 18: ELE604/ELE704 Optimization - Hacettepe

Newton's Method (NA) Newton's Method

I Newton step and decrement are independent of a�ne transformations (i.e., linearcoordinate transformations), i.e., for non-singular T ∈ RN×N

x = Ty and f(y) = f(Ty)

then

∇f(y) = TT∇f(x)

H(y) = TTH(x)T

- So, the Newton step will be

∆ynt = −H−1(y)∇f(y)

= −(TTH(x)T

)−1(TT∇f(x)

)= −T−1H−1(x)∇f(x)

= T−1∆xnt

i.e,x + ∆xnt = T (y + ∆ynt), ∀x

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 103 / 120

Newton's Method (NA) Newton's Method

- Similarly, the Newton decrement will be

λ(y) =(∇T f(y)H−1(y)∇f(y)

) 12

=

((∇T f(x)T

)(TTH(x)T

)−1(TT∇f(x)

)) 12

=(∇T f(x)H(x)∇f(x)

) 12

= λ(x)

I Thus, Newton's Method is independent of a�ne transformations (i.e., linearcoordinate transformations).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 104 / 120

Newton's Method (NA) Convergence Analysis

Convergence Analysis

Read Boyd, Section 9.5.3.

I Assume a strongly convex f(x) with mI � H(x) with constant m ∀x ∈ dom f(x)

and H(x) is Lipschitz continuous on dom f(x), i.e.,

‖H(x)−H(y)‖2 ≤ L ‖x− y‖2

for constant L > 0. This inequality imposes a bound on the third derivative off(x).

If L is small f(x) is closer to a quadratic function. If L is large, f(x) is far from aquadratic function. If L = 0, then f(x) is quadratic.

Thus, L measures how well f(x) can be approximated by a quadratic function.

- Newton's Method will perform well for small L.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 105 / 120

Newton's Method (NA) Convergence Analysis

Convergence: There exist constants η ∈(

0, m2

L

)and σ > 0 such that

I Damped Newton Phase‖∇f(x)‖2 ≥ η

- α < 1 gives better solutions, so most iterations will require line search, e.g.,backtracking line search.

- As k increases, function value decreases by at least σ, but not necessarilyquadratic.

- This phase ends after at most f(x(0))−p∗σ

iterations

I Quadratically Convergent Phase

‖∇f(x)‖2 < η

- All iterations use α = 1 (i.e., quadratic approximation suits very well.)

-‖∇f(x(k+1))‖‖∇f(x(k))‖2

≤ L

2m2, i.e., quadratic convergence.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 106 / 120

Newton's Method (NA) Convergence Analysis

- For small ε > 0, f(x)− p∗ < ε is achieved after at most

log2 log2

ε0ε

iterations where ε0 = 2m3

L2 . This is typically 5-6 iterations.

- Number of iterations is bounded above by

f(x(0))− p∗

σ+ log2 log2

ε0ε

σ and ε0 are dependent on m, L and x(0).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 107 / 120

Newton's Method (NA) Convergence Analysis

NA Summary

I Convergence of Newton's method is rapid in general, and quadratic near x∗. Oncethe quadratic convergence phase is reached, at most six or so iterations arerequired to produce a solution of very high accuracy.

I Newton's method is a�ne invariant. It is insensitive to the choice of coordinates,or the condition number of the sublevel sets of the objective.

I Newton's method scales well with problem size. Ignoring the computation of theHessian, its performance on problems in R10000 is similar to its performance onproblems in R10, with only a modest increase in the number of steps required.

I The good performance of Newton's method is not dependent on the choice ofalgorithm parameters. In contrast, the choice of norm for steepest descent plays acritical role in its performance.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 108 / 120

Page 19: ELE604/ELE704 Optimization - Hacettepe

Newton's Method (NA) Convergence Analysis

I The main disadvantage of Newton's method is the cost of forming and storing theHessian, and the cost of computing the Newton step, which requires solving a setof linear equations.

I Other alternatives (called quasi-Newton methods) are also provided by a family ofalgorithms for unconstrained optimization. These methods require lesscomputational e�ort to form the search direction, but they share some of thestrong advantages of Newton methods, such as rapid convergence near x∗.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 109 / 120

Newton's Method (NA) Examples

Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example

12 (replace α and t with γ and α).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 110 / 120

Newton's Method (NA) Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 111 / 120

Newton's Method (NA) Examples

Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α

and t with γ and α).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 112 / 120

Newton's Method (NA) Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 113 / 120

Newton's Method (NA) Examples

Example 17: (problem in R10000) Replace α and t with γ and α.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 114 / 120

Page 20: ELE604/ELE704 Optimization - Hacettepe

Newton's Method (NA) Examples

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 115 / 120

Newton's Method (NA) Approximation of the Hessian

Approximation of the Hessian

For relatively large scale problems, i.e., N is large, calculating the inverse of the Hessianat each iteration can be costly. So, we may use, some approximations of the Hessian

S(x) = H−1(x)→ H−1(x)

x(k+1) = x(k) − α(k)S(x(k))∇f(x(k))

1. Hybrid GD + NA

We know that the �rst phase the Newton's Algorithm (NA) is not very fast.Therefore, �rst we can run run GD which has considerably low complexity andafter satisfying some conditions, we can switch to the NA.

Newton's Algorithm may not converge for highly non-quadratic functions unless xis close to x∗.

Hybrid method (given on the next slide) also guarantees global convergence.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 116 / 120

Newton's Method (NA) Approximation of the Hessian

I Hybrid Algorithm

- Start at x(0) ∈ dom f(x)

repeat

run GD (i.e., S(x(k)) = I)

until stopping criterion is satis�ed

- Start at the �nal point of GD

repeat

run NA with exact H(x) (i.e., S(x(k)) = H−1(x(k)))

until stopping criterion is satis�ed

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 117 / 120

Newton's Method (NA) Approximation of the Hessian

2. The Chord Method

If f(x) is close to a quadratic function, we may use S(x(k)) = H−1(x(0))throughout the iterations, i.e.,

∆x(k) = −H−1(x(0))∇f(x(k))

x(k+1) = x(k) + ∆x(k)

This is also the same as the SD algorithm with P = H(x(0)) and α(k) = 1.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 118 / 120

Newton's Method (NA) Approximation of the Hessian

3. The Shamanski Method

Updating the Hessian at every N iterations may give better performance, i.e.,

S(x(k)) = H−1(xbkN cN )

∆x(k) = −H−1(xbkN cN )∇f(x(k))

x(k+1) = x(k) + ∆x(k)

This is a trade-o� between the Chord method (N ←∞) and the full NA (N ← 1).

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 119 / 120

Newton's Method (NA) Approximation of the Hessian

4. Approximating Particular Terms

Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero

- If some entries of H(x) are small or below a small threshold, then set themto zero, obtaining H(x). Thus, H(x) becomes sparse.

- In the extreme case. when the Hessian is strongly diagonal dominant, let theo�-diagonal terms to be zero, obtaining H(x). Thus, H(x) becomesdiagonal which is very easy to invert.

There are also other advanced quasi-Newton (modi�ed Newton) algorithmsdeveloped to approximate the inverse of the Hessian, e.g., Broyden andDavidon-Fletcher-Powell (DFP) methods.

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 120 / 120