ELE604/ELE704 Optimization - Hacettepe

ELE604/ELE704 Optimization

Unconstrained Optimization

http://www.ee.hacettepe.edu.tr/∼usezen/ele604/

Dr. Umut Sezen & Dr. Cenk TokerDepartment of Electrical and Electronic Engineering

Hacettepe University

Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 1 / 120

Contents

Unconstrained Optimization

Unconstrained Minimization

Descent Methods

Motivation

General Descent Method

Line SearchExact Line SearchBisection AlgorithmBacktracking Line Search

Convergence

Gradient Descent (GD) Method

Gradient Descent Method

Convergence Analysis

Conv. of GD with Exact Line Search

Conv. of GD with Backtracking Line Search

Examples

Steepest Descent (SD) Method

Preliminary De�nitions

Steepest Descent Method

Steepest Descent for di�erent normsEuclidean NormQuadratic NormL1-normChoice of norm


Examples

Conjugate Gradient (GD) Method

Introduction

Conjugate DirectionsDescent Properties of the Conjugate Gradient Method

The Conjugate Gradient Method

Extension to Nonquadratic Problems

Newton's Method (NA)

The Newton StepInterpretation of the Newton Step

The Newton Decrement

Newton's Method


Examples

Approximation of the Hessian


Unconstrained Optimization Unconstrained Minimization

Unconstrained Minimization

I The aim is

min f(x)

where f(x) : RN → R is twice di�erentiable.

I The problem is solvable, i.e., �nite optimal point x∗ exists.

I The optimal value (�nite) is given by

p∗ = infx

f(x) = f(x∗) (> −∞)



I Example 1: Quadratic program

minx∈RN

f(x) =1

2xTQx− bTx + c

where Q : RN×N is symmetric, b ∈ RN and c ∈ R.

Necessary conditions:

∇f(x∗) = Qx∗ − b = 0

H(x∗) = Q � 0 (PSD)

- Q ≺ 0⇒ f(x) has no local minimum.

- Q � 0⇒ x∗ = Q−1b is the unique global minimum.

- Q � 0⇒ either no solution or ∞ number of solutions.



I Example 2: Consider

minx1,x2∈R

f(x1, x2) =1

2(αx2

1 + βx22 − x1)

Here, let us �rst express the above equation in the quadratic program form with

Q =

[α γ−γ β

]b =

[10

]where γ ∈ R, for simplicity we can take γ = 0. So,

- If α > 0 and β > 0 (i.e., Q � 0): x∗ = ( 1α, 0) is the unique global minimum.

- If α > 0 and β = 0 (i.e., Q � 0): In�nite number of solutions,{( 1α, y), y ∈ R

}.

- If α = 0 and β > 0 (i.e., Q � 0): No solution.

- If α < 0 and β > 0, or α > 0 and β < 0 (i.e., Q is inde�nite): No solution.



−10−5

05

10

−10−5

05

10−50

0

50

100

150

x1

α > 0, β > 0

x2 −10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α > 0, β = 0

x2

−10−5

05

10

−10−5

05

10−20

0

20

40

60

x1

α = 0, β > 0

x2−10

−50

510

−10−5

05

10−60

−40

−20

0

20

40

60

x1

α > 0, β < 0

x2



I Two possibilities:

- {f(x) : x ∈ X} is unbounded below ⇒ no optimal solution.

- {f(x) : x ∈ X} is bounded below ⇒ a global minimum exists, if ‖x‖ 6=∞.

Then, unconstrained minimization methods

- produce sequence of points x(k) ∈ dom f(x) for k = 0, 1, . . . with

f(x(k))→ p∗

- can be interpreted as iterative methods for solving the optimality condition

∇f(x∗) = 0


Descent Methods Motivation

Motivation

I If ∇f(x) 6= 0, there is an interval (0, δ) of stepsizes such that

f(x− α∇f(x)) < f(x) ∀α ∈ (0, δ)

I If d makes an angle with ∇f(x) that is greater than 90◦, i.e.,

∇T f(x)d < 0

∃ an interval (0, δ) of stepsizes such that

f(x + αd) < f(x) ∀α ∈ (0, δ)


Descent Methods Motivation

I De�nition: The descent direction d is selected such that

∇T f(x)d < 0

I Proposition: For a descent method

f(x(k+1)) < f(x(k))

except x(k) = x∗.

I De�nition: Minimizing sequence is de�ned as

x(k+1) = x(k) + α(k)d(k)

where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration k, andd(k) ∈ RN is the step or search direction.

- How to �nd optimum α(k)? Line Search Algorithm

- How to �nd optimum d(k)? Depends on the descent algorithm,e.g., d = −∇f(x(k)).


Descent Methods General Descent Method

General Descent Method

I Given a starting point x(0) ∈ dom f(x)

repeat

1. Determine a descent direction d(k),

2. Line search: Choose a stepsize α(k) > 0,

3. Update: x(k+1) = x(k) + α(k)d(k),

until stopping criterion is satis�ed.

I Example 3: Simplest method: Gradient Descent

x(k+1) = x(k) − α(k)∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −∇f(x(k)).


Descent Methods General Descent Method

I Example 4: Most sophisticated method: Newton's Method

x(k+1) = x(k) − α(k)H−1(x(k))∇f(x(k)), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −H−1(x(k))∇f(x(k)).


Descent Methods Line Search

Line Search

I Suppose f(x) is a continuously di�erentiable convex function and we want to �nd

α(k) = argminα

f(x(k) + αd(k))

for a given descent direction d(k). Now, let

h(α) = f(x(k) + αd(k))

where h(α) : R→ R is a "convex function" in the scalar variable α then theproblem becomes

α(k) = argminα

h(α)

Then, as h(α) is convex, it has a minimum at

h′(α(k)) =∂h(α(k))

∂α= 0



where h′(α) is given by

h′(α) =∂h(α)

∂α= ∇T f(x(k) + αd(k))d(k) (using chain rule)

Therefore, since d is the descent direction, (i.e., ∇T f(x(k))d(k)< 0 ), we haveh′(0) < 0. Also, h′(α) is monotone increasing function of α because h(α) isconvex. Hence. search for h′(α(k)) = 0.



Choice of stepsize:

I Constant stepsizeα(k) = c : constant

I Diminishing stepsizeα(k) → 0

while satisfying∞∑

k=−∞α(k) =∞.

I Exact line search (analytic)

α(k) = argminα

f(x(k) + αd(k))



Exact line search: (for quadratic programs)

I If f(x) is a quadratic function, then h(α) is also a quadratic function, i.e.,

h(α) = f(x(k) + αd(k))

= f(x(k)) + α∇T f(x(k))d(k) +α2

2d(k)TH(x(k))d(k)

Exact line search solution α0 which minimizes the quadratic equation above, i.e.,∂h(α0)∂α

= 0, is given by

α0 = α(k) = − ∇T f(x(k))d(k)

d(k)TH(x(k))d(k)

- If f(x) is a higher order function, then second order Taylor series approximationcan be used for the exact line search algorithm (which also gives an approximatesolution).



Bisection Algorithm:

I Assume h(α) is convex, then h′(α) is monotonically increasing function. Supposethat we know a value α such that h′(α) > 0.

- Since h′(0) < 0, α = 0+α2

is the next test point

- If h′(α) = 0, α(k) = α is found (very di�cult to achieve)

- If h′(α) > 0 narrow down the search interval to (0, α)

- If h′(α) < 0 narrow down the search interval to (α, α)



Algorithm:

1. Set k = 0, α` = 0 and αu = α

2. Set α = α`+αu

2and calculate h′(α)

3. If h′(α) > 0 ⇒ αu = α and k = k + 1

4. If h′(α) < 0 ⇒ α` = α and k = k + 1

5. If h′(α) = 0 ⇒ stop.



Proposition: After every iteration, the current interval [α`, αu] contains α∗,h′(α∗) = 0.

Proposition: At the k-th iteration, the length of the current interval is

L =

(1

2

)kα

Proposition: A value of α such that |α− α∗| < ε can be found at most⌈log2

(α

ε

)⌉steps.

I How to �nd α such that h′(α) > 0?

1. Make an initial guess of α

2. If h′(α) < 0 ⇒ α = 2α, go to step 2

3. Stop.



I Stopping criterion for the Bisection Algorithm: h′(α)→ 0 as k →∞, may notconverge quickly.

Some relevant stopping criteria:

1. Stop after k = K iterations (K : user de�ned)

2. Stop when |αu − α`| ≤ ε (ε : user de�ned)

3. Stop when h′(α) ≤ ε (ε : user de�ned)

In general, 3rd criterion is the best.



Backtracking line search

For small enough α:

f(x0 + αd) ' f(x0) + α∇T f(x0)d < f(x0) + γα∇T f(x0)d

where 0 < γ < 0.5 as ∇T f(x0)d < 0.



I Algorithm: Backtracking line search

Given a descent direction d for f(x) at x0 ∈ dom f(x)

α = 1

while f(x0 + αd) > f(x0) + γα∇T f(x0)d

α = βα

end

where 0 < γ < 0.5 and 0 < β < 1

- At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 :�ne search).

- γ can be interpreted as the fraction of the decrease in f(x) predicted by linearextrapolation (γ = 0.01↔ 0.3 (typical) meaning that we accept a decrease inf(x) between 1% and 30%).



- The backtracking exit inequality

f(x0 + αd) ≤ f(x0) + γα∇T f(x0)d

holds for α ∈ [0, α0]. Then, the line search stops with a step length α

i. α = 1 if α0 ≥ 1

ii. α ∈ [βα0, α0].

In other words, the step length obtained by backtracking line search satis�es

α ≥ min {1, βα0} .


Descent Methods Convergence

I Convergence

De�nition: Let ‖ · ‖ be a norm on RN . Let{x(k)

}∞k=0

be a sequence of vectors in

RN . Then, the sequence{x(k)

}∞k=0

is said to converge to a limit x∗ if

∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε)⇒ (‖x(k) − x∗‖ < ε)

If the sequence{x(k)

}∞k=0

converges to x∗ then we write

limk→∞

x(k) = x∗

and call x∗ as the limit of the sequence{x(k)

}∞k=0

.

- Nε may depend on ε

- For a distance ε, after Nε iterations, all the subsequent iterations are withinthis distance ε to x∗.

This de�nition does not characterize how fast the convergence is (i.e., rate ofconvergence).



I Rate of Convergence

De�nition: Let ‖ · ‖ be a norm on RN . A sequence{x(k)

}∞k=0

that converges to

x∗ ∈ RN is said to converge at rate R ∈ R++ and with rate constant δ ∈ R++ if

limk→∞

‖x(k+1) − x∗‖‖x(k) − x∗‖R

= δ

- If R = 1, 0 < δ < 1, then rate is linear

- If 1 < R < 2, 0 < δ <∞, then rate is called super-linear

- If R = 2, 0 < δ <∞, then rate is called quadratic

The rate of convergence R is sometimes called asymptotic convergence rate. Itmay not apply for the iterates, but applies asymptotically as k →∞.



Example 5: The sequence{ak}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖ak+1 − 0‖‖ak − 0‖1 = a ⇒ R = 1, δ = a

Example 6: The sequence{a2k}∞k=0

, 0 < a < 1 converges to 0.

limk→∞

‖a2k+1

− 0‖‖a2k − 0‖2

= 1 ⇒ R = 2, δ = 1


Gradient Descent (GD) Method Gradient Descent Method

Gradient Descent MethodI First order Taylor series expansion at x0 gives us

f(x0 + αd) ≈ f(x0) + α∇T f(x0)d.

This approximation is valid for α‖d‖ → 0.

I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.

I If we normalize d, i.e., ‖d‖ = 1, then normalized direction d

d = − ∇f(x0)

‖∇f(x0)‖

makes the smallest inner product with ∇f(x0).

I Then, the unnormalized direction

d = −∇f(x0)

is called the direction of gradient descent (GD) at the point of x0.

I d is a direction as long as ∇f(x0) 6= 0.


Gradient Descent (GD) Method Gradient Descent Method

I Algorithm: Gradient Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. d(k) = −∇f(x(k))

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)

until stopping criterion is satis�ed

- A typical stopping criterion is ‖∇f(x)‖ < ε, ε→ 0 (small)


Gradient Descent (GD) Method Convergence Analysis

I Convergence Analysis

The Hessian matrix H(x) is bounded as

1. mI � H(x), i.e.,

(H(x)−mI) � 0

yTH(x)y ≥ m‖y‖2, ∀y ∈ RN

2. H(x) �MI, i.e.,

(MI−H(x)) � 0

yTH(x)y ≤M‖y‖2,∀y ∈ RN

with ∀x ∈ dom f(x).



Note that, condition number of a matrix is given by the ratio of the largest andthe smallest eigenvalue, e.g.,

κ(H(x)) =

∣∣∣∣maxλiminλi

∣∣∣∣ =M

m

If the condition number is close to one, the matrix is well-conditioned which meansits inverse can be computed with good accuracy. If the condition number is large,then the matrix is said to be ill-conditioned.



I Lower Bound: mI � H(x)

For x,y ∈ dom f(x)

f(y) = f(x) +∇T f(x)(y − x) +1

2(y − x)TH(z)(y − x)

for some z on the line segment [x,y] where H(z) � mI. Thus,

f(y) ≥ f(x) +∇T f(x)(y − x) +m

2‖y − x‖2

- If m = 0, then the inequality characterizes convexity.

- If m > 0, then we have a better lower bound for f(y)

Right-hand side is convex in y. Minimum is achieved at

y0 = x− 1

m∇f(x)

Then,

f(y) ≥ f(x) +∇T f(x)(y0 − x) +m

2‖y0 − x‖2

≥ f(x)− 1

2m‖∇f(x)‖2

∀y ∈ dom f .



When y = x∗

f(x∗) = p∗ ≥ f(x)− 1

2m‖∇f(x)‖2

- A stopping criterion

f(x)− p∗ ≤ 12m‖∇f(x)‖2



I Upper Bound: H(x) �MI

For any x,y ∈ dom f(x), using similar derivations as the lower bound, we arrive at

f(y) ≤ f(x) +∇T f(x)(y0 − x) +M

2‖y0 − x‖2

Then for y = x∗

f(x∗) = p∗ ≤ f(x)− 1

2M‖∇f(x)‖2


Gradient Descent (GD) Method Conv. of GD with Exact Line Search

I Convergence of GD using exact line search

For the exact line search, let us use second order approximation for f(x(k+1)):

f(x(k+1)) = f(x(k) − α∇f(x(k)))

∼= f(x(k))− α∇T f(x(k))∇f(x(k))︸︷︷︸‖∇f(x(k))‖2

+α2

2∇T f(x(k))H(x(k))︸︷︷︸

�MI

∇f(x(k))

This criterion is quadratic in α.

Normally, exact line search solution α0 which minimizes the quadratic equationabove is given by

α0 =∇T f(x(k))∇f(x(k))

∇T f(x(k))H(x(k))∇f(x(k))



- However, let us use the upper bound of the second order approximation forconvergence analysis

f(x(k+1)) ≤ f(x(k))− α‖∇f(x(k))‖2 +Mα2

2‖∇f(x(k))‖2

Find α′0 such that upper bound of f(x(k) − α∇f(x(k))) is minimized over α.

Upper bound equation (i.e., right-hand side equation) is quadratic in α, henceminimized for

α′0 =1

M

with the minimum value

f(x(k))− 1

2M‖∇f(x(k))‖2

Then, for α′0

f(x(k+1)) ≤ f(x(k))− 1

2M‖∇f(x(k))‖2



Subtract p∗ for both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − 1

2M‖∇f(x(k))‖2

We know that

f(x(k))− p∗ ≤ 1

2m‖∇f(x(k))‖2 ⇒ ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗)

Then substituting this result to the above inequality

f(x(k+1))− p∗ ≤ (f(x(k))− p∗)− m

M(f(x(k))− p∗)

≤ (1− m

M)(f(x(k))− p∗)

orf(x(k+1))− p∗

f(x(k))− p∗≤ (1− m

M) = c ≤ 1 (

m

M≤ 1)



- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Upper limit of rate constant is(1− m

M

)I Number of steps? Apply the above inequality recursively

f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗)

i.e., f(x(k+1))→ p∗ as k →∞, since 0 ≤ c < 1. Thus, convergence is guaranteed.

- If m = M ⇒ c = 0, then convergence occurs in one iteration.

- If m�M ⇒ c→ 1, the slow convergence.



(f(x(k+1))− p∗) ≤ ε is achieved after at most

K =log(

[f(x(0))− p∗]/ε)

log (1/c)

iterations

- Numerator is small when initial point is close to x∗ (K gets smaller).

- Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger).

- Denominator decreases linearly with mM

(reciprocal of the condition number)as c = (1− m

M), i.e., log(1/c) = − log(1− m

M) ' m

M(using

log(x) = log(x0) + 1x0

(x− x0)− 12x20

(x− x0)2 + · · · with x0 = 1).

- well-conditioned Hessian, mM→ 1⇒ denominator is large (K gets

smaller).

- ill-conditioned Hessian, mM→ 0⇒ denominator is small (K gets

larger).


Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search

I Convergence of GD using backtracking line search

Backtracking exit condition

f(x(k) − α∇f(x(k))) ≤ f(x(k))− γα‖∇f(x(k))‖2

is satis�ed when α ∈ [βα0, α0] where α0 ≥ 1M.

Backtracking line search terminates either if α = 1 or α ≥ βM

which gives a lowerbound on the decrease

1. f(x(k+1)) ≤ f(x(k))− γ‖∇f(x(k))‖2 if α = 1

2. f(x(k+1)) ≤ f(x(k))− βγM‖∇f(x(k))‖2 if α ≥ β

M



If we put these inequalities (1 & 2) together

f(x(k+1)) ≤ f(x(k))−min

{γ,βγ

M

}‖∇f(x(k))‖2

Similar to the analysis exact line search, subtract p∗ from both sides

f(x(k+1))− p∗ ≤ f(x(k))− p∗ − γmin

{1,β

M

}‖∇f(x(k))‖2

But, we know that ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), then

f(x(k+1))− p∗ ≤(

1− 2mγmin

{1,β

M

})(f(x(k))− p∗

)Finally

f(x(k+1))− p∗

f(x(k))− p∗≤(

1− 2mγmin

{1,β

M

})= c < 1



- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Rate is constant c < 1

f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗

)Thus, k →∞⇒ ck → 0, so convergence is guaranteed.


Gradient Descent (GD) Method Examples

Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).

Example 7: (quadratic problem in R2) Replace γ with σ.







Example 8: (nonquadratic problem in R2) Replace α and t with γ and α.







Example 9: (problem in R100) Replace α and t with γ and α.





Example 10: (Condition number) Replace γ, α and t with σ, γ and α.







Observations:

- The gradient descent algorithm is simple.

- The gradient descent method often exhibits approximately linear convergence.

- The choice of backtracking parameters γ and β has a noticeable but not dramatice�ect on the convergence. Exact line search sometimes improves the convergenceof the gradient method, but the e�ect is not large (and probably not worth thetrouble of implementing the exact line search).

- The convergence rate depends greatly on the condition number of the Hessian, orthe sublevel sets. Convergence can be very slow, even for problems that aremoderately well-conditioned (say, with condition number in the 100s). When thecondition number is larger (say, 1000 or more) the gradient method is so slow thatit is useless in practice.

- The main advantage of the gradient method is its simplicity. Its main disadvantageis that its convergence rate depends so critically on the condition number of theHessian or sublevel sets.


Steepest Descent (SD) Method Preliminary De�nitions

I Dual Norm: Let ‖ · ‖ denote any norm on RN , then the dual norm, denoted by‖ · ‖∗, is the function from RN to R with values

‖x‖∗ = maxy

yTx : ‖y‖ ≤ 1 = sup{yTx : ‖y‖ ≤ 1

}The above de�nition also corresponds to a norm: it is convex, as it is the pointwisemaximum of convex (in fact, linear) functions y→ xTy; it is homogeneous ofdegree 1, that is, ‖αx‖∗ = α‖x‖∗ for every x in RN and α ≥ 0.

I By de�nition of the dual norm,

xTy ≤ ‖x‖ · ‖y‖∗

This can be seen as a generalized version of the Cauchy-Schwartz inequality, whichcorresponds to the Euclidean norm.

I The dual to the dual norm above is the original norm.



- The norm dual to the Euclidean norm is itself. This comes directly from theCauchy-Schwartz inequality.

‖x‖2∗ = ‖x‖2

- The norm dual to the the L∞-norm is the L1-norm, or vice versa.

‖x‖∞∗ = ‖x‖1 and ‖x‖1∗ = ‖x‖∞

- More generally, the dual of the Lp-norm is the Lq-norm

‖x‖p∗ = ‖x‖q

where q =p

1− p .



I Quadratic norm: A generalized quadratic norm of x is de�ned by

‖x‖P =(xTPx

)1/2

= ‖P1/2x‖2 = ‖Mx‖2

where P = MTM is an N ×N symmetric positive de�nite (SPD) matrix.

I When P = I then, quadratic norm is equal to the Euclidean norm.

I The dual of the quadratic norm is given by

‖x‖P∗ = ‖x‖Q =(xTP−1x

)1/2

where Q = P−1.


Steepest Descent (SD) Method Steepest Descent Method

Steepest Descent Method

I The �rst-order Taylor series approximation of f(x(k) + αd) around x(k) is

f(x(k) + αd) ≈ f(x(k)) + α∇T f(x(k))d.

This approximation is valid for α‖d‖2 → 0.

I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.

I First normalize d to obtain normalized steepest descent direction (nsd) dnsd

dnsd = argmin{∇T f(x(k))d : ‖d‖ = 1

}where ‖ · ‖ is any norm and ‖ · ‖∗ is its dual norm on RN . Choice of norm is veryimportant.

I It is also convenient to consider the unnormalized steepest descent direction (sd)

dsd = ‖∇f(x)‖∗dnsd

where ‖ · ‖∗ is the dual norm of ‖ · ‖.


Steepest Descent (SD) Method Steepest Descent Method

I Then, for the steepest descent step, we have

∇T f(x)dsd = ‖∇f(x)‖∗∇T f(x)dnsd︸︷︷︸−‖∇f(x)‖∗

= −‖∇f(x)‖2∗

I Algorithm: Steepest Descent Algorithm

Given a starting point x(0) ∈ dom f(x)

repeat

1. Compute the steepest descent direction d(k)sd

2. Line search: Choose step size α(k) via a line search algorithm

3. Update: x(k+1) = x(k) + α(k)d(k)sd



Steepest Descent (SD) Method Steepest Descent for di�erent norms

I Steepest Descent for di�erent norms:

- Euclidean norm: As ‖ · ‖2∗ = ‖ · ‖2 and having x0 = x(k), the steepest descentdirection is the negative gradient, i.e.,

dsd = −∇f(x0)

For Euclidean norm, steepest descent algorithm is the same as the gradientdescent algorithm.



- Quadratic norm: For a quadratic norm ‖ · ‖P and having x0 = x(k), thenormalized descent direction is given by

dnsd = −P−1 ∇f(x0)

‖∇f(x0)‖P∗= −P−1 ∇f(x0)

(∇T f(x0)P−1∇f(x0))1/2

As ‖∇f(x)‖P∗ = ‖P−1/2∇f(x)‖2, then

dsd = −P−1∇f(x0)



Change of coordinates: Let y = P 1/2x, then ‖x‖P = ‖y‖2. Using this change ofcoordinates, we can solve the original problem of minimizing f(x) by solving theequivalent problem of minimizing the function f(y) : RN → R, given by

f(y) = f(P−1/2y) = f(x)

Apply the gradient descent method to f(y). The descent direction at y0

(x0 = P−1/2y0 for the original problem) is

dy = −∇f(y0) = −P−1/2∇f(P−1/2y0) = −P−1/2∇f(x0)

Then the descent direction for the original problem becomes

dx = P−1/2 dy = −P−1∇f(x0)

Thus, x∗ = P−1/2y∗.

The steepest descent method in the quadratic norm ‖ · ‖P is equivalent to thegradient descent method applied to the problem after the coordinatetransformation

y = P1/2x



- L1-norm: For an L1-norm ‖ · ‖1 and having x0 = x(k), the normalized descentdirection is given by

dnsd = − argmin{∇T f(x)d : ‖d‖1 = 1

}.

Let i be any index for which ‖∇f(x0)‖∞ = max |(∇f(x0))i|. Then a normalizedsteepest descent direction dnsd for the L1-norm is given by

dnsd = − sign

(∂f(x0)

∂xi

)ei

where ei is the i-th standard basis vector (i.e., the coordinate axis direction) withthe steepest gradient. For example, in the �gure above we have dnsd = e1.



Then, the unnormalized steepest descent direction is given by

dsd = dnsd ‖∇f(x0)‖∞ = −∂f(x0)

∂xiei

The steepest descent algorithm in the L1-norm has a very natural interpretation:

- At each iteration we select a component of ∇f(x0) with maximum absolutevalue, and then decrease or increase the corresponding component of x0,according to the sign of (∇f(x0))i.

- The algorithm is sometimes called a coordinate-descent algorithm, sinceonly one component of the variable x(k) is updated at each iteration.

- This can greatly simplify, or even trivialize, the line search.



Choice of norm:

- Choice of norm can dramatically a�ect the convergence

- Condition number of the Hessian should be close to unity for fast convergence

- Consider quadratic norm with respect to SPD matrix P. Performing the change ofcoordinates

y = P1/2x

can change the condition number.

- If an approximation of the Hessian at the optimal point, H(x∗), is known,then setting P ∼= H(x∗) will yield

P−1/2H(x∗)P1/2 ∼= I

resulting in a very low condition number.

- If P is chosen correctly the ellipsoid ε ={x : xTPx ≤ 1

}approximates the

cost surface at point x.

- A correct P will greatly improve the convergence whereas the wrong choiceof P will result in very poor convergence.


Steepest Descent (SD) Method Convergence Analysis


- (Using backtracking line search) It can be shown that any norm can be bounded interms of Euclidean norm with a constant η ∈ (0, 1]

‖x‖∗ ≥ η‖x‖2

- Assuming strongly convex f(x) and using H(x) ≺MI

f(x(k) + αdsd) ≤ f(x(k)) + α∇T f(x(k))dsd +Mα2

2‖dsd‖22

≤ f(x(k)) + α∇T f(x(k))dsd +Mα2

2η2‖dsd‖2∗

≤ f(x(k))− α‖∇f(x(k))‖2∗ + α2 M

2η2‖∇f(x(k))‖2∗

Right hand side of the inequality is a quadratic function of α and has a minimum

at α = η2

M. Then,

f(x(k) + αdsd) ≤ f(x(k))− η2

2M‖∇f(x(k))‖2∗ ≤ f(x(k)) +

γη2

M∇T f(x(k))dsd


Steepest Descent (SD) Method Convergence Analysis

Since γ < 0.5 and −‖∇f(x)‖2∗ = ∇T f(x)dsd, backtracking line search will return

α ≥ min{

1, βη2

M

}, then

f(x(k) + αdsd) ≤ f(x(k))− γmin

{1,βη2

M

}‖∇f(x(k))‖2∗

≤ f(x(k))− γη2 min

{1,βη2

M

}‖∇f(x(k))‖22

Subtracting p∗ from both sides and using ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), wehave

f(x(k+1))− p∗

f(x(k))− p∗≤ 1− 2mγη2 min

{1,βη2

M

}= c < 1

- Linear convergence

f(x(k))− p∗ ≤ ck(f(x(0))− p∗

)as k →∞, ck → 0. So, convergence is guaranteed,


Steepest Descent (SD) Method Examples

Example 11: A steepest descent example with L1-norm.



Example 12: Consider the nonquadratic problem in R2 given in Example 8 (replace α

and t with γ and α).





When P = I, i.e., gradient descent






Conjugate Gradient (GD) Method Introduction

Conjugate Gradient Method

I Can overcome the slow convergence of Gradient Descent algorithm

I Computational complexity is lower than Newton's Method.

I Can be very e�ective in dealing with general objective functions.

I We will �rst investigate the quadratic problem

min1

2xTQx− bTx

where Q is SPD, and then extend the solution to the general case byapproximation.


Conjugate Gradient (GD) Method Conjugate Directions

Conjugate Directions

I De�nition: Given a symmetric matrix Q, two vectors d1 and d2 are said to beQ-orthogonal or conjugate with respect to Q if

dT1 Qd2 = 0

- Although it is not required, we will assume that Q is SPD.

- If Q = I, then the above de�nition becomes the de�nition of orthogonality.

- A �nite set of non-zero vectors d0,d1, . . . ,dk is said to be a Q-orthogonal set if

dTi Qdj = 0, ∀i, j : i 6= j



I Proposition: If Q is SPD and the set of non-zero vectors d0,d1, . . . ,dk areQ-orthogonal, then these vectors are linearly indepedent.

Proof: Assume linear dependency and suppose ∃αi, i = 0, 1, . . . , k :

α0d0 + α1d1 + · · ·+ αkdk = 0

Multiplying with dTi Q yields

α0 dTi Qd0︸︷︷︸

=0

+α1 dTi Qd1︸︷︷︸

=0

+ · · ·+ αidTi Qdi︸︷︷︸

must be 0

+ · · ·+ αk dTi Qdk︸︷︷︸

=0

= 0

But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi.



I Quadratic Problem:

min1

2xTQx− bTx

If Q is N ×N PD matrix, then we have unique solution

Qx∗ = b

Let d0,d1, . . . ,dN−1 be non-zero Q-orthogonal vectors corresponding to theN ×N SPD matrix Q. They are linearly independent. Then the optimum solutionis given by

x∗ = α0d0 + α1d1 + · · ·+ αN−1dN−1

We can �nd the value of the coe�cients αi by multiplying the above equationwith dTi Q:

dTi Qx∗ = αidTi Qdi

αi =dTi b

dTi Qdi. . . Qx∗ = b

Finally the optimum solution is given by,

x∗ =

N−1∑i=0

dTi b

dTi Qdidi



- αi can be found from the known vector b and matrix Q once di are found.

- The expansion of x∗ is a result of an iterative process of N steps where at the i-thstep αidi is added.

I Conjugate Direction Theorem: Let {di}N−1i=0 be a set of non-zero Q-orthogonal

vectors. For any x(0) ∈ dom f(x), the sequence{x(k)

}Nk=0

generated according to

x(k+1) = x(k) + α(k)dk, . . . k ≥ 0

with

α(k) = − dTk g(k)

dTkQdk

and g(k) is the gradient at x(k)

g(k) = ∇f(x(k)) = Qx(k) − b

converges to the unique solution x∗ of Qx∗ = b after N steps, i.e., x(N) = x∗.



Proof: Since dk are linearly independent, we can write

x∗ − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(N−1)dN−1

for some α(k). We can �nd α(k) by

α(k) =dTkQ

(x∗ − x(0)

)dTkQdk

(1)

Now, the iterative steps from x(0) to x(k)

x(k) − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(k−1)dk−1

and due to Q-orthogonality

dTkQ(x(k) − x(0)

)= 0 (2)

Using (1) and (2) we arrive at

α(k) =dTkQ

(x∗ − x(k)

)dTkQdk

= − dTk g(k)

dTkQdk



Descent Properties of the Conjugate Gradient Method

I We de�ne B(k) which is spanned by {d0,d1, . . . ,dk−1} as the subspace of RN ,i.e.,

B(k) = span {d0,d1, . . . ,dk−1} ⊆ RN

We will show that at each step x(k) minimizes the objective over thek-dimensional linear variety x(0) + B(k).

I Theorem: (Expanding Subspace Theorem) Let {di}N−1i=0 be non-zero,

Q-orthogonal vectors in RN .

For any x(0) ∈ RN , the sequence

x(k+1) = x(k) + α(k)dk

α(k) = − dTk g(k)

dTkQdk

minimizes f(x) = 12xTQx− bTx on the line

x = x(k−1) − αdk−1, −∞ < α <∞

and on x(0) + B(k).



Proof: Since x(k) ∈ x(0) + B(k), i.e., B(k) contains the line x = x(k−1) − αdk−1,it is enough to show that x(k) minimizes f(x) over x(0) + B(k)

Since we assume that f(x) is strictly convex, the above condition holds when g(k)

is orthogonal to B(k), i.e., the gradient of f(x) at x(k) is orthogonal to B(k).



- Proof of g(k) ⊥ B(k) is by induction

Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.

Now assume that g(k) ⊥ B(k), show that g(k+1) ⊥ B(k+1)

From the de�nition of g(k) (g(k) = Qx(k) − b), it can be shown that

g(k+1) = g(k) + αkQdk

Hence, by the de�nition of αk

dTk g(k+1) = dTk g

(k) + αkdTkQdk = 0

Also, for i < k

dTi g(k+1) = dTi g

(k)︸︷︷︸vanishes by induction

+ αk dTi Qdk︸︷︷︸

=0

= 0



- Corollary: The gradients g(k), k = 0, 1, . . . , N satisfy

dTi g(k) = 0

for i < k.

Expanding subspace, at every iteration dk increases the dimensionality of B. Sincex(k) minimizes f(x) over x(0) + B(k), x(N) is the overall minimum of f(x).


Conjugate Gradient (GD) Method The Conjugate Gradient Method

The Conjugate Gradient Method

In the conjugate direction method, select the successive direction vectors as aconjugate version of the successive gradients obtained as the method progresses

I Conjugate Gradient Algorithm:

Start at any x(0) ∈ RN and de�ne d(0) = −g(0) = b−Qx(0)

repeat

1. α(k) = − d(k)T g(k)

d(k)TQd(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = Qx(k+1) − b

4. β(k) = g(k+1)Qd(k)

d(k)TQd(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .



- Algorithm terminates in at most N steps with the exact solution (for the quadraticcase)

- Gradient is always linearly independent of all previous direction vectors, i.e.,g(k) ⊥ B(k), where B(k) = {d0,d1, . . . ,dk−1}

- If solution is reached before N steps, the gradient is zero

- Very simple formula, computational complexity is slightly higher than gradientdescent algorithm

- The process makes uniform progress toward the solution at every step. Importantfor the nonquadratic case.



Example 13: Consider the quadratic problem

min1

2xTQx− bTx

where Q =

[3 22 6

]and b =

[2−8

].

Solution is given by



CG Summary

I In theory (with exact arithmetic) converges to solution in N steps

- The bad news: due to numerical round-o� errors, can take more than Nsteps (or fail to converge)

- The good news: with luck (i.e., good spectrum of Q), can get goodapproximate solution in � N steps

I Compared to direct (factor-solve) methods, CG is less reliable, data dependent;often requires good (problem-dependent) preconditioner

I But, when it works, can solve extremely large systems


Conjugate Gradient (GD) Method Extension to Nonquadratic Problems

Extension to Nonquadratic Problems

I Idea is simple. We have two loops

- Outer loop approximates the problem with a quadratic one- Inner loop runs conjugate gradient method (CGM) for the approximation

i.e., for the neighbourhood of point x0

f(x) ∼= f(x0) +∇T f(x0)(x− x0) +1

2(x− x0)TH(x0)(x− x0)︸︷︷︸

quadratic function

+ residual︸︷︷︸→0

- Expanding

f(x) ∼=1

2xTH(x0)x+

(∇T f(x0)− xT0 H(x0)

)x+f(x0) +

1

2xT0 H(x0)x0 −∇T f(x0)x0︸︷︷︸

independent of x, i.e., constant

Thus,

min f(x) ≡ min1

2xTH(x0)x+

(∇T f(x0)− xT0 H(x0)

)x

≡ min1

2xTQx− bTx



I Here,

Q = H(x0)

bT = −∇T f(x0) + xT0 H(x0)

The gradient g(k) is

g(k) = Qx(k) − b

= H(x0)x0 +∇f(x0)−H(x0)x0 . . . x0 = x(k)

= ∇f(x0)



I Nonquadratic Conjugate Gradient Algorithm:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. α(k) = − d(k)T g(k)

d(k)TH(x(k))d(k)

2. x(k+1) = x(k) + α(k)d(k)

3. g(k+1) = ∇f(x(k+1))

4. β(k) = g(k+1)TH(x(k))d(k)

d(k)TH(x(k))d(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).




- No line search is required.

- H(x(k)) must be evaluated at each point, can be impractical.

- Algorithm may not be globally convergent.

I Involvement of H(x(k)) can be avoided by employing a line search algorithm forα(k) and slightly modifying β(k)



I Nonquadratic Conjugate Gradient Algorithm with Line-search:

Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)

repeat

repeat

1. Line search: α(k) = argminα

f(x(k) + αd(k))

2. Update: x(k+1) = x(k) + α(k)d(k)

3. Gradient: g(k+1) = ∇f(x(k+1))

4. Use

Fletcher-Reeves method: β(k) = g(k+1)T g(k+1)

g(k)T g(k), or

Polak-Ribiere method: β(k) =(g(k+1)−g(k))

Tg(k+1)

g(k)T g(k)

5. d(k+1) = −g(k+1) + β(k)d(k)

until k = N .

new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).




- Polak-Ribiere method can be superior to the Fletcher-Reeves method.

- Global convergence of the line search methods is established by noting that agradient descent step is taken every N steps and serves as a spacer step. Since theother steps do not increase the objective, and in fact hopefully they decrease it,global convergence is guaranteed.



Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A complicatedfunction with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linearCG, convergence does not occur in two steps. (c) Cross-section of the surface corresponding to the �rst linesearch. (d) Convergence path of Polak-Ribiere CG.


Newton's Method (NA) The Newton Step

The Newton Step

I In Newton's Method, local quadratic approximations of f(x) are utilized. Startingwith the second-order Taylor's approximation around x(k),

f(x(k+1)) = f(x(k)) +∇f(x(k))∆x +1

2∆xTH(x(k))∆x︸︷︷︸

f(x(k+1))

+ residual

where ∆x = x(k+1) − x(k), �nd ∆x = ∆xnt such that f(x(k+1)) is minimized.

I Quadratic approximation optimum step ∆xnt (by solving ∂f(x(k+1))∂∆x

= 0)

∆xnt = −H−1(x(k))∇f(x(k))

is called the Newton step, which is a descent direction, i.e.,

∇T f(x(k))∆xnt = −∇T f(x(k))H−1(x(k))∇f(x(k)) < 0



I Then

x(k+1) = x(k) + ∆xnt



Interpretation of the Newton Step1. Minimizer of second-order approximation

As given on the previous slide ∆x minimizes f(x(k+1)), i.e., the quadraticapproximation of f(x) in the neighbourhood of x(k).

- If f(x) is quadratic, then f(x(0)) + ∆x is the exact minimizer of f(x) andalgorithm terminates in a single step with the exact answer.

- If f(x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizerof f(x), x∗.

- For twice di�erentiable f(x), quadratic approximation is very accurate in theneighbourhood of x∗, i.e., when x is very close to x∗, the point x + ∆x is a verygood estimate of x∗.



2. Steepest Descent Direction in Hessian Norm- The Newton step is the steepest descent direction at x(k), i.e.,

‖v‖H(x(k)) =(vTH(x(k))v

) 12

- In the steepest descent method, the quadratic norm ‖ · ‖P can signi�cantlyincrease speed of convergence, by decreasing the condition number. In theneighbourhood of x∗, P = H(x∗) is a very good choice.

- In Newton's method when x is near x∗, we have H(x) ' H(x∗).



3. Solution of Linearized Optimality Condition

- First-order optimality condition

∇f(x∗) = 0

near x∗ (using �rst order Taylor's approximation for ∇f(x + ∆x))

∇f(x + ∆x) ' ∇f(x) + H(x)∆x = 0

with the solution∆xnt = −H−1(x)∇f(x)


Newton's Method (NA) The Newton Decrement

The Newton Decrement

I The norm of the Newton step in the quadratic norm de�ned by H(x) is called theNewton decrement

λ(x) = ‖∆xnt‖H(x) =(

∆xTntH(x)∆xnt) 1

2

I It can be used as a stopping criterion since it is an estimate of f(x)− p∗, i.e.,

f(x)− infyf(y) = f(x)− f(x + ∆xnt) =

1

2λ2(x)

where

f(x + ∆xnt) = f(x) +∇T f(x)∆xnt +1

2∆xTntH(x)∆xnt

i.e., the second-order quadratic approximation of f(x) at x.


Newton's Method (NA) The Newton Decrement

Substitute f(x + ∆xnt) into f(x)− infyf(y) and let

∆xnt = −H−1(x)∇f(x)

then1

2∇T f(x)H−1(x)∇f(x) =

1

2λ2(x)

I So, if λ2(x)2

< ε, then algorithm can be terminated for some small ε.

I With the substitution of ∆xnt = −H−1(x)∇f(x), the Newton decrement can alsobe written as

λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))

) 12


Newton's Method (NA) Newton's Method

Newton's Method

I Given a starting point x(0) ∈ dom f(x) and some small tolerance ε > 0

repeat

1. Compute the Newton step and Newton decrement

∆x(k) = −H−1(x(k))∇f(x(k))

λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))

) 12

2. Stopping criterion, quit if λ2(x(k))/2 ≤ ε.

3. Line search: Choose a stepsize α(k) > 0, e.g., by backtracking line search.

4. Update: x(k+1) = x(k) + α(k)∆x(k).



I The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts ofthe algorithm. Otherwise, algorithm may not converge due to large higher-orderresiduals.

I As x(k) gets closer to x∗. f(x) can be better approximated by the second-orderexpansion. Hence, stepsize α(k) is no longer required. Line search algorithm willautomatically set α(k) = 1.

I If we start with α(k) = 1 and keep it the same, then the algorithm is called thepure Newton's method.

I For an arbitrary f(x), there are two regions of convergence.

- damped Newton phase, when x is far from x∗

- quadratically convergent phase, when x gets closer to x∗

I If we let H(x) = I, the algorithm reduces to gradient descent (GD)

x(k+1) = x(k) − α(k)∇f(x(k))



I If H(x) is not positive de�nite, Newton's method will not converge.

So, use (aI + H(x))−1 instead of H−1(x), also known as (a.k.a) Marquardtmethod. There always exists an a which will make the matrix (aI + H(x)) positivede�nite.

a is a trade-o� between GD and NA

- a→∞ ⇒ Gradient Descent (GD)

- a→ 0 ⇒ Newton's Method (NA)



I Newton step and decrement are independent of a�ne transformations (i.e., linearcoordinate transformations), i.e., for non-singular T ∈ RN×N

x = Ty and f(y) = f(Ty)

then

∇f(y) = TT∇f(x)

H(y) = TTH(x)T

- So, the Newton step will be

∆ynt = −H−1(y)∇f(y)

= −(TTH(x)T

)−1(TT∇f(x)

)= −T−1H−1(x)∇f(x)

= T−1∆xnt

i.e,x + ∆xnt = T (y + ∆ynt), ∀x



- Similarly, the Newton decrement will be

λ(y) =(∇T f(y)H−1(y)∇f(y)

) 12

=

((∇T f(x)T

)(TTH(x)T

)−1(TT∇f(x)

)) 12

=(∇T f(x)H(x)∇f(x)

) 12

= λ(x)

I Thus, Newton's Method is independent of a�ne transformations (i.e., linearcoordinate transformations).


Newton's Method (NA) Convergence Analysis


Read Boyd, Section 9.5.3.

I Assume a strongly convex f(x) with mI � H(x) with constant m ∀x ∈ dom f(x)

and H(x) is Lipschitz continuous on dom f(x), i.e.,

‖H(x)−H(y)‖2 ≤ L ‖x− y‖2

for constant L > 0. This inequality imposes a bound on the third derivative off(x).

If L is small f(x) is closer to a quadratic function. If L is large, f(x) is far from aquadratic function. If L = 0, then f(x) is quadratic.

Thus, L measures how well f(x) can be approximated by a quadratic function.

- Newton's Method will perform well for small L.



Convergence: There exist constants η ∈(

0, m2

L

)and σ > 0 such that

I Damped Newton Phase‖∇f(x)‖2 ≥ η

- α < 1 gives better solutions, so most iterations will require line search, e.g.,backtracking line search.

- As k increases, function value decreases by at least σ, but not necessarilyquadratic.

- This phase ends after at most f(x(0))−p∗σ

iterations

I Quadratically Convergent Phase

‖∇f(x)‖2 < η

- All iterations use α = 1 (i.e., quadratic approximation suits very well.)

-‖∇f(x(k+1))‖‖∇f(x(k))‖2

≤ L

2m2, i.e., quadratic convergence.



- For small ε > 0, f(x)− p∗ < ε is achieved after at most

log2 log2

ε0ε

iterations where ε0 = 2m3

L2 . This is typically 5-6 iterations.

- Number of iterations is bounded above by

f(x(0))− p∗

σ+ log2 log2

ε0ε

σ and ε0 are dependent on m, L and x(0).



NA Summary

I Convergence of Newton's method is rapid in general, and quadratic near x∗. Oncethe quadratic convergence phase is reached, at most six or so iterations arerequired to produce a solution of very high accuracy.

I Newton's method is a�ne invariant. It is insensitive to the choice of coordinates,or the condition number of the sublevel sets of the objective.

I Newton's method scales well with problem size. Ignoring the computation of theHessian, its performance on problems in R10000 is similar to its performance onproblems in R10, with only a modest increase in the number of steps required.

I The good performance of Newton's method is not dependent on the choice ofalgorithm parameters. In contrast, the choice of norm for steepest descent plays acritical role in its performance.



I The main disadvantage of Newton's method is the cost of forming and storing theHessian, and the cost of computing the Newton step, which requires solving a setof linear equations.

I Other alternatives (called quasi-Newton methods) are also provided by a family ofalgorithms for unconstrained optimization. These methods require lesscomputational e�ort to form the search direction, but they share some of thestrong advantages of Newton methods, such as rapid convergence near x∗.


Newton's Method (NA) Examples

Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example

12 (replace α and t with γ and α).





Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α

and t with γ and α).





Example 17: (problem in R10000) Replace α and t with γ and α.




Newton's Method (NA) Approximation of the Hessian

Approximation of the Hessian

For relatively large scale problems, i.e., N is large, calculating the inverse of the Hessianat each iteration can be costly. So, we may use, some approximations of the Hessian

S(x) = H−1(x)→ H−1(x)

x(k+1) = x(k) − α(k)S(x(k))∇f(x(k))

1. Hybrid GD + NA

We know that the �rst phase the Newton's Algorithm (NA) is not very fast.Therefore, �rst we can run run GD which has considerably low complexity andafter satisfying some conditions, we can switch to the NA.

Newton's Algorithm may not converge for highly non-quadratic functions unless xis close to x∗.

Hybrid method (given on the next slide) also guarantees global convergence.



I Hybrid Algorithm

- Start at x(0) ∈ dom f(x)

repeat

run GD (i.e., S(x(k)) = I)


- Start at the �nal point of GD

repeat

run NA with exact H(x) (i.e., S(x(k)) = H−1(x(k)))




2. The Chord Method

If f(x) is close to a quadratic function, we may use S(x(k)) = H−1(x(0))throughout the iterations, i.e.,

∆x(k) = −H−1(x(0))∇f(x(k))

x(k+1) = x(k) + ∆x(k)

This is also the same as the SD algorithm with P = H(x(0)) and α(k) = 1.



3. The Shamanski Method

Updating the Hessian at every N iterations may give better performance, i.e.,

S(x(k)) = H−1(xbkN cN )

∆x(k) = −H−1(xbkN cN )∇f(x(k))

x(k+1) = x(k) + ∆x(k)

This is a trade-o� between the Chord method (N ←∞) and the full NA (N ← 1).



4. Approximating Particular Terms

Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero

- If some entries of H(x) are small or below a small threshold, then set themto zero, obtaining H(x). Thus, H(x) becomes sparse.

- In the extreme case. when the Hessian is strongly diagonal dominant, let theo�-diagonal terms to be zero, obtaining H(x). Thus, H(x) becomesdiagonal which is very easy to invert.

There are also other advanced quasi-Newton (modi�ed Newton) algorithmsdeveloped to approximate the inverse of the Hessian, e.g., Broyden andDavidon-Fletcher-Powell (DFP) methods.


ELE604/ELE704 Optimization - Hacettepe

Documents

Transcript of ELE604/ELE704 Optimization - Hacettepe