Introduction to advanced parameter Gradient descent algorithm …€¦ · 2 ω 2 ω 1 η 0.04,µ...

Introduction to advanced parameter optimization

So far:

• What is a neural network?

• Basic training algorithm:

• Gradient descent

• Backpropagation

Next: advanced training algorithms

Gradient descent algorithm

1. Choose an initial weight vector and let .

2. Let .

3. Evaluate .

4. Let .

5. Let and go to step 2.

w

1

d

1

g

1

–=

w

j

1

+

w

j

η

d

j

+=

g

j

1

+

d

j

1

+

g

j

1

+

–=

j j

1

+=

Gradient descent review

Gradient descent:

where,

Two main problems:

• Slow convergence

• Trial-and-error selection of

Goal

: cut number of epochs (training cycles) by orders of magnitude ... how?

w

t

1

+

( )

w

t

( )

w

t

( )∆

+=

w

t

( )∆ η

E

w

t

( )[ ]∇

–=

η

How to improve over gradient descent?

• Must understand convergence properties

• Use second-order information...

First case study

(same as last time)

Will also look at simple nonquadratic error surfaces...

E

20

ω

12

ω

22

+=

-1.5

-1

-0.5

0

0.5

1 -0.5

0

0.5

1

1.5

2

0

20

40

-1.5

-1

-0.5

0

0.5

1

ω

1

ω

2

E

Why quadratic error surface?

Disadvantages:

• Too simple/too few parameters

• NN error surface not

globally

quadratic

Advantage:

• Easy to visualize

• NN error surfaces will be

locally

quadratic near a local minimum.

Taylor series expansion

Single dimension (from calculus):

Multi-dimensional error surface:

about some vector , where,

(

Hessian:

not just a German mercenary)

f x

( )

f x

0

( )

f

'

x

0

( )

x x

0

–

( )

f

''

x

0

( )

x x

0

–

( )

2

+ +

≈

E

w

( )

E

w

0

( )

w w

0

–

( )

T

b

12

---

w w

0

–

( )

T

H

w w

0

–

( )

+ +

≈

w

0

b

E

w

0

( )∇

=

H E

w

0

( )∇[ ]∇

=

Hessian matrix

Definition

: The

Hessian

matrix of a -dimensional function is defined as,

where,

Alter natively:

W W

×

H

W

E

w

( )

H E

w

( )∇[ ]∇

=

w

ω

1

ω

2

… ω

WT

=

H

i j

,( )

∂

2

E

∂ω

i

∂ω

j

------------------=

Some linear algebra

Definition

: For a square matrix , the

eigenvalues

are the solution of,

Definition

: A square matrix is

positive-definite

, if and only if all its eigenvalues are greater than zero. If a matrix is positive-definite, then,

, .

• Quadratic error surface:

• Arbitrary error surface: near local minimum.

W W

×

H

λ

λ

I

W

H

–

0

=

H

λ

i

v

T

H

v

0

>

v

∀

0

≠

H

0

>

H

0

>

Gradient descent convergence rate

Near local minimum:

(why?)

Convergence governed by:

Learning rate bound:

λ

min

0

>

λ

min

λ

max

------------

0

η

2

λ

max

------------

< <

Simple Hessian example

First partial derivatives:

Second partial derivatives:

E

20

ω

12

ω

22

+=

E

∂ω

1

∂

---------

40

ω

1

=

E

∂ω

2

∂

---------

2

ω

2

=

∂

2

E

∂ω

12

----------

40

=

∂

2

E

∂ω

22

----------

2

=

∂

2

E

∂ω

1

∂ω

2

--------------------

∂

2

E

∂ω

2

∂ω

1

--------------------

0

= =

Simple Hessian example (continued)

Second partial derivatives:

Hessian:

E

20

ω

12

ω

22

+=

∂

2

E

∂ω

12

----------

40

=

∂

2

E

∂ω

22

----------

2

=

∂

2

E

∂ω

1

∂ω

2

--------------------

∂

2

E

∂ω

2

∂ω

1

--------------------

0

= =

H

40 0

0 2

=

Simple Hessian example (continued)

What are the eigenvalues?

H

40 0

0 2

=

Computation of eigenvalues

λ

I

2

H

–

0

=

λ

1 0

0 1

40 0

0 2

–

0

=

λ

40

–

0

0

λ

2

–

0

=

λ

40

–

( ) λ

2

–

( )

0

=

λ

min

2

=

λ

max

40

=

Learning rate bounds

(same as fixed-point derivation)

λ

min

2

=

λ

max

40

=

0

η

2

λ

max

------------

< <

0

η

240

------

< <

0.05

=

Convergence examples

-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

1

ω

2

η

0.01

=

719 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

2

ω

1

η

0.04

=

175 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

1

η

0.05

=

no convergence

Basic pr oblem: “long v alley with steep sides”

What characterizes a “ long valley with steep sides?”

Length of contour lines proportional to:

and

Small ratios:

So what can we do about this?

1

λ

1

----------

1

λ

2

----------

λ

min

λ

max

------------

Solution

• Fixed learning rate is the problem

• Answer: different learning rates for each weight.

Key question: how to achieve automatically?

Heuristic extension: momentum

Gradient descent with momentum:

, ,

Notes:

• dependent on and

• Ideally, high

effective

learning rate in shallow dimensions

• Little effect along steep dimensions

µ

w

t

1

+

( )

w

t

( )

w

t

( )∆

+=

w

0

( )∆ η

E

w

0

( )[ ]∇

–=

w

t

( )∆ η

E

w

t

( )[ ]∇

–

µ∆

w

t

1

–

( )

+=

t

0

>

0

µ≤

1

<

w

t

( )∆

w

t

( )

w

t

1

–

( )

Gradient descent algorithm

1. Choose an initial weight vector and let .

2. Let .

3. Evaluate .

4. Let .

5. Let and go to step 2.

w

1

d

1

g

1

–=

w

j

1

+

w

j

η

d

j

+=

g

j

1

+

d

j

1

+

g

j

1

+

–=

j j

1

+=

Analyzing momentum term

Shallow regions: assume,

,

Then:

E

w

t

( )∇

E

w

0

( )∇≈

g

=

t

1 2

…, ,{ }∈

w

0

( )∆ η

g

–=

w

1

( )∆ η

g

–

µ

w

0

( )∆

+

≈ η

g

1

µ

+

( )

–=

w

2

( )∆ η

g

–

µ

w

1

( )∆

+

≈ η

g

–

µ η

g

1

µ

+

( )

–

[ ]

+=

w

2

( )∆ η

g

1

µ µ

2

+ +

( )

–

≈


Assumption (shallow region):

,

In general,

In the limit:

E

w

t

( )∇

E

w

0

( )∇≈

g

=

t

1 2

…, ,{ }∈

w

t

( )∆ η

g

µ

s

s

0

=

t

∑

–

≈ η

1

µ

t

1

+

–

1

µ

–---------------------

g

–=

w

t

( )∆ η

–

1

µ

–

( )

-----------------

g

≈

t

∞→

lim


Effective learning rate (shallow regions):

η

1

µ

–

( )⁄


Steep regions: oscillations

Net effect (ideally): little

E

w

t

1

+

( )[ ]∇

E

w

t

( )[ ]∇

–

≈

Momentum

Advantage:

• Increase

effective

learning rate in shallow regions

Disadvantages:

• Yet another parameter to hand tune

• If not carefully chosen, can do more harm than good

µ


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

1

ω

2

η

0.01

µ,

0.0

= =

719 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

1

ω

2

η

0.01

µ,

0.5

= =

341 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

1

ω

2

η

0.01

µ,

0.9

= =

266 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

2

ω

1

η

0.04

µ,

0.0

= =

175 steps


-1.5 -1 -0.5 0 0.5 1

-0.5

0

0.5

1

1.5

2

ω

2

ω

1

η

0.04

µ,

0.5

= =

60 steps


-1.5 -1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

1.5

2

ω

2

ω

1

η

0.04

µ,

0.9

= =

272 steps

Convergence examples: summary

719 341 266

175 60 272

µ

0.0

=

µ

0.5

=

µ

0.9

=

η

0.01

=

η

0.04

=

Heuristic extensions to gradient descent

Momentum popular in neural network community.

Many other heuristic attempts (some examples):

• Adaptive learning rate (what should and be?)

• (what’s the problem here?)

ρ

σ

η

new

ρη

old

E

∆

0

<

ση

old

E

∆

0

>

=

η

max

2

λ

max

⁄

=


• Individual learning rates:

(problems?)

• Quickprop: local independent quadratic assumption:

(problems?)

η

i

∆ γ

g

it

( )

g

it

1

–

( )

=

ω

i

∆

t

1

+

( )

g

it

( )

g

it

1

–

( )

g

it

( )

–------------------------------

ω

i

∆

t

( )

=


Problems:

• Additional hand-tuned parameters

•

Independence of weight assumptions

More principled approach is desirable.

Steepest descent

Gradient descent:

where,

Question: why take all those little tiny steps?

w

t

1

+

( )

w

t

( )

w

t

( )∆

+=

w

t

( )∆ η

E

w

t

( )[ ]∇

–=

Steepest descent: gradient descent with line minimization

1. Define search direction :

2. Minimize:

such that:

,

3. New update:

(problems?)

d

t

( )

d

t

( )

E

w

t

( )[ ]∇

–=

E

η( )

E

w

t

( ) η

d

t

( )

+

[ ]≡

E

η∗( )

E

η( )≤

η∀

w

t

1

+

( )

w

t

( ) η∗

d

t

( )

+=

Steepest descent

Question: Do we need to compute ?

Answer: No. Use

one-dimensional

line search

, which requires only evaluation of .

Line search: two steps

1. Bracket minimum

2. Line minimization

∂

E

∂η⁄

E

η( )

Line search: bracketing the minimum

Basic problem: need three values such that:

a b c

, ,

E a

( )

E b

( )>

E c

( )

E b

( )>

1 2 3 4 5 6 7 8

1

2

3

4

5

6

η

E

η()

a

bc

Bracketing the minimum

1. Let . Let .

Note: will satisfy (why?).

2. Let , where (what should be?).

3. If , then done; else, let and . Repeat step 2.

Note: one evaluation of per step.

a

0

=

b

ε

=

E a

( )

E b

( )>

c k b a

–

( )

a

+=

k

1

>

k

E c

( )

E b

( )>

a b

=

b c

=

E

Bracketing example

1 2 3 4 5 6 7 8

1

2

3

4

5

6

1 2 3 4 5 6 7 8

1

2

3

4

5

6

1 2 3 4 5 6 7 8

1

2

3

4

5

6

1 2 3 4 5 6 7 8

1

2

3

4

5

6

η

η

η

η

E

η()

E

η()

E

η()

E

η()

#1 #2

#3 #4

Bracketing example: error surface

E

ω

1

ω

2

,( )

1 5

ω

12

–

ω

22

–

( )

exp

–=

-2

-1

0

1

2 -2

-1

0

1

2

0

0.25

0.5

0.75

1

-2

-1

0

1

2

ω

2

ω

1

What is ?

Weights

At :

E

η( )

ω

1

ω

2

,( )

1 2

,( )

=

E

ω

1

ω

2

,( )∇

10

ω

1

5

ω

12

–

ω

22

–

( )

exp 2

ω

2

,[ ]

5

ω

12

–

ω

22

–

( )

exp

=

E

η( )

1 5

ω

1

10

ω

1

η

5

ω

12

–

ω

22

–

( )

exp

–

( )

2

–

–

(

exp

–=

ω

2

2

ω

2

η

5

ω

12

–

ω

22

–

( )

exp

–

( )

2

)

ω

1

ω

2

,( )

1 2

,( )

=

E

η( )

1 5 1 10

η

9

–

( )

exp

–

( )

2

–

2 4

η

9

–

( )

exp

–

( )

2

–

( )

exp

–=

Bracketing example: error surface

0 200 400 600 800 1000 1200 1400

0.92

0.94

0.96

0.98

1

η

E

η()

a

b

c

Line minimization

1. Pick a value of in larger interval: or .

2. If is larger interval, set new bracketing values to:

if , (set ), or

, if , (set and ).

Else, set new bracketing values to,

if , (set ), or

, if , (set and ).

3. Iterate steps 1 and 2 until .

η

x

=

a b

,( )

b c

,( )

a b

,( )

x b c

, ,{ }

E x

( )

E b

( )>

a x

=

a x b

, ,{ }

E b

( )

E x

( )>

c b

=

b x

=

a b x

, ,{ }

E x

( )

E b

( )>

c x

=

b x c

, ,{ }

E b

( )

E x

( )>

a b

=

b x

=

c a

–

( ) θ<

Line minimization

What should the value of be?

[ is larger interval]

[ is larger interval]

Rate of convergence proportional to:

(golden mean)

x

x

0.381966

c b

–

( )

b

+=

b c

,( )

x b

0.381966

b a

–

( )

–=

a b

,( )

1

k

---

0.61803

≈

k

1 5

+

2

----------------

1.61803

≈

=

Examples: quadratic surface

-1 -0.5 0 0.5 1

0

0.5

1

1.5

2

ω

2

ω

1

steepest descent15 steps to convergence

Examples: quadratic surface (comparison)

-1 -0.5 0 0.5 10

0.5

1

1.5

2

ω

2

ω

1

η

0.04

=

( )

gradient descent175 steps to convergence

Examples: nonquadratic surface

-1 -0.5 0 0.5 1

0

0.5

1

1.5

2

ω

2

ω

1

steepest descent24 steps to convergence

Examples: nonquadratic surface

-1 -0.5 0 0.5 1 1.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

ω

2

ω

1

η

0.2

=

( )

gradient descent456 steps to convergence

Computational complexity

Steepest descent:

computations/step (why?)

Gradient descent:

computations/step

5

NW

10 2

NW

( )

+

25

NW

=

5

NW

Discussion

What’ s bad about steepest descent?

Answer: Orthogonality of consecutive steps.

• Why does this occur?

• Can we do better?

Introduction to advanced parameter Gradient descent algorithm …€¦ · 2 ω 2 ω 1 η 0.04,µ...

Documents

Transcript of Introduction to advanced parameter Gradient descent algorithm …€¦ · 2 ω 2 ω 1 η 0.04,µ...