Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the...

Post on 13-Aug-2020

1 views 0 download

Transcript of Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the...

Gradient Descent

Review: Gradient Descent

โ€ข In step 3, we have to solve the following optimization problem:

๐œƒโˆ— = argmin๐œƒ

๐ฟ ๐œƒ L: loss function ๐œƒ: parameters

Suppose that ฮธ has two variables {ฮธ1, ฮธ2}

Randomly start at ๐œƒ0 =๐œƒ10

๐œƒ20 ๐›ป๐ฟ ๐œƒ =

ฮค๐œ•๐ฟ ๐œƒ1 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2 ๐œ•๐œƒ2

๐œƒ11

๐œƒ21 =

๐œƒ10

๐œƒ20 โˆ’ ๐œ‚

ฮค๐œ•๐ฟ ๐œƒ10 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2

0 ๐œ•๐œƒ2๐œƒ1 = ๐œƒ0 โˆ’ ๐œ‚๐›ป๐ฟ ๐œƒ0

๐œƒ12

๐œƒ22 =

๐œƒ11

๐œƒ21 โˆ’ ๐œ‚

ฮค๐œ•๐ฟ ๐œƒ11 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2

1 ๐œ•๐œƒ2๐œƒ2 = ๐œƒ1 โˆ’ ๐œ‚๐›ป๐ฟ ๐œƒ1

Review: Gradient Descent

Start at position ๐œƒ0

Compute gradient at ๐œƒ0

Move to ๐œƒ1 = ๐œƒ0 - ฮท๐›ป๐ฟ ๐œƒ0

Compute gradient at ๐œƒ1

Move to ๐œƒ2 = ๐œƒ1 โ€“ ฮท๐›ป๐ฟ ๐œƒ1Movement

Gradient

โ€ฆโ€ฆ

๐œƒ0

๐œƒ1

๐œƒ2

๐œƒ3

๐›ป๐ฟ ๐œƒ0

๐›ป๐ฟ ๐œƒ1

๐›ป๐ฟ ๐œƒ2

๐›ป๐ฟ ๐œƒ3

๐œƒ1

๐œƒ2 Gradient: Loss ็š„็ญ‰้ซ˜็ทš็š„ๆณ•็ทšๆ–นๅ‘

Gradient DescentTip 1: Tuning your

learning rates

Learning Rate

No. of parameters updates

Loss

Loss

Very Large

Large

small

Just make

11 iii L

Set the learning rate ฮท carefully

If there are more than three parameters, you cannot visualize this.

But you can always visualize this.

Adaptive Learning Rates

โ€ข Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.โ€ข At the beginning, we are far from the destination, so we

use larger learning rate

โ€ข After several epochs, we are close to the destination, so we reduce the learning rate

โ€ข E.g. 1/t decay: ๐œ‚๐‘ก = ฮค๐œ‚ ๐‘ก + 1

โ€ข Learning rate cannot be one-size-fits-all

โ€ข Giving different parameters different learning rates

Adagrad

โ€ข Divide the learning rate of each parameter by the root mean square of its previous derivatives

๐œŽ๐‘ก: root mean square of the previous derivatives of parameter w

w is one parameters

๐‘”๐‘ก =๐œ•๐ฟ ๐œƒ๐‘ก

๐œ•๐‘ค

Vanilla Gradient descent

Adagrad

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐œ‚๐‘ก๐‘”๐‘ก

๐œ‚๐‘ก =๐œ‚

๐‘ก + 1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚๐‘ก

๐œŽ๐‘ก๐‘”๐‘ก

Parameter dependent

Adagrad

๐‘ค1 โ† ๐‘ค0 โˆ’๐œ‚0

๐œŽ0๐‘”0

โ€ฆโ€ฆ

๐‘ค2 โ† ๐‘ค1 โˆ’๐œ‚1

๐œŽ1๐‘”1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚๐‘ก

๐œŽ๐‘ก๐‘”๐‘ก

๐œŽ0 = ๐‘”0 2

๐œŽ1 =1

2๐‘”0 2 + ๐‘”1 2

๐œŽ๐‘ก =1

๐‘ก + 1

๐‘–=0

๐‘ก

๐‘”๐‘– 2

๐‘ค3 โ† ๐‘ค2 โˆ’๐œ‚2

๐œŽ2๐‘”2 ๐œŽ2 =

1

3๐‘”0 2 + ๐‘”1 2 + ๐‘”2 2

๐œŽ๐‘ก: root mean square of the previous derivatives of parameter w

Adagrad

โ€ข Divide the learning rate of each parameter by the root mean square of its previous derivatives

๐œ‚๐‘ก =๐œ‚

๐‘ก + 1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

1/t decay

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

๐œŽ๐‘ก๐‘”๐‘ก

๐œŽ๐‘ก =1

๐‘ก + 1

๐‘–=0

๐‘ก

๐‘”๐‘– 2

๐œ‚๐‘ก

๐œŽ๐‘ก

Contradiction?

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

Vanilla Gradient descent

Adagrad

Larger gradient, larger step

Larger gradient, smaller step

Larger gradient, larger step

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐œ‚๐‘ก๐‘”๐‘ก

๐‘”๐‘ก =๐œ•๐ฟ ๐œƒ๐‘ก

๐œ•๐‘ค๐œ‚๐‘ก =

๐œ‚

๐‘ก + 1

Intuitive Reason

โ€ข How surprise it is

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

้€ ๆˆๅๅทฎ็š„ๆ•ˆๆžœ

g0 g1 g2 g3 g4 โ€ฆโ€ฆ

0.001 0.001 0.003 0.002 0.1 โ€ฆโ€ฆ

g0 g1 g2 g3 g4 โ€ฆโ€ฆ

10.8 20.9 31.7 12.1 0.1 โ€ฆโ€ฆ

ๅๅทฎ

๐‘”๐‘ก =๐œ•๐ถ ๐œƒ๐‘ก

๐œ•๐‘ค๐œ‚๐‘ก =

๐œ‚

๐‘ก + 1

็‰นๅˆฅๅคง

็‰นๅˆฅๅฐ

Larger gradient, larger steps?

๐‘ฆ = ๐‘Ž๐‘ฅ2 + ๐‘๐‘ฅ + ๐‘

๐œ•๐‘ฆ

๐œ•๐‘ฅ= |2๐‘Ž๐‘ฅ + ๐‘|

๐‘ฅ0

|๐‘ฅ0 +๐‘

2๐‘Ž|

๐‘ฅ0

|2๐‘Ž๐‘ฅ0 + ๐‘|

Best step:

โˆ’๐‘

2๐‘Ž

|2๐‘Ž๐‘ฅ0 + ๐‘|

2๐‘ŽLarger 1st order derivative means far from the minima

Comparison between different parameters

๐‘ค1

๐‘ค2

๐‘ค1

๐‘ค2

a

b

c

d

c > d

a > b

Larger 1st order derivative means far from the minima

Do not cross parameters

Second Derivative

๐‘ฆ = ๐‘Ž๐‘ฅ2 + ๐‘๐‘ฅ + ๐‘

๐œ•๐‘ฆ

๐œ•๐‘ฅ= |2๐‘Ž๐‘ฅ + ๐‘|

โˆ’๐‘

2๐‘Ž

๐‘ฅ0

|๐‘ฅ0 +๐‘

2๐‘Ž|

๐‘ฅ0

|2๐‘Ž๐‘ฅ0 + ๐‘|

Best step:

๐œ•2๐‘ฆ

๐œ•๐‘ฅ2= 2๐‘Ž The best step is

|First derivative|

Second derivative

|2๐‘Ž๐‘ฅ0 + ๐‘|

2๐‘Ž

Comparison between different parameters

๐‘ค1

๐‘ค2

๐‘ค1

๐‘ค2

a

b

c

d

c > d

a > b

Larger 1st order derivative means far from the minima

Do not cross parameters|First derivative|

Second derivativeThe best step is

Smaller Second

LargerSecond

Larger second derivative

smaller second derivative

|First derivative|

Second derivative

The best step is

Use first derivative to estimate second derivative

first derivative 2

๐‘ค1 ๐‘ค2

larger second derivativesmaller second

derivative

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

?

Gradient DescentTip 2: Stochastic

Gradient Descent

Make the training faster

Stochastic Gradient Descent

Gradient Descent

Stochastic Gradient Descent

11 iii L

11 inii L

Pick an example xn

Faster!

๐ฟ =

๐‘›

เทœ๐‘ฆ๐‘› โˆ’ ๐‘ + ๐‘ค๐‘–๐‘ฅ๐‘–๐‘›

2

Loss is the summation over all training examples

๐ฟ๐‘› = เทœ๐‘ฆ๐‘› โˆ’ ๐‘ + ๐‘ค๐‘–๐‘ฅ๐‘–๐‘›

2

Loss for only one example

โ€ข Demo

Stochastic Gradient Descent

Gradient Descent

Stochastic Gradient Descent

See all examples

See all examples

See only one example

Update after seeing all examples

If there are 20 examples, 20 times faster.

Update for each example

Gradient DescentTip 3: Feature Scaling

Feature Scaling

Make different features have the same scaling

Source of figure: http://cs231n.github.io/neural-networks-2/

๐‘ฆ = ๐‘ + ๐‘ค1๐‘ฅ1 +๐‘ค2๐‘ฅ2

๐‘ฅ1 ๐‘ฅ1

๐‘ฅ2 ๐‘ฅ2

Feature Scaling

y1w

2w1x

2x

b

1, 2 โ€ฆโ€ฆ

100, 200 โ€ฆโ€ฆ

1w

2w Loss L

y1w

2w1x

2x

b

1, 2 โ€ฆโ€ฆ

1w

2w Loss L

1, 2 โ€ฆโ€ฆ

๐‘ฆ = ๐‘ + ๐‘ค1๐‘ฅ1 +๐‘ค2๐‘ฅ2

Feature Scaling

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ โ€ฆโ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ๐‘Ÿ ๐‘ฅ๐‘…

mean: ๐‘š๐‘–

standard deviation: ๐œŽ๐‘–

๐‘ฅ๐‘–๐‘Ÿ โ†

๐‘ฅ๐‘–๐‘Ÿ โˆ’๐‘š๐‘–

๐œŽ๐‘–

The means of all dimensions are 0, and the variances are all 1

For each dimension i:

๐‘ฅ11

๐‘ฅ21

๐‘ฅ12

๐‘ฅ22

Gradient DescentTheory

Question

โ€ข When solving:

โ€ข Each time we update the parameters, we obtain ๐œƒthat makes ๐ฟ ๐œƒ smaller.

๐œƒโˆ— = argmin๐œƒ

๐ฟ ๐œƒ by gradient descent

๐ฟ ๐œƒ0 > ๐ฟ ๐œƒ1 > ๐ฟ ๐œƒ2 > โ‹ฏ

Is this statement correct?

Warning of Math

1

2

Formal Derivation

โ€ข Suppose that ฮธ has two variables {ฮธ1, ฮธ2}

0

1

2

How?

L(ฮธ)

Given a point, we can easily find the point with the smallest value nearby.

Taylor Series

โ€ข Taylor series: Let h(x) be any function infinitely differentiable around x = x0.

kk

k

xxk

x0

0

0

!

hxh

2

00

000!2

xxxh

xxxhxh

When x is close to x0 000 xxxhxhxh

sin(x)=

โ€ฆโ€ฆ

E.g. Taylor series for h(x)=sin(x) around x0=ฯ€/4

The approximation is good around ฯ€/4.

Multivariable Taylor Series

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

When x and y is close to x0 and y0

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

+ something related to (x-x0)2 and (y-y0)2 + โ€ฆโ€ฆ

Back to Formal Derivation

1

2 ba,

bba

aba

ba

2

2

1

1

,L,L,LL

Based on Taylor Series:If the red circle is small enough, in the red circle

21

,L,

,L

bav

bau

bvaus 21

L

bas ,L

L(ฮธ)

1

2

Back to Formal DerivationBased on Taylor Series:If the red circle is small enough, in the red circle

bvaus 21L

bas ,L

ba,

L(ฮธ)

Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)

22

2

2

1 dba

21

,L,

,L

bav

bau

constant

dSimple, right?

Gradient descent โ€“ two variablesRed Circle: (If the radius is small)

bvaus 21L

1 2

1 2

21,

vu,

v

u

2

1

To minimize L(ฮธ)

v

u

b

a

2

1

Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)

22

2

2

1 dba

21,

Back to Formal Derivation

Find ๐œƒ1 and ๐œƒ2 yielding the smallest value of ๐ฟ ๐œƒ in the circle

v

u

b

a

2

1

2

1

,L

,L

ba

ba

b

a This is gradient descent.

Based on Taylor Series:If the red circle is small enough, in the red circle

bvaus 21L

bas ,L

21

,L,

,L

bav

bau

constant

Not satisfied if the red circle (learning rate) is not small enough

You can consider the second order term, e.g. Newtonโ€™s method.

End of Warning

More Limitation of Gradient Descent

๐ฟ

๐‘ค1 ๐‘ค2

Loss

The value of the parameter w

Very slow at the plateau

Stuck at local minima

๐œ•๐ฟ โˆ• ๐œ•๐‘ค= 0

Stuck at saddle point

๐œ•๐ฟ โˆ• ๐œ•๐‘ค= 0

๐œ•๐ฟ โˆ• ๐œ•๐‘คโ‰ˆ 0

Acknowledgement

โ€ขๆ„Ÿ่ฌ Victor Chen็™ผ็พๆŠ•ๅฝฑ็‰‡ไธŠ็š„ๆ‰“ๅญ—้Œฏ่ชค