Question aboutGradient descent
Hung-yi Lee
Larger gradient, larger steps?
π¦=ππ₯2+ππ₯+π
|π π¦π π₯ |=ΒΏ2ππ₯+πβ¨ΒΏ
π₯0
ΒΏ π₯0βπ2π
β¨ΒΏ
π₯0
ΒΏ2ππ₯0+πβ¨ΒΏ
Best step:
βπ2π
ΒΏ2ππ₯0+πβ¨ 2π
Contradiction
π€π‘+1βπ€ π‘βπππ‘ π
π‘
π π‘=βπΌ (ππ‘β 1 )2+(1βπΌ ) (ππ‘ )2
π€π‘+1βπ€ π‘βπ
ββπ=0π‘
(ππ )2ππ‘
Original Gradient descent
Adagrad
RMSprop
π€π‘+1βπ€ π‘βπππ‘
ππ‘=ππΆ (π€ π‘ )ππ€
Larger gradient, larger step
Divided by first derivative
Divided by first derivative
Second Derivative
π¦=ππ₯2+ππ₯+π
|π π¦π π₯ |=ΒΏ2ππ₯+πβ¨ΒΏ
βπ2π
π₯0
ΒΏ π₯0βπ2π
β¨ΒΏ
π₯0ΒΏ2ππ₯0+πβ¨ΒΏ
Best step:
π2 π¦π π₯2
=2π The best step is|First derivative|
Second derivative
ΒΏ2ππ₯0+πβ¨ 2π
More than one parameters
π€1
π€2
π€1
π€2
|First derivative|
Second derivativeThe best step is
a
b
c
d
c < a
c > d
Larger second derivative
smaller second derivative
a > b
What to do with Adagrad and RMSprop?
|First derivative|
Second derivative
The best step is
Use first derivative to estimate second derivative
β ( first derivative )2
π€1 π€2
larger second derivative
smaller second derivative
Acknowledgement
β’ This question is raised by ζ廣ε
Thanks for your attention!
Top Related