Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385...
Transcript of Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385...
Back-Propagation16-385 Computer Vision (Kris Kitani)
Carnegie Mellon University
World’s Smallest Perceptron!
y = wx
back to the…
function of ONE parameter!
(a.k.a. line equation, linear regression)
yx
wf
Training the world’s smallest perceptron
this should be the gradient of the loss
function
Now where does this come from?
This is just gradient descent, that means…
dLdw
L =1
2(y � y)2
y = wx
…is the rate at which this will change…
… per unit change of this
the loss function
the weight parameter
Let’s compute the derivative…
Compute the derivative
That means the weight update for gradient descent is:
dLdw
=d
dw
⇢1
2(y � y)2
�
= �(y � y)dwx
dw
= �(y � y)x = rw
just shorthand
w = w �rw
= w + (y � y)x
move in direction of negative gradient
Gradient Descent (world’s smallest perceptron)
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = wxi
dLi
dw
= �(yi � y)xi = rw
Li =1
2(yi � y)2
w = w �rw
Training the world’s smallest perceptron
y
world’s (second) smallest perceptron!
function of two parameters!
w1
w2
x1
x2
Gradient Descent
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
we just need to compute partial derivatives for this network
Back-Propagation
@L@w1
=@
@w1
⇢1
2(y � y)2
�
= �(y � y)@y
@w1
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w1x1
@w1
= �(y � y)x1 = rw1
@L@w2
=@
@w2
⇢1
2(y � y)2
�
= �(y � y)@y
@w2
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w2x2
@w2
= �(y � y)x2 = rw2
Why do we have partial derivatives now?
Back-Propagation
@L@w1
=@
@w1
⇢1
2(y � y)2
�
= �(y � y)@y
@w1
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w1x1
@w1
= �(y � y)x1 = rw1
@L@w2
=@
@w2
⇢1
2(y � y)2
�
= �(y � y)@y
@w2
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w2x2
@w2
= �(y � y)x2 = rw2
w1 = w1 � ⌘rw1
= w1 + ⌘(y � y)x1
w2 = w2 � ⌘rw2
= w2 + ⌘(y � y)x2
Gradient Update
Gradient Descent
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
Li =1
2(yi � y)
y = fMLP(xi; ✓)
w1i = w1i + ⌘(y � y)x1i
w2i = w2i + ⌘(y � y)x2i
rw1i = �(yi � y)x1i
rw2i = �(yi � y)x2i
(since gradients approximated from stochastic sample)
(adjustable step size)
two BP lines now
We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…
multi-layer perceptron
x
function of FOUR parameters and FOUR layers!
w1 w2
b1
h1 h2
w3
yo
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as one long equation
What is known? What is unknown?We need to train the network:
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as a long equation
What is known? What is unknown?
known
We need to train the network:
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as a long equation
What is known? What is unknown?
unknown
We need to train the network:
activation function sometimes has
parameters
Given a set of samples and a MLP
Estimate the parameters of the MLP
Learning an MLP
{xi, yi}
✓ = {f, w, b}
y = fMLP(x; ✓)
Stochastic Gradient Descent
For each random sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
@L@✓
vector of parameter update equations
vector of parameter partial derivatives
✓ ✓ � ⌘r✓
So we need to compute the partial derivatives
@L@✓
=
@L@w3
@L@w2
@L@w1
@L@b
�
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
how
…thisthis
doesaffect…
@L
@w1Partial derivative describes…
So, how do you compute it?
(loss layer)
Remember,
The Chain Rule
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
f2 a3w3 f3· · ·
@L
@w3
@L
@f3
@f3@a3
@a3@w3
rest of the network L(y, y)y
Intuitively, the effect of weight on loss function :
depends on
depends ondepends on
According to the chain rule…
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Chain Rule!
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Just the partial derivative of L2 loss
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Let’s use a Sigmoid functionds(x)
dx
= s(x)(1� s(x))
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Let’s use a Sigmoid functionds(x)
dx
= s(x)(1� s(x))
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
x
w1 a1 f1 w2 yf2a2 a3w3 f3
b1
@L
@w2=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
x
w1 a1 f1 w2 yf2a2 a3w3 f3
b1
already computed. re-use (propagate)!
@L
@w2=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
The Chain rule says…
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L
@w1=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
The Chain rule says…
depends ondepends ondepends on
depends ondepends on
depends on
depends on
already computed. re-use (propagate)!
@L
@w1=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
Stochastic Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
w3 = w3 � ⌘rw3
w2 = w2 � ⌘rw2
w1 = w1 � ⌘rw1
b = b� ⌘rb
Li
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
@L@✓
✓ ✓ + ⌘@L@✓
vector of parameter update equations
vector of parameter partial derivatives
Stochastic Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
Li