Supervised Learning Neural Networks and Back-Propagation ... · Part 7: Neural Networks & Learning...
Transcript of Supervised Learning Neural Networks and Back-Propagation ... · Part 7: Neural Networks & Learning...
Part 7: Neural Networks & Learning 11/2/05
1
11/2/05 1
Neural Networksand Back-Propagation Learning
11/2/05 2
Supervised Learning
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately toother inputs
• Good example: pattern recognition
• Feedforward multilayer networks
11/2/05 3
Feedforward Network
. . .
. . . . . .
. . .
. . .
. . .
inputlayer
outputlayer
hiddenlayers
11/2/05 4
Credit Assignment Problem. .
.
. . . . . .
. . .
. . .
. . .
inputlayer
outputlayer
hiddenlayers
How do we adjust the weights of the hidden layers?
. . .
Desiredoutput
11/2/05 5
Adaptive System
S F
Pk PmP1… …
SystemEvaluation Function
(Fitness, Figure of Merit)
Control ParametersC
ControlAlgorithm
11/2/05 6
Gradient
F
Pk measures how F is altered by variation of Pk
F =
FP1
M
FPk
M
FPm
F points in direction of maximum increase in F
Part 7: Neural Networks & Learning 11/2/05
2
11/2/05 7
Gradient Ascenton Fitness Surface
+
–
F
gradient ascent
11/2/05 8
Gradient Ascentby Discrete Steps
+
–
F
11/2/05 9
Gradient Ascent is LocalBut Not Shortest
+
–
11/2/05 10
Gradient Ascent Process˙ P = F P( )
Change in fitness :
˙ F =dF
d t=
F
Pk
dPk
d tk=1
m= F( )k
˙ P kk=1
m
˙ F = F ˙ P
˙ F = F F = F2
0
Therefore gradient ascent increases fitness(until reaches 0 gradient)
11/2/05 11
General Ascent in FitnessNote that any adaptive process P t( ) will increase
fitness provided :
0 < ˙ F = F ˙ P = F ˙ P cos
where is angle between F and ˙ P
Hence we need cos > 0
or < 90o
11/2/05 12
General Ascenton Fitness Surface
+
–
F
Part 7: Neural Networks & Learning 11/2/05
3
11/2/05 13
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,K,tQ
Suppose for parameters P the corresponding actual outputs
are y1,K,yQ
Suppose D t,y( ) 0,[ ) measures difference between
target & actual outputs
Let E q = D tq ,yq( ) be error on qth sample
Let F P( ) = E q P( ) = D tq ,yq P( )[ ]q=1
Q
q=1
Q
11/2/05 14
Gradient of Fitness
F = E q
q
= E q
q
Eq
Pk=
PkD tq ,yq( ) =
D tq,yq( )y jq
j
y jq
Pk
=dD tq ,yq( )dyq
yq
Pk
=y qD tq ,yq( ) yq
Pk
11/2/05 15
Jacobian Matrix
Define Jacobian matrix Jq =
y1q
P1L
y1q
PmM O M
ynq
P1L
ynq
Pm
Note Jq n m and D tq,yq( ) n 1
Since E q( )k
=E q
Pk=
y jq
Pk
D tq,yq( )y jq
j
,
Eq = Jq( )TD tq,yq( )
11/2/05 16
Derivative of Squared EuclideanDistance
Suppose D t,y( ) = t y2
= ti yi( )2
i
D t y( )y j
=y j
ti yi( )2
i
=ti yi( )
2
y ji
=d t j y j( )
2
d y j
= 2 t j y j( )
dD t,y( )dy
= 2 y t( )
11/2/05 17
Gradient of Error on qth Input
Eq
Pk=dD tq,yq( )dyq
y q
Pk
= 2 yq tq( )y q
Pk
= 2 y jq t j
q( )y jq
Pkj
11/2/05 18
Recap
To know how to decrease the differences between
actual & desired outputs,
we need to know elements of Jacobian, y jq
Pk,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,in this case, a feedforward neural network
˙ P = Jq( )T
tq yq( )q
Part 7: Neural Networks & Learning 11/2/05
4
11/2/05 19
Equations
hi = wijs jj=1
n
h =Ws
Net input:
s i = hi( )
s = h( )
Neuron output:
11/2/05 20
Multilayer Notation
W1 W2 WL–2 WL–1
s1 s2 sL–1 sL
xq yq
11/2/05 21
Notation• L layers of neurons labeled 1, …, L
• Nl neurons in layer l
• sl = vector of outputs from neurons in layer l
• input layer s1 = xq (the input pattern)
• output layer sL = yq (the actual output)
• Wl = weights between layers l and l+1
• Problem: find how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
11/2/05 22
Typical Neuron
sjl–1
sNl–1
s1l–1
silhi
lWijl–1
WiNl–1
Wi1l–1
11/2/05 23
Error Back-Propagation
We will compute E q
Wijl
starting with last layer (l = L 1)
and working back to earlier layers (l = L 2,K,1)
11/2/05 24
Delta Values
Convenient to break derivatives by chain rule :
Eq
Wijl 1 =
E q
hil
hil
Wijl 1
Let il =
E q
hil
So E q
Wijl 1
= il hi
l
Wijl 1
Part 7: Neural Networks & Learning 11/2/05
5
11/2/05 25
Output-Layer Neuron
sjL–1
sNL–1
s1L–1
siL = yi
qhiLWij
L–1
WiNL–1
Wi1L–1
tiq
Eq
11/2/05 26
Output-Layer Derivatives (1)
iL =
Eq
hiL
=hiL
skL tk
q( )2
k
=d si
L tiq( )2
dhiL = 2 si
L tiq( )d si
L
dhiL
= 2 siL ti
q( ) hiL( )
11/2/05 27
Output-Layer Derivatives (2)
hiL
WijL 1 =
WijL 1 Wik
L 1skL 1
k
= s jL 1
Eq
WijL 1
= iLs j
L 1
where iL = 2 si
L tiq( ) hi
L( )
11/2/05 28
Hidden-Layer Neuron
sjl–1
sNl–1
s1l–1
silhi
lWijl–1
WiNl–1
Wi1l–1
skl+1
sNl+1
s1l+1
W1il
Wkil
WNil
s1l
sNl
Eq
11/2/05 29
Hidden-Layer Derivatives (1)
Recall Eq
Wijl 1 = i
l hil
Wijl 1
il
=Eq
hil =
E q
hkl+1
hkl+1
hil
k
= kl+1 hk
l+1
hil
k
hkl+1
hil =
Wkml sm
l
m
hil =
Wkil sil
hil =Wki
ld hi
l( )dhi
l =Wkil hi
l( )
il = k
l+1Wkil hi
l( )k
= hil( ) k
l+1Wkil
k
11/2/05 30
Hidden-Layer Derivatives (2)
hil
Wijl 1 =
Wijl 1 Wik
l 1skl 1
k
=dWij
l 1s jl 1
dWijl 1 = s j
l 1
Eq
Wijl 1
= il s jl 1
where il = hi
l( ) kl+1Wki
l
k
Part 7: Neural Networks & Learning 11/2/05
6
11/2/05 31
Derivative of Sigmoid
Suppose s = h( ) =1
1+ exp h( ) (logistic sigmoid)
Dh s =Dh 1+ exp h( )[ ]1
= 1+ exp h( )[ ]2Dh 1+ e h( )
= 1+ e h( )2
e h( ) =e h
1+ e h( )2
=1
1+ e h
e h
1+ e h= s
1+ e h
1+ e h
11+ e h
= s(1 s)
11/2/05 32
Summary of Back-PropagationAlgorithm
Output layer : iL = 2 si
L 1 siL( ) siL ti
q( )Eq
WijL 1
= iLs j
L 1
Hidden layers : il = si
l 1 sil( ) k
l+1Wkil
k
E q
Wijl 1
= il s jl 1
11/2/05 33
Output-Layer Computation
sjL–1
sNL–1
s1L–1
siL = yi
qhiLWij
L–1
WiNL–1
Wi1L–1
tiq–
iL
2
1–
iL = 2 si
L 1 siL( ) tiq si
L( )
WijL–1
WijL 1
= iLs j
L 1
11/2/05 34
Hidden-Layer Computation
sjl–1
sNl–1
s1l–1
silhi
lWijl–1
WiNl–1
Wi1l–1
skl+1
sNl+1
s1l+1
W1il
Wkil
WNil
Eq
1l+1
kl+1
Nl+1
il
1–
il = si
l 1 sil( ) k
l+1Wkil
k
Wijl–1
Wijl 1
= il s jl 1
11/2/05 35
Training Procedures• Batch Learning
– on each epoch (pass through all the training pairs),
– weight changes for all patterns accumulated
– weight matrices updated at end of epoch
– accurate computation of gradient
• Online Learning– weight are updated after back-prop of each training pair
– usually randomize order for each epoch
– approximation of gradient
• Doesn’t make much difference
11/2/05 36
Summation of Error Surfaces
E1
E2
E
Part 7: Neural Networks & Learning 11/2/05
7
11/2/05 37
Gradient Computationin Batch Learning
E1
E2
E
11/2/05 38
Gradient Computationin Online Learning
E1
E2
E
11/2/05 39
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!