15 Machine Learning Multilayer Perceptron
-
Upload
andres-mendez-vazquez -
Category
Engineering
-
view
405 -
download
2
Transcript of 15 Machine Learning Multilayer Perceptron
![Page 1: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/1.jpg)
Neural NetworksMultilayer Perceptron
Andres Mendez-Vazquez
December 12, 2015
1 / 94
![Page 2: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/2.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
2 / 94
![Page 3: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/3.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
3 / 94
![Page 4: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/4.jpg)
Do you remember?
The Perceptron has the following problemGiven that the perceptron is a linear classifier
It is clear that
It will never be able to classify stuff that is not linearly separable
4 / 94
![Page 5: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/5.jpg)
Do you remember?
The Perceptron has the following problemGiven that the perceptron is a linear classifier
It is clear that
It will never be able to classify stuff that is not linearly separable
4 / 94
![Page 6: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/6.jpg)
Example: XOR Problem
The Problem
0
1
1
Class 1
Class 2
5 / 94
![Page 7: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/7.jpg)
The Perceptron cannot solve it
BecauseThe perceptron is a linear classifier!!!
ThusSomething needs to be done!!!
MaybeAdd an extra layer!!!
6 / 94
![Page 8: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/8.jpg)
The Perceptron cannot solve it
BecauseThe perceptron is a linear classifier!!!
ThusSomething needs to be done!!!
MaybeAdd an extra layer!!!
6 / 94
![Page 9: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/9.jpg)
The Perceptron cannot solve it
BecauseThe perceptron is a linear classifier!!!
ThusSomething needs to be done!!!
MaybeAdd an extra layer!!!
6 / 94
![Page 10: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/10.jpg)
A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."
It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.
HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
![Page 11: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/11.jpg)
A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."
It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.
HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
![Page 12: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/12.jpg)
A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."
It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.
HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
![Page 13: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/13.jpg)
Then
Something NotableIt led to a “renaissance” in the field of artificial neural network research.
NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.
8 / 94
![Page 14: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/14.jpg)
Then
Something NotableIt led to a “renaissance” in the field of artificial neural network research.
NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.
8 / 94
![Page 15: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/15.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
9 / 94
![Page 16: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/16.jpg)
Multi-Layer Perceptron (MLP)Multi-Layer Architecture·
Output
Input
Target
Hidden
Sigmoid Activationfunction
Identity Activationfunction
10 / 94
![Page 17: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/17.jpg)
Information Flow
We have the following information flow
Function Signals
Error Signals
11 / 94
![Page 18: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/18.jpg)
Explanation
Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound
the region.
AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.
12 / 94
![Page 19: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/19.jpg)
Explanation
Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound
the region.
AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.
12 / 94
![Page 20: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/20.jpg)
Explanation
Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound
the region.
AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.
12 / 94
![Page 21: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/21.jpg)
Explanation
Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound
the region.
AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.
12 / 94
![Page 22: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/22.jpg)
Explanation
Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound
the region.
AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.
12 / 94
![Page 23: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/23.jpg)
The Process
We have something like thisLayer 1
(0,0,1)
(1,0,0)
(1,0,0)
Layer 2
13 / 94
![Page 24: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/24.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
14 / 94
![Page 25: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/25.jpg)
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) = 12e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
![Page 26: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/26.jpg)
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) = 12e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
![Page 27: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/27.jpg)
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) = 12e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
![Page 28: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/28.jpg)
Back-propagation
SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network
Training Error for a single Pattern or Sample!!!
J (w) = 12
c∑k=1
(tk − zk)2 = 12 ‖t − z‖2 (4)
16 / 94
![Page 29: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/29.jpg)
Back-propagation
SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network
Training Error for a single Pattern or Sample!!!
J (w) = 12
c∑k=1
(tk − zk)2 = 12 ‖t − z‖2 (4)
16 / 94
![Page 30: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/30.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
17 / 94
![Page 31: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/31.jpg)
Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.
Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:
∆w = −η ∂J∂w (5)
Whereη is the learning rate which indicates the relative size of the change inweights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented18 / 94
![Page 32: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/32.jpg)
Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.
Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:
∆w = −η ∂J∂w (5)
Whereη is the learning rate which indicates the relative size of the change inweights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented18 / 94
![Page 33: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/33.jpg)
Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.
Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:
∆w = −η ∂J∂w (5)
Whereη is the learning rate which indicates the relative size of the change inweights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented18 / 94
![Page 34: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/34.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
19 / 94
![Page 35: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/35.jpg)
Multilayer ArchitectureMultilayer Architecture: hidden–to-output weights
Output
Input
Target
Hidden
20 / 94
![Page 36: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/36.jpg)
Observation about the activation function
Hidden Output is equal to
yj = f( d∑
i=1wjixi
)
Output is equal to
zk = f
ynH∑j=1
wkjyj
21 / 94
![Page 37: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/37.jpg)
Observation about the activation function
Hidden Output is equal to
yj = f( d∑
i=1wjixi
)
Output is equal to
zk = f
ynH∑j=1
wkjyj
21 / 94
![Page 38: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/38.jpg)
Hidden–to-Output Weights
Error on the hidden–to-output weights∂J∂wkj
= ∂J∂netk
· ∂netk∂wkj
= −δk ·∂netk∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’snet:
netk =ynH∑j=1
wkjyj = wTk · y (8)
Now
δk = − ∂J∂netk
= − ∂J∂zk· ∂zk∂netk
= (tk − zk) f ′ (netk) (9)
22 / 94
![Page 39: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/39.jpg)
Hidden–to-Output Weights
Error on the hidden–to-output weights∂J∂wkj
= ∂J∂netk
· ∂netk∂wkj
= −δk ·∂netk∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’snet:
netk =ynH∑j=1
wkjyj = wTk · y (8)
Now
δk = − ∂J∂netk
= − ∂J∂zk· ∂zk∂netk
= (tk − zk) f ′ (netk) (9)
22 / 94
![Page 40: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/40.jpg)
Hidden–to-Output Weights
Error on the hidden–to-output weights∂J∂wkj
= ∂J∂netk
· ∂netk∂wkj
= −δk ·∂netk∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’snet:
netk =ynH∑j=1
wkjyj = wTk · y (8)
Now
δk = − ∂J∂netk
= − ∂J∂zk· ∂zk∂netk
= (tk − zk) f ′ (netk) (9)
22 / 94
![Page 41: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/41.jpg)
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus∂zk∂netk
= f ′ (netk) (11)
Since netk = wTk · y therefore:
∂netk∂wkj
= yj (12)
23 / 94
![Page 42: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/42.jpg)
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus∂zk∂netk
= f ′ (netk) (11)
Since netk = wTk · y therefore:
∂netk∂wkj
= yj (12)
23 / 94
![Page 43: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/43.jpg)
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus∂zk∂netk
= f ′ (netk) (11)
Since netk = wTk · y therefore:
∂netk∂wkj
= yj (12)
23 / 94
![Page 44: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/44.jpg)
Finally
The weight update (or learning rule) for the hidden-to-output weightsis:
4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (13)
24 / 94
![Page 45: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/45.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
25 / 94
![Page 46: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/46.jpg)
Multi-Layer ArchitectureMulti-Layer Architecture: Input–to-Hidden weights
Output
Input
Target
Hidden
26 / 94
![Page 47: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/47.jpg)
Input–to-Hidden WeightsError on the Input–to-Hidden weights
∂J∂wji
= ∂J∂yj· ∂yj∂netj
· ∂netj∂wji
(14)
Thus
∂J∂yj
= ∂
∂yj
[12
c∑k=1
(tk − zk)2]
= −c∑
k=1(tk − zk) ∂zk
∂yj
= −c∑
k=1(tk − zk) ∂zk
∂netk· ∂netk∂yj
= −c∑
k=1(tk − zk) ∂f (netk)
∂netk· wkj
27 / 94
![Page 48: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/48.jpg)
Input–to-Hidden WeightsError on the Input–to-Hidden weights
∂J∂wji
= ∂J∂yj· ∂yj∂netj
· ∂netj∂wji
(14)
Thus
∂J∂yj
= ∂
∂yj
[12
c∑k=1
(tk − zk)2]
= −c∑
k=1(tk − zk) ∂zk
∂yj
= −c∑
k=1(tk − zk) ∂zk
∂netk· ∂netk∂yj
= −c∑
k=1(tk − zk) ∂f (netk)
∂netk· wkj
27 / 94
![Page 49: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/49.jpg)
Input–to-Hidden Weights
Finally∂J∂yj
= −c∑
k=1(tk − zk) f ′ (netk) · wkj (15)
Remember
δk = − ∂J∂netk
= (tk − zk) f ′ (netk) (16)
28 / 94
![Page 50: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/50.jpg)
What is ∂yj∂netj
?
First
netj =d∑
i=1wjixi = wT
j · x (17)
Then
yj = f (netj)
Then∂yj∂netj
= ∂f (netj)∂netj
= f ′ (netj)
29 / 94
![Page 51: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/51.jpg)
What is ∂yj∂netj
?
First
netj =d∑
i=1wjixi = wT
j · x (17)
Then
yj = f (netj)
Then∂yj∂netj
= ∂f (netj)∂netj
= f ′ (netj)
29 / 94
![Page 52: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/52.jpg)
What is ∂yj∂netj
?
First
netj =d∑
i=1wjixi = wT
j · x (17)
Then
yj = f (netj)
Then∂yj∂netj
= ∂f (netj)∂netj
= f ′ (netj)
29 / 94
![Page 53: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/53.jpg)
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f ′ (netj)c∑
k=1wkjδk (18)
Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”
30 / 94
![Page 54: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/54.jpg)
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f ′ (netj)c∑
k=1wkjδk (18)
Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”
30 / 94
![Page 55: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/55.jpg)
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f ′ (netj)c∑
k=1wkjδk (18)
Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”
30 / 94
![Page 56: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/56.jpg)
What about ∂netj∂wji
?
We have that∂netj∂wji
=∂wT
j · x∂wji
= ∂∑d
i=1 wjixi∂wji
= xi
31 / 94
![Page 57: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/57.jpg)
Finally
The learning rule for the input-to-hidden weights is:
∆wji = ηxiδj = η
[ c∑k=1
wkjδk
]f ′ (netj) xi (19)
32 / 94
![Page 58: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/58.jpg)
Basically, the entire training process has the following steps
InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds
Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.
Backward ComputationCompute the local gradients of the network.
FinallyAdjust the weights!!!
33 / 94
![Page 59: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/59.jpg)
Basically, the entire training process has the following steps
InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds
Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.
Backward ComputationCompute the local gradients of the network.
FinallyAdjust the weights!!!
33 / 94
![Page 60: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/60.jpg)
Basically, the entire training process has the following steps
InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds
Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.
Backward ComputationCompute the local gradients of the network.
FinallyAdjust the weights!!!
33 / 94
![Page 61: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/61.jpg)
Basically, the entire training process has the following steps
InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds
Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.
Backward ComputationCompute the local gradients of the network.
FinallyAdjust the weights!!!
33 / 94
![Page 62: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/62.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
34 / 94
![Page 63: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/63.jpg)
Now, Calculating Total Change
We have for thatThe Total Training Error is the sum over the errors of N individual patterns
The Total Training Error
J =N∑
p=1Jp = 1
2
N∑p=1
d∑k=1
(tpk − zp
k )2 = 12
n∑p=1‖tp − zp‖2 (20)
35 / 94
![Page 64: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/64.jpg)
Now, Calculating Total Change
We have for thatThe Total Training Error is the sum over the errors of N individual patterns
The Total Training Error
J =N∑
p=1Jp = 1
2
N∑p=1
d∑k=1
(tpk − zp
k )2 = 12
n∑p=1‖tp − zp‖2 (20)
35 / 94
![Page 65: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/65.jpg)
About the Total Training Error
RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.
36 / 94
![Page 66: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/66.jpg)
About the Total Training Error
RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.
36 / 94
![Page 67: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/67.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
37 / 94
![Page 68: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/68.jpg)
Now, we want the training to stop
ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!
A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1))− J (w (t))| (21)
There are other stopping criteria that lead to better performance thanthis one.
38 / 94
![Page 69: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/69.jpg)
Now, we want the training to stop
ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!
A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1))− J (w (t))| (21)
There are other stopping criteria that lead to better performance thanthis one.
38 / 94
![Page 70: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/70.jpg)
Now, we want the training to stop
ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!
A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1))− J (w (t))| (21)
There are other stopping criteria that lead to better performance thanthis one.
38 / 94
![Page 71: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/71.jpg)
Now, we want the training to stop
ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!
A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1))− J (w (t))| (21)
There are other stopping criteria that lead to better performance thanthis one.
38 / 94
![Page 72: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/72.jpg)
Other Stopping Criteria
Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.
‖∇wJ (m)‖ < Θ (22)
Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1
N
N∑p=1
Jp
∣∣∣∣∣∣ < Θ (23)
39 / 94
![Page 73: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/73.jpg)
Other Stopping Criteria
Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.
‖∇wJ (m)‖ < Θ (22)
Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1
N
N∑p=1
Jp
∣∣∣∣∣∣ < Θ (23)
39 / 94
![Page 74: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/74.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 75: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/75.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 76: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/76.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 77: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/77.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 78: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/78.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 79: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/79.jpg)
About the Stopping Criteria
Observations1 Before training starts, the error on the training set is high.
I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.
I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”
40 / 94
![Page 80: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/80.jpg)
Some More Terminology
EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.
In our caseI am using the batch sum of all correcting weights to define that epoch.
41 / 94
![Page 81: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/81.jpg)
Some More Terminology
EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.
In our caseI am using the batch sum of all correcting weights to define that epoch.
41 / 94
![Page 82: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/82.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
42 / 94
![Page 83: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/83.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 84: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/84.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 85: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/85.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 86: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/86.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 87: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/87.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 88: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/88.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 89: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/89.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 90: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/90.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 91: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/91.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 92: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/92.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 93: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/93.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 94: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/94.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 95: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/95.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 96: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/96.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 97: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/97.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 98: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/98.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 99: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/99.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 100: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/100.jpg)
Final Basic Batch AlgorithmPerceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk) f ′(
wTk · y)
8 for j = 1 to nH
9 netj = wTj · x;yj = f
(netj)
10 wkj (m) = wkj (m) + ηδkyj (m)
11 for j = 1 to nH
12 δj = f ′(
netj)∑c
k=1wkjδk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδjxi (m)
15 until ‖∇wJ (m)‖ < Θ
16 return w (m) 43 / 94
![Page 101: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/101.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
44 / 94
![Page 102: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/102.jpg)
Example of Architecture to be usedGiven the following Architecture and assuming N samples
45 / 94
![Page 103: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/103.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
46 / 94
![Page 104: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/104.jpg)
Generating the output zk
Given the input
X =[
x1 x2 · · · xN]
(24)
Wherexi is a vector of features
xi =
x1ix2i...
xdi
(25)
47 / 94
![Page 105: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/105.jpg)
Generating the output zk
Given the input
X =[
x1 x2 · · · xN]
(24)
Wherexi is a vector of features
xi =
x1ix2i...
xdi
(25)
47 / 94
![Page 106: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/106.jpg)
ThereforeWe must have the following matrix for the input to hidden inputs
W IH =
w11 w12 · · · w1dw21 w22 · · · w2d...
.... . .
...wnH 1 wnH 2 · · · wnH d
=
wT
1wT
2...
wTnH
(26)
Given that wj =
wj1wj2...
wjd
ThusWe can create the netj for all the inputs by simply
netj = W IH X =
wT
1 x1 wT1 x2 · · · wT
1 xNwT
2 x1 wT2 x2 · · · wT
2 xN...
.... . .
...wT
nH x1 wTnH x2 · · · wT
nH xN
(27)
48 / 94
![Page 107: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/107.jpg)
ThereforeWe must have the following matrix for the input to hidden inputs
W IH =
w11 w12 · · · w1dw21 w22 · · · w2d...
.... . .
...wnH 1 wnH 2 · · · wnH d
=
wT
1wT
2...
wTnH
(26)
Given that wj =
wj1wj2...
wjd
ThusWe can create the netj for all the inputs by simply
netj = W IH X =
wT
1 x1 wT1 x2 · · · wT
1 xNwT
2 x1 wT2 x2 · · · wT
2 xN...
.... . .
...wT
nH x1 wTnH x2 · · · wT
nH xN
(27)
48 / 94
![Page 108: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/108.jpg)
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =
f(wT
1 x1)
f(wT
1 x2)· · · f
(wT
1 xN)
f(wT
2 x1)
f(wT
2 x2)· · · f
(wT
2 xN)
...... . . . ...
f(wT
nH x1)
f(wT
nH x2)· · · f
(wT
nH xN)
(28)
IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special
49 / 94
![Page 109: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/109.jpg)
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =
f(wT
1 x1)
f(wT
1 x2)· · · f
(wT
1 xN)
f(wT
2 x1)
f(wT
2 x2)· · · f
(wT
2 xN)
...... . . . ...
f(wT
nH x1)
f(wT
nH x2)· · · f
(wT
nH xN)
(28)
IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special
49 / 94
![Page 110: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/110.jpg)
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =
f(wT
1 x1)
f(wT
1 x2)· · · f
(wT
1 xN)
f(wT
2 x1)
f(wT
2 x2)· · · f
(wT
2 xN)
...... . . . ...
f(wT
nH x1)
f(wT
nH x2)· · · f
(wT
nH xN)
(28)
IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special
49 / 94
![Page 111: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/111.jpg)
However, We can create a Sigmoid function
It is possible to use the following pseudo-codeSigmoid(x)
1 if try{
1.01.0+exp{−αx}
}catch {OVERFLOW } / I will use a
/ try-and-catch to catch/ the overflow
2 if x < 03 return 04 else5 return 16 else7 return 1.0
1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)
50 / 94
![Page 112: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/112.jpg)
However, We can create a Sigmoid function
It is possible to use the following pseudo-codeSigmoid(x)
1 if try{
1.01.0+exp{−αx}
}catch {OVERFLOW } / I will use a
/ try-and-catch to catch/ the overflow
2 if x < 03 return 04 else5 return 16 else7 return 1.0
1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)
50 / 94
![Page 113: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/113.jpg)
However, We can create a Sigmoid function
It is possible to use the following pseudo-codeSigmoid(x)
1 if try{
1.01.0+exp{−αx}
}catch {OVERFLOW } / I will use a
/ try-and-catch to catch/ the overflow
2 if x < 03 return 04 else5 return 16 else7 return 1.0
1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)
50 / 94
![Page 114: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/114.jpg)
However, We can create a Sigmoid function
It is possible to use the following pseudo-codeSigmoid(x)
1 if try{
1.01.0+exp{−αx}
}catch {OVERFLOW } / I will use a
/ try-and-catch to catch/ the overflow
2 if x < 03 return 04 else5 return 16 else7 return 1.0
1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)
50 / 94
![Page 115: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/115.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
51 / 94
![Page 116: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/116.jpg)
For this, we get netk
For this, we obtain the W HO
W HO =(
wo11 wo
12 · · · wo1nH
)=(wT
o
)(29)
Thus
netk =(
wo11 wo
12 · · · wo1nH
)
f(
wT1 x1)
f(
wT1 x2)
· · · f(
wT1 xN)
f(
wT2 x1)
f(
wT2 x2)
· · · f(
wT2 xN)
......
. . ....
f(
wTnH
x1)︸ ︷︷ ︸ f
(wT
nHx2)︸ ︷︷ ︸ · · · f
(wT
nHxN)︸ ︷︷ ︸
yk1 yk2 · · · ykN
(30)
In matrix notationnetk =
(wT
o yk1 wTo yk2 · · · wT
o ykN)
(31)
52 / 94
![Page 117: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/117.jpg)
For this, we get netk
For this, we obtain the W HO
W HO =(
wo11 wo
12 · · · wo1nH
)=(wT
o
)(29)
Thus
netk =(
wo11 wo
12 · · · wo1nH
)
f(
wT1 x1)
f(
wT1 x2)
· · · f(
wT1 xN)
f(
wT2 x1)
f(
wT2 x2)
· · · f(
wT2 xN)
......
. . ....
f(
wTnH
x1)︸ ︷︷ ︸ f
(wT
nHx2)︸ ︷︷ ︸ · · · f
(wT
nHxN)︸ ︷︷ ︸
yk1 yk2 · · · ykN
(30)
In matrix notationnetk =
(wT
o yk1 wTo yk2 · · · wT
o ykN)
(31)
52 / 94
![Page 118: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/118.jpg)
For this, we get netk
For this, we obtain the W HO
W HO =(
wo11 wo
12 · · · wo1nH
)=(wT
o
)(29)
Thus
netk =(
wo11 wo
12 · · · wo1nH
)
f(
wT1 x1)
f(
wT1 x2)
· · · f(
wT1 xN)
f(
wT2 x1)
f(
wT2 x2)
· · · f(
wT2 xN)
......
. . ....
f(
wTnH
x1)︸ ︷︷ ︸ f
(wT
nHx2)︸ ︷︷ ︸ · · · f
(wT
nHxN)︸ ︷︷ ︸
yk1 yk2 · · · ykN
(30)
In matrix notationnetk =
(wT
o yk1 wTo yk2 · · · wT
o ykN)
(31)
52 / 94
![Page 119: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/119.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
53 / 94
![Page 120: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/120.jpg)
Now, we have
Thus, we have zk (In our case k = 1, but it could be a range ofvalues)
zk =(
f(wT
o yk1
)f(wT
o yk2
)· · · f
(wT
o ykN
) )(32)
Thus, we generate a vector of differences
d = t − zk =(
t1 − f(wT
o yk1)
t2 − f(wT
o yk2)
· · · tN − f(wT
o ykN) )
(33)
where t =(
t1 t2 · · · tN)is a row vector of desired outputs for each
sample.
54 / 94
![Page 121: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/121.jpg)
Now, we have
Thus, we have zk (In our case k = 1, but it could be a range ofvalues)
zk =(
f(wT
o yk1
)f(wT
o yk2
)· · · f
(wT
o ykN
) )(32)
Thus, we generate a vector of differences
d = t − zk =(
t1 − f(wT
o yk1)
t2 − f(wT
o yk2)
· · · tN − f(wT
o ykN) )
(33)
where t =(
t1 t2 · · · tN)is a row vector of desired outputs for each
sample.
54 / 94
![Page 122: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/122.jpg)
Now, we multiply element wise
We have the following vector of derivatives of net
Df =(ηf ′(wT
o yk1
)ηf ′(wT
o yk2
)· · · ηf ′
(wT
o ykN
) )(34)
where η is the step rate.
Finally, by element wise multiplication (Hadamard Product)
d =(η[t1 − f
(wT
o yk1)]
f ′(wT
o yk1)
η[t2 − f
(wT
o yk2)]
f ′(wT
o yk2)· · ·
η[tN − f
(wT
o ykN)]
f ′(wT
o ykN))
55 / 94
![Page 123: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/123.jpg)
Now, we multiply element wise
We have the following vector of derivatives of net
Df =(ηf ′(wT
o yk1
)ηf ′(wT
o yk2
)· · · ηf ′
(wT
o ykN
) )(34)
where η is the step rate.
Finally, by element wise multiplication (Hadamard Product)
d =(η[t1 − f
(wT
o yk1)]
f ′(wT
o yk1)
η[t2 − f
(wT
o yk2)]
f ′(wT
o yk2)· · ·
η[tN − f
(wT
o ykN)]
f ′(wT
o ykN))
55 / 94
![Page 124: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/124.jpg)
Tile d
Tile downward
dtile = nH rows
dd...d
(35)
Finally, we multiply element wise against y1 (Hadamard Product)
∆wtemp1j = y1 ◦ dtile (36)
56 / 94
![Page 125: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/125.jpg)
Tile d
Tile downward
dtile = nH rows
dd...d
(35)
Finally, we multiply element wise against y1 (Hadamard Product)
∆wtemp1j = y1 ◦ dtile (36)
56 / 94
![Page 126: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/126.jpg)
We obtain the total ∆w1j
We sum along the rows of ∆wtemp1j
∆w1j =
η[
t1 − f(
wTo yk1
)]f ′(
wTo yk1
)y11 + η
[t1 − f
(wT
o yk1
)]f ′(
wTo yk1
)y1N
...η[
t1 − f(
wTo yk1
)]f ′(
wTo yk1
)ynH 1 + η
[t1 − f
(wT
o yk1
)]f ′(
wTo yk1
)ynH N
(37)
where yhm = f(wT
h xm)with h = 1, 2, ...,nH and m = 1, 2, ...,N .
57 / 94
![Page 127: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/127.jpg)
Finally, we update the first weights
We have then
W HO (t + 1) = W HO (t) + ∆wT1j (t) (38)
58 / 94
![Page 128: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/128.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
59 / 94
![Page 129: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/129.jpg)
First
We multiply element wise the W HO and ∆w1j
T = ∆wT1j ◦W T
HO (39)
Now, we obtain the element wise derivative of netj
Dnetj =
f ′(wT
1 x1)
f ′(wT
1 x2)· · · f ′
(wT
1 xN)
f ′(wT
2 x1)
f ′(wT
2 x2)· · · f ′
(wT
2 xN)
...... . . . ...
f ′(wT
nH x1)
f ′(wT
nH x2)· · · f ′
(wT
nH xN)
(40)
60 / 94
![Page 130: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/130.jpg)
First
We multiply element wise the W HO and ∆w1j
T = ∆wT1j ◦W T
HO (39)
Now, we obtain the element wise derivative of netj
Dnetj =
f ′(wT
1 x1)
f ′(wT
1 x2)· · · f ′
(wT
1 xN)
f ′(wT
2 x1)
f ′(wT
2 x2)· · · f ′
(wT
2 xN)
...... . . . ...
f ′(wT
nH x1)
f ′(wT
nH x2)· · · f ′
(wT
nH xN)
(40)
60 / 94
![Page 131: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/131.jpg)
Thus
We tile to the right T
T tile =(
T T · · · T)
︸ ︷︷ ︸N Columns
(41)
Now, we multiply element wise together with η
Pt = η (Dnetj ◦T tile) (42)
where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)
61 / 94
![Page 132: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/132.jpg)
Thus
We tile to the right T
T tile =(
T T · · · T)
︸ ︷︷ ︸N Columns
(41)
Now, we multiply element wise together with η
Pt = η (Dnetj ◦T tile) (42)
where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)
61 / 94
![Page 133: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/133.jpg)
Finally
We get use the transpose of X which is a N × d matrix
XT =
xT
1xT
2...
xTN
(43)
Finally, we get a nH × d matrix
∆wij = PtXT (44)
Thus, given W IH
W IH (t + 1) = W HO (t) + ∆wTij (t) (45)
62 / 94
![Page 134: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/134.jpg)
Finally
We get use the transpose of X which is a N × d matrix
XT =
xT
1xT
2...
xTN
(43)
Finally, we get a nH × d matrix
∆wij = PtXT (44)
Thus, given W IH
W IH (t + 1) = W HO (t) + ∆wTij (t) (45)
62 / 94
![Page 135: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/135.jpg)
Finally
We get use the transpose of X which is a N × d matrix
XT =
xT
1xT
2...
xTN
(43)
Finally, we get a nH × d matrix
∆wij = PtXT (44)
Thus, given W IH
W IH (t + 1) = W HO (t) + ∆wTij (t) (45)
62 / 94
![Page 136: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/136.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
63 / 94
![Page 137: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/137.jpg)
We have different activation functions
The two most important1 Sigmoid function.2 Hyperbolic tangent function
64 / 94
![Page 138: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/138.jpg)
We have different activation functions
The two most important1 Sigmoid function.2 Hyperbolic tangent function
64 / 94
![Page 139: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/139.jpg)
Logistic Function
This non-linear function has the following definition for a neuron j
fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)
Example
65 / 94
![Page 140: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/140.jpg)
Logistic FunctionThis non-linear function has the following definition for a neuron j
fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)
Example
65 / 94
![Page 141: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/141.jpg)
The differential of the sigmoid function
Now if we differentiate, we have
f ′j (vj (n)) =[
11 + exp {−avj (n)}
] [1− 1
1 + exp {−avj (n)}
]
= exp {−avj (n)}(1 + exp {−avj (n)})2
66 / 94
![Page 142: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/142.jpg)
The differential of the sigmoid function
Now if we differentiate, we have
f ′j (vj (n)) =[
11 + exp {−avj (n)}
] [1− 1
1 + exp {−avj (n)}
]
= exp {−avj (n)}(1 + exp {−avj (n)})2
66 / 94
![Page 143: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/143.jpg)
The outputs finish as
For the output neurons
δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1− fj (vj (n)))c∑
k=1wkjδk
67 / 94
![Page 144: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/144.jpg)
The outputs finish as
For the output neurons
δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1− fj (vj (n)))c∑
k=1wkjδk
67 / 94
![Page 145: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/145.jpg)
The outputs finish as
For the output neurons
δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1− fj (vj (n)))c∑
k=1wkjδk
67 / 94
![Page 146: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/146.jpg)
Hyperbolic tangent function
Another commonly used form of sigmoidal non linearity is thehyperbolic tangent function
fj (vj (n)) = a tanh (bvj (n)) (47)
Example
68 / 94
![Page 147: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/147.jpg)
Hyperbolic tangent functionAnother commonly used form of sigmoidal non linearity is thehyperbolic tangent function
fj (vj (n)) = a tanh (bvj (n)) (47)
Example
68 / 94
![Page 148: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/148.jpg)
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2 (bvj (n))
=ab(1− tanh2 (bvj (n))
)BTWI leave to you to figure out the outputs.
69 / 94
![Page 149: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/149.jpg)
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2 (bvj (n))
=ab(1− tanh2 (bvj (n))
)BTWI leave to you to figure out the outputs.
69 / 94
![Page 150: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/150.jpg)
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2 (bvj (n))
=ab(1− tanh2 (bvj (n))
)BTWI leave to you to figure out the outputs.
69 / 94
![Page 151: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/151.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
70 / 94
![Page 152: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/152.jpg)
Maximizing information content
Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.
For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.
Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
![Page 153: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/153.jpg)
Maximizing information content
Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.
For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.
Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
![Page 154: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/154.jpg)
Maximizing information content
Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.
For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.
Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
![Page 155: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/155.jpg)
Maximizing information content
Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.
For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.
Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
![Page 156: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/156.jpg)
Maximizing information content
Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.
For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.
Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
![Page 157: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/157.jpg)
However
Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.
Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).
72 / 94
![Page 158: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/158.jpg)
However
Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.
Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).
72 / 94
![Page 159: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/159.jpg)
However
Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.
Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).
0 1 2 3
Outlier
72 / 94
![Page 160: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/160.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
73 / 94
![Page 161: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/161.jpg)
Activation Function
We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.
Example: The hyperbolic tangent
74 / 94
![Page 162: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/162.jpg)
Activation Function
We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.
Example: The hyperbolic tangent
74 / 94
![Page 163: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/163.jpg)
Activation FunctionWe say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.
Example: The hyperbolic tangent
74 / 94
![Page 164: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/164.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
75 / 94
![Page 165: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/165.jpg)
Target Values
ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.
SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε
76 / 94
![Page 166: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/166.jpg)
Target Values
ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.
SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε
76 / 94
![Page 167: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/167.jpg)
For example
Given the a limiting value
We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.
77 / 94
![Page 168: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/168.jpg)
For example
Given the a limiting value
We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.
77 / 94
![Page 169: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/169.jpg)
For example
Given the a limiting value
We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.
77 / 94
![Page 170: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/170.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
78 / 94
![Page 171: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/171.jpg)
Normalizing the inputs
Something Important (LeCun, 1993)Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.
Example
79 / 94
![Page 172: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/172.jpg)
Normalizing the inputs
Something Important (LeCun, 1993)Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.
Example
79 / 94
![Page 173: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/173.jpg)
Normalizing the inputs
Something Important (LeCun, 1993)Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.
Example
79 / 94
![Page 174: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/174.jpg)
Normalizing the inputs
Something Important (LeCun, 1993)Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.
ExampleMean Value
79 / 94
![Page 175: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/175.jpg)
The normalization must include two other measures
UncorrelatedWe can use the principal component analysis
Example
80 / 94
![Page 176: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/176.jpg)
The normalization must include two other measures
UncorrelatedWe can use the principal component analysis
Example
80 / 94
![Page 177: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/177.jpg)
In addition
Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.
WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.
81 / 94
![Page 178: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/178.jpg)
In addition
Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.
WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.
81 / 94
![Page 179: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/179.jpg)
There are other heuristics
AsInitializationLearning form hintsLearning ratesetc
82 / 94
![Page 180: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/180.jpg)
There are other heuristics
AsInitializationLearning form hintsLearning ratesetc
82 / 94
![Page 181: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/181.jpg)
There are other heuristics
AsInitializationLearning form hintsLearning ratesetc
82 / 94
![Page 182: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/182.jpg)
There are other heuristics
AsInitializationLearning form hintsLearning ratesetc
82 / 94
![Page 183: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/183.jpg)
In addition
In section 4.15, Simon HaykinWe have the following techniques:
Network growingI You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruningI Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
![Page 184: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/184.jpg)
In addition
In section 4.15, Simon HaykinWe have the following techniques:
Network growingI You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruningI Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
![Page 185: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/185.jpg)
In addition
In section 4.15, Simon HaykinWe have the following techniques:
Network growingI You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruningI Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
![Page 186: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/186.jpg)
Outline1 Introduction
The XOR Problem2 Multi-Layer Perceptron
ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm
3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions
4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer
84 / 94
![Page 187: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/187.jpg)
Virtues and limitations of Back-Propagation Layer
Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.
It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training
85 / 94
![Page 188: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/188.jpg)
Virtues and limitations of Back-Propagation Layer
Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.
It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training
85 / 94
![Page 189: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/189.jpg)
Virtues and limitations of Back-Propagation Layer
Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.
It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training
85 / 94
![Page 190: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/190.jpg)
Connectionism
Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.
This form of restrictionIt is known as the locality constraint
86 / 94
![Page 191: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/191.jpg)
Connectionism
Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.
This form of restrictionIt is known as the locality constraint
86 / 94
![Page 192: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/192.jpg)
Why this is advocated in Artificial Neural Networks
FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.
SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.
ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.
87 / 94
![Page 193: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/193.jpg)
Why this is advocated in Artificial Neural Networks
FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.
SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.
ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.
87 / 94
![Page 194: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/194.jpg)
Why this is advocated in Artificial Neural Networks
FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.
SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.
ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.
87 / 94
![Page 195: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/195.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.
SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.
88 / 94
![Page 196: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/196.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.
SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.
88 / 94
![Page 197: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/197.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.
SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.
88 / 94
![Page 198: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/198.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.
FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.
89 / 94
![Page 199: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/199.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.
FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.
89 / 94
![Page 200: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/200.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.
FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.
89 / 94
![Page 201: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/201.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.
FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.
89 / 94
![Page 202: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/202.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.
90 / 94
![Page 203: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/203.jpg)
However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)
FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.
90 / 94
![Page 204: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/204.jpg)
Computational Efficiency
Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)
We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
![Page 205: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/205.jpg)
Computational Efficiency
Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)
We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
![Page 206: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/206.jpg)
Computational Efficiency
Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)
We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
![Page 207: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/207.jpg)
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf ′ (netj)[ c∑
k=1wkjδk
]xi
We have that for this step[∑c
k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights
Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.
92 / 94
![Page 208: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/208.jpg)
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf ′ (netj)[ c∑
k=1wkjδk
]xi
We have that for this step[∑c
k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights
Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.
92 / 94
![Page 209: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/209.jpg)
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf ′ (netj)[ c∑
k=1wkjδk
]xi
We have that for this step[∑c
k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights
Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.
92 / 94
![Page 210: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/210.jpg)
We have that
The Complexity of the multi-layer perceptron is
O (W ) Complexity
93 / 94
![Page 211: 15 Machine Learning Multilayer Perceptron](https://reader033.fdocuments.us/reader033/viewer/2022051404/586e85131a28aba0038b625f/html5/thumbnails/211.jpg)
Exercises
We have from NN by Haykin4.2, 4.3, 4.6, 4.8, 4.16, 4.17, 3.7
94 / 94