Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a...
Transcript of Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a...
![Page 1: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/1.jpg)
13
1
Generalization
![Page 2: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/2.jpg)
13
2
Generalization
A cat that once sat on a hot stove will never again sit on a hot stove
or on a cold one either.
Mark Twain
![Page 3: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/3.jpg)
13
3
Generalization
• The network input-output mapping is accurate for the training data and for test data never seen before.
• The network interpolates well.
![Page 4: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/4.jpg)
13
4
Cause of Overfitting
Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance we need to find the least complex network that can represent the data (Ockham’sRazor).
![Page 5: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/5.jpg)
13
5
Ockham’s Razor
Find the simplest model that explains the data.
![Page 6: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/6.jpg)
13
6
Problem Statement
p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,
tq g pq( ) εq+=
F x( ) ED tq a q–( )T tq a q–( )
q 1=
Q
∑= =
Training Set
Performance Function
Underlying Function
![Page 7: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/7.jpg)
13
7
Poor Generalization
Overfitting Extrapolation
Interpolation
![Page 8: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/8.jpg)
13
8
Good Generalization
ExtrapolationInterpolation
![Page 9: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/9.jpg)
13
9
Extrapolation in 2-D
![Page 10: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/10.jpg)
13
10
Measuring Generalization
Test Set•Part of the available data is set aside during the training process.
•After training, the network error on the test set is used as a measure of generalization ability.
•The test set must never be used in any way to train the network, or even to select one network from a group of candidate networks.
•The test set must be representative of all situations for which the network will be used.
![Page 11: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/11.jpg)
13
11
Methods for Improving Generalization
• Pruning (removing neurons) until the performance is degraded.
• Growing (adding neurons) until the performance is adequate.
• Validation Methods• Regularization
![Page 12: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/12.jpg)
13
12
Early Stopping
• Break up data into training, validation, and test sets.• Use only the training set to compute gradients and
determine weight updates.• Compute the performance on the validation set at each
iteration of training.• Stop training when the performance on the validation set
goes up for a specified number of iterations.• Use the weights which achieved the lowest error on the
validation set.
![Page 13: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/13.jpg)
13
13
Early Stopping Example
a b
a b
F(x)Validation
Training
![Page 14: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/14.jpg)
13
14
Regularization
F ED=
Standard Performance Measure
Performance Measure with Regularization
(Smaller weights means a smoother function.)
Complexity Penalty
F βED α EW+ β tq aq–( )T tq aq–( )
q 1=
Q
∑ α x i2
i 1=
n
∑+= =
![Page 15: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/15.jpg)
13
15
Effect of Weight Changes
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
w1 1,2 1=
w1 1,2 0=
w1 1,2 2=
w1 1,2 3=
![Page 16: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/16.jpg)
13
16
α/β = 0
α/β = 1α/β = 0.25
α/β = 0.01
Effect of Regularization
![Page 17: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/17.jpg)
13
17
Bayes’ Rule
P A B( )P B A( )P A( )
P B( )-------------------------------=
P(A) – Prior Probability. What we know about A before B is known.
P(A|B) – Posterior Probability. What we know about Aafter we know the outcome of B.
P(B|A) – Conditional Probability (Likelihood Function). Describes our knowledge of the system.
P(B) – Marginal Probability. A normalization factor.
![Page 18: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/18.jpg)
13
18
Example Problem
• 1% of the population have a certain disease.• A test for the disease is 80% accurate in detecting
the disease in people who have it.• 10% of the time the test yields a false positive.• If you have a positive test, what is your probability
of having the disease?
![Page 19: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/19.jpg)
13
19
Bayesian Analysis
P A B( )P B A( )P A( )
P B( )-------------------------------=
A – Event that you have the disease.
B – Event that you have a positive test.
P(A) = 0.01
P(B|A) = 0.8
P(B) = P(B|A)P(A) + P(B|~A)P(~A) = 0.8 0.01 + 0.1 0.99 = 0.107
P A B( )P B A( )P A( )
P B( )------------------------------- 0.8 0.01×
0.107------------------------ 0.0748= = =
![Page 20: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/20.jpg)
13
20
Signal Plus Noise Example
t x ε+=
f ε( ) 12πσ
-------------- ε2
2σ2---------–
⎝ ⎠⎜ ⎟⎛ ⎞
exp= f x( )1
2πσx
----------------- x2
2σx2-----------–
⎝ ⎠⎜ ⎟⎛ ⎞
exp=
f t x( ) 12πσ
-------------- t x–( )2
2σ2-----------------–
⎝ ⎠⎜ ⎟⎛ ⎞
exp=
−3 −2 −1 0 1 2 3
0
0.2
0.4
0.6
f x t( )f t x( ) f x( )
f t( )-----------------------=
f t x( )
f x( )
f x t( )
![Page 21: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/21.jpg)
13
21
NN Bayesian Framework
(MacKay 92)
LikelihoodPrior
Normalization(Evidence)
D - Data SetM - Neural Network Modelx - Vector of Network Weights
PosteriorMP
ML
P x D α β M, , ,( ) P D x β M, ,( )P x α M,( )P D α β M, ,( )
-----------------------------------------------------------=
![Page 22: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/22.jpg)
13
22
Gaussian Assumptions
Gaussian Noise
Gaussian Prior:
Minimize F to Maximize P.
P D x β M, ,( )1
ZD β( )--------------- βED–( )exp= Z D β( ) 2πσε
2( )N 2⁄
π β⁄( )N 2⁄= =
P x α M,( )1
ZW α( )---------------- αEW–( )exp= Z W α( ) 2π σw
2( )n 2⁄
π α⁄( )n 2⁄= =
P x D α β M, , ,( )
1ZW α( )---------------- 1
ZD β( )--------------- βED αEW+( )–( )exp
Normalization Factor------------------------------------------------------------------------------------ 1
ZF α β,( )--------------------- F x( )–( )exp= =
F βED α EW+=
![Page 23: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/23.jpg)
13
23
Optimizing Regularization Parameters
P α β D M,,( )P D α β M, ,( )P α β M,( )
P D M( )-------------------------------------------------------------=
Second Level of Inference
Evidence from First Level
Evidence: P D α β M, ,( )P D x β M, ,( )P x α M,( )
P x D α β M, , ,( )
-------------------------------------------------------------=
1ZD β( )--------------- βED–( )exp 1
ZW α( )---------------- α EW–( )exp
1ZF α β,( )--------------------- F x( )–( )exp
-----------------------------------------------------------------------------------------------------=
ZF α β,( )
ZD β( )ZW α( )--------------------------------
βED α EW––( )exp
F x( )–( )exp---------------------------------------------- ZF α β,( )
ZD β( )ZW α( )--------------------------------=⋅=
ZF α β,( ) is the only unknown in this expression.
![Page 24: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/24.jpg)
13
24
Quadratic ApproximationTaylor series expansion:
F x( ) F xMP( )12--- x xM P–( )
THMP x xMP–( )+≈ H β ED α EW∇2+∇2=
P x D α β M, , ,( )1
ZF------ F xMP( )–
12--- x xMP–( )THMP x xM P–( )–exp≈
P x D α β M, , ,( )1ZF------exp F xMP( )–( )
⎩ ⎭⎨ ⎬⎧ ⎫ 1
2--- x xM P–( )T HM P x xMP–( )–exp≈
Substituting into previous posterior density function:
ZF α β,( ) 2π( )n 2⁄ det HM P( )
1–( )( )
1 2⁄F xMP( )–( )exp≈
P x( )1
2π( )n HMP( )
1–--------------------------------------------exp 1
2--- x xMP–( )TH M P x xM P–( )–⎝ ⎠
⎛ ⎞=
Equate with standard Gaussian density:
Comparing to previous equation, we have:
![Page 25: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/25.jpg)
13
25
Optimum Parameters
If we make this substitution for ZF in the expression for theevidence and then take the derivative with respect to α andβ to locate the minimum we find:
Effective Number of Parameters
αMP γ2EW xMP
( )---------------------------=
γ n 2αM P tr HMP( )1–
–=
βMP N γ–
2ED xMP( )
--------------------------=
![Page 26: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/26.jpg)
13
26
Gauss-Newton Approximation
It can be expensive to compute the Hessian matrix.
Try the Gauss-Newton Approximation.
This is readily available if the Levenberg-Marquardtalgorithm is used for training.
H F x( ) 2βJTJ 2αIn+≈∇2=
![Page 27: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/27.jpg)
13
27
Algorithm (GNBR)
0. Initialize α, β and the weights.
1. Take one step of Levenberg-Marquardt to minimize F(w).
2. Compute the effective number of parameters γ = n - 2αtr(H-1), using the Gauss-Newton approximation for H.
3. Compute new estimates of the regularization parameters α = γ/(2EW) and β = (N-γ)/(2ED).
4. Iterate steps 1-3 until convergence.
![Page 28: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/28.jpg)
13
28
Checks of Performance
• If γ is very close to n, then the network may be too small. Add more hidden layer neurons and retrain.
• If the larger network has the same final γ, then the smaller network was large enough.
• Otherwise, increase the number of hidden neurons.
• If a network is sufficiently large, then a larger network will achieve comparable values for γ, ED and EW.
![Page 29: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/29.jpg)
13
29
GNBR Example
α β⁄ 0.0137=
![Page 30: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/30.jpg)
13
30
Convergence of GNBR
ED
γ
α/β
ED
Training Testing
Iteration Iteration
Iteration Iteration
![Page 31: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/31.jpg)
13
31
Relationship between Early Stopping and Regularization
![Page 32: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/32.jpg)
13
32
Linear Network
������
a = purelin (Wp + b)
Linear Neuron
p a
1
n
����W
����
b
R x 1S x R
S x 1
S x 1
S x 1
Input
R S
a purel in Wp b+( ) Wp b+= =
ai pure lin ni( ) purelin wT
i p bi+( ) wT
i p bi+= = =
…
wi
wi 1,
wi 2,
wi R,
=
![Page 33: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/33.jpg)
13
33
Performance Index
p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,
Training Set:
pq tqInput: Target:
x w1
b= z p
1= a w
T1 p b+= a xT z=
Notation:
Mean Square Error:F x( ) E e
2 ][= E t a–( )2 ][ E t xT z–( )2 ][= = ED=
![Page 34: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/34.jpg)
13
34
Error AnalysisF x( ) E e
2 ][= E t a–( )2 ][ E t xT z–( )2 ][= =
F x( ) E t2 2tx Tz– x T zz
Tx+ ][=
F x( ) E t2] 2x T E tz[ ]– xT E zzT[ ]x+[=
F x( ) c 2 xT h– xT Rx+=
c E t2 ][= h E tz[ ]= R E zz
T[ ]=
F x( ) c dT x 12---x TAx+ +=
d 2h–= A 2R=
The mean square error for the Linear Network is aquadratic function:
![Page 35: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/35.jpg)
13
35
Example
p1an
Inputs
bp2 w1,2
w1,1
1
��S
a = purelin (Wp + b)
Two-Input Neuron
��
p11
1= t1 1=,
⎩ ⎭⎨ ⎬⎧ ⎫
p21–
1= t2 1–=,
⎩ ⎭⎨ ⎬⎧ ⎫
(Probability = 0.75)
(Probability = 0.25)
ED=F x( ) c 2x Th– xT Rx+=
c E t2[ ] 1( )2 0.75( ) 1–( ) 2 0.25( )+ 1= = =
R E zzT[ ] p1p1T 0.75( ) p2 p2
T 0.25( )+= =
0.75 1
11 1 0.25 1–
11– 1+ 1 0.5
0.5 1==
h E tz[ ] 0.75( ) 1( ) 1
10.25( ) 1–( ) 1–
1+ 1
0.5= = =
![Page 36: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/36.jpg)
13
36
Performance ContourOptimum Point (Maximum Likelihood)
∇2F x( ) A 2R 2 1
1 2= = =
A λI–2 λ– 1
1 2 λ–λ
24λ– 3 λ 1–( ) λ 3–( )=+= = λ1 1 λ,= 2 3=
λ1 1=
A λI– v 0=
Hessian Matrix
Eigenvalues
Eigenvectors
λ2 3=1 11 1
v1 0= v111–
= 1– 1
1 1–v2 0= v2
1
1=
x ML A– 1– d R 1– h 1 0.50.5 1
1–1
0.510
= = = =
![Page 37: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/37.jpg)
13
37
Contour Plot of ED
-2 -1 0 1 2-2
-1
0
1
2
γ n 2αMPtr HMP( )
1––=
![Page 38: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/38.jpg)
13
38
Steepest Descent Trajectory
xk 1+ xk αgk– xk α Axk d+( )–= =
xk αA xk A 1– d+( )– xk αA xk xML–( )–= =
I αA–[ ]xk αAxML
+ Mxk I M–[ ]xML
+= =
x1 Mx0 I M–[ ]x ML+=
x2 Mx1 I M–[ ]xML+=
M2x0 M I M–[ ]x
MLI M–[ ]x
ML+ +=
M2x0 Mx
MLM
2x
ML– x
MLMx
ML–+ +=
M2x0 xML M2xML–+ M2x0 I M–
2[ ]xML+= =
M I αA–[ ]=
x k Mkx0 I M
k–[ ]x
M L+=
![Page 39: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/39.jpg)
13
39
Regularization
(ρ = α/β)
To locate the minimum point, set the gradient to zero.
EW12--- x x 0–( )
Tx x 0–( )=
EW∇ x x 0–( )= E∇ D A x xML
–( )=
F x( ) ED ρEW+=
F∇ x( ) E∇ D ρ EW∇+=
F∇ x( ) A x xM L–( ) ρ x x0–( )+ 0= =
![Page 40: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/40.jpg)
13
40
MAP – ML
A xMAP xML–( ) ρ xMAP x0–( )– ρ xMAP xML xML
+– x0–( )–= =
ρ xMAP xML–( )– ρ xML x0–( )–=
A ρI+( ) x MAP x ML–( ) ρ x0
xM L–( )=
xMAP xML–( ) ρ A ρI+( ) 1– x0 xM L–( )=
x MAP x ML ρ A ρI+( ) 1– x ML– ρ A ρI+( ) 1– x0+ xML Mρx ML– Mρx
0+= =
Mρ ρ A ρI+( ) 1–=
x MAP Mρx0
I Mρ–[ ]x ML+=
![Page 41: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/41.jpg)
13
41
Early Stopping – Regularization
M I αA–[ ]=
x k Mkx0 I M
k–[ ]x
M L+=
I αA–[ ]zi z i αAz i– z i α λiz i– 1 α λi–( )z i= = =
Eigenvalues of Mzi - eigenvector of A λi - eigenvalue of A
Eigenvalues of Mk:
Eigenvalues of Mγ:
e ig Mk
( ) 1 αλ i–( ) k=
x MAP Mρx0
I Mρ–[ ]x ML+=
Mρ ρ A ρI+( ) 1–=
eig Mρ( ) ρλ i ρ+( )------------------=
![Page 42: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/42.jpg)
13
42
Reg. Parameter – Iteration Number
Mk and Mρ have the same eigenvectors. They would be equal if their eigenvalues were equal.
Taking log :
Since these are equal at λi = 0, they are always equal if the slopes are equal.
If αλi and λi/ρ are small, then:
(Increasing the number of iterations is equivalent to decreasing the regularization parameter!)
ρλi ρ+( )------------------ 1 αλ i–( ) k= 1
λi
ρ----+⎝ ⎠⎛ ⎞log– k 1 αλ i–( )log=
1
1λiρ----+⎝ ⎠
⎛ ⎞-------------------1
ρ---– k1 αλ i–----------------- α–( )= α k 1
ρ---1 αλi–( )
1 λi ρ⁄+( )-------------------------=
αk 1ρ---
≅
![Page 43: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/43.jpg)
13
43
Example
p1an
Inputs
bp2 w1,2
w1,1
1
��S
a = purelin (Wp + b)
Two-Input Neuron
��
p11
1= t1 1=,
⎩ ⎭⎨ ⎬⎧ ⎫
p21–
1= t2 1–=,
⎩ ⎭⎨ ⎬⎧ ⎫
(Probability = 0.75)
(Probability = 0.25)
ED c xT
d 12---x
TAx+ +=
c 1= d 2h– 2–
1–= = A 2R 2 1
1 2= =EW
12---x
Tx=
∇2F x( ) E∇2D ρ EW∇2+ 2 1
1 2ρ 1 0
0 1+ 2 ρ+ 1
1 2 ρ+= = =
F x( ) EDρEW+=
![Page 44: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/44.jpg)
13
44
ρ = 0, 2, ∞
-0.5 0 0.5 1 1.5-1
-0.5
0
0.5
1
![Page 45: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/45.jpg)
13
45
ρ = 0 → ∞
-0.5 0 0.5 1 1.5-1
-0.5
0
0.5
1
![Page 46: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/46.jpg)
13
46
Steepest Descent Path
-0.5 0 0.5 1 1.5-1
-0.5
0
0.5
1
![Page 47: Generalizationhagan.okstate.edu/Generalization_pres.pdf · Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec4deb5bbef68757344a940/html5/thumbnails/47.jpg)
13
47
Effective Number of Parameters
γ n 2αMPtr HMP( )1–
⎩ ⎭⎨ ⎬⎧ ⎫
–=
H x( ) ∇2F x( ) β E∇2D α EW∇2+ β E∇2
D 2αI+= = =
tr H 1–{ }1
βλ i2α+
----------------------
i 1=
n
∑=
γ n 2αMP tr HMP( )1–
⎩ ⎭⎨ ⎬⎧ ⎫
– n2α
β λi2α+
----------------------
i 1=
n
∑–β λi
β λi2α+
----------------------
i 1=
n
∑= = =
γβλ i
βλi2α+
----------------------
i 1=
n
∑ γi
i 1=
n
∑= = γiβλ i
βλ i2α+
----------------------= 0 γi1≤ ≤
Effective number of parameters will equal number of large eigenvalues of the Hessian.