Fundamentals of Artificial Neural NetworksChapter 7 in amlbook.com
Review: Neuron analogy to linear modelsDot product wTx is a way of combining attributes with bias (x0=1) into a scalar signal s, which is used to form an hypothesis about x.
Analogy is called a “perceptron”
sigmoid(s)
What can a perceptron do in 1D? Fit a line to data: y=wx+w0 Use y=wx+w0 as a discriminant
3
ww0
y
x
x0=+1
ww0
y
x
s
w0
y
x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
S = sigmoid(y)If y > 0 -> S > 0.5; chose green otherwise chose red
x
Td
Td
Td
jjj
xx
www
wxwy
,...,,
,...,,
1
10
01
1
x
w
xw
4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
x is input vectorw is weight vectory = wTx
Perceptron can do the same thing in dD: fit plane to data and use a plane as a discriminant for binary classificationFor regression output is yFor classification output is sigmoid(y)
Boolean AND: linearly separable 2D binary classification problem
5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Truth tableS
x1+x2=1.5 is an acceptable linear discriminant
6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
linear discriminant wTx = 0
x1 x2 r required choice0 0 0 w0 <0 w0=-1.50 1 0 w2 + w0 <0 w1= 11 0 0 w1 + w0 <0 w2= 11 1 1 w1 + w2 + w0>0
y=wTx wTx < 0 → r = 0 wTx > 0 → r = 1
Derive the linear discriminant x1+ x2 -1.5 = 0
7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Other linear discriminants are possibleWe have not yet specified an optimization condition
Truth tableS
wTx=0
Boolean AND: linearly separable 2D binary classification problem
data table graphical representation
Application of perceptron not possible in attribute spaceSolution: transform to linearly separable feature space
Boolean XOR: linearly inseparable 2D binary classification problem
XOR in Gaussian Feature space
This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant
f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678
Review: XOR in Gaussian feature space
This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant
f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678
XOR data
r=1
r=0
11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Add a “hidden” layer to the perceptron
-0.78
Derive 2 weight vectors connecting input to hidden layer that define linearly separable features.
Derive 1 weight vector connecting hidden layer to output that defines a linear discriminant separating features
S
S S
Consider hidden units zh as features. Choose weight vectors wh so that in feature space (0,0) and (1,1) are close to the same point.
x)w(sigmoidx]exp[-w1
1z T
T hh
h
z2
z1
attribute spaceideal feature space
data table in feature space
r z1 z20 ~0 ~01 ~0 ~11 ~1 ~00 ~0 ~0
If whTx << 0 → zh ~ 0
If whTx >> 0 → zh ~ 1
x)w(sigmoidx]exp[-w1
1z T
T hh
h
z2
z1
x1 x2 z1 w1Tx required choice
0 0 ~0 <0 w0 <0 w0=-0.50 1 ~0 <0 w2 + w0 <0 w2= -11 0 ~1 >0 w1 + w0 >0 w1= 11 1 ~0 <0 w1 + w2 + w0<0
Find weights vectors for linearly separable features
x1 x2 z2 w2Tx required choice
0 0 ~0 <0 w0 <0 w0=-0.50 1 ~1 >0 w2 + w0 >0 w2= 11 0 ~0 <0 w1 + w0 <0 w1= -11 1 ~0 <0 w1 + w2 + w0<0
Transformation of input by hidden layer
x1 x2 arg1 z1 arg2 z2 r0 0 -0.5 0.38 -0.5 0.38 00 1 -1.5 0.18 0.5 0.62 11 0 0.5 0.62 -1.5 0.18 11 1 -0.5 0.38 -0.5 0.38 0
z1 = sigmoid(x1-x2-0.5)z2 = sigmoid(-x1+x2-0.5)
z2
z1
XOR transformed by hidden layer
Find weights connecting hidden layer to outputthat define linear discriminant in feature spaceDenote weight vector by vOutput transformed by y = sigmoid(vTz)
z1 z2 r vTz required choice0.38 0.38 0 <0 .38v1+.38v2+v0 <0 v0= -.780.18 0.62 1 >0 .18v1+.62v2+v0 >0 v2= 10.62 0.18 1 >0 .18v1+.62v2+v0 >0 v1= 10.38 0.38 0 <0 .38v1+.38v2+v0 <0
z2
z1
17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Not the only solution to XOR classification problem
-0.78
Define an optimization condition that enables learning optimum weights for both layers
S
S S
Data will determine the best transform of the input
Training a neural network by back-propagationInitialize weights randomly
Need a rule that relates changes in weights to the difference between output and target.
If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers),then an analytical expression for the rate of change of error with
change in weights can be derived.
This example is instructive but not relevant. Normal equations are a better way to find optimum weights in this case.
Simplest case: multivariate linear regressionIn-sample error is squared residuals and no hidden layers
Approaches to Training
21
• Online: weights updated based on training-set examples seen one by one in random order
• Batch: weights updated based on whole training set after summing deviations from individual examples
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Weight-update formulas are simpler for “online” approachFormulas for “batch” can be derived from “online” formula
Weight-update rule: Perceptron regression
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
2
1
2
1,| tTtttttt ryrrE xwxw Contribution to sum
of squared residuals from single example
2
01
wxw2
1,|
tk
d
kk
tttt rrE xw
wj is the jth component of weight vector w connecting attribute vector x to scalar output y
Et depends on wj through yt = wTxt; hence use chain rule
tj
ttt
t
tt
xyry
y
jj w
E
w
E
Weight update formula called “stochastic gradient decent”
Proportionality constant h is called “learning rate”
Since Dwj is proportional xj, all attributes should be roughly the same size.
Normalized to achieve this may be helpful
tjx
tytrjw
tEjw
Momentum parameter1
ti
i
tti w
wE
w
24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
How do learning rate and momentum affect training?
As learning rate → 1, back-propagation becomes deterministicEach example determines a set of weights optimal for itself only
As learning rate → 0, probably local minimum trapping → 1because step size of weight change is so small
Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization
Opinions differ on best choice of learning rate and momentum
Keep part of previous update
Multivariate linear dichotomizerNt
tt ,r 1}{ xX
This example is equivalent to logistic regression but with r t = {0,1} instead of (-1,1}
r t = {0,1}y = sigmoid(wTx)
In-sample error: cross entropy
Weight vector w connects inputto output, which is transformedy sigmoid function
26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1
p(r) = por (1 – po )
(1 – r) p(r =0) = 1 – po
Let y =sigmoid(wTx) be the MLE of p0
p(r) = yr (1 – y ) (1 – r)
Nt
tt ,r 1}{ xX
negative is if
positive is if
x
x
0
1r
Weight update for optimization by back propagation
27Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1 and y =sigmoid(wTx) the MLE of p0
Let L(w|X) be the log-likelihood that weight vector w results from training set X
Nt
tt ,r 1}{ xX
negative is if
positive is if
x
x
0
1r
t
tttt ryryXL
1) 1()(log|w
])1log()1(log[w ttttt yryrL X
Weight update for optimization by back propagation
])1log()1(log[w ttttt yryrL X
yt = sigmoid(wTxt) betweem 0 and 1; hence L(w|X) < 0Therefore to maximize L(w|X), minimize
])1log()1(log[w ttttt yryrE X
Like the sum of squared residuals, cross entropy depends on weights w through yt
Unlike fitting a plane to data, where yt = wTx, yt = sigmoid(wTx), which puts an additional factor in the chain rule when deriving the back propagation formula.
])1log()1(log[w ttttt yryrE X
h
Th
Thh w
xw
xw
y
y
E
w
E
)(
)(
By much tedious algebra, you can show that result has the same form as for online training in regression
tjx
tytrjw
tEjw
yt = sigmoid(wTxt)
Result can be generalized to multi-class classification
Perceptron classification with k >1 classes
30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
t
jti
ti
tij
i
ti
ti
ttt
ktT
k
tTit
i
xyrw
yrEy
log rx|W xw( exp
xw( exp,
)
)
Weight vector wi (column of weight matrix W) connects input
vector x to output node yi
wij is the jth component of wi
Assign to example to class with largest yi
Review: Multilayer Perceptrons (MLP)
31
Layers between input and output called “hidden”
Both number of hidden layers and number of hidden units in a layer are variables in the structure of an MLP
Less complex structures improve generalization
More than one hidden layer is seldom necessary
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
32Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Review: MLP solution of XOR classification problem
-0.78
Nodes in hidden layer are linearly separable transforms of inputs
S
S S
Data will determine the transform that minimizes in-sample error
Weights that connect hidden layer to output node define a linear discriminant in feature space, vTz=0.
Transform output by sigmoid function.If S>0.5 assign class = 1
33Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Solution of XOR problem by nonlinear regression
-0.78
Do not transform the output node.
Change labels to +1Fit vTz to class labels by nonlinear regression.If vTz >0 assign class 1
S S
Data will determine the best level of nonlinearity
Let in-sample error be sum of squared residuals.
Use back propagation to optimize weights
Derive formulas for updating weights in both layers
34
xwThhz sigmoid
H
h
thh
t vzvy1
0
Forward
Backward
x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
1,| ttttt yrrE xw
th
ttth
t
t
t
th
t
zyrv
y
y
E
v
Ev )(t
h
Review: weight update rules for nonlinear regression
Each pass through all training data called “epoch”
tj
th
th
th
ttthj
tthj xzzvyr
w
Ew )1(
hj
Th
Th
h
h
t
thj w
xw
xw
z
z
y
y
E
w
E
)(
)(Given a weights wh and v
Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.
35
xwThhz sigmoid
H
h
thh
t vzvy1
0
Forward
Backward
x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
1,| ttttt yrrE xw
th
ttth
t
t
t
th
t
zyrv
y
y
E
v
Ev )(t
h
Review: weight update rules for nonlinear regression
Consider a momentum parameter
tj
th
th
th
ttthj
tthj xzzvyr
w
Ew )1(
Given a weights wh and v
Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.
Calculate changes to wh vectors before changing vLearning rates can be different for Dwhj and Dvh
1
ti
i
tti w
wE
w
36
xwThhz sigmoid
H
h
thh
t vzvy1
0
Forward
Backward
x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
1,| ttttt yrrE xw
th
ttth
t
t
t
th
t
zyrv
y
y
E
v
Ev )(t
h
Review: weight update rules for nonlinear regression
normalized to [0,1]
tj
th
th
th
ttthj
tthj xzzvyr
w
Ew )1(
Given a weights wh and v
Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.
It may be helpful to normalize attributes
minmax
min
kk
ktkt
k xx
xxnx
Other normalization method can be usedTransforms in the hidden layer do not require normalization
37Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Above h ~ 15, Eval increasing while Ein is flatNo significant decrease in Eval or Ein after h=3
Use Eval to determine number of nodes in the hidden layer
Favor small h to avoid overfitting
38Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Beyond elbow, Eval~Ein for ~ 200 eAbove e ~ 600 evidence for overfittingExpect overfitting at e(10 x elbow) Set estop when you see evidence of overfitting
elbow
Stop early to avoid overfitting
39Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Use a validation set to detect overfitting
Validation error calculated after each epoch
xt = U(-0.5, 0.5)yt = sin(6xt) + N(0, 0.1)
Stop early to avoid overfitting
Fit looks better at 300 eWhy stop at 200 e?
40Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Review: Solution of XOR problem by nonlinear regression
S S
A possible structure of MLP to solve the XOR problem.Should we consider a structure with 2 nodes in the hidden layer?
1
ti
i
tti w
wE
w
Edom’s code for solution of XOR by non-linear regression with 3 nodes in the hidden layer
Note: no momentum parameterand no binning of y(t) to predict class label
Edom’s solution of XOR problem by nonlinear regression
Fit is adequate to assign class labels
43Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
No summation; K=1
In classification by regression, bin y before calculating difference from targetAllows number misclassified to be used as error
For regression, K =1Bin y to get a class assignment
Binning creates flat regions
From Heng’s HW6
elbow
Assignment 6 due 11-13-15
Use dataset randomized shortened glassdata.csv to develop a classifier for beer-bottle glass by ANN non-linear regression. Keep the class labels as 1, 2, and 6.
With validations set of 100 examples and training set of 74 examples, select the best number of hidden nodes in a single hidden layer and the best number of epochs for weight refinement.
Use all the data to optimize weights at the selected structure and training time. Calculate confusion matrix and accuracy of prediction. Use 10-fold cross validation to estimate the accuracy of a test set.
When should you start with random [-0.01,0.01] weight?
t
jth
thh
thj
th
th
t
tttt
H
h
thh
t
xzzvw
zv
yryrE
vzvy
1yr
yr
1 log 1 log |,
sigmoid
tt
tt
10
XvW
46Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Sigmoid output models P(C1|x)Minimize cross entropy in batch updateWeight update formulas are same as for regressionAssign examples to C1 if output > 0.5
Two-Class Discrimination with one hidden layer
K>2 Classes: one hidden layer
tj
th
th
t iih
ti
tihj
th
t
ti
tiih
t i
ti
ti
ti
ktk
tit
i
H
hi
thih
ti
xzzvyrw
zyrv
yrE
CPo
oyvzvo
1
10
log|
| exp
exp
Xv
x
,W
47Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Minimize in-sample cross entropy by batch updateAssign examples to class with largest output
vi is weight vector connection nodes of hidden layer to output of class i
Note sum over i
Top Related