Download - Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Fundamentals of Artificial Neural NetworksChapter 7 in amlbook.com

Review: Neuron analogy to linear modelsDot product wTx is a way of combining attributes with bias (x0=1) into a scalar signal s, which is used to form an hypothesis about x.

Analogy is called a “perceptron”

sigmoid(s)

What can a perceptron do in 1D? Fit a line to data: y=wx+w0 Use y=wx+w0 as a discriminant

3

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

S = sigmoid(y)If y > 0 -> S > 0.5; chose green otherwise chose red

x

Td

Td

Td

jjj

xx

www

wxwy

,...,,

,...,,

1

10

01

1

x

w

xw

4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

x is input vectorw is weight vectory = wTx

Perceptron can do the same thing in dD: fit plane to data and use a plane as a discriminant for binary classificationFor regression output is yFor classification output is sigmoid(y)

Boolean AND: linearly separable 2D binary classification problem


Truth tableS

x1+x2=1.5 is an acceptable linear discriminant


linear discriminant wTx = 0

x1 x2 r required choice0 0 0 w0 <0 w0=-1.50 1 0 w2 + w0 <0 w1= 11 0 0 w1 + w0 <0 w2= 11 1 1 w1 + w2 + w0>0

y=wTx wTx < 0 → r = 0 wTx > 0 → r = 1

Derive the linear discriminant x1+ x2 -1.5 = 0


Other linear discriminants are possibleWe have not yet specified an optimization condition

Truth tableS

wTx=0

Boolean AND: linearly separable 2D binary classification problem

data table graphical representation

Application of perceptron not possible in attribute spaceSolution: transform to linearly separable feature space

Boolean XOR: linearly inseparable 2D binary classification problem

XOR in Gaussian Feature space

This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant

f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678

Review: XOR in Gaussian feature space

This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant

f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678

XOR data

r=1

r=0


Add a “hidden” layer to the perceptron

-0.78

Derive 2 weight vectors connecting input to hidden layer that define linearly separable features.

Derive 1 weight vector connecting hidden layer to output that defines a linear discriminant separating features

S

S S

Consider hidden units zh as features. Choose weight vectors wh so that in feature space (0,0) and (1,1) are close to the same point.

x)w(sigmoidx]exp[-w1

1z T

T hh

h

z2

z1

attribute spaceideal feature space

data table in feature space

r z1 z20 ~0 ~01 ~0 ~11 ~1 ~00 ~0 ~0

If whTx << 0 → zh ~ 0

If whTx >> 0 → zh ~ 1

x)w(sigmoidx]exp[-w1

1z T

T hh

h

z2

z1

x1 x2 z1 w1Tx required choice

0 0 ~0 <0 w0 <0 w0=-0.50 1 ~0 <0 w2 + w0 <0 w2= -11 0 ~1 >0 w1 + w0 >0 w1= 11 1 ~0 <0 w1 + w2 + w0<0

Find weights vectors for linearly separable features

x1 x2 z2 w2Tx required choice

0 0 ~0 <0 w0 <0 w0=-0.50 1 ~1 >0 w2 + w0 >0 w2= 11 0 ~0 <0 w1 + w0 <0 w1= -11 1 ~0 <0 w1 + w2 + w0<0

Transformation of input by hidden layer

x1 x2 arg1 z1 arg2 z2 r0 0 -0.5 0.38 -0.5 0.38 00 1 -1.5 0.18 0.5 0.62 11 0 0.5 0.62 -1.5 0.18 11 1 -0.5 0.38 -0.5 0.38 0

z1 = sigmoid(x1-x2-0.5)z2 = sigmoid(-x1+x2-0.5)

z2

z1

XOR transformed by hidden layer

Find weights connecting hidden layer to outputthat define linear discriminant in feature spaceDenote weight vector by vOutput transformed by y = sigmoid(vTz)

z1 z2 r vTz required choice0.38 0.38 0 <0 .38v1+.38v2+v0 <0 v0= -.780.18 0.62 1 >0 .18v1+.62v2+v0 >0 v2= 10.62 0.18 1 >0 .18v1+.62v2+v0 >0 v1= 10.38 0.38 0 <0 .38v1+.38v2+v0 <0

z2

z1


Not the only solution to XOR classification problem

-0.78

Define an optimization condition that enables learning optimum weights for both layers

S

S S

Data will determine the best transform of the input

Training a neural network by back-propagationInitialize weights randomly

Need a rule that relates changes in weights to the difference between output and target.

If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers),then an analytical expression for the rate of change of error with

change in weights can be derived.

This example is instructive but not relevant. Normal equations are a better way to find optimum weights in this case.

Simplest case: multivariate linear regressionIn-sample error is squared residuals and no hidden layers

Approaches to Training

21

• Online: weights updated based on training-set examples seen one by one in random order

• Batch: weights updated based on whole training set after summing deviations from individual examples


Weight-update formulas are simpler for “online” approachFormulas for “batch” can be derived from “online” formula

Weight-update rule: Perceptron regression


22

2

1

2

1,| tTtttttt ryrrE xwxw Contribution to sum

of squared residuals from single example

2

01

wxw2

1,|

tk

d

kk

tttt rrE xw

wj is the jth component of weight vector w connecting attribute vector x to scalar output y

Et depends on wj through yt = wTxt; hence use chain rule

tj

ttt

t

tt

xyry

y

jj w

E

w

E

Weight update formula called “stochastic gradient decent”

Proportionality constant h is called “learning rate”

Since Dwj is proportional xj, all attributes should be roughly the same size.

Normalized to achieve this may be helpful

tjx

tytrjw

tEjw

Momentum parameter1

ti

i

tti w

wE

w


How do learning rate and momentum affect training?

As learning rate → 1, back-propagation becomes deterministicEach example determines a set of weights optimal for itself only

As learning rate → 0, probably local minimum trapping → 1because step size of weight change is so small

Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization

Opinions differ on best choice of learning rate and momentum

Keep part of previous update

Multivariate linear dichotomizerNt

tt ,r 1}{ xX

This example is equivalent to logistic regression but with r t = {0,1} instead of (-1,1}

r t = {0,1}y = sigmoid(wTx)

In-sample error: cross entropy

Weight vector w connects inputto output, which is transformedy sigmoid function


Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1

p(r) = por (1 – po )

(1 – r) p(r =0) = 1 – po

Let y =sigmoid(wTx) be the MLE of p0

p(r) = yr (1 – y ) (1 – r)

Nt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r

Weight update for optimization by back propagation


Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1 and y =sigmoid(wTx) the MLE of p0

Let L(w|X) be the log-likelihood that weight vector w results from training set X

Nt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r

t

tttt ryryXL

1) 1()(log|w

])1log()1(log[w ttttt yryrL X

Weight update for optimization by back propagation

])1log()1(log[w ttttt yryrL X

yt = sigmoid(wTxt) betweem 0 and 1; hence L(w|X) < 0Therefore to maximize L(w|X), minimize

])1log()1(log[w ttttt yryrE X

Like the sum of squared residuals, cross entropy depends on weights w through yt

Unlike fitting a plane to data, where yt = wTx, yt = sigmoid(wTx), which puts an additional factor in the chain rule when deriving the back propagation formula.

])1log()1(log[w ttttt yryrE X

h

Th

Thh w

xw

xw

y

y

E

w

E

)(

)(

By much tedious algebra, you can show that result has the same form as for online training in regression

tjx

tytrjw

tEjw

yt = sigmoid(wTxt)

Result can be generalized to multi-class classification

Perceptron classification with k >1 classes


t

jti

ti

tij

i

ti

ti

ttt

ktT

k

tTit

i

xyrw

yrEy

log rx|W xw( exp

xw( exp,

)

)

Weight vector wi (column of weight matrix W) connects input

vector x to output node yi

wij is the jth component of wi

Assign to example to class with largest yi

Review: Multilayer Perceptrons (MLP)

31

Layers between input and output called “hidden”

Both number of hidden layers and number of hidden units in a layer are variables in the structure of an MLP

Less complex structures improve generalization

More than one hidden layer is seldom necessary



Review: MLP solution of XOR classification problem

-0.78

Nodes in hidden layer are linearly separable transforms of inputs

S

S S

Data will determine the transform that minimizes in-sample error

Weights that connect hidden layer to output node define a linear discriminant in feature space, vTz=0.

Transform output by sigmoid function.If S>0.5 assign class = 1


Solution of XOR problem by nonlinear regression

-0.78

Do not transform the output node.

Change labels to +1Fit vTz to class labels by nonlinear regression.If vTz >0 assign class 1

S S

Data will determine the best level of nonlinearity

Let in-sample error be sum of squared residuals.

Use back propagation to optimize weights

Derive formulas for updating weights in both layers

34

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x


22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h

Review: weight update rules for nonlinear regression

Each pass through all training data called “epoch”

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

hj

Th

Th

h

h

t

thj w

xw

xw

z

z

y

y

E

w

E

)(

)(Given a weights wh and v

Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

35

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x


22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h


Consider a momentum parameter

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

Given a weights wh and v


Calculate changes to wh vectors before changing vLearning rates can be different for Dwhj and Dvh

1

ti

i

tti w

wE

w

36

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x


22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h


normalized to [0,1]

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

Given a weights wh and v


It may be helpful to normalize attributes

minmax

min

kk

ktkt

k xx

xxnx

Other normalization method can be usedTransforms in the hidden layer do not require normalization


Above h ~ 15, Eval increasing while Ein is flatNo significant decrease in Eval or Ein after h=3

Use Eval to determine number of nodes in the hidden layer

Favor small h to avoid overfitting


Beyond elbow, Eval~Ein for ~ 200 eAbove e ~ 600 evidence for overfittingExpect overfitting at e(10 x elbow) Set estop when you see evidence of overfitting

elbow

Stop early to avoid overfitting


Use a validation set to detect overfitting

Validation error calculated after each epoch

xt = U(-0.5, 0.5)yt = sin(6xt) + N(0, 0.1)

Stop early to avoid overfitting

Fit looks better at 300 eWhy stop at 200 e?


Review: Solution of XOR problem by nonlinear regression

S S

A possible structure of MLP to solve the XOR problem.Should we consider a structure with 2 nodes in the hidden layer?

1

ti

i

tti w

wE

w

Edom’s code for solution of XOR by non-linear regression with 3 nodes in the hidden layer

Note: no momentum parameterand no binning of y(t) to predict class label

Edom’s solution of XOR problem by nonlinear regression

Fit is adequate to assign class labels


No summation; K=1

In classification by regression, bin y before calculating difference from targetAllows number misclassified to be used as error

For regression, K =1Bin y to get a class assignment

Binning creates flat regions

From Heng’s HW6

elbow

Assignment 6 due 11-13-15

Use dataset randomized shortened glassdata.csv to develop a classifier for beer-bottle glass by ANN non-linear regression. Keep the class labels as 1, 2, and 6.

With validations set of 100 examples and training set of 74 examples, select the best number of hidden nodes in a single hidden layer and the best number of epochs for weight refinement.

Use all the data to optimize weights at the selected structure and training time. Calculate confusion matrix and accuracy of prediction. Use 10-fold cross validation to estimate the accuracy of a test set.

When should you start with random [-0.01,0.01] weight?

t

jth

thh

thj

th

th

t

tttt

H

h

thh

t

xzzvw

zv

yryrE

vzvy

1yr

yr

1 log 1 log |,

sigmoid

tt

tt

10

XvW


Sigmoid output models P(C1|x)Minimize cross entropy in batch updateWeight update formulas are same as for regressionAssign examples to C1 if output > 0.5

Two-Class Discrimination with one hidden layer

K>2 Classes: one hidden layer

tj

th

th

t iih

ti

tihj

th

t

ti

tiih

t i

ti

ti

ti

ktk

tit

i

H

hi

thih

ti

xzzvyrw

zyrv

yrE

CPo

oyvzvo

1

10

log|

| exp

exp

Xv

x

,W


Minimize in-sample cross entropy by batch updateAssign examples to class with largest output

vi is weight vector connection nodes of hidden layer to output of class i

Note sum over i