Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

47
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com

Transcript of Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Page 1: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Fundamentals of Artificial Neural NetworksChapter 7 in amlbook.com

Page 2: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Review: Neuron analogy to linear modelsDot product wTx is a way of combining attributes with bias (x0=1) into a scalar signal s, which is used to form an hypothesis about x.

Analogy is called a “perceptron”

sigmoid(s)

Page 3: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

What can a perceptron do in 1D? Fit a line to data: y=wx+w0 Use y=wx+w0 as a discriminant

3

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

S = sigmoid(y)If y > 0 -> S > 0.5; chose green otherwise chose red

x

Page 4: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Td

Td

Td

jjj

xx

www

wxwy

,...,,

,...,,

1

10

01

1

x

w

xw

4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

x is input vectorw is weight vectory = wTx

Perceptron can do the same thing in dD: fit plane to data and use a plane as a discriminant for binary classificationFor regression output is yFor classification output is sigmoid(y)

Page 5: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Boolean AND: linearly separable 2D binary classification problem

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Truth tableS

x1+x2=1.5 is an acceptable linear discriminant

Page 6: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

linear discriminant wTx = 0

x1 x2 r required choice0 0 0 w0 <0 w0=-1.50 1 0 w2 + w0 <0 w1= 11 0 0 w1 + w0 <0 w2= 11 1 1 w1 + w2 + w0>0

y=wTx wTx < 0 → r = 0 wTx > 0 → r = 1

Derive the linear discriminant x1+ x2 -1.5 = 0

Page 7: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Other linear discriminants are possibleWe have not yet specified an optimization condition

Truth tableS

wTx=0

Boolean AND: linearly separable 2D binary classification problem

Page 8: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

data table graphical representation

Application of perceptron not possible in attribute spaceSolution: transform to linearly separable feature space

Boolean XOR: linearly inseparable 2D binary classification problem

Page 9: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

XOR in Gaussian Feature space

This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant

f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678

Page 10: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Review: XOR in Gaussian feature space

This transformation puts examples (0,1) and (1,0) at the same point in feature spacePerceptron could be applied to find a linear discriminant

f1 = exp(-|X – [1,1]|2)f2 = exp(-|X – [0,0]|2)X f1 f2(1,1) 1 0.1353(0,1) 0.3678 0.3678(0,0) 0.1353 1(1,0) 0.3678 0.3678

XOR data

r=1

r=0

Page 11: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Add a “hidden” layer to the perceptron

-0.78

Derive 2 weight vectors connecting input to hidden layer that define linearly separable features.

Derive 1 weight vector connecting hidden layer to output that defines a linear discriminant separating features

S

S S

Page 12: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Consider hidden units zh as features. Choose weight vectors wh so that in feature space (0,0) and (1,1) are close to the same point.

x)w(sigmoidx]exp[-w1

1z T

T hh

h

z2

z1

attribute spaceideal feature space

Page 13: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

data table in feature space

r z1 z20 ~0 ~01 ~0 ~11 ~1 ~00 ~0 ~0

If whTx << 0 → zh ~ 0

If whTx >> 0 → zh ~ 1

x)w(sigmoidx]exp[-w1

1z T

T hh

h

z2

z1

Page 14: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

x1 x2 z1 w1Tx required choice

0 0 ~0 <0 w0 <0 w0=-0.50 1 ~0 <0 w2 + w0 <0 w2= -11 0 ~1 >0 w1 + w0 >0 w1= 11 1 ~0 <0 w1 + w2 + w0<0

Find weights vectors for linearly separable features

x1 x2 z2 w2Tx required choice

0 0 ~0 <0 w0 <0 w0=-0.50 1 ~1 >0 w2 + w0 >0 w2= 11 0 ~0 <0 w1 + w0 <0 w1= -11 1 ~0 <0 w1 + w2 + w0<0

Page 15: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Transformation of input by hidden layer

x1 x2 arg1 z1 arg2 z2 r0 0 -0.5 0.38 -0.5 0.38 00 1 -1.5 0.18 0.5 0.62 11 0 0.5 0.62 -1.5 0.18 11 1 -0.5 0.38 -0.5 0.38 0

z1 = sigmoid(x1-x2-0.5)z2 = sigmoid(-x1+x2-0.5)

z2

z1

XOR transformed by hidden layer

Page 16: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Find weights connecting hidden layer to outputthat define linear discriminant in feature spaceDenote weight vector by vOutput transformed by y = sigmoid(vTz)

z1 z2 r vTz required choice0.38 0.38 0 <0 .38v1+.38v2+v0 <0 v0= -.780.18 0.62 1 >0 .18v1+.62v2+v0 >0 v2= 10.62 0.18 1 >0 .18v1+.62v2+v0 >0 v1= 10.38 0.38 0 <0 .38v1+.38v2+v0 <0

z2

z1

Page 17: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Not the only solution to XOR classification problem

-0.78

Define an optimization condition that enables learning optimum weights for both layers

S

S S

Data will determine the best transform of the input

Page 18: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Training a neural network by back-propagationInitialize weights randomly

Need a rule that relates changes in weights to the difference between output and target.

Page 19: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers),then an analytical expression for the rate of change of error with

change in weights can be derived.

Page 20: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

This example is instructive but not relevant. Normal equations are a better way to find optimum weights in this case.

Simplest case: multivariate linear regressionIn-sample error is squared residuals and no hidden layers

Page 21: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Approaches to Training

21

• Online: weights updated based on training-set examples seen one by one in random order

• Batch: weights updated based on whole training set after summing deviations from individual examples

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Weight-update formulas are simpler for “online” approachFormulas for “batch” can be derived from “online” formula

Page 22: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Weight-update rule: Perceptron regression

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22

2

1

2

1,| tTtttttt ryrrE xwxw Contribution to sum

of squared residuals from single example

2

01

wxw2

1,|

tk

d

kk

tttt rrE xw

wj is the jth component of weight vector w connecting attribute vector x to scalar output y

Et depends on wj through yt = wTxt; hence use chain rule

tj

ttt

t

tt

xyry

y

jj w

E

w

E

Page 23: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Weight update formula called “stochastic gradient decent”

Proportionality constant h is called “learning rate”

Since Dwj is proportional xj, all attributes should be roughly the same size.

Normalized to achieve this may be helpful

tjx

tytrjw

tEjw

Page 24: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Momentum parameter1

ti

i

tti w

wE

w

24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

How do learning rate and momentum affect training?

As learning rate → 1, back-propagation becomes deterministicEach example determines a set of weights optimal for itself only

As learning rate → 0, probably local minimum trapping → 1because step size of weight change is so small

Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization

Opinions differ on best choice of learning rate and momentum

Keep part of previous update

Page 25: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Multivariate linear dichotomizerNt

tt ,r 1}{ xX

This example is equivalent to logistic regression but with r t = {0,1} instead of (-1,1}

r t = {0,1}y = sigmoid(wTx)

In-sample error: cross entropy

Weight vector w connects inputto output, which is transformedy sigmoid function

Page 26: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1

p(r) = por (1 – po )

(1 – r) p(r =0) = 1 – po

Let y =sigmoid(wTx) be the MLE of p0

p(r) = yr (1 – y ) (1 – r)

Nt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r

Weight update for optimization by back propagation

Page 27: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

27Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Assume that r t is drawn from Bernoulli distributed with parameter p0 for the probability that r t = 1 and y =sigmoid(wTx) the MLE of p0

Let L(w|X) be the log-likelihood that weight vector w results from training set X

Nt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r

t

tttt ryryXL

1) 1()(log|w

])1log()1(log[w ttttt yryrL X

Weight update for optimization by back propagation

Page 28: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

])1log()1(log[w ttttt yryrL X

yt = sigmoid(wTxt) betweem 0 and 1; hence L(w|X) < 0Therefore to maximize L(w|X), minimize

])1log()1(log[w ttttt yryrE X

Like the sum of squared residuals, cross entropy depends on weights w through yt

Unlike fitting a plane to data, where yt = wTx, yt = sigmoid(wTx), which puts an additional factor in the chain rule when deriving the back propagation formula.

Page 29: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

])1log()1(log[w ttttt yryrE X

h

Th

Thh w

xw

xw

y

y

E

w

E

)(

)(

By much tedious algebra, you can show that result has the same form as for online training in regression

tjx

tytrjw

tEjw

yt = sigmoid(wTxt)

Result can be generalized to multi-class classification

Page 30: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Perceptron classification with k >1 classes

30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

t

jti

ti

tij

i

ti

ti

ttt

ktT

k

tTit

i

xyrw

yrEy

log rx|W xw( exp

xw( exp,

)

)

Weight vector wi (column of weight matrix W) connects input

vector x to output node yi

wij is the jth component of wi

Assign to example to class with largest yi

Page 31: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Review: Multilayer Perceptrons (MLP)

31

Layers between input and output called “hidden”

Both number of hidden layers and number of hidden units in a layer are variables in the structure of an MLP

Less complex structures improve generalization

More than one hidden layer is seldom necessary

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 32: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

32Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Review: MLP solution of XOR classification problem

-0.78

Nodes in hidden layer are linearly separable transforms of inputs

S

S S

Data will determine the transform that minimizes in-sample error

Weights that connect hidden layer to output node define a linear discriminant in feature space, vTz=0.

Transform output by sigmoid function.If S>0.5 assign class = 1

Page 33: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

33Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Solution of XOR problem by nonlinear regression

-0.78

Do not transform the output node.

Change labels to +1Fit vTz to class labels by nonlinear regression.If vTz >0 assign class 1

S S

Data will determine the best level of nonlinearity

Let in-sample error be sum of squared residuals.

Use back propagation to optimize weights

Derive formulas for updating weights in both layers

Page 34: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

34

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h

Review: weight update rules for nonlinear regression

Each pass through all training data called “epoch”

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

hj

Th

Th

h

h

t

thj w

xw

xw

z

z

y

y

E

w

E

)(

)(Given a weights wh and v

Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

Page 35: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

35

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h

Review: weight update rules for nonlinear regression

Consider a momentum parameter

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

Given a weights wh and v

Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

Calculate changes to wh vectors before changing vLearning rates can be different for Dwhj and Dvh

1

ti

i

tti w

wE

w

Page 36: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

36

xwThhz sigmoid

H

h

thh

t vzvy1

0

Forward

Backward

x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22

1,| ttttt yrrE xw

th

ttth

t

t

t

th

t

zyrv

y

y

E

v

Ev )(t

h

Review: weight update rules for nonlinear regression

normalized to [0,1]

tj

th

th

th

ttthj

tthj xzzvyr

w

Ew )1(

Given a weights wh and v

Transform hidden layer by Sigmoid.h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

It may be helpful to normalize attributes

minmax

min

kk

ktkt

k xx

xxnx

Other normalization method can be usedTransforms in the hidden layer do not require normalization

Page 37: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

37Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Above h ~ 15, Eval increasing while Ein is flatNo significant decrease in Eval or Ein after h=3

Use Eval to determine number of nodes in the hidden layer

Favor small h to avoid overfitting

Page 38: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

38Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Beyond elbow, Eval~Ein for ~ 200 eAbove e ~ 600 evidence for overfittingExpect overfitting at e(10 x elbow) Set estop when you see evidence of overfitting

elbow

Stop early to avoid overfitting

Page 39: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

39Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Use a validation set to detect overfitting

Validation error calculated after each epoch

xt = U(-0.5, 0.5)yt = sin(6xt) + N(0, 0.1)

Stop early to avoid overfitting

Fit looks better at 300 eWhy stop at 200 e?

Page 40: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

40Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Review: Solution of XOR problem by nonlinear regression

S S

A possible structure of MLP to solve the XOR problem.Should we consider a structure with 2 nodes in the hidden layer?

Page 41: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

1

ti

i

tti w

wE

w

Edom’s code for solution of XOR by non-linear regression with 3 nodes in the hidden layer

Note: no momentum parameterand no binning of y(t) to predict class label

Page 42: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Edom’s solution of XOR problem by nonlinear regression

Fit is adequate to assign class labels

Page 43: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

43Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

No summation; K=1

In classification by regression, bin y before calculating difference from targetAllows number misclassified to be used as error

For regression, K =1Bin y to get a class assignment

Page 44: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Binning creates flat regions

From Heng’s HW6

elbow

Page 45: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Assignment 6 due 11-13-15

Use dataset randomized shortened glassdata.csv to develop a classifier for beer-bottle glass by ANN non-linear regression. Keep the class labels as 1, 2, and 6.

With validations set of 100 examples and training set of 74 examples, select the best number of hidden nodes in a single hidden layer and the best number of epochs for weight refinement.

Use all the data to optimize weights at the selected structure and training time. Calculate confusion matrix and accuracy of prediction. Use 10-fold cross validation to estimate the accuracy of a test set.

When should you start with random [-0.01,0.01] weight?

Page 46: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

t

jth

thh

thj

th

th

t

tttt

H

h

thh

t

xzzvw

zv

yryrE

vzvy

1yr

yr

1 log 1 log |,

sigmoid

tt

tt

10

XvW

46Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Sigmoid output models P(C1|x)Minimize cross entropy in batch updateWeight update formulas are same as for regressionAssign examples to C1 if output > 0.5

Two-Class Discrimination with one hidden layer

Page 47: Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

K>2 Classes: one hidden layer

tj

th

th

t iih

ti

tihj

th

t

ti

tiih

t i

ti

ti

ti

ktk

tit

i

H

hi

thih

ti

xzzvyrw

zyrv

yrE

CPo

oyvzvo

1

10

log|

| exp

exp

Xv

x

,W

47Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Minimize in-sample cross entropy by batch updateAssign examples to class with largest output

vi is weight vector connection nodes of hidden layer to output of class i

Note sum over i