Download - Robert J. Marks II University of Washington Department of Electrical Engineering CIA Laboratory, Box 352500 Seattle, Washington 98185-2500 [email protected].

Robert J. Marks II

Robert J. Marks IIUniversity of WashingtonDepartment of Electrical EngineeringCIA Laboratory, Box 352500Seattle, Washington [email protected]

Artificial Neural Networks: Supervised Models

Robert J. Marks II

Supervised Learning

Given: Input (Stimulus)/Output (Response)

Data Object:

Train a machine to simulate the input/output relationship

Types Classification (Discrete Outputs) Regression (Continuous Outputs)

Robert J. Marks II

Training a Classifier

> classifier < Marks

> classifier < not Marks



> classifier < Marks


Robert J. Marks II

Recall from a Trained Classifier

> Classifier > Marks

Note: The test image does not appear in the training data.

Learning = Memorization

Robert J. Marks II

Classifier In Feature Space, After Training

representation

concept(truth)

= training data= Marks= not Marks

= test data (Marks)

Robert J. Marks II

Supervised Regression (Interpolation)

Output data is continuous rather than discrete

Example - Load Forecasting Training (from historical data):

Input: temperatures, current load, day of week, holiday(?), etc.

Output: next day’s load Test

Input: forecasted temperatures, current load, day of week, holiday(?), etc.

Output: tomorrow’s load forecast

Robert J. Marks II

Properties of Good Classifiersand Regression Machines

Good accuracy outside of training set Explanation Facility

Generate rules after training

Fast training Fast testing

Robert J. Marks II

Some Classifiers and Regression Machines

Classification & Autoregression Trees (CART)

Nearest Neighbor Look-Up Neural Networks

Layered Perceptron (or MLP’s) Recurrent Perceptrons Cascade Correlation Neural Networks Radial Basis Function Neural Networks

Robert J. Marks II

A Model of an Artificial Neuron

w4

w3

w2

w5

w1

s1

s5

s4

s3

s2

s = state = (sum)(.) = squashing function

sum s

nsum = wn sn

(sum)

Robert J. Marks II

Squashing Functions

sum

(sum)1

sigmoid: (x) = __________11 + e - x

Robert J. Marks II

A Layered Perceptron

interconnects

neurons

hiddenlayer

output

input

Robert J. Marks II

Training

Given Training Data, input vector set :

{ i n | 1 < n < N } corresponding output (target) vector set:

{ t n | 1 < n < N }

Find the weights of the interconnects using training data to minimize error in the test data

Robert J. Marks II

Error

Input, target & response input vector set : { i n | 1 < n < N }

target vector set: { t n | 1 < n < N }

on = neural network output when the input is i n . (Note: on = t n )

Error

on - tn n

12

Robert J. Marks II

Error Minimization Techniques The error is a function of the

fixed training and test data neural network weights

Find weights that minimize error (Standard Optimization) conjugate gradient descent random search genetic algorithms steepest descent (error backpropagation)

Robert J. Marks II

Minimizing Error Using Steepest Descent

The main idea:Find the way downhill and take a step:

E

x

minimum

downhill = - _____d Ed x

= step size

x x -d Ed x

Robert J. Marks II

Example of Steepest Descent

E(x) = _ x 2 ; minimum at x = 0

- ___ = - xx x -x

Solution to difference equation

xp xp-1

is x p p x0.

for || < 1, x1

d Ed x

12

d Ed x

Robert J. Marks II

Training the Perceptron

on- tn

wnkik -tn

im wmkik - tm

ij om- tm

12

12

2

n = 1

n=1 k=1

k=1

2 4

4 ddwm j

o1 o2

i 1 i 2 i 3 i 4

w11 w24

Robert J. Marks II

Weight Update

ij om- tm

for m = 4 and j = 2

w24 w24 - i4 o2- t2

o1 o2

i 1 i 2 i 3 i 4

w11 w24

ddwm j

Robert J. Marks II

No Hidden Alters = Linear Separation

o = (wn in )For classifier, threshold: If o > ___ , announce class #1 If o < ___ , announce class #2

Classification boundary:

o = ___ , or wn in = 0. This is the equation of a plane!

12

12

12

n

n

o

w1 w3

w2

i 1 i 2 i 3

Robert J. Marks II

wn in = 0 = line through origin.n

i 2

Classification Boundary

i 1

Robert J. Marks II

Adding Bias Term

o

w1

w2 w3

w4

i 1 i 2 i 3 1

Classification boundary still a line,but need not go through origin.

i 2

i 1

Robert J. Marks II

The Minsky-Papert Objection

i 2

i 1

1

1

The simple operationof the exclusive or

(XOR)cannot be resolved

usinga linear perceptron

withbias.

More importantproblems can

probablythus not be resolved

with a linear perceptron with bias.

?

Robert J. Marks II

The Layered Perceptron

interconnect:

weights =

wjk(l)

neurons:states = sj(l )

hiddenlayer: l

output: l = L

input: l = 0

Robert J. Marks II

Error Backpropagation

dEdwjk(l )

______ = _____ ________ ________

Problem: For an arbitrary weight, wjk(l) , update

wjk(l ) wjk(l ) - ______

A Solution: Error Backpropagation Chain rule for partial fractions

dd d sj(l ) dsumj(l )

dwm j d sj(l ) dsumj(l ) dwm j

Robert J. Marks II

Each Partial is Evaluated(Beautiful Math!!!)

dsj(l ) d 1 dsumj(l ) dsumj(l ) 1 + exp[ - sumj(l ) ]

= sj(l ) [ 1 - sj(l ) ]

dsumj(l )

dwm j

dE dsj(l )

________ = _______ _________________

________ = s (l -1)

= j (l ) = n(l +1) sn(l +1) [ 1 - s n (l +1) ] wnj (l ) n

m

Robert J. Marks II

Weight Update

dEdwjk(l )

______ = _____ ________ ________

wjk(l ) wjk(l ) - ______

dd dsj(l ) dsumj(l )

dwm j d sj(l ) dsumj(l ) dwm j

= j(l ) sj(l +1) [ 1 - s j (l +1) ] sk (l -1)

Robert J. Marks II

Step #1: Input Data & Feedforward

s1(2) = o1 s2(2) = o2

s1(1) s2(1) s3(1)

i1 i2 = s2(0)

The states of all of the

neurons are

determined by the states of

the neurons below

them and the

interconnect weights.

Robert J. Marks II

Step #2: Evaluate output error, backpropagate

to find ’s for each neuron

o1 , t1 o2 , t2

1(2) 2(2)

s1(1) s2(1) s3(1)

1(1) 2(1) 3(1)

i1 i2 = s2(0)

1(0) 2(0)

Each neuron now keeps track of two

numbers. The ’s for each neuron are determined by “back-

propagating” the output

error towards the input.

Robert J. Marks II

Step #3: Update Weights

o1 , t1 o2 , t2

1(2) 2(2)

s1(1) s2(1) s3(1)

1(1) 2(1) 3(1)

i1 i2 = s2(0)

1(0) 2(0)

w32(1) w32(1)

-3(1) s3(1)

[ 1 - s3(1) ] s2 (0)

Weight updates are

performed within the neural network

architecture

Robert J. Marks II

Neural Smithing

Bias Momentum Batch Training Learning Versus Memorization Cross Validation The Curse of Dimensionality Variations

Robert J. Marks II

Bias

Bias is used with MLP At input Hidden layers (sometimes)

Robert J. Marks II

Momentum Steepest descent wjk(l) wjk(l ) +wjk(l ) With Momentum,

wjk (l ) = wjk(l ) +wjk (l ) + wjk(l ) New step effected by previous step m is the iteration number Convergence is improved

m mm+1 m+1

Robert J. Marks II

Back Propagation Batch Training

Accumulate error from all training data prior to weight update True steepest descent Update weights each epoch

Training Layered Perceptron One Data pair at a time Randomize data to avoid structure The Widrow-Hoff Algorithm

Robert J. Marks II

Learning versus Memorization: Both have zero training error

goodgeneralization

(learning)

concept(truth)

badgeneralization(memorization)

= training data= test data

Robert J. Marks II

Alternate View:

concept

learning

memorization(over fitting)

Robert J. Marks II

Learning versus Memorization (cont.) Successful Learning:

Recognizing data outside the training set, e.g. data in the test set.

i.e. the neural network must successfully classify (interpolate) inputs it has not seen before.

How can we assure learning? Cross Validation Choosing neural network structure

Pruning Genetic Algorithms

Robert J. Marks II

Cross Validation

iterations (m)

testerror

trainingerror

minimum

Robert J. Marks II

The Curse of Dimensionality

For many problems, the required number of trainingdata increases to the power of the input’s dimension.

Example:

• For N=2 inputs, suppose that 100 = 102 training data pairs

• For N=3 inputs, 103 = 1000 training data pairs are needed

• In general, 10N training data pairs are neededfor many important problems.

Robert J. Marks II

Example: Classifying a circle in a square

i1

i2

neural net

o

i1 i2

100 = 102 pointsare shown.

Robert J. Marks II

Example: Classifying a sphere in a cubeN=3

neural net

o

i1 i2 i3

i3

i2

i1

10 layers each with 102

points= 103 points= 10N points

Robert J. Marks II

Variations Architecture variation for MLP’s

Recurrent Neural Networks Radial Basis Functions Cascade Correlation Fuzzy MLP’s

Training Algorithms

Robert J. Marks II

Applications

Power Engineering Finance Bioengineering Control Industrial Applications Politics

Robert J. Marks II

Political ApplicationsPolitical ApplicationsRobert Novak syndicated column

Washington, February 18, 1996

UNDECIDED BOWLERS

“President Clinton’s pollsters have identified the voters who will determine whether he will be elected to a second term: two-parent families whose members bowl for recreation.”

“Using a technique they call the ‘neural network,’ Clinton advisors contend that these family bowlers are the quintessential undecided voters. Therefore, these are the people who must be targeted by the president.”

Robert J. Marks II

“A footnote: Two decades ago, Illinois Democratic Gov. Dan Walker campaigned

heavily in bowling alleys in the belief he would find swing voters there. Walker had national

political ambitions but ended up in federal prison.”

Robert Novak syndicated columnWashington, February 18, 1996

(continued)

Robert J. Marks II

FiniFiniss