Artificial Neural Networks Dan Simon Cleveland State University 1.

Artificial Neural Networks

Dan SimonCleveland State University

1

Neural Networks• Artificial Neural Network (ANN): An information

processing paradigm that is inspired by biological neurons

• Distinctive structure: Large number of simple, highly interconnected processing elements (neurons); parallel processing

• Inductive learning, that is, learning by example; an ANN is configured for a specific application through a learning process

• Learning involves adjustments to connections between the neurons

2

Inductive Learning

• Sometimes we can’t explain how we know something; we rely on our experience

• An ANN can generalize from expert knowledge and re-create expert behavior

• Example: An ER doctor considers a patient’s age, blood pressure, heart rate, ECG, etc., and makes an educated guess about whether or not the patient had a heart attack

3

The Birth of ANNs

• The first artificial neuron was proposed in 1943 by neurophysiologist Warren McCulloch and the psychologist/logician Walter Pitts

• No computing resources at that time

4

Biological Neurons

5

A Simple Artificial Neuron

6

A Simple ANNPattern recognition: T versus H

f1(.)

f2(.)

f3(.)

g(.)

x11

x12

x13x21

x22

x23x31

x32

x33

1 0

7

1 0

x1 x2 x3 f1 f2 f3 g

0 0 0 0 1 1 10 0 1 0 ? 0 00 1 0 1 1 1 10 1 1 1 ? 1 11 0 0 0 ? 0 01 0 1 0 0 0 01 1 0 1 ? 1 11 1 1 1 0 0 0

Truth table

1, 1, 1 1

0, 0, 0 0

1, ? 1 1

0, ?, 1 ?

Examples:

8

Feedforward ANN

How many hidden layers should we use? How many neurons should we use in each hidden layer?

9

Recurrent ANN

10

Perceptrons

• A simple ANN introduced by Frank Rosenblatt in 1958

• Discredited by Marvin Minsky and Seymour Papert in 1969– “Perceptrons have been widely publicized as

'pattern recognition' or 'learning machines' and as such have been discussed in a large number of books, journal articles, and voluminous 'reports'. Most of this writing ... is without scientific value …”

11

Perceptrons

Three-dimensional single-layer perceptronProblem: Given a set of training data (i.e., (x, y)

pairs), find the weight vector {w} that correctly classifies the inputs.

12

x0=1

x1

x2

x3

w0

w1

w2

w3

if · 0

oth

1( )

erw0 isef

w xx

The Perceptron Training Rule

• t = target output, o = perceptron output• Training rule: wi = e xi, where e = t o,

and is the step size.Note that e = 0, 1 or 1.If e = 0, then don’t update the weight.If e = 1, then t = 1 and o = 0, so we need to increase wi if xi > 0, and decrease wi if xi < 0.Similar logic applies when e = 1.

• is often initialized to 0.1 and decreases as training progresses.

13

From Perceptrons to Backpropagation

• Perceptrons were dismissed because of:– Limitations of single layer perceptrons– The threshold function is not differentiable

• Multi-layer ANNs with differentiable activation functions allow much richer behaviors.

14

A multi-layer perceptron (MLP) is a feedforward ANN with at least one hidden layer.

Derivative-based method for optimizing ANN weights.1969: First described by Arthur Bryson and Yu-Chi Ho.1970s-80s: Popularized by David Rumelhart, Geoffrey Hinton Ronald Williams, Paul Werbos; led to a renaissance in ANN research.

Backpropagation

Derivative-based method for optimizing

ANN weights

15

The Credit Assignment Problem

In a multi-layer ANN, how can we tell which weight should be varied to correct an output error? Answer: backpropagation.

16

Output 1Wanted 0

Backpropagationinput neurons hidden neurons output neurons

1 11 1 21 2

1 1( )

z

o

w y w y

f z

z1a1

z2a2

x1

x2

x1

x2

v11

v21

v12

v22

o1

o2

w11

w21

w12

w22

y1

y2

Similar for z2 and o2

1 11 1 21 2

1 1( )

a

y

v x v x

f a

Similar for a2 and y2

17

tk = desired (target) value of k-th output neuron

no = number of output neurons

2

1

2

1

1)

2

1

(

( ( ))2

o

o

n

k kk

n

k kk

o

f z

E t

t

1

2

)

(

( ) (1

[1

1

( )] ( )

)

x

x xdfe e

d

f x e

xx

f f x

Sigmoid transfer function

18-5 0 50

0.5

1

x

f(x)

2

1

1(

2( ))

on

k kk

j

ij j ij

ij

j i

E f z

dzdE dE

dw dz dw

dEy

dz

t

y

21

( )2

( )

( )( )

( )(1 ( )) ( )

( )(1 )

jj

j jj

jj j

j

jj j

j

j j j j

j j j j

dE

dz

dt o

dz

dot o

dz

df zt o

dz

t o f z f z

t o o o

Output Neurons

19

D( j ) = {output neurons whose inputs come from the j-th middle-layer neuron}vij aj yj { zk for all k D( j ) }

Hidden Neurons

( )

( )

( )

( )

( )

(1 )

(1 )

j jk

D jij k j j ij

jki

D j k j j

jkj

D j k j j

k jk j jD j

j j k jkD j

k

k

k

k

k

dy dadzdE dE

dv dz dy da dv

dydzdE

dz dy da

dydzdE

dz dy da

w y y

y y w

x

ò

20

The Backpropagation Training Algorithm

1. Randomly initialize weights {w} and {v}.2. Input sample x to get output o. Compute error E.3. Compute derivatives of E with respect to output weights {w}

(two pages previous).4. Compute derivatives of E with respect to hidden weights {v}

(previous page). Note that the results of step 3 are used for this computation; hence the term “backpropagation”).

5. Repeat step 4 for additional hidden layers as needed.6. Use gradient descent to update weights {w} and {v}. Go to

step 2 for the next sample/iteration.

21

XOR Example

Not linearly separable. This is a very simple problem, but early ANNs were unable to solve it.

22

x1

x2

0

0 1

1y = sign(x1x2)

XOR Example

Bias nodes at both the input and hidden layer23

z1a1

a2

x1

x2

x1

x2

v11

v21

v12

v22

o1w11

w21

y1

1

v31 v32

y2

11

w31

1 Backprop.m

XOR Example

Homework: Record the weights for the trained ANN, input various (x1, x2) combinations to the ANN to see how well it can generalize.

24

x1

x2

0

0 1

1

Backpropagation Issues

• Momentum: wij wij – j yi + wij,previous

What value of should we use?• Backpropagation is a local optimizer

– Combine it with a global optimizer (e.g., BBO)– Run backprop with multiple initial conditions

• Add random noise to input data and/or weights to improve generalization

25

Backpropagation IssuesBatch backpropagation

26

Randomly initialize weights {w} and {v}While not (termination criteria)

For i = 1 to (number of training samples)Input sample xi to get output oi. Compute error Ei

Compute dEi / dw and dEi / dv

Next sampledE / dw = dEi / dw and dE / dv = dEi / dv

Use gradient descent to update weights {w} and {v}.End while

Don’t forget to adjust the learning rate!


Weight decay• wij wij – j yi – dwij

This tends to decrease weight magnitudes unless they are reinforced with backpropd 0.001

• This corresponds to adding a term to the error function that penalizes the weight magnitudes

27


Quickprop (Scott Fahlman, 1988)• Backpropagation is notoriously slow.• Quickprop has the same philosophy as

Newton-Raphson.Assume the error surface is quadratic and jump in one step to the minimum of the quadratic.

28


• Other activation functions– Sigmoid: f(x) = (1+e–x)–1 – Hyperbolic tangent: f(x) = tanh(x)– Step: f(x) = U(x)– Tan Sigmoid: f(x) = (ecx – e–cx) / (ecx + e–cx) for some

positive constant c

• How many hidden layers should we use?

29

Universal Approximation Theorem

• A feed-forward ANN with one hidden layer and a finite number of neurons can approximate any continuous function to any desired accuracy.

• The ANN activation functions can be any continuous, nonconstant, bounded, monotonically increasing functions.

• The desired weights may not be obtainable via backpropagation.

• George Cybenko, 1989; Kurt Hornik, 1991

30

Termination Criterion

If we train too long we begin to “memorize” the training data and lose the ability to generalize.Train with a validation/test set.

31

Error

Validation/Test Set

Training Set

Termination Criterion

Cross Validation• N data partitions• N training runs, each using (N1) partitions for

training and 1 partition for validation/test• Each training run, store number of epochs ci

for the best test set performance (i=1,…,N)• cave = mean{ci}

• Train on all data for cave epochs32

Adaptive Backpropagation

Recall standard weight update: wij wij – j yi

• With adaptive learning rates, each weight wij has its own rate ij

• If the sign of wij is the same over several backprop updates, then increase ij

• If the sign of wij is not the same over several backprop updates, then decrease ij

33

Double Backpropagation

P = number of input training patterns. We want an ANN that can generalize. So input changes should not result in large error changes.

34

2

12

1

1

2

on

k k

EE

x

21

1

1)

2(

on

k kk

oE t

In addition to minimizing the training error:

Also minimize the sensitivity of training error to input data:

Other ANN Training Methods

Gradient-free approaches (GAs, BBO, etc.)• Global optimization • Combination with gradient descent• We can train the structure as well as the

weights• We can use non-differentiable activation

functions• We can use non-differentiable cost functions

35

BBO.m

Classification Benchmarks

The Iris classification problem• 150 data samples• Four input feature values (sepal length and

width, and petal length and width)• Three types of irises: Setosa, Versicolour, and

Virginica

36

Classification Benchmarks

• The two-spirals classification problem

• UC Irvine Machine Learning Repository – http://archive.ics.uci.edu/ml 194 benchmarks!

37

http://archive.ics.uci.edu/ml

Radial Basis Functions

J. Moody and C. Darken, 1989Universal approximators

38

N middle-layer neuronsInputs xActivation functions f (x, ci)Output weights wik

yk = wik f (x, ci)= wik ( ||xci|| )

(.) is a basis functionlimx ( ||xci|| ) = 0{ ci } are the N RBF centers


Common basis functions:• Gaussian: ( ||xci|| ) = exp(||xci||2 / 2)

is the width of the basis function• Many other proposed basis functions

39

Radial Basis FunctionsSuppose we have the data set (xi, yi), i = 1, …, N

Each xi is multidimensional, each yi is scalar

Set ci = xi, i = 1, …, N

Define gik = ( || xi xk|| )

Input each xi to the RBF to obtain:

40

11 1 1 1

1

N

N NN N N

g g w y

g g w y

Gw = yG is nonsingular if {xi} are distinct Solve for wGlobal minimum (assuming fixed c and )

Radial Basis FunctionsWe again have the data set (xi, yi), i = 1, …, N

Each xi is multidimensional, each yi is scalar

ck are given for (k = 1, …, m), and m < N

Define gik = ( || xi ck|| )

Input each xi to the RBF to obtain:

41

11 1 1 1

1

m

N Nm m N

g g w y

g g w y

Gw = yw = (GTG)1GT = G+y


How can we choose the RBF centers?• Randomly select them from the inputs• Use a clustering algorithm• Other options (BBO?)How can we choose the RBF widths?

42

Other Types of ANNs

Many other types of ANNs• Cerebellar Model Articulation Controller (CMAC)• Spiking neural networks• Self-organizing map (SOM)• Recurrent neural network (RNN)• Hopfield network• Boltzman machine• Cascade-Correlation• and many others …

43

Sources• Neural Networks, by C. Stergiou and D. Siganos,

www.doc.ic.ac.uk/~nd/ surprise_96/journal/vol4/cs11/report.html• The Backpropagation Algorithm, by A. Venkataraman,

www.speech.sri.com/people/anand/771/html/node37.html• CS 478 Course Notes, by Tony Martinez, http://axon.cs.byu.edu/~martinez

44

http://www.doc.ic.ac.uk/~nd/%20surprise_96/journal/vol4/cs11/report.html

http://www.speech.sri.com/people/anand/771/html/node37.html

http://axon.cs.byu.edu/~martinez

Artificial Neural Networks Dan Simon Cleveland State University 1.

Documents

Transcript of Artificial Neural Networks Dan Simon Cleveland State University 1.