Cleveland State University ESC 720 Research Communications Proposals Dan Simon.
Artificial Neural Networks Dan Simon Cleveland State University 1.
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Artificial Neural Networks Dan Simon Cleveland State University 1.
Neural Networks• Artificial Neural Network (ANN): An information
processing paradigm that is inspired by biological neurons
• Distinctive structure: Large number of simple, highly interconnected processing elements (neurons); parallel processing
• Inductive learning, that is, learning by example; an ANN is configured for a specific application through a learning process
• Learning involves adjustments to connections between the neurons
2
Inductive Learning
• Sometimes we can’t explain how we know something; we rely on our experience
• An ANN can generalize from expert knowledge and re-create expert behavior
• Example: An ER doctor considers a patient’s age, blood pressure, heart rate, ECG, etc., and makes an educated guess about whether or not the patient had a heart attack
3
The Birth of ANNs
• The first artificial neuron was proposed in 1943 by neurophysiologist Warren McCulloch and the psychologist/logician Walter Pitts
• No computing resources at that time
4
A Simple ANNPattern recognition: T versus H
f1(.)
f2(.)
f3(.)
g(.)
x11
x12
x13x21
x22
x23x31
x32
x33
1 0
7
1 0
x1 x2 x3 f1 f2 f3 g
0 0 0 0 1 1 10 0 1 0 ? 0 00 1 0 1 1 1 10 1 1 1 ? 1 11 0 0 0 ? 0 01 0 1 0 0 0 01 1 0 1 ? 1 11 1 1 1 0 0 0
Truth table
1, 1, 1 1
0, 0, 0 0
1, ? 1 1
0, ?, 1 ?
Examples:
8
Feedforward ANN
How many hidden layers should we use? How many neurons should we use in each hidden layer?
9
Perceptrons
• A simple ANN introduced by Frank Rosenblatt in 1958
• Discredited by Marvin Minsky and Seymour Papert in 1969– “Perceptrons have been widely publicized as
'pattern recognition' or 'learning machines' and as such have been discussed in a large number of books, journal articles, and voluminous 'reports'. Most of this writing ... is without scientific value …”
11
Perceptrons
Three-dimensional single-layer perceptronProblem: Given a set of training data (i.e., (x, y)
pairs), find the weight vector {w} that correctly classifies the inputs.
12
x0=1
x1
x2
x3
w0
w1
w2
w3
if · 0
oth
1( )
erw0 isef
w xx
The Perceptron Training Rule
• t = target output, o = perceptron output• Training rule: wi = e xi, where e = t o,
and is the step size.Note that e = 0, 1 or 1.If e = 0, then don’t update the weight.If e = 1, then t = 1 and o = 0, so we need to increase wi if xi > 0, and decrease wi if xi < 0.Similar logic applies when e = 1.
• is often initialized to 0.1 and decreases as training progresses.
13
From Perceptrons to Backpropagation
• Perceptrons were dismissed because of:– Limitations of single layer perceptrons– The threshold function is not differentiable
• Multi-layer ANNs with differentiable activation functions allow much richer behaviors.
14
A multi-layer perceptron (MLP) is a feedforward ANN with at least one hidden layer.
Derivative-based method for optimizing ANN weights.1969: First described by Arthur Bryson and Yu-Chi Ho.1970s-80s: Popularized by David Rumelhart, Geoffrey Hinton Ronald Williams, Paul Werbos; led to a renaissance in ANN research.
Backpropagation
Derivative-based method for optimizing
ANN weights
15
The Credit Assignment Problem
In a multi-layer ANN, how can we tell which weight should be varied to correct an output error? Answer: backpropagation.
16
Output 1Wanted 0
Backpropagationinput neurons hidden neurons output neurons
1 11 1 21 2
1 1( )
z
o
w y w y
f z
z1a1
z2a2
x1
x2
x1
x2
v11
v21
v12
v22
o1
o2
w11
w21
w12
w22
y1
y2
Similar for z2 and o2
1 11 1 21 2
1 1( )
a
y
v x v x
f a
Similar for a2 and y2
17
tk = desired (target) value of k-th output neuron
no = number of output neurons
2
1
2
1
1)
2
1
(
( ( ))2
o
o
n
k kk
n
k kk
o
f z
E t
t
1
2
)
(
( ) (1
[1
1
( )] ( )
)
x
x xdfe e
d
f x e
xx
f f x
Sigmoid transfer function
18-5 0 50
0.5
1
x
f(x)
2
1
1(
2( ))
on
k kk
j
ij j ij
ij
j i
E f z
dzdE dE
dw dz dw
dEy
dz
t
y
21
( )2
( )
( )( )
( )(1 ( )) ( )
( )(1 )
jj
j jj
jj j
j
jj j
j
j j j j
j j j j
dE
dz
dt o
dz
dot o
dz
df zt o
dz
t o f z f z
t o o o
Output Neurons
19
D( j ) = {output neurons whose inputs come from the j-th middle-layer neuron}vij aj yj { zk for all k D( j ) }
Hidden Neurons
( )
( )
( )
( )
( )
(1 )
(1 )
j jk
D jij k j j ij
jki
D j k j j
jkj
D j k j j
k jk j jD j
j j k jkD j
k
k
k
k
k
dy dadzdE dE
dv dz dy da dv
dydzdE
dz dy da
dydzdE
dz dy da
w y y
y y w
x
ò
20
The Backpropagation Training Algorithm
1. Randomly initialize weights {w} and {v}.2. Input sample x to get output o. Compute error E.3. Compute derivatives of E with respect to output weights {w}
(two pages previous).4. Compute derivatives of E with respect to hidden weights {v}
(previous page). Note that the results of step 3 are used for this computation; hence the term “backpropagation”).
5. Repeat step 4 for additional hidden layers as needed.6. Use gradient descent to update weights {w} and {v}. Go to
step 2 for the next sample/iteration.
21
XOR Example
Not linearly separable. This is a very simple problem, but early ANNs were unable to solve it.
22
x1
x2
0
0 1
1y = sign(x1x2)
XOR Example
Bias nodes at both the input and hidden layer23
z1a1
a2
x1
x2
x1
x2
v11
v21
v12
v22
o1w11
w21
y1
1
v31 v32
y2
11
w31
1 Backprop.m
XOR Example
Homework: Record the weights for the trained ANN, input various (x1, x2) combinations to the ANN to see how well it can generalize.
24
x1
x2
0
0 1
1
Backpropagation Issues
• Momentum: wij wij – j yi + wij,previous
What value of should we use?• Backpropagation is a local optimizer
– Combine it with a global optimizer (e.g., BBO)– Run backprop with multiple initial conditions
• Add random noise to input data and/or weights to improve generalization
25
Backpropagation IssuesBatch backpropagation
26
Randomly initialize weights {w} and {v}While not (termination criteria)
For i = 1 to (number of training samples)Input sample xi to get output oi. Compute error Ei
Compute dEi / dw and dEi / dv
Next sampledE / dw = dEi / dw and dE / dv = dEi / dv
Use gradient descent to update weights {w} and {v}.End while
Don’t forget to adjust the learning rate!
Backpropagation Issues
Weight decay• wij wij – j yi – dwij
This tends to decrease weight magnitudes unless they are reinforced with backpropd 0.001
• This corresponds to adding a term to the error function that penalizes the weight magnitudes
27
Backpropagation Issues
Quickprop (Scott Fahlman, 1988)• Backpropagation is notoriously slow.• Quickprop has the same philosophy as
Newton-Raphson.Assume the error surface is quadratic and jump in one step to the minimum of the quadratic.
28
Backpropagation Issues
• Other activation functions– Sigmoid: f(x) = (1+e–x)–1 – Hyperbolic tangent: f(x) = tanh(x)– Step: f(x) = U(x)– Tan Sigmoid: f(x) = (ecx – e–cx) / (ecx + e–cx) for some
positive constant c
• How many hidden layers should we use?
29
Universal Approximation Theorem
• A feed-forward ANN with one hidden layer and a finite number of neurons can approximate any continuous function to any desired accuracy.
• The ANN activation functions can be any continuous, nonconstant, bounded, monotonically increasing functions.
• The desired weights may not be obtainable via backpropagation.
• George Cybenko, 1989; Kurt Hornik, 1991
30
Termination Criterion
If we train too long we begin to “memorize” the training data and lose the ability to generalize.Train with a validation/test set.
31
Error
Validation/Test Set
Training Set
Termination Criterion
Cross Validation• N data partitions• N training runs, each using (N1) partitions for
training and 1 partition for validation/test• Each training run, store number of epochs ci
for the best test set performance (i=1,…,N)• cave = mean{ci}
• Train on all data for cave epochs32
Adaptive Backpropagation
Recall standard weight update: wij wij – j yi
• With adaptive learning rates, each weight wij has its own rate ij
• If the sign of wij is the same over several backprop updates, then increase ij
• If the sign of wij is not the same over several backprop updates, then decrease ij
33
Double Backpropagation
P = number of input training patterns. We want an ANN that can generalize. So input changes should not result in large error changes.
34
2
12
1
1
2
on
k k
EE
x
21
1
1)
2(
on
k kk
oE t
In addition to minimizing the training error:
Also minimize the sensitivity of training error to input data:
Other ANN Training Methods
Gradient-free approaches (GAs, BBO, etc.)• Global optimization • Combination with gradient descent• We can train the structure as well as the
weights• We can use non-differentiable activation
functions• We can use non-differentiable cost functions
35
BBO.m
Classification Benchmarks
The Iris classification problem• 150 data samples• Four input feature values (sepal length and
width, and petal length and width)• Three types of irises: Setosa, Versicolour, and
Virginica
36
Classification Benchmarks
• The two-spirals classification problem
• UC Irvine Machine Learning Repository – http://archive.ics.uci.edu/ml 194 benchmarks!
37
Radial Basis Functions
J. Moody and C. Darken, 1989Universal approximators
38
N middle-layer neuronsInputs xActivation functions f (x, ci)Output weights wik
yk = wik f (x, ci)= wik ( ||xci|| )
(.) is a basis functionlimx ( ||xci|| ) = 0{ ci } are the N RBF centers
Radial Basis Functions
Common basis functions:• Gaussian: ( ||xci|| ) = exp(||xci||2 / 2)
is the width of the basis function• Many other proposed basis functions
39
Radial Basis FunctionsSuppose we have the data set (xi, yi), i = 1, …, N
Each xi is multidimensional, each yi is scalar
Set ci = xi, i = 1, …, N
Define gik = ( || xi xk|| )
Input each xi to the RBF to obtain:
40
11 1 1 1
1
N
N NN N N
g g w y
g g w y
Gw = yG is nonsingular if {xi} are distinct Solve for wGlobal minimum (assuming fixed c and )
Radial Basis FunctionsWe again have the data set (xi, yi), i = 1, …, N
Each xi is multidimensional, each yi is scalar
ck are given for (k = 1, …, m), and m < N
Define gik = ( || xi ck|| )
Input each xi to the RBF to obtain:
41
11 1 1 1
1
m
N Nm m N
g g w y
g g w y
Gw = yw = (GTG)1GT = G+y
Radial Basis Functions
How can we choose the RBF centers?• Randomly select them from the inputs• Use a clustering algorithm• Other options (BBO?)How can we choose the RBF widths?
42
Other Types of ANNs
Many other types of ANNs• Cerebellar Model Articulation Controller (CMAC)• Spiking neural networks• Self-organizing map (SOM)• Recurrent neural network (RNN)• Hopfield network• Boltzman machine• Cascade-Correlation• and many others …
43
Sources• Neural Networks, by C. Stergiou and D. Siganos,
www.doc.ic.ac.uk/~nd/ surprise_96/journal/vol4/cs11/report.html• The Backpropagation Algorithm, by A. Venkataraman,
www.speech.sri.com/people/anand/771/html/node37.html• CS 478 Course Notes, by Tony Martinez, http://axon.cs.byu.edu/~martinez
44