DM 05 05 Prediction by Neural Networkswebpages.iust.ac.ir/yaghini/Courses/Data_Mining_882... · –...
Transcript of DM 05 05 Prediction by Neural Networkswebpages.iust.ac.ir/yaghini/Courses/Data_Mining_882... · –...
Data MiningPart 5. Prediction
5.5. Prediction by Neural Networks
Prediction by Neural Networks
5.5. Prediction by Neural Networks
Spring 2010
Instructor: Dr. Masoud Yaghini
Outline
� How the Brain Works
� Artificial Neural Networks
� Simple Computing Elements
� Feed-Forward Networks
� Perceptrons (Single-layer, Feed-Forward Neural Network)
Perceptron Learning Method
Prediction by Neural Networks
� Perceptron Learning Method
� Multilayer Feed-Forward Neural Network
� Defining a Network Topology
� Backpropagation Algorithm
� Backpropagation and Interpretability
� Discussion
� References
How the Brain Works
Prediction by Neural Networks
How the Brain Works
� Neuron (nerve cell)
– the fundamental functional unit of all nervous system
tissue, including the brain.
– There 1011 neurons in the human brain
Neuron components
Prediction by Neural Networks
� Neuron components
– Soma (cell body):
� provides the support functions and structure of the cell, that contains
a cell nucleus.
– Dendrites:
� consist of more branching fibers which receive signal from other nerve cells
How the Brain Works
� Neuron components (cont.)
– Axon:
� a branching fiber which carries signals away from the neuron that connect to the dendrites and cell bodies of other neurons.
� In reality, the length of the axon should be about 100 times the diameter of the cell body.
– Synapse:
Prediction by Neural Networks
– Synapse:
� The connecting junction between axon and dendrites.
How the Brain Works
� The parts of a nerve cell or neuron.
Prediction by Neural Networks
Neuron Firing Process
� Neuron Firing Process
1. Synapse receives incoming signals, change electrical
potential of cell body
2. When a potential of cell body reaches some limit, neuron
“fires”, electrical signal (action potential) sent down axon
3. Axon propagates signal to other neurons, downstream
Prediction by Neural Networks
3. Axon propagates signal to other neurons, downstream
How the Brain Works
� How synapse works:
– Excitatory synapse: increasing potential
– Synaptic connection: plasticity
– Inhibitory synapse: decreasing potential
� Migration of neurons
– Neurons also form new connections with other neurons
Prediction by Neural Networks
– Neurons also form new connections with other neurons
– Sometimes entire collections of neurons can migrate from
one place to another.
– These mechanisms are thought to form the basis for
learning in the brain.
� A collection of simple cells can lead to thoughts,
action, and consciousness.
Comparing brains with digital computers
� Advantages of a human brain vs. a computer
– Parallelism: all the neurons and synapses are active
simultaneously, whereas most current computers have only
one or at most a few CPUs.
– More fault-tolerant: A hardware error that flips a single
bit can doom an entire computation, but brain cells die all
Prediction by Neural Networks
bit can doom an entire computation, but brain cells die all
the time with no ill effect to the overall functioning of the
brain.
– Inductive algorithm: To be trained using an inductive
learning algorithm
Artificial Neural Networks
Prediction by Neural Networks
Artificial Neural Networks
� Artificial Neural Networks (ANN) Started by
psychologists and neurobiologists to develop and test
computational analogues of neurons
� Other names:
– connectionist learning,
Prediction by Neural Networks
– parallel distributed processing,
– neural computation,
– adaptive networks, and
– collective computation
Artificial Neural Networks
� Artificial neural networks components:
– Units
� A neural network is composed of a number of nodes, or units
� Metaphor for nerve cell body
– Links
� Units connected by links.
Prediction by Neural Networks
� Units connected by links.
� Links represent synaptic connections from one unit to another
– Weight
� Each link has a numeric weight
Artificial Neural Networks
� An example of ANN
Prediction by Neural Networks
Artificial Neural Networks
� Long-term memory
– Weights are the primary means of long-term storage in
neural networks
� Learning method
– Learning usually takes place by adjusting the weights.
Prediction by Neural Networks
� Input and Output Units
– Some of the units are connected to the external
environment, and can be designated as input units or
output units
Artificial Neural Networks
� Components of a Unit
– a set of input links from other units,
– a set of output links to other units,
– a current activation level, and
– a means of computing the activation level at the next step in
time, given its inputs and weights.
Prediction by Neural Networks
time, given its inputs and weights.
� The idea is that each unit does a local computation
based on inputs from its neighbors, but without the
need for any global control over the set of units as a
whole.
Artificial Neural Networks
� Real (Biological) Neural Network vs. Artificial Neural
Network
Artificial Neural NetworkReal Neural Network
Neuron / Node / UnitSoma / Cell body
Prediction by Neural Networks
Input linksDendrite
Output linksAxon
WeightSynapse
Artificial Neural Networks
� Neural networks can be used for both
– supervised learning, and
– unsupervised learning
� For supervised learning neural networks can be used
for both
Prediction by Neural Networks
– classification (to predict the class label of a given example)
and
– prediction (to predict a continuous-valued output).
� In this chapter we want to discuss about application of
neural networks for supervised learning
Artificial Neural Networks
� To build a neural network must decide:
– how many units are to be used
– what kind of units are appropriate
– how the units are to be connected to form a network.
� Then
Prediction by Neural Networks
– initializes the weights of the network, and
– trains the weights using a learning algorithm applied to a
set of training examples for the task.
� The use of examples also implies that one must decide
how to encode the examples in terms of inputs and
outputs of the network.
Simple Computing Elements
Prediction by Neural Networks
Simple computing elements
� Each unit performs a simple process:
– Receives n-inputs
– Multiplies each input by its weight
– Applies activation function to the sum of results
– Outputs result
Prediction by Neural Networks
Simple computing elements
� Two computational components
– Linear component:
� input function, that init, that computes the weighted sum of the unit's
input values.
– Nonlinear component:
� activation function, g, that transforms the weighted sum into the
Prediction by Neural Networks
� activation function, g, that transforms the weighted sum into the
final value that serves as the unit's activation value , ai
� Usually, all units in a network use the same activation function.
Simple computing elements
� A typical unit
Prediction by Neural Networks
Simple computing elements
� Total weighted input
– the weights on links from node j into node i are denoted by
Wj, i
j
j
iji aWin ∑= ,
Prediction by Neural Networks
j, i
– The input values is called aj
Example: Total weighted input
Input: (3, 1, 0, -2)
Processing:
3(0.3) + 1(-0.1) + 0(2.1) + -1.1(-2)
= 0.9 + (-0.1) + 2.21
Prediction by Neural Networks
= 31
2
3
4
0.3
-0.1
2.1
-1.1
Simple computing elements
� The activation function g
� Three common mathematical functions for g are
)()( , j
j
ijii aWginga ∑==
Prediction by Neural Networks
� Three common mathematical functions for g are
– Step function
– Sign function
– Sigmoid function
Simple computing elements
� Three common mathematical functions for g
Prediction by Neural Networks
Step Function
� The step function has a threshold t such that it outputs
a 1 when the input is greater than its threshold, and
outputs a 0 otherwise.
� The biological motivation is that a 1 represents the
firing of a pulse down the axon, and a 0 represents no
firing.
Prediction by Neural Networks
firing.
� The threshold represents the minimum total weighted
input necessary to cause the neuron to fire.
Step Function Example
� Let t = 4 1
2
3
0.3
-0.1
2.1
Prediction by Neural Networks
4
1
0)3(4 =Step
4-1.1
Step Function
� It mathematically convenient to replace the threshold
with an extra input weight.
� Because it need only worry about adjusting weights,
rather than adjusting both weights and thresholds.
� Thus, instead of having a threshold t for each unit, we a
Prediction by Neural Networks
add an extra input whose activation
∑∑==
==
n
j
jij
n
j
jijti aWstepaWstepa0
,0
1
, )()(
Where W0, i = t and a0= -1 � fixed
0a
Step Function
� The Figure shows how the Boolean functions AND, OR,
and NOT can be represented by units with a step function
and suitable weights and thresholds.
� This is important because it means we can use these units
to build a network to compute any Boolean function of the
inputs.
Prediction by Neural Networks
inputs.
Sigmoid Function
� A sigmoid function often used to approximate the step
function
xe
xfσ−
+=
1
1)(
Prediction by Neural Networks
: the steepness parameter
e+1
σ
Sigmoid Function
� Input: (3, 1, 0, -2), =1
1
2
3
0.3
-0.1
2.1
σ
xe
xfσ−
+=
1
1)(
Prediction by Neural Networks
4-1.1
95.01
1)3( ≈
+=
−xef
Sigmoid Function
0.5
0.6
0.7
0.8
0.9
1
1/(1+exp(-x)))
1/(1+exp(-10*x)))
Prediction by Neural Networks
0
0.1
0.2
0.3
0.4
-5-4
.5 -4-3
.5 -3-2
.5 -2-1
.5 -1-0
.5 0
0.5 1
1.5 2
2.5 3
3.5 4
4.5 5
sigmoidal(0) = 0.5
Another Example
� A two weight layer, feedforward network
� Two inputs, one output, one ‘hidden’ unit
� Input: (3, 1)
xe
xf−
+=
1
1)(
0.5
Prediction by Neural Networks
� What is the output?
0.5
-0.5
0.75
Computing in Multilayer Networks
� Computing:
– Start at leftmost layer
– Compute activations based on inputs
– Then work from left to right, using computed activations as
inputs to next layer
Example solution xf =1
)(
Prediction by Neural Networks
� Example solution
– Activation of hidden unit
� f(0.5(3) + -0.5(1)) = f(1.5 – 0.5) = f(1) = 0.731
– Output activation
� f(0.731(0.75)) = f(0.548) = 0.634
xe
xf−
+=
1
1)(
Feed-Forward Networks
Prediction by Neural Networks
Feed-forward Networks
� Feed-forward networks
– Unidirectional links
– Directed acyclic (no cycles) graph (DAG)
– No links between units in the same layer
– No links backward to a previous layer
Prediction by Neural Networks
– No links that skip a layer.
– Uniformly processing from input units to output units
Feed-forward Networks
� An example: A two-layer, feed-forward network
with two inputs, two hidden nodes, and one output
node.
Prediction by Neural Networks
Feed-forward Networks
� Units
– Input units: the activation value of each of these units is
determined by the environment.
– Output units: at the right-hand end of the network units
– Hidden units: they have no direct connection to the outside
world.
Prediction by Neural Networks
world.
� Because the input units (square nodes) simply serve to
pass activation to the next layer, they are not counted
Feed-forward Networks
� Types of feed-forward networks:
– Perceptrons
� No hidden units
� This makes the learning problem much simpler, but it means that
perceptrons are very limited in what they can represent.
– Multilayer networks
Prediction by Neural Networks
– Multilayer networks
� one or more hidden units
Feed-forward Networks
� Feed-forward networks have a fixed structure and fixed
activation functions g
� The functions have a specific parameterized structure
� The weights chosen for the network determine which of
these functions is actually represented.
For example, the network calculates the following
Prediction by Neural Networks
� For example, the network calculates the following
function:
– where g is the activation function, ai and , is the output of node i.
What neural networks do
� Because the activation functions g are nonlinear, the
whole network represents a complex nonlinear
function.
� If you think of the weights as parameters or
coefficients of this function, then learning just
becomes:
Prediction by Neural Networks
becomes:
– a process of tuning the parameters to fit the data in the
training set—a process that statisticians call nonlinear
regression.
Optimal Network Structure
� Too small network
– incapable of representation
� Too big network
– not generalized well
– Overfitting when there are too many parameters.
Prediction by Neural Networks
Perceptrons(Single-layer, Feed-forward Neural Networks)
Prediction by Neural Networks
(Single-layer, Feed-forward Neural Networks)
Perceptrons
� Perceptrons
– Single-layer feed-forward network
– were first studied in the late 1950s
� Types of Perceptrons:
Prediction by Neural Networks
– Single-output Perceptron
� perceptrons with a single output unit
– Multi-output perceptron
� perceptrons with several output units
Perceptrons
� Each output unit is
independent of
the others
� Each weight only
affects one of
the outputs.
Prediction by Neural Networks
Perceptrons
� Activation of output unit:
– Wj : The weight from input unit j
).(00 IWStepIWStepOj
jj =
= ∑
Prediction by Neural Networks
– Wj : The weight from input unit j
– Ij : The activation of input unit j
– we have assumed an additional weight W0 to provide a
threshold for the step function, with I0 = -1 .
Perceptrons
� Perceptrons are severely limited in the Boolean functions
they can represent.
� The problem is that any input Ij can only influence the final
output in one direction, no matter what the other input
values are.
� Consider some input vector a.
Prediction by Neural Networks
� Consider some input vector a.
– Suppose that this vector has aj = 0 and that the vector
produces a 0 as output. Furthermore, suppose that when aj
is replaced with 1, the output changes to 1. This implies that
Wj must be positive.
– It also implies that there can be no input vector b for which
the output is 1 when bj = 0, but the output is 0 when bj is
replaced with 1.
Perceptrons
� The Figure shows three different Boolean functions of two
inputs, the AND, OR, and XOR functions.
Prediction by Neural Networks
� Black dots indicate a point in the input space where the
value of the function is 1, and white dots indicate a point
where the value is 0.
(Il # I2)
Perceptrons
� As we will explain, a perceptron can represent a
function only if there is some line that separates all the
white dots from the black dots.
� Such functions are called linearly separable.
� Thus, a perceptron can represent AND and OR, but
Prediction by Neural Networks
not XOR (if Il # I2).
Perceptrons
� The fact that a perceptron can only represent linearly
separable functions follows directly from Equation:
� A perceptron outputs a 1 only if W . I > 0.
).(00 IWStepIWStepOj
jj =
= ∑
Prediction by Neural Networks
� A perceptron outputs a 1 only if W . I > 0.
– This means that the entire input space is divided in two
along a boundary defined by W . I = 0,
– that is, a plane in the input space with coefficients given by
the weights.
Perceptrons
� It is easiest to understand for the case where n = 2. In
Figure (a), one possible separating "plane" is the
dotted line defined by the equation
� The region above the line, where the output is 1 , is
Prediction by Neural Networks
therefore given by
Perceptron Learning Method
Prediction by Neural Networks
Perceptron Learning Method
� The initial network has randomly assigned weights,
usually from the range [-0.5,0.5].
� The network is then updated to try to make it
consistent with the training examples (instances).
� This is done by making small adjustments in the
Prediction by Neural Networks
weights to reduce the difference between the observed
and predicted values.
� The algorithm is the need to repeat the update phase
several times for each example in order to achieve
convergence.
Perceptron Learning Method
� Epochs
– The updating process is divided into epochs.
– Each epoch involves updating all the weights for all the
examples.
Prediction by Neural Networks
Perceptron Learning Method
� The generic neural network learning method
Prediction by Neural Networks
an epoch
Err = T-O
Perceptron Learning Method
� The weight update rule
– If the predicted output for the single output unit is O, and
the correct output should be T, then the error is given by
Err = T – O
– If the Err is positive, we need to increase O
Prediction by Neural Networks
– If the Err is negative, we need to decrease O
– Each input unit contributes Wj Ij to the total input, so
– If Ij is positive, an increase in Wj will tend to increase O
– If Ij is negative, an increase in Wj will tend to decrease O.
Perceptron Learning Method
� We can achieve the effect we want with the following
rule:
– α : is the learning rate
ErrIWW jjj **α+←
Prediction by Neural Networks
� This rule is a variant of the perceptron learning rule
proposed by Frank Rosenblatt.
– Rosenblatt proved that a learning system using the
perceptron learning rule will converge to a set of weights
that correctly represents the examples, as long as the
examples represent a linearly separable function.
Delta Rule for a Single Output Unit
jj IOTW )( −=∆ α
jW∆
α
Change in j th weight of weight vector
Learning raten inputs 1 output
Prediction by Neural Networks
α
jI
O
T Target or correct output
Net (summed, weighted) input to output unit
j th input value
.
.
. full
y
con
nec
ted
n inputs 1 output
Example
� W = (W1, W2, W3)
– Initially: W= (.5 .2 .4)
� Let α = 0.5
� Apply delta rule
Sample Input Output
W1
W2
W3
Prediction by Neural Networks
Sample Input Output
1 0 0 0 0
2 1 1 1 1
3 1 0 0 1
4 0 0 1 1
One Epoch of Training
Step Input Desired
output
(T)
Actual
output
(O)
Starting
Weights
Weight updates
1 (0 0 0) 0 0 (.5 .2 .4)
Prediction by Neural Networks
Delta rule:
1 (0 0 0) 0 0 (.5 .2 .4)
2 (1 1 1) 1
3 (1 0 0) 1
4 (0 0 1) 1
jj IOTW )( −=∆ α
One Epoch of Training
Step Input Desired
output
(T)
Actual
output
(O)
Starting
Weights
Weight updates
1 (0 0 0) 0 0 (.5 .2 .4) W1: 0.1(0 – 0)0
Prediction by Neural Networks
1 (0 0 0) 0 0 (.5 .2 .4) W1: 0.1(0 – 0)0
W2: 0.1(0 – 0)0
W3: 0.1(0 – 0)0
Delta rule:jj IOTW )( −=∆ α
delta-rule1.xls
One Epoch of Training
Step Input Desired
output
(T)
Actual
output
(O)
Starting
Weights
Weight updates
1 (0 0 0) 0 0 (.5 .2 .4) (0 0 0)
Prediction by Neural Networks
1 (0 0 0) 0 0 (.5 .2 .4) (0 0 0)
2 (1 1 1) 1 (.5 .2 .4)
3 (1 0 0) 1
4 (0 0 1) 1
Remaining Steps in First Epoch of Training
Step Input Desired
output
(T)
Actual
output
(O)
Starting
Weights
Weight updates
1 (0 0 0) 0 0 (.5 .2 .4) (0 0 0)
2 (1 1 1) 1 1.1 (.5 .2 .4) (-.05 -.05 -.05)
Prediction by Neural Networks
2 (1 1 1) 1 1.1 (.5 .2 .4) (-.05 -.05 -.05)
3 (1 0 0) 1 .45 (.45 .15
.35)
(.275 0 0)
4 (0 0 1) 1 .35 (.725 .15
.35)
(0 0 .325)
Completing the Example
� After 18 epochs
– Weights
� W1= 0.990735
� W2= -0.970018005
� W3= 0.98147
� Does this adequately approximate the training data?
W1
W2
W3
Prediction by Neural Networks
� Does this adequately approximate the training data?
Sample Input Output
1 0 0 0 0
2 1 1 1 1
3 1 0 0 1
4 0 0 1 1
Example
� Actual Outputs
Sample Input Desired
Output
Actual Output
1 0 0 0 0 0
Prediction by Neural Networks
1 0 0 0 0 0
2 1 1 1 1 1.002187
3 1 0 0 1 0.990735
4 0 0 1 1 0.98147
examples in ANN
� There is a slight difference between the example
descriptions used for neural networks and those used
for other attribute-based methods such as decision
trees.
� In a neural network, all inputs are real numbers in
some fixed range, whereas decision trees allow for
Prediction by Neural Networks
some fixed range, whereas decision trees allow for
multivalued attributes with a discrete set of values.
� For example, an attribute may has values None, Some,
and Full.
Perceptron Learning Method
� There are two ways to handle this.
– Local encoding
� we use a single input unit and pick an appropriate number of distinct
values to correspond to the discrete attribute values.
� For example, we can use None = 0.0, Some = 0.5, and Full = 1.0.
– Distributed encoding
Prediction by Neural Networks
– Distributed encoding
� we use one input unit for each value of the attribute, turning on the
unit that corresponds to the correct value.
Multilayer Feed-Forward Neural
Network
Prediction by Neural Networks
Network
Multilayer Feed-Forward Neural Network
� A multilayer feed-forward neural network consists of
several layers includes:
– an input layer,
– one or more hidden layers, and
– an output layer.
Prediction by Neural Networks
Multilayer Feed-Forward Neural Network
� Each layer is made up of units.
� A two-layer neural network has a hidden layer and an
output layer.
� The input layer is not counted because it serves only to
pass the input values to the next layer.
Prediction by Neural Networks
� A network containing two hidden layers is called a
three-layer neural network, and so on.
Multilayer Feed-Forward Neural Network
� Suppose we want to construct a network for a
problem.
� We have ten attributes describing each example, so
we will need ten input units.
� How many hidden units are needed?
Prediction by Neural Networks
– The problem of choosing the right number of hidden units
in advance is still not well-understood.
� We use a network with four hidden units.
Multilayer Feed-Forward Neural Network
� A two-layer feed-forward network
Prediction by Neural Networks
Multilayer Feed-Forward Neural Network
� The inputs to the network correspond to the attributes
measured for each training example.
� Inputs are fed simultaneously into the units making up the
input layer
� They are then weighted and fed simultaneously to a
hidden layer
Prediction by Neural Networks
hidden layer
� The number of hidden layers is arbitrary, although in
practice, usually only one is used.
� The weighted outputs of the last hidden layer are input to
units making up the output layer, which sends out the
network’s prediction.
Multilayer Feed-Forward Neural Network
� The network is feed-forward in that none of the
weights cycles back to an input unit or to an output
unit of a previous layer
� From a statistical point of view, networks perform
nonlinear regression
Prediction by Neural Networks
nonlinear regression
� Given enough hidden units and enough training
samples, they can closely approximate any function
Multilayer Feed-Forward Neural Network
� Learning method
– example inputs are presented to the network and the
network computes an output vector that matches the target.
– If there is an error (a difference between the output and
target), then the weights are adjusted to reduce this error.
– The trick is to assess the blame for an error and divide it
Prediction by Neural Networks
– The trick is to assess the blame for an error and divide it
among the contributing weights.
– In perceptrons, this is easy, because there is only one
weight between each input and the output.
– But in multilayer networks, there are many weights
connecting each input to an output, and each of these
weights contributes to more than one output.
Defining a Network Topology
Prediction by Neural Networks
Defining a Network Topology
� First decide the network topology:
– the number of units in the input layer
– the number of hidden layers (if > 1),
– the number of units in each hidden layer
– the number of units in the output layer
Prediction by Neural Networks
� Normalizing the input values for each attribute
measured in the training examples to [0.0—1.0] will
help speed up the learning phase.
Defining a Network Topology
� Input units
– Normalizing the input values for each attribute measured in the
training examples to [0.0—1.0] will help speed up the learning
phase.
– Discrete-valued attributes may be encoded such that there is one
input unit per domain value.
Example, if an attribute A has three possible or known values,
Prediction by Neural Networks
� Example, if an attribute A has three possible or known values,
namely {a0, a1, a2}, then we may assign three input units to represent
A. That is, we may have, say, I0, I1, I2 as input units.
� Each unit is initialized to 0.
� Then
– I0 is set to 1, If A = a1
– I1 is set to 1, If A = a2
– I2 is set to 1, If A = a3
Defining a Network Topology
� Output unit
– For classification, one output unit may be used to represent
two classes (where the value 1 represents one class, and the
value 0 represents the other).
– If there are more than two classes, then one output unit per
class is used.
Prediction by Neural Networks
class is used.
Defining a Network Topology
� Hidden layer units
– There are no clear rules as to the “best” number of hidden
layer units
– Network design is a trial-and-error process and may affect
the accuracy of the resulting trained network.
Prediction by Neural Networks
� Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a
different network topology or a different set of
initial weights
Optimal Network Structure
� Using genetic algorithm: for finding a good network
structure
� Hill-climbing search (modifying an existing network
structure)
– Start with a big network: optimal brain damage
algorithm
Prediction by Neural Networks
algorithm
� Removing weights from fully connected model
– Start with a small network: tiling algorithm
� Start with single unit and add subsequent units
� Cross-validation techniques: are useful for deciding
when we have found a network of the right size.
Backpropagation Algorithm
Prediction by Neural Networks
Backpropagation
� The backpropagation algorithm performs learning
on a multilayer feed-forward neural network.
� It is the most popular method for learning in multilayer
networks
� Backpropagation iteratively process a set of training
Prediction by Neural Networks
examples & compare the network's prediction with the
actual known target value
� The target value may be the known class label of the
training example (for classification problems) or a
continuous value (for prediction problems).
Backpropagation
� For each training example, the weights are modified to
minimize the mean squared error between the
network's prediction and the actual target value
� Modifications are made in the “backwards” direction
Prediction by Neural Networks
– from the output layer, through each hidden layer down to
the first hidden layer, hence “backpropagation”
– Although it is not guaranteed, in general the weights will
eventually converge, and the learning process stops.
Backpropagation
� Backpropagation algorithm steps:
– Initialize the weights
� Initialize weights to small random and biases in the network
– Propagate the inputs forward
� by applying activation function
– Backpropagate the error
Prediction by Neural Networks
– Backpropagate the error
� by updating weights and biases
– Terminating condition
� when error is very small, etc.
Backpropagation Algorithm
� Input:
– D, a data set consisting of the training examples and their
associated target values
– l, the learning rate
– network, a multilayer feed-forward network
Output:
Prediction by Neural Networks
� Output:
– A trained neural network.
Initialize the weights
� 1) Initialize the weights
– The weights in the network are initialized to small random
numbers
– e.g., ranging from -1.0 to 1.0 or -0.5 to 0.5
– Each unit has a bias associated with it
– The biases are similarly initialized to small random
Prediction by Neural Networks
– The biases are similarly initialized to small random
numbers.
� Each training example is processed by the steps 2
to 8.
Propagate the inputs forward
� 2) determining the output of input layer units
– the training example is fed to the input layer of the network.
– The inputs pass through the input units, unchanged.
– For an input unit, j,
� its input value, Ij
its output, O , is equal to its input value, I .
Prediction by Neural Networks
� its output, Oj, is equal to its input value, Ij.
Propagate the inputs forward
� 3) compute the net input of each unit in the hidden
and output layers
– The net input to a unit in the hidden or output layers is
computed as a linear combination of its inputs.
– Given a unit j in a hidden or output layer, the net input, Ij, to
unit j is
Prediction by Neural Networks
unit j is
� where wij is the weight of the connection from unit i in the previous
layer to unit j
� Oi is the output of unit i from the previous layer
� Өj is the bias of the unit
∑ +=
ijiijj OwI θ
Propagate the inputs forward
� A hidden or output layer unit j
Prediction by Neural Networks
Propagate the inputs forward
� 4) compute the output of each unit j in the hidden
and output layers
– The output of each unit is calculating by applying an
activation function to its net input
– The logistic, or sigmoid, function is used.
– Given the net input I to unit j, then O , the output of unit j,
Prediction by Neural Networks
– Given the net input Ij to unit j, then Oj, the output of unit j,
is computed as:
jIje
O−
+
=
1
1
Backpropagate the Error
� 5) compute the error for each unit j in the output
layer
– For a unit j in the output layer, the error Errj is computed
by
))(1( jjjjj OTOOErr −−=
Prediction by Neural Networks
– Oj is the actual output of unit j,
– Tj is the known target value of the given training example
– Note that Oj (1 - Oj) is the derivative of the logistic
function.
jjjjj
Backpropagate the Error
� 6) compute the error for each unit j in the hidden
layers, from the last to the first hidden layer
– The error of a hidden layer unit j is
jkk
kjjj wErrOOErr ∑−= )1(
Prediction by Neural Networks
– wjk is the weight of the connection from unit j to a unit k in
the next higher layer, and
– Errk is the error of unit k.
k
Backpropagate the Error
� 7) update the weights for each weight wij in
network
– Weights are updated by the following equations
ijijij www ∆+=
OErrlw )(=∆
Prediction by Neural Networks
– ∆wij is the change in weight wij
– The variable l is the learning rate, a constant typically
having a value between 0.0 and 1.0
ijij OErrlw )(=∆
Backpropagate the Error
� Learning rate
– Backpropagation learns using a method of gradient descent
– The learning rate helps avoid getting stuck at a local
minimum in decision space (i.e., where the weights appear
to converge, but are not the optimum solution) and
encourages finding the global minimum.
Prediction by Neural Networks
encourages finding the global minimum.
– If the learning rate is too small, then learning will occur at
a very slow pace.
– If the learning rate is too large, then oscillation between
inadequate solutions may occur.
– A rule to set the learning rate to 1 / t, where t is the number
of iterations through the training set so far.
Backpropagate the Error
� 8) update the for each bias Өj in network
– Biases are updated by the following equations below:
jjj θθθ ∆+=
jj Errl)(=∆θ
Prediction by Neural Networks
– ∆Өj is the change in bias Өj
– There are two strategies for updating the weights and biases
Backpropagate the Error
� Updating strategies:
– Case updating
� updating the weights and biases after the presentation of each
example.
� case updating is more common because it tends to yield more
accurate result
Prediction by Neural Networks
– Epoch updating
� The weight and bias increments could be accumulated in variables,
so that the weights and biases are updated after all of the examples in
the training set have been presented.
� One iteration through the training set is an epoch.
Terminating Condition
� 9) Checking the stopping condition
– After finishing the processed for all training examples, we
must evaluate the stopping condition
– Stopping condition: Training stops when
� All ∆wij in the previous epoch were so small as to be below some
specified threshold, or
Prediction by Neural Networks
specified threshold, or
� The percentage of examples misclassified in the previous epoch is
below some threshold, or
� A prespecified number of epochs has expired.
– In practice, several hundreds of thousands of epochs may
be required before the weights will converge.
– If stopping condition was not true steps 2 to 8 should repeat
for all training examples
Efficiency of Backpropagation
� The computational efficiency depends on the time
spent training the network.
� However, in the worst-case scenario, the number of
epochs can be exponential in n, the number of inputs.
� In practice, the time required for the networks to
Prediction by Neural Networks
converge is highly variable.
� A number of techniques exist that help speed up the
training time.
– Metaheuristic algorithms such as simulated annealing
algorithm can be used, which also ensures convergence to
a global optimum.
Example
� The Figure shows a multilayer feed-forward neural network
Prediction by Neural Networks
Example
� This example shows the calculations for
backpropagation, given the first training example, X.
� Let the learning rate be 0.9.
� The initial weight and bias values of the network are
given in the Table, along with the first training
Prediction by Neural Networks
example, X = (1, 0, 1), whose class label is 1.
Example
� The net input and output calculations:
∑ +=
ijiijj OwI θ
jIje
O−
+
=
1
1
Prediction by Neural Networks
e+1
Example
� Calculation of the error at each node:
– The output layer
– The hidden layer
))(1( jjjjj OTOOErr −−=
jkkjjj wErrOOErr ∑−= )1(
Prediction by Neural Networks
jkk
kjjj wErrOOErr ∑−= )1(
Example
� Calculations for weight and bias updating:
ijijij www ∆+=
ijij OErrlw )(=∆
Prediction by Neural Networks
jjj θθθ ∆+=
jj Errl)(=∆θ
Example
� Several variations and alternatives to the
backpropagation algorithm have been proposed for
classification in neural networks.
� These may involve:
– the dynamic adjustment of the network topology and of the
learning rate
Prediction by Neural Networks
learning rate
– New parameters
– The use of different error functions
Backpropagation and
Interpretability
Prediction by Neural Networks
Interpretability
Backpropagation and Interpretability
� Neural networks are like a black box.
� A major disadvantage of neural networks lies in their
knowledge representation.
� Acquired knowledge in the form of a network of units
connected by weighted links is difficult for humans to
interpret.
Prediction by Neural Networks
interpret.
� This factor has motivated research in extracting the
knowledge embedded in trained neural networks and in
representing that knowledge symbolically.
� Methods include:
– extracting rules from networks
– sensitivity analysis
Backpropagation and Interpretability
� Often the first step toward extracting rules from neural
networks is network pruning
– This consists of simplifying the network structure by
removing weighted links that have the least effect on the
trained network.
Prediction by Neural Networks
Backpropagation and Interpretability
� Rule extraction from networks
– Often, the first step toward extracting rules from neural
networks is network pruning.
� This consists of simplifying the network structure by removing
weighted links that have the least effect on the trained network
– Then perform link, unit, or activation value clustering
Prediction by Neural Networks
– Then perform link, unit, or activation value clustering
� In one method, for example, clustering is used to find the set of
common activation values for each hidden unit in a given trained
two-layer neural network.
– The set of input and activation values are studied to derive
rules describing the relationship between the input and
hidden unit layers
Backpropagation and Interpretability
� Rules can be extracted from training neural networks
Prediction by Neural Networks
Backpropagation and Interpretability
Prediction by Neural Networks
Backpropagation and Interpretability
� Sensitivity analysis
– assess the impact that a given input variable has on a
network output.
– The knowledge gained from this analysis can be
represented in rules
– Such as “IF X decreases 5% THEN Y increases 8%.”
Prediction by Neural Networks
– Such as “IF X decreases 5% THEN Y increases 8%.”
Discussion
Prediction by Neural Networks
Discussion
� Weakness of neural networks
– Long training time
– Require a number of parameters typically best determined
empirically
� e.g., the network topology or structure.
– Poor interpretability
Prediction by Neural Networks
– Poor interpretability
� Difficult to interpret the symbolic meaning behind the learned
weights and of “hidden units” in the network
Discussion
� Strength of neural networks
– High tolerance to noisy data
– It can be used when you may have little knowledge of the
relationships between attributes and classes
– Well-suited for continuous-valued inputs and outputs
– Successful on a wide array of real-world data
Prediction by Neural Networks
– Successful on a wide array of real-world data
– Algorithms are inherently parallel
– Techniques have recently been developed for the extraction
of rules from trained neural networks
Research Areas
� Finding optimal network structure
– e.g. by genetic algorithms
� Increasing learning speed (efficiency)
– e.g. by simulated annealing
� Increasing accuracy (effectiveness)
Prediction by Neural Networks
� Increasing accuracy (effectiveness)
� Extracting rules from networks
References
Prediction by Neural Networks
References
� J. Han, M. Kamber, Data Mining: Concepts and
Techniques, Elsevier Inc. (2006). (Chapter 6)
� S. J. Russell and P. Norvig, Artificial Intelligence, A
Modern Approach, Prentice Hall,1995. (Chapter 19)
Prediction by Neural Networks
The end
Prediction by Neural Networks