1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau...
-
Upload
emma-chambers -
Category
Documents
-
view
213 -
download
0
Transcript of 1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau...
1L. OrseauNeural Networks
Neural Networks
EFREI 2010
Laurent Orseau([email protected])
AgroParisTech
based on slides by Antoine Cornuejols
2L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
3L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
4L. OrseauNeural Networks
Introduction: Why neural networks?
• Biological inspiration
Natural brain: a very seductive model
– Robust and fault tolerant
– Flexible. Easily adaptable
– Can work with incomplete, uncertain, noisy data ...
– Massively parallel
– Can learn
Neurons
– ≈ 1011 neurons in the human brain
– ≈ 104 connections (synapses + axons) / neuron
– Action potential / refractory period / neurotransmitters
– Excitatory / inhibitory signals
5L. OrseauNeural Networks
Introduction: Why neural networks?
• Some propertiesproperties
Parallel computation
Directly implementable on dedicated circuits
Robust and fault tolerant (distributed representation)
Simple algorithms
Very general
• Some defectsdefects
Opacity of acquired knowledge
6L. OrseauNeural Networks
Historical notes (quickly)
Premises
– Mc Culloch & Pitts (1943): 1st formal neuron model.
neuron and logical calculus: base of artificial intelligence.
– Hebb rule (1949): learning by reinforcing synaptic coupling
First realizations
– ADALINE (Widrow-Hoff, 1960)
– PERCEPTRON (Rosenblatt, 1958-1962)
– Analysis of Minsky & Papert (1969)
New models
– Kohonen (competitive learning), ...
– Hopfield (1982) (recurrent net)
– multi-layer perceptron(1985)
Analysis and developments
– Control theory, generalization (Vapnik), ...
7L. OrseauNeural Networks
The perceptron
Rosenblatt (1958-1962)
8L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
9L. OrseauNeural NetworksLinear discrimination: the perceptron
[Rosenblatt, 1957,1962]
Decision function:
Bias nodeOutput node
Input nodes
11L. OrseauNeural NetworksLinear discrimination: the perceptron
• Geometry - 2 classes
12L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination contre tous les autresDiscrimination against all others
• Geometry - multiclass
Ambiguous region
13L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination entre deux classesDiscrimination against all others
• Geometry – multiclass
•N(N-1)/2 discriminant functions
14L. OrseauNeural NetworksThe perceptron: Performance criterion
• Optimization criterion (error function): Total # classification errors: NO
Perceptron criterion:
For all forms of learning, we want:
Proportional to the distance to the decision surface (for all wrongly classified
examples)
Piecewise linear and continuous function
wT x 0
< 0
x 1
2
15L. OrseauNeural Networks
Direct learning: pseudo-inverse method
• Direct solution (pseudo-inverse method) requires:
Knowledge of all pairs (xi,yi)
Matrix inversion (often ill defined)
(only for linear network and quadratic error function)
• Requires an iterative method without matrix inversion
Gradient descent
16L. OrseauNeural NetworksThe perceptron: algorithm
• Exploration method of H Gradient search
– Minimization of error function
– Principle: in the spirit of the Hebb rule:
modify connection proportionally to input and output
– Learn only if classification error
Algorithm:
if example is correctly classified: do nothing
otherwise:
Loop over all training examples until a stopping criterion
Convergence?
w(t 1) w(t) xi ui
17L. OrseauNeural NetworksThe perceptron: convergence memory capacity
• Questions:
What can be learned?
– Result from [Minsky & Papert,68]: linear separators
Convergence guaranties?
– Perceptron convergence theorem [Rosenblatt,62]
Reliability of learning and number of examples
– How many examples do we need to have some guaranty about what should
be learned?
18L. OrseauNeural Networks
Expressive power: Linear separations
19L. OrseauNeural Networks
Expressive power: Linear separations
20L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
22L. OrseauNeural Networks
The multi-layer perceptron
• Usual topology
Signal flow
Input : xk
Input layer Output layerHidden layer
Output: yk
Desired output: uk
23L. OrseauNeural Networks
The multi-layer perceptron: propagation
• For each neuron:
wjk : weightweight of the connection from node j to node k
ak : activationactivation of node k
g : activationactivation functionfunction
g(a) 1
1 e a
yl g w jk jj 0, d
g(ak )
g’(a) = g(a)(1-g(a))
Radial Basis Function
sigmoïdal function
Threshold function
rail function
Activation ai
Sortie zi
+1
+1
+1
24L. OrseauNeural Networks
The multi-layer perceptron: the XOR example
A
B
C
x1
x2
y
Bias
Weight
Weigth
-0.5
1-1.5
1
11
1
-0.5
-1
A B C
25L. OrseauNeural Networks
Example of network (JavaNNS)
26L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
27L. OrseauNeural Networks
The MLP: learning
• Find weights such that the network makes a input-output mapping consistent with the given examples
(same old generalization problem)
• Learning:
Minimize loss function E(w,{xl,ul}) in function of w
Use a gradient descent method
(gradient back-propagation algorithm )
Inductive principle: We suppose that what works on training examples (empirical risk minimization) should work on test (unseen) examples (real risk minimization)
wij E wij
28L. OrseauNeural Networks
Learning: gradient descent
• learning = search in the multidimensional parameter space (synaptic weights) to minimize loss function
• Almost all learning rules
= gradient descent method
Optimal solution w* so that
wij(1) wij
( ) E
wij w( )
E(1) E( ) w E
E(w* ) 0
=
w1
,
w2
, ...,
w N
T
29L. OrseauNeural Networks
The multi-layer perceptron: learning
Goal:
Algorithm (gradient back-propagation): gradient descent
Iterative algorithm:
Off-line case (total gradient):
where:
On-line case: (stochastic gradient):
w( t ) w( t 1) Ew(t )
wij (t) wij (t 1) (t)1
m
RE (xk ,w)
wijk1
m
wij (t) wij (t 1) (t)RE(xk,w)
wij
RE(xk ,w) [tk f (xk ,w)]2
w * argminw
1
my(xl ; w) u(xl ) 2
l1
m
30L. OrseauNeural Networks
The multi-layer perceptron: learning
1. Take one example from training set
2. Compute output state of network
3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2)
4. Compute gradients
With gradient back-propagation algorithm
5. Modify synaptic weights
6. Stopping criterion
Based on global error, number of examples, etc.
7. Go back to 1
31L. OrseauNeural Networks
MLP: gradient back-propagation
• The problem: Determine responsibilities (“credit assignment problem”)What connection is responsible, and of how much, on error E ?
• Principle: Compute error on a connection in function of the error on the next layer
• Two steps:
1. Evaluation of error derived relative to weights
2. Use of these derivates to compute the modification on each weight
32L. OrseauNeural Networks
1. Evaluation of error Ej (or E) due to each connection:
Idea: compute error on connection wji in function of error after node j
For nodes in the output layer:
For nodes in the hidden layer:
MLP: gradient back-propagation
E l
wij
k E l
ak
g' (ak ) E l
yk
g' (ak ) uk(xl) yk
j E l
aj
E l
ak
ak
ajk k
ak
zj
zj
a jk g' (a j ) w jk k
k
E l
wij
E l
a j
a j
wij
j zi
33L. OrseauNeural Networks
MLP: gradient back-propagation
ai : activation ofnode i
zi : sortie of node i
i : error attached to node i
wijji k
yk
Output nodeHidden node
k
akaj
j
wjkzjzi
34L. OrseauNeural Networks
MLP: gradient back-propagation
• 2. Modification of weights
We suppose a step gradient (constant or not): (t)
If stochastic learning (after presentation of each example)
If batch learning (after presentation of the whole set of examples)
wji (t) j ai
wji (t) jn ai
n
n
35L. OrseauNeural Networks
MLP: forward and backward passes (resume)
x
ai(x) w jxj
j 1
d
w0
yi(x) g(ai(x))
ys (x) w js y jj1
k
ys(x)
wis
k neurons on thehidden layer
. . .x1 x2 x3 xd
w1 w2 w3wd
yi(x)
x0
w0Bia s
. . .y (x)1
36L. OrseauNeural Networks
MLP: forward and backward passes (resume)
x
ys(x)
wis
. . .x1 x2 x3 xd
w1 w2 w3wd
yi(x)
x0
w0Biais
. . .y (x)1
s g' (as ) (us ys )
wis (t 1) wis (t) ( t) sai
wei (t 1) wei(t) (t ) iae
j
g' ( aj) w
js
snodes
of next layer
37L. OrseauNeural Networks
MLP: gradient back-propagation
• Learning efficiency
O(|w|) for each learning pass, |w| = # weights
Usually several hundreds of passes (see below)
And learning must typically be done several dozens of times with different
initial random weights
• Recognition efficiency
Possibility of real time
40L. OrseauNeural Networks
Applications: multi-objective optimization
• cf [Tom Mitchell]
Predict both class and color
Instead of class only
41L. OrseauNeural Networks
Role of the hidden layer
42L. OrseauNeural Networks
Role of the hidden layer
43L. OrseauNeural Networks
Role of the hidden layer
44L. OrseauNeural Networks
MLP: Applications• Control: identification and control of processes
(e.g. Robot control)
• Signal Processing (filtering, data compression, speech processing (recognition, prediction, production),…)
• Pattern recognition, image processing (hand-writing recognition, automated postal code recognition (Zip codes, USA), face recognition...)
• Prediction (water, electricity consumption, meteorology, stock market, ...)
• Diagnostic (industry, medical, science, ...)
45L. OrseauNeural Networks
Application to postal Zip codes
• [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team)
• ≈ 10000 examples of handwritten numbers
• Segment et rescales on a 16 x 16 matrix
• Weigh sharing
• Optimal brain damage
• 99% correct recognition (on training set)
• 9% reject (delegated to human recognition)
46L. OrseauNeural Networks
The database
47L. OrseauNeural Networks
Application to postal Zip codes
1
2
3
4
5
6
7
8
9
0
16 x 16 Matrix 12 segment detectors (8x8)
12 segment detectors (4x4)
30 nodes
10 output nodes
48L. OrseauNeural Networks
Some mistakes made by the network
49L. OrseauNeural Networks
Regression
50L. OrseauNeural Networks
A failure: QSAR
• Quantitative Structure Activity Relations
Predire certaines proprietes de molecules (par example activite biologique) à partir de descriptions:- chimiques- geometriques- electriques
51L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
52L. OrseauNeural Networks
MLP: Practical view (1)• Technical problems:
how to improve the algorithm performance?
MLP as an optimization method: variants
• Momentum
• Second order methods
• Hessian
• Conjugate gradient
Heuristics
• Sequential learning vs batch learning
• Choice of activation function
• Normalization of inputs
• Weights initializations
• Learning gains
53L. OrseauNeural Networks
MLP: gradient back-propagation (variants)
• Momentum
wji (t 1) E
w ji
w ji(t)
54L. OrseauNeural Networks
Convergence
• Learning step tweaking:
55L. OrseauNeural Networks
MLP: Convergence problems
• Local minimums
Add momentum (inertia)
Conditioning of parameters
Noising learning data
Online algorithm (stochastic vs. total)
Variable gradient step (in time and for each node)
Use of second derivate (Hessien). Conjugate gradient
56L. OrseauNeural Networks
MLP: Convergence problems (variables gradient)
• Adaptive gain
if gradient does not change sign, otherwise
Much lower gain for stochastic than for total gradient
Specific gain for each layer (e.g. 1 / (# input node)1/2 )
• More complex algorithms
Conjugate gradients
– Idea: Try to minimize independently on each dimension, using a momentum of
search
Second order methods (Hessian)
– Faster convergence but slower computations
57L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
58L. OrseauNeural Networks
Overfitting
Real Risk
Emprirical Risk
Overfitting
Data quantity
59L. OrseauNeural Networks
Prenventing overfitting: regularisation
• Principle: limit expressiveness of H
• New empirical risk:
• Some useful regularizers:
– Control of NN architecture
– Parameter control
• Soft-weight sharing
• Weight decay
• Convolution network
– Noisy examples
Remp () 1
mL(h (xl , ),u l
l 1
m
) [h(. , )]Penalization term
60L. OrseauNeural Networks
Control by limiting the exploration of H
• Early stopping
• Weight decay
61L. OrseauNeural Networks
Generalization: optimize the network structure
• Progressive growth
Cascade correlation [Fahlman,1990]
• Pruning
Optimal brain damage [Le Cun,1990]
Optimal brain surgeon [Hassibi,1993]
62L. OrseauNeural Networks
Introduction of prior knowledge
Invariances
• Symmetries in the example space
Translation / rotation / dilatation
• Cost function having derivates
63L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
64L. OrseauNeural Networks
ANN Application Areas
• Classification
• Clustering
• Associative memory
• Control
• Function approximation
65L. OrseauNeural Networks
Applications for ANN Classifiers
• Pattern recognition
Industrial inspection
Fault diagnosis
Image recognition
Target recognition
Speech recognition
Natural language processing
• Character recognition
Handwriting recognition
Automatic text-to-speech conversion
66L. OrseauNeural Networks
Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000
Neural Network ApproachesALVINN - Autonomous Land Vehicle In a Neural Network
ALVINN
67L. OrseauNeural Networks
Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000
- Developed in 1993.
- Performs driving with Neural Networks.
- An intelligent VLSI image sensor for road following.
- Learns to filter out image details not relevant to driving.
Hidden layer
Output units
Input units
ALVINN
68L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
69L. OrseauNeural Networks
MLP with Radial Basis Functions (RBF)
• Definition
Hidden layer uses radial basis activation function (e.g. Gaussian)
– Idea: “pave” the input space with “receptive fields”
Output layer: linear combination upon the hidden layer
• Properties
Still universal approximator ([Hartman et al.,90], ...)
But not parsimonious (combinatorial explosion of input dimension)
Only for small input dimension problems
Strong links with fuzzy inference systems and neuro-fuzzy systems
70L. OrseauNeural Networks
• Parameters to tune:
# hidden nodes
Initial positions if receptive fields
Diameter of receptive fields
Output weights
• Methodes
Adaptation of back-propagation
Determination of each type of parameters with a specific method (usually more effective)
– Centers determined by “clustering” methofs (k-means, ...)
– Diameters determined by covering rate optimization (PPV, ...)
– Output weights by linear optimization (calcul de pseudo-inverse, ...)
MLP with Radial Basis Functions (RBF)
71L. OrseauNeural Networks
Neural Networks for sequence processing
• Tasks : Take the Time dimension into account
Sequence recognition
E.g. recognize a word corresponding to a vocal signal
Reproduction of sequence
E.g. predict next values of the sequence (ex: electricity consumption prediction)
Temporal association
Production of another in response to the recognition of another sequence
Time Delay Neural Networks (TDNNs)
Duplicate inputs for several past time steps
Recurrent Neural Networks
72L. OrseauNeural Networks
Recurrent ANN Architectures
• Feedback connections
• Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...)
• Models: Jordan/Elman ANNs
Hopfield
Adaptive Resonance Theory (ART)
73L. OrseauNeural Networks
Recurrent Neural Networks
• Can learn regular grammars
Finite State Machines
Back Propagation Through Time
• Can even model full computers with 11 neurons (!)
Very special use of RNNs…
Uses the property that a weight can be any number,i.e. it is an unlimited memory
+ Chaotic dynamics
No learning algorithm for this
75L. OrseauNeural Networks
Recurrent Neural Networks
• Problems
Complex trajectories
– Chaotic dynamics
Limited memory of past
Learning is very difficult!
– Exponential decay of error signal in time
76L. OrseauNeural Networks
Long Short Term Memory (Hochreiter 1997)
• Idea:
Only some nodes are recurrent
Only self-recurrence
Linear activation function
– Error decays linearly, not exponentially
• Can learn
Regular languages (FSM)
Some Context-free (stack machine) and Context-sensitive grammars
– anbn, anbncn
77L. OrseauNeural Networks
Reservoir computing
• Idea:
Random recurrent neural network,
Learn only output layer weights
• Many internal dynamics
• Output layer selects interesting ones
• And combinations thereofOutputInput
79L. OrseauNeural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
80L. OrseauNeural Networks
Conclusions
• Limits
Learning is slow and difficult
Result is opaque
– Difficult to extract knowledge
– Difficult to use prior knowledge (but KBANN)
Incremental learning of new concepts is difficult: catastrophic forgetting
• Avantages
Can learn a wide variety of problems
81L. OrseauNeural Networks
Bibliography• Ouvrages / articles
Bishop C. (06): Neural networks for pattern recognition. Clarendon Press - Oxford, 1995.
Haykin (98): Neural Networks. Prentice Hall, 1998.
Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation. Addison Wesley, 1991.
Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales. Dunod, 1997.
Vapnik (95): The nature of statistical learning. Springer Verlag, 1995.
• Sites web
http://www.lps.ens.fr/~nadal/