1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau...

1L. OrseauNeural Networks

Neural Networks

EFREI 2010

Laurent Orseau([email protected])

AgroParisTech

based on slides by Antoine Cornuejols


Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


Introduction: Why neural networks?

• Biological inspiration

Natural brain: a very seductive model

– Robust and fault tolerant

– Flexible. Easily adaptable

– Can work with incomplete, uncertain, noisy data ...

– Massively parallel

– Can learn

Neurons

– ≈ 1011 neurons in the human brain

– ≈ 104 connections (synapses + axons) / neuron

– Action potential / refractory period / neurotransmitters

– Excitatory / inhibitory signals


Introduction: Why neural networks?

• Some propertiesproperties

Parallel computation

Directly implementable on dedicated circuits

Robust and fault tolerant (distributed representation)

Simple algorithms

Very general

• Some defectsdefects

Opacity of acquired knowledge


Historical notes (quickly)

Premises

– Mc Culloch & Pitts (1943): 1st formal neuron model.

neuron and logical calculus: base of artificial intelligence.

– Hebb rule (1949): learning by reinforcing synaptic coupling

First realizations

– ADALINE (Widrow-Hoff, 1960)

– PERCEPTRON (Rosenblatt, 1958-1962)

– Analysis of Minsky & Papert (1969)

New models

– Kohonen (competitive learning), ...

– Hopfield (1982) (recurrent net)

– multi-layer perceptron(1985)

Analysis and developments

– Control theory, generalization (Vapnik), ...


The perceptron

Rosenblatt (1958-1962)


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions

9L. OrseauNeural NetworksLinear discrimination: the perceptron

[Rosenblatt, 1957,1962]

Decision function:

Bias nodeOutput node

Input nodes

11L. OrseauNeural NetworksLinear discrimination: the perceptron

• Geometry - 2 classes

12L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination contre tous les autresDiscrimination against all others

• Geometry - multiclass

Ambiguous region

13L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination entre deux classesDiscrimination against all others

• Geometry – multiclass

•N(N-1)/2 discriminant functions

14L. OrseauNeural NetworksThe perceptron: Performance criterion

• Optimization criterion (error function): Total # classification errors: NO

Perceptron criterion:

For all forms of learning, we want:

Proportional to the distance to the decision surface (for all wrongly classified

examples)

Piecewise linear and continuous function

wT x 0

< 0

x 1

2


Direct learning: pseudo-inverse method

• Direct solution (pseudo-inverse method) requires:

Knowledge of all pairs (xi,yi)

Matrix inversion (often ill defined)

(only for linear network and quadratic error function)

• Requires an iterative method without matrix inversion

Gradient descent

16L. OrseauNeural NetworksThe perceptron: algorithm

• Exploration method of H Gradient search

– Minimization of error function

– Principle: in the spirit of the Hebb rule:

modify connection proportionally to input and output

– Learn only if classification error

Algorithm:

if example is correctly classified: do nothing

otherwise:

Loop over all training examples until a stopping criterion

Convergence?

w(t 1) w(t) xi ui

17L. OrseauNeural NetworksThe perceptron: convergence memory capacity

• Questions:

What can be learned?

– Result from [Minsky & Papert,68]: linear separators

Convergence guaranties?

– Perceptron convergence theorem [Rosenblatt,62]

Reliability of learning and number of examples

– How many examples do we need to have some guaranty about what should

be learned?


Expressive power: Linear separations


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


The multi-layer perceptron

• Usual topology

Signal flow

Input : xk

Input layer Output layerHidden layer

Output: yk

Desired output: uk


The multi-layer perceptron: propagation

• For each neuron:

wjk : weightweight of the connection from node j to node k

ak : activationactivation of node k

g : activationactivation functionfunction

g(a) 1

1 e a

yl g w jk jj 0, d

g(ak )

g’(a) = g(a)(1-g(a))

Radial Basis Function

sigmoïdal function

Threshold function

rail function

Activation ai

Sortie zi

+1

+1

+1


The multi-layer perceptron: the XOR example

A

B

C

x1

x2

y

Bias

Weight

Weigth

-0.5

1-1.5

1

11

1

-0.5

-1

A B C


Example of network (JavaNNS)


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


The MLP: learning

• Find weights such that the network makes a input-output mapping consistent with the given examples

(same old generalization problem)

• Learning:

Minimize loss function E(w,{xl,ul}) in function of w

Use a gradient descent method

(gradient back-propagation algorithm )

Inductive principle: We suppose that what works on training examples (empirical risk minimization) should work on test (unseen) examples (real risk minimization)

wij E wij


Learning: gradient descent

• learning = search in the multidimensional parameter space (synaptic weights) to minimize loss function

• Almost all learning rules

= gradient descent method

Optimal solution w* so that

wij(1) wij

( ) E

wij w( )

E(1) E( ) w E

E(w* ) 0

=

w1

,

w2

, ...,

w N

T


The multi-layer perceptron: learning

Goal:

Algorithm (gradient back-propagation): gradient descent

Iterative algorithm:

Off-line case (total gradient):

where:

On-line case: (stochastic gradient):

w( t ) w( t 1) Ew(t )

wij (t) wij (t 1) (t)1

m

RE (xk ,w)

wijk1

m

wij (t) wij (t 1) (t)RE(xk,w)

wij

RE(xk ,w) [tk f (xk ,w)]2

w * argminw

1

my(xl ; w) u(xl ) 2

l1

m


The multi-layer perceptron: learning

1. Take one example from training set

2. Compute output state of network

3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2)

4. Compute gradients

With gradient back-propagation algorithm

5. Modify synaptic weights

6. Stopping criterion

Based on global error, number of examples, etc.

7. Go back to 1


MLP: gradient back-propagation

• The problem: Determine responsibilities (“credit assignment problem”)What connection is responsible, and of how much, on error E ?

• Principle: Compute error on a connection in function of the error on the next layer

• Two steps:

1. Evaluation of error derived relative to weights

2. Use of these derivates to compute the modification on each weight


1. Evaluation of error Ej (or E) due to each connection:

Idea: compute error on connection wji in function of error after node j

For nodes in the output layer:

For nodes in the hidden layer:


E l

wij

k E l

ak

g' (ak ) E l

yk

g' (ak ) uk(xl) yk

j E l

aj

E l

ak

ak

ajk k

ak

zj

zj

a jk g' (a j ) w jk k

k

E l

wij

E l

a j

a j

wij

j zi



ai : activation ofnode i

zi : sortie of node i

i : error attached to node i

wijji k

yk

Output nodeHidden node

k

akaj

j

wjkzjzi



• 2. Modification of weights

We suppose a step gradient (constant or not): (t)

If stochastic learning (after presentation of each example)

If batch learning (after presentation of the whole set of examples)

wji (t) j ai

wji (t) jn ai

n

n


MLP: forward and backward passes (resume)

x

ai(x) w jxj

j 1

d

w0

yi(x) g(ai(x))

ys (x) w js y jj1

k

ys(x)

wis

k neurons on thehidden layer

. . .x1 x2 x3 xd

w1 w2 w3wd

yi(x)

x0

w0Bia s

. . .y (x)1


MLP: forward and backward passes (resume)

x

ys(x)

wis

. . .x1 x2 x3 xd

w1 w2 w3wd

yi(x)

x0

w0Biais

. . .y (x)1

s g' (as ) (us ys )

wis (t 1) wis (t) ( t) sai

wei (t 1) wei(t) (t ) iae

j

g' ( aj) w

js

snodes

of next layer



• Learning efficiency

O(|w|) for each learning pass, |w| = # weights

Usually several hundreds of passes (see below)

And learning must typically be done several dozens of times with different

initial random weights

• Recognition efficiency

Possibility of real time


Applications: multi-objective optimization

• cf [Tom Mitchell]

Predict both class and color

Instead of class only


Role of the hidden layer


MLP: Applications• Control: identification and control of processes

(e.g. Robot control)

• Signal Processing (filtering, data compression, speech processing (recognition, prediction, production),…)

• Pattern recognition, image processing (hand-writing recognition, automated postal code recognition (Zip codes, USA), face recognition...)

• Prediction (water, electricity consumption, meteorology, stock market, ...)

• Diagnostic (industry, medical, science, ...)


Application to postal Zip codes

• [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team)

• ≈ 10000 examples of handwritten numbers

• Segment et rescales on a 16 x 16 matrix

• Weigh sharing

• Optimal brain damage

• 99% correct recognition (on training set)

• 9% reject (delegated to human recognition)


The database


Application to postal Zip codes

1

2

3

4

5

6

7

8

9

0

16 x 16 Matrix 12 segment detectors (8x8)

12 segment detectors (4x4)

30 nodes

10 output nodes


Some mistakes made by the network


Regression


A failure: QSAR

• Quantitative Structure Activity Relations

Predire certaines proprietes de molecules (par example activite biologique) à partir de descriptions:- chimiques- geometriques- electriques


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


MLP: Practical view (1)• Technical problems:

how to improve the algorithm performance?

MLP as an optimization method: variants

• Momentum

• Second order methods

• Hessian

• Conjugate gradient

Heuristics

• Sequential learning vs batch learning

• Choice of activation function

• Normalization of inputs

• Weights initializations

• Learning gains


MLP: gradient back-propagation (variants)

• Momentum

wji (t 1) E

w ji

w ji(t)


Convergence

• Learning step tweaking:


MLP: Convergence problems

• Local minimums

Add momentum (inertia)

Conditioning of parameters

Noising learning data

Online algorithm (stochastic vs. total)

Variable gradient step (in time and for each node)

Use of second derivate (Hessien). Conjugate gradient


MLP: Convergence problems (variables gradient)

• Adaptive gain

if gradient does not change sign, otherwise

Much lower gain for stochastic than for total gradient

Specific gain for each layer (e.g. 1 / (# input node)1/2 )

• More complex algorithms

Conjugate gradients

– Idea: Try to minimize independently on each dimension, using a momentum of

search

Second order methods (Hessian)

– Faster convergence but slower computations


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


Overfitting

Real Risk

Emprirical Risk

Overfitting

Data quantity


Prenventing overfitting: regularisation

• Principle: limit expressiveness of H

• New empirical risk:

• Some useful regularizers:

– Control of NN architecture

– Parameter control

• Soft-weight sharing

• Weight decay

• Convolution network

– Noisy examples

Remp () 1

mL(h (xl , ),u l

l 1

m

) [h(. , )]Penalization term


Control by limiting the exploration of H

• Early stopping

• Weight decay


Generalization: optimize the network structure

• Progressive growth

Cascade correlation [Fahlman,1990]

• Pruning

Optimal brain damage [Le Cun,1990]

Optimal brain surgeon [Hassibi,1993]


Introduction of prior knowledge

Invariances

• Symmetries in the example space

Translation / rotation / dilatation

• Cost function having derivates


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


ANN Application Areas

• Classification

• Clustering

• Associative memory

• Control

• Function approximation


Applications for ANN Classifiers

• Pattern recognition

Industrial inspection

Fault diagnosis

Image recognition

Target recognition

Speech recognition

Natural language processing

• Character recognition

Handwriting recognition

Automatic text-to-speech conversion


Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000

- Developed in 1993.

- Performs driving with Neural Networks.

- An intelligent VLSI image sensor for road following.

- Learns to filter out image details not relevant to driving.

Hidden layer

Output units

Input units

ALVINN


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


MLP with Radial Basis Functions (RBF)

• Definition

Hidden layer uses radial basis activation function (e.g. Gaussian)

– Idea: “pave” the input space with “receptive fields”

Output layer: linear combination upon the hidden layer

• Properties

Still universal approximator ([Hartman et al.,90], ...)

But not parsimonious (combinatorial explosion of input dimension)

Only for small input dimension problems

Strong links with fuzzy inference systems and neuro-fuzzy systems


• Parameters to tune:

# hidden nodes

Initial positions if receptive fields

Diameter of receptive fields

Output weights

• Methodes

Adaptation of back-propagation

Determination of each type of parameters with a specific method (usually more effective)

– Centers determined by “clustering” methofs (k-means, ...)

– Diameters determined by covering rate optimization (PPV, ...)

– Output weights by linear optimization (calcul de pseudo-inverse, ...)

MLP with Radial Basis Functions (RBF)


Neural Networks for sequence processing

• Tasks : Take the Time dimension into account

Sequence recognition

E.g. recognize a word corresponding to a vocal signal

Reproduction of sequence

E.g. predict next values of the sequence (ex: electricity consumption prediction)

Temporal association

Production of another in response to the recognition of another sequence

Time Delay Neural Networks (TDNNs)

Duplicate inputs for several past time steps

Recurrent Neural Networks


Recurrent ANN Architectures

• Feedback connections

• Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...)

• Models: Jordan/Elman ANNs

Hopfield

Adaptive Resonance Theory (ART)



• Can learn regular grammars

Finite State Machines

Back Propagation Through Time

• Can even model full computers with 11 neurons (!)

Very special use of RNNs…

Uses the property that a weight can be any number,i.e. it is an unlimited memory

+ Chaotic dynamics

No learning algorithm for this



• Problems

Complex trajectories

– Chaotic dynamics

Limited memory of past

Learning is very difficult!

– Exponential decay of error signal in time


Long Short Term Memory (Hochreiter 1997)

• Idea:

Only some nodes are recurrent

Only self-recurrence

Linear activation function

– Error decays linearly, not exponentially

• Can learn

Regular languages (FSM)

Some Context-free (stack machine) and Context-sensitive grammars

– anbn, anbncn


Reservoir computing

• Idea:

Random recurrent neural network,

Learn only output layer weights

• Many internal dynamics

• Output layer selects interesting ones

• And combinations thereofOutputInput


Plan

1. Introduction

2. The perceptron


4. Learning in MLP



7. Applications


9. Conclusions


Conclusions

• Limits

Learning is slow and difficult

Result is opaque

– Difficult to extract knowledge

– Difficult to use prior knowledge (but KBANN)

Incremental learning of new concepts is difficult: catastrophic forgetting

• Avantages

Can learn a wide variety of problems


Bibliography• Ouvrages / articles

Bishop C. (06): Neural networks for pattern recognition. Clarendon Press - Oxford, 1995.

Haykin (98): Neural Networks. Prentice Hall, 1998.

Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation. Addison Wesley, 1991.

Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales. Dunod, 1997.

Vapnik (95): The nature of statistical learning. Springer Verlag, 1995.

• Sites web

http://www.lps.ens.fr/~nadal/

1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau...

Documents

Transcript of 1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau...