Overview over different methods – Supervised Learning You are here ! And many more.

Machine Learning Classical Conditioning Synaptic Plasticity

Dynamic Prog.(Bellman Eq.)

REINFORCEMENT LEARNING UN-SUPERVISED LEARNINGexample based correlation based

d-Rule

Monte CarloControl

Q-Learning

TD( )often =0

ll

TD(1) TD(0)

Rescorla/Wagner

Neur.TD-Models(“Critic”)

Neur.TD-formalism

DifferentialHebb-Rule

(”fast”)

STDP-Modelsbiophysical & network

EVALUATIVE FEEDBACK (Rewards)

NON-EVALUATIVE FEEDBACK (Correlations)

SARSA

Correlationbased Control

(non-evaluative)

ISO-Learning

ISO-Modelof STDP

Actor/Critictechnical & Basal Gangl.

Eligibility Traces

Hebb-Rule

DifferentialHebb-Rule

(”slow”)

supervised L.

Anticipatory Control of Actions and Prediction of Values Correlation of Signals

=

=

=

Neuronal Reward Systems(Basal Ganglia)

Biophys. of Syn. PlasticityDopamine Glutamate

STDP

LTP(LTD=anti)

ISO-Control

Overview over different methods – Supervised Learning

You are

her

e !

And many more

u2

Some more basics:Threshold Logic Unit (TLU)

u1

un

.

..

w1

w2

wn

a=i=1n wi ui

1 if a v= 0 if a <

v

{

inputsweights

activation output

Activation Functions

a

v

a

v

a

v

a

v

threshold linear

piece-wise linear sigmoid

Decision Surface of a TLU

u1

u2

Decision linew1 u1 + w2 u2 =

1

1 1

0

0

0

0

0

1

>

<

Scalar Products & Projections

w • v > 0

u

w

w • v = 0

u

w

w • v < 0

u

w

w • u = |w||u| cos

u

w

Geometric Interpretation

u1

u2

w

u

w•u=v=1

v=0

|uw|=/|w|

The relation w•u= implicitly defines the decision line

uw

Decision linew1 u1 + w2 u2 =


In n dimensions the relation w•u= defines a n-1 dimensional hyper-plane, which is perpendicular to the weight vector w.

On one side of the hyper-plane (w•u>) all patterns are classified by the TLU as “1”, while those that get classified as “0” lie on the other side of the hyper-plane.

If patterns can be not separated by a hyper-plane then they cannot be correctly classified with a TLU.

Linear Separability

u1

u2

10

0 0

Logical AND

u1 u2 a v

0 0 0 0

0 1 1 0

1 0 1 0

1 1 2 1

w1=1w2=1=1.5 u1

10

0

w1=?w2=?= ?

1

Logical XOR

u1 u2 v

0 0 0

0 1 1

1 0 1

1 1 0

u2

Threshold as Weight

u1

u2

un

.

..

w1

w2

wn

wn+1

un+1=-1

a= i=1n+1 wi ui

v

1 if a v= 0 if a <{

=wn+1


u1

u2 Decision linew

u

w•u=v=1

v=0

The relation w•u= defines the decision line

Training ANNs Training set S of examples {u,vt}

u is an input vector and vt the desired target output Example: Logical And S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}

Iterative process Present a training example u , compute network output v ,

compare output v with target vt, adjust weights and thresholds

Learning rule Specifies how to change the weights w and thresholds

of the network as a function of the inputs u, output v and target vt.

Adjusting the Weight Vector

Target vt =1Output v=0

u

w

>90

w

uw’ = w + u

u

Target vt =0Output v=1

w

<90

u w

u

Move w in the direction of u

u

w’ = w - u

Move w away from the direction of u

Perceptron Learning Rule

w’=w + (vt-v) uOr in components w’i = wi + wi = wi + (vt-v) ui (i=1..n+1)With wn+1 = and un+1=-1 The parameter is called the learning rate. It

determines the magnitude of weight updates wi .

If the output is correct (vt=v) the weights are not changed (wi =0).

If the output is incorrect (vt v) the weights wi are changed such that the output of the TLU for the new weights w’i is closer/further to the input ui.

Perceptron Training Algorithm

Repeat

for each training vector pair (u, vt)evaluate the output y when u is the input

if vvt thenform a new weight vector w’ according

to w’=w + (vt-v) uelse do nothing

end if end for

Until v=vt for all training vector pairs

Perceptron Convergence Theorem

The algorithm converges to the correct classification

if the training data is linearly separable and is sufficiently small

If two classes of vectors {u1} and {u2} are linearly separable, the application of the perceptron training algorithm will eventually result in a weight vector w0, such that w0 defines a TLU whose decision hyper-plane separates {u1} and {u2} (Rosenblatt 1962).

Solution w0 is not unique, since if w0 u =0 defines a hyper-plane, so does w’0 = k w0.

Linear Separability

u1

u2

10

0 0

Logical AND

u1 u2 a v

0 0 0 0

0 1 1 0

1 0 1 0

1 1 2 1

w1=1w2=1=1.5 u1

10

0

w1=?w2=?= ?

1

Logical XOR

u1 u2 v

0 0 0

0 1 1

1 0 1

1 1 0

u2

Generalized Perceptron Learning Rule

If we do not include the threshold as an input we use the follow description of the perceptron with symmetrical outputs (this does not matter much,

though):

w ! w + 2ö (vt à v)u ò! òà 2

ö (vt à v)and

à 1 if w áu à ò < 0v =n

+1 if wáuà ò õ 0

Then we get for the learning rule:

This implies:

(w áu à ò) ! (w áu à ò) + 2ö (vt à v)(juj2+ 1)

Hence, if vt=1 and v=-1 the weight change increase the term

w.u- and vice versa. This is what we need to compensate the error!

Linear Unit – no Threshold!

u1

u2

un

.

..

w1

w2

wn

a=i=1n wi ui

v

v= a = i=1n wi vi

inputsweights

activation output

Lets abbreviate the target output (vectors) by t in the next slides

Gradient Descent Learning Rule

Consider linear unit without threshold and continuous output v (not just –1,1) v=w0 + w1 u1 + … + wn un

Train the wi’s such that they minimize the squared error E[w1,…,wn] = ½ dD (td-vd)2

where D is the set of training examples andt the target outputs.

Gradient Descent

D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>}

Gradient:E[w]=[E/w0,… ,E/wn]

(w1,w2)

(w1+w1,w2 +w2)w=- E[w]

-1/wi= - E/wi

=/wi 1/2d(td-vd)2

= /wi 1/2d(td-i wi ui)2

= d(td- vd)(-ui)

Gradient DescentGradient-Descent(training_examples, )

Each training example is a pair of the form {(u1,…un),t} where (u1,…,un) is the vector of input values, and t is the target output value, is the learning rate (e.g. 0.1)

Initialize each wi to some small random value Until the termination condition is met, Do

Initialize each wi to zero For each {(u1,…un),t} in training_examples Do

Input the instance (u1,…,un) to the linear unit and compute the output v

For each linear unit weight wi Do wi= wi + (t-v) ui

For each linear unit weight wi Do wi=wi+wi

Incremental Stochastic Gradient Descent

Batch mode : gradient descent

w=w - ED[w] over the entire data D

ED[w]=1/2d(td-vd)2

Incremental (stochastic) mode: gradient descent

w=w - Ed[w] over individual training examples d

Ed[w]=1/2 (td-vd)2

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if is small enough.

This is the d-Rule

Perceptron vs. Gradient Descent Rule

perceptron rule w’i = wi + (tp-vp) ui

p

derived from manipulation of decision surface.

gradient descent rule w’i = wi + (tp-vp) ui

p

derived from minimization of error function

E[w1,…,wn] = ½ p (tp-vp)2

by means of gradient descent.

Perceptron vs. Gradient Descent Rule

Perceptron learning rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate

Linear unit training rules using gradient descent Guaranteed to converge to hypothesis with

minimum squared error Given sufficiently small learning rate Even when training data contains noise Even when training data not separable by

hyperplane.

Presentation of Training Examples

Presenting all training examples once to the ANN is called an epoch.

In incremental stochastic gradient descent training examples can be presented in Fixed order (1,2,3…,M) Randomly permutated order (5,2,7,…,3) Completely random (4,1,7,1,5,4,……)

Neuron with Sigmoid-Function

x1

x2

xn

.

..

w1

w2

wn

a=i=1n wi xi

y=(a) =1/(1+e-a)

y

inputsweights

activation output

Sigmoid Unit

u1

u2

un

.

..

w1

w2

wn

w0

x0=-1

a=i=0n wi ui

v

v=(a)=1/(1+e-a)

(x) is the sigmoid function: 1/(1+e-x)

d(x)/dx= (x) (1- (x))

Derive gradient decent rules to train:• one sigmoid function

E/wi = -p(tp-v) v (1-v) uip

• Multilayer networks of sigmoid units backpropagation:

Gradient Descent Rule for Sigmoid Output Function

a

sigmoid

Ep/wi = /wi ½ (tp-vp)2

= /wi ½ (tp- (i wi uip))2

= (tp-up) ‘(i wi uip) (-ui

p)

for v=(a) = 1/(1+e-a)’(a)= e-a/(1+e-a)2=(a) (1-(a))

Ep[w1,…,wn] = ½ (tp-vp)2

w’i= wi + wi = wi + v(1-v)(tp-vp) uip

a

’

Gradient Descent Learning Rule

wi = vjp(1-vj

p) (tjp-vj

p) uip

ui

wji

vj

activation ofpre-synaptic neuron

error dj ofpost-synaptic neuron

derivative of activation function

learning rate

ALVINNAutomated driving at 70 mph on a public highway

Camera image

30x32 pixelsas inputs

30 outputsfor steering

30x32 weightsinto one out offour hiddenunit

4 hiddenunits

Overview over different methods – Supervised Learning You are here ! And many more.

Documents

Transcript of Overview over different methods – Supervised Learning You are here ! And many more.