6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea:...

22
1 6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning Very slow training Overfitting is easy Disadvantages: Advantages: Fast prediction Does feature weighting Very generally applicable Training: Choose connection weights that minimize error (O bserved - Computed) = error driven learning Prediction: Propagate input feature values through the network of “artificial neurons” y -1 x n x 2 w 1 w n w 0 w 2 . . . x 1 ) ( z s

Transcript of 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea:...

Page 1: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

1

6.034 Artificial Intelligence

• Big idea: Learning as acquiring a function on featurevectors

• Background• Nearest Neighbors• Identification Trees• Neural Nets

Neural NetsSupervised learning

• Very slow training• Overfitting is easy

Disadvantages:

Advantages: • Fast prediction• Does feature weighting• Very generally applicable

Training: • Choose connection weights that minimize error(Observed - Computed) = error driven learning

Prediction: • Propagate input feature values through the network of“artificial neurons”

y

-1 xnx2

w1wnw0 w2 . . .

x1

)(zs

Page 2: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

2

x1

x2

x3

xu

input hidden hidden output

y1

y2

x!

y!

Multi-Layer Neural net

More powerful than single layer. Lower layers transform the inputproblem into more tractable (linearly separable) problems for subsequentlayers.

Error driven learning:minimize error• Error is difference between

(Observed/Data - Currently computed) orError function E(w) = (Observed - computed) value

• Goal is to make the Error 0 by adjusting the weights on the neural netunits

• So first we have to figure out how far off we are: we must make aforward propagation through the net w/ our current weights to see howclose we are to the ‘truth’

• Once we know how close we are, we must figure out how to changethe w’s to minimize E(w)

• So - how do we minimize a function? Set derivative to 0 !• So, we do this by gradient ascent: look at the slope of the Error function

E with respect to the weights, and figure out how to minimize E bychanging the w’s: if slope is negative, move down; o.w. up. (We want toget to the place where

• The backpropagation algorithm tells us how to adjust w’s.

Page 3: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

3

< >

Minimizing by Gradient DescentSlope = dG

dw

Local minw1 w0 w0 w1 w

E

G(w)

Basic Problem: Given initial parametersw0 with error E=G(w0), find w that is localminimum of G, that is, all first derivatives ofG are zero.

Approach:

1. Find “slope” of G at w0

2. Move to w1 which is “down slope”

Gradient is n-dimensional “slope”

Gradient descent, r is “rate” – a small step size

Stop when gradient is very small (or change in G is very small)

y

-1 x2

x1

w1w2

w0

The simplest possible net

Output y = w1x1 + w2x2

Error or Performance function =

1/2(y*–y)2 , so what is the derivative of the error function

wrt w?

It is just (y*-y) dy/dw (Note that the two negative signs cancel)

For this linear case easy to figure out how to change

w in order to change y - i.e., dy/dw. It is just the vector of coefficients

on the wi - i.e., [x1 x2] (e.g., dy/dx for y=wx + 2 is just ‘x’)

So “Credit/blame” assignment is simple

Page 4: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

4

Simplest neural net Learning Rule

!!w =!w + r(d " z)

!x

learning rate(small)

Desired(correct)output

actualoutput

Change weights on non-zero inputs in direction of error.(gradient ascent) - note for a single layer, this is linear, soderivative of simple: it is just [x1 x2].

We update the weight - we take a ‘step’ based on the size ofthe error times the derivative, a direction in which to take a step(size, plus direction). This is gradient descent

(d ! z)DerivError:

derivative of Error fn wrtw; how error changes ifyou change w a little bit(this is the slope of theerror function at point wNew weight

w’

Originalweight w

w•x=0

y= w1x1+w2x2

dy/dw = x = [x1 x2]

Page 5: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

5

Δw = r(y*–y)x

Correct hyperplane

Current hyperplane

Note: Weight vectorchanged in thedirection of x

x

IN P U T TARGET

0 1 0

1 0 0

1 1 1

0 1 0

1 0 0

1 1 1

0 1 0

0 0 0

Learning AND

IN P U T WEIG H T S ACTIVATIO N OUTPU

T

TARGE

T

WEIG H T S LEARN?

0 1 -0.1 +0.2 0.2 1 0 0 -0.1 +0.2 OK

1 0 -0.1 +0.2 -0.1 1 0 0

1 1 -0.1 +0.2 0.1 1 0 1 +0.9 +1.2 CHANGE

0 1 +0.9 +1.2 1.2 1

1 0 +0.9 +0.2 0.9 1 0 0

1 1 +0.9 +0.2 1.1 1 1 1

0 1 +0.9 +0.2 0.2 1 0 0

0 0 +0.9 +0.2 0 1 0 0

Page 6: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

6

0,0 1,0

1,10,1

Learning XOR: this is not separable by a singleline, which is all a 1-layer neural net can do

0,0 1,0

1,10,1

Learning XOR: note that we need two lines through the plane toseparability red from green - there isn’t a single slice that will workto separate the two

So, we need a hidden layer with two units (one to make each slice), and then one output neuron to decide between these two slices:

But two slices will work

Page 7: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

7

XOR solvable by 2-layer net: one layerseparates off/off & on/on together; next

layer can separate on/off, off/on

A more complex pattern will require morehidden units, depending on how many‘slices’ we need

0,0 1,0

1,10,1

Think about what happens now.

How many ‘slices’ will we need?

Therefore, how many hidden units?

Will we need more than 2 layers?

Page 8: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

8

Now we add 2 things, one at a time

• Sigmoid unit on the unit (instead of just adding)

• Intermediate (hidden) units between input and output

The first just makes the computation of the derivative relatingthe output back to the input of a unit a bit more complicated

The second also - how to allocate ‘credit/blame’

y

-1 x2

x1

w1w2

w0

x2

x1

y=1/2

outputsurface

Sigmoid Unit

xout

= wixi s(x

out) =

1

1+ e! x

out

i

n

"

y

-1 xnx2

w1wnw0 w2 . . .

x1

)(zs

z z

Page 9: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

9

Derivative of the sigmoid wrt its ‘input’s(x) =

1

1+ e! x

ds(x)

x=d

dx(1+ e! x )!1"# $%

= [!(1+ e! x )!2 ][!e! x ]

=1

1+ e! x"

#&$

%'e! x

1+ e! x"

#&

$

%'

= s(x)(1! s(x)) or just

z(1! z) where z is the output from the sigmoid function,

i.e., the final output from this particular net unit

Note - if this is the 'final output' we sometimes use o instead of z

So now, the derivative (slope) relating changes in w to changes in the outputis a bit more complicated:

How this changes the computation of theway to ‘twiddle’ the weights

Original: !w= r(d " z)!x

After adding sigmoid: !w= r(d " z)!x[z(1" z)]

y

-1 x2

x1

w1w2

w0

y

-1 xnx2

w1wnw0 w2 . . .

x1

)(zs

So far, this is allWe’ve added - to take

care of computing the derivativecorrectly as we go thru a

Sigmoid unit

Note: we will d for the true or desired value of y,And “ynet” or “o” (as in Winston) for the value output or computed by the net

z z

s(x)

Page 10: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

10

Now we add hidden units… y3

w03

w23

z3

z2

w02w22

w21

w12w11

w01

z1

-1

-1 -1x1 x2

w13 y1 y2

Given (yobs ! y net /calc ), how do we twiddle the wij ?

The KEY POINT: hidden units have no ‘correct target’ value so how

do we know what their ‘error’ is? If we don’t know ‘error’, how can

we know how to change the weights for them?

Ans: we use the error from the output and propagate it back - allocate the

‘blame’ proportionately over the next units in the layer down

Q: How much blame should each unit get?

A: it should be proportional to the weight on what it passes to the next unit

< >

Minimizing by Gradient DescentSlope = dG

dw

Local minw1 w0 w0 w1 w

P(erformance fn)

G(w)

Basic Problem: Given initial parametersw0 with error P=G(w0), find w that is localminimum of G, that is, all first derivatives ofG are zero.

Approach:

1. Find “slope” of G at w0

2. Move to w1 which is “down slope”

Grwwwii

!"=+1

!"

#$%

&

'

'

'

'

'

'=(

n

w

w

G

w

G

w

GG !

10

Gradient is n-dimensional “slope”

Gradient descent, r is “rate” – a small step size

Stop when gradient is very small (or change in G is very small)

(Recall that the minus sign in front will be canceled by the negative sign for del Gof our Performance function in our actual formula.)

Page 11: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

11

The simplest net - just the output unit

Sigmoid

wz

y pz zOutputunit

(Note: pz =y * wz)Input to unitOutput from unit

(Note that the output unit could also include a ‘bias’ w, set to 1,

Included as in the 1-layer net case to make the threshold come out to 0)

The picture: propagating back one step - the ‘delta rule’Intuitively, delP/del wz tells us how much theP(erformance) function (the difference between the actualdesired output and the output the net computes given itscurrent weights) will change if we ‘twiddle’ wz a little bitThis calculation has three parts

Sigmoid

wz

y pzz

OUTPUT UNIT

Output value

P = !1 / 2(observed ! computed)2

dP

dz= (d[esired]! z) = derivative of error function

"P

"wz

="P

"z#

"z

"pz#"pz

"wz

=

"P

"z#

"z

"pz#"(wzy)

"wz

=

"P

"wz

= (d ! z) # z(1! z) # y

Outputunit

!P

!wz

= (d " z) # z(1" z) # y

delta !z

(Note: pz =y * wz)

z=s(pz)

Input to unitOutput from unit

Page 12: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

12

The three components

So, !P

!wz

= (d " z) # z(1" z) # y

Input to the unitDerivative of the performancefunction wrt z

Derivative of output z wrt the output from the neuronbefore passing it through the sigmoid function

Intuition!P

!wz

= (d " z) # z(1" z) # y

Sigmoid

wz

y pzz

Output value

x

Intuitively, this makes sense, because the size of theeffect on error P is proportional to the ‘amount’ of flowthat comes into a node, y, which is the derivative wrt wz,times the derivative twiddle produced by the sigmoid,which is z(1-z), times the derivative of P wrt z

Page 13: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

13

Or, it divides into 3 parts, one for each partof the ‘chain’ in the net - read it ‘backward’off the formula

!P

!wz

= (d " z) # z(1" z) # y

Sigmoid

wz

y pzz

Output value

x

Part 1: del y w/del w = y

Or, it divides into 3 parts, one for each partof the ‘chain’ in the net - read it ‘backward’off the formula

!P

!wz

= (d " z) # z(1" z) # y

Sigmoid

wz

y pzz

Output value

x

Part 2: del sigmoid(z)/del w = -z(1-z)(I have put the – sign out front….)

Page 14: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

14

Or, it divides into 3 parts, one for each partof the ‘chain’ in the net - read it ‘backward’off the formula

!P

!wz

= (d " z) # z(1" z) # y

Sigmoid

wz

y pzz

x

Part 3: del P/del w = (d-z)

Delta rule for output unit

Sigmoid

wz

y pzz

OUTPUT UNIT

x

!P

!wz

= (d " z) # z(1" z) # y

delta !z

For output unit, where z = o, and re-arranging:

!output = !o = o(1" o) # (d " o)

Actual weight change $w = r # input to unit # !o or

$w = r # y # o(1" o) # (d " o)

Page 15: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

15

So now we know how to change the output unit weight, givensome example (input, true output) training example pair:

!w = r " y " o(1# o) " (d # o)

wnew = wold + !w = wold + r " y " o(1# o) " (d # o)

where r is the learning rate; y is the input to the unit

from the previous layer; d is the desired, true output;

and o is the value actually compute the net w/ its

current weights

But, how do we know y, the input to this unit?

Ans: because we had to compute the forward propagation first,

from the very first input to the output at the end of

the net, and this tells us,given the current weights,

what the net outputs at each of its units, including

the 'hidden' ones (like y)

BackpropagationAn efficient method of implementing gradient descent for neural networks

1. Initialize weights to small random values

2. Choose a random sample input feature vector

3. Compute total input and output for each unit (forward prop)

4. Compute for output layer

5. Compute for preceding layer by backprop rule (repeat for all layers)

6. Compute weight change by descent rule (repeat for all weights)

!output = o(1" o)(d " o)!o

j!

Backprop rule

Gradient descentr

Page 16: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

16

Now that we know this, we can find theweight changes required in the previouslayer. Here we use Winston’s notation:

il = input to left neuron; wl = weight on left neuron;

pl = product of il and wl on left neuron (input to sigmoid);

ol = output from left neuron (and input to right neuron)

ir = input to right neuron (from left neuron); wr = weight on right neuron;

pr = product of ir and wr on right neuron (input to sigmoid);

or = output from right neuron

We already know how to find the wt delta forthe right hand side neuron (this is like theoutput layer as before):

Page 17: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

17

Now we compute the weight delta for thelayer to the left of this neuron:

So if we use the following notation we can see the keyrecursive pattern better:

Recall we called this δz before

So we can write the weight derivatives moresimply as a function of the input to each neuronplus the delta for each neuron:

Clearly, if add another layer to the left, we can just repeat thiscomputation: we just take the delta for the new layer (which is given interms of things we already know plus the delta to the right) and the inputto that layer (we we can compute via forward propagation)

Page 18: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

18

From deltas to weight adjustments

!output = !right = o(1" o) # (d " o)

! left = ! l = ol (1" ol ) # wl$r # !right

%wright = r # ol # !right

(Here I have used the subscript ‘right’ instead of ‘r’ to avoidconfusion with the learning rate r; Winston uses alpha instead of r inhis netnotes.pdf for just this reason. Note that ol = the output fromthe left neuron, is the input to the right neuron, so this equation hasthe form of r X input to final layer neuron X delta for final neuron)

First, the output layer, the neuron on the right:

Finding the next adjustment back, wl,involves using the delta from the right, thethe weight on the right unit, the delta for theleft unit, and the input to the left unit:

!output = !right = o(1" o) # (d " o)

! left = ! l = ol (1" ol ) # wright # !right

$wl = r # il # ! l

Page 19: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

19

Let’s do an example! Suppose we have training example (1, 2), and wts asshown. So, the input to the net is 1 and d (the true value) is 2. Let the learningrate r be 0.2. Q: what are the new wts after one iteration of backprop?

1

2.0 1.0

Step 1: Do forward propagation to find all the pl, ol, (hence il), pr, or (the last the computed output o)(a) Right neuron, his gives: 1 x 2.0 = 2.0 for pl; 1/(1+e-2) =0.88 for ol(b) Left neuron: ol = il =0.88 x 1.0 = 0.88 for pr; 1/(1+e-0.88)= 0.71 for or

Step 2: Now start backpropagation. First, compute the weight change for the output weight, wr = Δw = r x ir x o(1-o) (d–o) = 0.2 x 0.88 x 0.71 x (1–0.71) x (2 – 0.71) = 0.0467

Note: o(1–o)(d–o) = δr= 0.2656new weight wr= 1.0 + 0.0467 = 1.0467

Step 3: Compute the weight change for the next level to the left, wl= Δw = r x Ii x ol (1–ol) x wr x δr = 0.2 x 1.0 x 0.88 x (1–0.88) x 1.0 x 0.2656= 0.0056new weight wl= 2.0 + 0.0056 = 2.0056

So we’ve adjusted both the weights upwards, as desired (compare to the 1-layer case) though veryslowly… Now we could go back and iterate with this sample point until we get no further change inthe weights…and then we go to the next training example.

2.0= desired outputvalue, d

For a much more detailed example in thesame vein, please see my other handout onmy web page, neuralnetex.pdf!

Let’s sum up…

Page 20: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

20

BackpropagationAn efficient method of implementing gradient descent for neural networks

1. Initialize weights to small random values

2. Choose a random sample input feature vector

3. Compute total input and output for each unit (forward prop)

4. Compute for output layer

5. Compute for preceding layer by backprop rule (repeat for all layers)

6. Compute weight change by descent rule (repeat for all weights)

!output = o(1" o)(d " o)!o

j!

Backprop rule

Gradient descentr

Training Neural Netswithout overfitting, hopefully…

Given: Data set, desired outputs and a neural net with m weights. Find asetting for the weights that will give good predictive performance on newdata. Estimate expected performance on new data.

1. Split data set (randomly) into three subsets:• Training set – used for picking weights• Validation set – used to stop training• Test set – used to evaluate performance

2. Pick random, small weights as initial values3. Perform iterative minimization of error over training set.4. Stop when error on validation set reaches a minimum (to avoid

overfitting).5. Use best weights to compute error on test set, which is estimate of

performance on new data. Do not repeat training to improve this.

Can use cross-validation if data set is too small to divide into three subsets.

Page 21: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

21

On-line vs off-line

There are two approaches to performing the error minimization:

• On-line training – present Xi and Yi* (chosen randomly from the training

set). Change the weights to reduce the error on this instance. Repeat.

• Off-line training – change weights to reduce the total error on training set(sum over all instances).

On-line training is a stochastic approximation to gradient descent since thegradient based on one instance is “noisy” relative to the full gradient (based onall instances). This can be beneficial in pushing the system out of shallow localminima.

Classification vs Regression

A neural net with a single sigmoid output unit is aimed at binaryclassification. Class is 0 if y<0.5 and 1 otherwise.

For multiway classification, use one output unit per class.

Page 22: 6.034 Artificial Intelligence - MITweb.mit.edu/6.034/ · 6.034 Artificial Intelligence •Big idea: Learning as acquiring a function on feature vectors • Background • Nearest

22

Classification vs Regression

A neural net with a single sigmoid output unit is aimed at binaryclassification. Class is 0 if y<0.5 and 1 otherwise.

For multiway classification, use one output unit per class.

A sigmoid output unit is not suitable for regression, since sigmoidsare designed to change quickly from 0 to 1. For regression, we wanta linear output unit, that is, remove the output non-linearity. The restof the net still retains the sigmoid units.