Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

28
Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department

Transcript of Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Page 1: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Neural Networks

Tuomas SandholmCarnegie Mellon University

Computer Science Department

Page 2: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

How the brain works

Synaptic connections exhibit long-term changes in the connection strengths based on patterns seen

Page 3: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Comparing brains with digital computers

ParallelismGraceful degradationInductive learning

Page 4: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Notation ANN(software/hardware,synchronous/asynchronous)

Page 5: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Single unit (neuron) of an artificial neural network

jj

iji aWin ,

Page 6: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Activation Functions

n

jjij

n

jjijti aWstepaWstepa

0,0

1, )()(

Where W0,i = t and a0= -1 fixed

Page 7: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Boolean gates can be simulated by units with a step function

t=1.5

W=1

W=1

t=0.5

W=1

W=1

t=-0.5W= -1

AND OR NOT

g is a step function

Page 8: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

TopologiesFeed-forward vs. recurrent

Recurrent networks have state (activations from previous time steps have to be remembered): Short-term memory.

Page 9: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Hopfield network• Bidirectional symmetric (Wi,j = Wj,i) connections• g is the sign function• All units are both input and output units• Activations are 1

“Associative memory”

After training on a set of examples, a new stimulus will cause the network to settle into an activation pattern corresponding to the example in the training set that most closely resemble the new stimulus.

E.g. parts of photograph

Thrm. Can reliably store 0.138 #units training examples

Page 10: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Boltzman machine

• Symmetric weights• Each output is 0 or 1• Includes units that are neither input units nor output units• Stochastic g, i.e. some probability (as a fn of ini) that g=1

State transitions that resemble simulated annealing.

Approximates the configuration that best meets the training set.

Page 11: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Learning in ANNs is the process of tuning the weights

Form of nonlinear regression.

Page 12: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

ANN topologyRepresentation capability vs. overfitting risk.

A feed-forward net with one hidden layer can approximate any continuous fn of the inputs.

With 2 hidden layers it can approximate any fn at all.

The #units needed in each layer may grow exponentially

Learning the topologyHill-climbing vs. genetic algorithms vs. …Removing vs. adding (nodes/connections).Compare candidates via cross-validation.

Page 13: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Perceptronsj

jj IWstepO )(0

Majority fn Implementable with one output unit

Decision tree requires O(2n) nodes

Page 14: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Representation capability of a perceptron

Every input can only affect the output in one direction independent of other inputs.E.g. unable to represent WillWait in the restaurant example.

Perceptrons can only represent linearly separable fns. For a given problem, does one know in advance whether it is linearly separable?

Page 15: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Linear separability in 3D

Minority Function

Page 16: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Learning linearly separable functionsTraining examples used over and over!

epochErr = T-O

Variant of perceptron learning rule.Thrm. Will learn the linearly separable target fn. (if is not too high)Intuition: gradient descent in a search space with no local optima

ErrIWW jjj **

Page 17: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Encoding for ANNsE.g. #patrons can be none, some or full

Local encoding:None=0.0, Some=0.5, Full=1.0

Distributed encoding:None 1 0 0Some 0 1 0Full 0 0 1

Page 18: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Majority Function

Page 19: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

WillWait

Page 20: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Multilayer feedforward networksStructural credit assignment problem

Back propagation algorithm (again, Erri=Ti-Oi)Updating between hidden & output units.

)('*** iijjiij ingErraWW

Updating between input & hidden units:

)('***

)('**,

jjkkjkj

iiiijj

ingErrIWW

ingErrWErr

Back propagation of the error

Page 21: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Back propagation (BP) as gradient descent search

A way of localizing the computation of the gradient to units.

jjk

iiijijk

iiij

jjjiiij

ji

i j kkkjiji

i jjjii

iii

ErringI

ingErrWingI

ingOTa

aWgOTaW

E

IWgWgT

aWgTwE

OTE

*)('*

)('**)('*W

E

get weunitshidden For

)('*)(

)('*)(

))),(((2

1

))((2

1)(

)(2

1

kj

2

2

2

Page 22: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Observations on BP as gradient descent

1. Minimize error move in opposite direction of gradient

2. g needs to be differentiable Cannot use sign fn or step fn Use e.g. sigmoid g’=g(1-g)

3. Gradient taken wrt. one training example at a time

Page 23: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

ANN learning curve

WillWait problem

Page 24: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

WillWait Problem

Page 25: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Expressiveness of BP

2n/n hidden units needed to represent arbitrary Boolean fns of n inputs.

(such a network has O(2n) weights, and we need at least 2n bits to represent a Boolean fn)

Thrm. Any continuous fn f:[0,1]nRm

Can be implemented in a 3-layer network with 2n+1 hidden units. (activation fns take special form) [Kolmogorov]

Page 26: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Efficiency of BP

Using is fast

Training is slowEpoch takes May need exponentially many epochs in #inputs

|)|( wmO

Page 27: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

More on BP…Generalization:

Good on fns where output varies smoothly with input

Sensitivity to noise:Very tolerant of noiseDoes not give a degree of certainty in the output

Transparency:Black box

Prior knowledge:Hard to “prime”

No convergence guarantees

Page 28: Neural Networks Tuomas Sandholm Carnegie Mellon University Computer Science Department.

Summary of representation capabilities (model class) of different supervised

learning methods

3-layer feedforward ANNDecision TreePerceptronK-Nearest neighborVersion space