ICS 273A UC Irvine Instructor: Max Welling Neural Networks.

ICS 273A UC Irvine

Instructor: Max Welling

Neural Networks

Neurons• Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold.

• The axon connects to new dendrites through synapses which can learn how much signal is transmitted.

• McCulloch and Pitt (’43) built a first abstract model of a neuron.

( )i ii

y g W x b input

output

weightsactivationfunction

bias

1

b

Neurons• We have about neurons, each one connected to other neurons on average.

• Each neuron needs at least seconds to transmit the signal.

• So we have many, slow neurons. Yet we recognize our grandmother in sec.

• Computers have much faster switching times: sec. • Conclusion: brains compute in parallel !

• In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast.

1110 410

310

110

1010

Classification & Regression

• Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y).

• If Y is continuous: regression, if Y is discrete: classification.

• We adapt the weights so as to minimize the error between the data and the model predictions.

• This is just a perceptron with a quadratic cost function.

2

1 1 1

( )out ind dN

in ij jn in i j

error y W x b

( )if x

jxijW

Optimization• We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights.

Stochastic Gradient Descent

stochastic updates

full updates (averaged over all data-items)

• Stochastic gradient descent does not converge to the minimum, but “dances” around it.• To get to the minimum, one needs to decrease the step-size as one get closer to the minimum.• Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).

Multi-Layer Nets

• Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer.• In fact, NN with 1 hidden layer can approximate any boolean and cont. functions

3 2 3ˆ ( )i ij j ij

y g W h b y

h2

h1

x

2 2 1 2( )i ij j ij

h g W h b 1 1 1( )i ij j i

j

h g W x b

W3,b3

W2,b2

W1,b1

Back-propagation• How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder!

i

W3,b3

x

h1

h2

y

W2,b2

W1,b1

Back Propagation

i

W3,b3

x

h1

h2

y

W2,b2

W1,b1

i

W3,b3

x

h1

h2

y

W2,b2

W1,b1

Upward passdownward pass

2 2 2 3 3(1 )jn jn jn ij inupstreami

h h W

1 1 1 2 2(1 ) jnkn kn kn jkupstream j

h h W

Back Propagation

i

W3,b3

x

h1

h2

y

W2,b2

W1,b1

2 2 2 3 3(1 )jn jn jn ij inupstreami

h h W

1 1 1 2 2(1 ) jnkn kn kn jkupstream j

h h W

2 2 2 1

2 2 2

jnjk jk kn

j j jn

W W h

b b

ALVINN

This hidden unit detects a mildly left slopingroad and advices to steer left.

How would another hidden unit look like?

Learning to drive a car

Weight Decay• NN can also overfit (of course).

• We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning.

• One can now check performance on a validation set and stop early.

• Or one can change the update rule to discourage large weights:

2 2 2 1 2

2 2 2 2

jnjk jk kn jk

j j jn j

W W h W

b b b

• Now we need to set using X-validation.

• This is called “weight-decay” in NN jargon.

Momentum

• In the beginning of learning it is likely that the weights are changed in a consistent manner.

• Like a ball rolling down a hill, we should gain speed if we make consistent changes. It’s like an adaptive stepsize.

• This idea is easily implemented by changing the gradient as follows:

2 2 1 2

2 2 2

( ) ( )

( )

jnjk kn jk

jk jk jk

W new h W old

W W W new

(and similar to biases)

Conclusion• NN are a flexible way to model input/output functions

• They are robust against noisy data

• Hard to interpret the results (unlike DTs)

• Learning is fast on large datasets when using stochastic gradient descent plus momentum.

• Local minima in optimization is a problem

• Overfitting can be avoided using weight decay or early stopping

• There are also NN which feed information back (recurrent NN)

• Many more interesting NNs: Boltzman machines, self-organizing maps,...

ICS 273A UC Irvine Instructor: Max Welling Neural Networks.

Documents

Transcript of ICS 273A UC Irvine Instructor: Max Welling Neural Networks.