ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
-
date post
22-Dec-2015 -
Category
Documents
-
view
228 -
download
3
Transcript of ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
ICS 273A UC Irvine
Instructor: Max Welling
Neural Networks
Neurons• Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold.
• The axon connects to new dendrites through synapses which can learn how much signal is transmitted.
• McCulloch and Pitt (’43) built a first abstract model of a neuron.
( )i ii
y g W x b input
output
weightsactivationfunction
bias
1
b
Neurons• We have about neurons, each one connected to other neurons on average.
• Each neuron needs at least seconds to transmit the signal.
• So we have many, slow neurons. Yet we recognize our grandmother in sec.
• Computers have much faster switching times: sec. • Conclusion: brains compute in parallel !
• In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast.
1110 410
310
110
1010
Classification & Regression
• Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y).
• If Y is continuous: regression, if Y is discrete: classification.
• We adapt the weights so as to minimize the error between the data and the model predictions.
• This is just a perceptron with a quadratic cost function.
2
1 1 1
( )out ind dN
in ij jn in i j
error y W x b
( )if x
jxijW
Optimization• We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights.
Stochastic Gradient Descent
stochastic updates
full updates (averaged over all data-items)
• Stochastic gradient descent does not converge to the minimum, but “dances” around it.• To get to the minimum, one needs to decrease the step-size as one get closer to the minimum.• Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).
Multi-Layer Nets
• Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer.• In fact, NN with 1 hidden layer can approximate any boolean and cont. functions
3 2 3ˆ ( )i ij j ij
y g W h b y
h2
h1
x
2 2 1 2( )i ij j ij
h g W h b 1 1 1( )i ij j i
j
h g W x b
W3,b3
W2,b2
W1,b1
Back-propagation• How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder!
i
W3,b3
x
h1
h2
y
W2,b2
W1,b1
Back Propagation
i
W3,b3
x
h1
h2
y
W2,b2
W1,b1
i
W3,b3
x
h1
h2
y
W2,b2
W1,b1
Upward passdownward pass
2 2 2 3 3(1 )jn jn jn ij inupstreami
h h W
1 1 1 2 2(1 ) jnkn kn kn jkupstream j
h h W
Back Propagation
i
W3,b3
x
h1
h2
y
W2,b2
W1,b1
2 2 2 3 3(1 )jn jn jn ij inupstreami
h h W
1 1 1 2 2(1 ) jnkn kn kn jkupstream j
h h W
2 2 2 1
2 2 2
jnjk jk kn
j j jn
W W h
b b
ALVINN
This hidden unit detects a mildly left slopingroad and advices to steer left.
How would another hidden unit look like?
Learning to drive a car
Weight Decay• NN can also overfit (of course).
• We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning.
• One can now check performance on a validation set and stop early.
• Or one can change the update rule to discourage large weights:
2 2 2 1 2
2 2 2 2
jnjk jk kn jk
j j jn j
W W h W
b b b
• Now we need to set using X-validation.
• This is called “weight-decay” in NN jargon.
Momentum
• In the beginning of learning it is likely that the weights are changed in a consistent manner.
• Like a ball rolling down a hill, we should gain speed if we make consistent changes. It’s like an adaptive stepsize.
• This idea is easily implemented by changing the gradient as follows:
2 2 1 2
2 2 2
( ) ( )
( )
jnjk kn jk
jk jk jk
W new h W old
W W W new
(and similar to biases)
Conclusion• NN are a flexible way to model input/output functions
• They are robust against noisy data
• Hard to interpret the results (unlike DTs)
• Learning is fast on large datasets when using stochastic gradient descent plus momentum.
• Local minima in optimization is a problem
• Overfitting can be avoided using weight decay or early stopping
• There are also NN which feed information back (recurrent NN)
• Many more interesting NNs: Boltzman machines, self-organizing maps,...