13 Neural Networks - Virginia Tech
Transcript of 13 Neural Networks - Virginia Tech
Neural Networks
Intro to AI Bert Huang
Virginia Tech
Outline
• Biological inspiration for artificial neural networks
• Linear vs. nonlinear functions
• Learning with neural networks: back propagation
https://en.wikipedia.org/wiki/Neuron#/media/File:Chemical_synapse_schema_cropped.jpg
https://en.wikipedia.org/wiki/Neuron#/media/File:Chemical_synapse_schema_cropped.jpg
https://en.wikipedia.org/wiki/Neuron#/media/File:GFPneuron.png
Parameterizing p(y|x)
p(y |x) := f
f : Rd ! [0, 1]
f (x) :=1
1 + exp(�w>x)
Parameterizing p(y|x)
p(y |x) := f
f : Rd ! [0, 1]
f (x) :=1
1 + exp(�w>x)
Logistic Function�(x) =
1
1 + exp(�x)
limx!�1
�(x) = limx!�1
1
1 + exp(�x)= 0.0
Logistic Function�(x) =
1
1 + exp(�x)
�(0) =1
1 + exp(�0)=
1
1 + 1= 0.5
limx!1
�(x) = limx!1
1
1 + exp(�x)=
1
1= 1.0
From Features to Probability
f (x) :=1
1 + exp(�w>x)
Parameterizing p(y|x)p(y |x) := f
f : Rd ! [0, 1]
f (x) :=1
1 + exp(�w>x)
x1 x2 x3 x4 x5
w1w2 w3w4 w5
y
Multi-Layered Perceptronx1 x2 x3 x4 x5
w1w2 w3w4 w5
h1
Multi-Layered Perceptronx1 x2 x3 x4 x5
h1 h2
y
raw data
representation
prediction
pixel values
shapes
faces
roundshadows
Multi-Layered Perceptron
x1 x2 x3 x4 x5
h1 h2
yh = [h1, h2]>
h1 = �(w>11x) h2 = �(w>
12x)
p(y |x) = �(w>21h)
p(y |x) = �⇣w>21
⇥�(w>
11x),�(w>12x)
⇤>⌘
Decision Surface: Logistic Regression
Decision Surface: 2-Layer, 2 Hidden Units
Decision Surface: 2-Layer, More Hidden Units
3 hidden units 10 hidden units
Decision Surface: More Layers, More Hidden Units
4 layers, 10 hidden units each layer10 layers, 5 hidden units per layer
= L(W)
Training Neural Networks
minW
1n
n
∑i=1
l( f(xi, W), yi)
Wj ← Wj − α ( ∂L∂Wj )
average training error
Gradient Descent
Gradient Descent
Wj ← Wj − α ( ∂L∂Wj ) W1
W2
W3
L(W3)
L(W2)
L(W1) very positivetake big step left
almost zero take tiny step left
slightly negative take medium step right
Approximate Q-LearningQ̂(s, a) := g(s, a,✓) := ✓1f1(s, a) + ✓2f2(s, a) + . . .+ ✓dfd(s, a)
✓i ✓i + ↵⇣R(s) + �max
a0Q̂(s0, a0)� Q̂(s, a)
⌘ @g
@✓i
✓i ✓i + ↵⇣R(s) + �max
a0Q̂(s0, a0)� Q̂(s, a)
⌘fi(s, a)
Back Propagation• Back propagation:
• Compute hidden unit activations: forward propagation
• Compute gradient at output layer: error
• Propagate error back one layer at a time
• Chain rule via dynamic programming
Chain Rule Review
f(g(x))
d f(g(x))d x
=d f(g(x))d g(x)
⋅d g(x)
d x
g(x)x f(g(x))
Chain Rule on More Complex Functionh( f(a) + g(b))
d h( f(a) + g(b))d a
d h( f(a) + g(b))d f(a) + g(b)
⋅d f(a)d a
d h( f(a) + g(b))d f(a) + g(b)
⋅d g(b)
d b
d h( f(a) + g(b))d b
Back to Neural Networksraw input x
hidden layer h1
hidden layer hn-1
final output y
…
weights w1
weights wn-1
weights wn y = f(hn−1, wn)
hn−1 = f(hn−2, wn−1)
h1 = f(x, w1)
Back to Neural Networksraw input x
hidden layer h1
hidden layer hn-1
final output y
…
weights w1
weights wn-1
weights wn y = f(hn−1, wn)
hn−1 = f(hn−2, wn−1)
h1 = f(x, w1)
L(y)d Ld wn
=d Ld y
⋅d y
d wn
Intuition of Back Propagation• At each layer, calculate how much changing the input changes the
final output (derivative of final output w.r.t. layer’s input)
• Calculate directly for last layer
• For preceding layers, use calculation from next layer and work backwards through network
• Use that derivative to find how changing the weights affect the error of the final output
FYI: Matrix Formx
h1 = s(W1x)
…
hm-1 = s(Wm-1 hm-2)
f(x, W) = s(Wm hm-1)
…
h2 = s(W2 h1)
J(W ) = `(f(x,W ))
(You will not be tested on this matrix form in
this course.)
FYI: Matrix Gradient Recipeh1 = s(W1x)
…
hm-1 = s(Wm-1 hm-2)
f(x, W) = s(Wm hm-1)
h2 = s(W2 h1)
J(W ) = `(f(x,W ))
�m = `0(f(x,W ))rWmJ = �mh>
m�1
rWm�1J = �m�1h>m�2
rWiJ = �ih>i�1
rW1J = �1x>
�m�1 = (W>m�m)� s0(Wm�1hm�2)
�i = (W>i+1�i+1)� s0(Wihi�1)
(You will not be tested on this matrix form in
this course.)
FYI: Matrix Gradient Recipe
h1 = s(W1x)
f(x, W) = s(Wm hm-1)
hi = s(Wi hi-1)
J(W ) = `(f(x,W ))
�m = `0(f(x,W )) rWiJ = �ih>i�1
rW1J = �1x>
Feed Forward Propagation Back Propagation
�i = (W>i+1�i+1)� s0(Wihi�1)
(You will not be tested on this matrix form in
this course.)
Other New Aspects of Deep Learning
• GPU computation
• Differentiable programming
• Automatic differentiation
• Neural network structures
Types of Neural Network Structures
• Feed-forward
• Recurrent neural networks (RNNs)
• Good for analyzing sequences (text, time series)
• Convolutional neural networks (convnets, CNNs)
• Good for analyzing spatial data (images, videos)
Outline
• Biological inspiration for artificial neural networks
• Linear vs. nonlinear functions
• Learning with neural networks: backpropagation