Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.
Function Learning and Neural Nets
description
Transcript of Function Learning and Neural Nets
![Page 1: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/1.jpg)
1
FUNCTION LEARNING AND NEURAL NETS
![Page 2: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/2.jpg)
2
SETTING Learn a function with : Continuous-valued examples
E.g., pixels of image Continuous-valued output
E.g., likelihood that image is a ‘7’ Known as regression [Regression can be turned into classification
via thresholds]
![Page 3: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/3.jpg)
3
FUNCTION-LEARNING (REGRESSION) FORMULATION Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))
Inductive inference: find a function h that fits the points well
Same Keep-It-Simple biasx
f(x)
![Page 4: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/4.jpg)
4
LEAST-SQUARES FITTING Hypothesize a class of functions f(x,θ)
parameterized by θ Minimize squared loss E(θ) = Σi ( f(x(i),θ)-y(i) )2
x
f(x)
![Page 5: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/5.jpg)
5
LINEAR LEAST-SQUARES f(x,θ) = x ∙ θ Value of θ that optimizes E(θ) is:
θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)] E(θ) = Σi ( x(i)∙θ - y(i) )2
= Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2) E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ +
y(i)2)] = Σi 2 x(i)2 θ – 2 x(i) y(i) = 0 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]
x
f(x)f(x,q)
![Page 6: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/6.jpg)
6
LINEAR LEAST-SQUARES WITH CONSTANT OFFSET f(x,θ0,θ1) = θ0 + θ1 x E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2
= Σi (θ02 + θ1
2 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i))
dE/dθ0(θ0*,θ1
*) = 0 and dE/dθ1(θ0*,θ1
*) = 0, so:0 = 2Σi (θ0
* +θ1*x(i) - y(i))
0 = 2Σi x(i)(θ0*+ θ1
* x(i) - y(i)) Verify the solution:
θ0* = 1/N Σi (y(i) – θ1
*x(i))θ1
* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/[N (Σi x(i)2) – (Σi x(i))2]
x
f(x)f(x,q)
![Page 7: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/7.jpg)
7
MULTI-DIMENSIONAL LEAST-SQUARES Let x include attributes (x1,…,xN) Let θ include coefficients (θ1,…,θN) Model f(x,θ) = x1 θ1 + … + xN θN
x
f(x)f(x,q)
![Page 8: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/8.jpg)
8
MULTI-DIMENSIONAL LEAST-SQUARES f(x,θ) = x1 θ1 + … + xN θN Best θ given by
θ = (ATA)-1 AT b Where A is matrix of x(i)’s in rows, b is vector
of y(i)’s
x
f(x)f(x,q)
![Page 9: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/9.jpg)
9
NONLINEAR LEAST-SQUARES E.g. quadratic f(x,θ) = θ0 + x θ1 + x2 θ2 E.g. exponential f(x,θ) = exp(θ0 + x θ1) Any combinations
f(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3 Fitting can be done using gradient descent
x
f(x)linear
quadratic other
![Page 10: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/10.jpg)
10
ASIDE: FEATURE TRANSFORMS Common model: weighted sums of nonlinear
functions f1(x),…,fN(x) Linear in the feature space
Polynomial g(x,θ) = θ0 + x θ1 + … + xd θd In general g(x,θ) = f1(x) θ1 + … + fN(x) θN
Least squares fit can be solved exactly by consider a transformed dataset (x’,y) with x’=(f1(x),…,fN(x))
![Page 11: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/11.jpg)
Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase
![Page 12: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/12.jpg)
Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase
![Page 13: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/13.jpg)
Gradient descent: iteratively move in direction
![Page 14: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/14.jpg)
Gradient descent: iteratively move in direction
![Page 15: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/15.jpg)
Gradient descent: iteratively move in direction
![Page 16: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/16.jpg)
Gradient descent: iteratively move in direction
![Page 17: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/17.jpg)
Gradient descent: iteratively move in direction
![Page 18: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/18.jpg)
Gradient descent: iteratively move in direction
![Page 19: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/19.jpg)
Gradient descent: iteratively move in direction
![Page 20: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/20.jpg)
GRADIENT DESCENT FOR LEAST SQUARES Let θ = (θ1,…,θn) Error Take gradient:
Update rule
A vector, in which element k indicates how quickly the prediction at example i changes with respect to a change in qk
![Page 21: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/21.jpg)
GRADIENT DESCENT FOR LEAST SQUARES Let θ = (θ1,…,θn) Error Take gradient:
Update rule
The error at example i
![Page 22: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/22.jpg)
GRADIENT DESCENT EXAMPLE: LINEAR FITTING f(x,θ) = x1 θ1 + … + xN θN
Update rule
![Page 23: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/23.jpg)
23
PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)
S gxi
x1
xn
ywi
y = f(x,w) = g(Si=1,…,n wi xi)
+ +
+
++ -
-
--
-x1
x2
w1 x1 + w2 x2 = 0
g(u)
u
![Page 24: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/24.jpg)
24
A SINGLE PERCEPTRON CAN LEARN
S gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority function
![Page 25: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/25.jpg)
25
A SINGLE PERCEPTRON CAN LEARN
S gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority functionXOR?
![Page 26: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/26.jpg)
26
PERCEPTRON LEARNING RULE θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)
If output is correct, weights are unchanged If g is 0 but y is 1, then the value of g on
attribute i is increased If g is 1 but y is 0, then the value of g on
attribute i is decreased
Converges if data is linearly separable, but oscillates otherwise
![Page 27: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/27.jpg)
27
PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)
S gxi
x1
xn
ywi
+ +
+ +
+ -
-
- -
-
?
y = f(x,w) = g(Si=1,…,n wi xi)
g(u)
u
![Page 28: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/28.jpg)
28
UNIT (NEURON)
S gxi
x1
xn
ywi
y = g(Si=1,…,n wi xi)g(u) = 1/[1 + exp(-u)]
![Page 29: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/29.jpg)
29
NEURAL NETWORK Network of interconnected neurons
S gxi
x1
xn
ywi
S gxi
x1
xn
ywi
Acyclic (feed-forward) vs. recurrent networks
![Page 30: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/30.jpg)
30
TWO-LAYER FEED-FORWARD NEURAL NETWORK
Inputs Hiddenlayer
Outputlayer
w1j w2k
![Page 31: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/31.jpg)
31
NETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear
functions Common neuron types:
Linear, soft perception (sigmoid), radial basis functions
As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features
How to train hidden layers?
![Page 32: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/32.jpg)
32
BACKPROPAGATION (PRINCIPLE) New example y(k) = f(x(k)) φ(k) = compute NN prediction with weights
w(k-1) for inputs x(k) Error function: E(k)(w(k-1)) = (φ(k) – y(k))2
Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.
![Page 33: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/33.jpg)
33
STOCHASTIC GRADIENT DESCENT Gradient descent uses a batch update
because all examples are incorporated in each step
Stochastic Gradient Descent: use single example on each step
Update rule: Pick example i (either at random or in order) and
a step size Update rule
θ θ + x(i)(y(i)-g(x(i),θ)) Reduces error on i’th example… but does it
converge?
![Page 34: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/34.jpg)
34
UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…
E(q)
q
![Page 35: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/35.jpg)
35
UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…
E(q)
q
Gradient of E
![Page 36: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/36.jpg)
36
UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent…
E(q)
q
Step ~ gradient
![Page 37: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/37.jpg)
37
LEARNING ALGORITHM Given many examples (x(1),y(1)),…, (x(N),y(N))
and a learning rate Init: Set k = 0 (or rand(1,N)) Repeat:
Tweak weights with a backpropagation update on example x(k), y(k)
Set k = k+1 (or rand(1,N))
![Page 38: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/38.jpg)
38
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e1
![Page 39: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/39.jpg)
39
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e1
![Page 40: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/40.jpg)
40
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e2
![Page 41: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/41.jpg)
41
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e2
![Page 42: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/42.jpg)
42
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e3
![Page 43: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/43.jpg)
43
UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e3
![Page 44: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/44.jpg)
STOCHASTIC GRADIENT DESCENT Objective function values (measured over all
examples) over time settle into local minimum
Step size must be reduced over time, e.g., O(1/t)
44
![Page 45: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/45.jpg)
45
CAVEATS Choosing a convergent “learning rate” can
be hard in practice
E(q)
q
![Page 46: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/46.jpg)
EXAMPLE FROM B553: IMAGE ENCODING 2 layer network, 1 hidden radial basis
function layer with 50 neurons (200 parameters) 1000 training
examplesFitted NN 12x18 image
![Page 47: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/47.jpg)
47
COMMENTS AND ISSUES How to choose the size and structure of
networks? If network is too large, risk of over-fitting (data
caching) If network is too small, representation may not
be rich enough Role of representation: e.g., learn the
concept of an odd number
![Page 48: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/48.jpg)
48
BENEFITS / DRAWBACKS OF NNS Benefits
Easy to generate complex nonlinear function classes
Incremental learning via stochastic gradient descent
Good performance on many problems Predictions evaluated quickly
Drawbacks Difficult to characterize the hypothesis space Low interpretability Relatively slow training, local minima
![Page 49: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/49.jpg)
49
PERFORMANCE OF FUNCTION LEARNING Overfitting: too many parameters Regularization: penalize large parameter
values Minimize E(θ) + C(θ)
where C(θ) measures the cost of a parameter (independent of data) and is a regularization parameter
Efficient optimization If E(q) is nonconvex, can only guarantee finding a
local minimum Batch updates are expensive, stochastic updates
converge slowly
![Page 50: Function Learning and Neural Nets](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816387550346895dd47448/html5/thumbnails/50.jpg)
50
READINGS R&N 18.8-9