Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural...

27
Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi

Transcript of Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural...

Page 1: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Neural NetworksCAP5610 Machine Learning

Instructor: Guo-Jun Qi

Page 2: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Recap: linear classifier

• Logistic regression• Maximizing the posterior distribution of class Y conditional on the input

vector X

• Support vector machines• Maximizing the maximum margin, and

• Hard margin: subject to the constraints that no training error shall be made• Soft margin: minimizing the slack variables that represent how much an associated

training example violates the classification rule.

• Extended to nonlinear classifier with kernel trick• Mapping input vectors to high dimensional space • linear classifier in high dimensional space, nonlinear in original space.

Page 3: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Building nonlinear classifier

• With a network of logistic units?

• A single logistic unit is linear classifier : f: X → Y

𝑓 𝑋 =1

1 + exp(−𝑊0 − 𝑛=1𝑁 𝑊𝑛𝑋𝑛)

Page 4: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Graph representation of a logistic unit

• Input layer: An input X=(X1, …, Xn)

• Output: logistic function of the input features

Page 5: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

A logistic unit as an neuron:

• Input layer: An input X=(X1, …, Xn)

• Activation: weighted sum of input features 𝑎 = 𝑊0 + 𝑛=1𝑁 𝑊𝑛𝑋𝑛

• Activation function: logistic function h applied to the weighted sum

• Output: z = ℎ(𝑎)

Page 6: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Neural Network: Multiple layers of neurons

• Output of a layer is the input into the upper layer

Page 7: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

An example

• A three layer neural network

𝑥1 𝑥2

𝑧1 𝑧2

𝑦1 𝑎1(1)= 𝑤11(1)𝑥1 + 𝑤12

(1)𝑥2 + 𝑤10

(1)

𝑧1 = ℎ(𝑎1(1)

)

𝑎2(1)= 𝑤21(1)𝑥1 + 𝑤22

(1)𝑥2 + 𝑤20

(1)

𝑧2 = ℎ(𝑎2(1)

)

𝑎1(2)= 𝑤11(2)𝑧1 + 𝑤12

(2)𝑥2 +𝑤10

(2)

𝑦1 = 𝑓(𝑎1(2))

𝑤11(1)

𝑤21(1) 𝑤12

(1)𝑤22(1)

𝑤11(2)

𝑤12(2)

𝑤10(1)

𝑤20(1)

𝑤10(2)

𝑎1(2)

𝑎1(1) 𝑎2

(1)

Page 8: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

XOR Problem• It is impossible to linearly separate these two classes

(1,1)

(0,0) (1,0)

(0,1)

x1

x2

Page 9: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

XOR Problem

• Two classes become separable by putting a threshold 0.5 to the output y1

(1,1)

(0,0) (1,0)

(0,1)

x1

x2

𝑥1 𝑥2

𝑧1 𝑧2

𝑦1

−11.62 12.88 10.99 −13.13

13.34 13.13

−6.06

−6.56

−7.19

Input Output

(0,0) 0.057

(1,0) 0.949

(0,1) 0.946

(1,1) 0.052

Page 10: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Application: Drive a car

• Input: real-time videos captured by a camera

• Output: signals that steer a car• From the sharp left, straight to sharp right

Page 11: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Training Neural Network

• Given a training set of M examples {(x(i),t(i))|i=1,…,M}

• Training neural network is equivalent to minimizing the least square error between the network output and the true value

min𝑤 𝐿 𝑤 =1

2

𝑖=1

𝑀

(𝑦(𝑖) − 𝑡 𝑖 )2

Where y(i) is the output depending on the network parameters w.

Page 12: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Recap: Gradient decent Method

• Gradient descent method is an iterative algorithm• hill climbing method to find the peak point of a “mountain”

• At each point, compute its gradient

• Gradient is a vector that points to the steepest direction climbing up the mountain.

• At each point, w is updated so it moves a size of step λ in the gradient direction

0 1

, ,...,N

L L LL

w w w

( )w w L w

Page 13: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Stochastic Gradient Ascent Method

• Making the learning algorithm scalable to big data

• Computing the gradient of square error for only one example

𝐿 𝑤 =

𝑖=1

𝑀

(𝑦(𝑖) − 𝑡 𝑖 )2

𝐿(𝑖) 𝑤 = (𝑦(𝑖) − 𝑡 𝑖 )2

0 1

, ,...,N

L L LL

w w w

( ) ( ) ( )( )

0 1

, ,...,i i i

i

N

L L LL

w w w

Page 14: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Boiling down to computing the gradient

𝑥1 𝑥2

𝑧1 𝑧2

𝑦1

𝑤11(2)

𝑤12(2)

𝑤10(2) 21

(y t )2

k kL

(2)

(2) (2) (2)

kk i

ki k ki

aL Lz

w a w

Square loss:

(2) (2)y ( ),k k k kj j

j

f a a w z

Derivative to the activation in the second layer:

(2) (2)

(2) (2)(y ) (y ) '( )k

k k k k k k

k k

yLt t f a

a a

𝑎1(2)

Derivative to the parameter in the second layer:

Page 15: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Boiling down to computing the gradient

• Computing the derivatives to the parameters in the first layer

𝑥1 𝑥2

𝑧1 𝑧2

𝑦1

𝑤11(1)

𝑤21(1)

𝑤12(1) 𝑤22

(1)

𝑤11(2)

𝑤12(2)

𝑤10(1)

𝑤10(2)

(1) (1)

j jn n

n

a w x

By chain rule:(2)

(1) (2) (1)

(1) (2) (2) (1)'( )

k

kj k j

j k kj j

k

aL L

a a a

h a w

(2) (2) (1)( )k kj j

j

a w h a

𝑎1(2)

Relation between activations of the first and second layers

(1)

(1)

(1) (1) (1)

j

j n

jn j jn

aL Lx

w a w

The derivative to the parameter in the first layer:

Page 16: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Summary: Back propagation

• For each training example (x,y),

• For each output unit k

• For each hidden unit j

(2) (2)(y ) '( )k k k kt f a

(1) (1) (2) (2)'( )j j k kj

k

h a w

𝛿1(2)

𝛿1(1)

𝛿2(1)

Page 17: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Summary: Back propagation (2)

• For each training example (x,y),

• For each weight 𝑤𝑘𝑖(2)

:

• Update

• For each weight 𝑤𝑗𝑛(1)

:

• Update

𝛿1(2)

𝛿1(1)

𝛿2(1)

𝜕𝐿

𝜕𝑤𝑘𝑖(2)= 𝛿𝑘(2)𝑧𝑖

𝜕𝐿

𝜕𝑤𝑗𝑛(1)= 𝛿𝑗(1)𝑥𝑛

𝑤𝑘𝑖(2)← 𝑤𝑘𝑖(2)− 𝛿𝑘(2)𝑧𝑖

𝑤𝑗𝑛(1)← 𝑤𝑗𝑛(1)− 𝛿𝑗(1)𝑥𝑛

Page 18: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Regularized Square Error

• Add a zero mean Gaussian prior on the weights 𝑤𝑖𝑗(𝑙)~𝑁(0, 𝜎2)

• MAP estimate of w

𝐿(𝑖) 𝑤 =1

2(𝑦(𝑖) − 𝑡 𝑖 )2 +

𝛾

2 (𝑤𝑖𝑗

(𝑙))2

Page 19: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Summary: Back propagation (2)

• For each training example (x,y),

• For each weight 𝑤𝑘𝑖(2)

:

• Update

• For each weight 𝑤𝑗𝑛(1)

:

• Update

𝛿1(2)

𝛿1(1)

𝛿2(1)

𝜕𝐿

𝜕𝑤𝑘𝑖(2)= 𝛿𝑘(2)𝑧𝑖 + 𝛾𝑤𝑘𝑖

(2)

𝜕𝐿

𝜕𝑤𝑗𝑛(1)= 𝛿𝑗(1)𝑥𝑛 + 𝛾𝑤𝑗𝑛

(1)

𝑤𝑘𝑖(2)← 𝑤𝑘𝑖(2)− 𝛿𝑘(2)𝑧𝑖 − 𝛾𝑤𝑘𝑖

(2)

𝑤𝑗𝑛(1)← 𝑤𝑗𝑛(1)− 𝛿𝑗(1)𝑥𝑛 − 𝛾𝑤𝑗𝑛

(1)

Page 20: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Multiple outputs encoding multiple classes

• MNIST: ten classes of digits

• Encoding multiple classes as multiple outputs:• An output variable is set to 1 if the corresponding class is positive for the example

• Otherwise, the output is set to 0.

• The posterior probability of an example belonging to class k

𝑃 Class𝑘|𝐱 =𝑦𝑘

𝑘′=1𝐾 𝑦𝑘′

Page 21: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Overfitting

• Tuning the number of update iterations on validation set

Page 22: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

How expressive is NN?

• Boolean functions:• Every Boolean function can be represented by network with single hidden

layer

• But might require exponential number of hidden units

• Continuous functions:• Every bounded continuous function can be approximated with arbitrarily

small error by neural network with one hidden layer

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers

Page 23: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Learning feature representation by neural networks

• A compact representation for high dimensional input vectors• A large image with thousands of pixels

• High dimensional input vectors • might cause curse of dimensionality

• Needs more examples for training (in lecture 1)

• Not well capture the intrinsic variations• an arbitrary point in a high dimensional space probably does not represent a valid real

object.

• A meaningful low dimensional space is preferred!

Page 24: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Autoencoder

• Set output to input

• Hidden layers as feature representation, since it contains sufficient information to reconstruct the input in the output layer

Page 25: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

An example

Page 26: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Deep Learning: A Deep Feature Representation

• If you build multiple layers to reconstruct the input at the output layer

Page 27: Neural Networks - UCF Computer Sciencegqi/CAP5610/CAP5610Lecture09.pdf•A three layer neural network ... Learning feature representation by neural networks •A compact representation

Summary

• Neural Networks: Multiple layers of neurons • Each upper layer neuron encodes weighted sum of inputs from the other

neurons at a lower layer by an activation function

• BP training: a stochastic gradient descent method• From the output layer down to the hidden and input layers

• Autoencoder: feature representation