Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...
Transcript of Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...
Neural Network(Basic Ideas)
Hung-yi Lee
Learning ≈ Looking for a Function
• Speech Recognition
• Handwritten Recognition
• Weather forecast
• Play video games
f
f
f
f
“2”
“你好”
“sunny tomorrow”
“fire”
weather today
Positions and number of enemies
Framework
Training:Pick the best Function f*
TrainingData
Model
Testing:
,ˆ,,ˆ, 2211 yxyx
Hypothesis Function Set
21, ff
*f
“Best” Function
yxf
:x :y(label)
function input
function output
:x
:y
“2”
y “2”
Outline
3. How to pick the “best” function?
2. What is the “best” function?
1. What is the model (function hypothesis set)?
• Classification
Task Considered Today
input object
Class A
Class B
Binary Classification
Only two classes
(yes)
(no)
• Spam filtering• Is an e-mail spam or not?
• Recommendation systems• recommend the product to the
customer or not?• Malware detection
• Is the software malicious or not?• Stock prediction
• Will the future value of a stock increase or not?
• Classification
Task Considered Today
input object
Class A
Class B
input object
Class A
Class B
Class C
Binary Classification Multi-class Classification
Only two classes More than two classes
(yes)
(no) ……
Multi-class Classification
• Handwriting Digit Classification
• Image Recognition
Input: Class: “1”, “2”, …., “9”, “0”
Input: Class: “dog”, “cat”, “book”, ….
Thousands of classes
10 classes
Multi-class Classification
• Real speech recognition is not multi-class classification
• The HW1 is multi-class classification
Input:
Classes: “hi”, “how are you”, “I am sorry” …….
Cannot be enumerated
frame
/b/ /b/ /ε/
The frame belongs to which phoneme.
Classes are the phonemes.
1. What is the model?
What is the function we are looking for?• classification
• x: input object to be classified
• y: class
• Assume both x and y can be represented as fixed-size vector
• x is a vector with N dimensions, and y is a vector with M dimensions
𝑦 = 𝑓 𝑥 𝑓: 𝑅𝑁 → 𝑅𝑀
What is the function we are looking for?• Handwriting Digit Classification
“2”“1”
0
0
1
10 dimensions for digit recognition
“1”
“2”
“3”
0
1
0 “1”
“2”
“3”
1: for ink, 0: otherwise
Each pixel corresponds to an element in the vector
1
0
16 x 16
16 x 16 = 256 dimensions
x: image y: class
𝑓: 𝑅𝑁 → 𝑅𝑀
“1” or not
“2” or not
“3” or not
1. What is the model?A Layer of Neuron
Single Neuron
z
1w
2w
Nw
…
1x
2x
Nx
b
z
z
z
bias
y
ze
z
1
1
Activation function
𝑓: 𝑅𝑁 → 𝑅
Single Neuron
z
1w
2w
Nw
…
1x
2x
Nx
b
z
z
z
bias
y
ze
z
1
1
Activation function
1
𝑓: 𝑅𝑁 → 𝑅
Single Neuron
• Single neuron can only do binary classification, cannot handle multi-class classification
…
1x
2x
Nx
1
5.0"2"
5.0"2"
ynot
yis
y
𝑓: 𝑅𝑁 → 𝑅
A Layer of Neuron
• Handwriting digit classification• Classes: “1”, “2”, …., “9”, “0”
• 10 classes
If y2 is the max, then the image is “2”.
𝑓: 𝑅𝑁 → 𝑅𝑀
…
1x
2x
Nx
1
1y
……
……
“1” or not
“2” or not
“3” or not
2y
3y
10 neurons
1. What is the model?Limitation of Single Layer
1x
2x
Limitation of Single Layer
InputOutput
x1 x2
0 0 No
0 1 Yes
1 0 Yes
1 1 No
a
z1w
2w1x
2xb
1
thresholdano
thresholdayes
bxwxwz 2211
a ≥ threshold
a ≥ threshold
a < threshold
a < threshold
Can we?
Limitation of Single Layer
• No, we can’t ……
1x
2x
1x
2x
a
z1w
2w1x
2xb
1
Limitation of Single Layer
AND
OR
AND
NOT
InputOutput
x1 x2
0 0 No
0 1 Yes
1 0 Yes
1 1 No
0
0
1
1
0
0
1
0
1
1
1
0
0
1
1
0
1
1
a
z1w
2w1x
2xb
1
Neural NetworkAND
OR
AND
NOT
1a 1z
2z
1x
2x az
2a
Hidden Neurons
Neural Network
1 1
1x
2x
a2=0.27
a2=0.73a2=0.27
a2=0.05
1a 1z
2z
1x
2x
2a1
a1=0.27
1x
2x
a1=0.27
a1=0.05
a1=0.73
b az
1w
2w
(0.27, 0.27)
(0.73, 0.05)
(0.05,0.73)
1x
2x
a2=0.27
a2=0.73a2=0.27
a2=0.05
a1=0.27
1x
2x
a1=0.27
a1=0.05
a1=0.73
a1
a2
a1
a2
1. What is the model?Neural Network
Output LayerHidden Layers
Input Layer
Neural Network as Model
Layer 1 Layer 2 Layer LInput Output
vector x
vector y
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Fully connected feedforward network
Deep Neural Network: many hidden layers
𝑓: 𝑅𝑁 → 𝑅𝑀
Notation
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
……
Layer 1l
nodes1lN
la1
la2
l
ia……
l
ia
Output of a neuron:
Neuron i
Layer l
Output of one layer:
la : a vector
1
2
j
1
2
i
Notation
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
l
ijw
1
2
j
1
2
i
l
ijwLayer 1lto Layer l
from neuron j
to neuron i
(Layer )1l(Layer )l
ll
ll
l ww
ww
W 2221
1211
lN
1N l
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
Notation
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
l
ib : bias for neuron i at layer l
l
ib
1
lb1
lb2
l
i
l
l
l
b
b
b
b
2
1
bias for all neurons in layer l
Notation
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
Layer 1l
nodes1lN
l
ia……
1
2
j i
l
izl
i
N
j
l
j
l
ij
l
i bawzl
1
1
1
: input of the activation function for neuron i at layer l
l
ijw
l
iw 2
l
iw 1l
i
ll
i
ll
i
l
i bawawz 1
22
1
11
l
iz
l
ib
1
: input of the activation function all the neurons in layer l
lz
Notation - Summary
l
ia
la
l
iz
lz
l
ijw
lW
l
ib
lb
:output of a neuron
:output of a layer
: input of activation function
: input of activation function for a layer
: a weight
: a weight matrix
: a bias
: a bias vector
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
l
i
l
l
l
i
l
l
ll
ll
l
i
l
l
b
b
b
a
a
a
ww
ww
z
z
z
2
1
1
1
2
1
1
2221
12112
1
llll baWz 1
llllll bawawz 1
1
212
1
1111 llllll bawawz 2
1
222
1
1212
l
i
ll
i
ll
i
l
i bawawz 1
22
1
11
……
Relations between Layer Outputs
l
i
l
l
l
i
l
l
z
z
z
a
a
a
2
1
2
1
l
i
l
i za
ll za
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
llll baWz 1
ll za
llll baWa 1
Layer 1 Layer 2 Layer LInput Output
vector x
vector y
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Function of Neural Network
111W abx
1a2a
La
11,W b22 ,W b LL ,W b
2212W aba
LL1-LLW aba y
x
Layer 1 Layer 2 Layer LInput Output
vector x
vector y
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Function of Neural Network
11,W b22 ,W b LL ,W b
xfy
LL bbbx 2112 WWW
2. What is the “best” function?
Best Function = Best Parameters
xfy LL bbbx 2112 WWW
;xf
LL bbb ,W,W,,W 2211
Pick the “best” parameter set θ*
→ parameter set
Pick the “best” function f*
because different parameters W and b lead to different function
function set
Formal way to define a function set:
Cost Function
• Define a function for parameter set 𝐶 𝜃
• 𝐶 𝜃 evaluate how bad a parameter set is
• The best parameter set 𝜃∗ is the one that minimizes 𝐶 𝜃
• 𝐶 𝜃 is called cost/loss/error function
• If you define the goodness of the parameter set by another function 𝑂 𝜃
• 𝑂 𝜃 is called objective function
𝜃∗ = 𝑎𝑟𝑔min𝜃
𝐶 𝜃
Cost Function
• Handwriting Digit Classification
0
1
0 “1”
“2”
“3”
Given training data:
RRrr yxyxyx ˆ,ˆ,ˆ, 11
r
rr yxfR
C ˆ;1
sum over all training examples
2.0
4.0
1.0 “1”
“2”
“3”
Minimize distance
3. How to pick the “best” function?
Gradient Descent
Statement of Problems
• Statement of problems:
• There is a function C(θ)
• θ represents parameter set
• θ = {θ1, θ2, θ3, ……}
• Find θ* that minimizes C(θ)
• Brute force?
• Enumerate all possible θ
• Calculus?
• Find θ* such that
,0,0** 21
CC
Gradient Descent – Idea
• For simplification, first consider that θ has only one variable
C When the ball stops, we find the local minima
Drop a ball somewhere …
0 1 23
Gradient Descent – Idea
• For simplification, first consider that θ has only one variable
C
Compute Τ𝑑𝐶 𝜃0 𝑑𝜃
Randomly start at 𝜃0
0
𝜃1 ← 𝜃0 − 𝜂 Τ𝑑𝐶 𝜃0 𝑑𝜃
Compute Τ𝑑𝐶 𝜃1 𝑑𝜃
……
1
𝜃2 ← 𝜃1 − 𝜂 Τ𝑑𝐶 𝜃1 𝑑𝜃
η is called “learning rate”
Gradient Descent
• Suppose that θ has two variables {θ1, θ2}
Randomly start at 𝜃0 =𝜃10
𝜃20
Compute the gradients of 𝐶 𝜃 at 𝜃0: 𝛻𝐶 𝜃0 =Τ𝜕𝐶 𝜃1
0 𝜕𝜃1Τ𝜕𝐶 𝜃2
0 𝜕𝜃2
𝜃11
𝜃21 =
𝜃10
𝜃20 − 𝜂
Τ𝜕𝐶 𝜃10 𝜕𝜃1Τ𝜕𝐶 𝜃2
0 𝜕𝜃2
Update parameters
𝜃1 = 𝜃0 − 𝜂𝛻𝐶 𝜃0
Compute the gradients of 𝐶 𝜃 at 𝜃1: 𝛻𝐶 𝜃1 =Τ𝜕𝐶 𝜃1
1 𝜕𝜃1Τ𝜕𝐶 𝜃2
1 𝜕𝜃2 ……
Gradient Descent
Start at position 𝜃0
Compute gradient at 𝜃0
Move to 𝜃1 = 𝜃0 - η𝛻𝐶 𝜃0
Compute gradient at 𝜃1
Move to 𝜃2 = 𝜃1 – η𝛻𝐶 𝜃1Movement
Gradient
……
𝜃0
𝜃1
𝜃2
𝜃3
𝛻𝐶 𝜃0
𝛻𝐶 𝜃1
𝛻𝐶 𝜃2
𝛻𝐶 𝜃3
𝜃1
𝜃2
1
2
Formal Derivation of Gradient Descent • Suppose that θ has two variables {θ1, θ2}
0
1
2
How?
C(θ)
Given a point, we can easily find the point with the smallest value nearby.
Formal Derivation of Gradient Descent
• Taylor series: Let h(x) be infinitely differentiable around x = x0.
kk
k
xxk
x0
0
0
!
hxh
2
00
000!2
xxxh
xxxhxh
When x is close to x0 000 xxxhxhxh
sin(x)=
……
E.g. Taylor series for h(x)=sin(x) around x0=π/4
The approximation is good around π/4.
Multivariable Taylor series
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
When x and y is close to x0 and y0
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
+ something related to (x-x0)2 and (y-y0)2 + ……
Formal Derivation of Gradient Descent
1
2 ba,
bba
aba
ba
2
2
1
1
,C,C,CC
Based on Taylor Series:If the red circle is small enough, in the red circle
21
,C,
,C
bav
bau
bvaus 21
C
bs ,CC(θ)
Formal Derivation of Gradient Descent
Find 𝜃1 and 𝜃2 yielding the smallest value of 𝐶 𝜃 in the circle
bvaus 21C
v
u
b
a
2
1
21
,C,
,C
bav
bau
2
1
,C
,C
ba
ba
b
a
Its value depending on the radius of the circle, u and v. This is how gradient descent
updates parameters.
Based on Taylor Series:If the red circle is small enough, in the red circle
bab
a,C
Gradient Descent for Neural Network
0Starting
Parameters1 2 ……
0C C ompute
001 C
1C C ompute
112 C
C LLll bbbb ,W,,,W,,,W,,W 2211
l
i
l
ij
b
w
C
C
ll
ll
ww
ww
2221
1211
l
i
l
l
b
b
b
2
1
Millions of parameters ……To compute the gradients efficiently, we use backpropagation.
Stuck at local minima?
• Who is Afraid of Non-Convex Loss Functions?
• http://videolectures.net/eml07_lecun_wia/
• Deep Learning: Theoretical Motivations
• http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/
Saddlepoint
3. How to pick the “best” function?
Practical Issues
for neural network
Practical Issues for neural network
• Parameter Initialization
• Learning Rate
• Stochastic gradient descent and Mini-batch
• Recipe for Learning
Parameter Initialization
• For gradient Descent, we need to pick an initialization parameter θ0.
• The initialization parameters have some influence to the training.
• We will go back to this issue in the future.
• Suggestion today:
• Do not set all the parameters θ0 equal
• Set the parameters in θ0 randomly
Learning Rate
No. of parameters updates
cost
ErrorSurface
Very Large
Large
small
Just make
11 iii C
• Set the learning rate η carefully
Learning Rate
• Toy Example
y
zw
b
x
1zy
x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]
Training Data (20 examples)
0
1*
b
w
11 iii C
• Set the learning rate η carefully
• Toy Example
Learning Rate
Error Surface: C(w,b)
target
start
11 iii C
• Toy Example
updates 30.~ k
1.0
Learning Rate
updates 3~ k
001.0 01.0
Different learning rate η
Stochastic Gradient Descentand Mini-batch
Gradient Descent
Stochastic Gradient Descent
11 iii C r
iri CR
C 11 1
11 irii C Pick an example xr
If all example xr have equal probabilities to be picked
r
irir CR
CE 11 1
Faster! Better!
r
rr yxfR
C ˆ;1
r
r
RC
1
Stochastic Gradient Descentand Mini-batch
When using stochastic gradient descent
0101 CStarting at θ0
Training Data: RRrr yxyxyxyx ˆ,,ˆ,,ˆ,,ˆ, 2211
1212 C
pick x1
pick x2
pick xr 11 rrrr C
pick xR 1RR1RR C
pick x1 R1R1R C
Seen all the examples once
One epoch
… …
… …
What is epoch?
Stochastic Gradient Descentand Mini-batch• Toy Example
Gradient Descent
Stochastic Gradient Descent
1 epoch
See all examples
See only one example
Update after seeing all examples
If there are 20 examples, update 20 times in one epoch.
Stochastic Gradient Descentand Mini-batch
Shuffle your data
Stochastic Gradient Descent
11 irii C Pick an example xr
Mini Batch Gradient Descent
Pick B examples as a batch b
bx
irii
r
CB
11 1
Average the gradient of the examples in the batch b
B is batch size
Gradient Descent
11 iii C r
iri CR
C 11 1
Stochastic Gradient Descentand Mini-batch• Handwriting Digit Classification
Batch size = 1 Gradient Descent
Stochastic Gradient Descentand Mini-batch• Why mini-batch is faster than stochastic gradient
descent?
Stochastic Gradient Descent
Mini-batchmatrix
Practically, which one is faster?
𝑊1 𝑊1
𝑊1
𝑥 𝑥
𝑥 𝑥
𝑧1 = 𝑧1 = ……
𝑧1 =𝑧1
Recipe for Learning
• Data provided in homework
Training Data
Testing Data
x y
Validation Real Testing
x x
*f“Best” Function
y y
Recipe for Learning
• Data provided in homework
Training Data
Testing Data
x y
Validation Real Testing
x xy y
Immediately know the accuracy
Do not know the accuracy until the deadline
(what really count)
Modify your training process
Recipe for Learning
Do I get good results on
training set?
no
Your code has bug.
Bad model
There is no good function in the hypothesis function set.
Probably you need bigger network
Can not find a good function
Stuck at local minima, saddle points ….
Change the training strategy
Recipe for Learning
done
PreventingOverfitting
Do I get good results on
validation set?
yes yes
no
Modify your training process
Do I get good results on
training set?
no
Your code usually do not have bug at this situation.
Recipe for Learning - Overfitting
• You pick a “best” parameter set θ*
rr yx ˆ,Training Data: rr yxfr ˆ;: *
uxTesting Data: uu yxf ˆ; *
However,
Training Data: Testing Data:
Training data and testing data have different distribution.
Recipe for Learning - Overfitting
• Panacea: Have more training data
• You can do that in real application, but you can’t do that in homework.
• We will go back to this issue in the future.
Concluding Remarks
3. How to pick the “best” function?
2. What is the “best” function?
1. What is the model (function hypothesis set)?
Neural Network
Cost Function
Gradient DescentParameter Initialization Learning Rate Stochastic gradient descent, Mini-batchRecipe for Learning
Acknowledgement
•感謝余朗祺同學於上課時糾正投影片上的拼字錯誤
•感謝吳柏瑜同學糾正投影片上的 notation 錯誤
•感謝 Yes Huang糾正投影片上的打字錯誤