Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...

Neural Network(Basic Ideas)

Hung-yi Lee

Learning ≈ Looking for a Function

• Speech Recognition

• Handwritten Recognition

• Weather forecast

• Play video games

f

f

f

f

“2”

“你好”

“sunny tomorrow”

“fire”

weather today

Positions and number of enemies

Framework

Training:Pick the best Function f*

TrainingData

Model

Testing:

,ˆ,,ˆ, 2211 yxyx

Hypothesis Function Set

21, ff

*f

“Best” Function

yxf

:x :y(label)

function input

function output

:x

:y

“2”

y “2”

Outline

3. How to pick the “best” function?

2. What is the “best” function?

1. What is the model (function hypothesis set)?

• Classification

Task Considered Today

input object

Class A

Class B

Binary Classification

Only two classes

(yes)

(no)

• Spam filtering• Is an e-mail spam or not?

• Recommendation systems• recommend the product to the

customer or not?• Malware detection

• Is the software malicious or not?• Stock prediction

• Will the future value of a stock increase or not?

• Classification

Task Considered Today

input object

Class A

Class B

input object

Class A

Class B

Class C

Binary Classification Multi-class Classification

Only two classes More than two classes

(yes)

(no) ……

Multi-class Classification

• Handwriting Digit Classification

• Image Recognition

Input: Class: “1”, “2”, …., “9”, “0”

Input: Class: “dog”, “cat”, “book”, ….

Thousands of classes

10 classes

Multi-class Classification

• Real speech recognition is not multi-class classification

• The HW1 is multi-class classification

Input:

Classes: “hi”, “how are you”, “I am sorry” …….

Cannot be enumerated

frame

/b/ /b/ /ε/

The frame belongs to which phoneme.

Classes are the phonemes.

1. What is the model?

What is the function we are looking for?• classification

• x: input object to be classified

• y: class

• Assume both x and y can be represented as fixed-size vector

• x is a vector with N dimensions, and y is a vector with M dimensions

𝑦 = 𝑓 𝑥 𝑓: 𝑅𝑁 → 𝑅𝑀

What is the function we are looking for?• Handwriting Digit Classification

“2”“1”

0

0

1

10 dimensions for digit recognition

“1”

“2”

“3”

0

1

0 “1”

“2”

“3”

1: for ink, 0: otherwise

Each pixel corresponds to an element in the vector

1

0

16 x 16

16 x 16 = 256 dimensions

x: image y: class

𝑓: 𝑅𝑁 → 𝑅𝑀

“1” or not

“2” or not

“3” or not

1. What is the model?A Layer of Neuron

Single Neuron

z

1w

2w

Nw

…

1x

2x

Nx

b

z

z

z

bias

y

ze

z

1

1

Activation function

𝑓: 𝑅𝑁 → 𝑅

Single Neuron

z

1w

2w

Nw

…

1x

2x

Nx

b

z

z

z

bias

y

ze

z

1

1

Activation function

1


Single Neuron

• Single neuron can only do binary classification, cannot handle multi-class classification

…

1x

2x

Nx

1

5.0"2"

5.0"2"

ynot

yis

y


A Layer of Neuron

• Handwriting digit classification• Classes: “1”, “2”, …., “9”, “0”

• 10 classes

If y2 is the max, then the image is “2”.


…

1x

2x

Nx

1

1y

……

……

“1” or not

“2” or not

“3” or not

2y

3y

10 neurons

1. What is the model?Limitation of Single Layer

1x

2x

Limitation of Single Layer

InputOutput

x1 x2

0 0 No

0 1 Yes

1 0 Yes

1 1 No

a

z1w

2w1x

2xb

1

thresholdano

thresholdayes

bxwxwz 2211

a ≥ threshold

a ≥ threshold

a < threshold

a < threshold

Can we?


• No, we can’t ……

1x

2x

1x

2x

a

z1w

2w1x

2xb

1


AND

OR

AND

NOT

InputOutput

x1 x2

0 0 No

0 1 Yes

1 0 Yes

1 1 No

0

0

1

1

0

0

1

0

1

1

1

0

0

1

1

0

1

1

a

z1w

2w1x

2xb

1

Neural NetworkAND

OR

AND

NOT

1a 1z

2z

1x

2x az

2a

Hidden Neurons

Neural Network

1 1

1x

2x

a2=0.27

a2=0.73a2=0.27

a2=0.05

1a 1z

2z

1x

2x

2a1

a1=0.27

1x

2x

a1=0.27

a1=0.05

a1=0.73

b az

1w

2w

(0.27, 0.27)

(0.73, 0.05)

(0.05,0.73)

1x

2x

a2=0.27

a2=0.73a2=0.27

a2=0.05

a1=0.27

1x

2x

a1=0.27

a1=0.05

a1=0.73

a1

a2

a1

a2

1. What is the model?Neural Network

Output LayerHidden Layers

Input Layer

Neural Network as Model

Layer 1 Layer 2 Layer LInput Output

vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Fully connected feedforward network

Deep Neural Network: many hidden layers


Notation

……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

……

Layer 1l

nodes1lN

la1

la2

l

ia……

l

ia

Output of a neuron:

Neuron i

Layer l

Output of one layer:

la : a vector

1

2

j

1

2

i

Notation

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

l

ijw

1

2

j

1

2

i

l

ijwLayer 1lto Layer l

from neuron j

to neuron i

(Layer )1l(Layer )l

ll

ll

l ww

ww

W 2221

1211

lN

1N l

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

Notation

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

l

ib : bias for neuron i at layer l

l

ib

1

lb1

lb2

l

i

l

l

l

b

b

b

b

2

1

bias for all neurons in layer l

Notation

……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

Layer 1l

nodes1lN

l

ia……

1

2

j i

l

izl

i

N

j

l

j

l

ij

l

i bawzl

1

1

1

: input of the activation function for neuron i at layer l

l

ijw

l

iw 2

l

iw 1l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

l

iz

l

ib

1

: input of the activation function all the neurons in layer l

lz

Notation - Summary

l

ia

la

l

iz

lz

l

ijw

lW

l

ib

lb

:output of a neuron

:output of a layer

: input of activation function

: input of activation function for a layer

: a weight

: a weight matrix

: a bias

: a bias vector

Relations between Layer Outputs

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

l

i

l

l

l

i

l

l

ll

ll

l

i

l

l

b

b

b

a

a

a

ww

ww

z

z

z

2

1

1

1

2

1

1

2221

12112

1

llll baWz 1

llllll bawawz 1

1

212

1

1111 llllll bawawz 2

1

222

1

1212

l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

……


l

i

l

l

l

i

l

l

z

z

z

a

a

a

2

1

2

1

l

i

l

i za

ll za

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

llll baWz 1

ll za

llll baWa 1


vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Function of Neural Network

111W abx

1a2a

La

11,W b22 ,W b LL ,W b

2212W aba

LL1-LLW aba y

x


vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Function of Neural Network

11,W b22 ,W b LL ,W b

xfy

LL bbbx 2112 WWW

Best Function = Best Parameters

xfy LL bbbx 2112 WWW

;xf

LL bbb ,W,W,,W 2211

Pick the “best” parameter set θ*

→ parameter set

Pick the “best” function f*

because different parameters W and b lead to different function

function set

Formal way to define a function set:

Cost Function

• Define a function for parameter set 𝐶 𝜃

• 𝐶 𝜃 evaluate how bad a parameter set is

• The best parameter set 𝜃∗ is the one that minimizes 𝐶 𝜃

• 𝐶 𝜃 is called cost/loss/error function

• If you define the goodness of the parameter set by another function 𝑂 𝜃

• 𝑂 𝜃 is called objective function

𝜃∗ = 𝑎𝑟𝑔min𝜃

𝐶 𝜃

Cost Function

• Handwriting Digit Classification

0

1

0 “1”

“2”

“3”

Given training data:

RRrr yxyxyx ˆ,ˆ,ˆ, 11

r

rr yxfR

C ˆ;1

sum over all training examples

2.0

4.0

1.0 “1”

“2”

“3”

Minimize distance


Gradient Descent

Statement of Problems

• Statement of problems:

• There is a function C(θ)

• θ represents parameter set

• θ = {θ1, θ2, θ3, ……}

• Find θ* that minimizes C(θ)

• Brute force?

• Enumerate all possible θ

• Calculus?

• Find θ* such that

,0,0** 21

CC

Gradient Descent – Idea

• For simplification, first consider that θ has only one variable

C When the ball stops, we find the local minima

Drop a ball somewhere …

0 1 23

Gradient Descent – Idea

• For simplification, first consider that θ has only one variable

C

Compute Τ𝑑𝐶 𝜃0 𝑑𝜃

Randomly start at 𝜃0

0

𝜃1 ← 𝜃0 − 𝜂 Τ𝑑𝐶 𝜃0 𝑑𝜃

Compute Τ𝑑𝐶 𝜃1 𝑑𝜃

……

1

𝜃2 ← 𝜃1 − 𝜂 Τ𝑑𝐶 𝜃1 𝑑𝜃

η is called “learning rate”

Gradient Descent

• Suppose that θ has two variables {θ1, θ2}

Randomly start at 𝜃0 =𝜃10

𝜃20

Compute the gradients of 𝐶 𝜃 at 𝜃0: 𝛻𝐶 𝜃0 =Τ𝜕𝐶 𝜃1

0 𝜕𝜃1Τ𝜕𝐶 𝜃2

0 𝜕𝜃2

𝜃11

𝜃21 =

𝜃10

𝜃20 − 𝜂

Τ𝜕𝐶 𝜃10 𝜕𝜃1Τ𝜕𝐶 𝜃2

0 𝜕𝜃2

Update parameters

𝜃1 = 𝜃0 − 𝜂𝛻𝐶 𝜃0

Compute the gradients of 𝐶 𝜃 at 𝜃1: 𝛻𝐶 𝜃1 =Τ𝜕𝐶 𝜃1

1 𝜕𝜃1Τ𝜕𝐶 𝜃2

1 𝜕𝜃2 ……

Gradient Descent

Start at position 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 - η𝛻𝐶 𝜃0

Compute gradient at 𝜃1

Move to 𝜃2 = 𝜃1 – η𝛻𝐶 𝜃1Movement

Gradient

……

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐶 𝜃0

𝛻𝐶 𝜃1

𝛻𝐶 𝜃2

𝛻𝐶 𝜃3

𝜃1

𝜃2

1

2

Formal Derivation of Gradient Descent • Suppose that θ has two variables {θ1, θ2}

0

1

2

How?

C(θ)

Given a point, we can easily find the point with the smallest value nearby.

Formal Derivation of Gradient Descent

• Taylor series: Let h(x) be infinitely differentiable around x = x0.

kk

k

xxk

x0

0

0

!

hxh

2

00

000!2

xxxh

xxxhxh

When x is close to x0 000 xxxhxhxh

sin(x)=

……

E.g. Taylor series for h(x)=sin(x) around x0=π/4

The approximation is good around π/4.

Multivariable Taylor series

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

When x and y is close to x0 and y0

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

+ something related to (x-x0)2 and (y-y0)2 + ……


1

2 ba,

bba

aba

ba

2

2

1

1

,C,C,CC

Based on Taylor Series:If the red circle is small enough, in the red circle

21

,C,

,C

bav

bau

bvaus 21

C

bs ,CC(θ)


Find 𝜃1 and 𝜃2 yielding the smallest value of 𝐶 𝜃 in the circle

bvaus 21C

v

u

b

a

2

1

21

,C,

,C

bav

bau

2

1

,C

,C

ba

ba

b

a

Its value depending on the radius of the circle, u and v. This is how gradient descent

updates parameters.

Based on Taylor Series:If the red circle is small enough, in the red circle

bab

a,C

Gradient Descent for Neural Network

0Starting

Parameters1 2 ……

0C C ompute

001 C

1C C ompute

112 C

C LLll bbbb ,W,,,W,,,W,,W 2211

l

i

l

ij

b

w

C

C

ll

ll

ww

ww

2221

1211

l

i

l

l

b

b

b

2

1

Millions of parameters ……To compute the gradients efficiently, we use backpropagation.

Stuck at local minima?

• Who is Afraid of Non-Convex Loss Functions?

• http://videolectures.net/eml07_lecun_wia/

• Deep Learning: Theoretical Motivations

• http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/

Saddlepoint


Practical Issues

for neural network

Practical Issues for neural network

• Parameter Initialization

• Learning Rate

• Stochastic gradient descent and Mini-batch

• Recipe for Learning

Parameter Initialization

• For gradient Descent, we need to pick an initialization parameter θ0.

• The initialization parameters have some influence to the training.

• We will go back to this issue in the future.

• Suggestion today:

• Do not set all the parameters θ0 equal

• Set the parameters in θ0 randomly

Learning Rate

No. of parameters updates

cost

ErrorSurface

Very Large

Large

small

Just make

11 iii C

• Set the learning rate η carefully

Learning Rate

• Toy Example

y

zw

b

x

1zy

x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

Training Data (20 examples)

0

1*

b

w

11 iii C

• Set the learning rate η carefully

• Toy Example

Learning Rate

Error Surface: C(w,b)

target

start

11 iii C

• Toy Example

updates 30.~ k

1.0

Learning Rate

updates 3~ k

001.0 01.0

Different learning rate η

Stochastic Gradient Descentand Mini-batch

Gradient Descent

Stochastic Gradient Descent

11 iii C r

iri CR

C 11 1

11 irii C Pick an example xr

If all example xr have equal probabilities to be picked

r

irir CR

CE 11 1

Faster! Better!

r

rr yxfR

C ˆ;1

r

r

RC

1


When using stochastic gradient descent

0101 CStarting at θ0

Training Data: RRrr yxyxyxyx ˆ,,ˆ,,ˆ,,ˆ, 2211

1212 C

pick x1

pick x2

pick xr 11 rrrr C

pick xR 1RR1RR C

pick x1 R1R1R C

Seen all the examples once

One epoch

… …

… …

What is epoch?

Stochastic Gradient Descentand Mini-batch• Toy Example

Gradient Descent


1 epoch

See all examples

See only one example

Update after seeing all examples

If there are 20 examples, update 20 times in one epoch.


Shuffle your data


11 irii C Pick an example xr

Mini Batch Gradient Descent

Pick B examples as a batch b

bx

irii

r

CB

11 1

Average the gradient of the examples in the batch b

B is batch size

Gradient Descent

11 iii C r

iri CR

C 11 1

Stochastic Gradient Descentand Mini-batch• Handwriting Digit Classification

Batch size = 1 Gradient Descent

Stochastic Gradient Descentand Mini-batch• Why mini-batch is faster than stochastic gradient

descent?


Mini-batchmatrix

Practically, which one is faster?

𝑊1 𝑊1

𝑊1

𝑥 𝑥

𝑥 𝑥

𝑧1 = 𝑧1 = ……

𝑧1 =𝑧1

Recipe for Learning

• Data provided in homework

Training Data

Testing Data

x y

Validation Real Testing

x x

*f“Best” Function

y y

Recipe for Learning

• Data provided in homework

Training Data

Testing Data

x y

Validation Real Testing

x xy y

Immediately know the accuracy

Do not know the accuracy until the deadline

(what really count)

Modify your training process

Recipe for Learning

Do I get good results on

training set?

no

Your code has bug.

Bad model

There is no good function in the hypothesis function set.

Probably you need bigger network

Can not find a good function

Stuck at local minima, saddle points ….

Change the training strategy

Recipe for Learning

done

PreventingOverfitting


validation set?

yes yes

no

Modify your training process


training set?

no

Your code usually do not have bug at this situation.

Recipe for Learning - Overfitting

• You pick a “best” parameter set θ*

rr yx ˆ,Training Data: rr yxfr ˆ;: *

uxTesting Data: uu yxf ˆ; *

However,

Training Data: Testing Data:

Training data and testing data have different distribution.

Recipe for Learning - Overfitting

• Panacea: Have more training data

• You can do that in real application, but you can’t do that in homework.

• We will go back to this issue in the future.

Concluding Remarks



1. What is the model (function hypothesis set)?

Neural Network

Cost Function

Gradient DescentParameter Initialization Learning Rate Stochastic gradient descent, Mini-batchRecipe for Learning

Acknowledgement

•感謝余朗祺同學於上課時糾正投影片上的拼字錯誤

•感謝吳柏瑜同學糾正投影片上的 notation 錯誤

•感謝 Yes Huang糾正投影片上的打字錯誤

Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...

Documents

Transcript of Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...