Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...

75
Neural Network (Basic Ideas) Hung-yi Lee

Transcript of Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of...

Page 1: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Neural Network(Basic Ideas)

Hung-yi Lee

Page 2: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Learning ≈ Looking for a Function

• Speech Recognition

• Handwritten Recognition

• Weather forecast

• Play video games

f

f

f

f

“2”

“你好”

“sunny tomorrow”

“fire”

weather today

Positions and number of enemies

Page 3: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Framework

Training:Pick the best Function f*

TrainingData

Model

Testing:

,ˆ,,ˆ, 2211 yxyx

Hypothesis Function Set

21, ff

*f

“Best” Function

yxf

:x :y(label)

function input

function output

:x

:y

“2”

y “2”

Page 4: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Outline

3. How to pick the “best” function?

2. What is the “best” function?

1. What is the model (function hypothesis set)?

Page 5: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

• Classification

Task Considered Today

input object

Class A

Class B

Binary Classification

Only two classes

(yes)

(no)

• Spam filtering• Is an e-mail spam or not?

• Recommendation systems• recommend the product to the

customer or not?• Malware detection

• Is the software malicious or not?• Stock prediction

• Will the future value of a stock increase or not?

Page 6: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

• Classification

Task Considered Today

input object

Class A

Class B

input object

Class A

Class B

Class C

Binary Classification Multi-class Classification

Only two classes More than two classes

(yes)

(no) ……

Page 7: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Multi-class Classification

• Handwriting Digit Classification

• Image Recognition

Input: Class: “1”, “2”, …., “9”, “0”

Input: Class: “dog”, “cat”, “book”, ….

Thousands of classes

10 classes

Page 8: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Multi-class Classification

• Real speech recognition is not multi-class classification

• The HW1 is multi-class classification

Input:

Classes: “hi”, “how are you”, “I am sorry” …….

Cannot be enumerated

frame

/b/ /b/ /ε/

The frame belongs to which phoneme.

Classes are the phonemes.

Page 9: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1. What is the model?

Page 10: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

What is the function we are looking for?• classification

• x: input object to be classified

• y: class

• Assume both x and y can be represented as fixed-size vector

• x is a vector with N dimensions, and y is a vector with M dimensions

𝑦 = 𝑓 𝑥 𝑓: 𝑅𝑁 → 𝑅𝑀

Page 11: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

What is the function we are looking for?• Handwriting Digit Classification

“2”“1”

0

0

1

10 dimensions for digit recognition

“1”

“2”

“3”

0

1

0 “1”

“2”

“3”

1: for ink, 0: otherwise

Each pixel corresponds to an element in the vector

1

0

16 x 16

16 x 16 = 256 dimensions

x: image y: class

𝑓: 𝑅𝑁 → 𝑅𝑀

“1” or not

“2” or not

“3” or not

Page 12: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1. What is the model?A Layer of Neuron

Page 13: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Single Neuron

z

1w

2w

Nw

1x

2x

Nx

b

z

z

z

bias

y

ze

z

1

1

Activation function

𝑓: 𝑅𝑁 → 𝑅

Page 14: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Single Neuron

z

1w

2w

Nw

1x

2x

Nx

b

z

z

z

bias

y

ze

z

1

1

Activation function

1

𝑓: 𝑅𝑁 → 𝑅

Page 15: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Single Neuron

• Single neuron can only do binary classification, cannot handle multi-class classification

1x

2x

Nx

1

5.0"2"

5.0"2"

ynot

yis

y

𝑓: 𝑅𝑁 → 𝑅

Page 16: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

A Layer of Neuron

• Handwriting digit classification• Classes: “1”, “2”, …., “9”, “0”

• 10 classes

If y2 is the max, then the image is “2”.

𝑓: 𝑅𝑁 → 𝑅𝑀

1x

2x

Nx

1

1y

……

……

“1” or not

“2” or not

“3” or not

2y

3y

10 neurons

Page 17: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1. What is the model?Limitation of Single Layer

Page 18: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1x

2x

Limitation of Single Layer

InputOutput

x1 x2

0 0 No

0 1 Yes

1 0 Yes

1 1 No

a

z1w

2w1x

2xb

1

thresholdano

thresholdayes

bxwxwz 2211

a ≥ threshold

a ≥ threshold

a < threshold

a < threshold

Can we?

Page 19: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Limitation of Single Layer

• No, we can’t ……

1x

2x

1x

2x

a

z1w

2w1x

2xb

1

Page 20: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Limitation of Single Layer

AND

OR

AND

NOT

InputOutput

x1 x2

0 0 No

0 1 Yes

1 0 Yes

1 1 No

0

0

1

1

0

0

1

0

1

1

1

0

0

1

1

0

1

1

a

z1w

2w1x

2xb

1

Page 21: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Neural NetworkAND

OR

AND

NOT

1a 1z

2z

1x

2x az

2a

Hidden Neurons

Neural Network

1 1

Page 22: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1x

2x

a2=0.27

a2=0.73a2=0.27

a2=0.05

1a 1z

2z

1x

2x

2a1

a1=0.27

1x

2x

a1=0.27

a1=0.05

a1=0.73

Page 23: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

b az

1w

2w

(0.27, 0.27)

(0.73, 0.05)

(0.05,0.73)

1x

2x

a2=0.27

a2=0.73a2=0.27

a2=0.05

a1=0.27

1x

2x

a1=0.27

a1=0.05

a1=0.73

a1

a2

a1

a2

Page 24: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1. What is the model?Neural Network

Page 25: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Output LayerHidden Layers

Input Layer

Neural Network as Model

Layer 1 Layer 2 Layer LInput Output

vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Fully connected feedforward network

Deep Neural Network: many hidden layers

𝑓: 𝑅𝑁 → 𝑅𝑀

Page 26: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Notation

……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

……

Layer 1l

nodes1lN

la1

la2

l

ia……

l

ia

Output of a neuron:

Neuron i

Layer l

Output of one layer:

la : a vector

1

2

j

1

2

i

Page 27: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Notation

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

l

ijw

1

2

j

1

2

i

l

ijwLayer 1lto Layer l

from neuron j

to neuron i

(Layer )1l(Layer )l

ll

ll

l ww

ww

W 2221

1211

lN

1N l

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

Page 28: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Notation

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

l

ib : bias for neuron i at layer l

l

ib

1

lb1

lb2

l

i

l

l

l

b

b

b

b

2

1

bias for all neurons in layer l

Page 29: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Notation

……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

Layer 1l

nodes1lN

l

ia……

1

2

j i

l

izl

i

N

j

l

j

l

ij

l

i bawzl

1

1

1

: input of the activation function for neuron i at layer l

l

ijw

l

iw 2

l

iw 1l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

l

iz

l

ib

1

: input of the activation function all the neurons in layer l

lz

Page 30: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Notation - Summary

l

ia

la

l

iz

lz

l

ijw

lW

l

ib

lb

:output of a neuron

:output of a layer

: input of activation function

: input of activation function for a layer

: a weight

: a weight matrix

: a bias

: a bias vector

Page 31: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Relations between Layer Outputs

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

Page 32: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Relations between Layer Outputs

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

l

i

l

l

l

i

l

l

ll

ll

l

i

l

l

b

b

b

a

a

a

ww

ww

z

z

z

2

1

1

1

2

1

1

2221

12112

1

llll baWz 1

llllll bawawz 1

1

212

1

1111 llllll bawawz 2

1

222

1

1212

l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

……

Page 33: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Relations between Layer Outputs

l

i

l

l

l

i

l

l

z

z

z

a

a

a

2

1

2

1

l

i

l

i za

ll za

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz

Page 34: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Relations between Layer Outputs

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

llll baWz 1

ll za

llll baWa 1

Page 35: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Layer 1 Layer 2 Layer LInput Output

vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Function of Neural Network

111W abx

1a2a

La

11,W b22 ,W b LL ,W b

2212W aba

LL1-LLW aba y

x

Page 36: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Layer 1 Layer 2 Layer LInput Output

vector x

vector y

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Function of Neural Network

11,W b22 ,W b LL ,W b

xfy

LL bbbx 2112 WWW

Page 37: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

2. What is the “best” function?

Page 38: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Best Function = Best Parameters

xfy LL bbbx 2112 WWW

;xf

LL bbb ,W,W,,W 2211

Pick the “best” parameter set θ*

→ parameter set

Pick the “best” function f*

because different parameters W and b lead to different function

function set

Formal way to define a function set:

Page 39: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Cost Function

• Define a function for parameter set 𝐶 𝜃

• 𝐶 𝜃 evaluate how bad a parameter set is

• The best parameter set 𝜃∗ is the one that minimizes 𝐶 𝜃

• 𝐶 𝜃 is called cost/loss/error function

• If you define the goodness of the parameter set by another function 𝑂 𝜃

• 𝑂 𝜃 is called objective function

𝜃∗ = 𝑎𝑟𝑔min𝜃

𝐶 𝜃

Page 40: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Cost Function

• Handwriting Digit Classification

0

1

0 “1”

“2”

“3”

Given training data:

RRrr yxyxyx ˆ,ˆ,ˆ, 11

r

rr yxfR

C ˆ;1

sum over all training examples

2.0

4.0

1.0 “1”

“2”

“3”

Minimize distance

Page 41: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

3. How to pick the “best” function?

Gradient Descent

Page 42: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Statement of Problems

• Statement of problems:

• There is a function C(θ)

• θ represents parameter set

• θ = {θ1, θ2, θ3, ……}

• Find θ* that minimizes C(θ)

• Brute force?

• Enumerate all possible θ

• Calculus?

• Find θ* such that

,0,0** 21

CC

Page 43: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Gradient Descent – Idea

• For simplification, first consider that θ has only one variable

C When the ball stops, we find the local minima

Drop a ball somewhere …

0 1 23

Page 44: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Gradient Descent – Idea

• For simplification, first consider that θ has only one variable

C

Compute Τ𝑑𝐶 𝜃0 𝑑𝜃

Randomly start at 𝜃0

0

𝜃1 ← 𝜃0 − 𝜂 Τ𝑑𝐶 𝜃0 𝑑𝜃

Compute Τ𝑑𝐶 𝜃1 𝑑𝜃

……

1

𝜃2 ← 𝜃1 − 𝜂 Τ𝑑𝐶 𝜃1 𝑑𝜃

η is called “learning rate”

Page 45: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Gradient Descent

• Suppose that θ has two variables {θ1, θ2}

Randomly start at 𝜃0 =𝜃10

𝜃20

Compute the gradients of 𝐶 𝜃 at 𝜃0: 𝛻𝐶 𝜃0 =Τ𝜕𝐶 𝜃1

0 𝜕𝜃1Τ𝜕𝐶 𝜃2

0 𝜕𝜃2

𝜃11

𝜃21 =

𝜃10

𝜃20 − 𝜂

Τ𝜕𝐶 𝜃10 𝜕𝜃1Τ𝜕𝐶 𝜃2

0 𝜕𝜃2

Update parameters

𝜃1 = 𝜃0 − 𝜂𝛻𝐶 𝜃0

Compute the gradients of 𝐶 𝜃 at 𝜃1: 𝛻𝐶 𝜃1 =Τ𝜕𝐶 𝜃1

1 𝜕𝜃1Τ𝜕𝐶 𝜃2

1 𝜕𝜃2 ……

Page 46: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Gradient Descent

Start at position 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 - η𝛻𝐶 𝜃0

Compute gradient at 𝜃1

Move to 𝜃2 = 𝜃1 – η𝛻𝐶 𝜃1Movement

Gradient

……

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐶 𝜃0

𝛻𝐶 𝜃1

𝛻𝐶 𝜃2

𝛻𝐶 𝜃3

𝜃1

𝜃2

Page 47: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

1

2

Formal Derivation of Gradient Descent • Suppose that θ has two variables {θ1, θ2}

0

1

2

How?

C(θ)

Given a point, we can easily find the point with the smallest value nearby.

Page 48: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Formal Derivation of Gradient Descent

• Taylor series: Let h(x) be infinitely differentiable around x = x0.

kk

k

xxk

x0

0

0

!

hxh

2

00

000!2

xxxh

xxxhxh

When x is close to x0 000 xxxhxhxh

Page 49: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

sin(x)=

……

E.g. Taylor series for h(x)=sin(x) around x0=π/4

The approximation is good around π/4.

Page 50: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Multivariable Taylor series

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

When x and y is close to x0 and y0

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

+ something related to (x-x0)2 and (y-y0)2 + ……

Page 51: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Formal Derivation of Gradient Descent

1

2 ba,

bba

aba

ba

2

2

1

1

,C,C,CC

Based on Taylor Series:If the red circle is small enough, in the red circle

21

,C,

,C

bav

bau

bvaus 21

C

bs ,CC(θ)

Page 52: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Formal Derivation of Gradient Descent

Find 𝜃1 and 𝜃2 yielding the smallest value of 𝐶 𝜃 in the circle

bvaus 21C

v

u

b

a

2

1

21

,C,

,C

bav

bau

2

1

,C

,C

ba

ba

b

a

Its value depending on the radius of the circle, u and v. This is how gradient descent

updates parameters.

Based on Taylor Series:If the red circle is small enough, in the red circle

bab

a,C

Page 53: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Gradient Descent for Neural Network

0Starting

Parameters1 2 ……

0C C ompute

001 C

1C C ompute

112 C

C LLll bbbb ,W,,,W,,,W,,W 2211

l

i

l

ij

b

w

C

C

ll

ll

ww

ww

2221

1211

l

i

l

l

b

b

b

2

1

Millions of parameters ……To compute the gradients efficiently, we use backpropagation.

Page 54: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stuck at local minima?

• Who is Afraid of Non-Convex Loss Functions?

• http://videolectures.net/eml07_lecun_wia/

• Deep Learning: Theoretical Motivations

• http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/

Saddlepoint

Page 55: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

3. How to pick the “best” function?

Practical Issues

for neural network

Page 56: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Practical Issues for neural network

• Parameter Initialization

• Learning Rate

• Stochastic gradient descent and Mini-batch

• Recipe for Learning

Page 57: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Parameter Initialization

• For gradient Descent, we need to pick an initialization parameter θ0.

• The initialization parameters have some influence to the training.

• We will go back to this issue in the future.

• Suggestion today:

• Do not set all the parameters θ0 equal

• Set the parameters in θ0 randomly

Page 58: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Learning Rate

No. of parameters updates

cost

ErrorSurface

Very Large

Large

small

Just make

11 iii C

• Set the learning rate η carefully

Page 59: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Learning Rate

• Toy Example

y

zw

b

x

1zy

x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

Training Data (20 examples)

0

1*

b

w

11 iii C

• Set the learning rate η carefully

Page 60: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

• Toy Example

Learning Rate

Error Surface: C(w,b)

target

start

11 iii C

Page 61: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

• Toy Example

updates 30.~ k

1.0

Learning Rate

updates 3~ k

001.0 01.0

Different learning rate η

Page 62: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch

Gradient Descent

Stochastic Gradient Descent

11 iii C r

iri CR

C 11 1

11 irii C Pick an example xr

If all example xr have equal probabilities to be picked

r

irir CR

CE 11 1

Faster! Better!

r

rr yxfR

C ˆ;1

r

r

RC

1

Page 63: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch

When using stochastic gradient descent

0101 CStarting at θ0

Training Data: RRrr yxyxyxyx ˆ,,ˆ,,ˆ,,ˆ, 2211

1212 C

pick x1

pick x2

pick xr 11 rrrr C

pick xR 1RR1RR C

pick x1 R1R1R C

Seen all the examples once

One epoch

… …

… …

What is epoch?

Page 64: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch• Toy Example

Gradient Descent

Stochastic Gradient Descent

1 epoch

See all examples

See only one example

Update after seeing all examples

If there are 20 examples, update 20 times in one epoch.

Page 65: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch

Shuffle your data

Stochastic Gradient Descent

11 irii C Pick an example xr

Mini Batch Gradient Descent

Pick B examples as a batch b

bx

irii

r

CB

11 1

Average the gradient of the examples in the batch b

B is batch size

Gradient Descent

11 iii C r

iri CR

C 11 1

Page 66: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch• Handwriting Digit Classification

Batch size = 1 Gradient Descent

Page 67: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Stochastic Gradient Descentand Mini-batch• Why mini-batch is faster than stochastic gradient

descent?

Stochastic Gradient Descent

Mini-batchmatrix

Practically, which one is faster?

𝑊1 𝑊1

𝑊1

𝑥 𝑥

𝑥 𝑥

𝑧1 = 𝑧1 = ……

𝑧1 =𝑧1

Page 68: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Recipe for Learning

• Data provided in homework

Training Data

Testing Data

x y

Validation Real Testing

x x

*f“Best” Function

y y

Page 69: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Recipe for Learning

• Data provided in homework

Training Data

Testing Data

x y

Validation Real Testing

x xy y

Immediately know the accuracy

Do not know the accuracy until the deadline

(what really count)

Page 70: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Modify your training process

Recipe for Learning

Do I get good results on

training set?

no

Your code has bug.

Bad model

There is no good function in the hypothesis function set.

Probably you need bigger network

Can not find a good function

Stuck at local minima, saddle points ….

Change the training strategy

Page 71: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Recipe for Learning

done

PreventingOverfitting

Do I get good results on

validation set?

yes yes

no

Modify your training process

Do I get good results on

training set?

no

Your code usually do not have bug at this situation.

Page 72: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Recipe for Learning - Overfitting

• You pick a “best” parameter set θ*

rr yx ˆ,Training Data: rr yxfr ˆ;: *

uxTesting Data: uu yxf ˆ; *

However,

Training Data: Testing Data:

Training data and testing data have different distribution.

Page 73: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Recipe for Learning - Overfitting

• Panacea: Have more training data

• You can do that in real application, but you can’t do that in homework.

• We will go back to this issue in the future.

Page 74: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Concluding Remarks

3. How to pick the “best” function?

2. What is the “best” function?

1. What is the model (function hypothesis set)?

Neural Network

Cost Function

Gradient DescentParameter Initialization Learning Rate Stochastic gradient descent, Mini-batchRecipe for Learning

Page 75: Neural Network (Basic Ideas)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2... · Function of Neural Network Va1a2aL W 1 x b1 a1 WWW1,2bL,1b,b2 L VV WW 2 aL a 1 L -1b2 b La 2 a L

Acknowledgement

•感謝余朗祺同學於上課時糾正投影片上的拼字錯誤

•感謝吳柏瑜同學糾正投影片上的 notation 錯誤

•感謝 Yes Huang糾正投影片上的打字錯誤