Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...

Post on 21-May-2020

6 views 0 download

Transcript of Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...

Deep Learning

Deep learning attracts lots of attention.• I believe you have seen lots of exciting results

before.

Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Ups and downs of Deep Learning

• 1958: Perceptron (linear model)

• 1969: Perceptron has limitation

• 1980s: Multi-layer perceptron

• Do not have significant difference from DNN today

• 1986: Backpropagation

• Usually more than 3 hidden layers is not helpful

• 1989: 1 hidden layer is “good enough”, why deep?

• 2006: RBM initialization (breakthrough)

• 2009: GPU

• 2011: Start to be popular in speech recognition

• 2012: win ILSVRC image competition

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple ……

Neural Network

Neural Network

z

z

z

z

“Neuron”

Different connection leads to different network structures

Neural Network

Network parameter 𝜃: all the weights and biases in the “neurons”

Fully Connect Feedforward Network

z

z

ze

z

1

1

Sigmoid Function

1

-1

1

-2

1

-1

1

0

4

-2

0.98

0.12

Fully Connect Feedforward Network

1

-2

1

-1

1

0

4

-2

0.98

0.12

2

-1

-1

-2

3

-1

4

-1

0.86

0.11

0.62

0.83

0

0

-2

2

1

-1

Fully Connect Feedforward Network

1

-2

1

-1

1

0

0.73

0.5

2

-1

-1

-2

3

-1

4

-1

0.72

0.12

0.51

0.85

0

0

-2

2

𝑓00

=0.510.85

𝑓1−1

=0.620.83

0

0

This is a function.

Input vector, output vector

Given network structure, define a function set

Output LayerHidden Layers

Input Layer

Fully Connect Feedforward Network

Input Output

1x

2x

Layer 1

……

Nx

……

Layer 2

……

Layer L…

……

……

……

……

y1

y2

yM

neuron

8 layers

19 layers

22 layers

AlexNet (2012) VGG (2014) GoogleNet (2014)

16.4%

7.3%6.7%

http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf

Deep = Many hidden layers

AlexNet(2012)

VGG (2014)

GoogleNet(2014)

152 layers

3.57%

Residual Net (2015)

Taipei101

101 layers

16.4%7.3% 6.7%

Deep = Many hidden layers

Special structure

𝜎

Matrix Operation

2y

1y1

-2

1

-1

1

0

4

-2

0.98

0.12

1−1

1 −2−1 1

+10

0.980.12

=

1

-1

4−2

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Neural Network

W1 W2 WL

b2 bL

x a1 a2 y

b1W1 x +𝜎b2W2 a1 +𝜎

bLWL +𝜎 aL-1

b1

= 𝜎 𝜎

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Neural Network

W1 W2 WL

b2 bL

x a1 a2 y

y = 𝑓 x

b1W1 x +𝜎 b2W2 + bLWL +…

b1

Using parallel computing techniques to speed up matrix operation

Output Layer

……

……

……

……

……

……

……

……

y1

y2

yMKx

Output LayerHidden Layers

Input Layer

x

1x

2x

Feature extractor replacing feature engineering

= Multi-class Classifier

Softm

ax

Example Application

Input Output

16 x 16 = 256

1x

2x

256x…

Ink → 1No ink → 0

……

y1

y2

y10

Each dimension represents the confidence of a digit.

is 1

is 2

is 0

……

0.1

0.7

0.2

The image is “2”

Example Application

• Handwriting Digit Recognition

Machine “2”

1x

2x

256x

…… ……

y1

y2

y10

is 1

is 2

is 0

……

What is needed is a function ……

Input: 256-dim vector

output: 10-dim vector

NeuralNetwork

Output LayerHidden Layers

Input Layer

Example Application

Input Output

1x

2x

Layer 1

……

Nx

……

Layer 2

……

Layer L

……

……

……

……

“2”……

y1

y2

y10

is 1

is 2

is 0

……

A function set containing the candidates for

Handwriting Digit Recognition

You need to decide the network structure to let a good function in your function set.

FAQ

• Q: How many layers? How many neurons for each layer?

• Q: Can the structure be automatically determined?• E.g. Evolutionary Artificial Neural Networks

• Q: Can we design the network structure?

Trial and Error Intuition+

Convolutional Neural Network (CNN)

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple ……

Neural Network

Loss for an Example

1x

2x

……

256x

……

……

……

……

……

y1

y2

y10

CrossEntropy

“1”

……

1

0

0

……

target

Softm

ax

𝐶 𝑦 , ො𝑦 = −

𝑖=1

10

ෝ𝑦𝑖𝑙𝑛𝑦𝑖

ො𝑦1

ො𝑦2

ො𝑦10

……

Given a set of parameters

𝑦 ො𝑦

Total Loss

x1

x2

xN

NN

NN

NN

……

……

y1

y2

yN

ො𝑦1

ො𝑦2

ො𝑦𝑁

𝐶1

……

……

x3 NN y3 ො𝑦3

For all training data …𝐿 =

𝑛=1

𝑁

𝐶𝑛

Find the network parameters 𝜽∗ that minimize total loss L

Total Loss:

𝐶2

𝐶3

𝐶𝑁

Find a function in function set that minimizes total loss L

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple ……

Neural Network

Gradient Descent

𝑤1

Compute Τ𝜕𝐿 𝜕𝑤1

−𝜇 Τ𝜕𝐿 𝜕𝑤1

0.15

𝑤2

Compute Τ𝜕𝐿 𝜕𝑤2

−𝜇 Τ𝜕𝐿 𝜕𝑤2

0.05

𝑏1

Compute Τ𝜕𝐿 𝜕𝑏1

−𝜇 Τ𝜕𝐿 𝜕𝑏10.2

……

……

0.2

-0.1

0.3

𝜃𝜕𝐿

𝜕𝑤1

𝜕𝐿

𝜕𝑤2

⋮𝜕𝐿

𝜕𝑏1⋮

𝛻𝐿 =

gradient

Gradient Descent

𝑤1

Compute Τ𝜕𝐿 𝜕𝑤1

−𝜇 Τ𝜕𝐿 𝜕𝑤1

0.15−𝜇 Τ𝜕𝐿 𝜕𝑤1

Compute Τ𝜕𝐿 𝜕𝑤1

0.09

𝑤2

Compute Τ𝜕𝐿 𝜕𝑤2

−𝜇 Τ𝜕𝐿 𝜕𝑤2

0.05−𝜇 Τ𝜕𝐿 𝜕𝑤2

Compute Τ𝜕𝐿 𝜕𝑤2

0.15

𝑏1

Compute Τ𝜕𝐿 𝜕𝑏1

−𝜇 Τ𝜕𝐿 𝜕𝑏10.2

−𝜇 Τ𝜕𝐿 𝜕𝑏1

Compute Τ𝜕𝐿 𝜕𝑏10.10

……

……

0.2

-0.1

0.3

……

……

……

𝜃

Gradient Descent

This is the “learning” of machines in deep learning ……

Even alpha go using this approach.

I hope you are not too disappointed :p

People image …… Actually …..

Backpropagation

• Backpropagation: an efficient way to compute Τ𝜕𝐿 𝜕𝑤 in neural network

libdnn台大周伯威同學開發

Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Concluding Remarks

Neural Network

What are the benefits of deep architecture?

Layer X SizeWord Error

Rate (%)Layer X Size

Word Error Rate (%)

1 X 2k 24.2

2 X 2k 20.4

3 X 2k 18.4

4 X 2k 17.8

5 X 2k 17.2 1 X 3772 22.5

7 X 2k 17.1 1 X 4634 22.6

1 X 16k 22.1

Deeper is Better?

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription

Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Not surprised, more parameters, better performance

Universality Theorem

Reference for the reason: http://neuralnetworksanddeeplearning.com/chap4.html

Any continuous function f

M: RRf N

Can be realized by a network with one hidden layer

(given enough hidden neurons)

Why “Deep” neural network not “Fat” neural network?(next lecture)

“深度學習深度學習”

• My Course: Machine learning and having it deep and structured

• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351

• “Neural Networks and Deep Learning”

• written by Michael Nielsen

• http://neuralnetworksanddeeplearning.com/

• “Deep Learning”

• written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

• http://www.deeplearningbook.org

Acknowledgment

•感謝 Victor Chen發現投影片上的打字錯誤