Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...
Transcript of Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...
![Page 1: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/1.jpg)
Deep Learning
![Page 2: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/2.jpg)
Deep learning attracts lots of attention.• I believe you have seen lots of exciting results
before.
Deep learning trends at Google. Source: SIGMOD/Jeff Dean
![Page 3: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/3.jpg)
Ups and downs of Deep Learning
• 1958: Perceptron (linear model)
• 1969: Perceptron has limitation
• 1980s: Multi-layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition
![Page 4: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/4.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural Network
![Page 5: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/5.jpg)
Neural Network
z
z
z
z
“Neuron”
Different connection leads to different network structures
Neural Network
Network parameter 𝜃: all the weights and biases in the “neurons”
![Page 6: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/6.jpg)
Fully Connect Feedforward Network
z
z
ze
z
1
1
Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
![Page 7: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/7.jpg)
Fully Connect Feedforward Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1
![Page 8: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/8.jpg)
Fully Connect Feedforward Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓00
=0.510.85
𝑓1−1
=0.620.83
0
0
This is a function.
Input vector, output vector
Given network structure, define a function set
![Page 9: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/9.jpg)
Output LayerHidden Layers
Input Layer
Fully Connect Feedforward Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L…
…
……
……
……
……
y1
y2
yM
neuron
![Page 10: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/10.jpg)
8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%6.7%
http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf
Deep = Many hidden layers
![Page 11: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/11.jpg)
AlexNet(2012)
VGG (2014)
GoogleNet(2014)
152 layers
3.57%
Residual Net (2015)
Taipei101
101 layers
16.4%7.3% 6.7%
Deep = Many hidden layers
Special structure
![Page 12: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/12.jpg)
𝜎
Matrix Operation
2y
1y1
-2
1
-1
1
0
4
-2
0.98
0.12
1−1
1 −2−1 1
+10
0.980.12
=
1
-1
4−2
![Page 13: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/13.jpg)
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1 a2 y
b1W1 x +𝜎b2W2 a1 +𝜎
bLWL +𝜎 aL-1
b1
![Page 14: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/14.jpg)
= 𝜎 𝜎
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1 a2 y
y = 𝑓 x
b1W1 x +𝜎 b2W2 + bLWL +…
b1
…
Using parallel computing techniques to speed up matrix operation
![Page 15: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/15.jpg)
Output Layer
……
……
……
……
……
……
……
……
y1
y2
yMKx
Output LayerHidden Layers
Input Layer
x
1x
2x
Feature extractor replacing feature engineering
= Multi-class Classifier
Softm
ax
![Page 16: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/16.jpg)
Example Application
Input Output
16 x 16 = 256
1x
2x
256x…
…
Ink → 1No ink → 0
……
y1
y2
y10
Each dimension represents the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image is “2”
![Page 17: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/17.jpg)
Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
…… ……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a function ……
Input: 256-dim vector
output: 10-dim vector
NeuralNetwork
![Page 18: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/18.jpg)
Output LayerHidden Layers
Input Layer
Example Application
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
“2”……
y1
y2
y10
is 1
is 2
is 0
……
A function set containing the candidates for
Handwriting Digit Recognition
You need to decide the network structure to let a good function in your function set.
![Page 19: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/19.jpg)
FAQ
• Q: How many layers? How many neurons for each layer?
• Q: Can the structure be automatically determined?• E.g. Evolutionary Artificial Neural Networks
• Q: Can we design the network structure?
Trial and Error Intuition+
Convolutional Neural Network (CNN)
![Page 20: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/20.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural Network
![Page 21: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/21.jpg)
Loss for an Example
1x
2x
……
256x
……
……
……
……
……
y1
y2
y10
CrossEntropy
“1”
……
1
0
0
……
target
Softm
ax
𝐶 𝑦 , ො𝑦 = −
𝑖=1
10
ෝ𝑦𝑖𝑙𝑛𝑦𝑖
ො𝑦1
ො𝑦2
ො𝑦10
……
Given a set of parameters
𝑦 ො𝑦
![Page 22: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/22.jpg)
Total Loss
x1
x2
xN
NN
NN
NN
……
……
y1
y2
yN
ො𝑦1
ො𝑦2
ො𝑦𝑁
𝐶1
……
……
x3 NN y3 ො𝑦3
For all training data …𝐿 =
𝑛=1
𝑁
𝐶𝑛
Find the network parameters 𝜽∗ that minimize total loss L
Total Loss:
𝐶2
𝐶3
𝐶𝑁
Find a function in function set that minimizes total loss L
![Page 23: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/23.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural Network
![Page 24: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/24.jpg)
Gradient Descent
𝑤1
Compute Τ𝜕𝐿 𝜕𝑤1
−𝜇 Τ𝜕𝐿 𝜕𝑤1
0.15
𝑤2
Compute Τ𝜕𝐿 𝜕𝑤2
−𝜇 Τ𝜕𝐿 𝜕𝑤2
0.05
𝑏1
Compute Τ𝜕𝐿 𝜕𝑏1
−𝜇 Τ𝜕𝐿 𝜕𝑏10.2
……
……
0.2
-0.1
0.3
𝜃𝜕𝐿
𝜕𝑤1
𝜕𝐿
𝜕𝑤2
⋮𝜕𝐿
𝜕𝑏1⋮
𝛻𝐿 =
gradient
![Page 25: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/25.jpg)
Gradient Descent
𝑤1
Compute Τ𝜕𝐿 𝜕𝑤1
−𝜇 Τ𝜕𝐿 𝜕𝑤1
0.15−𝜇 Τ𝜕𝐿 𝜕𝑤1
Compute Τ𝜕𝐿 𝜕𝑤1
0.09
𝑤2
Compute Τ𝜕𝐿 𝜕𝑤2
−𝜇 Τ𝜕𝐿 𝜕𝑤2
0.05−𝜇 Τ𝜕𝐿 𝜕𝑤2
Compute Τ𝜕𝐿 𝜕𝑤2
0.15
𝑏1
Compute Τ𝜕𝐿 𝜕𝑏1
−𝜇 Τ𝜕𝐿 𝜕𝑏10.2
−𝜇 Τ𝜕𝐿 𝜕𝑏1
Compute Τ𝜕𝐿 𝜕𝑏10.10
……
……
0.2
-0.1
0.3
……
……
……
𝜃
![Page 26: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/26.jpg)
Gradient Descent
This is the “learning” of machines in deep learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
People image …… Actually …..
![Page 27: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/27.jpg)
Backpropagation
• Backpropagation: an efficient way to compute Τ𝜕𝐿 𝜕𝑤 in neural network
libdnn台大周伯威同學開發
Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html
![Page 28: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/28.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Concluding Remarks
Neural Network
What are the benefits of deep architecture?
![Page 29: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/29.jpg)
Layer X SizeWord Error
Rate (%)Layer X Size
Word Error Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more parameters, better performance
![Page 30: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/30.jpg)
Universality Theorem
Reference for the reason: http://neuralnetworksanddeeplearning.com/chap4.html
Any continuous function f
M: RRf N
Can be realized by a network with one hidden layer
(given enough hidden neurons)
Why “Deep” neural network not “Fat” neural network?(next lecture)
![Page 31: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/31.jpg)
“深度學習深度學習”
• My Course: Machine learning and having it deep and structured
• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html
• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning”
• written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
• http://www.deeplearningbook.org
![Page 32: Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec714aaf17d3575a27ac00e/html5/thumbnails/32.jpg)
Acknowledgment
•感謝 Victor Chen發現投影片上的打字錯誤