Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...

Deep Learning

Deep learning attracts lots of attention.• I believe you have seen lots of exciting results

before.

Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Ups and downs of Deep Learning

• 1958: Perceptron (linear model)

• 1969: Perceptron has limitation

• 1980s: Multi-layer perceptron

• Do not have significant difference from DNN today

• 1986: Backpropagation

• Usually more than 3 hidden layers is not helpful

• 1989: 1 hidden layer is “good enough”, why deep?

• 2006: RBM initialization (breakthrough)

• 2009: GPU

• 2011: Start to be popular in speech recognition

• 2012: win ILSVRC image competition

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple ……

Neural Network

“Neuron”

Different connection leads to different network structures

Neural Network

Network parameter 𝜃: all the weights and biases in the “neurons”

Fully Connect Feedforward Network

Sigmoid Function

𝑓00

=0.510.85

𝑓1−1

=0.620.83

This is a function.

Input vector, output vector

Given network structure, define a function set

Output LayerHidden Layers

Input Layer

Input Output

Layer 1

……

Layer 2

……

Layer L…

……

neuron

8 layers

19 layers

22 layers

AlexNet (2012) VGG (2014) GoogleNet (2014)

7.3%6.7%

http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf

Deep = Many hidden layers

AlexNet(2012)

VGG (2014)

GoogleNet(2014)

152 layers

Residual Net (2015)

Taipei101

101 layers

16.4%7.3% 6.7%

Deep = Many hidden layers

Special structure

Matrix Operation

1 −2−1 1

0.980.12

……

Neural Network

W1 W2 WL

x a1 a2 y

b1W1 x +𝜎b2W2 a1 +𝜎

bLWL +𝜎 aL-1

= 𝜎 𝜎

……

Neural Network

W1 W2 WL

x a1 a2 y

y = 𝑓 x

b1W1 x +𝜎 b2W2 + bLWL +…

Using parallel computing techniques to speed up matrix operation

Output Layer

……

Input Layer

Feature extractor replacing feature engineering

= Multi-class Classifier

Example Application

Input Output

16 x 16 = 256

256x…

Ink → 1No ink → 0

……

Each dimension represents the confidence of a digit.

……

The image is “2”

Example Application

• Handwriting Digit Recognition

Machine “2”

…… ……

……

What is needed is a function ……

Input: 256-dim vector

output: 10-dim vector

NeuralNetwork

Input Layer

Example Application

Input Output

Layer 1

……

Layer 2

……

Layer L

……

“2”……

……

A function set containing the candidates for

Handwriting Digit Recognition

You need to decide the network structure to let a good function in your function set.

• Q: How many layers? How many neurons for each layer?

• Q: Can the structure be automatically determined?• E.g. Evolutionary Artificial Neural Networks

• Q: Can we design the network structure?

Trial and Error Intuition+

Convolutional Neural Network (CNN)

Step 2: goodness of

function

Neural Network

Loss for an Example

……

CrossEntropy

“1”

……

target

𝐶 𝑦 , ො𝑦 = −

𝑖=1

ෝ𝑦𝑖𝑙𝑛𝑦𝑖

ො𝑦1

ො𝑦2

ො𝑦10

……

Given a set of parameters

𝑦 ො𝑦

Total Loss

……

ො𝑦1

ො𝑦2

ො𝑦𝑁

……

x3 NN y3 ො𝑦3

For all training data …𝐿 =

𝑛=1

𝐶𝑛

Find the network parameters 𝜽∗ that minimize total loss L

Total Loss:

𝐶𝑁

Find a function in function set that minimizes total loss L

Step 2: goodness of

function

Neural Network

Gradient Descent

Compute Τ𝜕𝐿 𝜕𝑤1

−𝜇 Τ𝜕𝐿 𝜕𝑤1

Compute Τ𝜕𝐿 𝜕𝑏1

−𝜇 Τ𝜕𝐿 𝜕𝑏10.2

……

𝜃𝜕𝐿

𝜕𝑤1

𝜕𝐿

𝜕𝑤2

⋮𝜕𝐿

𝜕𝑏1⋮

𝛻𝐿 =

gradient

Gradient Descent

0.15−𝜇 Τ𝜕𝐿 𝜕𝑤1

0.05−𝜇 Τ𝜕𝐿 𝜕𝑤2

Compute Τ𝜕𝐿 𝜕𝑏1

−𝜇 Τ𝜕𝐿 𝜕𝑏10.2

−𝜇 Τ𝜕𝐿 𝜕𝑏1

Compute Τ𝜕𝐿 𝜕𝑏10.10

……

Gradient Descent

This is the “learning” of machines in deep learning ……

Even alpha go using this approach.

I hope you are not too disappointed :p

People image …… Actually …..

Backpropagation

• Backpropagation: an efficient way to compute Τ𝜕𝐿 𝜕𝑤 in neural network

libdnn台大周伯威同學開發

Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html

Step 2: goodness of

function

Concluding Remarks

Neural Network

What are the benefits of deep architecture?

Layer X SizeWord Error

Rate (%)Layer X Size

Word Error Rate (%)

1 X 2k 24.2

2 X 2k 20.4

3 X 2k 18.4

4 X 2k 17.8

5 X 2k 17.2 1 X 3772 22.5

7 X 2k 17.1 1 X 4634 22.6

1 X 16k 22.1

Deeper is Better?

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription

Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Not surprised, more parameters, better performance

Universality Theorem

Reference for the reason: http://neuralnetworksanddeeplearning.com/chap4.html

Any continuous function f

M: RRf N

Can be realized by a network with one hidden layer

(given enough hidden neurons)

Why “Deep” neural network not “Fat” neural network?(next lecture)

“深度學習深度學習”

• My Course: Machine learning and having it deep and structured

• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

• 6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351

• “Neural Networks and Deep Learning”

• written by Michael Nielsen

• http://neuralnetworksanddeeplearning.com/

• “Deep Learning”

• written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

• http://www.deeplearningbook.org

Acknowledgment

•感謝 Victor Chen發現投影片上的打字錯誤

Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...

Documents

Transcript of Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... ·...

“Hello world” of deep learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/... · 2017-03-22 · Speed •Smaller batch size means more updates in

Unsupervised Conditional Generation - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/CycleGA… · •Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial

Recursive Structure - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Recursive.pdf · •Sentence relatedness NN Sentence 1 Sentence 2 Recursive Neural Network

Lifelong Learning Hw14 - speech.ee.ntu.edu.tw

Learning Map - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture... · 2016-09-30 · Learning Map Supervised Learning Regression Linear Model Structured

Tips for Deep Learning - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DNN...Keras: the-validation-loss-isnt-decreasing-anymore Good Results on Testing

Convolutional Neural Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/CNN (v2).pdf · Why CNN for Image •Some patterns are much smaller than the

Classification: Probabilistic Generative Modelspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture...Example Application •Total: sum of all stats that come after this, a general

Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Introduction of Generative Adversarial Network (GAN)speech.ee.ntu.edu.tw/~tlkagk/slide/Tutorial_HYLee_GAN.pdf · •Input x: an object x (e.g. an image) •Output D(x): scalar which

Machine Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · 2015-10-08 · Back to Deep Learning •Some functions can be easily represented

Deep Energy - · PDF file2. Deep Energy. Deep Energy. DEEP ENERGY. NASSAU. DEEP ENERGY DEEP ENERGY. The Deep Energy is Technip’s latest state-of

11.0 Spoken Content Understanding ... - speech.ee.ntu.edu.tw

IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END …speech.ee.ntu.edu.tw/~tlkagk/paper/asr-guided-tacotron.pdf · style transfer can be achieved [4]. Once a speaker’s utterance

Deep Reinforcement Learning - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/RL (v6).pdf · To learn deep reinforcement learning ... e.g. Q-learning

Unsupervised Learning - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/auto.pdf · Unsupervised Learning “We expect unsupervised learning to become

Meta Learning (Part 1) - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture...Meta Learning (Part 1) Hung-yi Lee Introduction •Meta learning = Learn to

Transfer Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/... · 2020. 9. 4. · Transfer Learning - Overview a Source Data (not directly related

Theory behind GAN - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...•Sample m examples 𝑥1,𝑥2,…,𝑥𝑚from data distribution 𝑃𝑑𝑎𝑡𝑎𝑥

From HMM's to Segment Models: A Unified View of ...speech.ee.ntu.edu.tw/previous_version/HMMSegment_IEEE96...Title From HMM's to Segment Models: A Unified View of Stochastic Modeling