Machine Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... ·...

Machine Learningand having it deep and structured

Hung-yi Lee

What isMachine Learning?

You know how to program …

• You can ask computers to do lots of things for you.

• However, computer can only do what you ask it to do.

• Computer can never solve the problem you can’t solve.

Some tasks are very complex

• One day, you are asked to write a program for speech recognition.

Find the common patterns from the left waveforms.

It seems impossible to write a program for speech recognition.

你好你好

You quickly get lost in the exceptions and special cases.

Let the machine learn by itself

你好

大家好

人帥真好

You said “你好”

A large amount of audio data

You write the program for learning.

Learning ......

Learning ≈ Looking for a Function

• Speech Recognition

• Handwritten Recognition

• Weather forecast

• Play video games

“2”

“你好”

“sunny tomorrow”

“fire”

weather today

Positions and number of enemies

Framework

Training:Pick the best Function f*

TrainingData

Testing:

,ˆ,,ˆ, 2211 yxyx

Hypothesis Function Set

y21, ff

“Best” Function

:y “你好” “2”

function input

function output

大家好

This Course

Machine Learning

Deep LearningStructured Learning

Deep Learning

• Production line (生產線)

What is Deep Learning?

“Hello”Simple

Function 1Simple

Function 2Simple

Function N

A very complex function

ModelHypothesis Functions

……

End-to-end training:What each function should do is learned automatically

• Shallow Approach

Deep v.s. Shallow- Speech Recognition

Waveform

Filter bank

DCT logGMM

spectrogram

“Hello”

:hand-crafted :learned from data

Each box is a simple function in the production line:

Deep v.s. Shallow- Speech Recognition• Deep Learning

All functions are learned from data

Less engineering labor, but machine learns more

“Bye bye, MFCC”- Deng Li in Interspeech 2014

“Hello”

Deep v.s. Shallow- Image Recognition• Shallow Approach

http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/

:hand-crafted :learned from data

Deep v.s. Shallow- Image Recognition• Deep Learning

Reference: Zeiler, M. D., &

Fergus, R. (2014). Visualizing

and understanding

convolutional networks.

In Computer Vision–ECCV

2014 (pp. 818-833)

“monkey”f1 f2 f3 f4

All functions are learned from data

• Production line

• Deep learning usually referred to neural network based approach

What is Deep Learning?

“Hello”Simple

Function 1Simple

Function 2Simple

Function N

End-to-end training:What each function should do is learned automatically

A very complex function

ModelHypothesis Function Set

……

Inspired from Human Brains

A Neuron for Machine

Sigmoid function

Each neuron is a very simple function

Activation function

• Cascading the neurons to form a neural network.

Deep Learning

Hidden Layer

……

Layer 1

……

Layer 2…

…Layer L

……

Input Output

M: RRf N

A neural network is a complex function:

Each layer is a simple function in the production line.

Ups and downs of Deep Learning

• 1960s: Perceptron (single layer neural network)

• 1969: Perceptron has limitation

• 1980s: Multi-layer perceptron

• Do not have significant difference from DNN today

• 1986: Backpropagation

• Usually more than 3 hidden layers is not helpful

• 1989: 1 hidden layer is “good enough”, why deep?

• 2006: RBM initialization (breakthrough)

• 2009: GPU

• 2011: Start to be popular in speech recognition

• 2012: win ILSVRC competition (image)

Become very Popular

Why Deep Learning?

• Speech recognition

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription

Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Deeper is Better.

Why Deeper is Better?

1x 2x ……Nx

……

1x 2x ……Nx

……

Shallow Deep

Deep works better simply because it uses more parameters.

Universality Theorem

Reference: http://neuralnetworksanddeeplearning.com/chap4.html

Any continuous function f

M: RRf N

Can be realized by a network with one hidden layer

(given enough hidden neurons)

What is the reason to be deep?

Why “Deep” neural network not “Fat” neural network?

Fat + Short v.s. Thin + Tall

1x 2x ……Nx

……

Shallow

Which one is better?

If they have the same parameters,

Fat + Short v.s. Thin + TallToy Example {0,1}: 2 Rf

……

0 or 1

Sample 10,0000 points as training data

Fat + Short v.s. Thin + TallToy Example

1 hidden layer:

(A) 125 neurons (C) 2500 neurons(B) 500 neurons

3 hidden layers:

Q: the number of parameters close to (A), (B) or (C)?

Fat + Short v.s. Thin + TallHand-writing digit classification• Same parameters

Deeper: Using less parameters to achieve the same performance

Fat + Short v.s. Thin + TallSpeech Recognition• Word error rate (WER)

1 hidden layerMultiple layers

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription

Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Think about Logic Circuits ……

• A two-layer circuit of logic gates can represent any Boolean function.

• Using multiple layers of logic gates to build some functions are much simpler (less gates needed).

• E.g. parity check

Circuit

1 (even)

0 (odd)

For input sequence with d bits,

Two-layer circuit need O(2d) gates.

1 0 1 0

0 0 0 1

Think about Logic Circuits ……

• A two-layer circuit of logic gates can represent any Boolean function.

• Using multiple layers of logic gates to build some functions are much simpler (less gates needed).

• E.g. parity check

With multiple layers, we need only O(d) gates.

Back to Deep Learning

• Some functions can be easily represented by deep structure

• Perhaps the functions that can be naturally decomposed into several steps

• E.g. image

http://neuralnetworksanddeeplearning.com/chap1.html

Back to Deep Learning

• Some functions can be easily represented by deep structure

• Perhaps the functions that can be naturally decomposed into several steps

• E.g. image

• To represent the functions with shallow structure needs much more parameters

• More parameters imply more training data needed

• To achieve the same performance, deep learning needs less training data

• With the same amount of data, deep learning can achieve better performance.

Size of Training Data

• Different numbers of training examples

10,0000 5,0000 2,0000

1 hidden layer

3 hidden layers

More parameters

Less parameters

Size of Training Data

• Hand-writing digit classification

Deeper: Using less training data to achieve the same performance

Why deep is not popular before?

• In the past, usually deep does not work ……

We will go back to this issue in the following lectures.

What is Structured Learning?

In the real world ……

X (Input domain): Sequence, graph structure, tree structure ……

Y (Output domain): Sequence, graph structure, tree structure ……

Retrieval

(keyword)

“Machine learning”

A list of web pages (Search Result)

Translation

(Another kind of sequence)(One kind of sequence)

“Machine learning and having it deep and structured”

“機器學習及其深層與結構化”

Speech Recognition

“大家好，歡迎大家來修機器學習及其深層與結構化”

(Another kind of sequence)(One kind of sequence)

Speech Summarization

:YSummary

Select the most informative segments to

form a compact version

Record Lectures

Object Detection

:X Image Object Positions:Y

Haruhi

Mikuru

Pose Estimation

:X Image :Y Pose

Source of images: http://groups.inf.ed.ac.uk/calvin/Publications/eichner-techreport10.pdf

Concluding Remarks

Machine Learning

Deep LearningStructured Learning

Reference

• No Textbook

• Deep Learning• “Neural Networks and Deep Learning”

• written by Michael Nielsen

• http://neuralnetworksanddeeplearning.com/

• “Deep Learning” (not finished yet)

• Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

• http://www.iro.umontreal.ca/~bengioy/dlbook/

• Structured Learning• No suggested reference

Thank you!

Human Brains are Deep

Machine Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... ·...

Documents

Transcript of Machine Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... ·...

PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Generative Adversarial Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/GAN.pdf · 2019-05-29 · Generative Adversarial Network (GAN) •Anime face

Convolutional Neural Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/CNN (v2).pdf · Why CNN for Image •Some patterns are much smaller than the

Orthogonal Projection - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2018/Lecture... · 2018-11-28 · Orthogonal Projection Matrix •Let C be an n x k matrix whose

Convolutional Neural Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/CNN.pdf · Convolutional Neural Network Hung-yi Lee Can the network be simplified

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE …speech.ee.ntu.edu.tw/~tlkagk/paper/LectureFinal.pdf · 2015-10-08 · IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE

Meta Learning (Part 1) - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture...Meta Learning (Part 1) Hung-yi Lee Introduction •Meta learning = Learn to

Tensorflow CNN turorial - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2017/Lecture/... · 2017. 3. 25. · 2017/03/10. Lenet-5 [LeCunet al., 1998] ... ConvolutionLayer

Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:

Recursive Structure - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Recursive.pdf · •Sentence relatedness NN Sentence 1 Sentence 2 Recursive Neural Network

PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM

Theory behind GAN - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...•Sample m examples 𝑥1,𝑥2,…,𝑥𝑚from data distribution 𝑃𝑑𝑎𝑡𝑎𝑥

Meta Learning (Part 2)speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Meta2 (v4).pdfRecurrent Neural Network •Given function f: ℎ′, =𝑓ℎ, h0 f h1 y1 x1 f h2 y2 x2 f

Ensemble - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture...Re-weighting Training Data •Idea: training 𝒇 𝒙on the new training set that fails 𝒇 𝒙

Orthogonal Basis - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2018/Lecture/orthogonal basis (v2).pdf · Orthogonal/Orthonormal Basis Orthogonal Decomposition Theory

Tips for Deep Learning - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DNN...Keras: the-validation-loss-isnt-decreasing-anymore Good Results on Testing

Beyond Vectors - speech.ee.ntu.edu.tw

Intelligent Photo Editing - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/PhotoEdit… · Photo Editing. Basic Idea space of code z Why move on the code

Eigenvalues and Eigenvectors - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2019/Lecture/eigen.pdf · Eigenvalues and Eigenvectors •If =𝜆 ( is a vector, 𝜆is a

Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster