Introduction of Machine / Deep Learning
Transcript of Introduction of Machine / Deep Learning
Introduction of Machine / Deep Learning
Hung-yi Lee ๆๅฎๆฏ
Machine Learning โ Looking for Function โข Speech Recognition
โข Image Recognition
โข Playing Go
( ) = f
( ) = f
( ) = f
โCatโ
โHow are youโ
โ5-5โ(next move)
Different types of Functions
Regression: The function outputs a scalar.
PredictPM2.5 f PM2.5 of
tomorrow
PM2.5 todaytemperature
Concentrationof O3
Spam filtering Yes/Nof
Classification: Given options (classes), the function outputsthe correct one.
Different types of Functions
Function
(19 x 19 classes)
Next move
Each position is a class
a position on the board
Classification: Given options (classes), the functionoutputs the correct one.
Playing GO
Regression, Classification
create something with structure (image, document)
Structured Learning
How to find a function?A Case Study
YouTube Channel
https://www.youtube.com/c/HungyiLeeNTU
๐ฆ = ๐
The function we want to find โฆ
no. of views on 2/26
1. Function with Unknown Parameters
๐ฆ = ๐ + ๐ค๐ฅ!
๐ฆ = ๐
๐ฆ: no. of views on 2/26, ๐ฅ!: no. of views on 2/25
based on domain knowledge
๐ค and ๐ are unknown parameters (learned from data)
Model
weight bias
feature
2. Define Loss from Training Data
Data from 2017/01/01 โ 2020/12/31
&๐ฆ= ๐ฆ โ &๐ฆ
๐ฟ ๐, ๐คร Loss is a function of
parameters
0.5๐+1๐ฅ! = ๐ฆ
ร Loss: how good a set of values is.
๐! = 0.4๐
2017/01/01 01/02 12/312020/12/30
4.8k 4.9k 3.4k 9.8k7.5k
01/03 โฆโฆ
4.9k
label
5.3k
๐ฟ 0.5๐, 1 ๐ฆ = ๐ + ๐ค๐ฅ! ๐ฆ = 0.5๐ + 1๐ฅ! How good it is?
2. Define Loss from Training Data
Data from 2017/01/01 โ 2020/12/31 2017/01/01 01/02 12/312020/12/30
4.8k 4.9k
&๐ฆ= ๐ฆ โ &๐ฆ
0.5๐+1๐ฅ! = ๐ฆ
4.9k 7.5k 9.8k
๐" = 2.1๐
3.4k
&๐ฆ
0.5๐+1๐ฅ! = ๐ฆ๐#
9.8k7.5k
01/03 โฆโฆ
5.4k
๐ฟ 0.5๐, 1 ๐ฆ = ๐ + ๐ค๐ฅ! ๐ฆ = 0.5๐ + 1๐ฅ! How good it is?
๐ฟ ๐, ๐คร Loss is a function of
parameters
ร Loss: how good a set of values is.
2. Define Loss from Training Data
๐ฟ =1๐,"
๐"Loss:
๐ = ๐ฆ โ &๐ฆ "
๐ = ๐ฆ โ &๐ฆ
๐ฟ is mean square error (MSE)
๐ฟ is mean absolute error (MAE)
&๐ฆ
๐ + ๐ค๐ฅ! = ๐ฆ๐!
4.8k 4.9k
4.9k
If ๐ฆ and &๐ฆ are both probability distributions Cross-entropy
๐ฟ ๐, ๐คร Loss is a function of
parameters
ร Loss: how good a set of values is.
2. Define Loss from Training Data
Small ๐ฟ
Large ๐ฟ ๐ค
๐ Error Surface
๐ฟ ๐, ๐คร Loss is a function of
parameters
ร Loss: how good a set of values is.๐ฆ = ๐ + ๐ค๐ฅ!Model
3. Optimization ๐คโ, ๐โ = ๐๐๐min%,'
๐ฟ
Loss ๐ฟ
ร (Randomly) Pick an initial value ๐ค(
ร Compute
๐ค(
Positive
Negative
Decrease w
Increase w
๐ค
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
๐๐ฟ๐๐ค
|%)%!
3. Optimization
Loss ๐ฟ
๐ค( ๐ค
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
๐ค! โ ๐ค( โ ๐๐๐ฟ๐๐ค
|%)%!
๐: learning rate
๐คโ, ๐โ = ๐๐๐min%,'
๐ฟ
ร (Randomly) Pick an initial value ๐ค(
ร Compute๐๐ฟ๐๐ค
|%)%!
hyperparameters
๐๐๐ฟ๐๐ค |%)%
!
๐ค!
3. Optimization
Loss ๐ฟ
ร Update ๐ค iteratively
๐ค( ๐ค
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
๐ค! ๐ค"
Local minima
๐ค*
global minima
๐คโ, ๐โ = ๐๐๐min%,'
๐ฟ
ร (Randomly) Pick an initial value ๐ค(
ร Compute๐๐ฟ๐๐ค
|%)%!
๐ค! โ ๐ค( โ ๐๐๐ฟ๐๐ค
|%)%!
Does local minima truly cause the problem?
3. Optimizationร (Randomly) Pick initial values ๐ค(, ๐(
ร Compute
๐๐ฟ๐๐ค
|%)%!,')'! ๐ค! โ ๐ค( โ ๐๐๐ฟ๐๐ค
|%)%!,')'!
๐๐ฟ๐๐|%)%!,')'! ๐! โ ๐( โ ๐
๐๐ฟ๐๐|%)%!,')'!
Can be done in one line in most deep learning frameworks
๐คโ, ๐โ = ๐๐๐min%,'
๐ฟ
ร Update ๐ค and ๐ interatively
3. Optimization
๐ค
๐
Compute โ๐๐ฟ ๐๐ค, โ๐๐ฟ ๐๐
(โ๐ โ๐๐ฟ ๐๐ค, โ๐ โ๐๐ฟ ๐๐)
Compute โ๐๐ฟ ๐๐ค, โ๐๐ฟ ๐๐
๐คโ, ๐โ = ๐๐๐min%,'
๐ฟ
๐ฟ ๐คโ, ๐โ = 0.48๐๐คโ = 0.97, ๐โ = 0.1๐
๐ฆ = ๐ + ๐ค๐ฅ!Model
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Machine Learning is so simple โฆโฆ๐ฆ = ๐ + ๐ค๐ฅ!
๐คโ = 0.97, ๐โ = 0.1๐๐ฟ ๐คโ, ๐โ = 0.48๐
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Machine Learning is so simple โฆโฆ๐ฆ = ๐ + ๐ค๐ฅ!
๐คโ = 0.97, ๐โ = 0.1๐๐ฟ ๐คโ, ๐โ = 0.48๐
๐ฆ = 0.1๐ + 0.97๐ฅ! achieves the smallest loss ๐ฟ = 0.48๐on data of 2017 โ 2020 (training data)
How about data of 2021 (unseen during training)?
Training
๐ฟโฒ = 0.58๐
2021/01/01 2021/02/14
๐ฆ = 0.1๐ + 0.97๐ฅ!
Views(k)
Red: real no. of viewsblue: estimated no. of views
๐ฆ = ๐ +,#$!
%&
๐ค# ๐ฅ#
๐ฆ = ๐ + ๐ค๐ฅ!
๐ฆ = ๐ +,#$!
'
๐ค# ๐ฅ#
๐ฆ = ๐ +,#$!
()
๐ค# ๐ฅ#
๐ฟโฒ = 0.58๐๐ฟ = 0.48๐
๐ฟโฒ = 0.49๐๐ฟ = 0.38๐
๐ฟโฒ = 0.46๐๐ฟ = 0.33๐
2017 - 2020 2021
2017 - 2020
2017 - 2020
2021
2021
๐ ๐๐โ ๐๐
โ ๐๐โ ๐๐
โ ๐๐โ ๐๐
โ ๐๐โ
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
Linear models
๐ฟโฒ = 0.46๐๐ฟ = 0.32๐2017 - 2020 2021
๐ฆ
๐ฅ!๐ฅ!
Linear models have severe limitation.
We need a more flexible model!
Different ๐
Different w
Model Bias
Linear models are too simple โฆ we need more sophisticated modes.
๐ฆ
๐ฅ!๐ฅ!0
1
2
3
sum of a set ofconstant + red curve =
All Piecewise Linear Curves
More pieces require more
sum of a set ofconstant + =
Beyond Piecewise Linear?
๐ฆ
๐ฅ!To have good approximation, we need sufficient pieces.
Approximate continuous curve by a piecewise linear curve.
๐ฅ!
๐ฆ = ๐1
1 + ๐2 '3%4"
๐ฅ!
How to represent this function?
sum of a set ofconstant + red curve =
Hard Sigmoid
Sigmoid Function
= ๐ ๐ ๐๐๐๐๐๐ ๐ + ๐ค๐ฅ!
Different ๐ค
Change slopes
Different b
Shift
Different ๐
Change height
๐ฆ
๐ฅ!๐ฅ!
sum of a set of + constant
0
2
3
red curve =
1
๐! ๐ ๐๐๐๐๐๐ ๐! +๐ค!๐ฅ!
๐" ๐ ๐๐๐๐๐๐ ๐" +๐ค"๐ฅ!
๐5 ๐ ๐๐๐๐๐๐ ๐5 +๐ค5๐ฅ!
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +๐ค*๐ฅ!0 1 2 3+ +
New Model: More Features
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +๐ค*๐ฅ!
๐ฆ = ๐ + ๐ค๐ฅ!
๐ฆ = ๐ +,#
๐ค#๐ฅ#
๐! +๐ค!!๐ฅ! +๐ค!"๐ฅ" +๐ค!5๐ฅ5 +
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
๐" +๐ค"!๐ฅ! +๐ค""๐ฅ" +๐ค"5๐ฅ5
๐5 +๐ค5!๐ฅ! +๐ค5"๐ฅ" +๐ค55๐ฅ5
๐! =
๐" =
๐5 =
๐: 1,2,3
๐: 1,2,3๐ฆ = ๐ +,
*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
1
2
3
no. of features
no. of sigmoid
๐ค67: weight for ๐ฅ7 for i-th sigmoid
๐! = ๐! +๐ค!!๐ฅ! +๐ค!"๐ฅ" +๐ค!5๐ฅ5๐" = ๐" +๐ค"!๐ฅ! +๐ค""๐ฅ" +๐ค"5๐ฅ5๐5 = ๐5 +๐ค5!๐ฅ! +๐ค5"๐ฅ" +๐ค55๐ฅ5
๐!๐(๐+
=๐!๐(๐+
+๐ค!! ๐ค!( ๐ค!+๐ค(! ๐ค(( ๐ค(+๐ค+! ๐ค+( ๐ค++
๐ฅ!๐ฅ(๐ฅ+
๐ ๐๐
๐: 1,2,3๐: 1,2,3
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
๐= +
+
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
๐: 1,2,3๐: 1,2,3
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
1
2
3
๐ ๐๐ ๐= +
๐!
๐"
๐5
+
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
๐: 1,2,3๐: 1,2,3
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
1
2
3
๐!
๐"
๐5
๐!
๐"
๐5๐ ๐= ๐
๐! = ๐ ๐๐๐๐๐๐ ๐! =1
1 + ๐28"
+
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
๐: 1,2,3๐: 1,2,3
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
1
2
3
๐!
๐"
๐5
๐!
๐"
๐5
๐ฆ +
1
๐!
๐"
๐5
๐
๐+๐๐ฆ = ๐,
+
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
1
2
3
๐!
๐"
๐5
๐!
๐"
๐5
๐ฆ +
1
๐!
๐"
๐5
๐
๐ ๐= ๐ ๐ ๐๐ ๐= +
๐+๐๐ฆ = ๐,
+
+
1
1
+
1
๐ฅ!
๐ฅ"
๐ฅ5
๐ค!!๐ค!"
๐ค!5
๐!
1
2
3
๐!
๐"
๐5
๐!
๐"
๐5
๐ฆ +
1
๐!
๐"
๐5
๐
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
โฆโฆ
Rows of ๐
๐ฝ =
๐!๐(๐+โฎ
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
๐ feature
Unknown parameters
๐ ๐
๐, ๐
Function with unknown parameters
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Back to ML Framework
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
Loss
D๐ฆ
๐
๐ฟ =1๐,"
๐"Loss:
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
feature
๐ฟ ๐ร Loss is a function of parameters
ร Loss means how good a set of values is.
Given a set of values label
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Back to ML Framework
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
Optimization of New Model
๐ฝโ = ๐๐๐min๐ฝ๐ฟ
ร (Randomly) Pick initial values ๐ฝ(
๐ฝ! โ ๐ฝ- โ ๐๐
๐ =
๐๐ฟ๐๐!
|๐ฝ$๐ฝ!
๐๐ฟ๐๐(
|๐ฝ$๐ฝ!
โฎ
๐!!
๐(!โฎ
โ๐!-
๐(-โฎ
โ
๐๐๐ฟ๐๐!
|๐ฝ$๐ฝ!
๐๐๐ฟ๐๐(
|๐ฝ$๐ฝ!
โฎ
gradient
๐ = โ๐ฟ ๐ฝ-
๐ฝ =
๐!๐(๐+โฎ
Optimization of New Model
๐ฝโ = ๐๐๐min๐ฝ๐ฟ
ร (Randomly) Pick initial values ๐ฝ(
ร Compute gradient ๐ = โ๐ฟ ๐ฝ(
๐ฝ! โ ๐ฝ( โ ๐๐ร Compute gradient ๐ = โ๐ฟ ๐ฝ!
๐ฝ" โ ๐ฝ! โ ๐๐ร Compute gradient ๐ = โ๐ฟ ๐ฝ"
๐ฝ5 โ ๐ฝ" โ ๐๐
Optimization of New Model
๐ฝโ = ๐๐๐min๐ฝ๐ฟ
ร (Randomly) Pick initial values ๐ฝ(
ร Compute gradient ๐ = โ๐ฟ! ๐ฝ(
๐ฝ! โ ๐ฝ( โ ๐๐ร Compute gradient ๐ = โ๐ฟ" ๐ฝ!
๐ฝ" โ ๐ฝ! โ ๐๐ร Compute gradient ๐ = โ๐ฟ5 ๐ฝ"
๐ฝ5 โ ๐ฝ" โ ๐๐
N
B batch
batch
batch
batch1 epoch = see all the batches once
๐ฟ!
๐ฟ"
๐ฟ5
๐ฟ
update
update
update
Optimization of New Model
B batch
batch
batch
batch
Example 1
ร 10,000 examples (N = 10,000)ร Batch size is 10 (B = 10)How many update in 1 epoch?
1,000 updatesExample 2
ร 1,000 examples (N = 1,000)ร Batch size is 100 (B = 100)How many update in 1 epoch?
10 updates
N
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Back to ML Framework
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
More variety of models โฆ
Sigmoid โ ReLU
๐ฅ!
How to represent this function?
๐ฅ!
๐ ๐๐๐ฅ 0, ๐ + ๐ค๐ฅ!
๐โฒ ๐๐๐ฅ 0, ๐โฒ + ๐คโฒ๐ฅ!
Rectified Linear Unit (ReLU)
Sigmoid โ ReLU
๐ฆ = ๐ +,*
๐* ๐ ๐๐๐๐๐๐ ๐* +,#
๐ค*#๐ฅ#
๐ฆ = ๐ +,(*
๐* ๐๐๐ฅ 0, ๐* +,#
๐ค*#๐ฅ#
Which one is better?
Activation function
Experimental Results
linear 10 ReLU 100 ReLU 1000 ReLU2017 โ 2020 0.32k 0.32k 0.28k 0.27k
2021 0.46k 0.45k 0.43k 0.43k
๐ฆ = ๐ +,(*
๐* ๐๐๐ฅ 0, ๐* +,#
๐ค*#๐ฅ#
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Back to ML Framework
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
Even more variety of models โฆ
+
+
1
1
+
1
โฆโฆ
๐ฅ!
๐ฅ"
๐ฅ5
+
+
1
1
+
1
๐!
๐"
๐5
= ๐ ๐ ๐๐ +=๐โฒ ๐ ๐โฒ๐โฒ + ๐๐
or
1 layer 2 layer 3 layer 4 layer2017 โ 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Experimental Results
โข Loss for multiple hidden layersโข 100 ReLU for each layerโข input features are the no. of views in the past 56
days
2021/01/01 2021/02/14
3 layers
Views(k)
Red: real no. of viewsblue: estimated no. of views
?
Step 1: function with
unknown
Step 2: define loss from
training data
Step 3: optimization
Back to ML Framework
๐ ๐ ๐๐ ++๐= ๐,๐ฆ
It is not fancy enough.
Letโs give it a fancy name!
๐!
๐"
๐5
+
+
1
1
+
1
โฆโฆ
๐ฅ!
๐ฅ"
๐ฅ5
+
+
1
1
+
1 Neuron
Neural Network This mimics human brains โฆ (???)
Many layers means Deep
hidden layerhidden layer
Deep Learning
8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%6.7%
http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf
Deep = Many hidden layers
AlexNet(2012)
VGG (2014)
GoogleNet(2014)
152 layers
3.57%
Residual Net (2015)
Taipei101
101 layers
16.4%7.3% 6.7%
Deep = Many hidden layers
Special structure
Why we want โDeepโ network, not โFatโ network?
Why donโt we go deeper?
โข Loss for multiple hidden layersโข 100 ReLU for each layerโข input features are the no. of views in the past 56
days 1 layer 2 layer 3 layer 4 layer
2017 โ 2020 0.28k 0.18k 0.14k 0.10k2021 0.43k 0.39k 0.38k 0.44k
Why donโt we go deeper?
โข Loss for multiple hidden layersโข 100 ReLU for each layerโข input features are the no. of views in the past 56
days 1 layer 2 layer 3 layer 4 layer
2017 โ 2020 0.28k 0.18k 0.14k 0.10k2021 0.43k 0.39k 0.38k 0.44k
Better on training data, worse on unseen dataOverfitting
Letโs predict no. of views today!
โข If we want to select a model for predicting no. of views today, which one will you use?
1 layer 2 layer 3 layer 4 layer2017 โ 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
We will talk about model selection next time. J
To learn more โฆโฆ
https://youtu.be/Dr-WRlEFefw https://youtu.be/ibJpTrp5mcE
Basic Introduction Backpropagation
Computing gradients in an efficient way