Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5...
Transcript of Machine Learning Branch of Machine Learning …...0.4 0.1 0.7 0.3 0.1 2 ˆ ˙ ˝ 0.5 0.9 0.3 0.5 0.5...
Machine Learning
Lecture 9
Deep Learning
Introduction & Stacked AE
Dr. Patrick [email protected]
South China University of Technology, China
1
Dr. Patrick Chan @ SCUT
Agenda
Introduction of Deep Learning
Autoencoder
Autoencoder
Sparse Autoencoder
Contractive Autoencoder
Denoising Autoencoder
Stacked Autoencoder
Lecture 9: DL - Introduction & Stacked AE2
Dr. Patrick Chan @ SCUT
What is Deep Learning?
Branch of Machine Learning
Commonly refer to a neural network with multiple layers (deep architecture)
Lecture 9: DL - Introduction & Stacked AE3
Dr. Patrick Chan @ SCUT
Why Deep Architecture?
Our brain is a very deep architecture
A deep architecture can represent more complicated function than a shallow one
Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009
Deeper
More Complicated
Function
Shallower
Less Complicated
Function
Lecture 9: DL - Introduction & Stacked AE4
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
Brief History of Machine Learning
Deep
Learning
http://www.erogol.com/brief-history-machine-learning/
2006
Digression of deep architecture
due to the success of SVM
Lecture 9: DL - Introduction & Stacked AE5
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
Multilayer neural networks have been studied for more than 30 years
What’s actually new?
Old wine in a new bottle??
Lecture 9: DL - Introduction & Stacked AE6
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
DNN is less accurate than shallow one by using traditional backpropagation
Backpropagation loses its power in deep architecture
Vanishing gradient problem
Lecture 9: DL - Introduction & Stacked AE7
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
DNN is less accurate than shallow one by using traditional backpropagation
Optimization is very complex
Too many parameters in deep architecture
8 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 9 9 x 4
Lecture 9: DL - Introduction & Stacked AE8
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
No suitable training method for deep architectures at past
Breakthrough in 2006, three papers proposing new learning algorithms for deep structure
1. Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief
nets, Neural Computation 18:1527-1554, 2006
2. Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy
Layer-Wise Training of Deep Networks, in J. Platt et al. (Eds), Advances in Neural
Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007
3. Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun
Efficient Learning of Sparse Representations with an Energy-Based Model, in J.
Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS
2006), MIT Press, 2007
Lecture 9: DL - Introduction & Stacked AE 9
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
Vanishing Gradient Problem
Divide and Conquer
Stacked training
E.g. Stacked Autoencoder (SA) & RBM
Reduce Parameters
Too many parameters
E.g. Convolutional Neural Network (CNN)
Lecture 9: DL - Introduction & Stacked AE 10
Dr. Patrick Chan @ SCUT
Deep Learning
What’s New?
Deep Learning focuses on feature learning but not a classifier
Lecture 9: DL - Introduction & Stacked AE11
PersonDesigned
Features
. . .
Who?
Decision
Feature Learning
(Deep Learning)Person
Who?
Decision
Deep
Learning
Traditional
Learning
Complicated
Classifier
Simple
Classifier
Dr. Patrick Chan @ SCUT
Deep Learning
Unsupervised Learning
Stacked Denoising Autoencoder (SAE)
Supervised Learning
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Generative Adversarial Network (GAN)
Lecture 9: DL - Introduction & Stacked AE12
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Autoencoder aims to learn an identify function
Predict the input itself in the output,
Containing two parts:
Encoder ,
Decoder ,
Lecture 9: DL - Introduction & Stacked AE13
�� ��
Latent
Representation
DecoderEncoder
Original
Input
Estimated
Input
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Autoencoder may contain more than one hidden layers
Lecture 9: DL - Introduction & Stacked AE14
Dec
od
er Dec
od
erE
nco
der
En
cod
er
�1�
�1 �2 �3
�2� �3�
1 2 3 4
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Gradient Descent can be used to train autoencoder
Objective Function:
Cross entropy
Squared Error
Lecture 9: DL - Introduction & Stacked AE15
�1�
�1 �2 �3
�2� �3�
1 2 3 4
�� ��
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Learning the identity function seems trivial
By adding constraints on the network, useful information can be learned
Number of hidden neurons
Regularization
Lecture 9: DL - Introduction & Stacked AE16
Dr. Patrick Chan @ SCUT
Autoencoder (AE): Type
Lecture 9: DL - Introduction & Stacked AE17
Overcomplete AE
Hidden layer is smaller than the input layer
Hidden layer is greaterthan the input layer
Undercomplete AE
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Undercomplete
Hidden layer is smaller than the input layer
Compress the input
Hidden nodes will be generally good features
Lecture 9: DL - Introduction & Stacked AE18
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Overcomplete
Hidden layer is greater than the input layer
No compression in hidden layer
A higher dimension code may help when the distribution is more complex, E.g. XOR case
No guarantee that the hidden units will extract meaningful feature
A hidden unit may
copy a input component
Additional requirement is considered
e.g. sparsity
Lecture 9: DL - Introduction & Stacked AE19
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Serve as feature extraction
Extract robust feature representation for a model
No label is needed
Independent to the model
Lecture 9: DL - Introduction & Stacked AE20
Latent Space
Representation
Noisy Input
Encoder
Robust Encoder(Maybe multi-layer)
Model
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Example
Decoder
Encoder Encoder
Lecture 9: DL - Introduction & Stacked AE21
Dr. Patrick Chan @ SCUT
Autoencoder (AE)
Example
Decoder
Lecture 9: DL - Introduction & Stacked AE22
Dr. Patrick Chan @ SCUT
Sparse Autoencoder
Sparsity is a common constraint
Meaningful features can be learnt
The model may be more explainable with sparse features
Lecture 9: DL - Introduction & Stacked AE23
= 1 + 1 + 1 + 1 + 1
+ 1 + 1 + 0.8 + 0.8
Dr. Patrick Chan @ SCUT
Sparse Autoencoder
Let be the activation of the th hidden unit in
the layer of encoder
Sparsity is defined as the average on n samples
Force the constraint:
is a “sparsity parameter”
Typically small but not zeroi.e. 0.05
Lecture 9: DL - Introduction & Stacked AE24
1 2 3 4
�1 �2 �3
�1� �2� �3�
……
� �,��� �
�
���
Layer k
Dr. Patrick Chan @ SCUT
Sparse Autoencoder
Penalize for deviating from
Many measures for this penalty term
Example: Kullback-Leibler (KL) divergence
KL is a standard function for measuring how different two distributions are
� : the number of neurons in the layer
Lecture 9: DL - Introduction & Stacked AE25
�
�+
�
Dr. Patrick Chan @ SCUT
Sparse Autoencoder
If ,
KL is increased monotonically with respect to the difference between
and
Lecture 9: DL - Introduction & Stacked AE26
Dr. Patrick Chan @ SCUT
Sparse Autoencoder
Cost function:�
: tradeoff parameter
depends on all samples and affects J
It should be calculated for all x in advance
Lecture 9: DL - Introduction & Stacked AE27
� �,��� �
�
���
Dr. Patrick Chan @ SCUT
Contractive Autoencoder
It is expected the learned encoding would also be similar for similar inputs
Derivative of the hidden layer activations are small with respect to the input
Lecture 9: DL - Introduction & Stacked AE28
f en
x
x2
x1
Dr. Patrick Chan @ SCUT
Contractive Autoencoder
Regularization
Penalizing instances where a small change in the input
leads to a large change in the encoding space
Cost function
�� �� ���
�
��
���
: tradeoff parameter
Frobenius norm (L2):
Jacobian Matrix
Lecture 9: DL - Introduction & Stacked AE29
J� =���
�,���
(�)
�,���
(�)
�,��
��
(�)
�,��
��
(�)
� ��
�����
��
���aij is an element in A
Dr. Patrick Chan @ SCUT
Denoising Autoencoder
Recover the noise-contaminated sample
undo the corruption effect
Denoising Autoencoder (DAE) is robust to noise
Lecture 9: DL - Introduction & Stacked AE30
�� ��
Latent
Representation
DecoderEncoder
Noisy Input Estimated
Input
Dr. Patrick Chan @ SCUT
Denoising Autoencoder
Recall, original AE:
DAE:
Each is contaminated by either
Assigning 0 to input subset with probability
Adding Gaussian additive noise
Lecture 9: DL - Introduction & Stacked AE31
� � �
0.6 0 0.2
0 0.5 0
0 0.1 0.7
0.3 0.1 2
� � �
0.6 0.8 0.2
0.2 0.5 0.5
0.4 0.1 0.7
0.3 0.1 2
� � �
0.5 0.9 0.3
0.5 0.5 0.3
0.4 0.2 0.8
0.1 0.3 1.7
(Gaussian Noise) (Blinding)
Dr. Patrick Chan @ SCUT
Denoising Autoencoder
DAE cannot fully trust each feature of
independently so it must learn the correlations of ’s features
Lecture 9: DL - Introduction & Stacked AE32
x1
x2
x3
x4
1
2
3
4
Encode Decode
z1
z2
1
2
3
4
Corruption
Dr. Patrick Chan @ SCUT
Denoising Autoencoder
Lecture 9: DL - Introduction & Stacked AE33
Dr. Patrick Chan @ SCUT
CAE vs DAE
Denoising AE AE (ie. Encoder +
decoder) resist small
but finite-sized
perturbations of the input
Defined by the noise
injected in training
Change training samples (adding noise
or blinding)
Contractive AE The feature extraction
function (ie. encoder)
resist infinitesimal
perturbations of the input
Slope
Change objective
function
Might be more stable than DAE, which uses a
sampled gradient
Complexity is high due
to calculation of Jacobean matric
Lecture 9: DL - Introduction & Stacked AE34
Dr. Patrick Chan @ SCUT
Stacked Autoencoder
AE with deep structure faces vanishing descent problem
Stack concept is used
Train one layer by one layerRather than the whole model at the same time
Lecture 9: DL - Introduction & Stacked AE35
���
���
���
Model
Dr. Patrick Chan @ SCUT
Stacked Autoencoder
Lecture 9: DL - Introduction & Stacked AE36
���
���
���
���
���
���
���
���
���
���
���
���
Model
1st Layer Training
(AE 1)
2nd Layer Training
(AE 2)
3rd Layer Training
(AE 3)
Connect to Classifier
Dr. Patrick Chan @ SCUT
Stacked Autoencoder
Algorithm
Train AE to reconstruct the input x
Keep encoder only (remove decoder)
Repeat
Treat the output of previous AE as input, and train a new AE
Keep encoder only (remove decoder)
Until enough layer
Connect to a model (e.g. classifier)
Use backpropagation to fine-tune the entire network using supervised data if necessary
Lecture 9: DL - Introduction & Stacked AE37
Dr. Patrick Chan @ SCUT
References
https://web.stanford.edu/class/cs294a/
https://www.jeremyjordan.me/autoencoders/
Lecture 9: DL - Introduction & Stacked AE38