Learning Deep Energy Models
description
Transcript of Learning Deep Energy Models
![Page 1: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/1.jpg)
1
Learning Deep Energy Models
Author: Jiquan Ngiam et. al., 2011 Presenter: Dibyendu Sengupta
![Page 2: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/2.jpg)
2
The Deep Learning Problem
Vectorized pixel intensities
Slightly higher level representation
. . .
Output or the highest level representation: That’s the CRAZY FROG!!!!!
Learning and modeling the different layers is a very challenging problem
![Page 3: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/3.jpg)
3
OutlinePlacing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using DEMs
![Page 4: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/4.jpg)
4
State of the art methods• Deep Belief Network (DBN): RBMs stacked and trained
in a “Greedy” manner to form DBN1 each of which models the posterior distribution of the previous layer.
• Deep Boltzmann Machine (DBM): DBM2 has undirected connections between the layers of networks initialized by RBMs. Joint training is done on the layers.
• Deep Energy Model (DEM): DEM consists of a feedforward NN that deterministically transforms the input and the output of the feedforward network is modeled with a stochastic hidden unit.
DBN Layers
DBM Layers
DBM Layers
1 Hinton et al., A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, 20062 Salakhutdinov et al., Deep Boltzmann Machines, AISTATS, 2009
![Page 5: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/5.jpg)
5
Deep Belief Network (DBN)DBNs are graphical models which learn to extract a deep hierarchical representation of the training data by modeling observed data x and “l” hidden layers “h” as follows by the joint distribution
Algorithm:
1. Train the first layer as an RBM that models the input, x as its visible layer.
2. The first layer is used as the input data for the second layer which is chosen either by mean activations of or samples of
3. Iterate for the desired number of layers, each time propagating upward either samples or mean values.
4. Fine-tune all the parameters of this deep architecture with respect to log- likelihood or with respect to a supervised training criterion.
![Page 6: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/6.jpg)
6
Deep Boltzmann Machine (DBM)
• DBM is similar to DBN: Primary contrasting feature is undirected connections in all the layers
• Layerwise training algorithm is used to initialize the layers using RBMs
• Joint training is performed on all the layers
![Page 7: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/7.jpg)
7
Motivation of Deep Energy Models (DEM)
• Both DBN and DBM has multiple stochastic hidden layers
• Computing the conditional posterior over the stochastic hidden layers is intractable
• Learning and inference is more tractable in single layer RBM but it suffers from lack of representational power
• To overcome both the defects DEMs combine layers of
deterministic hidden layers with a layer of stochastic hidden layer
![Page 8: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/8.jpg)
8
OutlinePlacing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using DEMs
![Page 9: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/9.jpg)
9
Energy Based Models (EBMs)x – Visible units, h – Hidden units, Z – Partition Function, F(x) – Free Energy
We would like the configurations to be at low energy
Independent of h
![Page 10: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/10.jpg)
10
General Learning Strategy for EBMsGradient based methods on this functional formulation to learn parameters θ
In general, the “positive” term is easy to compute but the “negative” term is often intractable and sampling needs to be done.
Expectations are computed for both these terms to estimate their values
![Page 11: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/11.jpg)
11
Energy based models of RBMW: weights connecting visible (v) and hidden (h) units
b and c: offsets of visible and hidden units RBM representation
Exploiting the structure of RBM we can obtain
In particular for RBMs with binary units:
![Page 12: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/12.jpg)
12
OutlinePlacing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using DEMs
![Page 13: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/13.jpg)
13
Sigmoid Deep Energy Model
gθ(v) represents the feedforward output of the neural network gθ
Similar to RBMs, an energy function defines the connections between gθ(v) and the hidden units h (assumed binary)
The conditional posteriors of the hidden variables are easy to compute:
Representational power of the model can be increased by adding more layers of the feedforward NN
![Page 14: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/14.jpg)
14
Generalized Deep Energy ModelsGeneralized Free Energy Function
Sigmoid DEM with gθ as the feedforward NN
Different models for H(v) enable DEMs with multiple layers of nonlinearities
PoT Distribution
Covariance RBM Distribution
Examples of 2-layered network: First layer computes squared responses followed by a soft-rectification.
There can also be linear combinations of models like mean-covariance RBM which is a linear combination of RBM and cRBM
![Page 15: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/15.jpg)
15
An alternative deep version of DEM
• PoT and cRBM uses shallow feedback in the energy landscape
• “Stacked PoT” or SPoT is chosen as a deeper version of PoT by stacking a bunch of PoT layers
• This creates a more expressive deeper model
![Page 16: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/16.jpg)
16
OutlinePlacing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using DEMs
![Page 17: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/17.jpg)
17
Learning Parameters in DEMs• Models were trained by maximizing the log-likelihood
• Stochastic gradient ascent was used to learn the parameters, θ
• Obtain update rules similar to generalized Energy Based Model (EBM) updates
2nd Term: Expectation over data – can be easily computed
1st Term: Expectation over model distribution – Harder to compute and often intractable and is approximated by Sampling
![Page 18: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/18.jpg)
18
Hybrid Monte Carlo (HMC) Sampler
• Model samples obtained by simulation of physical system
• Particles are subjected to potential and kinetic energies
• Velocities are sampled from a univariate Gaussian to obtain state1
• State of the particles follow conservation of Hamiltonian H(s,φ)
• n-steps of Leap-Frog Algorithm applied to state1 (s,φ) to obtain state2
• Acceptance is performed based on Pacc(state1, state2)
Neal RM, Proabilistic inference using Markov Chain Monte Carlo Methods, Technical Report, U Toronto, 1993
Hamiltonian Dynamics
![Page 19: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/19.jpg)
19
Greedy Layerwise and Joint Training• First greedy layerwise training is performed by– Training successive layers to optimize data likelihood– Freeze parameters of the earlier layers– Learning objective (i.e. data likelihood) stays the same
throughout training of deep model
• Joint training is subsequently performed on all layers by- Unfreezing the weights- Optimizing the same objective function- Computational cost is comparable to layerwise training
• Training in DEM is computationally much cheaper than DBN and DBM since only the top layer needs sampling and all intermediate layers are deterministic hidden units
![Page 20: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/20.jpg)
20
Discriminative Deep Energy Models
al: Activations of lth in gθ used to learn a linear classifier of image labels y via weights U
Training: Done by hybrid generative-discriminative objective
Gradient of generative cost: Can be computed by previously discussed method
Gradient discriminative cost: Can be computed by considering the model to be a short-circuited feedforward NN with softmax classification
![Page 21: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/21.jpg)
21
OutlinePlacing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using DEMs
![Page 22: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/22.jpg)
22
Experiments: Natural Images
Remarks:
• Experiments done with sigmoid and SPoT models under Annealed Importance Sampling (AIS)• M1 and M2: Training using Greedy Layerwise stacking of 1 and 2 layers respectively• M1-M2-M12: Greedy layerwise training for 2 layers followed by joint training of the two layers• Joint training results in performance improvement over pure Greedy lawerwise training but the convergence of Log-Likelihood is not evident from plots• M1-M2-M12 seem to require a significantly larger number of iterations• Adding multiple layers in SPoT significantly improves model performance
Convergence: Questionable!
![Page 23: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/23.jpg)
23
Experiments: Object Recognition
Samples from SPoT M12 model trained on NORB dataset
Remarks:
• Models were trained on NORB dataset
• Hybrid discriminative-generative Deep models (SPoT) performed better than the fully discriminative model
• Fully discriminative model suffers from overfitting
• Optimal α parameter that weighs discriminative-generative cost is obtained by cross validation on a subset of training data
• Iteration counts for convergence in the models were not reported
![Page 24: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/24.jpg)
24
Conclusions• It is often difficult to scale SPoT model to realistic datasets
because of the slowness of HMC
• Jointly training all layers yields significantly better models than Greedy layerwise training
Single layer Sigmoid DEM: Trained by Greedy layerwise
Two layer Sigmoid DEM: Trained by Greedy and Joint Training
Filters appear Blob-like
Filters appear Gabor-like
![Page 25: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/25.jpg)
25
What is the best Multi-Stage Architecture for Object Recognition?
Author: Kevin Jarrett et. al., 2009Presenter: Sreeparna Mukherjee
![Page 26: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/26.jpg)
26
Coming back to the starting problem
Vectorized pixel intensities
Feature Extraction: Stage 1
. . .
Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!
Can it be done more efficiently with multiple feature extraction stages instead of just one
![Page 27: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/27.jpg)
27
OutlineDifferent Feature Extraction Models
Multi-Stage Feature Extraction Architecture
Learning Protocols
Experiments
![Page 28: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/28.jpg)
28
Existing Feature Extraction Models• There are several single-stage feature extraction systems
inspired by mammalian visual cortex– Scale Invariant Feature Transform (SIFT)– Histogram of Oriented Gradients (HOG)– Geometric Blur
• There are also models with two or more successive stages of feature extractions - Convolutional networks trained in supervised or unsupervised mode - Multistage systems using a non-linear MAX or HMAX models
![Page 29: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/29.jpg)
29
Contrasts among different models
The feature extraction models primarily differ in following aspects
– Number of stages of feature extraction
– Type of non-linearity used after filter-bank
– Type of filter used
– Type of classifier used
![Page 30: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/30.jpg)
30
Questions to Address• How do the non-linearities following filter
banks influence recognition accuracy?
• Does unsupervised or supervised learning of filter banks improve performance over hard-wired or random filters?
• Is there any benefit of using a 2-stage feature extractor as opposed to single stage feature extractor?
![Page 31: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/31.jpg)
31
Intuitions in Feature Extraction Architecture
• Supervised training on a small number of labeled datasets (e.g. Caltech-101) will fail
• Filters need to be carefully handpicked for good performance
• Non-linearities should not be a significant factor
What do you think?
These intuitions are wrong – We’ll see how !!!!
![Page 32: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/32.jpg)
32
OutlineDifferent Feature Extraction Models
Multi-Stage Feature Extraction Architecture
Learning Protocols
Experiments
![Page 33: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/33.jpg)
33
General Model Architecture
Vectorized pixel intensities
Filter Bank Layer
Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!
Non Linear Transformation Layers
Pooling Layer: Local Averaging to remove small perturbations
![Page 34: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/34.jpg)
34
Filter Bank Layer (FCSG)
Input (x):n1 2D feature maps of size n2 × n3
xijk is each component in each feature map xi
Output (y):m1 feature maps of size m2 × m3
Filter (k):kij is a filter in the filter bank of size l1 × l2 mapping xi to yj
![Page 35: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/35.jpg)
35
Non-Linear Transformations in FCSG
FCSG comprises of Convolution Filters (C), Sigmoid/tanh non-linearity (S) and gain (G) coefficients gj
![Page 36: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/36.jpg)
36
Rectification Layer (Rabs)
• This layer returns the absolute value of its input
• Other rectifying non-linearities produced similar results
under Rabs
![Page 37: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/37.jpg)
37
Local Contrast Normalization Layer (N)
This layer performs local– Subtractive Normalization– Divisive Normalization
Subtractive Normalization
Divisive Normalization
wpq is a normalized Gaussian weighting window
![Page 38: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/38.jpg)
38
Average Pooling and Subsampling Layer (PA)
Averaging: This creates robustness to small distortions
wpq is a uniform weighting window
Subsampling: Spatial resolution is reduced by down-sampling with a ratio S in both spatial directions
Max Pooling and Subsampling Layer (PM)
Average operation is replaced by Max operationSubsampling procedure stays the same
![Page 39: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/39.jpg)
39
Hierarchy among the LayersLayers can be combined in various hierarchical
ways to obtain different architectures FCSG – PA
FCSG – Rabs – PA
FCSG – Rabs– N – PA
FCSG – PM
A typical multistage architecture: FCSG – Rabs– N – PA
![Page 40: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/40.jpg)
40
OutlineDifferent Feature Extraction Models
Multi-Stage Feature Extraction Architecture
Learning Protocols
Experiments
![Page 41: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/41.jpg)
41
Unsupervised Training Protocols
Input: X (vectorized patch or stack of patches)
Dictionary: W – to be learnt
Feature Vector: Z* - obtained by minimizing the energy function
![Page 42: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/42.jpg)
42
Learning procedure: Olshausen-Field
Obtaining Z* from EOF via “basis pursuit” is an expensive optimization problem
The energy function to be minimized
λ – Sparsity Hyper-parameter
Learning W: Done by minimizing the Loss Function LOF(W) using stochastic gradient descent
![Page 43: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/43.jpg)
43
Learning procedure: PSD
• EPSD optimization is faster as it has the predictor term• Goal of algorithm is to make the regressor C(X,K,G) as close to Z as possible•After training completion Z* = C(X,K) for input X i.e. the method is fast feedforward
Regressor function mapping X Y
Loss Function
![Page 44: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/44.jpg)
44
OutlineDifferent Feature Extraction Models
Multi-Stage Feature Extraction Architecture
Learning Protocols
Experiments
![Page 45: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/45.jpg)
45
Experiment: Caltech 101 DatasetR and RR – Random Features and Supervised Classifier
U and UU – Unsupervised Features, Supervised Classifier
U+ and U+U+ - Unsupervised Feature, Global Supervised Refinement
G – Gabor Functions
Remarks:• Random filter and no filter learning achieve decent performance• Both Rectification and Supervised fine tuning improved performance• Two-stage systems are better than single-stage models• Unsupervised training does not significantly improve performance if both rectification and normalization are used • Performance of Gabor Filters were worse than random filters
![Page 46: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/46.jpg)
46
Experiment: NORB Dataset
Remarks:
• Rectification and Normalization makes a significant improvement when samples are low
• As the number of samples increases, improvement with Rectification and Normalization becomes insignificant
• Random filters performs much worse on large number of labeled samples
![Page 47: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/47.jpg)
47
Experiment: MNIST Dataset
• Two-stage feature extraction architecture was used
• The parameters are first trained using PSD
• Classifier is initialized randomly and the whole system is fine tuned in supervised mode
• A test error rate of 0.53% was observed – best known error rate on MNIST without distortions or preprocessing
![Page 48: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/48.jpg)
48
Coming back to the same Questions!• How do the non-linearities following filter banks influence recognition
accuracy? - Yes– Rectification improves performance possibly due to i) non-polar
features improves recognition or ii) it prohibits cancellations of neighbors during pooling layer
– Normalization also enables performance improvement and makes supervised learning faster by contrast enhancement
• Does unsupervised or supervised learning of filter banks improve performance over hard-wired or random filters? - Yes– Random filter shows good performance in the limit of small training
set sizes where the optimal stimuli for random filters are similar to that trained filters
– Global supervised learning of filters yield good results if proper non-linearities are used
• Is there any benefit of using a 2-stage feature extractor as opposed to single stage feature extractor? - Yes– The experiments show that 2-stage feature extractor performs much
better compared to single stage feature extractor models.
![Page 49: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/49.jpg)
49
Questions
![Page 50: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/50.jpg)
50
Extra Slides
![Page 51: Learning Deep Energy Models](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816690550346895dda67c5/html5/thumbnails/51.jpg)
51
Hybrid Monte Carlo (HMC) Sampler• Model samples are obtained by simulating a physical system
• Particles are subjected to potential and kinetic energies
• Velocities are sampled from a univariate Gaussian to obtain state1
• State of the particles follow conservation of Hamiltonian H(s,φ)
• n-steps of Leap-Frog Algorithm applied to state1 (s,φ) to obtain state2
• Acceptance is performed based on Pacc(state1, state2)
Neal RM, Proabilistic inference using Markov Chain Monte Carlo Methods, Technical Report, U Toronto, 1993
Hamiltonian Dynamics
Leap-Frog Discretization: