Learning Deep Architectures
Transcript of Learning Deep Architectures
![Page 1: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/1.jpg)
Learning Deep Architectures
Yoshua Bengio, U. Montreal
CIFAR NCAP Summer School 2009
August 6th, 2009, Montreal
Main reference: “Learning Deep Architectures for AI”, Y. Bengio, to appear in Foundations and Trends in Machine Learning, available on my web page.
Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux, Yann LeCun, Guillaume Desjardins, Pascal Lamblin, James Bergstra, Nicolas Le Roux, Max Welling, Myriam Côté, Jérôme Louradour, Pierre-Antoine Manzagol, Ronan Collobert, Jason Weston
![Page 2: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/2.jpg)
Deep Architectures Work Well
� Beating shallow neural networks on vision and NLP tasks
� Beating SVMs on visions tasks from pixels (and handling dataset � Beating SVMs on visions tasks from pixels (and handling dataset sizes that SVMs cannot handle in NLP)
� Reaching state-of-the-art performance in NLP
� Beating deep neural nets without unsupervised component
� Learn visual features similar to V1 and V2 neurons
![Page 3: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/3.jpg)
Deep Motivations
� Brains have a deep architecture
� Humans organize their ideas hierarchically, through composition of simpler ideas
� Insufficiently deep architectures can be exponentially inefficient
� Distributed (possibly sparse) representations are necessary to achieve non-local generalization, exponentially more efficient than 1-of-N enumeration latent variable values
� Multiple levels of latent variables allow combinatorial sharing of statistical strength
![Page 4: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/4.jpg)
Locally Capture the Variations
![Page 5: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/5.jpg)
Easy with Few Variations
![Page 6: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/6.jpg)
The Curse ofDimensionality
To generalise locally, need representative exemples for all possible variations!
![Page 7: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/7.jpg)
Limits of Local Generalization:Theoretical Results
Theorem: Gaussian kernel machines need at least k examples
(Bengio & Delalleau 2007)
� Theorem: Gaussian kernel machines need at least k examples to learn a function that has 2k zero-crossings along some line
� Theorem: For a Gaussian kernel machine to learn some maximally varying functions over d inputs require O(2d) examples
![Page 8: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/8.jpg)
Curse of Dimensionality When Generalizing Locally on a Manifold
![Page 9: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/9.jpg)
How to Beat the Curse of Many Factors of Variation?
Compositionality: exponential gain in representational power
• Distributed representations
• Deep architecture
![Page 10: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/10.jpg)
Distributed Representations
� Many neurons active simultaneously
� Input represented by the activation of a set of features that are not mutually exclusive
� Can be exponentially more efficient than local representations
![Page 11: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/11.jpg)
Local vs Distributed
![Page 12: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/12.jpg)
Neuro-cognitive inspiration
� Brains use a distributed representation
� Brains use a deep architecture� Brains use a deep architecture
� Brains heavily use unsupervised learning
� Brains learn simpler tasks first
� Human brains developed with society / culture / education
![Page 13: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/13.jpg)
Deep Architecture in the Brain
Area V2
Area V4
Primitive shape detectors
Higher level visual
abstractions
Retina
Area V1
pixels
Edge detectors
![Page 14: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/14.jpg)
Deep Architecture in our Mind
� Humans organize their ideas and concepts hierarchically
� Humans first learn simpler concepts and � Humans first learn simpler concepts and then compose them to represent more abstract ones
� Engineers break-up solutions into multiple levels of abstraction and processing
� Want to learn / discover these concepts
![Page 15: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/15.jpg)
Deep Architectures and Sharing Statistical Strength, Multi-Task Learning
� Generalizing better to new tasks is crucial to approach AI
task 1 output y1
task 3 output y3
task 2output y2
� Deep architectures learn good intermediate representations that can be shared across tasks
� A good representation is one that makes sense for many tasks
raw input x
shared intermediate representation h
![Page 16: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/16.jpg)
Feature andSub-Feature Sharing
� Different tasks can share the same high-level feature
���� …
���� …
��
��
task 1 output y1
task N output yN
High-level features
high-level feature
� Different high-level features can be built from the same set of lower-level features
� More levels = up to exponential gain in representational efficiency ������ …
������ ��…
������ ��…
��
Low-level features
![Page 17: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/17.jpg)
Architecture Depth
Depth = 3Depth = 4
![Page 18: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/18.jpg)
Deep Architectures are More Expressive
= universal approximator2 layers of
Logic gatesFormal neuronsRBF units
…
1 2 3 2n
1 2 3
…
n
Theorems for all 3:(Hastad et al 86 & 91, Bengio et al 2007)
Functions compactly represented with k layers may require exponential size with k-1 layers
![Page 19: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/19.jpg)
Sharing Components in a Deep Architecture
Polynomial expressed with shared components:
advantage of depth may grow exponentially
![Page 20: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/20.jpg)
How to train Deep Architecture?
� Great expressive power of deep architectures� Great expressive power of deep architectures
� How to train them?
![Page 21: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/21.jpg)
The Deep Breakthrough
� Before 2006, training deep architectures was unsuccessful, except for convolutional neural nets
� Hinton, Osindero & Teh « A Fast Learning Algorithm for Deep Belief Nets », Neural Computation, 2006Belief Nets », Neural Computation, 2006
� Bengio, Lamblin, Popovici, Larochelle « Greedy Layer-Wise Training of Deep Networks », NIPS’2006
� Ranzato, Poultney, Chopra, LeCun « Efficient Learning of Sparse Representations with an Energy-Based Model », NIPS’2006
![Page 22: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/22.jpg)
Greedy Layer-Wise Pre-Training
Stacking Restricted Boltzmann Machines (RBM) � Deep Belief Network (DBN)� Supervised deep neural network
![Page 23: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/23.jpg)
Good Old Multi-Layer Neural Net
� Each layer outputs vector from
���� ��…
���� ��…
from of previous layer with params (vector) and (matrix).
� Output layer predicts parametrized distribution of target variable Y given input
������ ��…
������ ��…
������ ��…
���� ��…
![Page 24: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/24.jpg)
Training Multi-Layer Neural Nets
� Outputs: e.g. multinomial for multiclass classification with softmax output units
���� ��…
���� ��…
� Parameters are trained by gradient-based optimization of training criterion involving conditional log-likelihood, e.g.
������ ��…
������ ��…
������ ��…
���� ��…
![Page 25: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/25.jpg)
Effect of Unsupervised Pre-trainingAISTATS’2009
![Page 26: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/26.jpg)
Effect of Depthw/o pre-training with pre-training
![Page 27: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/27.jpg)
Boltzman Machines and MRFs
� Boltzmann machines:
(Hinton 84)
� Markov Random Fields:� Markov Random Fields:
� More interesting with latent variables!
![Page 28: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/28.jpg)
Restricted Boltzman Machine
� The most popular building block for deep architecturesdeep architectures
� Bipartite undirected graphical model
observed
hidden
![Page 29: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/29.jpg)
RBM with (image, label) visible units
� Can predict a subset yof the visible units given the others x
hidden
given the others x
� Exactly if y takes only few values
� Gibbs sampling o/w label
image
![Page 30: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/30.jpg)
RBMs are Universal Approximators
� Adding one hidden unit (with proper choice of parameters) guarantees increasing likelihood
(LeRoux & Bengio 2008, Neural Comp.)
� With enough hidden units, can perfectly model any discrete distribution
� RBMs with variable nb of hidden units = non-parametric
� Optimal training criterion for RBMs which will be stacked into a DBN is not the RBM likelihood
![Page 31: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/31.jpg)
RBM Conditionals Factorize
![Page 32: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/32.jpg)
RBM Energy Gives Binomial Neurons
![Page 33: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/33.jpg)
RBM Hidden Units Carve Input Space
������h1 h2 h3
����x1 x2
![Page 34: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/34.jpg)
Gibbs Sampling in RBMs
h1 ~ P(h|x1) h2 ~ P(h|x2) h3 ~ P(h|x3)
P(h|x) and P(x|h) factorize
x2 ~ P(x|h1) x3 ~ P(x|h2) x1
� Easy inference
� Convenient Gibbs sampling x�h�x�h…
![Page 35: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/35.jpg)
Problems with Gibbs Sampling
In practice, Gibbs sampling does not always mix well…
RBM trained by CD on MNIST
Chains from random state
Chains from real digits
RBM trained by CD on MNIST
![Page 36: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/36.jpg)
� Free Energy = equivalent energy when marginalizing
RBM Free Energy
� Can be computed exactly and efficiently in RBMs
� Marginal likelihood P(x) tractable up to partition function Z
![Page 37: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/37.jpg)
Factorization of the Free Energy
Let the energy have the following general form:
Then
![Page 38: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/38.jpg)
Energy-Based Models Gradient
![Page 39: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/39.jpg)
Boltzmann Machine Gradient
� Gradient has two components:
“negative phase”“positive phase”
� In RBMs, easy to sample or sum over h|x� Difficult part: sampling from P(x), typically with a Markov chain
“negative phase”“positive phase”
![Page 40: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/40.jpg)
Training RBMsContrastive Divergence:
(CD-k)start negative Gibbs chain at observed x, run k Gibbs steps
Persistent CD:(PCD)
run negative Gibbs chain in background while weights slowly changechange
Fast PCD: two sets of weights, one with a largelearning rate only used for negative phase, quickly exploring modes
Herding: Deterministic near-chaos dynamical system defines both learning and sampling
Tempered MCMC: use higher temperature to escape modes
![Page 41: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/41.jpg)
Contrastive Divergence
Contrastive Divergence (CD-k): start negative phase block Gibbs chain at observed x, run k Gibbs steps (Hinton 2002)
h ~ P(h|x) h’ ~ P(h|x’)
Sampled x’negative phase
Observed xpositive phase
k = 2 steps
x x’
Free Energy
push down
push up
![Page 42: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/42.jpg)
Persistent CD (PCD)
Run negative Gibbs chain in background while weights slowly change (Younes 2000, Tieleman 2008):
• Guarantees (Younes 89, 2000; Yuille 2004)
• If learning rate decreases in 1/t,
chain mixes before parameters change too much,
Observed x(positive phase)
new x’
h ~ P(h|x)
previous x’
chain mixes before parameters change too much,
chain stays converged when parameters change
![Page 43: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/43.jpg)
Negative phase samples quickly push up the energy of wherever they are and quickly move to another mode
FreeEnergypush down
Persistent CD with large learning rate
x
x’
push up
![Page 44: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/44.jpg)
Persistent CD with large step size
Negative phase samples quickly push up the energy of wherever they are and quickly move to another mode
push
x
x’
FreeEnergypush down
![Page 45: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/45.jpg)
Negative phase samples quickly push up the energy of wherever they are and quickly move to another mode
FreeEnergypush down
Persistent CD with large learning rate
x
x’
push up
![Page 46: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/46.jpg)
Fast Persistent CD and Herding
� Exploit impressively faster mixing achieved when parameters change quickly (large learning rate) while sampling
� Fast PCD: two sets of weights, one with a large learning rate only used for negative phase, quickly exploring modes
� Herding (see Max Welling’s ICML, UAI and workshop talks): 0-temperature MRFs and RBMs, only use fast weights
![Page 47: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/47.jpg)
Herding MRFs
� Consider 0-temperature MRF with state s and weights w
� Fully observed case, observe � Fully observed case, observe values s+, dynamical system where s- and W evolve
� Then statistics of samples s-
match the data’s statistics, even if approximate max, as long as w remains bounded
![Page 48: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/48.jpg)
Herding RBMs
� Hidden part h of the state s = (x,h)
� Binomial state variables si � {-1,1}
� Statistics f si, si sj� Statistics f si, si sj
� Optimize h given x in the positive phase
� In practice, greedy maximization works, exploiting RBM structure
![Page 49: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/49.jpg)
Fast Mixing with Herding
FPCD Herding
![Page 50: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/50.jpg)
The Sampler as a Generative Model
� Instead of the traditional clean separation between model and sampling procedure
� Consider the overall effect of combining some adaptive procedure with a sampling procedure as the generative modelprocedure with a sampling procedure as the generative model
� Can be evaluated as such (without reference to some underlying probability model)
Training data (x,y) Sampled data y
Query inputs x
![Page 51: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/51.jpg)
� Annealing from high-temperature worked well for estimating log-likelihood (AIS)
� Consider multiple chains at different temperatures and reversible swaps between adjacent chains
Higher temperature chains can escape modes
Tempered MCMC
� Higher temperature chains can escape modes
� Model samples are from T=1
Sample Generation Procedure
Training Procedure TMCMC Gibbs (ramdom start) Gibbs (test start)
TMCMC 215.45 ± 2.24 88.43 ± 2.75 60.04 ± 2.88
PCD 44.70 ± 2.51 -28.66 ± 3.28 -175.08 ± 2.99
CD -2165 ± 0.53 -2154 ± 0.63 -842.76 ± 6.17
![Page 52: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/52.jpg)
Deep Belief Networks
h2
h3
Top-level RBM
� DBN = sigmoidal belief net with RBM joint for top two layers
sampled x
h1� Sampling:
• Sample from top RMB
• Sample from level k given k+1
� Level k given level k+1 = same parametrization as RBM conditional: stacking RBMs � DBN
![Page 53: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/53.jpg)
From RBM to DBN
� RBM specifies P(v,h) from P(v|h) and P(h|v) h2
P(h1,h2) = RBM2
� Implicitly defines P(v)and P(h)
� Keep P(v|h) from 1st RBM and replace P(h) by the distribution generated by 2nd level RBM
sampled x
h1
P(x|h1) from RBM1
P(h1,h2) = RBM2
![Page 54: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/54.jpg)
Deep Belief Networks
� Easy approximate inference
• P(hk+1|hk) approximated from the associated RBM
• Approximation because P(hk+1) differs between RBM and DBN h1
h2
h3
Top-level RBM
differs between RBM and DBN
� Training:
• Variational bound justifies greedy layerwise training of RBMs
• How to train all levels together?
sampled x
h1
![Page 55: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/55.jpg)
Deep Boltzman Machines(Salakhutdinov et al, AISTATS 2009, Lee et al, ICML 2009)
� Positive phase: variational approximation (mean-field) h3
� Negative phase: persistent chain
� Can (must) initialize from stacked RBMs
� Improved performance on MNIST from 1.2% to .95% error
� Can apply AIS with 2 hidden layers observed x
h1
h2
![Page 56: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/56.jpg)
Estimating Log-Likelihood
� RBMs: requires estimating partition function
• Reconstruction error provides a cheap proxy
• Log Z tractable analytically for < 25 binary inputs or hidden• Log Z tractable analytically for < 25 binary inputs or hidden
• Lower-bounded (how well?) with Annealed Importance Sampling (AIS)
� Deep Belief Networks:
Extensions of AIS (Salakhutdinov & Murray, ICML 2008, NIPS 2008)
� Open question: efficient ways to monitor progress
![Page 57: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/57.jpg)
Deep Convolutional Architectures
Mostly from Le Cun’s group (NYU), also Ng (Stanford): state-of-the-art on MNIST digits, Caltech-101 objects, faces
![Page 58: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/58.jpg)
Convolutional DBNs(Lee et al, ICML’2009)
![Page 59: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/59.jpg)
Back to Greedy Layer-Wise Pre-Training
Stacking Restricted Boltzmann Machines (RBM) � Deep Belief Network (DBN)� Supervised deep neural network
![Page 60: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/60.jpg)
Why are Classifiers Obtained from DBNs Working so Well?
� General principles?
� Would these principles work for other single-level algorithms?
� Why does it work?
![Page 61: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/61.jpg)
Stacking Auto-Encoders
Greedy layer-wise unsupervised pre-training also works with auto-encoders
![Page 62: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/62.jpg)
Auto-encoders and CD
RBM log-likelihood gradient can be written as converging expansion: CD-k = 2 k terms, reconstruction error ~ 1 term.
(Bengio & Delalleau 2009)
![Page 63: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/63.jpg)
Greedy Layerwise Supervised Training
Generally worse than unsupervised pre-training but better than ordinary training of a deep neural network (Bengio et al. 2007).
![Page 64: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/64.jpg)
Supervised Fine-Tuning is Important
� Greedy layer-wise unsupervised pre-training phase with RBMs or auto-encoders on MNISTencoders on MNIST
� Supervised phase with or without unsupervised updates, with or without fine-tuning of hidden layers
� Can train all RBMs at the same time, same results
![Page 65: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/65.jpg)
Sparse Auto-Encoders
� Sparsity penalty on the intermediate codes
� Like sparse coding but with efficient run-time encoder
(Ranzato et al, 2007; Ranzato et al 2008)
� Sparsity penalty pushes up the free energy of all configurations (proxy for minimizing the partition function)
� Impressive results in object classification (convolutional nets):
• MNIST .5% error = record-breaking
• Caltech-101 65% correct = state-of-the-art (Jarrett et al, ICCV 2009)
� Similar results obtained with a convolutional DBN (Lee et al, ICML’2009)
![Page 66: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/66.jpg)
Denoising Auto-Encoder
KL(reconstruction|raw input)Hidden code(representation)
(Vincent et al, 2008)
� Corrupt the input (e.g. set 25% of inputs to 0)
� Reconstruct the uncorrupted input
� Use uncorrupted encoding as input to next level
Corrupted input Raw input reconstruction
![Page 67: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/67.jpg)
Denoising Auto-Encoder
� Learns a vector field towards higher probability regions
� Minimizes variational lower bound on a generative model
Corrupted input
on a generative model
� Similar to pseudo-likelihood
Corrupted input
![Page 68: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/68.jpg)
Stacked Denoising Auto-Encoders
� No partition function, can measure training criterion
� Encoder & decoder: any parametrizationany parametrization
� Performs as well or better than stacking RBMs for usupervised pre-training
� Generative model is semi-parametric
Infinite MNIST
![Page 69: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/69.jpg)
Denoising Auto-Encoders: Benchmarks
![Page 70: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/70.jpg)
Denoising Auto-Encoders: Results
![Page 71: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/71.jpg)
Why is Unsupervised Pre-Training Working So Well?
� Regularization hypothesis:
• Unsupervised component forces model close to P(x)• Unsupervised component forces model close to P(x)
• Representations good for P(x) are good for P(y|x)
� Optimization hypothesis:
• Unsupervised initialization near better local minimum of P(y|x)
• Can reach lower local minimum otherwise not achievable by random initialization
![Page 72: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/72.jpg)
Learning Trajectories in Function Space
� Each point a model in function space
� Color = epoch
� Top: trajectories � Top: trajectories w/o pre-training
� Each trajectory converges in different local min.
� No overlap of regions with and w/o pre-training
![Page 73: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/73.jpg)
Unsupervised learning as regularizer
� Adding extra regularization (reducing # hidden units) hurts more the pre-trained models
� Pre-trained models have � Pre-trained models have less variance wrt training sample
� Regularizer = infinite penalty outside of region compatible with unsupervised pre-training
![Page 74: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/74.jpg)
Better optimization of online error
� Both training and online error are smaller with unsupervised pre-training
� As # samples �� As # samples �training err. = online err. = generalization err.
� Without unsup. pre-training: can’t exploit capacity to capture complexity in target function from training data
![Page 75: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/75.jpg)
Pre-training lower layers more critical
Verifies that what matters is not just the marginal distribution marginal distribution over initial weight values
(Histogram init.)
![Page 76: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/76.jpg)
The Credit Assignment Problem
� Even with the correct gradient, lower layers (far from the prediction, close to input) are the most difficult to train
� Lower layers benefit most from unsupervised pre-training� Lower layers benefit most from unsupervised pre-training
• Local unsupervised signal = extract / disentangle factors
• Temporal constancy
• Mutual information between multiple modalities
� Credit assignment / error information not flowing easily?
� Related to difficulty of credit assignment through time?
![Page 77: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/77.jpg)
Level-Local Learning is Important
� Initializing each layer of an unsupervised deep Boltzmann machine helps a lot
� Initializing each layer of a supervised neural network as an RBM helps a lotRBM helps a lot
� Helps most the layers further away from the target
� Not just an effect of unsupervised prior
� Jointly training all the levels of a deep architecture is difficult
� Initializing using a level-local learning algorithm (RBM, auto-encoders, etc.) is a useful trick
![Page 78: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/78.jpg)
Semi-Supervised Embedding
� Use pairs (or triplets) of examples which are known to represent nearby concepts (or not)
Bring closer the intermediate representations of supposedly � Bring closer the intermediate representations of supposedly similar pairs, push away the representations of randomly chosen pairs
� (Weston, Ratle & Collobert, ICML’2008): improved semi-supervised learning by combining unsupervised embedding criterion with supervised gradient
![Page 79: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/79.jpg)
Slow Features
� Successive images in a video = similar
� Randomly chosen pair of images = dissimilar
� Slowly varying features are likely to represent interesting � Slowly varying features are likely to represent interesting abstractions
Slow features 1st layer
![Page 80: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/80.jpg)
Learning Dynamics of Deep Nets
Before fine-tuning After fine-tuning
![Page 81: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/81.jpg)
Learning Dynamics of Deep Nets
� As weights become larger, get trapped in basin of attraction (“quadrant” does not change)
� Initial updates have a crucial influence (“critical period”), explain more of the variance
� Unsupervised pre-training initializes in basin of attraction with good generalization properties
0
![Page 82: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/82.jpg)
Order & Selection of Examples Matters
� Curriculum learning (Bengio et al, ICML’2009; Krueger & Dayan 2009)
� Start with easier examples
� Faster convergence to a better local minimum in deep architectures
� Also acts like a regularizer with optimization effect?
� Influencing learning dynamics can make a big difference
![Page 83: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/83.jpg)
Continuation Methods
Track local minima
Final solution
Easy to find minimum
![Page 84: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/84.jpg)
Curriculum Learning as Continuation
� Sequence of training distributions
3 • Most difficult examples
• Higher level abstractions
2
� Initially peaking on easier / simpler ones
� Gradually give more weight to more difficult ones until reach target distribution
1• Easiest• Lower levelabstractions
![Page 85: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/85.jpg)
Take-Home Messages
� Break-through in learning complicated functions: deep architectures with distributed representations
� Multiple levels of latent variables: potentially exponential gain in statistical sharing
Main challenge: training deep architectures� Main challenge: training deep architectures
� RBMs allow fast inference, stacked RBMs / auto-encoders have fast approximate inference
� Unsupervised pre-training of classifiers acts like a strange regularizer with improved optimization of online error
� At least as important as the model: the inference approximations and the learning dynamics
![Page 86: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/86.jpg)
Some Open Problems
� Why is it difficult to train deep architectures?
� What is important in the learning dynamics?
How to improve joint training of all layers?� How to improve joint training of all layers?
� How to sample better from RBMs and deep generative models?
� Monitoring unsupervised learning quality in deep nets?
� Other ways to guide training of intermediate representations?
� Capturing scene structure and sequential structure?
![Page 87: Learning Deep Architectures](https://reader030.fdocuments.us/reader030/viewer/2022012723/61b538dae51ca44832154d56/html5/thumbnails/87.jpg)
Thank you for your attention!
� Questions?
� Comments?