Using Fast Weights to ImprovePersistent Contrastive Divergence
Tijmen TielemanGeoffrey Hinton Department of Computer Science, University of Toronto
ICML 2009
presented byJorge Silva
Department of Electrical and Computer Engineering, Duke University
2/17
Problems of interest:Density Estimation and Classification using RBMs
• RBM = Restricted Boltzmann Machine: a stochastic version of a Hopfield network (i.e., recurrent neural network); often used as an associative memory
• Can also be seen as a particular case of a Deep Belief Network (DBN)
• Why “restricted”?Because we restrict connectivity: no intra-layer connections
(Hinton, 2002; Smolensky 1986)
adapted from www.iro.montreal.ca
data pattern (binary vector)
internal, or hidden representations
hidden units
visible units
3/17
Notation
• Define the following energy function:
• The joint probability P(v,h) and the marginal P(v) are
visible state
hidden state state of the i-th visible unit
weight of the i-j connection
state of the j-th hidden unit
biases
4/17
Training with gradient descent
• Training data likelihood (using just one datum for simplicity)
• The positive gradient is easy:
• But the negative gradient is intractable:
• We can’t even sample from the model, so no MC approximation
5/17
Contrastive Divergence (CD)
• However, we can approximately sample from the model. The existing Contrastive Divergence (CD) algorithm is one way to do it
• CD gets the direction of the gradient approximately right, though not the magnitude
• The rough idea behind CD is to:
– start a Markov chain at one of the training points used to estimate
– perform one Gibbs update, i.e., get
– treat the configuration (h,v) as a sample from the model
• What about “Persistent” CD? (Hinton, 2002)
6/17
Persistent Contrastive Divergence (PCD)
• Use a persistent Markov chain that is not reinitialized at each time the parameters are changed
• The learning rate should be small compared to the mixing rate of the Markov chain
• Many persistent chains can be run in parallel; the corresponding (h,v) pairs are called “fantasy particles”
• For a fixed amount of computation, RBMs can learn better models using PCD
• Again, PCD is a previously existing algorithm
(Neal, 1992; Tieleman, 2008)
7/17
Contributions and outline
• Theoretical: show the interaction between the mixing rates and the weight updates in PCD
• Practical: introduce fast weights, in addition to the regular weights. This improves the performance/speed tradeoff
• Outline for the rest of the talk:– Mixing rates vs weight updates– Fast weights– PCD algorithm with fast weights (FPCD)– Experiments
8/17
Mixing rates vs weight updates
• Consider M persistent chains
• The states (v,h) of the chains define a distribution R consisting of M point masses
• Assume M is large enough that we can ignore sampling noise
• The weights are updated in the direction of the negative gradient of
• P is the data distribution and is the intractable model distribution(being approximated by R)
• is the vector of parameters (weights)
9/17
Mixing rates vs weight updates
• Terms in the objective function:
• The weight updates increase (which is bad), but
• This is compensated by an increase in the mixing rates, makingdecrease rapidly (which is good)
• Essentially, the fantasy particles quickly “rule out” large portions of the search space where Q is negligible
this term is the neg. log-likelihood(minus the fixed entropy of P)
this term is being maximizedw.r.t. \theta
10/17
Fast weights
• In addition to the regular weights , the paper introduces fast weights
• Fast weights are only used for fantasy particles; their learning rate is larger and their weight-decay is much stronger (weight-decay = ridge regression)
• The role of the fast weights is to make the (combined) energy increase faster in the vicinity of the fantasy particles, making them mix faster
• This way, the fantasy particles can escape low-energy local modes; this counteracts the progressive reduction in learning rates, which is otherwise desirable as learning progresses
• The learning rate of the fast weights stays constant, but the weights themselves decay fast, so their effect is temporary
(Bharath & Borkar, 1999)
11/17
PCD algorithm with fast weights (FPCD)
weight decay
12/17
Experiments: MNIST dataset
• Small-scale task: density estimation using an RBM with 25 hidden units
• Larger task: classification using an RBM with 500 hidden units
• In classification RBMs, there are two types of visible units: image units and label units. The RBM learns a joint density over both types.
• In the plots, each point corresponds to 10 runs; in each run, the network was trained for a predetermined amount of time
• Performance is measured on a held-out test set
• The learning rate (for regular weights) decays linearly to zero over the computation time; for fast weights it is constant=1/e
(Hinton et al., 2006; Larochelle & Bengio, 2008)
13/17
Experiments: MNIST dataset (fixed RBM size)
14/17
Experiments: MNIST dataset(optimized RBM size)
• FPCD: 1200 hidden units
• PCD: 700 hiden units
15/17
Experiments: Micro-NORB dataset
• Classification task on 96x96 images, downsampled to 32x32
• MNORB dimensionality (before downsampling) is 18432, while MNIST is 784
• Learning rate decays as 1/t for regular weights
(LeCun et al., 2004)
16/17
Experiments: Micro-NORB dataset
non-monotonicityindicates overfittingproblems
17/17
Conclusion
• FPCD outperforms PCD, especially when the number of weight updates is small
• FPCD allows more flexible learning rate schedules than PCD
• Results on the MNORB data also indicate outperformance in datasets where overfitting is a concern
• Logistic regression on the full 18432-dimensional MNORB dataset had 23% misclassification; the RBM with FPCD achieved 26% on the reduced dataset
• Future work: run FPCD for a longer time on an established dataset
Top Related