Post on 18-Mar-2016
description
1. Motivation
• Undirected models are useful in many settings. Consider models in exponential family form:
• Task: Given i.i.d. data, estimate parameters accurately and quickly
• Maximum likelihood estimation (MLE):
• Need to resort to approximate techniques:• Pseudolikelihood / composite likelihoods• Sampling-based techniques (e.g. MCMC-MLE)• Contrastive divergence (CD) learning
• We propose particle filtered MCMC-MLE
Particle Filtered MCMC-MLE with Connections to Contrastive Divergence
Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine
3. Contrastive Divergence (CD)
• Widely-used machine learning algorithm for learning undirected models [Hinton, 2002]
• CD can be motivated by taking gradient of log-likelihood directly:
• CD-n samples from current model (approx.):• Initialize chains at empirical data distribution• Only run n MCMC steps
• Persistent CD: initialize chains at samples at previous iteration [Tieleman, 2008]
5. Experimental Analysis
• Visible Boltzmann machines:
• Exponential random graph models (ERGMs):
• Conditional random fields (CRFs):
• Restricted Boltzmann machines (RBMs):
2. MCMC-MLE
• Widely used in statistics [Geyer, 1991]
• Idea: draw samples from alternate distribution p(x|θ0) using MCMC, to approximate the likelihood:
• To optimize approximate likelihood, use gradient:
• Degeneracy problems if θ moves far from initial θ0
4. Particle Filtered MCMC-MLE (PF)
• Use sampling-importance-resampling (SIR) with MCMC rejuvenation to estimate gradient
• Monitor effective sample size (ESS):
• If ESS (“health” of particles) is low:• Resample particles in proportion to w• Rejuvenate with n MCMC steps based on θ
• PF can avoid MCMC-MLE’s degeneracy issues
• PF can be potentially faster than CD since it only “rejuvenates” when ESS is low
• As the number of particles approaches infinity, PF recovers the exact log-likelihood gradient
6. Conclusions
• Particle filtered MCMC-MLE can avoid the degeneracy issues of MCMC-MLE by performing resampling and rejuvenation
• Particle filtered MCMC-MLE is sometimes faster than CD since it only rejuvenates when needed
• There is a unified view of all these algorithms
Partition function usually intractable
MCMC-MLE uses importance sampling to estimate gradient
Run MCMC under p(x|θ0) until equilibrium
Calculate new weight
Update θ using approximate gradient
…
Run MCMC under θ for n steps
…
Calculate weight and check ESS
If ESS is low,resample and rejuvenate
Run MCMC under θ for n steps
# edges # 2-stars # triangles
Network statistics:
Experiments on MNIST data. 500 hidden units used.
Monte Carlo approximation
PF can be viewed as a “hybrid” between MCMC-MLE and CD