Post on 26-May-2020
Slides
Variational Generative Stochastic Networks with
Collaborative Shaping
Philip Bachman and Doina Precup– – – – –
McGill UniversitySchool of Computer Science
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
What do we want from our model?
We want to develop a generative model that can:
Shape its distribution G to match a target distribution D.
Generate local random walks through G.Generate independent samples from G.Provide an e�cient estimate of log G(x).
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
What tools will we use?
We’ll approach this problem by combining recent work on:
Denoising auto-encoders as generative models– [BYAV13, BTLAY14]
I This will give us local random walks.
Variational inference for deep generative models– [KW14, RMW14]
I This will give us independent samples and log G(x).Approximate Bayesian Computation– [GDKC14a, GDKC14b, GPAM+14]
I This will keep the random walks near D.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Useful Terminology
Some definitions we’ll use to present our work:
x 2 X will indicate “observable” variables.
z 2 Z will indicate “latent” variables.
q�(z |x) – this is the corruption process.
p✓(x |z) – this is the reconstruction distribution.
p⇤(z) – this is the prior distribution.
G/D – these are the model/target distributions.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Denoising auto-encoders and Markov chains
A denoising auto-encoder trains p✓(x |z) to match theconditionals observed in pairs (x
i
, zi
) generated by samplingx
i
⇠ D and then z
i
⇠ q�(z |xi ).Noise typically interpreted as imposed on x can be moregenerally thought of as noise in the encoder q�(z |xi ).Given p✓(x |z) and q�(z |x), we can start with x
0
⇠ D and theniterate between sampling z
i
⇠ q�(z |xi ) and x
i+1
⇠ p✓(x |zi
).
This defines a Markov chain over x 2 X with transitionoperator T✓(xt+1
|xt
) /P
z
p✓(xt+1
|z)q�(z |xt).
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
GSNs and Simple GSNs
GSNs extend the standard denoising auto-encoder bychanging q�(z |xi ) to q�(z |xi , zi�1
).
This permits additional “hidden state” (in z
i�1
), which mayimprove Markov chain sampling behavior.
Simple GSNs are GSNs where q�(z |xi , zi�1
) ? z
i�1
.
Any GSN trained using walkback samples is equivalent to aSimple GSN with a corruption process W(z |x ; p✓, q�) definedby a procedural wrapper around p✓ and q�.
I But... what’s walkback?
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Controlling dynamics of the q/p chain
Problem 1: if q�(z |x) is too local, the q/p chain won’t mixwell between separated modes of D.
Walkback reduces locality by iteratively applying q/p whengenerating each corrupt/reconstruct pair (x , z)1.
I We obtain a similar e↵ect by minimizing a modified VFE.
Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.
Walkback removes spurious attractors from the q/p chain byrepeatedly unrolling it and pulling it back to D.
I We obtain a similar e↵ect using techniques from ABC.
1
E.g., if q adds (small �, µ = 0) Gaussian noise p will be Gaussian and
walkback will (roughly) perform iterated Gaussian convolution.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Balancing local vs. non-local corruption in GSNs
q(z|x1) q(z|x2)
x1 x2
p(x|z1)
z1
D(x)
Z
X
(a)
q(z|x1) q(z|x2)
x1 x2
p(x|z1)
z1
D(x)
Z
X
(b)
Figure: (a) the corruption process is more local, making p(x |z) simplebut causing slow mixing between modes of D in the q/p chain. (b) thecorruption process is less local, making p(x |z) tricky but causing fastmixing between modes of D in the q/p chain.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Variational Simple GSNs
We train a VAE comprising q� and p✓ by minimizing2:
Ex⇠D
E
z⇠q�(z|x)� log p✓(x |z) + �KL(q�(z |x) || p⇤(z)))
�
We run a Markov chain by feeding the VAE back into itself.
� lets us control E(x
1
,x2
)⇠D KL(q�(z |x2) ||q�(z |x1)) which,roughly speaking, corresponds to the locality of q�.
But, we still need better control over the dynamics of theunrolled chain (see example videos).
We’ll use Approximate Bayesian Computation to shape thedistribution G emitted by our unrolled, self-looped VAE.
2
We also penalize “squared KL above the mean” – see our paper for details.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Shaping Markov chain dynamics with ABC
Approximate Bayesian Computation is (roughly) based ontraining a generative model by minimizing some measure ofdissimilarity between G and D.
Examples include moment matching (e.g. MMD, etc.), andclassification-based methods (e.g. GANs by Goodfellow et. aland related work by Gutmann et. al).
For our models, we:
Train a guide function f to estimate log DG .
Move mass in G towards increasing f when f < 0.
Don’t move G-mass emitted in regions where f > 0.
Show a global minimum occurs i↵ 8x D(x) = G(x)3.
3
The first three points, though vague, are su�cient for this to hold.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Summarizing our model
q(z|x1) q(z|x2)
x1 x2
D(x)
Z
X
p(z)
log(D/G)
It's flat!
(a)
x0 ⇠ D(x)
x0 x1 x2
z0 z1
z0 ⇠ W(z|x0; p✓, q�)
x1 ⇠ p✓(x|z0)
f f f (xi) ⇡ logD(xi)
G(xi)
(b)
Figure: (a) the top figure gives a schematic for optimizing variationalfree-energy – the bottom figure shows valid ABC loss functions for q/pchain samples based on f ⇡ log D
G . (b) the full graph for our model.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Results
We now briefly look at:
How re-weighting the free-energy KLd term a↵ects G.How collaborative unrolling a↵ects chain dynamics.
How our models perform on di↵erent quantitative tests.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
MNIST – KLd comparison, Side-by-side
(a) � = 1, � = 0 (b) � = 4, � = 0.1 (c) � = 24, � = 0.1
Figure: Side-by-side comparison of independent samples from VAEstrained with varying strengths of KLd penalty. Scores on GPDE test were(L-to-R): 220, 265, and 330.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Chain behavior with and without guided unrolling
(a) (b)
Figure: Comparing chains generated by models learned with and withoutcollaborative shaping. The samples in (a) were generated by acorrupt-reconstruct pair q�/p✓ trained for 100k updates as a variationalauto-encoder (VAE), and then 100k updates as a 6-step unrolled,collaboratively-guided chain. Samples in (b) are from the same modelbut with 200k updates of standard VAE training.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Visualizing Markov chain dynamics – MNIST
.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Shaping Chain diversity
none
6 steps
Slides
Visualizing Markov chain dynamics – TFD
.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Standard free-energy Strong KL Regularization
Slides
Conclusion
Problem 1: if q�(z |x) is too local, the chain generated byT✓(xt+1
|xt
) = 1
C
Pz
p✓(xt+1
|z)q�(z |xt) won’t mix wellbetween separated modes of D.
I We address this using modified KL terms in the VFE.
Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.
I We address this using collaborative unrolling.
Problem 3: When q�(z |x) is non-local, p✓(x |z) may need tocapture sophisticated structure, e.g. multi-modality.
I Future work – construct q/p peu a peu – see our poster/talkat EWRL or our poster at the DL workshop.
Questions?
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
References I
Yoshua Bengio, Eric Thibodeau-Laufer, Guillaume Alain, andJason Yosinski, Deep generative stochastic networks trainable
by backprop, arXiv:1306.1091v5 [cs.LG] (2014).
Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent,Generalized denoising auto-encoders as generative models,Advances in Neural Information Processing Systems (NIPS),2013.
Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, andJukka Corander, Classifier abc, MCMSki IV (posters), 2014.
, Likelihood-free inference via classification,arXiv:1407.4981v1 [stat.CO], 2014.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
References II
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, Generative adversarial nets, Advances inNeural Information Processing Systems (NIPS), 2014.
Diederik P Kingma and Max Welling, Auto-encodingvariational bayes, International Conference on LearningRepresentations (ICLR), 2014.
Danilo Rezende, Shakir Mohamed, and Daan Wierstra,Stochastic backpropagation and approximate inference in deep
generative models, International Conference on MachineLearning (ICML), 2014.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Graphical models for DAEs, GSNs, and Simple GSNs
x0 ⇠ D(x)
x0 x1 x2
x0 x1
x1 ⇠ p✓(x|x0)
x0 ⇠ q�(x|x0)
(g)
x0 x1 x2
z0 z1 z2
z0 ⇠ q�(z)
x0 ⇠ p✓(x|z0)
z1 ⇠ q�(z|x0, z0)
(h)
x0 ⇠ D(x)
x0 x1 x2
z0 z1
z0 ⇠ W(z|x0; p✓, q�)
x1 ⇠ p✓(x|z0)
(i)
Figure: (a) The Markov chain for a Generalized DAE. (b) The Markovchain for a GSN. (c) The Markov chain for a Simple GSN using acorruption process W(z |x ; q�, p✓) formed via the walkback procedure.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Wrapping p✓ and q� via Walkback
Input: data sample x , corruptor q�, reconstructor p✓Initialize an empty training pair list P
xz
= { }Set z to some initial vector 2 Z.for i = 1 to k
burn�in
do
Sample z from q�(z |x , z) then set z to z .end for
Set x to x .for i = 1 to k
roll�out
do
Sample z from q�(z |x , z) then set z to z .Sample x from p✓(x |z) then set x to x .Add pair (x , z) to P
xz
.end for
Return: Pxz
.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
GSN theory – 1
Theorem
If p✓(x |z) is a consistent estimator of the true conditional
distribution P(x |z) and the transition operator T✓(xt+1
|xt
) thatsamples z
t
from q�(zt |xt) and samples x
t+1
from p✓(xt+1
|zt
)defines an ergodic Markov chain, then as the number of examples
used to train p✓(x |z) goes to infinity (i.e. as p✓(x |z) converges toP(x |z)), the asymptotic distribution of the Markov chain with
transition operator T✓(xt+1
|xt
) converges to target distribution D.
– Theorem modified slightly from “Generalized DenoisingAuto-encoders as Generative Models” by Bengio et. al, NIPS 2013.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
GSN theory – 2
Corollary
Let X be a set in which every pair of points is connected by a
finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S
x
✓ X such that all paths between x
and any point in X \ Sx
pass through some point in Sx
whose
shortest path to x has length > 0. Suppose that
8x , 8x 0 2 Sx
[ {x}, 9zxx
0such that q�(z
xx
0 |x) > 0 and
p✓(x 0|zxx
0) > 0. Then, the Markov chain with transition operator
T✓(xt+1
|xt
) =P
z
p✓(xt+1
|z)q�(z |xt) is ergodic.
– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Variational Free-energy – 1
Assume we have the distributions p✓(x |z), p⇤(z), and q�(z |x).Then, we define the following derived distributions.
p✓(x ; p⇤)=X
z
p✓(x |z)p⇤(z) (1)
p✓(z |x ; p⇤)=p✓(x |z)p⇤(z)p✓(x ; p⇤)
(2)
p✓(x , z ; p⇤)=p✓(x |z)p⇤(z) = p✓(z |x ; p⇤)p✓(x ; p⇤) (3)
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Variational Free-energy – 2.0
log p✓(x ; p⇤) =X
z
q�(z |x) log p✓(x ; p⇤) (4)
=X
z
q�(z |x) logp✓(z |x ; p⇤)p✓(x ; p⇤)
p✓(z |x ; p⇤)(5)
=X
z
q�(z |x) logp✓(x , z ; p⇤)
p✓(z |x ; p⇤)(6)
=X
z
q�(z |x)(log p✓(x , z ; p⇤) � log q�(z |x) (7)
+ log q�(z |x) � log p✓(z |x ; p⇤))
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
GSN theory – 2
Corollary
Let X be a set in which every pair of points is connected by a
finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S
x
✓ X such that all paths between x
and any point in X \ Sx
pass through some point in Sx
whose
shortest path to x has length > 0. Suppose that
8x , 8x 0 2 Sx
[ {x}, 9zxx
0such that q�(z
xx
0 |x) > 0 and
p✓(x 0|zxx
0) > 0. Then, the Markov chain with transition operator
T✓(xt+1
|xt
) =P
z
p✓(xt+1
|z)q�(z |xt) is ergodic.
– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
MNIST – Weak KLd penalty
(a) (b)
Figure: Models trained with � = 1 and � = 0.0. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
MNIST – Medium KLd penalty
(a) (b)
Figure: Models trained with � = 4 and � = 0.1. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
MNIST – Strong KLd penalty
(a) (b)
Figure: Models trained with � = 24 and � = 0.1. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Training progress comparison – MNIST
30k 60k 120k
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Training progress comparison – TFD
50k 100k 150k
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Multi-test – TFD
(d)
(e) (f)
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping
Slides
Multi-test – MNIST
(a)
(b) (c)
Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping