Bayesian Generative Adversarial Networks
Andrew Gordon Wilson
Assistant Professorhttps://people.orie.cornell.edu/andrew
Cornell University
Center for Informatics and Computational Science (CICS)Notre Dame University
February 26, 2018
Joint work with Yunus Saatchi
1 / 43
Bayesian Generative Adversarial Networks
I Generative adversarial networks(GANs) (Goodfellow et. al, NIPS2014) learn rich distributions overimages, audio, and data which arehard to model with an explicitlikelihood.
I We introduce a Bayesian GAN,which requires minimal humanintervention, and provides powerfulsemi-supervised results.
I State-of-the-art predictive accuracyusing less than 1% of labels.
I Scalable inference with stochasticgradient HMC.
θg
p(θg|D)
(θg)ML
2 / 43
Unsupervised Generative Models
Why do we care?
I Foundational to intelligent systems
I Simulating possible futures in reinforcement learning
I Semi-supervised learning
I Image super-resolution, inpainting, extrapolation
GANs and VAEs have emerged as exceptionally powerful frameworksfor generative unsupervised modelling.
“GANs are the most significant new development in machinelearning in the last 10 years!”Yann LeCun, Cornell CS Colloquium, 2016
3 / 43
What are GANs?
I Generative Adversarial Networks (GANs) implicitly performdensity estimation.
I A generator G proposes samples from the data distribution,attempting to fool a discriminator D. Learning takes placethrough an adversarial game between G and D.
I GANs are very good at learning to sample from a density overimages, which has been practically an intractable problem!
4 / 43
Classical Density Estimation
I Observations y1, . . . , yN drawn fromunknown density p(y).
I Specify an observation model. Forexample, we can let the points be drawnfrom a mixture of Gaussians:p(y|θ) = w1N (y|µ1, σ
21) + w2N (y|µ2, σ
22),
θ = {w1,w2, µ1, µ2, σ1, σ2}.
I Likelihood p(y|θ) =∏Ni=1 p(yi|θ) .
Can learn all free parameters θ using maximumlikelihood...
5 / 43
Regularisation = MAP 6= Bayesian Inference
Regularisation or MAP
I Find argmaxθ log p(θ|y) c=
model fit︷ ︸︸ ︷log p(y|θ)+
complexity penalty︷ ︸︸ ︷log p(θ)
I Choose p(θ) such that p(θ)→ 0 fasterthan p(y|θ)→∞ as σ1 or σ2 → 0.
Bayesian Inference
I Predictive Distribution: p(y∗|y) =∫
p(y∗|θ)p(θ|y)dθ.
I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).
6 / 43
Generative Adversarial Networks
Generative Procedure
I Sample z(1), . . . , z(n) ∼ p(z) (p(z) is typically uniform noise).
I Transform the noise through a generator to produce samplesx′(i) = G(z(i); θg)
I G can be arbitrary but is typically a de-convolutional neuralnetwork parametrized by θg.
I If G has sufficient capacity, there is a setting of θg such thatG(·; θg) can approximate the CDF inverse-CDF compositionrequired to sample from a data distribution of interest.
Notation summary:G: Generator; θg: generator parameters; z: noise; x: data sample
7 / 43
Generating Samples
8 / 43
DC-GAN Architecture
9 / 43
Training Procedure
I The generator G(·; θg) proposes candidate data samples.
I The discriminator has access to a dataset X = {x(i)} from theactual data distribution (e.g., a collection of photographs).
I A discriminator D(·; θd) trains itself to classify samples from thegenerator vs samples from the actual data distribution byupdating its parameters θd.
I The generator updates its parameters θg to fool the discriminator,the discriminator updates its parameters θd to get better atcalling out the generator.
I If G and D have enough capacity samples from G converge tosamples from the actual data distribution.
I This procedure works in practice because of the powerfulinductive biases of G and D.
10 / 43
GAN Training Illustration
x = data samplesz = noise samples
Black = data distributionGreen = generative distributionBlue = discriminative distribution
11 / 43
GAN Objective
minG
maxD
V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]
12 / 43
The Optimal Discriminator
minG
maxD
V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]
Proposition: For G fixed, the optimal discriminator D is
D∗G(x) =
pdata(x)pdata(x) + pg(x)
(1)
Proof : The training criterion for the discriminator D, given any generator G, is to maximize thequantity V(G,D)
V(G,D) =
∫x
pdata(x) log(D(x))dx +∫
zpz(z) log(1− D(G(z)))dz
=
∫x
pdata(x) log(D(x)) + pg(x) log(1− D(x))dx (2)
For any (a, b) ∈ R2 \ {0, 0}, the function y→ a log(y) + b log(1− y) achieves its maximumin [0, 1] at a
a+b . The discriminator does not need to be defined outside ofSupp(pdata) ∪ Supp(pg), concluding the proof.
13 / 43
The Optimal Generator
minG
maxD
V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]
Let D∗G(x) =pdata(x)
pdata(x)+pg(x) .Then
C(G) = V(G,D∗) = Ex∼pdata
[log
pdata(x)Pdata(x) + pg(x)
]+ Ex∼pg
[log
pg(x)pdata(x) + pg(x)
]= − log(4) + KL
(pdata
∥∥∥∥pdata + pg
2
)+ KL
(pg
∥∥∥∥pdata + pg
2
)(3)
= − log(4) + JSD(pdata||pg) (4)
which attains its minimum when pg = pd .
14 / 43
SGD Training Algorithm
15 / 43
Original Paper Illustrations (2014)
16 / 43
Improvements using DCGAN (2015, 2016)
17 / 43
Progressive GANs (2017)
18 / 43
Vector space arithmetic
19 / 43
Vector space arithmetic
20 / 43
GANs with covariates
21 / 43
Mode Collapse
minG
maxD
V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]
Imagine switching the objective from minG maxD to maxD minG. Thepractical SGD training algorithm is agnostic to ordering.
22 / 43
GAN stability
I Feature matching
I Minibatch discrimination
I Label smoothing
23 / 43
Bayesian Generative Adversarial Networks
Prior Model
I We induce a distribution over generators G and discriminators Dthrough distributions on their parameters:
θg ∼ p(θg|αg) (5)
θd ∼ p(θd|αd) (6)
I We then have a distribution over distributions of data.
24 / 43
Generative Model for Data
1. Sample θ′g ∼ p(θg|αg)
2. Sample z(1), . . . , z(n) ∼ p(z).
3. x′(j) = G(z(j); θ′g) ∼pgenerator(x; θ′g)
θg
p(θg|D)
(θg)ML
25 / 43
Posterior Inference with Adversarial Feedback
How do we update our posterior beliefs?
26 / 43
Propose Conditional Posteriors
p(θg|z, θd) ∝( ng∏
i=1
D(G(z(i); θg); θd)
)p(θg|αg) (7)
p(θd|z,X, θg) ∝nd∏
i=1
D(x(i); θd)×ng∏
i=1
(1− D(G(z(i); θg); θd))× p(θd|αd)
(8)
Sample iteratively from these conditional posteriors
27 / 43
Classical GANs as Maximum Likelihood
p(θg|z, θd) ∝( ng∏
i=1
D(G(z(i); θg); θd)
)p(θg|αg)
p(θd|z,X, θg) ∝nd∏
i=1
D(x(i); θd)×ng∏
i=1
(1− D(G(z(i); θg); θd))× p(θd|αd)
If we assign a vague uniform prior over θg and θd and performiterative MAP optimization instead of sampling, then the localoptima will be in the same place as in the classical GAN ofGoodfellow et. al (2014).
28 / 43
Marginalizing the Noise
p(θg|θd) =
∫p(θg, z|θd)dz =
∫p(θg|z, θd)
=p(z)︷ ︸︸ ︷p(z|θd) dz (9)
≈ 1J
J∑j=1
p(θg|z(j), θd) , z(j) ∼ p(z)
By following a similar derivation, p(θd|θg) ≈ 1J
∑Jj p(θd|z(j),X, θg),
z(j) ∼ p(z).
I p(z) is a white noise distribution from which we can takeefficient and exact samples.
I p(θg|z, θd) and p(θd|z,X, θg), when viewed as a function of z,are broad over z by construction, since z is used to producecandidate data samples in the generative procedure. Thereforeeach term in the sum contributes to the estimate.
29 / 43
Semi-supervised Learning
I Make label predictions using structure fromboth unlabelled and labelled training data.
I Can quantify recent advances inunsupervised learning.
I Crucial for reducing the dependency ofdeep learning on large labelled datasets.
30 / 43
Semi-supervised Learning
I Task: predict the class label of test images, based on atraining set of labelled and unlabelled images.
I n unlabelled observations, {x(i)}, and ns labelled observations,{(x(i)s , y
(i)s )}ns
i=1, with class labels y(i)s ∈ {1, . . . ,K}.I Redefine discriminator s.t. D(x(i) = y(i); θd) gives the
probability that sample x(i) belongs to class y(i).
p(θg|z, θd) ∝
ng∏i=1
K∑y=1
D(G(z(i); θg) = y; θd)
p(θg|αg) (10)
p(θd|z, x, ys, θg) ∝nd∏
i=1
K∑y=1
D(x(i) = y; θd)
ng∏i=1
D(G(z(i); θg) = 0; θd)
×ns∏
i=1
(D(x(i)s = y(i)s ; θd))p(θd|αd)
31 / 43
Making Predictions of Class Labels
To compute the predictive distribution for a class label y∗ at a testinput x∗ we use a model average over all collected samples withrespect to the posterior over θd:
p(y∗|x∗,D) =∫
p(y∗|x∗, θd)p(θd|D)dθd (11)
≈ 1T
T∑k=1
p(y∗|x∗, θ(k)d ) , θ(k)d ∼ p(θd|D) (12)
32 / 43
Stochastic Gradient Hamiltonian Monte Carlo
I Hamiltonian Monte Carlo (HMC) is an auxiliary variableMCMC approach, inspired by physics.
I HMC uses gradient information to make better proposals andavoid random walk behaviour.
I Stochastic gradient HMC (Chen et. al, 2014) is a new SGD likealgorithm that enables posterior sampling with no morecomputational complexity than SGD!
I SG-HMC makes it possible to do Bayesian deep learning withinsignificant computational overhead!
I Likelihood surfaces in deep architectures are very well suited tosampling over optimization!
33 / 43
Bayesian GAN Learning Algorithm
34 / 43
Exploring a whole distribution over G and D
θg
p(θg|D)
(θg)ML
35 / 43
Avoiding Mode Collapse
36 / 43
Semi-Supervised Results: MNIST
37 / 43
Semi-Supervised Results: CIFAR-10
38 / 43
Semi-Supervised Results
39 / 43
Semi-Supervised Results
BayesGAN
40 / 43
More Sample Generation
41 / 43
Discussion
I A natural Bayesian generalization of the classical GAN.
I Avoids mode collapse and reduces the need for manualintervention.
I Has particularly promising results on semi-supervised predictiontasks.
I Future directions: deterministic approximate inference, differentarchitectures, different priors, new applications...
I Code available:https://github.com/andrewgordonwilson/bayesgan
42 / 43
Scalable Gaussian Processes
I Highly accurate kernel approximations that admit fast matrixvector multiplications (MVMs)
I LCG for inference, stochastic lanczos for log determinants andderivatives (kernel learning).
I O(n) training and O(1) testing (instead of O(n3) training andO(n2) testing.
I Harmonizes with GPU acceleration
I Very powerful for large-scale spatiotemporal regression.
I Implemented in our new library GPyTorch:https://github.com/jrg365/gpytorch
43 / 43
Top Related