Bayesian Generative Adversarial Networks

Andrew Gordon Wilson

Assistant Professor

Cornell University

Center for Informatics and Computational Science (CICS)Notre Dame University

February 26, 2018

Joint work with Yunus Saatchi

Bayesian Generative Adversarial Networks

I Generative adversarial networks(GANs) (Goodfellow et. al, NIPS2014) learn rich distributions overimages, audio, and data which arehard to model with an explicitlikelihood.

I We introduce a Bayesian GAN,which requires minimal humanintervention, and provides powerfulsemi-supervised results.

I State-of-the-art predictive accuracyusing less than 1% of labels.

I Scalable inference with stochasticgradient HMC.




Unsupervised Generative Models

Why do we care?

I Foundational to intelligent systems

I Simulating possible futures in reinforcement learning

I Semi-supervised learning

I Image super-resolution, inpainting, extrapolation

GANs and VAEs have emerged as exceptionally powerful frameworksfor generative unsupervised modelling.

“GANs are the most significant new development in machinelearning in the last 10 years!”Yann LeCun, Cornell CS Colloquium, 2016

What are GANs?

I Generative Adversarial Networks (GANs) implicitly performdensity estimation.

I A generator G proposes samples from the data distribution,attempting to fool a discriminator D. Learning takes placethrough an adversarial game between G and D.

I GANs are very good at learning to sample from a density overimages, which has been practically an intractable problem!

Classical Density Estimation

I Observations y1, . . . , yN drawn fromunknown density p(y).

I Specify an observation model. Forexample, we can let the points be drawnfrom a mixture of Gaussians:p(y|θ) = w1N (y|µ1, σ

21) + w2N (y|µ2, σ


θ = {w1,w2, µ1, µ2, σ1, σ2}.

I Likelihood p(y|θ) =∏Ni=1 p(yi|θ) .

Can learn all free parameters θ using maximumlikelihood...

Regularisation = MAP 6= Bayesian Inference

Regularisation or MAP

I Find argmaxθ log p(θ|y) c=

model fit︷ ︸︸ ︷log p(y|θ)+

complexity penalty︷ ︸︸ ︷log p(θ)

I Choose p(θ) such that p(θ)→ 0 fasterthan p(y|θ)→∞ as σ1 or σ2 → 0.

Bayesian Inference

I Predictive Distribution: p(y∗|y) =∫


I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).

Generative Adversarial Networks

Generative Procedure

I Sample z(1), . . . , z(n) ∼ p(z) (p(z) is typically uniform noise).

I Transform the noise through a generator to produce samplesx′(i) = G(z(i); θg)

I G can be arbitrary but is typically a de-convolutional neuralnetwork parametrized by θg.

I If G has sufficient capacity, there is a setting of θg such thatG(·; θg) can approximate the CDF inverse-CDF compositionrequired to sample from a data distribution of interest.

Notation summary:G: Generator; θg: generator parameters; z: noise; x: data sample

Generating Samples

DC-GAN Architecture

Training Procedure

I The generator G(·; θg) proposes candidate data samples.

I The discriminator has access to a dataset X = {x(i)} from theactual data distribution (e.g., a collection of photographs).

I A discriminator D(·; θd) trains itself to classify samples from thegenerator vs samples from the actual data distribution byupdating its parameters θd.

I The generator updates its parameters θg to fool the discriminator,the discriminator updates its parameters θd to get better atcalling out the generator.

I If G and D have enough capacity samples from G converge tosamples from the actual data distribution.

I This procedure works in practice because of the powerfulinductive biases of G and D.

GAN Training Illustration

x = data samplesz = noise samples

Black = data distributionGreen = generative distributionBlue = discriminative distribution

GAN Objective



V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

The Optimal Discriminator



V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Proposition: For G fixed, the optimal discriminator D is

D∗G(x) =

pdata(x)pdata(x) + pg(x)


Proof : The training criterion for the discriminator D, given any generator G, is to maximize thequantity V(G,D)

V(G,D) =


pdata(x) log(D(x))dx +∫

zpz(z) log(1− D(G(z)))dz



pdata(x) log(D(x)) + pg(x) log(1− D(x))dx (2)

For any (a, b) ∈ R2 \ {0, 0}, the function y→ a log(y) + b log(1− y) achieves its maximumin [0, 1] at a

a+b . The discriminator does not need to be defined outside ofSupp(pdata) ∪ Supp(pg), concluding the proof.

The Optimal Generator



V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Let D∗G(x) =pdata(x)

pdata(x)+pg(x) .Then

C(G) = V(G,D∗) = Ex∼pdata


pdata(x)Pdata(x) + pg(x)

]+ Ex∼pg


pg(x)pdata(x) + pg(x)

]= − log(4) + KL


∥∥∥∥pdata + pg


)+ KL


∥∥∥∥pdata + pg



= − log(4) + JSD(pdata||pg) (4)

which attains its minimum when pg = pd .

SGD Training Algorithm

Page 16: Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Original Paper Illustrations (2014)

Improvements using DCGAN (2015, 2016)

Progressive GANs (2017)

Vector space arithmetic

Vector space arithmetic

GANs with covariates

Mode Collapse



V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

Imagine switching the objective from minG maxD to maxD minG. Thepractical SGD training algorithm is agnostic to ordering.

GAN stability

I Feature matching

I Minibatch discrimination

I Label smoothing

Bayesian Generative Adversarial Networks

Prior Model

I We induce a distribution over generators G and discriminators Dthrough distributions on their parameters:

θg ∼ p(θg|αg) (5)

θd ∼ p(θd|αd) (6)

I We then have a distribution over distributions of data.

Generative Model for Data

1. Sample θ′g ∼ p(θg|αg)

2. Sample z(1), . . . , z(n) ∼ p(z).

3. x′(j) = G(z(j); θ′g) ∼pgenerator(x; θ′g)




Posterior Inference with Adversarial Feedback

How do we update our posterior beliefs?

Propose Conditional Posteriors

p(θg|z, θd) ∝( ng∏


D(G(z(i); θg); θd)

)p(θg|αg) (7)

p(θd|z,X, θg) ∝nd∏


D(x(i); θd)×ng∏


(1− D(G(z(i); θg); θd))× p(θd|αd)


Sample iteratively from these conditional posteriors

Classical GANs as Maximum Likelihood

p(θg|z, θd) ∝( ng∏


D(G(z(i); θg); θd)


p(θd|z,X, θg) ∝nd∏


D(x(i); θd)×ng∏


(1− D(G(z(i); θg); θd))× p(θd|αd)

If we assign a vague uniform prior over θg and θd and performiterative MAP optimization instead of sampling, then the localoptima will be in the same place as in the classical GAN ofGoodfellow et. al (2014).

Marginalizing the Noise

p(θg|θd) =

∫p(θg, z|θd)dz =

∫p(θg|z, θd)

=p(z)︷ ︸︸ ︷p(z|θd) dz (9)

≈ 1J


p(θg|z(j), θd) , z(j) ∼ p(z)

By following a similar derivation, p(θd|θg) ≈ 1J

∑Jj p(θd|z(j),X, θg),

z(j) ∼ p(z).

I p(z) is a white noise distribution from which we can takeefficient and exact samples.

I p(θg|z, θd) and p(θd|z,X, θg), when viewed as a function of z,are broad over z by construction, since z is used to producecandidate data samples in the generative procedure. Thereforeeach term in the sum contributes to the estimate.

Semi-supervised Learning

I Make label predictions using structure fromboth unlabelled and labelled training data.

I Can quantify recent advances inunsupervised learning.

I Crucial for reducing the dependency ofdeep learning on large labelled datasets.

Semi-supervised Learning

I Task: predict the class label of test images, based on atraining set of labelled and unlabelled images.

I n unlabelled observations, {x(i)}, and ns labelled observations,{(x(i)s , y

(i)s )}ns

i=1, with class labels y(i)s ∈ {1, . . . ,K}.I Redefine discriminator s.t. D(x(i) = y(i); θd) gives the

probability that sample x(i) belongs to class y(i).

p(θg|z, θd) ∝



D(G(z(i); θg) = y; θd)

p(θg|αg) (10)

p(θd|z, x, ys, θg) ∝nd∏



D(x(i) = y; θd)


D(G(z(i); θg) = 0; θd)



(D(x(i)s = y(i)s ; θd))p(θd|αd)

Making Predictions of Class Labels

To compute the predictive distribution for a class label y∗ at a testinput x∗ we use a model average over all collected samples withrespect to the posterior over θd:

p(y∗|x∗,D) =∫

p(y∗|x∗, θd)p(θd|D)dθd (11)

≈ 1T


p(y∗|x∗, θ(k)d ) , θ(k)d ∼ p(θd|D) (12)

Stochastic Gradient Hamiltonian Monte Carlo

I Hamiltonian Monte Carlo (HMC) is an auxiliary variableMCMC approach, inspired by physics.

I HMC uses gradient information to make better proposals andavoid random walk behaviour.

I Stochastic gradient HMC (Chen et. al, 2014) is a new SGD likealgorithm that enables posterior sampling with no morecomputational complexity than SGD!

I SG-HMC makes it possible to do Bayesian deep learning withinsignificant computational overhead!

I Likelihood surfaces in deep architectures are very well suited tosampling over optimization!

Bayesian GAN Learning Algorithm

Exploring a whole distribution over G and D




Avoiding Mode Collapse

Semi-Supervised Results: MNIST

Semi-Supervised Results: CIFAR-10

Semi-Supervised Results

Semi-Supervised Results


More Sample Generation

I A natural Bayesian generalization of the classical GAN.

I Avoids mode collapse and reduces the need for manualintervention.

I Has particularly promising results on semi-supervised predictiontasks.

I Future directions: deterministic approximate inference, differentarchitectures, different priors, new applications...

I Code available:

