Download - Bayesian Generative Adversarial NetworksGenerative Adversarial Networks Generative Procedure I Sample z(1);:::;z(n) ˘p(z) (p(z) is typically uniform noise). I Transform the noise

Bayesian Generative Adversarial Networks

Andrew Gordon Wilson

Assistant Professorhttps://people.orie.cornell.edu/andrew

Cornell University

Center for Informatics and Computational Science (CICS)Notre Dame University

February 26, 2018

Joint work with Yunus Saatchi

1 / 43

https://people.orie.cornell.edu/andrew


I Generative adversarial networks(GANs) (Goodfellow et. al, NIPS2014) learn rich distributions overimages, audio, and data which arehard to model with an explicitlikelihood.

I We introduce a Bayesian GAN,which requires minimal humanintervention, and provides powerfulsemi-supervised results.

I State-of-the-art predictive accuracyusing less than 1% of labels.

I Scalable inference with stochasticgradient HMC.

θg

p(θg|D)

(θg)ML

2 / 43

Unsupervised Generative Models

Why do we care?

I Foundational to intelligent systems

I Simulating possible futures in reinforcement learning

I Semi-supervised learning

I Image super-resolution, inpainting, extrapolation

GANs and VAEs have emerged as exceptionally powerful frameworksfor generative unsupervised modelling.

“GANs are the most significant new development in machinelearning in the last 10 years!”Yann LeCun, Cornell CS Colloquium, 2016

3 / 43

What are GANs?

I Generative Adversarial Networks (GANs) implicitly performdensity estimation.

I A generator G proposes samples from the data distribution,attempting to fool a discriminator D. Learning takes placethrough an adversarial game between G and D.

I GANs are very good at learning to sample from a density overimages, which has been practically an intractable problem!

4 / 43

Classical Density Estimation

I Observations y1, . . . , yN drawn fromunknown density p(y).

I Specify an observation model. Forexample, we can let the points be drawnfrom a mixture of Gaussians:p(y|θ) = w1N (y|µ1, σ

21) + w2N (y|µ2, σ

22),

θ = {w1,w2, µ1, µ2, σ1, σ2}.

I Likelihood p(y|θ) =∏Ni=1 p(yi|θ) .

Can learn all free parameters θ using maximumlikelihood...

5 / 43

Regularisation = MAP 6= Bayesian Inference

Regularisation or MAP

I Find argmaxθ log p(θ|y) c=

model fit︷︸︸︷log p(y|θ)+

complexity penalty︷︸︸︷log p(θ)

I Choose p(θ) such that p(θ)→ 0 fasterthan p(y|θ)→∞ as σ1 or σ2 → 0.

Bayesian Inference

I Predictive Distribution: p(y∗|y) =∫

p(y∗|θ)p(θ|y)dθ.

I Parameter Posterior: p(θ|y) ∝ p(y|θ)p(θ).

6 / 43

Generative Adversarial Networks

Generative Procedure

I Sample z(1), . . . , z(n) ∼ p(z) (p(z) is typically uniform noise).

I Transform the noise through a generator to produce samplesx′(i) = G(z(i); θg)

I G can be arbitrary but is typically a de-convolutional neuralnetwork parametrized by θg.

I If G has sufficient capacity, there is a setting of θg such thatG(·; θg) can approximate the CDF inverse-CDF compositionrequired to sample from a data distribution of interest.

Notation summary:G: Generator; θg: generator parameters; z: noise; x: data sample

7 / 43

Generating Samples

8 / 43

DC-GAN Architecture

9 / 43

Training Procedure

I The generator G(·; θg) proposes candidate data samples.

I The discriminator has access to a dataset X = {x(i)} from theactual data distribution (e.g., a collection of photographs).

I A discriminator D(·; θd) trains itself to classify samples from thegenerator vs samples from the actual data distribution byupdating its parameters θd.

I The generator updates its parameters θg to fool the discriminator,the discriminator updates its parameters θd to get better atcalling out the generator.

I If G and D have enough capacity samples from G converge tosamples from the actual data distribution.

I This procedure works in practice because of the powerfulinductive biases of G and D.

10 / 43

GAN Training Illustration

x = data samplesz = noise samples

Black = data distributionGreen = generative distributionBlue = discriminative distribution

11 / 43

GAN Objective

minG

maxD

V(D,G) = Ex∼pdata(x)[log D(x)] + Ez∼p(z)[log(1− D(G(z)))]

12 / 43

The Optimal Discriminator

minG

maxD


Proposition: For G fixed, the optimal discriminator D is

D∗G(x) =

pdata(x)pdata(x) + pg(x)

(1)

Proof : The training criterion for the discriminator D, given any generator G, is to maximize thequantity V(G,D)

V(G,D) =

∫x

pdata(x) log(D(x))dx +∫

zpz(z) log(1− D(G(z)))dz

=

∫x

pdata(x) log(D(x)) + pg(x) log(1− D(x))dx (2)

For any (a, b) ∈ R2 \ {0, 0}, the function y→ a log(y) + b log(1− y) achieves its maximumin [0, 1] at a

a+b . The discriminator does not need to be defined outside ofSupp(pdata) ∪ Supp(pg), concluding the proof.

13 / 43

The Optimal Generator

minG

maxD


Let D∗G(x) =pdata(x)

pdata(x)+pg(x) .Then

C(G) = V(G,D∗) = Ex∼pdata

[log

pdata(x)Pdata(x) + pg(x)

]+ Ex∼pg

[log

pg(x)pdata(x) + pg(x)

]= − log(4) + KL

(pdata

∥∥∥∥pdata + pg

2

)+ KL

(pg

∥∥∥∥pdata + pg

2

)(3)

= − log(4) + JSD(pdata||pg) (4)

which attains its minimum when pg = pd .

14 / 43

SGD Training Algorithm

15 / 43

Original Paper Illustrations (2014)

16 / 43

Improvements using DCGAN (2015, 2016)

17 / 43

Progressive GANs (2017)

18 / 43

Vector space arithmetic

19 / 43

Vector space arithmetic

20 / 43

GANs with covariates

21 / 43

Mode Collapse

minG

maxD


Imagine switching the objective from minG maxD to maxD minG. Thepractical SGD training algorithm is agnostic to ordering.

22 / 43

GAN stability

I Feature matching

I Minibatch discrimination

I Label smoothing

23 / 43


Prior Model

I We induce a distribution over generators G and discriminators Dthrough distributions on their parameters:

θg ∼ p(θg|αg) (5)

θd ∼ p(θd|αd) (6)

I We then have a distribution over distributions of data.

24 / 43

Generative Model for Data

1. Sample θ′g ∼ p(θg|αg)

2. Sample z(1), . . . , z(n) ∼ p(z).

3. x′(j) = G(z(j); θ′g) ∼pgenerator(x; θ′g)

θg

p(θg|D)

(θg)ML

25 / 43

Posterior Inference with Adversarial Feedback

How do we update our posterior beliefs?

26 / 43

Propose Conditional Posteriors

p(θg|z, θd) ∝( ng∏

i=1

D(G(z(i); θg); θd)

)p(θg|αg) (7)

p(θd|z,X, θg) ∝nd∏

i=1

D(x(i); θd)×ng∏

i=1

(1− D(G(z(i); θg); θd))× p(θd|αd)

(8)

Sample iteratively from these conditional posteriors

27 / 43

Classical GANs as Maximum Likelihood

p(θg|z, θd) ∝( ng∏

i=1

D(G(z(i); θg); θd)

)p(θg|αg)

p(θd|z,X, θg) ∝nd∏

i=1

D(x(i); θd)×ng∏

i=1

(1− D(G(z(i); θg); θd))× p(θd|αd)

If we assign a vague uniform prior over θg and θd and performiterative MAP optimization instead of sampling, then the localoptima will be in the same place as in the classical GAN ofGoodfellow et. al (2014).

28 / 43

Marginalizing the Noise

p(θg|θd) =

∫p(θg, z|θd)dz =

∫p(θg|z, θd)

=p(z)︷︸︸︷p(z|θd) dz (9)

≈ 1J

J∑j=1

p(θg|z(j), θd) , z(j) ∼ p(z)

By following a similar derivation, p(θd|θg) ≈ 1J

∑Jj p(θd|z(j),X, θg),

z(j) ∼ p(z).

I p(z) is a white noise distribution from which we can takeefficient and exact samples.

I p(θg|z, θd) and p(θd|z,X, θg), when viewed as a function of z,are broad over z by construction, since z is used to producecandidate data samples in the generative procedure. Thereforeeach term in the sum contributes to the estimate.

29 / 43

Semi-supervised Learning

I Make label predictions using structure fromboth unlabelled and labelled training data.

I Can quantify recent advances inunsupervised learning.

I Crucial for reducing the dependency ofdeep learning on large labelled datasets.

30 / 43

Semi-supervised Learning

I Task: predict the class label of test images, based on atraining set of labelled and unlabelled images.

I n unlabelled observations, {x(i)}, and ns labelled observations,{(x(i)s , y

(i)s )}ns

i=1, with class labels y(i)s ∈ {1, . . . ,K}.I Redefine discriminator s.t. D(x(i) = y(i); θd) gives the

probability that sample x(i) belongs to class y(i).

p(θg|z, θd) ∝

ng∏i=1

K∑y=1

D(G(z(i); θg) = y; θd)

p(θg|αg) (10)

p(θd|z, x, ys, θg) ∝nd∏

i=1

K∑y=1

D(x(i) = y; θd)

ng∏i=1

D(G(z(i); θg) = 0; θd)

×ns∏

i=1

(D(x(i)s = y(i)s ; θd))p(θd|αd)

31 / 43

Making Predictions of Class Labels

To compute the predictive distribution for a class label y∗ at a testinput x∗ we use a model average over all collected samples withrespect to the posterior over θd:

p(y∗|x∗,D) =∫

p(y∗|x∗, θd)p(θd|D)dθd (11)

≈ 1T

T∑k=1

p(y∗|x∗, θ(k)d ) , θ(k)d ∼ p(θd|D) (12)

32 / 43

Stochastic Gradient Hamiltonian Monte Carlo

I Hamiltonian Monte Carlo (HMC) is an auxiliary variableMCMC approach, inspired by physics.

I HMC uses gradient information to make better proposals andavoid random walk behaviour.

I Stochastic gradient HMC (Chen et. al, 2014) is a new SGD likealgorithm that enables posterior sampling with no morecomputational complexity than SGD!

I SG-HMC makes it possible to do Bayesian deep learning withinsignificant computational overhead!

I Likelihood surfaces in deep architectures are very well suited tosampling over optimization!

33 / 43

Bayesian GAN Learning Algorithm

34 / 43

Exploring a whole distribution over G and D

θg

p(θg|D)

(θg)ML

35 / 43

Avoiding Mode Collapse

36 / 43

Semi-Supervised Results: MNIST

37 / 43

Semi-Supervised Results: CIFAR-10

38 / 43

Semi-Supervised Results

39 / 43

Semi-Supervised Results

BayesGAN

40 / 43

More Sample Generation

41 / 43

Discussion

I A natural Bayesian generalization of the classical GAN.

I Avoids mode collapse and reduces the need for manualintervention.

I Has particularly promising results on semi-supervised predictiontasks.

I Future directions: deterministic approximate inference, differentarchitectures, different priors, new applications...

I Code available:https://github.com/andrewgordonwilson/bayesgan

42 / 43

https://github.com/andrewgordonwilson/bayesgan

https://github.com/andrewgordonwilson/bayesgan

Scalable Gaussian Processes

I Highly accurate kernel approximations that admit fast matrixvector multiplications (MVMs)

I LCG for inference, stochastic lanczos for log determinants andderivatives (kernel learning).

I O(n) training and O(1) testing (instead of O(n3) training andO(n2) testing.

I Harmonizes with GPU acceleration

I Very powerful for large-scale spatiotemporal regression.

I Implemented in our new library GPyTorch:https://github.com/jrg365/gpytorch

43 / 43

https://github.com/jrg365/gpytorch