Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and...

14
Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che *12 Ruixiang Zhang *1 Jascha Sohl-Dickstein 2 Hugo Larochelle 2 Liam Paull 1 Yuan Cao 2 Yoshua Bengio 1 Abstract The sum of the implicit generator log-density log p g of a GAN with the logit score of the discriminator defines an energy function which yields the true data density when the generator is imperfect but the discriminator is optimal. This makes it possible to improve on the typical gen- erator (with implicit density p g ). We show that samples can be generated from this modified den- sity by sampling in latent space according to an energy-based model induced by the sum of the latent prior log-density and the discrimina- tor output score. We call this process of running Markov Chain Monte Carlo in the latent space, and then applying the generator function, Discrim- inator Driven Latent Sampling (DDLS). We show that DDLS is highly efficient compared to previ- ous methods which work in the high-dimensional pixel space, and can be applied to improve on pre- viously trained GANs of many types. We evaluate DDLS on both synthetic and real-world datasets qualitatively and quantitatively. On CIFAR-10, DDLS substantially improves the Inception Score of an off-the-shelf pre-trained SN-GAN (Miyato et al., 2018) from 8.22 to 9.09 which is compara- ble to the class-conditional BigGAN (Brock et al., 2019) model. This achieves a new state-of-the- art in the unconditional image synthesis setting without introducing extra parameters or additional training. 1. Introduction Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are state-of-the-art models for a large variety of tasks such as image generation, semi-supervised learning (Dai et al., 2017), image editing (Yi et al., 2017), image * Equal contribution 1 Mila, Universit de Montral 2 Google Brain. Correspondence to: Tong Che <[email protected]>, Ruixi- ang Zhang <[email protected]>. Work in progress. translation (Zhu et al., 2017), and imitation learning (Ho & Ermon, 2016). In a nutshell, the GAN framework consists of two neural networks, the generator G and the discriminator D. The optimization process is formulated as an adversarial game, with the generator trying to fool the discriminator and the discriminator trying to better classify real from fake samples. Despite the ability of GANs to generate high-resolution, sharp samples, these models are notoriously hard to train. Previous work on GANs focused on improving stability and mode dropping issues (Metz et al., 2016; Che et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Roth et al., 2017), which are believed to be the main difficulties in training GANs. Besides instability and mode dropping, the samples of state- of-the-art GAN models sometimes contain bad artifacts or are even not recognizable (Karras et al., 2019). It is conjec- tured that this is due to the inherent difficulty of generating high dimensional complex data, such as natural images, and the optimization challenge of the adversarial formulation. In order to improve sample quality, conventional sampling tech- niques, such as increasing the temperature, are commonly adopted for GAN models (Brock et al., 2019). Recently, new sampling methods such as Discriminator Rejection Sampling (DRS) (Azadi et al., 2018), Metropolis-Hastings Generative Adversarial Network (MH-GAN) (Turner et al., 2019), and Discriminator Optimal Transport (DOT) (Tanaka, 2019) have shown promising results by utilizing the infor- mation provided by both the generator and the discriminator. However, these sampling techniques are either inefficient or lack theoretical guarantees, possibly reducing the sam- ple diversity and making the mode dropping problem more severe. In this paper, we show that GANs can be better understood through the lens of Energy-Based Models (EBM). In our formulation, GAN generators and discriminators collabo- ratively learn an “implicit” energy-based model. However, efficient sampling from this energy based model directly in pixel space is extremely challenging for several reasons. One is that there is no tractable closed form for the implicit energy function in pixel space. This motivates an intrigu- arXiv:2003.06060v2 [cs.LG] 24 Mar 2020

Transcript of Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and...

Page 1: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Your GAN is Secretly an Energy-based Model and You Should UseDiscriminator Driven Latent Sampling

Tong Che * 1 2 Ruixiang Zhang * 1 Jascha Sohl-Dickstein 2 Hugo Larochelle 2 Liam Paull 1 Yuan Cao 2

Yoshua Bengio 1

AbstractThe sum of the implicit generator log-densitylog pg of a GAN with the logit score of thediscriminator defines an energy function whichyields the true data density when the generator isimperfect but the discriminator is optimal. Thismakes it possible to improve on the typical gen-erator (with implicit density pg). We show thatsamples can be generated from this modified den-sity by sampling in latent space according toan energy-based model induced by the sum ofthe latent prior log-density and the discrimina-tor output score. We call this process of runningMarkov Chain Monte Carlo in the latent space,and then applying the generator function, Discrim-inator Driven Latent Sampling (DDLS). We showthat DDLS is highly efficient compared to previ-ous methods which work in the high-dimensionalpixel space, and can be applied to improve on pre-viously trained GANs of many types. We evaluateDDLS on both synthetic and real-world datasetsqualitatively and quantitatively. On CIFAR-10,DDLS substantially improves the Inception Scoreof an off-the-shelf pre-trained SN-GAN (Miyatoet al., 2018) from 8.22 to 9.09 which is compara-ble to the class-conditional BigGAN (Brock et al.,2019) model. This achieves a new state-of-the-art in the unconditional image synthesis settingwithout introducing extra parameters or additionaltraining.

1. IntroductionGenerative Adversarial Networks (GANs) (Goodfellowet al., 2014) are state-of-the-art models for a large varietyof tasks such as image generation, semi-supervised learning(Dai et al., 2017), image editing (Yi et al., 2017), image

*Equal contribution 1Mila, Universit de Montral 2Google Brain.Correspondence to: Tong Che <[email protected]>, Ruixi-ang Zhang <[email protected]>.

Work in progress.

translation (Zhu et al., 2017), and imitation learning (Ho &Ermon, 2016). In a nutshell, the GAN framework consists oftwo neural networks, the generator G and the discriminatorD. The optimization process is formulated as an adversarialgame, with the generator trying to fool the discriminatorand the discriminator trying to better classify real from fakesamples.

Despite the ability of GANs to generate high-resolution,sharp samples, these models are notoriously hard to train.Previous work on GANs focused on improving stabilityand mode dropping issues (Metz et al., 2016; Che et al.,2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Rothet al., 2017), which are believed to be the main difficultiesin training GANs.

Besides instability and mode dropping, the samples of state-of-the-art GAN models sometimes contain bad artifacts orare even not recognizable (Karras et al., 2019). It is conjec-tured that this is due to the inherent difficulty of generatinghigh dimensional complex data, such as natural images, andthe optimization challenge of the adversarial formulation. Inorder to improve sample quality, conventional sampling tech-niques, such as increasing the temperature, are commonlyadopted for GAN models (Brock et al., 2019). Recently,new sampling methods such as Discriminator RejectionSampling (DRS) (Azadi et al., 2018), Metropolis-HastingsGenerative Adversarial Network (MH-GAN) (Turner et al.,2019), and Discriminator Optimal Transport (DOT) (Tanaka,2019) have shown promising results by utilizing the infor-mation provided by both the generator and the discriminator.However, these sampling techniques are either inefficientor lack theoretical guarantees, possibly reducing the sam-ple diversity and making the mode dropping problem moresevere.

In this paper, we show that GANs can be better understoodthrough the lens of Energy-Based Models (EBM). In ourformulation, GAN generators and discriminators collabo-ratively learn an “implicit” energy-based model. However,efficient sampling from this energy based model directlyin pixel space is extremely challenging for several reasons.One is that there is no tractable closed form for the implicitenergy function in pixel space. This motivates an intrigu-

arX

iv:2

003.

0606

0v2

[cs

.LG

] 2

4 M

ar 2

020

Page 2: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

ing possibility: that Markov Chain Monte Carlo (MCMC)sampling may prove more tractable in the GAN’s latentspace.

Surprisingly, we find that the implicit energy based modeldefined jointly by a GAN generator and discriminator takeson a simpler, tractable form when it is written as an energy-based model over the generator’s latent space. In this way,we propose a theoretically grounded way of generating highquality samples from GANs through what we call Discrimi-nator Driven Latent Sampling (DDLS).

DDLS leverages the information contained in the discrimi-nator to re-weight and correct the biases and errors in thegenerator. Through experiments, we show that our pro-posed method is highly efficient in terms of mixing time,is generally applicable to a variety of GAN models (e.g.Minimax, Non-Saturating, and Wasserstein GANs), and isrobust across a wide range of hyper-parameters.

We highlight our main contributions as follows:

• We provide more evidence that it is beneficial to sam-ple from the energy-based model defined both by thegenerator and the discriminator instead of from thegenerator only.

• We derive an equivalent formulation of the pixel-spaceenergy-based model in the latent space, and show thatsampling is much more efficient in the latent space.

• We show experimentally that samples from this energy-based model are of higher quality than samples fromthe generator alone.

• We show that our method can approximately extend toother GAN formulations, such as Wasserstein GANs.

2. BackgroundIn this section we present the background methodology ofGANs and EBMs on which our method is based.

2.1. Generative Adversarial Networks

GANs (Goodfellow et al., 2014) are a powerful class ofgenerative models defined through an adversarial minimaxgame between a generator network G and a discriminatornetwork D. The generator G takes a latent code z from aprior distribution p(z) and produces a sample G(z) ∈ X .The discriminator takes a sample x ∈ X as input and aimsto classify real data from fake samples produced by thegenerator, while the generator is asked to fool the discrim-inator as much as possible. We use pd to denote the truedata-generating distribution and pg to denote the implicitdistribution induced by the prior and the generator network.

The standard non-saturating training objective for the gener-ator and discriminator is defined as:

LD = −Ex∼pdata [logD(x)]− Ez∼pz [1− logD(G(z))]

LG = −Ez∼pz [logD(G(z))]

(1)

Wassertein GANs (WGAN) (Arjovsky et al., 2017) are aspecial family of GAN models. Instead of targeting a Jensen-Shannon distance, they target the 1-Wasserstein distanceW (pg, pd). The WGAN discriminator objective function isconstructed using the Kantorovich duality

maxD∈L

Epd [D(x)]− Epg [D(x)] (2)

where L is the set of 1-Lipstchitz functions.

The WGAN discriminator is motivated in terms of defininga critic function whose gradient with respect to its inputis better behaved (smoother) than in the original GAN for-mulation, making the optimization of the generator easier(Lucic et al., 2018).

2.2. Energy-Based Models and Langevin Dynamics

An energy-based model (EBM) is defined by a Boltzmanndistribution

p(x) = e−E(x) /Z (3)

where x ∈ X , X is the state space, and E(x) : X → Ris the energy function. Samples are typically generatedfrom p(x) by an MCMC algorithm. One common MCMCalgorithm in continuous state spaces is Langevin dynamics,with an update equation

xi+1 = xi −ε

2∇xE(x) +

√εn, n ∼ N(0, I). (4)

Langevin dynamics are guaranteed to exactly sample fromthe target distribution p(x) as ε→ 01.

One solution to the problem of slow-sampling MarkovChains is to perform sampling using a carfefully craftedlatent space (Bengio et al., 2013; Hoffman et al., 2019).Our method shows how one can execute such latent spaceMCMC in GAN models.

3. Methodology3.1. GANs as an Energy-Based Model

Suppose we have a GAN model trained on a data distributionpd with a generator G(z) with generator distribution pg anda discriminator D(x). We assume that pg and pd have thesame support. This can be guaranteed by adding smallGaussian noise to these two distributions.

1In practice we will use a small, finite, value for ε in our exper-iments.

Page 3: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

The training of GANs is an adversarial game which gener-ally does not converge to the optimal generator, so usuallypd and pg do not match perfectly at the end of training. How-ever, the discriminator provides a quantitative estimate forhow much these two distributions (mis)match. Let’s assumethe discriminator is near optimality, namely: (Goodfellowet al., 2014)

D(x) ≈ pd(x)

pd(x) + pg(x)(5)

From the above equation, let d(x) be the logit of D(x), inwhich case

pd(x)

pd(x) + pg(x)=

1

1 +pg(x)pd(x)

≈ 1

1 + exp (−d(x)), (6)

and we have ed(x) ≈ pd/pg, and pd(x) ≈ pg(x)ed(x). Nor-malization of pg(x)ed(x) is not guaranteed, and it will nottypically be a valid probabilistic model. We therefore con-sider the energy-based model:

p∗d = pg(x)ed(x)/Z0 (7)

where Z0 is a normalization constant. Intuitively, this for-mulation has two desirable properties. First, as we elaboratelater, if D = D∗ where D∗ is the optimal discriminator,then p∗d = pd. Secondly, it corrects the bias in the generatorvia weighting and normalization. If we can sample fromthis distribution, it should improve our samples.

There are two difficulties in sampling efficiently from p∗d:

1. Doing MCMC in pixel space to sample from the modelis impractical due to the high dimensionality and longmixing time.

2. pg(x) is implicitly defined and its density cannot becomputed directly.

In the next section we show how to overcome these twodifficulties.

3.2. Rejection Sampling and MCMC in Latent Space

Our approach to the above two problems is to formu-late an equivalent energy-based model in the latent space.To derive this formulation, we first review rejection sam-pling (Casella et al., 2004). With pg as the proposal dis-tribution, we have ed(x)/Z0 = p∗d(x) / pg(x). DenoteM = maxx p

∗d(x)/ pg(x) (this is well-defined if we add

a Gaussian noise to the output of the generator and x is in acompact space). If we accept samples from proposal distri-bution pg with probability p∗d / (Mpg), then the samples weproduce have the distribution p∗d.

We can alternatively interpret the rejection sampling pro-cedure above as occurring in the latent space z. In this

Algorithm 1 Discriminator Langevin Sampling

Input: N ∈ N+, ε > 0Output: Latent code zN ∼ pt(z)Sample z0 ∼ p0(z).for i < N doni ∼ N(0, 1)zi+1 = zi − ε/2∇zE(z) +

√εni

i = i+ 1end for

interpretation, we first sample z from p(z), and then per-form rejection sampling on z with acceptance probabilityed(G(z)) / (MZ0). Only once a latent sample z has beenaccepted do we generate the pixel level sample x = G(z).

This rejection sampling procedure on z induces a new prob-ability distribution pt(z). To explicitly compute this dis-tribution we need to conceptually reverse the definition ofrejection sampling. We formally write down the “reverse”lemma of rejection sampling as Lemma 1, to be used in ourmain theorem.

Lemma 1. On space X there is a probability distributionp(x). r(x) : X → [0, 1] is a measurable function on X . Weconsider sampling from p, accepting with probability r(x),and repeating this procedure until a sample is accepted. Wedenote the resulting probability measure of the acceptedsamples q(x). Then we have:

q(x) = p(x)r(x) /Z, Z = Ep[r(x)]. (8)

Proof. From the definition of rejection sampling, we cansee that in order to get the distribution q(x), we can samplex from p(x) and do rejection sampling with probabilityr′(x) = q(x) / (Mp(x)), where M ≥ q(x)/ p(x) for all x.So we have r′(x) = r(x) / (ZM). If we chooseM = 1/Z,then from r(x) ≤ 1 for all x, we can see that M satisfiesM ≥ q(x)/ p(x) = r(x) /Z, for all x. So we can chooseM = 1 /Z, resulting in r(x) = r′(x).

Namely, we have the prior proposal distribution p0(z) andan acceptance probability r(z) = ed(G(z))/ (MZ0). Wewant to compute the distribution after the rejection sam-pling procedure with r(z). With Lemma 1, we can see thatpt(z) = p0(z)r(z) /Z

′. We expand on the details in ourmain theorem.

Interestingly, pt(z) has the form of an energy-based model,pt(z) = e−E(z)/Z ′, with tractable energy functionE(z) = − log p0(z) − d(G(z)). In order to sample fromthis Boltzmann distribution, one can use an MCMC sampler,such as Langevin dynamics or Hamiltonian Monte Carlo.The algorithm using Langevin dynamics is given in Alg. 1.

Page 4: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

3.3. Main Theorem

Summarizing the arguments and results above, we have thefollowing theorem:Theorem 1. Assume pd is the data generating distribution,and pg is the generator distribution induced by the gener-ator G : Z → X , where Z is the latent space with priordistribution p0(z). Define p∗d = elog pg(x)+d(x)/Z0, whereZ0 is the normalization constant.

Assume pg and pd have the same support. This assumptionis typically satisfied when dim(z) ≥ dim(x). We addressthe case that dim(z) < dim(x) in Corollary 1. Further,let D(x) be the discriminator, and d(x) be the logit of D,namely D(x) = σ (d(x)). We define the energy functionE(z) = − log p0(z)− d(G(z)), and its Boltzmann distribu-tion pt(z) = e−E(z)/Z. Then we have:

1. p∗d = pd when D is the optimal discriminator.

2. If we sample z ∼ pt, and x = G(z), then we havex ∼ p∗d. Namely, the induced probability measureG ◦ pt = p∗d.

Proof. (1) follows from the fact that when D is optimal,D(x) =

pgpd+pg

, so D(x) = σ(log pd − log pg), whichimplies that d(x) = log pd − log pg (which is finite onthe support of pg due to the fact that they have the samesupport). Thus, p∗d(x) = pd(x)/Z0, we must have Z0 = 1for normalization, so p∗d = pd.

For (2), for samples x ∼ pg, if we do rejection samplingwith probability p∗d(x)/ (Mpg(x)) = ed(x)/ (MZ0) (whereM is a constant with M ≥ p∗d(x)/ pg(x)), we get sam-ples from the distribution p∗d. We can view this rejectionsampling as a rejection sampling in the latent space Z ,where we perform rejection sampling on p0(z) with ac-ceptance probability r(z) = p∗d(G(z))/ (Mpg(G(z))) =ed(G(z))/M . Applying lemma 1, we see that this rejec-tion sampling procedure induces a probability distributionpt(z) = p0(z)r(z)/C on the latent space Z . C is the nor-malization constant. Thus sampling from pd is equivalentto sampling from pt(z) and generating with G(z).

3.4. Sampling Wasserstein GANs with LangevinDynamics

Wasserstein GANs are different from original GANs inthat they target the Wassertein loss. Although when thediscriminator is trained to optimality, the discriminator canrecover the Kantorovich dual (Arjovsky et al., 2017) of theoptimal transport between pg and pd, the target distributionpd cannot be exactly recovered using the information inpg and D2. However, in the following we show that in

2In Tanaka (2019), the authors claim that it is possible to re-cover pd with D and pg in WGAN in certain metrics, but we show

practice, the optimization of WGAN can be viewed as anapproximation of an energy-based model, which can alsobe sampled with our method.

The objectives of Wasserstein GANs can be summarized as:

LD = Epg [D(x)]− Epd [D(x)] (9)LG = −Ep0 [D(G(z))] (10)

where D is restricted to be a K-Lipschitz function.

On the other hand, consider a new energy-based generativemodel (which also has a generator and a discriminator)trained with the following objectives:

1. Discriminator training phase (D-phase). Unlike GANs,our energy-based model tries to match the distributionpt(x) = pg(x)e

Dφ(x)/Z with the data distribution pd,where pt(x) can be interpreted as an EBM with energyDφ(x)− log pg(x). In this phase, the generator is keptfixed, and the discriminator is trained.

2. Generator training phase (G-phase). The generator istrained such that pg(x) matches pt(x), in this phasewe treat D as fixed and train G.

In the D-phase, we are training an EBM with data frompd. The gradient of the KL-divergence can be writtenas (MacKay, 2003):

∇φKL(pd||pt) = Ept [∇φD(x)]− Epd [∇φD(x)] (11)

Namely we are trying to maximizeD on real data and tryingto minimize it on fake data. Note that the fake data distribu-tion pt is a function of both the generator and discriminator,and cannot be sampled directly. As with other energy-basedmodels, we can use an MCMC procedure such as Langevindynamics to generate samples from pt. Although in prac-tice it would be difficult to generate equilibrium samplesby MCMC for every training step, historically training ofenergy-based models has been successful even when neg-ative samples are generated using only a small number ofMCMC steps (Tieleman, 2008).

In the G-phase, we can train the model with KL-divergence.Let p′t be a fixed copy of pt, we have (see the Appendix formore details):

∇θKL(pg || p′t) = −E[∇θD(G(z))]. (12)

Note that the losses above coincide with what we are opti-mizing in WGANs, with two differences:

1. In WGAN, we optimize D on pg instead of pt. Thismay not be a big difference in practice, since as training

in the Appendix that their assumptions don’t hold and in the L1

metric, which WGAN uses, it is not possible to recover pd.

Page 5: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

progresses pt is expected to approach pg, as the opti-mizing loss for the generator explicitly acts to bringpg closer to pt (Equation 12). Moreover, it has re-cently been found in LOGAN (Wu et al., 2019) thatoptimizing D on pt rather than pg can lead to betterperformance.

2. In WGAN, we impose a Lipschitz constraint on D.This constraint can be viewed as a smoothness reg-ularizer. Intuitively it will make the distributionpt(x) = pg(x)e

−Dφ(x)/Z more “flat” than pd, butpt(x) (which lies in a distribution family parameter-ized by D) remains an approximator to pd subject tothis constraint.

Thus, we can conclude that for a Wasserstein GAN withdiscriminator D, WGAN approximately optimizes the KLdivergence of pt = pg(x)e

−D(x)/Z with pd, with the con-straint that D is K-Lipschitz. This suggests that one canalso perform DDLS on the WGAN latent space to gener-ate improved samples, using an energy function E(z) =− log p0(z)−D(G(z)).

3.5. Practical Issues and the Mode Dropping Problem

Mode dropping is a major problem in training GANs. Inour main theorem it is assumed that pg and pd have thesame support. We also assumed that G : Z → X is adeterministic function. Thus, if G cannot recover some ofthe modes in pd, p∗d also cannot recover these modes.

However, we can partially solve the mode dropping problemby introducing an additional Gaussian noise z′ ∼ N(0, 1) tothe output of the generator, namely we define the new deter-ministic generator G∗(z, z′) = G(z) + εz′. We treat z′ as apart of the generator, and do DDLS on joint latent variables(z, z′). The Langevin dynamics will help the model to movedata points that are a little bit off-mode to the data manifold,yielding the following corollary of the main theorem.

Corollary 1. Assume pd is the data generating distributionwith small Gaussian noise added. The generator G : Z →X is deterministic, whereZ is the latent space endowed withprior distribution p0(z). Assume z′ ∼ p1(z′) = N(0, 1; z)is an additional Gaussian noise variable with dim z′ =dimX . Let ε > 0, denote the distribution of the extendedgenerator G∗(z, z′) = G(z) + εz′ as pg .

Define p∗d = elog pg(x)+d(x)/Z0, where Z0 is the normaliza-tion constant.

D(x) is the discriminator trained between pg and pd. Letd(x) be the logit of D, namely D(x) = σ (d(x)). We definethe energy function E(z, z′) = − log p0(z)− log p1(z

′)−d(G∗(z, z′)), and its Boltzmann distribution pt(z, z

′) =e−E(z,z′)/Z. Then we have:

1. p∗d = pd when D is the optimal discriminator.

2. If we sample (z, z′) ∼ pt, and x = G∗(z, z′), then wehave x ∼ p∗d. Namely, the induced probability measureG∗ ◦ pt = p∗d.

Proof. LetG∗(z, z′) be the generatorG defined in Theorem1, we can see that pd and pg have the same support. ApplyTheorem 1 and we deduce the corollary.

Additionally, one can add this Gaussian noise to all theintermediate layers in the generator G and rely on Langevindynamics to correct the resulting data distribution.

4. Related WorkPrevious work has considered utilizing the discriminatorto achieve better sampling for GANs. Discriminator rejec-tion sampling (Azadi et al., 2018) and Metropolis-HastingsGANs (Turner et al., 2019) use pg as the proposal distribu-tion and D as the criterion of acceptance or rejection. How-ever, these methods are inefficient as they may need to rejecta lot of samples. Intuitively, one major drawback of thesemethods is that since they operate in the pixel space, theiralgorithm can use discriminators to reject samples whenthey are bad, but cannot easily guide latent space updateswhich would improve these samples.

Discriminator optimal transport (DOT) (Tanaka, 2019) isanother way of sampling GANs. They use deterministic gra-dient descent in the latent space to get samples with higherD-values, However, since pg and D cannot recover the datadistribution exactly, DOT has to make the optimization localin a small neighborhood of generated samples (they use asmall δ to prevent over-optimization), which hurts the sam-ple performance. Also, DOT is not guaranteed to convergeto the data distribution even under ideal assumptions (D isoptimal).

A bunch of other previous works considered the usage ofprobability models defined jointly by generator and discrim-inator. In (Deng et al., 2020), the authors use the idea oftraining an EBM defined jointly by a generator and an addi-tional critic function in the text generation setting. (Groveret al., 2019) uses an additional discriminator as a bias correc-tor for generative models via importance weighting. (Groveret al., 2018) considered rejection sampling in latent spacein encoder-decoder models.

Energy-based models have gained significant attention in re-cent years. Most work focuses on the maximum likelihoodlearning of energy-based models (LeCun et al., 2006; Du &Mordatch, 2019; Salakhutdinov & Hinton, 2009). The pri-mary difficulty in training energy-based models comes fromeffectively estimating and sampling the partition function.The contribution to training from the partition function can

Page 6: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

be estimated via MCMC (Du & Mordatch, 2019; Hinton,2002; Nijkamp et al., 2019), via training another generatornetwork (Kim & Bengio, 2016; Kumar et al., 2019), or viasurrogate objectives to maximum likelihood (Hyvarinen,2005; Gutmann & Hyvrinen, 2010; Sohl-Dickstein et al.,2011).

The connection between GANs and EBMs has been stud-ied by many authors (Zhao et al., 2016; Finn et al., 2016).Our paper can be viewed as establishing a new connectionbetween GANs and EBMs which allows efficient latentMCMC sampling.

5. Experimental resultsIn this section we present a set of experiments demonstrat-ing the effectiveness of our method on both synthetic andreal-world datasets. In section 5.1 we illustrate how theproposed method, DDLS, can improve the distribution mod-eling of a trained GAN and compare with other baselinemethods. In section 5.2 we show that DDLS can improvethe sample quality on real world datasets, both qualitativelyand quantitatively.3

5.1. Synthetic dataset

Following the same setting used in Azadi et al. (2018);Turner et al. (2019); Tanaka (2019), we apply DDLS toa WGAN model trained on two synthetic datasets, 25-gaussians and Swiss Roll, and investigate the effect andperformance of the proposed sampling method.

Implementation details The 25-Gaussians dataset is gener-ated by a mixture of twenty-five two-dimensional isotropicGaussian distributions with variance 0.01, and means sepa-rated by 1, arranged in a grid. The Swiss Roll dataset is astandard dataset for testing dimensionality reduction algo-rithms. We use the implementation from scikit-learn, andrescale the coordinates as suggested by Tanaka (2019). Wetrain a Wasserstein GAN model with the standard WGAN-GP objective. Both the generator and discriminator arefully connected neural networks with ReLU nonlinearities,and we follow the same architecture design as in DOT(Tanaka, 2019), while parameterizing the prior with a stan-dard normal distribution instead of a uniform distribution.We optimize the model using the Adam optimizer, withα = 0.0001, β1 = 0.5, β2 = 0.9.

Qualitative results With the trained generator and dis-criminator, we generate 5000 samples from the genera-tor, then apply DDLS in latent space to obtain enhancedsamples. We also apply the DOT method as a base-line. All results are depicted in Figure 1 and Figure

3Code available at https://github.com/sodabeta7/gan_as_ebm.

2 together with the target dataset samples. For the 25-Gaussian dataset we can see that DDLS recovered and pre-served all modes while significantly eliminating spuriousmodes compared to a vanilla generator and DOT. For theSwiss Roll dataset we can also observe that DDLS suc-cessfully improved the distribution and recovered the un-derlying low-dimensional manifold of the data distribution.This qualitative evidence supports the hypothesis that ourGANs as energy based model formulation outperforms thenoisy implicit distribution induced by the generator alone.Table 1. DDLS suffers less from mode dropping when modelingthe 2d synthetic distribution in Figure 1. Table shows number ofrecovered modes, and fraction of “high quality” (less than fourstandard deviations from mode center) recovered modes.

# recovered modes % “high quality” std “high quality”

Generator only 24.8± 0.2 70± 9 0.11± 0.01DRS 24.8± 0.2 90± 2 0.10± 0.01GAN w. DDLS 24.8± 0.2 98± 2 0.10± 0.01

Table 2. DDLS has lower Earth Mover’s Distance (EMD) to thetrue distribution for the 2d synthetic distribution in Figure 1, andmatches the performance of DOT on the Swiss Roll distribution.

EMD 25-Gaussian EMD Swiss Roll

Generator only((Tanaka, 2019)) 0.052(08) 0.021(05)DOT((Tanaka, 2019)) 0.052(10) 0.020(06)Generator only(Our imple.) 0.043(04) 0.026(03)GAN as EBM with DDLS 0.036(04) 0.020(05)

Figure 1. DDLS outperforms the generator alone, and generator +DOT, on a synthetic dataset consisting of a 2d mixture of Gaussiansdistribution with 25 components. 10000 samples were generatedfor each condition.

Figure 2. DDLS outperforms the generator alone, and generator +DOT, on the swiss roll dataset.

Quantitative results We first examine the performanceof DDLS quantitavely by using the metrics proposed byDRS (Azadi et al., 2018). We generate 10, 000 sampleswith the DDLS algorithm, and each sample is assigned to its

Page 7: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

Table 3. Inception and FID scores on CIFAR-10, showing a sub-stantial quantitative advantage from DDLS.

Model Inception FID

PixelCNN (van den Oord et al., 2016) 4.60 65.93PixelIQN (Ostrovski et al., 2018) 5.29 49.46EBM (Du & Mordatch, 2019) 6.02 40.58WGAN-GP (Gulrajani et al., 2017) 7.86± .07 36.4MoLM (Ravuri et al., 2018) 7.90± .10 18.9SNGAN (Miyato et al., 2018) 8.22± .05 21.7ProgressiveGAN (Karras et al., 2018) 8.80± .05 -NCSN (Song & Ermon, 2019) 8.87± .12 25.32

DCGAN w/o DRS or MH-GAN 2.8789 -DCGAN w/ DRS(cal) (Azadi et al., 2018) 3.073 -DCGAN w/ MH-GAN(cal) (Turner et al., 2019) 3.379 -ResNet-SAGAN w/o DOT 7.85± .11 21.53ResNet-SAGAN w/ DOT 8.50± .12 19.71

SNGAN w/o DDLS 8.22± .05 21.7Ours: SNGAN w/ DDLS 9.05± .11 15.76Ours: SNGAN w/ DDLS(cal) 9.09± 0.10 15.42

closest mixture component. A sample is of “high quality” ifit is within four standard deviations of its assigned mixturecomponent, and a mode is successfully “recovered” if atleast one high-quality sample is assigned to it.

As shown in Table 1, our proposed model achieves a higher“high-quality” ratio. We also investigate the distance be-tween the distribution induced by our GAN as EBM formula-tion and the true data distribution. We use the Earth Mover’sDistance (EMD) between the two corresponding empiricaldistributions as a surrogate, as proposed in DOT (Tanaka,2019). As shown in Table 2, the EMD between our sam-pling distribution and the ground-truth distribution is sig-nificantly below the baselines. Note that we use our ownre-implementation, and numbers differ slightly from thosepreviously published.

5.2. CIFAR-10

In this section we evaluate the performance of the proposedDDLS method on the CIFAR-10 dataset.

Implementation details We adopt the Spectral Normaliza-tion GAN (SN-GAN) (Miyato et al., 2018) as our baselineGAN model. We take the publicly available pre-trainedmodels of unconditional SN-GAN and apply DDLS. Wefirst sample latent codes from the prior distribution, thenrun the Langevin dynamics procedure with an initial stepsize 0.01 up to 1000 iterations to generate enhanced sam-ples. Following the practice in (Welling & Teh, 2011) weseparately set the standard deviation of the Gaussian noiseas 0.1. We optionally fine-tune the pre-trained discriminatorwith an additional fully-connected layer and a logistic out-put layer using the binary cross-entropy loss to calibrate thediscriminator as suggested by Azadi et al. (2018); Turneret al. (2019).

Figure 3. CIFAR-10 Langevin dynamics visualization, initializedat a sample from the generator (left column). The latent spaceMarkov chain appears to mix quickly, as evidenced by the diversesamples generated by a short chain. Additionally, the visual qualityof the samples improves over the course of sampling, providingevidence that DDLS improves sample quality.

Quantitative results We evaluate the quality and diversityof generated samples via the Inception Score (Salimanset al., 2016) and Frchet Inception Distance (FID) (Heuselet al., 2017). We applied DDLS to the unconditional gen-erator of SN-GAN to generate 50000 samples and reportall results in Table 3. We found that the proposed methodsignificantly improves the Inception Score of the baselineSN-GAN model from 8.22 to 9.09 and reduces the FID from21.7 to 15.42. Our unconditional model outperforms pre-vious state-of-the-art GANs and other sampling-enhancedGANs (Azadi et al., 2018; Turner et al., 2019; Tanaka, 2019)and even approaches the performance of conditional Big-GANs (Brock et al., 2019) which achieves an InceptionScore 9.22 and an FID of 14.73, without the need of addi-tional class information, training and parameters.

Qualitative results We illustrate the process of Langevindynamics sampling in latent space in Figure 3 by generatingsamples for every 10 iterations. We find that our methodhelps correct the errors in the original generated image,and makes changes towards more semantically meaningfuland sharp output by leveraging the pre-trained discrimina-tor. We include more generated samples for visualizing theLangevin dynamics in the appendix. To demonstrate thatour model is not simply memorizing the CIFAR-10 dataset,we find the nearest neighbors of generated samples in thetraining dataset and show the results in Figure 4.

Page 8: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

Figure 4. Top-5 nearest neighbor images (right columns) of gener-ated samples (left column).

Mixing time evaluation MCMC sampling methods oftensuffer from extremely long mixing times, especially forhigh-dimensional multi-modal data. For example, morethan 600 MCMC iterations are need to obtain the mostperformance gain in MH-GAN (Turner et al., 2019) on realdata. We demonstrate the sampling efficiency of our methodby showing that we can expect a much shorter mixing timeby migrating the Langevin sampling process to the latentspace, compared to sampling in high-dimensional multi-modal pixel space. We evaluate the Inception Score andthe energy function for every 10 iterations of Langevindynamics and depict the results in Figure 5.

5.3. ImageNet Dataset

In this section we evaluate the performance of the proposedDDLS method on the ImageNet dataset.

Figure 5. Progression of Inception Score with more Langevin dy-namics sampling steps.

Table 4. Inception score for ImageNet, showing the substantialquantitative advantage of DDLS.

MODEL INCEPTION

SNGAN (MIYATO ET AL., 2018) 36.8CGAN W/O DOT 36.23CGAN W/ DOT 37.29

SNGAN W/O DDLS 36.8OURS: SNGAN W/ DDLS 40.2

Implementation details As with CIFAR-10, we adopt theSpectral Normalization GAN (SN-GAN) (Miyato et al.,2018) as our baseline GAN model. We take the publiclyavailable pre-trained models of SN-GAN and apply DDLS.Fine tuning is performed on the discriminator, as describedin Section 5.2. Implementation choices are otherwise thesame as for CIFAR-10, with additional details in the ap-pendix. We show the quantitative results in Table 4, wherewe substantially outperform the baseline.

6. Conclusion and Future WorkIn this paper, we have shown that a GAN’s discriminatorcan enable better modeling of the data distribution with Dis-criminator Driven Latent Sampling (DDLS). The intuitionbehind our model is that learning a generative model to dostructured generative prediction is usually more difficultthan learning a classifier, so the errors made by the genera-tor can be significantly corrected by the discriminator. Themajor advantage of DDLS is that it allows MCMC samplingin the latent space, which enables efficient sampling andbetter mixing.

For future work, we are exploring the inclusion of addi-tional Gaussian noise variables in each layer of the gen-erator, treated as latent variables, such that DDLS can beused to provide a correcting signal for each layer of the

Page 9: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

generator. We believe that this will lead to further samplingimprovements, via correcting small artifacts in the gener-ated samples. Also, the underlying idea behind DDLS iswidely applicable to other generative models, if we trainan additional discriminator together with the generator. Itwould be interesting to explore whether VAE-based modelscan be improved by training an additional discriminator.

ReferencesArjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.

arXiv preprint arXiv:1701.07875, 2017.

Azadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena,A. Discriminator rejection sampling. arXiv preprintarXiv:1810.06758, 2018.

Bengio, Y., Mesnil, G., Dauphin, Y. N., and Rifai,S. Better mixing via deep representations. InProceedings of the 30th International Conference onMachine Learning, ICML 2013, Atlanta, GA, USA,16-21 June 2013, volume 28 of JMLR Workshopand Conference Proceedings, pp. 552–560. JMLR.org,2013. URL http://proceedings.mlr.press/v28/bengio13.html.

Brock, A., Donahue, J., and Simonyan, K. Large scaleGAN training for high fidelity natural image synthesis.In 7th International Conference on Learning Representa-tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.OpenReview.net, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.

Casella, G., Robert, C. P., and Wells, M. T. Gen-eralized accept-reject sampling schemes. Lec-ture Notes-Monograph Series, 45:342–347, 2004.ISSN 07492170. URL http://www.jstor.org/stable/4356322.

Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode reg-ularized generative adversarial networks. arXiv preprintarXiv:1612.02136, 2016.

Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdi-nov, R. R. Good semi-supervised learning that requires abad gan. In Advances in neural information processingsystems, pp. 6510–6520, 2017.

Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato,M. Residual energy-based models for text generation. InInternational Conference on Learning Representations,2020. URL https://openreview.net/forum?id=B1l4SgHKDH.

Du, Y. and Mordatch, I. Implicit generation and model-ing with energy based models. In Advances in NeuralInformation Processing Systems, pp. 3603–3613, 2019.

Finn, C., Christiano, P., Abbeel, P., and Levine, S. A con-nection between generative adversarial networks, inversereinforcement learning, and energy-based models. arXivpreprint arXiv:1611.03852, 2016.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Grover, A., Gummadi, R., Lazaro-Gredilla, M., Schuur-mans, D., and Ermon, S. Variational rejection sam-pling. In Storkey, A. and Perez-Cruz, F. (eds.), Pro-ceedings of the Twenty-First International Conferenceon Artificial Intelligence and Statistics, volume 84 ofProceedings of Machine Learning Research, pp. 823–832, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr2018. PMLR. URL http://proceedings.mlr.press/v84/grover18a.html.

Grover, A., Song, J., Kapoor, A., Tran, K., Agarwal,A., Horvitz, E. J., and Ermon, S. Bias correctionof learned generative models using likelihood-freeimportance weighting. In Wallach, H., Larochelle, H.,Beygelzimer, A., d'Alche-Buc, F., Fox, E., and Garnett,R. (eds.), Advances in Neural Information ProcessingSystems 32, pp. 11058–11070. Curran Associates, Inc.,2019. URL http://papers.nips.cc/paper/9286-bias-correction-of-learned-generative-models-using-likelihood-free-importance-weighting.pdf.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of wasserstein gans.In Advances in neural information processing systems,pp. 5767–5777, 2017.

Gutmann, M. and Hyvrinen, A. Noise-contrastive estima-tion: A new estimation principle for unnormalized statisti-cal models. In Proceedings of the Thirteenth InternationalConference on Artificial Intelligence and Statistics, pp.297–304, 2010.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,and Hochreiter, S. Gans trained by a two time-scaleupdate rule converge to a local nash equilibrium. InGuyon, I., von Luxburg, U., Bengio, S., Wallach,H. M., Fergus, R., Vishwanathan, S. V. N., andGarnett, R. (eds.), Advances in Neural InformationProcessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA, pp. 6626–6637,2017. URL http://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium.

Hinton, G. E. Training products of experts by minimizingcontrastive divergence. Neural computation, 14(8):1771–1800, 2002.

Page 10: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

Ho, J. and Ermon, S. Generative adversarial imitation learn-ing. In Advances in neural information processing sys-tems, pp. 4565–4573, 2016.

Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran,D., and Vasudevan, S. Neutra-lizing bad geometry inhamiltonian monte carlo using neural transport. arXivpreprint arXiv:1903.03704, 2019.

Hyvarinen, A. Estimation of non-normalized statisticalmodels by score matching. Journal of Machine LearningResearch, 6(Apr):695–709, 2005.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres-sive growing of gans for improved quality, stability, andvariation. In 6th International Conference on LearningRepresentations, ICLR 2018, Vancouver, BC, Canada,April 30 - May 3, 2018, Conference Track Proceedings.OpenReview.net, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,and Aila, T. Analyzing and improving the image qualityof stylegan. arXiv preprint arXiv:1912.04958, 2019.

Kim, T. and Bengio, Y. Deep directed generative modelswith energy-based probability estimation. arXiv preprintarXiv:1606.03439, 2016.

Kumar, R., Goyal, A., Courville, A., and Bengio, Y. Maxi-mum entropy generators for energy-based models. arXivpreprint arXiv:1901.08508, 2019.

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang,F. A tutorial on energy-based learning. Predicting struc-tured data, 1(0), 2006.

Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bous-quet, O. Are gans created equal? a large-scale study. InAdvances in neural information processing systems, pp.700–709, 2018.

MacKay, D. J. Information theory, inference and learningalgorithms. Cambridge university press, 2003.

Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein,J. Unrolled generative adversarial networks. ArXiv,abs/1611.02163, 2016.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec-tral normalization for generative adversarial networks.In 6th International Conference on Learning Represen-tations, ICLR 2018, Vancouver, BC, Canada, April 30 -May 3, 2018, Conference Track Proceedings. OpenRe-view.net, 2018. URL https://openreview.net/forum?id=B1QRgziT-.

Nijkamp, E., Hill, M., Zhu, S.-C., and Wu, Y. N. Learningnon-convergent non-persistent short-run mcmc towardenergy-based model. In Advances in Neural InformationProcessing Systems, pp. 5233–5243, 2019.

Ostrovski, G., Dabney, W., and Munos, R. Autore-gressive quantile networks for generative modeling.In Dy, J. G. and Krause, A. (eds.), Proceedings ofthe 35th International Conference on Machine Learn-ing, ICML 2018, Stockholmsmassan, Stockholm, Swe-den, July 10-15, 2018, volume 80 of Proceedings ofMachine Learning Research, pp. 3933–3942. PMLR,2018. URL http://proceedings.mlr.press/v80/ostrovski18a.html.

Ravuri, S. V., Mohamed, S., Rosca, M., and Vinyals, O.Learning implicit generative models with the method oflearned moments. In Dy, J. G. and Krause, A. (eds.),Proceedings of the 35th International Conference on Ma-chine Learning, ICML 2018, Stockholmsmassan, Stock-holm, Sweden, July 10-15, 2018, volume 80 of Pro-ceedings of Machine Learning Research, pp. 4311–4320.PMLR, 2018. URL http://proceedings.mlr.press/v80/ravuri18a.html.

Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Stabi-lizing training of generative adversarial networks throughregularization. In Advances in neural information pro-cessing systems, pp. 2018–2028, 2017.

Salakhutdinov, R. and Hinton, G. Deep boltzmann machines.In Artificial intelligence and statistics, pp. 448–455, 2009.

Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V.,Radford, A., and Chen, X. Improved techniques fortraining gans. In Lee, D. D., Sugiyama, M., von Luxburg,U., Guyon, I., and Garnett, R. (eds.), Advances in NeuralInformation Processing Systems 29: Annual Conferenceon Neural Information Processing Systems 2016,December 5-10, 2016, Barcelona, Spain, pp. 2226–2234,2016. URL http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.

Sohl-Dickstein, J., Battaglino, P. B., and DeWeese, M. R.New method for parameter estimation in probabilisticmodels: minimum probability flow. Physical review let-ters, 107(22):220601, 2011.

Song, Y. and Ermon, S. Generative modeling byestimating gradients of the data distribution. InWallach, H., Larochelle, H., Beygelzimer, A.,d'Alche-Buc, F., Fox, E., and Garnett, R. (eds.),Advances in Neural Information Processing Sys-tems 32, pp. 11895–11907. Curran Associates, Inc.,2019. URL http://papers.nips.cc/paper/9361-generative-modeling-by-estimating-gradients-of-the-data-distribution.pdf.

Page 11: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

Tanaka, A. Discriminator optimal transport. arXiv preprintarXiv:1910.06832, 2019.

Tieleman, T. Training restricted boltzmann machines usingapproximations to the likelihood gradient. In Proceedingsof the 25th international conference on Machine learning,pp. 1064–1071, 2008.

Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski, J.Metropolis-Hastings generative adversarial networks. InChaudhuri, K. and Salakhutdinov, R. (eds.), Proceedingsof the 36th International Conference on Machine Learn-ing, volume 97 of Proceedings of Machine Learning Re-search, pp. 6345–6353, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/turner19a.html.

van den Oord, A., Kalchbrenner, N., Espeholt, L.,Kavukcuoglu, K., Vinyals, O., and Graves, A. Con-ditional image generation with pixelcnn decoders. InLee, D. D., Sugiyama, M., von Luxburg, U., Guyon,I., and Garnett, R. (eds.), Advances in Neural Infor-mation Processing Systems 29: Annual Conferenceon Neural Information Processing Systems 2016,December 5-10, 2016, Barcelona, Spain, pp. 4790–4798,2016. URL http://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.

Welling, M. and Teh, Y. W. Bayesian learning via stochasticgradient langevin dynamics. In Getoor, L. and Schef-fer, T. (eds.), Proceedings of the 28th International Con-ference on Machine Learning, ICML 2011, Bellevue,Washington, USA, June 28 - July 2, 2011, pp. 681–688.Omnipress, 2011. URL https://icml.cc/2011/papers/398_icmlpaper.pdf.

Wu, Y., Donahue, J., Balduzzi, D., Simonyan, K., and Lil-licrap, T. Logan: Latent optimisation for generative ad-versarial networks. arXiv preprint arXiv:1912.00953,2019.

Yi, Z., Zhang, H., Tan, P., and Gong, M. Dualgan: Unsu-pervised dual learning for image-to-image translation. InThe IEEE International Conference on Computer Vision(ICCV), Oct 2017.

Zhao, J., Mathieu, M., and LeCun, Y. Energy-based generative adversarial network. arXiv preprintarXiv:1609.03126, 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In Proceedings of the IEEE internationalconference on computer vision, pp. 2223–2232, 2017.

Page 12: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Your GAN is Secretly an Energy-based Model and You Should useDiscriminator Driven Latent Sampling

A. An analysis of WGANA.1. An analysis of DOT algorithm

In this section, we first give an example that in WGAN, given the optimal discriminator D and pg, it is not possible torecover pd.

Consider the following case: the underlying space is one dimensional space of real numbers R. pg is the Dirac δ-distributionδ−1 and data distribution pd is the Dirac δ-distribution δa, where a > 0 is a constant.

We can easily identity function f(x) = x is the optimal 1-Lipschitz function which separates pg and pd. Namely, we letD(x) = x is the optimal discriminator.

However, D is not a function of a. Namely, we cannot recover pd = δa with information provided by D and pg . This is themain reason that collaborative sampling algorithms based on W-GAN formulation such as DOT could not provide exacttheoretical guarantee, even if the discriminator is optimal.

A.2. Mathematical Details of approximate WGAN with EBMs

In the paper, we outlined an approximation result of WGAN. In the following, we prove them. For eq. 11:

∇φKL(pd||pt) =∇φEpd [− log pt(x)]

=∇φEpd [− log pg(x)−D(x) + logZ]

=− Epd [∇φD(x)] + Epd [∇φ logZ]=− Epd [∇φD(x)] +∇φZ/Z

=− Epd [∇φD(x)] +∑x

[pg(x)eD(x)∇φD(x)]/Z

=− Epd [∇φD(x)] +∑x

[pt(x)∇φD(x)]

=Ept [∇φD(x)]− Epd [∇φD(x)]

(13)

For eq. 12:

∇θKL(pg||p′t) =∇θEpg [log pg(x)− log p′t(x)]

=Epg [∇θ log pg(x)] +∑x

[log pg(x)− log p′t(x)]∇θpg(x)

=0 +∑x

[−D(x)]∇θpg(x)

=−∑x

D(x)∇θpg(x)

=−∇θEpg [D(x)] = −Ez∼p0(z)[∇θD(G(z))]

(14)

Page 13: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

Figure 6. CIFAR-10 Langevin dynamics visualization

Page 14: Your GAN is Secretly an Energy-based Model and You ...Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling Tong Che* 1 2 Ruixiang Zhang*

Improved Sampling of GANs as Energy-Based Models

B. Experimental detailsSource code of all experiments of this work is included in the supplemental material and is available at https://github.com/sodabeta7/gan_as_ebm, where all detailed hyper-parameters can be found.

B.1. CIFAR-10

We show more generated samples of DDLS during langevin dynamics in figure 6. We run 1000 steps of Langevin dynamicsand plot generated sample for every 10 iterations. We include 10000 more randomly generated samples in the supplementalmaterial.

B.2. Imagenet

We introduce more details of the preliminary experimental results on Imagenet dataset here. We run the Langevin dynamicssampling algorithm with an initial step size 0.01 up to 1000 iterations. We decay the step size with a factor 0.1 for every200 iterations. The standard deviation of Gaussian noise is annealed simultaneously with the step size. The discriminator isnot yet calibrated in this preliminary experiment.