Posterior-Aided Regularization for Likelihood-Free Inference

Posterior-Aided Regularization for Likelihood-Free Inference

Dongjun Kim 1 Kyungwoo Song 1 Seungjae Shin 1 Wanmo Kang 1 Il-Chul Moon 1

AbstractThe recent development of likelihood-free infer-ence aims training a flexible density estimator forthe target posterior with a set of input-output pairsfrom simulation. Given the diversity of simula-tion structures, it is difficult to find a single unifiedinference method for each simulation model. Thispaper proposes a universally applicable regular-ization technique, called Posterior-Aided Regular-ization (PAR), which is applicable to learning thedensity estimator, regardless of the model struc-ture. Particularly, PAR solves the mode collapseproblem that arises as the output dimension ofthe simulation increases. PAR resolves this pos-terior mode degeneracy through a mixture of 1)the reverse KL divergence with the mode seek-ing property; and 2) the mutual information forthe high quality representation on likelihood. Be-cause of the estimation intractability of PAR, weprovide a unified estimation method of PAR toestimate both reverse KL term and mutual infor-mation term with a single neural network. Af-terwards, we theoretically prove the asymptoticconvergence of the regularized optimal solutionto the unregularized optimal solution as the reg-ularization magnitude converges to zero. Addi-tionally, we empirically show that past sequentialneural likelihood inferences in conjunction withPAR present the statistically significant gains ondiverse simulation tasks.

1. IntroductionThe development of computing power has motivated build-ing simulation models, which are complex in order to be-come high-fidelity. A complex simulation model is con-structed as a generative process, so the likelihood functionis modeled in an implicit way. This implicit nature of thelikelihood yields the inference task challenging, and we call

1KAIST, Republic of Korea. Correspondence to: Il-Chul Moon<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML).

(a) True (b) SNL (c) SNL + PAR

Figure 1. Comparison of (a) true posterior (see Appendix A) toapproximate posteriors with (b) SNL and (c) the regularized SNL.SNL with PAR performs better than the vanilla SNL within limitedsimulation budget in 1) capturing multiple modes and 2) acquiringroundy shape of each mode.

the inference on the implicit likelihood as likelihood-freeinference. Recent algorithms on likelihood-free inferenceintroduced the amortized learning of a posterior distributionwith a set of input-output pairs from the simulation model.

The recent algorithms iteratively improve the approximationon the posterior to the exact posterior by rounds. Thisapproximation hinges upon the budget of executing thesimulation models, and the approximated posterior would beimmature if the budget is limited. In particular, if the modelattains a multi-modal posterior distribution1, Figure 1 showsthe failure of posterior inference with a previous algorithm,Sequential Neural Likelihood (SNL) (Papamakarios et al.,2019). Also, we observe that the inference gets worse bycapturing fewer modes in the posterior as the dimensionalityof the model outcome increases.

While this mode collapse is largely resolved as the numberof training data increases, it is difficult to gather the suffi-cient size of a dataset for training on simulations becauseof long execution time. This specific phenomenon in thelikelihood-free inference motivates a widely applicable re-search question in the machine learning community: recov-ering a complex distribution with limited training instances.If we focus on the data limitation, we may choose methodsto avoid overfitting. However, such avoidance will also limit

1When the generative model architecture is fixed throughoutthe inference, the posterior distribution needs to be complex to re-flect the real world, compared to either simple uniform or Normaldistributions as a source of stochasticity. Such model architec-ture includes the internal coefficients, the agent mechanisms, theequation dynamics, etc.

arX

iv:2

102.

0777

0v1

[cs

.LG

] 1

5 Fe

b 20

21

Submission and Formatting Instructions for ICML 2021

the complexity of the model distribution; so, for instance,the multi-modality would not be captured fully. Moreover,the model distribution complexity should increase as thedistribution lies on a high dimensional space2, and as thedistribution has fine details on its domain space.

Given this trade-off between the overfitting avoidance andthe modeled posterior complexity, this paper proposes aregularization method that avoids overfitting by minimallyaffecting the posterior distribution complexity. Specifically,the regularization is designed to capture the multi-modal andthe high dimensional aspects of the posterior distribution.We name this regularization as Posterior-Aided Regulariza-tion (PAR), which significantly outperforms the baselinealgorithms in every experimental case.

PAR introduces the reverse KL regularization that attainsthe mode seeking property to enhance the dataset efficiency,when the training dataset is constructed in a cumulativeway for every round by executing additional simulations.Also, PAR proposes the mutual information regularizationbetween the simulation input and the output in order tocapture rich representations that ultimately find the complexposterior with a limited budget. In addition, we estimate themixed regularization in a unified way with a single networkstructure, and we analyze PAR theoretically by providingthe unique closed-form solution of the regularized loss.

2. Previous Research2.1. Problem Definition

The evaluation on conditional likelihood of p(x|θ) in simu-lation is not allowed because simulation is fundamentally adata-generation process from its input θ to an output x. Thepurpose of likelihood-free inference is the estimation of theposterior distribution of p(θ|x = xo) where xo is a singleshot of a real-world observation3.

2.2. Sequential Likelihood-Free Inference

Recent approaches of likelihood-free inference estimate theposterior with the multiple rounds of inference steps. Thisiterative inference gradually fastens the approximate poste-rior to the exact posterior by optimizing the neural networkparameters used for inference. This sequential neural ap-proach is becoming the mainstream in the community oflikelihood-free inference because the iterative amortized op-timization saves simulation budget by orders of magnitude.

At the initial round, the sequential likelihood-free inferencegathers simulation inputs from the prior distribution, p(θ).

2This may happen if a raw simulation outcome is not filteredby a set of summary statistics.

3Assuming that the real-world with the same context happensonly once.

0 5 10 15 20 25 30 35

Rounds

150

200

250

300

350

L SNL(ψ

)

Generalization GapGeneralization Gap

training LSNL with λ = 0

validation LSNL with λ = 0

training LSNL with λ > 0

validation LSNL with λ > 0

(a) Overfitting

0 5 10 15 20 25 30

Round

−7

−6

−5

−4

−3

−2

−1

0

log

MM

D

SNL with 1-modeSNL with 4-modesSNL with 16-modesSNL with 64-modesSNL with 256-modesSNL + PAR with 256-modes

(b) Mode Collapse

0 5 10 15 20 25 30

Round

−8

−7

−6

−5

−4

−3

−2

−1

log

MM

D

SNL with 10-dimSNL with 30-dimSNL with 50-dimSNL with 70-dimSNL with 100-dimSNL + PAR with 100-dim

(c) Non-Scalability

Figure 2. SNL with no regularization (λ = 0) on SLCP-16 simula-tion (see Appendix A) suffers from (a) large generalization gap (b)inaccurate posterior estimation on multi-modal posterior (c) slowlearning on high dimensional output.

A collection of simulation input-output pairs constructsa dataset for the initial round, D1 = (θ1,j , x1,j)Nj=1,where each x1,j is the simulation output corresponds toθ1,j , i.e. x1,j ∼ p(x|θ1,j). If we were to infer the jointdistribution at the first round, the inference objective be-comes the maximization of the likelihood, or equivalentlythe minimization of its corresponding KL loss, q1(θ, x) =arg minq L(q;D1). For the next rounds, the new batchof simulation inputs are drawn from a proposal distribu-tion, i.e. θr,j ∼ pr(θ) ∝ p(θ)qr−1(xo|θ) for j = 1, ..., N ,and the next batch is cumulated to the existing dataset asDr ← Dr−1 ∪ (θr,j , xr,j)Nj=1, and the inference repeatsthe minimization of the loss function, again. This iterativeinference on the joint distribution eventually estimates thedesired posterior distribution.

Sequential Neural Likelihood proposes a neural likelihoodto model the joint distribution, qψ(θ, x) = p(θ)qψ(x|θ),parametrized by ψ. Here, the cumulative dataset is collectedfrom different input distributions at each round, so we de-note p(θ) to be the expected proposal distributions, p(θ) =1r

∑rs=1 ps(θ), over rounds. SNL loss is LSNL(ψ) =

−Ep(θ)p(x|θ)[

log qψ(x|θ)]

= DKL

(p(θ, x), qψ(θ, x)

)+

const, where p(θ, x) is the true distribution decomposed byp(θ)p(x|θ), and qψ(θ, x) is the model distribution decom-posed by p(θ)qψ(x|θ). Therefore, the optimal neural likeli-hood, qψ∗(x|θ), matches to the exact likelihood, p(x|θ), inthe support of p(θ) when the KL divergence vanishes.

2.3. Multimodal and High Dimensional Outputs

In spite of the flexibility on the posterior inference by SNL,it suffers from the overfitting as in Figure 2 (a) with a largegeneralization gap. In consequence, Figure 2 presents thatSNL hardly infers the posterior when (b) the posterior distri-bution attains multiple modes, and (c) the simulation outputis high dimensional. Assuming that the posterior inferenceon SNL is sub-optimal, the additional data from this inaccu-rate posterior would lower the data quality, and the poorlyconstructed data would reinforce the sub-optimality of SNL.This natural feedback loop of inference inaccuracy and inef-ficient training data collection limits the further utilizationof SNL in the likelihood-free inference community.


3. Design of Posterior-Aided RegularizationAs the previous section asserts the scalability with respectto the output dimension and the modes, the current ver-sion of SNL needs to be constrained by a regularizationto prevent the negative feedback of fitting the inaccurateposterior distribution. We introduce a constrained problemas an alternative of SNL as the below.

minimize LSNL(ψ)

subject to F (ψ) ≤ C. (1)

The constraint restricts the class of distributions to be amongF = qψ : F (ψ) ≤ C, so our objective becomes the de-sign of F . The constrained problem of Eq. 1 is equivalent tothe minimization of an unconstrained Lagrangian function:

Lλ(ψ) = LSNL(ψ) + λF (ψ),

where λ > 0 is the regularization magnitude.

To determine a specific form of the constraint, we analyzethe slow learning of SNL in terms of its loss design. SNLoptimizes the forward KL divergence between the true jointdistribution, p(θ, x) = p(θ)p(x|θ), and the model jointdistribution, qψ(θ, x) = p(θ)qψ(x|θ). The forward KLdivergence is well-known for its mode-covering property(Ke et al., 2019; Zhang et al., 2019), so SNL prefers awidespread distribution over a localized distribution to lowerthe forward KL divergence.

SNL forgets searching divisive modes at the intermediaterounds because the mode-covering property enforces gather-ing data instances from a widespread posterior, rather thana clustered posterior distribution. Eventually, the collecteddata becomes spread across the whole domain of θ, andthose θ samples, which are out of modes, do not contributeto the inference performance significantly. In addition, suchinstances hinder the learning due to the negative reinforcemechanism in Section 2.3.

As we observed the source of inefficiency within SNL, wediscovered that SNL would be better off by adding a con-straint for the exploitation rather than the exploration be-cause of the mode covering by the forward KL. Therefore,we introduce a regularization as below to provide the forceof exploitation, as well as the force of enhancing representa-tion power to reduce the required simulation budget.

F (ψ) = F (1)(ψ) + F (2)(ψ),

with F (1)(ψ) = DKL

(qψ(θ, x), p(θ, x)

), and F (2)(ψ) =

−I(Θ, X), where we denote random variables using upper-case letters (e.g. Θ, X), and their realizations by the corre-sponding lower-case letters (e.g. θ, x).

trueKLrKLKL+rKL

(a) Comparison of KL/rKL/sKL

KLrKL

(b) KL

KLrKL

(c) rKL

KLrKL

(d) sKL

Figure 3. (a) Comparison of distributions estimated by KL, reverseKL, and symmetric KL (KL + rKL) to the true distribution. (b,c,d)The estimated KL and reverse KL values when the loss function tooptimize is (b) KL, (c) reverse KL, and (d) symmetric KL.

3.1. F (1): Regularization for Mode Seeking

The first term of F (1) is the reverse KL divergence betweenp(θ, x) and qψ(θ, x). As a counterpart of the forward KL,the reverse KL has the mode-seeking property (Nguyen et al.,2017; Poole et al., 2016), so the reverse KL prefers mode-concentrated distributions. This property stems from penal-izing distributions with non-zero values wherever qψ > p

because the reverse KL integrand, logqψp , becomes large on

a region of qψ > p. In other words, the reverse KL stronglypenalizes a mode-covering distribution with non-zero valueson the intermediate region between modes, whereas suchdistribution is preferred by SNL.

Moreover, the forward KL divergence is neither weakernor stronger than the reverse KL divergence, which meansthat there is no inequality relation between the two of them.Because of this incomparability, the forward and the reverseKL divergences induce distinctive topologies on the spaceof probability distributions. Therefore, the SNL constrainedwith the reverse KL is not reducible to the unregularizedSNL, and the suggested constrained problem generates adifferent optimization path.

The weighted divergence, LSNL(ψ) + λF (1)(ψ), mixestwo extremes in a unified optimization loss by taking bothcontrastive properties inherited by forward and reverse KLs.The minimization of the weighted divergence searches dis-tributions with accurate modes by the mode-seeking prop-erty while retaining the mode variability by the mode-covering property. The weight of λ controls the exploration-exploitation effect by influencing the sample diversity andthe sample quality for the next round. Given this posteriorupdate, the cumulative dataset includes the samples of θ,which are clustered around modes (Zhang et al., 2019) ifλ becomes high. On the contrary, the cumulative datasetwould spread out if λ becomes low.

Figure 3 empirically demonstrates the arguments discussedabove: (a) the weighted divergence captures two-modals,whereas neither KL nor reverse KL infers the groundtruthdistribution; (b,c,d) the weighted divergence minimizes both


KL/reverse KL divergences, whereas neither of two ex-tremes optimizes the corresponding opposite extreme di-vergence.

3.2. F (2): Regularization for Representation Quality

Another regularization component, F (2), utilizes the mu-tual information to mitigate the slow learning of SNL. Al-though the mutual information solves the mode collapsein many applications (Lee et al., 2020; Peng et al., 2020),we emphasize the mutual information for its scalability ofrepresentation learning on high dimensions (Hjelm et al.,2018; Bachman et al., 2019; Zhao et al., 2017). The mutualinformation measures the amount of information obtainedfrom X by observing Θ, which is defined by I(Θ, X) =H(X)−H(X|Θ), where H(X|Θ) is the conditional entropy,H(X|Θ) = −

∫p(θ)

∫qψ(x|θ) log qψ(x|θ)dxdθ.

SNL regularized with the mutual information provides ahighly predictable model to discriminate θ given x. First,we discuss the localization effect of H(X|Θ = θ). Tomaximize I, H(X|Θ = θ) should be minimized, whichmeans the lower conditional entropy is preferred. Therefore,qψ(x|θ) will be localized in the x-space. Second, we focuson the discriminative effect of H(X). The maximizationof I increases H(X), so the higher entropy is preferred. Toincrease the entropy, qψ(x) =

∫p(θ)qψ(x|θ)dθ should be

flattened, so this entropy will enforce qψ(x|θ) 6= qψ(x|θ′).

0 5 10 15 20 25 30

Round

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

r ψ

SNLSNL + PARqψ(x|θ) = p(x|θ)

Figure 4. Comparison of ratios rψwith/without mutual informationregularization on SLCP-16 simula-tion (see Appendix A).

Figure 4 illustrates theeffect of mutual infor-mation on inference. Ifwe define the mutual in-formation ratio as rψ =DKL(qψ(θ,x),p(θ)qψ(x))DKL(p(θ,x),p(θ)p(x)) ,

then rψ = 1 if and onlyif qψ(x|θ) = p(x|θ).The ratio of the unregu-larized SNL saturates atthe sub-optimal in Fig-ure 4, so the additionalterm of maximizing themutual information enforces to reach the learning curveto rψ = 1. Figure 4 empirically demonstrates that SNLregularized with the mutual information estimates the exactlikelihood by the neural network after rounds of inference.

4. Estimation of Posterior-AidedRegularization

4.1. Unified Estimation of Reverse KL and MutualInformation

While the constraint of F provides a useful implication byavoiding the inefficient feedback loop between intermediate

inference inaccuracy and poorly cumulated dataset, the pro-posed constraint cannot be tractably calculated. Therefore,additional optimization is indispensable for the estimationof F (ψ). Although there are estimation methods for thereverse KL and the mutual information individually, a si-multaneous estimation of both terms has not been introducedyet. This paper proposes a method to estimate both F (1)

and F (2) with a single neural network.

In contrast to the previous estimation methods, we use nei-ther variational inference (Poole et al., 2019) nor minimaxframework (Belghazi et al., 2018; Nowozin et al., 2016).Instead, we estimate the constraint by approximating thejoint distribution with posterior modeling, so we name thisestimator as Posterior-Aided Regularization (PAR). PARupdates the approximation of the joint distribution at eachround by training the posterior model in the alternative orderwith SNL, as suggested in Algorithm 1.

4.2. Proxy of Regularization

If we define a surrogate constraint to be

F (ψ) = −Ep(θ)qψ(x|θ)[

log p(θ|x)], (2)

where p(θ|x) = p(θ)p(θ|x)p(θ)

p(x)p(x) with p(x) =

∫p(θ, x)dθ,

then Theorem 3 inspects the relation of F and F by provid-ing the decomposition of F . See Appendix B for proof.

Theorem 1. Suppose F (ψ) is defined to be−Ep(θ)qψ(x|θ)

[log p(θ|x)

], then F (ψ) is decomposed

into F (ψ) = F (1)(ψ) + F (2)(ψ) + const. with

F (1)(ψ) = DKL

(qψ(x)qψ(θ|x), qψ(x)p(θ|x)

)F (2)(ψ) = −I(Θ, X).

The second decomposition of F (ψ) is the mutual in-formation, so the surrogate constraint yields the exactestimation of the mutual information. On the otherhand, the first decomposition of F (1)(ψ) is a perturba-tion of the exact reverse KL divergence, F (1)(ψ) =DKL(qψ(x)qψ(θ|x), p(x)p(θ|x)), and we name F (1)(ψ) asa modified reverse KL divergence. This modified reverseKL divergence becomes the exact reverse KL divergence ifthe neural likelihood is the exact likelihood:

F (1)(ψ∗) = F (1)(ψ∗) with ψ∗ = arg minψ

LSNL(ψ).

Therefore, the surrogate constraint, F (ψ), is a proxy of theoriginal constraint, F (ψ).

4.3. Estimation Method

Posterior-Aided Regularization (PAR) estimates the proxyof the regularization in Eq. 2 from the estimation of the


Algorithm 1 Posterior-Aided Likelihood

for r = 1 to R doif r = 1 then

Draw simulation inputs θ1,jNj=1 from prior p(θ)end ifif r > 1 then

Draw simulation inputs θr,jNj=1 from proposalpr(θ) ∝ qr−1(xo|θ)p(θ)

end iffor j = 1 to N do

Simulate xr,j ∼ p(x|θr,j)end forAdd (θr,j , xr,j) into the training data Dr ← Dr−1 ∪(θr,j , xr,j)Nj=1 (assuming D0 = ∅)repeat

Update φ← φ−∇φLAPT (φ;Dr)Update ψ ← ψ −∇ψLλφ(ψ;Dr)

until convergedend for

posterior distribution, p(θ|x), through Automatic PosteriorTransformation (APT) (Greenberg et al., 2019). APT esti-mates the posterior distribution with a loss

LAPT (φ) = −Ep(θ)p(x|θ)[qφ(θ|x)

],

where qφ(θ|x) = p(θ)qφ(θ|x)p(θ)

1Z(x;φ) with Z(x;φ) to be a

normalization function. Here, APT models the posteriordistribution, qφ(θ|x), with a normalizing flow, parametrizedby φ. APT loss is equivalent to the KL divergence,LAPT (p(θ)p(x|θ), p(x)qφ(θ|x)), and the optimal solutionof this KL divergence satisfies qφ∗(θ|x) = p(θ|x).

We estimate Eq. 2 by

Fφ(ψ) = −Ep(θ)qψ(x|θ)[

log qφ(θ|x)]. (3)

The estimation on PAR becomes exact if the posterior esti-mation of APT becomes exact:

Fφ∗(ψ) = F (ψ) with φ∗ = arg minφ

LAPT (φ).

Analogous to Theorem 3, Eq. 3 is decomposed by

Fφ(ψ) = F(1)φ (ψ) + F (2)(ψ) + const.,

with F (1)φ (ψ) = DKL

(qψ(x)qψ(θ|x), qψ(x)qφ(θ|x)

).

The time complexity of our model is bounded by the lin-ear summation of the complexity on SNL and APT, seeAppendix C for the complexity analysis in big-O notations.

4.4. Theoretic Analysis

4.4.1. MODIFIED REVERSE KL DIVERGENCE

As the reverse KL divergence is perturbed as a modifiedreverse KL divergence, this section investigates the effect

of the perturbation on the optimality. We conclude that theperturbation does not arise the ill-posedness by showing theuniqueness of arg minqψ LSNL(ψ) + λF (1)(ψ).

The optimal solution of the modified reverse KL divergence,F (1)(ψ) = DKL(qψ(x)qψ(θ|x), qψ(x)p(θ|x)), satisfiesqψ∗(θ|x) = p(θ|x). Having that p(θ|x) = p(θ)p(θ|x)

p(θ)p(x)p(x) ,

q is the solution of arg minqψ F(1)(ψ) if and only if

p(x|θ)p(x)

=q(x|θ)q(x)

. (4)

While the solution of q could be more than one instance,this potential ill-posedness does not prevent the uniquenesswhen we include LSNL(ψ), as well. The exact likelihood,p(θ|x), satisfies Eq. 4, so the exact likelihood belongs to thesolution of arg minqψ F

(1)(ψ). Thus, the exact likelihoodminimizes both SNL loss and modified reverse KL loss.Particularly, the exact likelihood uniquely minimizes theSNL loss, and this uniqueness on SNL loss guarantees theuniqueness of the minimization problem for the weightedloss, arg minqψ LSNL(ψ) + λF (1)(ψ). Therefore, we con-clude that the perturbation does not arise ill-posedness if themodified reverse KL is combined with the SNL loss.

4.4.2. ASYMPTOTIC CONVERGENCE

In addition to the optimality analysis of the perturbation withrespect to the first term, maximizing the mutual information,F (2)(ψ), inhibits the neural likelihood from estimating theexact likelihood when λ > 0. To investigate the effectfrom the mutual information on the optimality, Theorem 4analyzes the closed-form solution of Lλφ(ψ) = LSNL(ψ) +

λF (ψ). See Appendix B for the proof.

Theorem 2. Suppose p(θ, x) and p(θ) are bounded on apositive interval, p(θ) is bounded below, and qψ(θ, x) is uni-formly upper bounded on ψ. Then the optimization problemarg minqψ Lλφ(ψ) attains the unique solution of

qλψ∗(x|θ) =p(x|θ)

c(θ)− λ log qφ(θ|x),

where c(θ) is a constant with respect to x that makesqλψ∗(x|θ) a distribution in x-space.

Theorem 4 guarantees that the optimal neural likelihood ofthe regularized loss, Lλφ(ψ), converges to the exact like-lihood as the regularizing magnitude converges to zero:qλψ∗(x|θ) → p(x|θ) as λ → 0. As an additional note, thisconvergence of qλψ∗(x|θ) is irrelevant to the convergence ofthe neural posterior, qφ(θ|x).

Figure 5 shows the trajectories of the regularized solutionswith various λ. Figure 5 visualizes two extremes: 1) λ = 0


(blue) represents the vanilla SNL without the regulariza-tion, which shows the slow convergence to the global op-timum; and 2) λ = ∞ (green) counts only the regular-ization loss without the SNL loss, which deviates largelyto the global optimum. The trajectory of non-vanishingλ = 1 (orange) shows the empirical evidence of The-orem 4 by slightly deviating from the global optimum.

Constant λ = 0

Constant λ = 1

Constant λ =∞Annealing λ

Figure 5. Comparison of trainingtrajectories in a toy example (seeAppendix A) with various λ.

On the other hand, thetrajectory with vanish-ing magnitude (purple)shows the convergenceto the global optimumas λ → 0. Moreover,the convergence withvanishing λ is fasterthan the learning curveof SNL.

5. ExperimentsWe compare our regularization with the baselines as SMC-ABC (Sisson et al., 2007), SNPE-A (Papamakarios & Mur-ray, 2016), SNPE-B (Lueckmann et al., 2018), APT, AALR(Hermans et al., 2020), and SNL on various simulations.First, we test the regularized SNL on tractable simulationswith multi-modal posterior distributions. SLCP-16 andSLCP-256 (Kim et al., 2020b) attain tractable likelihoodsthat allow analyzing the effect of PAR with respect to theMaximum Mean Discrepancy (MMD) (Sriperumbudur et al.,2010) and the Inception Score (IS) (Salimans et al., 2016).MMD measures the inference quality of the posterior, andIS measures the sample diversity for each inference round.

Next, we experiment on realistic simulations without thetractable likelihood formula. The simulations are as follows:1) the M/G/1 model (Papamakarios et al., 2019) from thequeuing theory is a discrete-event simulation that the simu-lation states update according to discrete events randomlydrawn from an exponential distribution with unknown pa-rameters; 2) the Ricker model (Gutmann & Corander, 2016)from the field of ecology is a system dynamics simulationthat predicts the latent animal population with observedpopulations randomly selected from a Poisson distributionwith unknown parameters; and 3) the Poisson model (Kimet al., 2020a) in the field of physics is a second-order ellipticpartial differential equation that calculates the heat transferswith unknown local diffusivities. See Appendix A for thedetails on simulations.

5.1. Tractable Simulations

5.1.1. SLCP-16

The Simple-Likelihood-and-Complex-Posterior-16 (SLCP-16) (Kim et al., 2020b) is a simulation with 16 modes in

its posterior distribution. Figure 6 presents the marginalposterior distribution inferred by various algorithms, includ-ing the proposed method. Compared to the groundtruthposterior in Figure 6 (a), APT in Figure 6 (b) infers an inac-curate posterior with many wrong modes. AALR in Figure6 (c) infers a posterior with unclear separations betweenmodes. The target distribution of AALR is the likelihood-to-evidence ratio, p(x|θ)p(x) , which shifts at every round. Wehypothesize the unclear separation of AALR due to the over-fitting that frequently arises if the target distribution shifts(Shao et al., 2019). The proposed SNL with PAR in Figure6 (e) captures every mode with clear separations betweenmodes, in contrast to the heavy-tailed inaccurate posteriorof SNL in Figure 6 (f) that misses many modes.

Figure 10 presents the experimental result of SLCP-16. Fig-ure 10 (a) demonstrates that SNL with PAR outperforms thebaseline algorithms with significant margins. This perfor-mance gain is particularly remarkable because the vanillaSNL performs the worst out of the baselines, which indi-cates that PAR significantly improved SNL. Figure 10 (b)compares the unregularized SNL with the regularized SNLwhen the output dimension increases. SNL shows the slowperformance on every cases of dimension settings while theregularized SNL captures the most of modes on every cases,including on the case of 150-dimensional output.

Table 1 presents the various performances of SLCP-16. Theregularized SNL beats the baselines on SLCP-16 for everymetric. The best performance of the proposed method in-duces three implications. First, the best score on the negativelog probability at the groundtruth inputs4, − log p(θ∗|xo),indicates that the inferred posterior is concentrated to themodes, as in Figure 6 (e). Second, the best MMD perfor-mance guarantees the best quality of the posterior inference.Third, the best IS indicates the most divisive modalities ofthe inferred posterior.

5.1.2. SLCP-256

For the study on extremely many modalities, the Simple-Likelihood-and-Complex-Posterior-256 (SLCP-256) (Kimet al., 2020b) is a simulation with 256 modes in its posteriordistribution. Figure 11 (a) presents that AALR shows thecomparable performance to SNL with PAR after rounds.However, the convergence is slower than the SNL with PARbecause SNL with PAR reaches 200 inception score at the27-th round whereas AALR comes to the similar score atthe 40-th round. The vanilla SNL captures various modesin SLCP-256 within an acceptable rounds, but SNL covers

4Though we test the simulations with a single true input(θ∗1 ), there are alternative simulation inputs (θ∗2 ) that create thesame likelihood function, p(x|θ∗1) = p(x|θ∗2). We denote suchset of inputs, θ∗1:K , as the groundtruth inputs, and we calculate− log p(θ∗|xo) := −

∑Kk=1 log p(θ

∗k|xo).


(a) Groundtruth (b) APT (c) AALR (d) SNL (e) SNL + PAR

Figure 6. Comparison of marginal posterior distributions on SLCP-16 with 100-dimensional output between (a) groundtruth and (b) APT,(c) AALR, (d) SNL, and (e) SNL regularized by PAR after 40 rounds of inference with 100 simulation budget for each round. Black dotsare the samples from approximate posteriors, the red dots are groundtruth simulation inputs, and we plot marginal distribution of eachdimension through Kernel Density Estimation.

Table 1. Comparison of performance on (Left) SLCP-16 with 100-dimensional output and (Right) SLCP-256 with 80-dimensional output.On SLCP-16, we run 30 rounds of inference with 100 simulation budget for each round. On SLCP-256, we run 50 rounds of inferencewith 100 simulation budget for each round. The negative log probability at the groundtruth inputs is scaled by 10−2.

ALGORITHMSLCP-16 SLCP-256

− log p(θ∗|xo) (↓) log MMD (↓) IS (↑) − log p(θ∗|xo) (↓) log MMD (↓) IS (↑)SMC-ABC 8.21±2.36 -1.80±1.24 13.56±0.55 75.44±29.71 -4.48±0.55 81.69±7.73

SNPE-A 1.46±1.72 -4.47±0.05 2.66±0.23 39.08±1.72 -4.55±0.21 34.74±3.68SNPE-B 27.66±23.83 -2.12±0.31 1.50±0.40 140.4±22.78 -1.83±0.13 19.43±0.93

APT (SNPE-C) 3.14±8.80 -3.27±0.71 7.39±4.48 69.04±93.53 -3.51±1.24 118.88±51.30AALR 0.89±0.12 -3.57±0.43 11.22±3.37 16.19±1.04 -6.81±0.68 208.91±4.33SNL 4.77±2.68 -2.53±0.54 5.34±3.43 40.40±11.84 -5.32±0.65 153.43±18.40

SNL+PAR 0.55±0.79 -5.39±0.94 14.95±1.08 15.13±2.65 -6.51±0.86 211.85±4.07

0 5 10 15 20 25 30

Round

0

2

4

6

8

10

12

14

16

Ince

ptio

nS

core

APTAALRSNLSNL + PAR

(a) SLCP-16 by Algorithm

0 5 10 15 20 25 30

Round

0

2

4

6

8

10

12

14

16

Ince

ptio

nS

core

SNL with 50-dimSNL with 100-dimSNL with 150-dimSNL + PAR with 50-dimSNL + PAR with 100-dimSNL + PAR with 150-dim

(b) SLCP-16 by Dimension

Figure 7. Comparison of Inception Score by round on SLCP-16with 100 simulation budgets for each round on (a) baselines of50-dimensional output and (b) varying dimensions.

less number of modes, eventually. APT struggles in this testbecause the estimation on Z(x;φ) via an atomic proposal inAPT becomes inaccurate on simulations with many modes.Analogous to SLCP-16, Figure 11 (b) illustrates the out-performance of the regularized SNL over the vanilla SNL.Figure 11 (b) indicates that the inference of SNL succeedson low dimensional outputs, whereas the inference of SNLon high dimensional outputs fails given simulation budgets.

Table 1 shows the performances on SLCP-256. The regular-ized SNL performs better than the vanilla SNL and APT, butAALR beats the proposed method in the MMD score. How-ever, AALR does not show any consistent performance ifwe compare SLCP-16 and SLCP-256 as well as the realistic

0 10 20 30 40 50

Round

25

50

75

100

125

150

175

200

225

Ince

ptio

nS

core

APTAALRSNLSNL + PAR

(a) SLCP-256 by Algorithm

0 10 20 30 40 50

Round

0

50

100

150

200

Ince

ptio

nS

core

SNL with 40-dimSNL with 80-dimSNL with 120-dimSNL + PAR with 40-dimSNL + PAR with 80-dimSNL + PAR with 120-dim

(b) SLCP-256 by Dimension

Figure 8. Comparison of Inception Score by round on SLCP-256with 100 simulation budgets for each round on (a) baselines of40-dimensional output and (b) varying dimensions.

simulations in Table 2.

5.1.3. HYPERPARAMETER SELECTION

This section examines the effect of hyperparameter on infer-ence. There are three strategies of hyperparameter selection.First, we experiment with a constant hyperparameter of λ tocheck if the inference performance indeed degrades due tothe approximated nature in Theorem 4. Second, we experi-ment with exponentially decaying hyperparameters by theinference round in Anneal by Round. The hyperparameter atthe initial round is set to five, and the exponent coefficientis chosen to make the hyperparameter at the last round tobe 0.1. Third, we periodically decay the hyperperameter at


Table 2. Comparison of negative log probability at groundtruth simulation inputs for M/G/1 model, Ricker model, and Poisson model. Werun 40 rounds of inference for M/G/1, and 10 rounds of inference for Ricker and Poisson models with simulation budget per a round to be20 for M/G/1 and 100 for Ricker and Poisson models.

ALGORITHM

M/G/1 RICKER POISSON

OUTPUT DIMENSION OUTPUT DIMENSION OUTPUT DIMENSION

5 20 100 13 20 100 25 49 361

SMC-ABC 22.54±17.83 25.48±22.12 14.85±18.41 6.77±7.53 6.02±7.61 36.12±18.64 0.92±1.42 0.58±1.75 0.00±0.00

SNPE-A 4.36±0.43 4.07±0.49 3.25±0.61 4.43±0.20 4.54±0.49 4.43±0.18 0.34±0.48 0.17±0.42 -3.22±6.32

SNPE-B 9.30±0.96 9.85±0.03 9.89±0.01 9.79±0.06 8.67±1.16 9.96±0.25 5.44±5.59 5.59±4.92 3.47±4.67

APT (SNPE-C) -3.49±0.97 -2.27±7.14 -3.01±1.82 -0.62±0.98 -0.24±3.86 4.13±6.16 -11.61±1.93 -11.61±2.50 -11.26±0.96

AALR -1.18±0.59 -1.69±0.51 1.14±2.08 4.10±0.51 1.82±0.77 3.15±2.55 -6.36±1.66 -6.25±0.92 0.28±0.66

SNL -3.23±1.25 -5.58±1.37 -4.65±1.30 -1.46±0.66 -0.41±1.07 1.44±6.42 -12.76±0.99 -12.48±0.95 -2.36±7.44

SNL+PAR -3.86±0.68 -6.36±0.78 -5.52±0.83 -1.78±0.89 -1.14±1.25 -0.48±0.70 -13.31±1.11 -13.61±0.83 -5.33±4.33

0 5 10 15 20 25 30 35 40

Round

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Ince

ptio

nS

core

Constant λ = 0 (SNL)Constant λ =0.1Constant λ =1Constant λ =10Constant λ =∞Anneal by RoundPeriodic by Round

(a) SLCP-16 by Strategy

0 5 10 15 20 25 30

Round

0

50

100

150

200

Ince

ptio

nS

core

Constant λ = 0 (SNL)Constant λ =0.1Constant λ =1Constant λ =10Constant λ =∞Anneal by RoundPeriodic by Round

(b) SLCP-256 by Strategy

Figure 9. Comparison of Inception Score by round upon varioushyperparameter strategies on (a) SLCP-16 of 50-dimensional out-put (b) SLCP-256 of 40-dimensional output.

every round in Periodic by Round. We put the initial hyper-parameter as five at each round, and the hyperparameter isdecayed as to 0.1 after 100 epochs.

Figure 9 (a) presents the result on SLCP-16. The learningcurve with the first strategy shows improving performanceas the hyperparameter increases, but the performance dra-matically degrades down when λ becomes the infinity. Thiscorrelation of the hyperparameter magnitude and the per-formance also occurs on SLCP-256 in Figure 9 (b), andwe argue that the proposed regularization is robust on thehyperparameter selection as long as the hyperperameter lieson a moderate range, [0.1, 10], though it is not guaranteed tobe exact by Theorem 4. Also, Figure 9 shows that the strate-gies with annealing and periodic hyperparameters performscomparably to the best performance of the hyperparameterselection on both SLCP-16 and SLCP-256. This empiricalresult reduces the endeavor of hyperparameter selections, ifwe choose either annealing or periodic strategy.

5.2. Real-world Simulations

M/G/1 Model We test the M/G/1 model with various outputdimensionalities. We define the summary statistics of theM/G/1 model as the equally dividend percentiles of the1,000-dimensional inter-departure simulation raw output.With the varying dimension of the summary statistics, Table

2 presents the state-of-the-art inference on the posterior bySNL with PAR.

Ricker Model While we accept the need of summarystatistics to mitigate unncessary noise in simulation re-sult for more accurate inference, we test the Ricker modelin three cases. First, we consider 13 summary statistics(Wood, 2010) of the mean, the number of zeros, the auto-covariances with lag up to five, the coefficients of a cubicregression of the ordered differences, and the coefficients ofa least squares estimates. Second, we extract 20 equally dis-tributed percentiles of the predicted simulation populations.Third, we infer the posterior without the summary statisticswith 100-dimensional raw output. Table 2 presents that SNLwith PAR gains a significant performance increment overthe previous algorithms.

Poisson Model While the previous M/G/1 model and theRicker model are temporal simulation models, the Poissonmodel is a time-independent simulation based on partial dif-ferential equations. We observe a vector of solution valuesat equally spaced pre-designated positions. Table 2 showsthat SNL with PAR shows the second best in the high di-mensional case, and SNL with PAR performs the best in thelow and the middle cases.

6. ConclusionThis paper proposes a new regularization, PAR, which isapplicable to the inference of the sequential neural likeli-hood. PAR is designed to pursue both learning the posteriordetail as well as avoiding the overfitting, which could beseem contradictory with a glimpse. We show that these twoobjectives can coexist in a single neural network for theinference, as a unified structure. This coexistence is theo-retically verified, and the empirical quality of the inferredposterior shows better fitness with increased mode capturesby PAR. We expect this new regularization can simply, yeteffectively change the practice of likelihood-free inference.


ReferencesArjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.

arXiv preprint arXiv:1701.07875, 2017.

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learningrepresentations by maximizing mutual information acrossviews. In Advances in Neural Information ProcessingSystems, pp. 15535–15545, 2019.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Ben-gio, Y., Courville, A., and Hjelm, D. Mutual informationneural estimation. In International Conference on Ma-chine Learning, pp. 531–540. PMLR, 2018.

Caruana, R., Lawrence, S., and Giles, L. Overfitting in neu-ral nets: Backpropagation, conjugate gradient, and earlystopping. Advances in neural information processingsystems, pp. 402–408, 2001.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. Infogan: Interpretable representationlearning by information maximizing generative adversar-ial nets. arXiv preprint arXiv:1606.03657, 2016.

Danihelka, I., Lakshminarayanan, B., Uria, B., Wierstra,D., and Dayan, P. Comparison of maximum likelihoodand gan-based training of real nvps. arXiv preprintarXiv:1705.05263, 2017.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios,G. nflows: normalizing flows in PyTorch, Novem-ber 2020. URL https://doi.org/10.5281/zenodo.4296287.

Greenberg, D., Nonnenmacher, M., and Macke, J. Auto-matic posterior transformation for likelihood-free infer-ence. In International Conference on Machine Learning,pp. 2404–2414, 2019.

Grover, A., Dhar, M., and Ermon, S. Flow-gan: Combiningmaximum likelihood and adversarial learning in genera-tive models. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 32, 2018.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of wasserstein gans.In Advances in neural information processing systems,pp. 5767–5777, 2017.

Gutmann, M. U. and Corander, J. Bayesian optimizationfor likelihood-free inference of simulator-based statisticalmodels. The Journal of Machine Learning Research, 17(1):4256–4302, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hermans, J., Begy, V., and Louppe, G. Likelihood-freemcmc with amortized approximate ratio estimators. InInternational Conference on Machine Learning, 2020.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,K., Bachman, P., Trischler, A., and Bengio, Y. Learningdeep representations by mutual information estimationand maximization. arXiv preprint arXiv:1808.06670,2018.

Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., andSrinivasa, S. Imitation learning as f -divergence mini-mization. arXiv preprint arXiv:1905.12888, 2019.

Kim, D., Joo, W., Shin, S., and Moon, I.-C. Adversariallikelihood-free inference on black-box generator. arXivpreprint arXiv:2004.05803, 2020a.

Kim, D., Song, K., Kim, Y., Shin, Y., and Moon, I.-C. Se-quential likelihood-free inference with implicit surrogateproposal. arXiv preprint arXiv:2010.07604, 2020b.

Lee, K. S., Tran, N.-T., and Cheung, N.-M. Infomax-gan:Improved adversarial image generation via informationmaximization and contrastive learning. In Proceedingsof the IEEE/CVF Winter Conference on Applications ofComputer Vision, pp. 3942–3952, 2020.

Lueckmann, J.-M., Goncalves, P. J., Bassetto, G., Oecal, K.,Nonnenmacher, M., and Macke, J. H. Flexible statisticalinference for mechanistic models of neural dynamics.In Neural Information Processing Systems (NIPS 2017),2018.

Nguyen, T., Le, T., Vu, H., and Phung, D. Dual discrimi-nator generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2670–2680, 2017.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Traininggenerative neural samplers using variational divergenceminimization. In Advances in neural information process-ing systems, pp. 271–279, 2016.

Papamakarios, G. and Murray, I. Fast ε-free inference ofsimulation models with bayesian conditional density esti-mation. In Advances in Neural Information ProcessingSystems, pp. 1028–1036, 2016.

Papamakarios, G., Sterratt, D., and Murray, I. Sequen-tial neural likelihood: Fast likelihood-free inference withautoregressive flows. In The 22nd International Confer-ence on Artificial Intelligence and Statistics, pp. 837–848,2019.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficultyof training recurrent neural networks. In Internationalconference on machine learning, pp. 1310–1318. PMLR,2013.

https://doi.org/10.5281/zenodo.4296287



Peng, J., Pedersoli, M., and Desrosiers, C. Mutual informa-tion deep regularization for semi-supervised segmenta-tion. In Medical Imaging with Deep Learning, pp. 601–613. PMLR, 2020.

Poole, B., Alemi, A. A., Sohl-Dickstein, J., and Angelova,A. Improved generator objectives for gans. arXiv preprintarXiv:1612.02780, 2016.

Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., andTucker, G. On variational bounds of mutual information.In International Conference on Machine Learning, pp.5171–5180. PMLR, 2019.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,Radford, A., and Chen, X. Improved techniques for train-ing gans. In Advances in neural information processingsystems, pp. 2234–2242, 2016.

Shao, J., Wang, Q., and Liu, F. Learning to sample: anactive learning framework. In 2019 IEEE InternationalConference on Data Mining (ICDM), pp. 538–547. IEEE,2019.

Sisson, S. A., Fan, Y., and Tanaka, M. M. Sequential montecarlo without likelihoods. Proceedings of the NationalAcademy of Sciences, 104(6):1760–1765, 2007.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,Scholkopf, B., and Lanckriet, G. R. Hilbert space embed-dings and metrics on probability measures. The Journalof Machine Learning Research, 11:1517–1561, 2010.

Tejero-Cantero, A., Boelts, J., Deistler, M., Lueckmann,J.-M., Durkan, C., Goncalves, P. J., Greenberg, D. S.,and Macke, J. H. sbi: A toolkit for simulation-basedinference. Journal of Open Source Software, 5(52):2505,2020. doi: 10.21105/joss.02505. URL https://doi.org/10.21105/joss.02505.

Wood, S. N. Statistical inference for noisy nonlinear eco-logical dynamic systems. Nature, 466(7310):1102–1104,2010.

Zhang, M., Bird, T., Habib, R., Xu, T., and Barber, D.Variational f-divergence minimization. arXiv preprintarXiv:1907.11891, 2019.

Zhao, S., Song, J., and Ermon, S. Infovae: Informationmaximizing variational autoencoders. arXiv preprintarXiv:1706.02262, 2017.

A. SimulationsA.1. Tractable Simulations

A.1.1. SLCP-16

The Simple-Likelihood-and-Complex-Posterior-16 (SLCP-16) (Kim et al., 2020b) is a simulation with 16 modes in itsposterior distribution. The model input parameter is five-dimensional and the model output dimension is a variableto choose. The model is defined by

θi ∼ U(−3, 3) for i = 1, ..., 5

µθ = (θ21, θ

22)

σ1 = θ23, σ2 = θ2

4, ρ = tanh (θ5)

Σθ =

(σ2

1 ρσ1σ2

ρσ1σ2 σ22

)x = x1 ⊕ · · · ⊕ xn with xi ∼ N (µθ,Σθ),

where ⊕ denotes the vector concatenation. As the namesuggests, the likelihood function is a simple Gaussian dis-tirbution, whereas the posterior is complex with 16 modes.Also, we can vary the output dimension of SLCP-16 bycontrolling n. The prior distribution of the model is the uni-form distribution on [−3, 3]5. The true parameter becomes(θ∗1 , θ

∗2 , θ∗3 , θ∗4 , θ∗5) = (1.5,−2.0,−1.0,−0.9, 0.6).

A.1.2. SLCP-256

The Simple-Likelihood-and-Complex Posterior-256 (SLCP-256) (Kim et al., 2020b) is a simulation with 256 modesin its posterior distribution. The model input parameter iseight-dimensional, and the model output is a variable tochoose. The model is defined by

θi ∼ U(−3, 3) for i = 1, ..., 8

x = x1 ⊕ · · · ⊕ xn with xj ∼ N (θ2, I),

where n is a control parameter for output dimen-sion. The prior distribution of the model is theuniform distribution on [−3, 3]8. The true parame-ter configuration is as (θ∗1 , θ

∗2 , θ∗3 , θ∗4 , θ∗5 , θ∗6 , θ∗7 , θ∗8) =

(1.5, 2.0, 1.3, 1.2, 1.8, 2.5, 1.6, 1.1).

A.2. Real-world Simulations

We introduce different types of simulations in their modelingframework.

A.2.1. M/G/1 MODEL

The M/G/1 model is a discrete-event simulation that modelsthe state transition of a waiting line with a single serveraccording to the sampling-based discrete events (customerarrive). This model is highly stochastic, and the likelihood,p(x|θ), is supported on a expansive area. Therefore, the

https://doi.org/10.21105/joss.02505



likelihood-free inference fails if the observation is a sin-gleton unless we filter the raw output by a set of summarystatistics. Although there could be alternative choices for thesummary statistics, we choose eqaully dividend percentiles.We choose this elementary statistics in order to minimize thedomain knowledge in the selection of the summary statis-tics.

Suppose the arrival of customers (v) is modeled by a Poissonprocess, and the service time (s) is modeled by a uniformdistribution. Then the finish time is calculated by

s(i) ∼ U(θ1, θ1 + θ2),

v(i)− v(i− 1) ∼ Exp(θ3),

d(i)− d(i− 1) = s(i) + max (0, v(i)− d(i− 1)).

We follow Papamakarios et al. (2019) to set the observationas a trajectory of d(i)−d(i−1)Ii=1. We determine I to be1,000. The model has three input parameters, (θ1, θ2, θ3),that describe the waiting line. The service time is deter-mined by a uniform distribution on [θ1, θ1 + θ2], and θ3 de-termines the customer arrival time. We consider a uniformprior distribution on [0, 10]× [0, 10]× [0, 1/3]. The true pa-rameter configuration is given by (θ∗1 , θ

∗2 , θ∗3) = (1, 4, 0.2).

A.2.2. RICKER MODEL

The Ricker model is a system dynamics simulation thatmodels the animal population that is latent. For example,the number of fish in the ocean cannot be observed, but itis extremely important to predict it in order to keep the sus-tainability of the environment. While the underlying animalpopulation follows a simple equation, the observed popu-lation at each time is based on a sampling from a Poissondistribution, which creates the stochasticity in this model.

The underlying population is described as a nonlinear auto-gressive equation by

logNt = log r + logNt−1 −Nt−1 + σet,

where Nt is the size of the animal population at time t, andet is a standard Gaussian random variable. The populationstarts from N0 = 0. Here, log r determines the popula-tion growth size, and σ represents the randomness of thepopulation growth.

Additionally, the observation model follows a Poisson dis-tribution with

yt ∼ Poisson(φNt),

where φ is a scaling parameters. The model output isytTt=1 with T = 100 timesteps.

We extract two set of summary statistics. The first set con-sists of 13 summary statistics (Wood, 2010): 1) the mean

population, y = 1T

∑Tt=1 yt; 2) the number of zeros ob-

served; 3) the coefficients, β1 and β2, of the autoregressiony0.3t+1 = β1y

0.3t + β2y

0.6t + et; 4) the coefficients of the

cubic regression of the ordered differences, yt − yt−1, ontheir observed values; and 5) the autocovariances up to lagfive. The choice of these summary statistics depends ondomain knowledge. The other set of summary statistics isthe equally distributed percentiles.

We consider a uniform prior on [0, 5]× [0, 1]× [0, 15], andthe true parameter to be (log r∗, σ∗, φ∗) = (3.8, 0.3, 10).

A.2.3. POISSON MODEL

The Poisson model is a second-order partial differentialequation. The generalized Laplace operator in the general-ized Poisson equation models the random walk effect withthe nonlinear diffusion coefficient (q) that could explain aporous material heat diffusivity. We assume a mixed bound-ary condition, which combines the Dirichlet boundary con-dition on the top and the bottom boundaries, and a Neumannboundary condition on the left and the right boundaries:

−∇ · (q∇u) = 1 in Ω

u = 5 on ∂ΩT

u = 0 on ∂ΩB

∇u · n = −10e−(y−0.5)2 on ∂ΩL

∇u · n = 1 on ∂ΩR,

where ∂ΩT , ∂ΩB , ∂ΩL and ∂ΩR are the top, bottom, left,and right sides of the unit square domain, Ω = [0, 1]2, rep-sectively. The Dirichlet condition on the top and the bottomsides indicate that the temperature is fixed throughout theboundaries. The Neumann condition on the right side meansthere exist a heat influx. The Neumann condition on theleft side means the heat is dissipating outwardly. The Pois-son model describes the stationary heat diffusion of thetime-dependent heat equation. The nonlinear diffusivity isdefined by

q(x, y) =

θ1 if x ≤ 0.5 ∧ y ≤ 0.5

θ2 if x ≤ 0.5 ∧ y > 0.5

θ3 if x > 0.5 ∧ y ≤ 0.5

θ4 if x > 0.5 ∧ y > 0.5

We fine the weak solution u ∈ V of the variational formgiven by

a(u, v) = L(v) ∀v ∈ V ,

where V is the space of test functions, i.e. C∞c (Ω); V isthe space of C∞(Ω) that satisfies the top and the bottom


Dirichlet boundary conditions; and

a(u, v) =

∫Ω

q(x)∇u(x) · ∇v(x)dx

L(v) =

∫∂ΩL

−10e−(y−0.5)2v(x)ds+

∫∂ΩR

v(x)ds

+

∫Ω

v(x)dx.

The observation is a vector of solution values u∗1, ..., u∗Dat equally spaced pre-designated positions, x1, ..., xD,repectively. We use the uniform prior on [0, 1]4, and the trueparameter as (θ∗1 , θ

∗2 , θ∗3 , θ∗4) = (0.1, 0.9, 0.6, 0.2).

A.3. Toy Simulations

A.3.1. SIMULATION OF FIGURE 1

We use a simulation model with

θi ∼ U(0, 1) for i = 1, 2

x ∼ N ([cos 5πθ1, cos 5πθ2]T , 0.1I).

With the observation of xo = [0, 0]T , the true posteriorattains equally spaced 25 periodic modalities. Surprisingly,the vanilla SNL fails at capturing multi-modalities of thislow-dimensional simulation model.

A.3.2. SIMULATION OF FIGURE 5

The true likelihood is p(x1) = N (x1; 0, 1) and p(x2|x1) =N (x2;x1, 1), and the modeled likelihood is defined byq(x1) = N (x1; 0, θ2

1) and q(x2|x1) = N (x2;x1, θ22),

where θ1 and θ2 are the modeling parameters. Here, themarginal distribution becomes q(x2) = N (x2; 0, θ2

1 + θ22).

This simple simulation model attains the tractable diver-gence formulas. We compute the forward KL divergence asfollows.

DKL(p, q) =

∫ ∫p(x1, x2) log

p(x1, x2)

q(x1, x2)dx1dx2

=−∫ ∫

p(x1, x2) log q(x1, x2)dx1dx2 −H(p)

=−∫ ∫

p(x1, x2)

log

[1

θ1θ2e− x21

2θ21− (x2−x1)2

2θ22

]dx1dx2 −H(p).

Here, the entropy of p, H(p), is constant along the optimiza-tion of q. From

− x21

2θ21

− (x2 − x1)2

2θ22

= − (θ21 + θ2

2)x21 − 2θ2

1x1x2 + θ21x

22

2θ21θ

22

,

the forward KL calculation is decomposed by five terms,

DKL(p, q) = (1) + (2) + (3) + (4) + (5), with

(1) = log θ1 + log θ2,

(2) =θ2

1 + θ22

2θ21θ

22

∫ ∫x2

1p(x1, x2)dx1dx2

=θ2

1 + θ22

2θ21θ

22

∫x2

1p(x1)dx1

=θ2

1 + θ22

2θ21θ

22

,

(3) =− 1

θ22

∫ ∫x1x2p(x1, x2)dx1dx2

=− 1

θ22

,

(4) =1

2θ22

∫ ∫x2

2p(x1, x2)dx1dx2

=1

2θ22

∫x2

2p(x2)dx2

=1

θ22

,

(5) =−H(p).

Therefore, the closed-form of the forward KL divergence is

DKL(p, q) = log θ1 + log θ2 +θ2

1 + θ22

2θ21θ

22

−H(p).

Also, the reverse KL divergence becomes

DKL(q‖p) =

∫ ∫q(x1, x2) log

q(x1, x2)

p(x1, x2)dx1dx2,

where

q(x1, x2)

p(x1, x2)=

1

θ1θ2e− x21

2θ21− (x2−x1)2

2θ22+x212 +

(x2−x1)2

2

Analogously, the reverse KL divergence is decomposed byfive terms

DKL(q, p) = (1) + (2) + (3) + (4) + (5),


where

(1) =− (log θ1 + log θ2)

∫ ∫q(x1, x2)dx1dx2

=− (log θ1 + log θ2),

(2) =− 1

2θ21

∫ ∫x2

1q(x1, x2)dx1dx2

=− 1

2θ21

∫x2

1q(x1)dx1

=− 1

2,

(3) =− 1

2θ22

∫ ∫(x2 − x1)2q(x1, x2)dx1dx2

=− 1

2θ22

∫q(x1)

∫(x2 − x1)2q(x2|x1)dx2dx1

=− 1

2θ22

∫q(x1)θ2

2dx1

=− 1

2

(4) =1

2

∫ ∫x2

1q(x1, x2)dx1dx2

=θ2

1

2

(5) =1

2

∫ ∫(x2 − x1)2q(x1, x2)dx1dx2

=θ2

2

2

Therefore, the closed-form of the reverse KL divergencebecomes

DKL(q, p) = − log θ1 − log θ2 +θ2

1 + θ22

2− 1.

The mutual information becomes

I(X1, X2) =

∫ ∫q(x1, x2) log

q(x1, x2)

q(x1)q(x2)dx1dx2

=

∫ ∫q(x1)q(x2|x1) log

q(x2|x1)

q(x2)dx1dx2

=

∫ ∫N (x1; 0, θ2

1)N (x2;x1, θ22)

logN (x2;x1, θ

22)

N (x2; 0, θ21 + θ2

2)dx1dx2.

Here, we have

N (x2;x1, θ22)

N (x2; 0, θ21 + θ2

2)=

√θ2

1 + θ22

θ2e

−(θ21+θ22)x21+2(θ21+θ22)x1x2−θ21x22

2θ22(θ21+θ22)

Therefore, I(X1, X2) is decomposed by

I(X1, X2) = (1) + (2) + (3) + (4) + (5),

with

(1) =1

2log (θ2

1 + θ22)

∫ ∫q(x1, x2)dx1dx2

=1

2log (θ2

1 + θ22),

(2) =− log θ2,

(3) =− 1

2θ22

∫ ∫x2

1q(x1, x2)dx1dx2

=− 1

2θ22

∫x2

1q(x1)dx1

=− 1

2θ22

∫x2

1N (x1; 0, θ21)dx1

=− θ21

2θ22

,

(4) =1

θ22

∫ ∫x1x2q(x1, x2)dx1dx2

=1

θ22

∫x1q(x1)

∫x2q(x2|x1)dx2dx1

=1

θ22

∫x1q(x1)

[ ∫(x2 − x1)N (x2;x1, θ

22)dx2

+

∫x1N (x2;x1, θ

22)dx2

]dx1

=1

θ22

∫x2

1q(x1)dx1

=θ2

1

θ22

,

(5) =− θ21

2θ22(θ2

1 + θ22)

∫ ∫x2

2q(x1, x2)dx1dx2

=− θ21

2θ22(θ2

1 + θ22)

∫x2

2q(x2)dx2

=− θ21

2θ22

,

Therefore, the closed-form of the mutual information be-comes

I(X1, X2) =1

2log (θ2

1 + θ22)− log θ2.

B. Proof of TheoremsTheorem 3. Suppose F (ψ) is defined to be−Ep(θ)qψ(x|θ)

[log p(θ|x)

], then F (ψ) is decomposed

into F (ψ) = F (1)(ψ) + F (2)(ψ) + const. with

F (1)(ψ) = DKL

(qψ(x)qψ(θ|x), qψ(x)p(θ|x)

)F (2)(ψ) = −I(Θ, X).


Proof. The regularization becomes

F (ψ) =− Ep(θ)qψ(x|θ)[

log p(θ|x)]

=−∫ ∫

p(θ)qψ(x|θ) log p(θ|x)dθdx

=

∫ ∫p(θ)qψ(x|θ) log

p(θ)qψ(x|θ)p(θ|x)qψ(x)

dθdx

+

∫ ∫p(θ)qψ(x|θ) log

qψ(x)

p(θ)qψ(x|θ)dθdx

=DKL

(p(θ)qψ(x|θ), qψ(x)p(θ|x)

)−DKL(p(θ)qψ(x|θ), p(θ)qψ(x))

−∫p(θ) log p(θ)dθ

=F (1)(ψ) + F (2)(ψ) + H(p),

since I(Θ, X) = DKL(p(θ)qψ(x|θ), p(θ)qψ(x)).

Theorem 4. Suppose p(θ, x) and p(θ) are bounded on apositive interval, p(θ) is bounded below, and qψ(θ, x) is uni-formly upper bounded on ψ. Then the optimization problemarg minqψ Lλφ(ψ) attains the unique solution of

qλψ∗(x|θ) =p(x|θ)

c(θ)− λ log qφ(θ|x),

where c(θ) is a constant with respect to x that makesqλψ∗(x|θ) a distribution in x-space.

Before we prove Theorem 4, we need a lemma.

Lemma 1. Suppose a measurable function f(x, y) isbounded below. Then,

∫ ∫u(x, y)f(x, y)dxdy = 0 holds

for all measurable u with∫u(x, y)dx = 0 for all y ∈ Y if

and only if f(x, y) = c(y) for some measurable function c.

Proof. Without loss of generality, we assume f(x, y) to benonnegative.

(⇒) Let η be positive and satisfy∫η(x) = 1. If we define

u0(x, y) to be

u0(x, y) := η(x)f(x, y)− η(x)

∫η(x′)f(x′, y)dx′,

the integration of u0(x, y) over x becomes∫u0(x, y)dx =

∫η(x)f(x, y)dx

−∫η(x)

∫η(x′)f(x′, y)dx′dx

=

∫η(x)f(x, y)dx−

∫η(x′)f(x′, y)dx′

=0.

Therefore, we have

0 =

∫ ∫u0(x, y)f(x, y)dxdy

=

∫ ∫η(x)f2(x, y)dxdy

−∫ (∫

η(x)f(x, y)dx

)(∫η(x′)f(x′, y)dx′

)dy

=

∫ ∫η(x)f2(x, y)dxdy −

∫ (∫η(x)f(x, y)dx

)2

dy.

On the other hand,(∫ (∫η(x)f(x, y)dx

)2

dy

)1/2

≤∫ (∫

η2(x)f2(x, y)dy

)1/2

dx (5)

=

∫η(x)

(∫f2(x, y)dy

)1/2

dx

≤(∫

η(x)dx

)1/2(∫η(x)

∫f2(x, y)dydx

)1/2

(6)

=

(∫ ∫η(x)f2(x, y)dydx

)1/2

,

where inequality 5 holds from Minkowski inequality, andinequality 6 holds from Holder’s inequality. Therefore, wehave

0 =

∫ ∫u0(x, y)f(x, y)dxdy ≥ 0,

and the equality holds if and only if the equlity conditionsof Minkowski inequality and Holder’s inequality hold.

The equality condition of Minkowski inequality is

η(x)f(x, y) = f1(x)f2(y), (7)

for some nonnegaitve measurable functions f1 and f2. Also,the equality conditions of Holder’s inequality is

η1/2(x)

(∫f2(x, y)dy

)1/2

= cη1/2(x), (8)

for some constant c ≥ 0. Therefore, from Eq. 8, we have∫f2(x, y)dy = c, (9)

for some c ≥ 0 (we use abuse of notation for c), and byplugging Eq. 7 into Eq. 9, we have

f21 (x) = cη2(x),


for some c ≥ 0. Therefore, f(x, y) = cf2(y) for somenonnegative measurable function f2 and some constant c ≥0.

(⇐) Assume f(x, y) = c(y), then for any u with∫u(x, y)dx = 0 for all y ∈ Y , we have∫ ∫

u(x, y)f(x, y)dxdy

=

∫ ∫u(x, y)c(y)dxdy

=

∫c(y)

(∫u(x, y)dx

)dy

= 0.

Proof of Theorem 4. For the proof of theorem, let us defineLλφ(qψ) := Lλφ(ψ). Also, suppose u(θ, x) is a function of(θ, x) that satisfies

∫u(θ, x)dx = 0 for any θ. Then, the

function of (q + εr)(x|θ) := q(x|θ) + εu(θ, x) becomes adistribution on x for any θ.

The difference Lλφ(q + εu)− Lλφ(q) becomes

Lλφ(q + εu)− Lλφ(q)

= LSNL(q + εu)− LSNL(q) (10)+Fφ(q + εu)− Fφ(q). (11)

Then, the first difference of Eq. 10 becomes

Eq. 10

= −∫ ∫

p(θ)p(x|θ) log(q(x|θ) + εu(θ, x)

)dθdx

+

∫ ∫p(θ)p(x|θ) log q(x|θ)dθdx

= −∫ ∫

p(θ)p(x|θ) logq(x|θ) + εu(θ, x)

q(x|θ) dθdx

= −∫ ∫

p(θ)p(x|θ) log

(1 + ε

u(θ, x)

q(x|θ)

)dθdx

= −ε∫ ∫

p(θ)u(θ, x)p(x|θ)q(x|θ)dθdx+ o(ε), (12)

and the second difference of Eq. 11 becomes

Eq. 11

= −λ∫ ∫

p(θ)(q(x|θ) + εu(θ, x)) log qφ(θ|x)dθdx

+λ

∫ ∫p(θ)q(x|θ) log qφ(θ|x)dθdx

= −ελ∫ ∫

p(θ)u(θ, x) log qφ(θ|x)dθdx. (13)

Therefore, from Eq. 12 and Eq. 13, we have

Lλφ(q + εu)− Lλφ(q)

= −ε∫ ∫

p(θ)u(θ, x)

(p(x|θ)q(x|θ) + λ log qφ(θ|x)

)dθdx

+o(ε).

The distribution of q becomes the optimal solution ofarg minq Lλφ(q) if and only if the Frechet derivative ofLλφ(q) becomes zero, which is equivalent with∫

p(θ)

∫u(θ, x)

(p(x|θ)q(x|θ) + λ log qφ(θ|x)

)dxdθ = 0,

(14)

for all u(θ, x) with∫u(θ, x)dx = 0, ∀θ ∈ Θ. If we define

u(θ, x) := p(θ)u(θ, x), then Eq. 14 is equivalent to∫ ∫u(θ, x)

(p(x|θ)q(x|θ) + λ log p(θ|x)

)dxdθ = 0,

for all u(θ, x) with∫u(θ, x)dx = 0, ∀θ ∈ Θ.

From assumptions, p(x|θ)q(x|θ) = p(θ,x)

q(θ,x) is bounded belowbecause the denominator, q(θ, x) is bounded above andp(θ, x) is bounded below. Also, p(x) =

∫p(θ)p(x|θ)dθ =∫

p(θ)p(θ,x)p(θ) dθ ≤ sup(θ,x)

p(θ,x)p(θ) , and p(θ) is bounded be-

low by a positive constant. Hence, p(θ|x) = p(θ)p(θ,x)p(θ)p(x) is

bounded below. Thus, we conclude that p(x|θ)q(x|θ) +λ log p(θ|x)

is bounded below, and we could apply Lemma 1. Therefore,the optimal solution of arg minq Lλφ(q) exists to be

q∗(x|θ) =p(x|θ)

c(θ)− λ log p(θ|x).

Note that c(θ) = p(x|θ)q(x|θ) + λ log p(θ|x) > λ log p(θ|x), so

c(θ)− λ log p(θ|x) > 0, and the optimal solution q∗ is welldefined.

Let us assume there exists a function c(θ) with c(θ) 6= c(θ)for some θ such that

q∗(x|θ) =p(x|θ)

c(θ)− λ log p(θ|x).

If c(θ) > c(θ) at θ, then

1 =

∫q∗(x|θ)dx

=

∫p(x|θ)

c(θ)− λ log p(θ|x)dx

<

∫p(x|θ)

c(θ)− λ log p(θ|x)dx

=

∫q∗(x|θ)dx

=1,


Table 3. Comparison of Complex Analysis

AlgorithmLearning with

(Regularized) LossAdditional

Posterior LearningAPT O(rNEB1) —

AALR O(rNEB2) —SNL O(rNEB3) —

SNL+PAR O(rNE(B1 +B3)

)O(rNEB1)

which is a contradiction. Therefore, the optimal solution isunique.

C. Complexity AnalysisTable 3 analyzes the time complexity at the r-th round ofthe inference along the baseline algorithms. We assume thefollowings for the analysis: 1) The simulation budget pera round is given to N ; 2) The number of training epoch isshared by E; 3) It takes B1, B2, and B3 number of oper-ations to compute the neural posterior in APT, the neuraldiscriminator in AALR, and the neural likelihood in SNL,respectively.

The total complexity of our model is increased byO(rNE(2B1 + B3)

), but the increment is minor if the

representation powers for APT and SNL, B1 and B3, areat the identical complexity level. Note that this linear-scalediscrepancy of complexity would negligeable compared tothe expensive simulation execution time.

D. Experimental DetailsWe release the code at https://github.com/Kim-Dongjun/Posterior_Aided_Regularization_for_Likelihood_Free_Inference for the reproducibility and the detailedexperimental setup. We make use of the Neural Spline Flow(NSF) (Durkan et al., 2020) for the conditional densities inSNL and APT. We use four spline transforms, where eachtransformation includes two-blocks of residual networks(He et al., 2016) with the ReLU activation function, andeach transformation is partitioned by 64 bins. We usethe python package, Simulation-Based Inference (SBI)(Tejero-Cantero et al., 2020), for the calculation of qφ(θ|x)in APT. We use three-layered fully connected networkof 256 neurons for each layers with the ReLU activationfunction for AALR. We use the Adam optimizer withlearning rate of 1E-3, (β1, β2) = (0.5, 0.999), and weightdecay of 1E-5. On SPNE-A and SNPE-B, we follow theexperimental details of the original papers (Papamakarios& Murray, 2016) and (Lueckmann et al., 2018) for theexperiments, respectively.

Throughout the inference, we use the early-stopping (Caru-

ana et al., 2001) if no validation loss is improved more than20 epochs, where 10% of the simulation budget is the val-idation data. In order to prevent the learning instability,we clip the gradient norm (Pascanu et al., 2013) to five forNSF of SNL and one for NSF of APT. The new simulationinputs are sampled from the Metropolis-Hastings samplerwith N -chains, i.e. we draw batch of next simulation inputsfrom the independent Markov chains (Kim et al., 2020b).Note that all experiments are replicated 30 times to obtainstable statistics for the experimental results on Intel Corei7/NVIDIA RTX 3090.

As the performance metrics, we use the negative log proba-bility at the groundtruth parameters (Papamakarios & Mur-ray, 2016). The groundtruth parameters include the trueparameter (θ∗1) and the alternative parameters (θ∗2). Thealternative parameter denotes the parameters that generateidentical likelihood fucntion, i.e p(x|θ∗1) = p(x|θ∗2). Thenegative log probability is defined by − log p(θ∗|xo) :=

−∑Kk=1 log p(θ∗k|xo), where θ∗1:K are the groundtruth pa-

rameters.

Alternatively, we use the Inception Score (Salimans et al.,2016) to measure the sample quality. The InceptionScore calculates the KL divergence between the con-ditional and the marginal class distributions: IS =exp(Eθ[DKL(p(c|θ)‖p(c))]), where the conditional distri-bution, p(c|θ), is the probability of θ being classified by thec-th cluster. As the conditional distribution, we train theGaussian Mixture Model with 5,000 samples from the trueposterior distribution.

In Figure 4 of the main paper, we follow Mutual Informa-tion Neural Estimation (MINE) (Belghazi et al., 2018) toestimate the mutual information of two random variables.We use a three-layered fully connected neural network of100 neurons, and the ReLU activation function.

E. Additional Experimental ResultsE.1. Additional Results on Tractable Simulations

Figure 10 and Figure 11 shows the full comparison withrespect to every performance metric on both SLCP-16 andSLCP-256 simulations. Figure 10 (a) shows that SNL regu-larized by PAR obtains a smoother learning curve comparedto the unregularized SNL. Figure 10 (b) proves that the trueposterior distribution of SLCP-16 is only obtainable withSNL with PAR. If we move on to the SLCP-256 simulation,AALR eventually yields a comparable performance for allmetrics. However, the saturation of SNL with PAR is fasterthan AALR and SNL for every metric.

https://github.com/Kim-Dongjun/Posterior_Aided_Regularization_for_Likelihood_Free_Inference





0 5 10 15 20 25 30

Round

100

200

300

400

500N

egat

ive

Log

Pro

babi

lity

APTAALRSNLSNL + PAR

(a) Negative Log Probability

0 5 10 15 20 25 30

Round

−8

−7

−6

−5

−4

−3

−2

−1

log

MM

D

APTAALRSNLSNL + PAR

(b) log Maximum Mean Discrepancy

0 5 10 15 20 25 30

Round

0

2

4

6

8

10

12

14

16

Ince

ptio

nS

core

APTAALRSNLSNL + PAR

(c) Inception Score

Figure 10. Comparison of all performance metrics by round on SLCP-16 of 50-dimensional output with 100 simulation budget for eachround.

0 10 20 30 40 50

Round

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

Neg

ativ

eLo

gP

roba

bilit

y

APTAALRSNLSNL + PAR

(a) Negative Log Probability

0 10 20 30 40 50

Round

−9

−8

−7

−6

−5

−4

−3

−2

log

MM

D

APTAALRSNLSNL + PAR

(b) log Maximum Mean Discrepancy

0 10 20 30 40 50

Round

25

50

75

100

125

150

175

200

225

Ince

ptio

nS

core

APTAALRSNLSNL + PAR

(c) Inception Score

Figure 11. Comparison of all performance metrics by round on SLCP-256 of 40-dimensional output with 100 simulation budget for eachround.

E.2. Comparison of Forward KL/Reverse KL/JSDivergences

The experimental result in this section indeed motivates usto introduce a regularization technique. Figure 12 presentsthe marginal posterior after the inference with (a) the fowardKL minimization, (b) the reverse KL minimization, and (c)the Jensen-Shannon (JS) divergence minimization on theSLCP-256 simulation of 40-dimensional output. In order tominimize the reverse KL divergence and the JS divergence,we train NSF parameters under the adversarial frameworkin this experiment. We use the variational divergence mini-mization (Nowozin et al., 2016), or f-GAN that makes useof the normalizing flow model as a generator. As reportedin Grover et al. (2018); Danihelka et al. (2017), the trainingof a normalizing flow model with the adversarial conceptmight yield extremely poor log-likelihood scores. Havingsaid that, we suggest Figure 12 to qualitatively investigatethe effect of divergences on the inference, rather than thequantitative comparison.

Figure 12 (a) presents the mode covering property of theforward KL divergence, and Figure 12 (b) shows the modeseeking property of the reverse KL divergence. On the otherhand, Figure 12 (c) illustrates the poor performance by cap-

turing a few modes. Figure 12 provides three implications.First, we need to infer the posterior with MLE rather thanthe adversarial training as long as we insist modeling theneural likelihood by a normalizing flow. Second, the reverseKL divergence would help learning the posterior. Third, weneed another regularizing term to avoid the mode collapse ifwe combine the forward KL and the reverse KL divergences.These three implications motivate us to introduce PAR that1) still trains the normalizing flow model by MLE, 2) regu-larizes the origianl loss (forward KL divergence) with thereverse KL divergence, and 3) regularizes with the mutualinformation which mitigates the mode collapse by providingthe discriminative conditional likelihood given distinctivesimulation inputs.

E.3. Likelihood-Free Inference on Image GenerationModels

While VAE targets to estimate the posterior distributionby an auto-encoder structure with the marginal distributionmodeling, likelihood-free inference infers the posterior dis-tribution with the joint distribution modeling. The posteriorestimation of two unsupervised approaches differs by theiruse-cases; VAE infers the posterior distribution under an am-ple size of dataset, and likelihood-free inference estimates


(a) forward KL divergence (b) reverse KL divergence (c) JS divergence

Figure 12. Comparisons of the marginal posterior inferred by foward KL/reverse KL/JS divergences learning with the variationaldivergence minimization.

the approximate posterior of a fixed simulation model byconstructing a cumulative generated dataset.

Although the field of likelihood-free inference largely fo-cuses on simulation models, we expand the usability oflikelihood-free inference on various image generation mod-els. Image data is particularly challenging on likelihood-freeinference because the multi-class of images induces a highlymulti-modal joint distribution. Hence, we qualitatively in-spect the regenerated images if the inference algorithm cap-tures the semantic information of the observation, whileretaining the generated image quality.

We use a pre-trained Generative Adversarial Network(GAN) generator as a simulation model with the input as anoise and the output as an image, and we extract the sum-mary statistics of a raw image data as the encoded represen-tation of the pre-trained Variational Auto-Encoder (VAE).We use the generative model’s network architecture of Chenet al. (2016), and we train this generative model by theWGAN loss (Arjovsky et al., 2017) with gradient panelty(Gulrajani et al., 2017). The generative model is pre-trainedfor 100 epochs with the bach size of 64, and with the Adamoptimizer of learning rate to be 0.0002. Also, we train theencoder network of three-layered convolutional networksfor 50 epochs with 128 batch size by the Adam optimizer oflearning rate as 0.001.

In this section, we empirically show that SNL with PARoutperforms the baselines in terms of the mode captivitywhen the underlying joint distribution is highly multi-modal.Also, SNL with PAR captures the latent semantic informa-tion the best by estimating the nearest latent embeddings ofxo with no label information. This best semantic informa-tion captured by SNL with PAR provides the high qualityimage of the regeneration of xo.

E.4. MNIST

In MNIST dataset, with five-dimensional input, and 10-dimensional encoded output, Figure 13 (a) presents theconsistently outperforming performance of SNL with PARin terms of the negative log probability with 100 simulationbudget for each round. Figure 14 compares the baselinemodels, in which each image is generated by a simulationinput drawn from the inferred posterior. The learning speedof AALR to capture the latent embedding of xo is too slowbecause AALR sometimes generate images of label fiveafter the inference. On the other hand, other algorithms re-generate the observation well. We note that this regeneratedimages could be used to augment a specific data instance.

Figure 15 shows the inferred posteriors by rounds. APTinfers the heavy-tailed posterior distribution, and AALRrequires more dataset to infer the accurate posterior despiteof the use of shallow network for AALR. On the otherhand, SNL and SNL regularized by PAR captures the trueparameter. In particular, SNL with PAR outperforms SNL inits inference speed that the regularzation reduces the numberof required simulation budget.

E.5. FashionMNIST

In FashionMNIST experiment, we use the identical genera-tive model’s network architecture and the summary statis-tics’ network architecture of the MNIST experiment. Withthe seven-dimensional input, 20-dimensional encoded out-put, and 100 simulation budget for each round, Figure 13(b) presents that PAR performs the best at the end of theinference. Moreover, Figure 16 shows that PAR regneratesthe original observation the best by retaining the essentialinformation of the observation.


E.6. SVHN

In SVHN dataset, with nine-dimensional input, 20-dimensional encoded output, and 100 simulation budgetfor each round, Figure 13 (c) shows the consistent outperfor-mance of SNL with PAR out of baselines. Also, Figure 17shows that PAR regenerates the original observation, whilekeeping the semantics of the given observation. Figure 18illustrates that APT and AALR fail at capturing the latentembedding of xo after rounds of inference. Moreover, SNLwith PAR captures the latent embedding of xo better thanthe vanilla SNL.

ReferencesArjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.

arXiv preprint arXiv:1701.07875, 2017.

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learningrepresentations by maximizing mutual information acrossviews. In Advances in Neural Information ProcessingSystems, pp. 15535–15545, 2019.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Ben-gio, Y., Courville, A., and Hjelm, D. Mutual informationneural estimation. In International Conference on Ma-chine Learning, pp. 531–540. PMLR, 2018.

Caruana, R., Lawrence, S., and Giles, L. Overfitting in neu-ral nets: Backpropagation, conjugate gradient, and earlystopping. Advances in neural information processingsystems, pp. 402–408, 2001.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. Infogan: Interpretable representationlearning by information maximizing generative adversar-ial nets. arXiv preprint arXiv:1606.03657, 2016.

Danihelka, I., Lakshminarayanan, B., Uria, B., Wierstra,D., and Dayan, P. Comparison of maximum likelihoodand gan-based training of real nvps. arXiv preprintarXiv:1705.05263, 2017.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios,G. nflows: normalizing flows in PyTorch, Novem-ber 2020. URL https://doi.org/10.5281/zenodo.4296287.

Greenberg, D., Nonnenmacher, M., and Macke, J. Auto-matic posterior transformation for likelihood-free infer-ence. In International Conference on Machine Learning,pp. 2404–2414, 2019.

Grover, A., Dhar, M., and Ermon, S. Flow-gan: Combiningmaximum likelihood and adversarial learning in genera-tive models. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 32, 2018.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of wasserstein gans.In Advances in neural information processing systems,pp. 5767–5777, 2017.

Gutmann, M. U. and Corander, J. Bayesian optimizationfor likelihood-free inference of simulator-based statisticalmodels. The Journal of Machine Learning Research, 17(1):4256–4302, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hermans, J., Begy, V., and Louppe, G. Likelihood-freemcmc with amortized approximate ratio estimators. InInternational Conference on Machine Learning, 2020.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,K., Bachman, P., Trischler, A., and Bengio, Y. Learningdeep representations by mutual information estimationand maximization. arXiv preprint arXiv:1808.06670,2018.

Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., andSrinivasa, S. Imitation learning as f -divergence mini-mization. arXiv preprint arXiv:1905.12888, 2019.

Kim, D., Joo, W., Shin, S., and Moon, I.-C. Adversariallikelihood-free inference on black-box generator. arXivpreprint arXiv:2004.05803, 2020a.

Kim, D., Song, K., Kim, Y., Shin, Y., and Moon, I.-C. Se-quential likelihood-free inference with implicit surrogateproposal. arXiv preprint arXiv:2010.07604, 2020b.

Lee, K. S., Tran, N.-T., and Cheung, N.-M. Infomax-gan:Improved adversarial image generation via informationmaximization and contrastive learning. In Proceedingsof the IEEE/CVF Winter Conference on Applications ofComputer Vision, pp. 3942–3952, 2020.

Lueckmann, J.-M., Goncalves, P. J., Bassetto, G., Oecal, K.,Nonnenmacher, M., and Macke, J. H. Flexible statisticalinference for mechanistic models of neural dynamics.In Neural Information Processing Systems (NIPS 2017),2018.

Nguyen, T., Le, T., Vu, H., and Phung, D. Dual discrimi-nator generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2670–2680, 2017.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Traininggenerative neural samplers using variational divergenceminimization. In Advances in neural information process-ing systems, pp. 271–279, 2016.




0 2 4 6 8 10 12 14

Round

−12

−10

−8

−6

−4

−2

0

2N

egat

ive

Log

Pro

babi

lity

APTAALRSNLSNL + PAR

(a) MNIST

0 2 4 6 8

Round

−15

−10

−5

0

5

10

Neg

ativ

eLo

gP

roba

bilit

y

APTAALRSNLSNL + PAR

(b) FashionMNIST

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

Round

−10

−8

−6

−4

−2

0

2

4

6

Neg

ativ

eLo

gP

roba

bilit

y

APTAALRSNLSNL + PAR

(c) SVHN

Figure 13. Comparisons of baselines on (a) MNIST, (b) FashionMNIST, and (c) SVHN image generation models.

(a) Groundtruth Image (b) APT (c) AALR (d) SNL (e) SNL + PAR

Figure 14. Comparison of the regenerated MNIST images after the inference.

Papamakarios, G. and Murray, I. Fast ε-free inference ofsimulation models with bayesian conditional density esti-mation. In Advances in Neural Information ProcessingSystems, pp. 1028–1036, 2016.

Papamakarios, G., Sterratt, D., and Murray, I. Sequen-tial neural likelihood: Fast likelihood-free inference withautoregressive flows. In The 22nd International Confer-ence on Artificial Intelligence and Statistics, pp. 837–848,2019.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficultyof training recurrent neural networks. In Internationalconference on machine learning, pp. 1310–1318. PMLR,2013.

Peng, J., Pedersoli, M., and Desrosiers, C. Mutual informa-tion deep regularization for semi-supervised segmenta-tion. In Medical Imaging with Deep Learning, pp. 601–613. PMLR, 2020.

Poole, B., Alemi, A. A., Sohl-Dickstein, J., and Angelova,A. Improved generator objectives for gans. arXiv preprintarXiv:1612.02780, 2016.

Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., andTucker, G. On variational bounds of mutual information.In International Conference on Machine Learning, pp.5171–5180. PMLR, 2019.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,Radford, A., and Chen, X. Improved techniques for train-ing gans. In Advances in neural information processingsystems, pp. 2234–2242, 2016.

Shao, J., Wang, Q., and Liu, F. Learning to sample: anactive learning framework. In 2019 IEEE InternationalConference on Data Mining (ICDM), pp. 538–547. IEEE,2019.

Sisson, S. A., Fan, Y., and Tanaka, M. M. Sequential montecarlo without likelihoods. Proceedings of the NationalAcademy of Sciences, 104(6):1760–1765, 2007.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,Scholkopf, B., and Lanckriet, G. R. Hilbert space embed-dings and metrics on probability measures. The Journalof Machine Learning Research, 11:1517–1561, 2010.

Tejero-Cantero, A., Boelts, J., Deistler, M., Lueckmann,J.-M., Durkan, C., Goncalves, P. J., Greenberg, D. S.,and Macke, J. H. sbi: A toolkit for simulation-basedinference. Journal of Open Source Software, 5(52):2505,2020. doi: 10.21105/joss.02505. URL https://doi.org/10.21105/joss.02505.

Wood, S. N. Statistical inference for noisy nonlinear eco-logical dynamic systems. Nature, 466(7310):1102–1104,2010.




(a) APT after 1 round (b) AALR after 1 round (c) SNL after 1 round (d) SNL + PAR after 1 round

(e) APT after 5 rounds (f) AALR after 5 rounds (g) SNL after 5 rounds (h) SNL + PAR after 5 rounds

(i) APT after inference (j) AALR after inference (k) SNL after inference (l) SNL + PAR after inference

Figure 15. Comparison of the marginal posterior distributions on the MNIST dataset by rounds.

Zhang, M., Bird, T., Habib, R., Xu, T., and Barber, D.Variational f-divergence minimization. arXiv preprintarXiv:1907.11891, 2019.

Zhao, S., Song, J., and Ermon, S. Infovae: Informationmaximizing variational autoencoders. arXiv preprintarXiv:1706.02262, 2017.



Figure 16. Comparison of the regenerated Fashion MNIST images after the inference.


Figure 17. Comparison of the regenerated SVHN images after the inference.


(a) APT after 1 round (b) AALR after 1 round (c) SNL after 1 round (d) SNL + PAR after 1 round

(e) APT after 10 rounds (f) AALR after 10 rounds (g) SNL after 10 rounds (h) SNL + PAR after 10 rounds

(i) APT after inference (j) AALR after inference (k) SNL after inference (l) SNL + PAR after inference

Figure 18. Comparison of the marginal posterior distributions on the SVHN dataset by rounds.

Posterior-Aided Regularization for Likelihood-Free Inference

Documents

Transcript of Posterior-Aided Regularization for Likelihood-Free Inference