Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation...

18
Properties of the Affine Invariant Ensemble Sampler’s ‘stretch move’ in high dimensions David Huijser *† Jesse Goodman * Brendon J. Brewer * August 22, 2017 Abstract We present theoretical and practical properties of the affine-invariant ensemble sampler Markov Chain Monte Carlo method. In high dimensions, the sampler’s ‘stretch move’ has unusual and undesirable properties. We demonstrate this with an n-dimensional correlated Gaussian toy problem with a known mean and covariance structure, and a multivariate version of the Rosenbrock problem. Visual inspection of trace plots suggests the burn-in period is short. Upon closer inspection, we discover the mean and the variance of the target distribution do not match the known values, and the chain takes a very long time to converge. This problem becomes severe as n increases beyond 50. We therefore conclude that the stretch move should not be relied upon (in isolation) in moderate to high dimensions. We also present some theoretical results explaining this behaviour. Key words: Affine Invariant Ensemble Sampler, Stretch Move, Markov Chain Monte Carlo 1 Introduction Since the introduction of the Markov Chain Monte Carlo methods (MCMC) (Metropolis, Rosenbluth, Rosenbluth et al., 1953), a large number of different algorithms have been developed. Popular examples include Metropolis-Hastings (Hastings, 1970), slice sampling (Neal, 2003) and Hamiltonian MCMC (Neal, 2011). Each has its own strengths and weaknesses. A recent innovative MCMC method is the affine-invariant ensemble sampler (AIES) introduced by Goodman & Weare (2009). Methods that are invariant under affine transformation of the parameter space offer much promise for highly dependent target distributions, because there is usually some affine transformation that would make the target density much easier to sample from, and the sampler performs identically on the untransformed problem as it would on the transformed one. The intuitition behind the ‘stretch move’ of the AIES (described in Section 2) is compelling. It is also straightforward to implement because the user doesn’t need to define a proposal distribution, or add any additional information except the ability to evaluate a function proportional to the density of the target distribution. This allowed Foreman-Mackey, Hogg, Lang et al. (2013) to develop a high quality Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized target density. Performance comparisons (e.g. Lampart, 2012) show that the AIES is competitive with other common techniques, and outperforms them on certain kinds of target distribution. As a result, the algorithm has become popular, especially in astronomy (Vanderburg, Montet, Johnson et al., 2015; Crossfield, Petigura, Schlieder et al., 2015). However, questions remain about the behaviour of the AIES on high dimensional problems. In Section 3 we test the method on a correlated Gaussian target distribution and discuss the observed behaviour that arises in high dimensions. In Section 5 we see the same problem on a more elaborate example. Section 6 consists of a mathematical exploration of this behaviour. * Department of Statistics, The University of Auckland, Private Bag 92019, Auckland 1142, New Zealand [email protected], [email protected], [email protected] To whom correspondence should be addressed 1 arXiv:1509.02230v2 [stat.CO] 21 Aug 2017

Transcript of Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation...

Page 1: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

Properties of the Affine Invariant Ensemble Sampler’s ‘stretchmove’ in high dimensions

David Huijser∗† Jesse Goodman∗ Brendon J. Brewer∗

August 22, 2017

Abstract

We present theoretical and practical properties of the affine-invariant ensemble sampler MarkovChain Monte Carlo method. In high dimensions, the sampler’s ‘stretch move’ has unusual andundesirable properties. We demonstrate this with an n-dimensional correlated Gaussian toy problemwith a known mean and covariance structure, and a multivariate version of the Rosenbrock problem.Visual inspection of trace plots suggests the burn-in period is short. Upon closer inspection, wediscover the mean and the variance of the target distribution do not match the known values, and thechain takes a very long time to converge. This problem becomes severe as n increases beyond 50. Wetherefore conclude that the stretch move should not be relied upon (in isolation) in moderate to highdimensions. We also present some theoretical results explaining this behaviour.

Key words: Affine Invariant Ensemble Sampler, Stretch Move, Markov Chain Monte Carlo

1 Introduction

Since the introduction of the Markov Chain Monte Carlo methods (MCMC) (Metropolis, Rosenbluth,Rosenbluth et al., 1953), a large number of different algorithms have been developed. Popular examplesinclude Metropolis-Hastings (Hastings, 1970), slice sampling (Neal, 2003) and Hamiltonian MCMC(Neal, 2011). Each has its own strengths and weaknesses. A recent innovative MCMC method is theaffine-invariant ensemble sampler (AIES) introduced by Goodman & Weare (2009). Methods that areinvariant under affine transformation of the parameter space offer much promise for highly dependenttarget distributions, because there is usually some affine transformation that would make the target densitymuch easier to sample from, and the sampler performs identically on the untransformed problem as itwould on the transformed one.

The intuitition behind the ‘stretch move’ of the AIES (described in Section 2) is compelling. It isalso straightforward to implement because the user doesn’t need to define a proposal distribution, or addany additional information except the ability to evaluate a function proportional to the density of thetarget distribution. This allowed Foreman-Mackey, Hogg, Lang et al. (2013) to develop a high qualityPython software implementation emcee, where the user only needs to implement a function that evaluatesthe unnormalized target density. Performance comparisons (e.g. Lampart, 2012) show that the AIES iscompetitive with other common techniques, and outperforms them on certain kinds of target distribution.As a result, the algorithm has become popular, especially in astronomy (Vanderburg, Montet, Johnsonet al., 2015; Crossfield, Petigura, Schlieder et al., 2015).

However, questions remain about the behaviour of the AIES on high dimensional problems. InSection 3 we test the method on a correlated Gaussian target distribution and discuss the observedbehaviour that arises in high dimensions. In Section 5 we see the same problem on a more elaborateexample. Section 6 consists of a mathematical exploration of this behaviour.∗Department of Statistics, The University of Auckland, Private Bag 92019, Auckland 1142, New Zealand

[email protected], [email protected], [email protected]†To whom correspondence should be addressed

1

arX

iv:1

509.

0223

0v2

[st

at.C

O]

21

Aug

201

7

Page 2: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

2 The AIES algorithm

The AIES algorithm works by evolving a set of L samples, called walkers, of n parameters. A walkercan be considered as a vector in the n-dimensional parameter space. One iteration of AIES involves asweep over all L walkers. For each walker, a new position is proposed, and accepted with a Metropolis-Hastings type acceptance probability. The aim is to simulate the distribution specified by a target densityfunction π(x) on the n-dimensional parameter space, but, as is common for ensemble methods, the targetdistribution is actually

L∏i=1

π(xi), (1)

that is, the target distribution π independently replicated L times, once for each walker.Several kinds of proposals are possible, and we describe the stretch move used in emcee. Superscripts

will denote walkers and subscripts will denote coordinates. Thus X(j)(t) means the position (in n-dimensional space) of the jth walker, j = 1, . . . , L, at discrete time t during the algorithm, and X(j)

i (t)means the ith coordinate of that walker, i = 1, . . . , n.

At each iteration t, each walker is updated in sequence. To update the kth walker, we select select acomplementary walker Y = Y(t) = X(j)(t) with j 6= k chosen uniformly, and define the proposal point

X = ZX(k) + (1− Z)Y (2)

where Z is a real-valued stretching variable drawn according to the density

g(z) ∝

{1√z

if z ∈[

1a , a]

0 otherwise(3)

where a is an adjustable parameter, usually set to 2 which is considered a good value in essentially allsituations (Foreman-Mackey, Hogg, Lang et al., 2013). Finally, the proposal X is accepted to replaceX(k) with probability

p(X,Y, Z) = min

(1, Zn−1π

(X)

π (X)

). (4)

Otherwise, X(k) remains unchanged.In words, one can imagine the two selected walkers, X (the one that might be moved) and Y (the one

that helps construct the proposal), defining a line in parameter space. The proposal is to move the mainwalker X to a new position along the line connecting X to Y. The stretching variable Z defines how farthe main walker moves along this line (either towards or away from Y) to obtain a proposed new position,with Z = 1 corresponding to no change. Similar to the single particle Metropolis-Hastings sampler, theacceptance probability depends on the ratio of the target densities at the current and proposal points, withan additional factor Zn−1 arising because the proposed position is chosen from a one dimensional subsetof the n-dimensional space.

To enable parallel processing, the implementation in emcee performs several stretch moves si-multaneously. The vector of walkers is split into two subsets S(0) = {X(k) : k = 1, . . . , L/2} andS(1) = {X(k) : k = L/2, . . . , L}. In the first “half-iteration”, all walkers in S(0) are simultaneouslyupdated according to the stretch move described above, with all the complementary walkers chosen fromS(1). In the second half-iteration, the sets are switched and all walkers in S(1) are simultaneously updated,with complementary walkers chosen from S(0). The time variable t denotes the number of iterations,each consisting of a pair of half-iterations. Because of subtleties related to detailed balance, the mainand complementary walkers must not be updated simultaneously, hence the splitting into two subsetsS(0), S(1).

Later, we will also consider a simpler continuous-time variant without the subsets S(0), S(1). In thissetup, at times t chosen according to an independent exponential clock, a main walker X(k)(t) and a

2

Page 3: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

complementary walker X(j)(t) are chosen uniformly among all walkers, and a stretch move is performed.To ensure consistency of the time variable, we set the overall rate of moves to equal L, the number ofwalkers, so that each walker is chosen as main walker once per unit of time, on average.

The walkers collectively – i.e., the vector X(t) =(X(1)(t), . . . ,X(L)(t)

)in n×L-dimensional space

– form a Markov chain under either of these dynamics (either in discrete time or in continuous time,respectively). Properties of the scaling variable Z and the acceptance probability in (4) ensure that thisMarkov chain has an equilibrium distribution corresponding to the target density π. Specifically, theequilibrium distribution is that the walkers X(j), j = 1, . . . , L, are independent random samples from thedensity π. Under mild conditions, this is the unique equilibrium distribution and any initial distributionwill approach it as t→∞, provided that the initial points X(j)(0) do not lie in an (n− 1)-dimensionalaffine subspace of the parameter space. In particular, this requires that L ≥ n + 1 in the non-parallelversion. Hence, for each sufficiently large time t, empirical means such as

1

L

L∑j=1

f(X(j)(t)

)(5)

can be used as approximations of the integral∫Rn f(x)π(x)dx, corresponding to the mean E (f(X))

when X is a random variable with density function π. Similarly, empirical variances can be used toapproximate Var (f (X)). These empirical means and variances can also be averaged over different valuesof t. If this averaging starts after the Markov chain has burned in and spans a sufficient time compared tothe mixing time, the overall estimate will improve.

As its name suggests, the AIES is invariant under affine transformations of parameter space. Toexplain this property, suppose X has density π. Given an invertible n× n matrix A and n-dimensionalvector b, define the affine transformation x 7→ Ax+ b and the random variable Q = Ax+ b. Then Qhas density

π′(q) =π(A−1(q− b)

)detA

. (6)

The fact that the proposal point in (2) is a linear combination of existing walkers causes the AIES algorithmto be invariant under affine transformations:

Proposition 1 (Affine invariance property). Running the AIES with initial conditions X(j)(0) and densityπ is equivalent to running the AIES with initial conditions Q(j)(0) = AX(j)(0) + b and density π′.

In particular, the AIES algorithm does not give special treatment to moves along the coordinate axes.

3 The AIES for sampling a high dimensional Gaussian

The AIES has been used with great success in various research projects (Vanderburg, Montet, Johnsonet al., 2015; Crossfield, Petigura, Schlieder et al., 2015), and is especially popular in the astronomycommunity. However there is reason for caution if one tries to apply this method in higher dimensionalproblems (n > 50), as we will show. Unfortunately, the output from AIES may resemble the output of anMCMC algorithm “in equilibrium”, yet the points obtained from the AIES might not accurately representthe target distribution, with the true equilibrium taking much longer to achieve.

To investigate the properties of the AIES in n dimensions we chose a correlated n-dimensionalGaussian as the target distribution (See also the correlated Gaussian studied by Lampart (2012)). Morespecifically, the target distribution is a discrete-time Ornstein-Uhlenbeck process, also known as a discrete-time autoregressive process of the order 1, hereafter referred to as an AR(1) process. This model iswell suited to be used for benchmarking, because posterior distributions in Bayesian statistics are oftenapproximately multivariate normal. Besides this, the AR(1) is also useful as a prior in time seriesmodelling.

3

Page 4: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

Table 1: The chosen values of the problem dimensionality n, along with the obtained mean µn andstandard deviation σn of the effective sample (last half of the run), where the initial conditions weregenerated by drawing each coordinate of each walker independently from a over-dispersed normalN(0, 102) distribution. The mean µ and the variance σ were accurate for n = 10 and n = 50 but not forn = 100.

n µn σn10 -0.0163909857436 1.0580412790950 0.0104536594961 0.971905913468

100 -0.498700028573 0.690744763472

The AR(1) distribution is the distribution of the random vector X whose coordinates are definedrecursively by

X1 ∼ N(0, 1)

X2|X1 ∼ N(αX1, β2)

X3|X2 ∼ N(αX2, β2)

...

Xn|Xn−1 ∼ N(αXn−1, β2)

(7)

whereN(µ, σ2) denotes a normal distribution, and α controls the degree of correlation from one coordinateto the next. We set β =

√(1− α2) so the marginal distribution of all of the coordinates is N(0, 1). If we

run MCMC to sample this target distribution, it should be straightforward to verify whether the output iscorrect, since the expected values and standard deviations of all coordinates are 0 and 1 respectively. Totest the AIES, we arbitrarily chose the coordinate x1 as a probe of the convergence properties of the AIES.

We sampled the AR(1) target distribution using emcee (Foreman-Mackey, Hogg, Lang et al., 2013)with α set to 0.9. We tested three values of the dimensionality: n = 10, n = 50, and n = 100, and set thenumber of walkers L to 2n in each case. Each run consisted of 200,000 iterations (each of which is a loopover all walkers). Each run was thinned to reduce the size of the output.

For reasons explained in the next section, for each value of the dimensionality n, we performed fourseparate runs, each of which had the starting positions sampled from four different widely dispersed dis-tributions. These distributions are N(0, 52), N(1, 52), N(−1, 52), and N(1, 102). The initial conditionswere generated by drawing each coordinate of each walker independently from these distributions.

For a properly working MCMC method applied to this problem, the output should have an observedmean µ ≈ 0 and an observed standard deviation σ ≈ 1. However, the observed values of σ for theobtained target distribution for n = 100 dimensions (displayed in Table 1) are smaller than the true valueσ = 1. The results for n = 50 also appear to be suspect but to a lesser degree.

The output from each run consists of a three-dimensional array with dimensions (number of walkers,number of iterations/thinning factor, number of dimensions). To visualise the convergence properties,this array was “flattened” to an array which only contains the values of variable x1. The process of“flattening” reduces a two dimensional array which contains the values of coordinate x1 of all walkers andat every iteration to an one dimensional array. Therefore the final array consists of a concatenation ofXj=1..L

1 (t = 0), Xj=1..L1 (t = 1), Xj=1..L

1 (t = 2), . . . , Xj=1..L1 (t = 200, 000) in this specific order.

Figure 1a displays the traceplot of the flattened array of coordinate x1 for n = 10, where the dashedline displays the running average over the last 50% of the elapsed time, and the dash-dotted line displaysthe running standard deviation over the last 50% of the elapsed time. The running average and runningstandard deviation were chosen because it removes the first 50% of the ensemble, and therefore excludesmore of the output (as potential “burn-in”) over time. Figure 1a displays a short burn-in, and it seemsto come to an equilibrium quite quickly. Figure 1d displays empirical values of the mean of x1 and thevariance x1 of all walkers as a function of time.

4

Page 5: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

(a) n = 10 (b) n = 50 (c) n = 100

Graphs of the flattened trace plots of the first coordinate x1 for n = 10, n = 50 and n = 100. The x-axisis proportional to CPU time. The running means and standard deviations are averaged over the secondhalf of the run, so progressively exclude more of the initial part of the run as time increases. The n = 10and n = 50 runs give more or less accurate results by the end of the run, but σ is too small in the n = 100run even though the trace plot might look satisfactory to the eye.

(d) n = 10 (e) n = 50 (f) n = 100

Mean values (grey) and variance (black) of x1, averaged over all walkers at a fixed time, as a function ofthinned iteration t.

(g) n = 10 (h) n = 50 (i) n = 100

Scatter plot of binned y1 vs. y2, which are defined as the average taken over all walkers for parameters x1

and x2 for the second half of the run. The marginal distribution for y1 and 2 should resemble the left plothere. However, for the n = 100 run, the correlation is incorrect.

(j) n = 10 (k) n = 50 (l) n = 100

Trace plot of the coordinate x1 for different walkers where j = 1, 3, 5.

Figure 1: Results of an emcee run on an AR(1) target distribution with α = 0.9. Each run consistedof 200,000 iterations (each iteration being a sweep over all walkers). The results are shown for n = 10,n = 50, and n = 100 dimensional versions of the target distribution.

5

Page 6: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

Figure 1g displays the joint distribution of coordinates x1 and x2 of the entire ensemble for the secondhalf of the run and it clearly shows the correlation between the two coordinates as expected. Figure 1jdisplays the trace plot of coordinate x1 for several different walkers and it displays reasonable mixingand sufficient convergence. The final mean µn=10 and σn=10 measured over the last 50% of the chaindisplayed in Table 1 are close to the desired values.

If we perform a similar analysis in higher dimensions, for example n = 50, the results are close tothe desired values. Visually, the traceplot in Figure 1b seems to suggest that the ’fast burn-in period haspassed at around the halfway point, and also density plot displayed in Figure 1h shows the correlationbetween the two coordinates as expected. Figure 1e shows that both the mean and the variance x1 exhibita sudden decrease from the initial standard deviation of 2. The variance of the walkers eventually recovers,but this takes a long time. The traceplots of the first coordinate of for different walkers displayed inFigure 1k shows reasonable mixing for n = 10 and slower mixing for n = 50, and n = 100. The hope isthat the large number of independent walkers compensates for the slow movement of each walker.

For n = 100, the results do not accurately represent the target distribution. To the eye, the trace plotdisplayed in Figure 1c seems to show a successful MCMC run. However, the density plot of x1 andx2 (Figure 1i) has too small variance and correlation. The graphs of the variance of x1 for the walkers,displayed in Figure 1f, shows that an initial sudden drop in variance which again takes a very long timeto recover. Apart from the initial fast transient, at no time in the run of 200,000 iterations (40 millionlikelihood evaluations) was the standard deviation of the walkers’ first coordinates greater than 1. Thetraceplot for n = 100 (Figure 1l) shows poor mixing, and the estimate σ is too small.

This is the main reason why the AIES should be used with caution in high dimensions. The outputcan resemble a successful run while in reality, the algorithm is still going through an initial transient phasethat takes a long time. Therefore the final sample does not represent the target distribution properly.

Roughly speaking, the burn-in process of the stretch move appears to have two distinct stages: afast initial transient, followed by a much slower phase. As we shall explain in Section 6, the fast stagereflects convergence among the “bulk” of the coordinates to be consistent with the correlation structureof the AR(1) distribution. However, the stretch moves performed during this fast stage have serious andundesirable side-effects for the ensemble of first coordinates.

4 Convergence Diagnostics for Ensemble methods

In theory, if a Markov chain Monte Carlo method is run for a large number of iterations, the effect ofinitial values will decrease to zero. Ideally the initial distribution would approach the target distribution ata certain point during the run after a relatively small number of iterations. A Markov Chain is consideredconverged if the probability distribution of its state is approximately the target distribution. In principle,the crux is to estimate the number of iterations T sufficient for convergence a priori. In practise, however,it is more convenient to try to estimate whether convergence has been achieved by examining the outputitself. Based on the assumption that it takes T iterations for the chain to converge, a chain is usually runfor some number of iterations much greater than T (such as 2T ) to obtain usable output.

In this section, we analyse the results from the previous section using formal convergence diagnostics,to see whether the failure of the AIES is detectable using these methods. Caution is required when usingsingle-particle MCMC convergence diagnostics, since the individual walker sequences X(t)l might not beindependent, or even Markovian. In general, a walker sequence and the entire ensemble do not convergeat the same rate. A straightforward way to make a convergence diagnostics applicable to an ensemblemethod is by using a function which combines the information from all walkers into a single number,and apply the diagnostics to the obtained results. While the sequence of values of this summary functiondoes not have the Markov property, much of the reasoning behind convergence tests still applies, at leastapproximately. The obvious choice for this function would be the average or the variance of a coordinate,taken over all the walkers. At each iteration t, for each run m, and each parameters n the average taken

6

Page 7: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

over all walkers is defined by:

µ(t)(i)m =1

L

L∑l=1

X(t)(i)l,m (8)

and the variance is defined by

σ(t)(i)m =1

L

L∑l=1

(X(t)(i)l,m − µ(t)

(i)m )2 (9)

where i indicate the parameters, t the iteration and m indicates the run. This should enable users to applyany single particle MCMC diagnostics to the obtained results.

The Gelman-Rubin Gelman & Rubin (1992) method is a widely accepted diagnostic tool for assess-ment of MCMC convergence. However, it is designed to be applied to a single-particle method. Thereforethe two functions mentioned before are used for the analysis. Since there might be a correlation betweendifferent parameters the method presented in this paper is based on the multivariate approach Brook& Gelman (1998). The Gelman-Rubin method is based on M ≥ 2 independent chains, whose initialconditions were drawn from M different overly-dispersed distributions.The process starts with independently simulating these M chains, and discarding the first T iterations.After that, matrices B and W are constructed from y(t)

(i)m which contains the results of any appropriate

function applied to the walkers. The two functions chosen for y(t)(i)m in this paper are the averages overthe walkers as defined in equation 8 and the variance over the walkers as defined in equation 9. HereM indicates the number of chains, and T the number of iterations. Matrix B/T is the n-dimensionalbetween-sequence covariance matrix estimate of the n dimensional function values taken over all walkersy:

B/T =1

M − 1

M∑j=1

(yj· − y··)(yj· − y··)′ (10)

Matrix W is the within-sequence covariance matrix estimate of the n dimensional average of thewalkers y:

W =1

M(T − 1)

M∑j=1

T∑t=1

(yjt − y·)(yjt − yj·)′ (11)

Using the previously defined matrices one can calculate V which is the estimate of the posteriorvariance-covariance matrix

V =T − 1

TW +

(M + 1

M

)B

T(12)

The quantity of interest to establish convergence is the rotationally invariant distance measure betweenV and W Brook & Gelman (1998). This distance measure is the maximum scale reduction factor (SRF)of any linear projection of y, and is given by

Rn =T − 1

T+

(M + 1

M

)λ1 (13)

where λ1 is the largest eigenvalue of the positive matrix

W−1B/T (14)

. The multivariate potential scale reduction factor Rn should approach 1 from above as λ1 → 0 forconvergence. These computations are impossible if W is a singular matrix, and the results will suffersevere inaccuracies if W is close to being singular. The standard method to obtained the eigenvaluesof W−1B involves solving B = WX, however using standard software packages like eigen in R ornumpy.linalg.eigvals in Python are likely to suffer from numerical instability. For efficiency

7

Page 8: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

and numerical stability our analysis uses a Cholensky decomposition similar to the Gelman-Rubin diag-nostics in CODA Plummer & Vines (2006).

The starting distribution can still influence the final distribution after many iterations Gelman & Rubin(1992). Therefore the Gelman-Rubin method demands the starting distribution to be over-dispersed. In anensemble method this condition is more subtle. One can sample each walker in such way it represents anoverly-dispersed distribution, however if you do this for each of the M runs used in the Gelman-Rubinmethod each run will still represent the same distribution, and it will be challenging for the Gelman-Rubinmethod to determine any lack of convergence. Therefore we propose that the starting positions of thewalkers of each run are sampled from M different distributions. This should enable the method to detectif each of the ensembles migrates toward the target distribution.

Table 2: Results of diagnostics for the Correlated Gaussian Problem, where the initial conditions weregenerated by drawing each coordinate of each walker independently from four different normal distri-butions — N(0, 52), N(1, 52), N(−1, 52), N(0, 102). The results of the diagnostics applied to the meanover the walkers µ and the variance over the walkers σ, which show the Multivariate Ensemble PSRFobtained from Gelman-Rubin diagnostics and Heidelberger-Welch (CODA).

n Rnµ Rnσ H-W µ H-W σ

10 1.005 1.009 PASSED PASSED50 1.233 1.121 PASSED FAILED

100 2.238 1.688 PASSED FAILED

For each set of parameters n = 10, n = 50, and n = 100 we performed 4 independent runswith 4 different initial conditions drawn from Gaussian distributions N(0, 52), N(1, 52),N(−1, 52) andN(0, 102). The values of the multivariate potential scaled reduction factors Rnµ and Rnσ are displayedin table 2. For n = 10 both Rnµ and Rnσ are close to one which suggests convergence was achieved.Forn = 50, Rnµ and Rnσ are somewhat close to 1. However, for n = 100 Rnµ and Rnσ are much greater thanone, and indicates a strong lack of convergence, in agreement with the conclusions reached in the previoussection.

As an additional convergence diagnostic, the Heidelberger-Welch-test, implemented in the CODApackagePlummer & Vines (2006) in R, was chosen. The CODA-implementation performs two tests: TheHeidelberger-Welch-test and the half-width test. The Heidelberger-Welch-test is a convergence test whichuses Cramer-von-Mises statistic to test the null hypothesis that the sampled values come from a stationarydistribution Plummer & Vines (2006). The test is initially applied to the whole chain, however if the chainfails the test a percentage at the beginning of the chain is discarded. This process is repeated until eitherthe test is passed or 50% percent is discarded. The half-width test calculates a 95% confidence interval forthe mean, using the portion of the chain which passed the stationarity test, and the calculates the half thewidth of this interval which is compared with the estimate of the mean. If the ratio between the half-widthand the mean is lower than eps, the halfwidth test is passed Plummer & Vines (2006). In this research theresults of the halfwidth test are considered of little interest, because they are very subjective due to thedependence on the choice of the ε-parameter value.

The Heidelberger-Welch-test is applied to chains of the mean µ and the variance σ as defined inequations 8 and 9 where the first 50 % is already discarded. Therefore we consider this test passed only ifit passes this without discarding any more of the chain. Only for n = 10 did both the mean µ and σ chainspass the Heidelberger-Welch test without discarding any part of the chain beyond the 100,000 iterationburn-in. These results are summarized in table 2.

Even though it is good practice for every user of MCMC methods to use some convergence diagnosticson the obtained samples, we strongly suggest that in the case of the AIES it is not only good practice, buta necessity, since visual inspection of the results might be deceiving in high dimensions.

8

Page 9: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

5 Rosenbrock example

As a second test on the convergence properties of the stretch move, we investigated an n-dimensionalgeneralisation of the Rosenbrock density (Rosenbrock, 1960), in the form proposed by Dixon & Mills(1994). The target density is

f(x) =

n/2∑i=1

100(x22i−1 − x2i)

2 + (x2i−1 − 1)2. (15)

This is simply n/2 independent replications of a two-dimensional Rosenbrock density, and should befairly straightforward to sample. Again, we used 200,000 iterations, but increased the number of walkersto L = 10n. We tested dimensionalities of n = 10, n = 50, and n = 100, so the corresponding overallnumbers of likelihood evaluations were 2× 107, 1× 108, and 2× 108 respectively.

Inspection of the graphs displayed in 2 show proper mixing for n = 10, however the mixing forn = 50 and n = 100 is far from desirable. Upon visual inspection the flattened traceplots for alldimensions 2a, 2b, 2c shows no indicator of convergence problems. The plots of the running variance andrunning mean 2d, 2e seem to indicate slow but steady convergence, however 2f doesn t look promising.A good indicator for convergence is provided from the binned scatter plots of the means of two parameterstaken over the walkers 2g, 2h, 2i. This suggests for n = 10 to target distribution is properly sampled,however the plots for n = 50 and n = 100 display some features which might indicate convergence issues.The individual traceplots for different parameters 2j, 2k and 2l fail to reveal any underlying problems.The minimum scaled reduction factors Rnµ and Rnσ displayed in table 3 suggests a lack of convergencefor all models which is unexpected for n = 10 which seem to converge properly according to the graphsdisplayed in 2a and 2g.

While the AIES fails (at least for n = 50 and n = 100), other methods succeed on this problem.Simple single-particle Metropolis, with a scale mixture of gaussians (around the current position) as theproposal, succeeds with the equivalent computational cost (2× 108 likelihood evaluations), and producesabout 100 effectively independent samples, by inspection of the empirical autocorrelation function. Ofcourse, if the AIES had converged, the final state of the walkers would have yielded 1000 independentsamples.

Table 3: The Multivariate PSRF obtained from Gelman-Rubin diagnostics for the Rosenbrock Problem,where the initial conditions were generated by drawing each coordinate of each walker independently fromfour different normal distribution N(0, 5), N(1, 5), N(−1, 5), N(0, 10). The results of the diagnosticsapplied to the mean over the walkers µ and the variance over the walkers σ, which show the MultivariatePSRF obtained from Gelman-Rubin diagnostics and Heidelberger-Welch (CODA).

n Rnµ Rnσ H-W µ H-W σ

10 1.74 2.11 FAILED PASSED50 216 137 FAILED FAILED

100 930 330 FAILED FAILED

9

Page 10: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

Table 4: Estimates of the expected value and standard deviation of x1 in the Rosenbrock problem. Thetrue values are approximately 1.0 and 0.7.

n µn σn10 0.912880664359 0.77579805331750 0.219674347615 0.690989906816

100 0.397340115775 0.741348730082

6 Theoretical causes of the behaviour of the AIES for sampling a highdimensional Gaussian

The results in Section 3 suggest studying the limiting behaviour of the AIES in an appropriate limit asn→∞. This limit is somewhat complicated, not least because it requires the number of walkers to belarge since L ≥ n+ 1. We begin with a description of the AIES in the limit L→∞ for fixed n and π.Then we examine a single AIES move in the limit n→∞ under a simplifying assumption. We then givea non-rigorous heuristic for the limit n→∞ that explains the behaviour described in Section 3.

6.1 The AIES with many walkers

To study the limit L → ∞, it is convenient to use the continuous-time variant where the main andcomplementary walkers are selected uniformly among all walkers. Then the L walkers play symmetricroles and it is natural to collect them into the empirical measure

µ(L)(t) =1

L

L∑j=1

δX(j)(t), (16)

where δx denotes the measure placing unit mass at X ∈ Rn. In words, the measure µ(L)(t) encodes thedistribution of a uniformly chosen walker at time t. Because of the assumption that both walkers X andY are selected uniformly, it follows that µ(L)(t) is itself a Markov chain.

Proposition 2. Choose the initial walkers independently according to the distribution µ0. Then, in thelimit L→∞, the empirical measure process µ(L)(t) converges in distribution to a deterministic path µtwith initial value µ0 and

d

dt

∫f(x)dµt(x) =

∫∫Rn×Rn

E(f(Zx+ (1− Z)y

)− f(x)

)p(x,y, z)dµt(x)dµt(y). (17)

To interpret this result, suppose X and Y are independent samples distributed according to the currentempirical measure µt. Choose Z according to the density in (3) and define X = XZ +Y(1− Z), as in(2). Set X′ to be X with probability p(X,Y, Z) and X otherwise, and let µ′t denote the measure encodingthe distribution of X′, averaged over all the possibilities for X,Y and Z. Then Proposition 2 says that themeasure µt evolves by travelling in the direction of the line (in the space of measures) joining µt to µ′t.

The intuition behind Proposition 2 is that the average effect of each move is to take µt in the directiontoward µ′t. Each move changes a fraction 1/L of the measure µ(L)(t), but this is offset by the fact thatmoves occur at rate L. The fact that the limiting dynamics are deterministic is established by examiningproducts of empirical means.

The authors did not find any explicit solutions to the system of equations (17). (In the simplest casen = 1, π(x) ∝ exp

(−x2/2

), i.e., a one-dimensional standard normal density, any normal initial data

that is not standard becomes non-normally distributed at positive times and does not appear to followany simple trajectory.) However, it is possible to consider (17) for other choices for the function p. Thesimplest possible choice is to take p = 1, i.e., to accept proposals unconditionally. Even in this case, westill cannot solve (17), but we can make the following observation.

10

Page 11: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

(a) n = 10 (b) n = 50 (c) n = 100

Graphs of the flattened trace plots of the first coordinate x1 for n = 10, n = 50 and n = 100, and andt = 200.000.

(d) n = 10 (e) n = 50 (f) n = 100

Mean values (grey) and variance (black) of x1 over all walkers as a function of iterations t.

(g) n = 10 (h) n = 50 (i) n = 100

Density plot of coordinates x1 and x2 from the second half of the run.

(j) n = 10 (k) n = 50 (l) n = 100

Trace plot of the coordinate x1 for different walkers where j = 1, 3, 5.

Figure 2: Results of an MCMC run of a Rosenbrock distribution using 200,000 iterations, for dimension-alities of n = 10, 50, 100. The number of walkers was 10n in each case.

11

Page 12: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

Proposition 3. Consider the system of equations (17) where the function p(x,y, z) is replaced by theconstant 1. If the ith coordinate has finite second moment under the initial measure,

∫x2i dµ0(x) <∞,

then its mean∫xidµt(x) is constant and its variance Varµt(xi) =

∫x2i dµt(x)− (

∫xidµt(x))

2 evolvesaccording to

d

dtVarµt(xi) = E

(Z (Z − 1)

)Varµt(xi). (18)

Thus the variance will either grow or decay exponentially, depending on whether E(Z (Z − 1)

)is

positive or negative.

6.2 The AIES for a high-dimensional standard Gaussian

We next consider a single stretch move for the target density

π′(x) = c · exp

(−1

2

n∑i=1

x2i

)(19)

corresponding to an n-dimensional standard normal distribution. The acceptance probability from (4)becomes p(x,y, z) = min {1, exp (h (x,y, z))} where

h(x,y, z) = (n− 1) log z − 1

2

n∑i=1

(zxi + (1− z) yi

)2+

1

2

n∑i=1

x2i . (20)

To analyze h(x,y, z), we make the following assumption about the randomly chosen walkers X and Yfor the move under consideration.

Assumption 4. The coordinates of X1, . . . , Xn and Y1, . . . , Yn are mutually independent and identicallydistributed (i.i.d.) with common mean µ and common variance σ2.

Proposition 5. Subject to Assumption 4, 1nh(X,Y, z)→ fσ(z) where

fσ(z) = log z − σ2z(z − 1). (21)

The acceptance probability p(X,Y, z) converges to 1 if fσ(z) > 0 and to 0 if fσ(z) < 0.

Proof. This is an application of the Law of Large Numbers:

h(x,y, z)

n= − log z

n+

1

n

n∑i=1

(log z − 1

2(z2 − 1)x2

i − z(1− z)xiyi −1

2(1− z)2y2

i

)→ 0 + E

(log z − 1

2(z2 − 1)x2

i − z(1− z)xiyi −1

2(1− z)2y2

i

)= log z − 1

2(z2 − 1)(σ2 + µ2)− z(1− z)µ · µ− 1

2(1− z)2(σ2 + µ2)

= log z − σ2 z2 − 1 + (1− z)2

2− µ2 z

2 − 1 + 2z(1− z) + (1− z)2

2= fσ(z). (22)

The convergence of p(X,Y, z) = min (1, exp(h(X,Y, z))) follows because either h(X,Y, z)→∞ orh(X,Y, z)→ −∞ depending on the sign of fσ(z).

The behaviour of fσ(z) for three values of σ is shown in Figure 3. When σ > 1, fσ(z) is positive forz slightly smaller than 1. When σ < 1, fσ(z) is positive for z slightly larger than 1. In the critical caseσ = 1, fσ(z) is always negative and a Taylor expansion gives

f1(z) ≈ −32(z − 1)2 for z close to 1. (23)

The interpretation of Assumption 4 and Proposition 5 is as follows. Freeze the AIES algorithm at time t.The walkers X and Y to be used in the next move will be drawn independently from the current populationof walkers. Assume without proof that the coordinatesX1, . . . , Xn and Y1, . . . , Yn are i.i.d. (or at least aresufficiently close to i.i.d. for the conclusion of Proposition 5 to hold).

12

Page 13: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

(a) σ(0) = 0.1 for n = 10 (b) σ(0) = 1.0 for n = 10. (c) σ(0) = 2.0 for n = 10.

(d) σ(0) = 0.1 for n = 50. (e) σ(0) = 1.0 for n = 50. (f) σ(0) = 2.0 for n = 50.

(g) σ(0) = 0.1 for n = 100. (h) σ(0) = 1.0 for n = 100. (i) σ(0) = 2.0 for n = 100.

Figure 4: Histogram of the accepted z for n = 10, 50, 100 and σ(0) = 0.1, 1, 2 for an MCMC run of anuncorrelated Gaussian with t = 1000.

Figure 3: The graph of the function F (z) =n(log z − σ2(t)z(z − 1)) as a function of z forσ(t) = 0.1 (dashed-dotted line), σ(t) = 1 (dottedline), and σ(t) = 2 (dashed line).

Then the acceptance probability at the next moveis effectively independent of the actual walkers Xand Y. Furthermore the dependence on z is deter-mined only by the variance σ2. To illustrate thiseffect, we ran three different MCMC-runs on a un-correlated Gaussian, where each run consist 1000repetitions of two consecutive steps. The first stepis a (re-)initialization step where the code drawsinitial conditions, aand the second step consist ofone regular AIES iteration. Each of these 1000AIES iteration was started with a re-initialized ini-tial condition obtained in the first step, where eachof the n components of every of L walkers aredrawn from a Gaussian distribution N(0, σ2) withσ2 ∈ {0.1, 1, 2}. We obtained the accepted z-values, and created histograms displayed in Fig-ure 4. For σ 6= 1, a sharp cutoff is observed across the value Z = 1, becoming more pronounced as nincreases. Values of z on the predicted side of 1 are accepted almost unconditionally. For σ = 1, theapproximation p(X,Y, z) ≈ exp

(−cn(z − 1)2

)suggests that the accepted values of Z will be roughly

normally distributed with a spread that decreases with n, and this prediction is reflected in the histograms.

13

Page 14: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

6.3 The AIES for a high-dimensional correlated Gaussian

We now turn to the correlated AR(1) model from Section 3. Based on the analysis of Sections 6.1 and 6.2,we present a heuristic that explains the behaviour observed in Section 3.

Because of Proposition 1, we can apply an affine transformation to this correlated Gaussian distributioninto an uncorrelated one. Recalling (7), the relevant transformation ψ(x) will be

q1 = ψ1(x) = x1, x1 = q1,

qi = ψi(x) =xi − αxi−1

β, xi = αxi−1 + βqi, i ≥ 2.

(24)

A random variable X has the AR(1) distribution if and only if the corresponding Q has the n-dimensionalstandard normal distribution. This problem therefore falls in the setup of Section 6.2 with density π′ whenexpressed in terms of the q-coordinate system.

Note that the coordinate q1 plays a distinguished role, directly measuring the quantity of interestfrom the original system. (Different transformations can be used to emphasise other quantities from theoriginal system.) The coordinates q2, . . . , qn can be thought of as encoding how closely the coordinatesxi conform to the correlation structure of the AR(1) distribution.

We will analyse the AIES algorithm from the perspective of the q-coordinate system. To begin, wemust specify the initial walker coordinates. In practice, we are unlikely to have any knowledge of theq-coordinate system. Therefore the most obvious choice is to generate i.i.d. initial coordinates X(j)

i (0) inthe x-coordinate system. By (24), the initial q-coordinates have variances

Var(Q

(j)1 (0)

)= Var

(ψ1

(X(j)(0)

))= Var

(X

(j)1 (0)

),

Var(Q

(j)i (0)

)= Var

(ψi(X(j)(0)

))=

1 + α2

β2Var

(X

(j)i (0)

), i ≥ 2.

(25)

Note that the variance for i ≥ 2 is larger by a factor 1+α2

1−α2 compared to i = 1. This factor becomes largein the highly correlated case where α is close to 1.

To proceed with our heuristic analysis, we introduce without proof the following assumption.

Assumption 6. At every time t in the AIES algorithm, the q-coordinates (Q(J)i (t), i = 1, . . . , n) (possibly

excluding the first coordinate) of a randomly chosen walker are i.i.d. with common mean µ(t) and commonvariance σ(t)2 – or at least are sufficiently close to i.i.d. that the conclusions of Proposition 5 apply.

In reality, the assumption of independence will not hold even at time t = 0; the initial q-coordinatesare only weakly correlated in the sense that Cov(Q(j)

i (0), Q(j)i′ (0)) = 0 if |i− i′| ≥ 2. Even if the initial

q-coordinates were chosen in an i.i.d. way, there would be no reason for the AIES dynamics to preservethis property at later times. However, Proposition 5 depends only on a Law of Large Numbers effect, andat the level of a heuristic it is reasonable to expect this effect to be robust.

Subject to Assumption 6, Proposition 5 asserts that the acceptance probability p(X,Y, Z) is essentiallyindependent of the actual walker positions X and Y. In particular, it is essentially independent of theactual first q-coordinates ψ1(X) and ψ1(Y). From the perspective of the first coordinates only, the AIESdynamics are approximated1 by the following:

1. Select walkers X and Y, stretching variable Z, and proposal X = ZX+(1−Z)Y as usual. WriteQ = ψ(X), U = ψ(Y), and Q = ψ(X).

1A careful justification of this approximation involves more than the fact that p(X,Y, z) is largely independent of X andY. Specifically, the justification is that the probabilities p(X,Y, Z) and p′ are unlikely to have a large difference. This holdsbecause according to Proposition 5 the quantity h(X,Y, Z) is likely to be either large and positive (in which case it is unlikelyto become negative after removing the i = 1 term, so that both p and p′ will be 1) or large and negative (in which case removingthe i = 1 term may make a difference, but both p and p′ will still be small). This reasoning will break down when h(X,Y, Z)has the possibility to be small, which will happen when Z and σ(t) are close to 1.

14

Page 15: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

(a) n = 10 (b) n = 50 (c) n = 100

Figure 5: The variance of q over all walkers and coordinates for α = 0.9, t = 1000

2. Define the modified acceptance probability

p′ = min

(1, Zn−1 exp

(− 1

2

n∑i=2

((ZQi + (1− Z)Ui

)2 −Q2i

)))(26)

solely in terms of q-coordinates 2 to n. In coordinates 2 to n, update from Q to Q with probabilityp′.

3. For each move accepted in step 2, apply the same move to the first coordinate with probability 1.

From the perspective of the first coordinates, the accepted stretching variables Z follow a modifieddistribution Z(t) that may vary over time as the other q-coordinates equilibrate. Stretching variables Z(t)also arrive at a reduced average frequency r(t) = E(p′). However, when they arrive, they are alwaysaccepted, as in Proposition 3. We therefore make the following predictions:

Prediction 7. Write

Vart(X1) =1

L

L∑j=1

X(j)1 (t)2 −

(1

L

L∑j=1

X(j)1 (t)

)2

(27)

for the empirical variance of X1 as determined by the walkers at time t. Then:

• When σ(t)� 1, Vart(X1) will decrease rapidly, regardless of whether it is too large or too smallcompared to the true value 1.

• When σ(t)� 1, Vart(X1) will increase rapidly, regardless of whether it is too large or too smallcompared to the true value 1.

• When σ(t) is close to 1, Vart(X1) will not converge quickly to the true value 1.

• Quantitatively,d

dtVart(X1) ≈ r(t)E

(Z(t)(Z(t)− 1)

)Vart(X1). (28)

To test Prediction 7, we graphed the average empirical standard deviation of all q-coordinates at timesclose to the fast burn-in phase: see Figure 5. The initial value is greater than 1, as predicted by (25), andhas equilibrated around 1 after about 10 iterations. In Figure 6, the empirical variance Vart(X1) of allfirst coordinates is graphed over the same time interval. Especially for n = 50 and n = 100, the variancedecreases rapidly at first, then levels off around the same time, 10 iterations. This confirms the qualitativeparts of Prediction 7.

We also tested the quantitative prediction in equation (28). At five selected times t = 2, 8, 14, 30, 50,we overlaid the predicted tangent line from Prediction 7 – i.e., the line with slope given by the right-handside of equation (28) – onto the graph of Vart(X1). The quantities r(t) and E

(Z(t)(Z(t)− 1)

)in

equation (28) depend on the hypothetical distribution of all Z values that would be accepted given the

15

Page 16: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

actual walkers at time t, and were therefore approximated by observing the accepted Z values from 100auxiliary emcee iterations, each initialised with the actual walker positions at time t.

The results are shown in Figure 6. For n = 10, the predicted lines do not very closely track theunderlying curve, but for n = 50 and n = 100 there is good agreement with Prediction 7.

(a) n = 10 (b) n = 50 (c) n = 100

Figure 6: In grey: the empirical variance Vart(X1), as estimated by the first coordinates X(j)1 (t) of all

walkers at time t, for 0 ≤ t ≤ 60. Overlaid in black: five predicted “tangent” lines for the curve, at timest = 2, 8, 14, 30, 50, based on the slopes predicted in Equation (28).

7 Discussion

Even though the AIES has been used with success in the past on numerous occasions, we advise cautionin high dimensional problems. The benchmark model we used to probe the problems of the AIES was arelatively simple model which already fails at n = 100, and we expect problems to be even worse formore strongly correlated, or more complex, models.

Unsurprisingly, other MCMC methods work more efficiently for the benchmark model we chose. Forinstance, if the target distribution is interpreted as a posterior distribution with an uncorrelated Gaussianas the prior, then the elliptical slice sampler of Murray, Adams & MacKay (2010) gives good results.If the target distribution is interpreted as a time-discretisation of a continuous process (in this case theOrnstein-Uhlenbeck process Ut, the centred Gaussian process with Cov(Ut, Us) = e−|t−s|, over thetime interval [0, αn]) then the ideas of Cotter, Roberts, Stuart et al. (2013) can be used to obtain anMCMC method that handles dimensionality well. However, these alternative methods require additionalstructure and analysis of the model. Especially if the model is more complicated, the possibility of slowconvergence may be a reasonable price to pay for the simplicity and generality of the AIES; the difficultyhere is that the slow convergence can be hard to detect.

Indeed, an important issue in practice is how to know whether the AIES is experiencing the kind ofproblems we describe. Evidently, it is not possible to rely on knowing the true target distribution, as wedid.

Our analysis shows that the profile of accepted Z values of the correlated Gaussian– particularly ifthey clustered on either side of z = 1 – gave relevant information about the system. In both of the testcases the adapted Gelman-Rubin diagnostic gave a good indication of lack of convergence. This methodmight fail is W or both W and B are singular, which might indicate an ill posed problem, or very highcorrelation between parameters.Finally, traceplots are common tools to visually assess the performance of MCMC methods. Because theAIES is an ensemble method, some adaptations are necessary. For instance, the straightforward traceplotsin Figure 1a, 1b and 1c, showing the first coordinates over all walkers and steps, give an impression ofthe range of likely values but give little insight into the different mixing properties. Instead, selectinga small number of walkers, as in Figures 1j, 1k and 1l, shows that individual walkers are mixing well(relative to the range of likely values) when n = 10, but not when n = 50 or n = 100. In Figure 1j forn = 10, it seems like the first coordinate of each walker is free to explore parameter space (between about

16

Page 17: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

−2 and 2, a region where most walkers appear to spend most of their time, according to Figure 1a). Bycontrast, in Figures 1k and 1l for n = 50 and n = 100, the first coordinates of each walker seem to beconfined to much narrower regions (even accounting for unduly restricted range of walker positions inFigures 1b and 1c) and the relative order among the sampled walkers changes much less frequently. In ourexamination of the AIES in this high-dimensional model, this was the only sign of slow convergence thatwe could identify without knowing the characteristics of the true target distribution. The performanceof the AIES can be improved by implementing the optimal choice for parameters a and L. The stretchparameter a can adjusted depending on the acceptance rate which should typically be between 0.2 to 0.5.The acceptance fraction can be increased if it’s too low by decreasing a, and it can be decreased if it’s tohigh be increase a. A large L would also improve the performance (Foreman-Mackey, Hogg, Lang et al.,2013).In summary, high dimensions can bring problems that make the AIES converge slowly and, moredisturbingly, appear to have converged even when it has not. Knowing the structure of the true distributionallowed us to make accurate predictions about the evolution of the AIES for our chosen model. Lookingat two measures arising from the algorithm – the profile of accepted Z values, a subsample traceplot of afew walkers, the adapted Gelman-Rubin-diagnostics and the adapted Heidelberger-Welch test– gave apossible signal of slow convergence. Such diagnostics, and a measure of caution, should be used whenapplying the AIES to high-dimensional problems.

8 Acknowledgments

We would like to thank Michael Betancourt, Bob Carpenter, Andrew Gelman, and Jeorg Dietrich forvaluable discussion and comments. We also thank the reviewers of an earlier version of this paper fortheir constructive criticisms.

References

BROOK, S. & GELMAN, A. (1998). General methods for monitoring convergence of iterative simulations.J Comput Graph Stat 7, 434–455.

COTTER, S.L., ROBERTS, G.O., STUART, A.M. & WHITE, D. (2013). MCMC methods for functions:Modifying old algorithms to make them faster. Stat. Sci. 28, 424–446. doi:10.1214/13-STS421.

CROSSFIELD, I.J.M., PETIGURA, E., SCHLIEDER, J.E., HOWARD, A.W., FULTON, B.J., ALLER,K.M., CIARDI, D.R., LEPINE, S., BARCLAY, T., DE PATER, I., DE KLEER, K., QUINTANA, E.V.,CHRISTIANSEN, J.L., SCHLAFLY, E., KALTENEGGER, L., CREPP, J.R., HENNING, T., OBERMEIER,C., DEACON, N., WEISS, L.M., ISAACSON, H.T., HANSEN, B.M.S., LIU, M.C., GREENE, T.,HOWELL, S.B., BARMAN, T. & MORDASINI, C. (2015). A nearby M star with three transitingsuper-Earths discovered by K2. Astrophys. J. 804, 10. doi:10.1088/0004-637X/804/1/10.

DIXON, L.C.W. & MILLS, D.J. (1994). Effect of rounding errors on the variable metric method. Journalof Optimization Theory and Applications 80, 175–179.

FOREMAN-MACKEY, D., HOGG, D.W., LANG, D. & GOODMAN, J. (2013). Emcee: The MCMChammer. Publ. Astron. Soc. Pac. 125, 306–312.

GELMAN, A. & RUBIN, D. (1992). Inference from iterative simulation using multiple sequences. StatSci 7, 457–511.

GOODMAN, J. & WEARE, J. (2009). Ensemble samplers with affine invariance. Commun. Appl. Math.Comput. Sci. 5, 65–80.

HASTINGS, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications.Biometrika 57, 97–109.

17

Page 18: Properties of the Affine Invariant Ensemble Sampler’s ... · Python software implementation emcee, where the user only needs to implement a function that evaluates the unnormalized

LAMPART, T. (2012). Implementation and performance comparison of an ensemble sampler with affineinvariance. Technical report, MOSAIC group, Institute of Theoretical Computer Science, Departmentof Computer Science, ETH Zurich.

METROPOLIS, N., ROSENBLUTH, A.W., ROSENBLUTH, M.N., TELLER, A.H. & TELLER, E. (1953).Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092.

MURRAY, I., ADAMS, R.P. & MACKAY, D.J. (2010). Elliptical slice sampling. ”J. Mach. Learn. Res.Workshop Conf. Proc.” 9, 541–548.

NEAL, R.M. (2003). Slice sampling. Ann. Stat. 31, 705–741.

NEAL, R.M. (2011). MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo,chap. 5. Chapman & Hall/CRC, pp. 113–162. See also arXiv:1206.1901 [stat.CO].

PLUMMER, MARTYN; BEST, N.C.K. & VINES, K. (2006). Coda: convergence diagnosis and outputanalysis for mcmc. R News 6, 7–11.

ROSENBROCK, H.H. (1960). An automatic method for finding the greatest or least value of a function.The Computer Journal 3, 175–184.

VANDERBURG, A., MONTET, B.T., JOHNSON, J.A., BUCHHAVE, L.A., ZENG, L., PEPE, F.,CAMERON, A.C., LATHAM, D.W., MOLINARI, E., UDRY, S., LOVIS, C., MATTHEWS, J.M.,CAMERON, C., LAW, N., BOWLER, B.P., ANGUS, R., BARANEC, C., BIERYLA, A., BOSCHIN,W., CHARBONNEAU, D., COSENTINO, R., DUMUSQUE, X., FIGUEIRA, P., GUENTHER, D.B.,HARUTYUNYAN, A., HELLIER, C., KUSCHNIG, R., LOPEZ-MORALES, M., MAYOR, M., MICELA,G., MOFFAT, A.F.J., PEDANI, M., PHILLIPS, D.F., PIOTTO, G., POLLACCO, D., QUELOZ, D.,RICE, K., RIDDLE, R., ROWE, J.F., RUCINSKI, S.M., SASSELOV, D., SEGRANSAN, D., SOZZETTI,A., SZENTGYORGYI, A., WATSON, C. & WEISS, W.W. (2015). Characterizing K2 planet discoveries:A super-Earth transiting the bright K dwarf HIP 116454. Astrophys. J. 800, 59 pages.

18