Download - Dr09 Slide

A vanilla Rao–Blackwellisation ofMetropolis–Hastings algorithms

Randal DOUC and Christian ROBERTTelecom SudParis, France

[email protected]

April 2009

1 / 24

Main themes

1 Rao–Blackwellisation on MCMC.2 Can be performed in any Hastings Metropolis algorithm.3 Asymptotically more efficient to usual MCMC with a

controlled amount of calculations.

2 / 24

Introduction Some properties of the HM algorithm Rao–Blackwellisation Illustrations Conclusion

Outline

1 Introduction

2 Some properties of the HM algorithm

3 Rao–BlackwellisationVariance reductionAsymptotic results

4 Illustrations

5 Conclusion

3 / 24


Outline

1 Introduction



4 Illustrations

5 Conclusion

4 / 24


Metropolis Hastings algorithm

1 We wish to approximate

I =

∫h(x)π(x)dx∫π(x)dx

=

∫

h(x)π̄(x)dx

2 x 7→ π(x) is known but not∫π(x)dx .

3 Approximate I with δ = 1n

∑nt=1 h(x (t)) where (x (t)) is a Markov

chain with limiting distribution π̄.

4 Convergence obtained from Law of Large Numbers or CLT forMarkov chains.

5 / 24


Metropolis Hasting Algorithm

Suppose that x (t) is drawn.

1 Simulate yt ∼ q(·|x (t)).

2 Set x (t+1) = yt with probability

α(x (t), yt) = min{

1,π(yt )

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}

Otherwise, set x (t+1) = x (t) .

3 α is such that the detailed balance equation is satisfied: ⊲ π̄ isthe stationary distribution of (x (t)).

◮ The accepted candidates are simulated with the rejectionalgorithm.

6 / 24


Metropolis Hasting Algorithm

Suppose that x (t) is drawn.

1 Simulate yt ∼ q(·|x (t)).

2 Set x (t+1) = yt with probability


1,π(yt )

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}

Otherwise, set x (t+1) = x (t) .

3 α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

⊲ π̄ is the stationary distribution of (x (t)).

◮ The accepted candidates are simulated with the rejectionalgorithm.

6 / 24


Outline

1 Introduction



4 Illustrations

5 Conclusion

7 / 24


1 Alternative representation of the estimator δ is

δ =1n

n∑

t=1

h(x (t)) =1N

MN∑

i=1

nih(zi ) ,

where

zi ’s are the accepted yj ’s,MN is the number of accepted yj ’s till time N,ni is the number of times zi appears in the sequence (x (t))t .

8 / 24


q̃(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),

where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate according to q̃(·|zi ):

1 Propose a candidate y ∼ q(·|zi )

2 Accept with probability

q̃(y |zi )/

(q(y |zi )

p(zi )

)

= α(zi , y)

Otherwise, reject it and starts again.

3 ◮ this is the transition of the HM algorithm.

The transition kernel q̃ admits π̃ as a stationary distribution:

π̃(x)q̃(y |x) =

9 / 24


Lemme

The sequence (zi , ni) satisfies

1 (zi , ni)i is a Markov chain;

2 zi+1 and ni are independent given zi ;

3 ni is distributed as a geometric random variable with probabilityparameter

p(zi ) :=

∫

α(zi , y) q(y |zi ) dy ; (1)

4 (zi )i is a Markov chain with transition kernel

Q̃(z, dy) = q̃(y |z)dy and stationary distribution π̃ such that

q̃(·|z) ∝ α(z, ·) q(·|z) and π̃(·) ∝ π(·)p(·) .

10 / 24


zi−1

11 / 24


zi−1 zi

ni−1

indep

indep

11 / 24


zi−1 zi zi+1

ni−1 ni

indep

indep

indep

indep

11 / 24


zi−1 zi zi+1

ni−1 ni

indep

indep

indep

indep

δ =1n

n∑

t=1

h(x (t)) =1N

MN∑

i=1

nih(zi ) .

11 / 24


Outline

1 Introduction



4 Illustrations

5 Conclusion

12 / 24


1 A natural idea:

δ∗ =1N

MN∑

i=1

h(zi )

p(zi ),

13 / 24


1 A natural idea:

δ∗ ≃

∑MNi=1

h(zi )

p(zi )∑MN

i=11

p(zi )

=

∑MNi=1

π(zi )

π̃(zi )h(zi )

∑MNi=1

π(zi )

π̃(zi )

.

13 / 24


1 A natural idea:

δ∗ ≃

∑MNi=1

h(zi )

p(zi )∑MN

i=11

p(zi )

=

∑MNi=1

π(zi )

π̃(zi )h(zi )

∑MNi=1

π(zi )

π̃(zi )

.

2 But p not available in closed form.

13 / 24


1 A natural idea:

δ∗ ≃

∑MNi=1

h(zi )

p(zi )∑MN

i=11

p(zi )

=

∑MNi=1

π(zi )

π̃(zi )h(zi )

∑MNi=1

π(zi )

π̃(zi )

.


3 The geometric ni is the obvious solution that is used in theoriginal Metropolis–Hastings estimate.

13 / 24


1 A natural idea:

δ∗ ≃

∑MNi=1

h(zi )

p(zi )∑MN

i=11

p(zi )

=

∑MNi=1

π(zi )

π̃(zi )h(zi )

∑MNi=1

π(zi )

π̃(zi )

.


3 The geometric ni is the obvious solution that is used in theoriginal Metropolis–Hastings estimate.

ni = 1 +

∞∑

j=1

∏

ℓ≤j

I {uℓ ≥ α(zi , yℓ)} ,

13 / 24


ni = 1 +

∞∑

j=1

∏

ℓ≤j

I {uℓ ≥ α(zi , yℓ)} ,

Lemma

If (yj)j is an iid sequence with distribution q(y |zi ), the quantity

ξ̂i = 1 +

∞∑

j=1

∏

ℓ≤j

{1 − α(zi , yℓ)}

is an unbiased estimator of 1/p(zi ) which variance, conditional on zi ,

is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).

13 / 24


ξ̂i = 1 +∞∑

j=1

∏

ℓ≤j

{1 − α(zi , yℓ)}

1 Infinite sum but sometimes finite:


1,π(yt )

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}

For example: take a symetric random walk as a proposal.

2 What if we wish to be sure that the sum is finite?

14 / 24


Variance reduction

Proposition

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is an iiduniform sequence, for any k ≥ 0, the quantity

ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (2)

is an unbiased estimator of 1/p(zi ) with an almost sure finite numberof terms.

15 / 24


Variance reduction

Proposition


ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (2)

is an unbiased estimator of 1/p(zi ) with an almost sure finite numberof terms. Moreover, for k ≥ 1,

V

[

ξ̂ki

∣∣∣ zi

]

=1 − p(zi )

p2(zi)−1 − (1 − 2p(zi ) + r(zi))

k

2p(zi ) − r(zi )

(2 − p(zi )

p2(zi )

)

(p(zi )−r(zi )) ,

where p(zi ) :=∫α(zi , y) q(y |zi ) dy . and r(zi) :=

∫α2(zi , y) q(y |zi ) dy .

15 / 24


Variance reduction

Proposition


ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (2)

is an unbiased estimator of 1/p(zi ) with an almost sure finite numberof terms. Therefore, we have

V

[

ξ̂i

∣∣∣ zi

]

≤ V

[

ξ̂ki

∣∣∣ zi

]

≤ V

[

ξ̂0i

∣∣∣ zi

]

= V [ni | zi ] .

15 / 24


Variance reduction

zi−1

ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (3)

16 / 24


Variance reduction

zi−1 zi

ξ̂ki−1

not indep

not indep

ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (3)

16 / 24


Variance reduction

zi−1 zi zi+1

ξ̂ki−1 ξ̂k

i

not indep

not indep

not indep

not indep

ξ̂ki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)} (3)

16 / 24


Variance reduction

zi−1 zi zi+1

ξ̂ki−1 ξ̂k

i

not indep

not indep

not indep

not indep

δkM =

∑Mi=1 ξ̂

ki h(zi )

∑Mi=1 ξ̂

ki

.

16 / 24


Asymptotic results

Let

δkM =

∑Mi=1 ξ̂

ki h(zi )

∑Mi=1 ξ̂

ki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}.

17 / 24


Asymptotic results

Let

δkM =

∑Mi=1 ξ̂

ki h(zi )

∑Mi=1 ξ̂

ki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}.Assume that there exist a positive function ϕ ≥ 1 such that

∀h ∈ Cϕ,∑M

i=1 h(zi )/p(zi )∑M

i=1 1/p(zi )

P−→ π(h) (3)

Theorem

Under the assumption that π(p) > 0, the following convergenceproperty holds:

i) If h is in Cϕ, then

δkM

P−→M→∞ π(h) (◮CONSISTENCY)

17 / 24


Asymptotic results

Let

δkM =

∑Mi=1 ξ̂

ki h(zi )

∑Mi=1 ξ̂

ki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}.Assume that there exist a positive function ψ such that

∀h ∈ Cψ,√

M

(∑Mi=1 h(zi )/p(zi )∑M

i=1 1/p(zi )− π(h)

)

L−→ N (0, Γ(h))

Theorem

Under the assumption that π(p) > 0, the following convergenceproperty holds:

ii) If, in addition, h2/p ∈ Cϕ and h ∈ Cψ, then

√M(δk

M − π(h))L−→M→∞ N (0,Vk [h − π(h)]) , (◮CLT)

where Vk (h) := π(p)∫π(dz)V

[

ξ̂ki

∣∣∣ z

]

h2(z)p(z) + Γ(h) .17 / 24


Asymptotic results

We will need some additional assumptions. Assume a maximalinequality for the Markov chain (zi )i : there exists a measurablefunction ζ such that for any starting point x ,

∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π̃(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

Theorem

Assume that h is such that h/p ∈ Cζ and {Ch/p, h2/p2} ⊂ Cφ. Assumemoreover that

√M(δ0

M − π(h)) L−→ N (0,V0[h − π(h)]) .

Then, for any starting point x,

√

MN

(∑Nt=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,

18 / 24


Asymptotic results

We will need some additional assumptions. Assume a maximalinequality for the Markov chain (zi )i : there exists a measurablefunction ζ such that for any starting point x ,

∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π̃(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

Moreover, assume that ∃φ ≥ 1 such that for any starting point x ,

∀h ∈ Cφ, Q̃n(x , h)P−→ π̃(h) = π(ph)/π(p) ,

Theorem


√M(δ0

M − π(h)) L−→ N (0,V0[h − π(h)]) .


(∑ )18 / 24


Asymptotic results

∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π̃(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

∀h ∈ Cφ, Q̃n(x , h)P−→ π̃(h) = π(ph)/π(p) ,

Theorem


√M(δ0

M − π(h)) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑Nt=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,

where MN is defined by18 / 24


Asymptotic results

Theorem


√M(δ0

M − π(h)) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑Nt=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,

where MN is defined by

MN∑

i=1

ξ̂0i ≤ N <

MN+1∑

i=1

ξ̂0i . (3)

18 / 24


Outline

1 Introduction



4 Illustrations

5 Conclusion

19 / 24


Figure: Overlay of the variations of 250 iid realisations of theestimates δ (gold) and δ∞ (grey) of E[X ] = 0 for 1000 iterations, alongwith the 90% interquantile range for the estimates δ (brown) and δ∞

(pink), in the setting of a random walk Gaussian proposal with scaleτ = 10.

20 / 24


Figure: Overlay of the variations of 500 iid realisations of theestimates δ (deep grey), δ∞ (medium grey) and of the importancesampling version (light grey) of E[X ] = 10 when X ∼ Exp(.1) for 100iterations, along with the 90% interquantile ranges (same colourcode), in the setting of an independent exponential proposal withscale µ = 0.02.

21 / 24


π(x) = β(1 − β)x and 2q(y |x) =

{

I|x−y |=1 if x > 0 ,

I|y |≤1 if x = 0 .

For this problem,

p(x) = 1 − β/2 and r(x) = 1 − β + β2/2 .

We can therefore compute the gain in variance

p(x) − r(x)

2p(x) − r(x)

2 − p(x)

p2(x)= 2

β(1 − β)(2 + β)

(2 − β2)(2 − β)2

which is optimal for β = 0.174, leading to a gain of 0.578 while therelative gain in variance is

p(x) − r(x)

2p(x) − r(x)

2 − p(x)

1 − p(x)=

(1 − β)(2 + β)

(2 − β2)

which is decreasing in β.

22 / 24


Outline

1 Introduction



4 Illustrations

5 Conclusion

23 / 24


a) Rao Blackwellisation of any HM algorithm with a controledamount of additional calculation.

b) Link with the importance sampling of Markov chains.

c) Analysis with asymptotic results on triangular arrays.

24 / 24