Highly e cient Bayesian inference with a novel estimator for Metropolis-Hastings · 2018. 5....

1

Highly efficient Bayesian inference with a novelestimator for Metropolis-Hastings

Ingmar Schuster (FU Berlin)

(with Ilja Klebanov, Zuse Institute Berlin)

March 23, 2018

2

Outline

Introduction

Markov chain importance sampling

Theory

Empirical results

Conclusion and Outlook

3

Introduction

4

Markov chain algorithms: use

§ Markov chain (MC) algorithms used in science, engineering,finance

§ approximate integrals§ represent nonstandard distributions

§ common constructions for for given target distribution§ Metropolis-Hastings (MH), popular in Bayesian stats and ML§ discretized Langevin diffusion (LMC), popular in molecular

dynamics and ML

5

Approximating integrals: Monte Carlo

§ let ρ be a density, h a function

§ we are interested inş

hpxqρpxqdx

§ using standard Monte Carlo approximation

ż

hpxqρpxqdx «1

K

Kÿ

i“1

hpXi q

where Xi „ ρ, possibly generated with a Markov chain

6

Approximating integrals: importance sampling

§ using importance sampling

ż

hpxqρpxqdx “

ż

hpxqρpxq

ρθpxqρθpxqdx «

1

K

Kÿ

i“1

hpXi qρpXi q

ρθpXi q

§ in Bayesian inference ρ unnormalized

1 ‰

ż

ρpxqdx “ E ă 8

and we want a normalized integral

E´1ż

hpxqρpxqdx

(E called normalizing constant, evidence, marginal likelihood)

7

Approximating integrals: importance sampling

§ observe

E “ż

ρpxq

ρθpxqρθpxqdx «

1

K

Kÿ

i“1

ρpXi q

ρθpXi q“ pE

and

E´1ż

hpxqρpxq

ρθpxqρθpxqdx «

1

K pE

Kÿ

i“1

hpXi qρpXi q

ρθpXi q

8

Markov chain importance sampling

9

Idea

§ MC algorithm with parameter θ designed to sampleapproximately from density ρ

§ LMC samples from some other ρθ that is hopefully close to ρ§ MH uses proposals from some stationary distribution ρθ

accepts or rejects them

§ if we know ρθ we might assign weight ρpYi q{ρθpYi q to Yi anduse IS estimator

§ results in§ correction scheme for LMC not using Metropolis Hastings§ estimator for MH that can strongly improve error§ estimator for evidence

10

Metropolis-Hastings

§ current state of MC is Xi

§ as next state propose Yi „ qθp¨|Xi q where qθ is a density

§ set

Xi`1 “

#

Yi with prob. α “ min´

1, qθpXi |Yi qρpYi q

qθpYi |Xi qρpXi q

¯

Xi with prob. 1´ α

§ Xi has asymptotic density ρ

What is the asymptotic density of the Yi?

11

Asymptotic proposal density for MH

§ hierarchic generation of proposals

x „ ρ

y „ qθp¨|xq

§ thus joint density is ρpxqqθpy |xq and marginal of y is

ρθpyq “

ż

ρpxqqθpy |xq dx

§ consistent estimator given MH-samples Xi „ ρ

ρθ «1

K

Kÿ

i“1

qθp¨|Xi q “ ρKθ

12

Hastings MCIS

§ estimator for normalizing constant available for MH (contraryto MH/MCMC folklore)

E « 1

K

Kÿ

i“1

ρpYi q

ρKθ pYi q

§ MCIS estimator available for large class of MH algorithmswith no change to algorithm

H «1

K pE

Kÿ

i“1

hpYi qρpYi q

ρKθ pYi q

§ uses all proposals, not just accepted ones

13

Cartography: Looks vs steps

14

The Unadjusted Langevin Algorithm (ULA) admits a similarapproximation, talk to me offline

15

Proposal densities

ρ

Gauss MH ρθULA ρθ

16

Weight functions

Gauss MH ρ/ρθULA ρ/ρθ

17

Related work

§ Layered adaptive importance sampling (Martino, Elvira,Luengo, & Corander, 2015)

§ use samples from MH as IS proposal locations§ unbiased estimator§ twice the computational cost of MCIS

(no. of posterior evaluations)

§ Simple MCIS (Rudolf & Sprungk, 2018)

§ poor but cheap approximation for ρpXi`1q

ρθpXi`1q: ρpXi`1q

qθpXi`1|Xi q

§ results in unbiased estimator as well§ will evaluate this variant in experiments (S-MCIS)

18

Theory

19

Runtime Hastings MCIS

§ computational prolongation factor MCIS/MH is

1`p1´ αqKct ` αK

2cpαKct ` Kcd ` 2Kcp

where cd target density evaluation cost, K no. of samples

§ if cd is high (typical in Bayesian inference), computationaloverhead becomes negligible

20

Law of large numbers

§ analysis of estimator intricate when looking at ρKθ§ will use ρθ directly since ρKθ Ñ ρθ pointwise a.s.§ Then

§ 1K

řKi“1

ρpYi q

ρθpYi qÑa.s. E

§ 1K

řKi“1 hpYi q

ρpYi q

ρθpYi qÑa.s. E

ş

hpxqrρpxqdx

§ by Theorem 6.63 in Robert and Casella, 2007 this means that

1K

řKi“1 hpYi q

ρpYi q

ρθpYi q

1K

řKi“1

ρpYi q

ρθpYi q

Ña.s.Eş

hpxqrρpxqdx

E

21

Central limit theorem

§ From the CLT for MCMC we have

?K

˜

1

K

Kÿ

i“1

hpYi qρpYi q

ρθpYi q´ E

ż

hpxqrρpxqdx

¸

Ñdistr . N p0, γq

for some γ

§ the estimator for E converges a.s. to the correct value

§ apply Slutski’s lemma to get CLT for MCIS estimator

Nˆ

0,γ?E

˙

22

Empirical results

23

Hastings MCIS

4 2 0 2 4 6 8 10 12 140.00

0.05

0.10

0.15

0.20

0.25

0.30p(x)g(x)

Figure: Target density p vs global proposal density g for gaussian randomwalk with sd 3

24

Metropolis-Hastings MCIS 1D

3.5 3.0 2.5 2.0 1.5 1.0 0.5CPU time (log seconds)

3.5

4.0

4.5

5.0

5.5

6.0

6.5

log

AE w

ith 2

0% c

i

Estimate of µ(ex) (3000 samples, 10 repetitions)

MCMCISS-MCIS

Figure: moving average of log AE vs log CPU time (10 iid MCs)

25

Metropolis-Hastings MCIS 5D

5 4 3 2 1 0 1CPU time (log seconds)

1

0

1

2

3

4

5lo

g AE

with

20%

ci

Estimate of µ(ex) (12000 samples, 10 repetitions)

MCMCISS-MCIS

26

Metropolis-Hastings MCIS Gaussian Process

Applied problem: Gaussian process regression for noise levelprediction

§ tests of airfoil blade sections conducted in an anechoic windtunnel by NASA

§ predictors

1. Frequency, in Hertz2. Angle of attack, in degrees3. Chord length, in meters4. Free-stream velocity, in meters per second5. Suction side displacement thickness, in meters

§ output: Scaled sound pressure level, in decibels

27


§ Use Gaussian ARD kernel for GP§ parameter σ2

i for bandwidth in dimension i§ parameter σ2

n for noise variance§ using prior logpexppσ2

˚q ´ 1q „ N p0, 1q§ results in 6D posterior

§ ground truth obtained using 50 000 iterations from MH withstandard estimator

§ experiment using multiple MH runs for 12 000 iterations

28


MC S-MCIS S-MCIS-CV MCIS MCIS-CVEst

30000

35000

40000

45000

50000

55000

60000

tAE

Figure: Time adjusted absolute error for Epxq, averaged across dimensions

29


MC S-MCIS S-MCIS-CV MCIS MCIS-CVEst

0.0

0.5

1.0

1.5

2.0

tAE

1e59

Figure: Time adjusted absolute error for Epexq, averaged acrossdimensions

30

Conclusion and Outlook

31

Open questions

§ How to tell when ρKθ has converged?

§ For gaussian random-walk MH

ρθ “

ż

ρpxqN p¨; x , θqdx

is injective embedding of ρ into a reproducing kernel Hilbertspace (RKHS)

§ Can RKHS-norm help determine convergence?§ At what cost?

32

Open questions

§ Is there way to correct bias introduced by only estimating ρθ,and by using self-normalized IS?

§ for random walk MH unbiased estimator for Eρθ exists

§ behavior in increasing dimensions

§ which variants of LMC have stationary distributions?

33

Conclusion and outlook

§ MCIS estimator§ has a CLT§ can be very efficient wrt error vs CPU time§ is available for large class of MH§ is available for LMC§ can estimate evidence

§ fundamental insights into MH

34

Thank you

35

Unadjusted Langevin Algorithm

§ Current state of MC is Xi

§ Generate next state Xi`1 „ qθp¨|Xi q where

qθp¨|Xi q “ N p¨|Xi ` θ∇ log ρpXi q,1

2

?θI q

§ under certain assumptions qθ is a markov kernel for ρθ(Durmus & Moulines, 2017), which means

ż

qθp¨|xqρθpxqdx “ ρθp¨q

§ Markov chain X1,X2, . . . resulting from repeated applicationof qθ is ergodic wrt ρθ

36

Langevin MCIS

§ But if we have samples from ρθ and a Markov kernel then

ρθp¨q “

ż

qθp¨|xqρθpxqdx «1

K

Kÿ

i“1

qθp¨|Xi q “ ρKθ p¨q

§ estimate of normalizing constant of ρ

E « 1

K

Kÿ

i“1

ρpXi q

ρKθ pXi q

§ MCIS estimator

H «1

KE

Kÿ

i“1

hpXi qρpXi q

ρKθ pXi q

37

Runtime Hastings MCIS (1)

§ let§ α be acceptance rate§ ct test function cost§ cp proposal density cost§ cd target density cost

§ Cost of standard MH OpαKct ` Kcd ` 2Kcpq

§ Cost of Hastings MCIS OpKct ` Kcd ` 2Kcp ` αK2cpq

§ prolongation factor MCIS/MH is

1`p1´ αqKct ` αK


38

Runtime Hastings MCIS (2)

§ let§ α be acceptance rate§ ct test function cost§ cp proposal density cost§ cd target density cost

§ prolongation factor MCIS/MH

1`p1´ αqKct ` αK


§ in Bayesian inference cd large (involves looking at many datapoints), so cd " cp

39

Langevin MCIS: Experiments

MC S-MCIS MCISEst

0

100

200

300

400

500

600tA

E

Figure: sec ¨ AE for possible estimators

40

Langevin MCIS: Experiments

2.5 2.0 1.5 1.0 0.5CPU time (log seconds)

0

1

2

3

4

5

6

7

log

AE w

ith 2

0% c

i

Estimate of µ(x3) (3000 samples, 10 repetitions)

MCMCISS-MCIS

Figure: moving average of log AE vs log CPU time (10 iid MCs)

41

Bibliography

Durmus, Alain, Eric Moulines, et al. (2017). “Nonasymptoticconvergence analysis for the unadjusted Langevin algorithm”.In: The Annals of Applied Probability 27.3, pp. 1551–1587.

Martino, L et al. (2015). “Layered Adaptive ImportanceSampling”. url: http://arxiv.org/abs/1505.04732.

Robert, Christian and George Casella (2007). Monte Carlostatistical methods.

Rudolf, Daniel and Bjorn Sprungk (2018). “An importancesampling estimator for Metropolis Hastings, personalcommunication”.

http://arxiv.org/abs/1505.04732

Highly e cient Bayesian inference with a novel estimator for Metropolis-Hastings · 2018. 5....

Documents

Transcript of Highly e cient Bayesian inference with a novel estimator for Metropolis-Hastings · 2018. 5....