Highly e cient Bayesian inference with a novel estimator for Metropolis-Hastings · 2018. 5....
Transcript of Highly e cient Bayesian inference with a novel estimator for Metropolis-Hastings · 2018. 5....
1
Highly efficient Bayesian inference with a novelestimator for Metropolis-Hastings
Ingmar Schuster (FU Berlin)
(with Ilja Klebanov, Zuse Institute Berlin)
March 23, 2018
2
Outline
Introduction
Markov chain importance sampling
Theory
Empirical results
Conclusion and Outlook
3
Introduction
4
Markov chain algorithms: use
§ Markov chain (MC) algorithms used in science, engineering,finance
§ approximate integrals§ represent nonstandard distributions
§ common constructions for for given target distribution§ Metropolis-Hastings (MH), popular in Bayesian stats and ML§ discretized Langevin diffusion (LMC), popular in molecular
dynamics and ML
5
Approximating integrals: Monte Carlo
§ let ρ be a density, h a function
§ we are interested inş
hpxqρpxqdx
§ using standard Monte Carlo approximation
ż
hpxqρpxqdx «1
K
Kÿ
i“1
hpXi q
where Xi „ ρ, possibly generated with a Markov chain
6
Approximating integrals: importance sampling
§ using importance sampling
ż
hpxqρpxqdx “
ż
hpxqρpxq
ρθpxqρθpxqdx «
1
K
Kÿ
i“1
hpXi qρpXi q
ρθpXi q
§ in Bayesian inference ρ unnormalized
1 ‰
ż
ρpxqdx “ E ă 8
and we want a normalized integral
E´1ż
hpxqρpxqdx
(E called normalizing constant, evidence, marginal likelihood)
7
Approximating integrals: importance sampling
§ observe
E “ż
ρpxq
ρθpxqρθpxqdx «
1
K
Kÿ
i“1
ρpXi q
ρθpXi q“ pE
and
E´1ż
hpxqρpxq
ρθpxqρθpxqdx «
1
K pE
Kÿ
i“1
hpXi qρpXi q
ρθpXi q
8
Markov chain importance sampling
9
Idea
§ MC algorithm with parameter θ designed to sampleapproximately from density ρ
§ LMC samples from some other ρθ that is hopefully close to ρ§ MH uses proposals from some stationary distribution ρθ
accepts or rejects them
§ if we know ρθ we might assign weight ρpYi q{ρθpYi q to Yi anduse IS estimator
§ results in§ correction scheme for LMC not using Metropolis Hastings§ estimator for MH that can strongly improve error§ estimator for evidence
10
Metropolis-Hastings
§ current state of MC is Xi
§ as next state propose Yi „ qθp¨|Xi q where qθ is a density
§ set
Xi`1 “
#
Yi with prob. α “ min´
1, qθpXi |Yi qρpYi q
qθpYi |Xi qρpXi q
¯
Xi with prob. 1´ α
§ Xi has asymptotic density ρ
What is the asymptotic density of the Yi?
11
Asymptotic proposal density for MH
§ hierarchic generation of proposals
x „ ρ
y „ qθp¨|xq
§ thus joint density is ρpxqqθpy |xq and marginal of y is
ρθpyq “
ż
ρpxqqθpy |xq dx
§ consistent estimator given MH-samples Xi „ ρ
ρθ «1
K
Kÿ
i“1
qθp¨|Xi q “ ρKθ
12
Hastings MCIS
§ estimator for normalizing constant available for MH (contraryto MH/MCMC folklore)
E « 1
K
Kÿ
i“1
ρpYi q
ρKθ pYi q
§ MCIS estimator available for large class of MH algorithmswith no change to algorithm
H «1
K pE
Kÿ
i“1
hpYi qρpYi q
ρKθ pYi q
§ uses all proposals, not just accepted ones
13
Cartography: Looks vs steps
14
The Unadjusted Langevin Algorithm (ULA) admits a similarapproximation, talk to me offline
15
Proposal densities
ρ
Gauss MH ρθULA ρθ
16
Weight functions
Gauss MH ρ/ρθULA ρ/ρθ
17
Related work
§ Layered adaptive importance sampling (Martino, Elvira,Luengo, & Corander, 2015)
§ use samples from MH as IS proposal locations§ unbiased estimator§ twice the computational cost of MCIS
(no. of posterior evaluations)
§ Simple MCIS (Rudolf & Sprungk, 2018)
§ poor but cheap approximation for ρpXi`1q
ρθpXi`1q: ρpXi`1q
qθpXi`1|Xi q
§ results in unbiased estimator as well§ will evaluate this variant in experiments (S-MCIS)
18
Theory
19
Runtime Hastings MCIS
§ computational prolongation factor MCIS/MH is
1`p1´ αqKct ` αK
2cpαKct ` Kcd ` 2Kcp
where cd target density evaluation cost, K no. of samples
§ if cd is high (typical in Bayesian inference), computationaloverhead becomes negligible
20
Law of large numbers
§ analysis of estimator intricate when looking at ρKθ§ will use ρθ directly since ρKθ Ñ ρθ pointwise a.s.§ Then
§ 1K
řKi“1
ρpYi q
ρθpYi qÑa.s. E
§ 1K
řKi“1 hpYi q
ρpYi q
ρθpYi qÑa.s. E
ş
hpxqrρpxqdx
§ by Theorem 6.63 in Robert and Casella, 2007 this means that
1K
řKi“1 hpYi q
ρpYi q
ρθpYi q
1K
řKi“1
ρpYi q
ρθpYi q
Ña.s.Eş
hpxqrρpxqdx
E
21
Central limit theorem
§ From the CLT for MCMC we have
?K
˜
1
K
Kÿ
i“1
hpYi qρpYi q
ρθpYi q´ E
ż
hpxqrρpxqdx
¸
Ñdistr . N p0, γq
for some γ
§ the estimator for E converges a.s. to the correct value
§ apply Slutski’s lemma to get CLT for MCIS estimator
Nˆ
0,γ?E
˙
22
Empirical results
23
Hastings MCIS
4 2 0 2 4 6 8 10 12 140.00
0.05
0.10
0.15
0.20
0.25
0.30p(x)g(x)
Figure: Target density p vs global proposal density g for gaussian randomwalk with sd 3
24
Metropolis-Hastings MCIS 1D
3.5 3.0 2.5 2.0 1.5 1.0 0.5CPU time (log seconds)
3.5
4.0
4.5
5.0
5.5
6.0
6.5
log
AE w
ith 2
0% c
i
Estimate of µ(ex) (3000 samples, 10 repetitions)
MCMCISS-MCIS
Figure: moving average of log AE vs log CPU time (10 iid MCs)
25
Metropolis-Hastings MCIS 5D
5 4 3 2 1 0 1CPU time (log seconds)
1
0
1
2
3
4
5lo
g AE
with
20%
ci
Estimate of µ(ex) (12000 samples, 10 repetitions)
MCMCISS-MCIS
26
Metropolis-Hastings MCIS Gaussian Process
Applied problem: Gaussian process regression for noise levelprediction
§ tests of airfoil blade sections conducted in an anechoic windtunnel by NASA
§ predictors
1. Frequency, in Hertz2. Angle of attack, in degrees3. Chord length, in meters4. Free-stream velocity, in meters per second5. Suction side displacement thickness, in meters
§ output: Scaled sound pressure level, in decibels
27
Metropolis-Hastings MCIS Gaussian Process
§ Use Gaussian ARD kernel for GP§ parameter σ2
i for bandwidth in dimension i§ parameter σ2
n for noise variance§ using prior logpexppσ2
˚q ´ 1q „ N p0, 1q§ results in 6D posterior
§ ground truth obtained using 50 000 iterations from MH withstandard estimator
§ experiment using multiple MH runs for 12 000 iterations
28
Metropolis-Hastings MCIS Gaussian Process
MC S-MCIS S-MCIS-CV MCIS MCIS-CVEst
30000
35000
40000
45000
50000
55000
60000
tAE
Figure: Time adjusted absolute error for Epxq, averaged across dimensions
29
Metropolis-Hastings MCIS Gaussian Process
MC S-MCIS S-MCIS-CV MCIS MCIS-CVEst
0.0
0.5
1.0
1.5
2.0
tAE
1e59
Figure: Time adjusted absolute error for Epexq, averaged acrossdimensions
30
Conclusion and Outlook
31
Open questions
§ How to tell when ρKθ has converged?
§ For gaussian random-walk MH
ρθ “
ż
ρpxqN p¨; x , θqdx
is injective embedding of ρ into a reproducing kernel Hilbertspace (RKHS)
§ Can RKHS-norm help determine convergence?§ At what cost?
32
Open questions
§ Is there way to correct bias introduced by only estimating ρθ,and by using self-normalized IS?
§ for random walk MH unbiased estimator for Eρθ exists
§ behavior in increasing dimensions
§ which variants of LMC have stationary distributions?
33
Conclusion and outlook
§ MCIS estimator§ has a CLT§ can be very efficient wrt error vs CPU time§ is available for large class of MH§ is available for LMC§ can estimate evidence
§ fundamental insights into MH
34
Thank you
35
Unadjusted Langevin Algorithm
§ Current state of MC is Xi
§ Generate next state Xi`1 „ qθp¨|Xi q where
qθp¨|Xi q “ N p¨|Xi ` θ∇ log ρpXi q,1
2
?θI q
§ under certain assumptions qθ is a markov kernel for ρθ(Durmus & Moulines, 2017), which means
ż
qθp¨|xqρθpxqdx “ ρθp¨q
§ Markov chain X1,X2, . . . resulting from repeated applicationof qθ is ergodic wrt ρθ
36
Langevin MCIS
§ But if we have samples from ρθ and a Markov kernel then
ρθp¨q “
ż
qθp¨|xqρθpxqdx «1
K
Kÿ
i“1
qθp¨|Xi q “ ρKθ p¨q
§ estimate of normalizing constant of ρ
E « 1
K
Kÿ
i“1
ρpXi q
ρKθ pXi q
§ MCIS estimator
H «1
KE
Kÿ
i“1
hpXi qρpXi q
ρKθ pXi q
37
Runtime Hastings MCIS (1)
§ let§ α be acceptance rate§ ct test function cost§ cp proposal density cost§ cd target density cost
§ Cost of standard MH OpαKct ` Kcd ` 2Kcpq
§ Cost of Hastings MCIS OpKct ` Kcd ` 2Kcp ` αK2cpq
§ prolongation factor MCIS/MH is
1`p1´ αqKct ` αK
2cpαKct ` Kcd ` 2Kcp
38
Runtime Hastings MCIS (2)
§ let§ α be acceptance rate§ ct test function cost§ cp proposal density cost§ cd target density cost
§ prolongation factor MCIS/MH
1`p1´ αqKct ` αK
2cpαKct ` Kcd ` 2Kcp
§ in Bayesian inference cd large (involves looking at many datapoints), so cd " cp
39
Langevin MCIS: Experiments
MC S-MCIS MCISEst
0
100
200
300
400
500
600tA
E
Figure: sec ¨ AE for possible estimators
40
Langevin MCIS: Experiments
2.5 2.0 1.5 1.0 0.5CPU time (log seconds)
0
1
2
3
4
5
6
7
log
AE w
ith 2
0% c
i
Estimate of µ(x3) (3000 samples, 10 repetitions)
MCMCISS-MCIS
Figure: moving average of log AE vs log CPU time (10 iid MCs)
41
Bibliography
Durmus, Alain, Eric Moulines, et al. (2017). “Nonasymptoticconvergence analysis for the unadjusted Langevin algorithm”.In: The Annals of Applied Probability 27.3, pp. 1551–1587.
Martino, L et al. (2015). “Layered Adaptive ImportanceSampling”. url: http://arxiv.org/abs/1505.04732.
Robert, Christian and George Casella (2007). Monte Carlostatistical methods.
Rudolf, Daniel and Bjorn Sprungk (2018). “An importancesampling estimator for Metropolis Hastings, personalcommunication”.