Bayesian inversion of deterministic dynamic causal models

Post on 24-Jun-2015

535 views 4 download

Tags:

Transcript of Bayesian inversion of deterministic dynamic causal models

Bayesian inversion of deterministic dynamic causal models

Kay H. Brodersen1,2 & Ajita Gupta2

Mike West

1 Department of Economics, University of Zurich, Switzerland 2 Department of Computer Science, ETH Zurich, Switzerland

2

1. Problem setting Model; likelihood; prior; posterior.

2. Variational Laplace Factorization of posterior; why the free-energy; energies; gradient ascent; adaptive step size; example.

3. Sampling Transformation method; rejection method; Gibbs sampling; MCMC; Metropolis-Hastings; example.

4. Model comparison Model evidence; Bayes factors; free-energy; prior arithmetic mean; posterior harmonic mean; Savage-Dickey; example.

Overview

With material from Will Penny, Klaas Enno Stephan, Chris Bishop, and Justin Chumbley.

Problem setting 1

4

Model = likelihood + prior

5

Question 1: what do the data tell us about the model parameters?

compute the posterior

๐‘ ๐œƒ ๐‘ฆ, ๐‘š =๐‘ ๐‘ฆ ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐‘š

๐‘ ๐‘ฆ ๐‘š

Question 2: which model is best?

compute the model evidence

๐‘ ๐‘š ๐‘ฆ โˆ ๐‘ ๐‘ฆ ๐‘š ๐‘(๐‘š) = โˆซ ๐‘ ๐‘ฆ ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐‘š ๐‘‘๐œƒ

Bayesian inference is conceptually straightforward

?

?

Variational Laplace 2

7

Variational Laplace in a nutshell

, | ,p y q q q

( )

( )

exp exp ln , ,

exp exp ln , ,

q

q

q I p y

q I p y

Iterative updating of sufficient statistics of approx. posteriors by gradient ascent.

ln | , , , |

ln , , , , , |q

p y m F KL q p y

F p y KL q p m

Mean field approx.

Neg. free-energy approx. to model evidence.

Maximise neg. free energy wrt. q = minimise divergence,

by maximising variational energies

K.E. Stephan

8

Variational Laplace

Assumptions

mean-field approximation

Laplace approximation

W. Penny

9

Variational Laplace

Inversion strategy

Recall the relationship between the log model evidence and the negative free-energy ๐น:

Maximizing ๐น implies two things:

(i) we obtain a good approximation to ln ๐‘ ๐‘ฆ ๐‘š

(ii) the KL divergence between ๐‘ž ๐œƒ and ๐‘ ๐œƒ ๐‘ฆ, ๐‘š becomes minimal

Practically, we can maximize F by iteratively (EM) maximizing the variational energies:

W. Penny

ln ๐‘ ๐‘ฆ ๐‘š = ๐ธ๐‘ž ln ๐‘ ๐‘ฆ ๐œƒ โˆ’ ๐พ๐ฟ ๐‘ž ๐œƒ |๐‘ ๐œƒ ๐‘š

=: ๐น

+ ๐พ๐ฟ ๐‘ž ๐œƒ ||๐‘ ๐œƒ ๐‘ฆ, ๐‘šโ‰ฅ0

10

Variational Laplace

Implementation: gradient-ascent scheme (Newtonโ€™s method)

Compute curvature matrix

Compute gradient vector

๐‘—๐œƒ ๐‘– =๐œ•๐ผ(๐œƒ)

๐œ•๐œƒ(๐‘–)

๐ป๐œƒ ๐‘–, ๐‘— =๐œ•2๐ผ(๐œƒ)

๐œ•๐œƒ(๐‘–)๐œ•๐œƒ(๐‘—)

Newtonโ€™s Method for finding a root (1D)

( ( ))( ) ( )

'( ( ))

f x oldx new x old

f x old

11

Variational Laplace

New estimate

Compute Newton update (change)

Big curvature -> small step

Small curvature -> big step

Implementation: gradient-ascent scheme (Newtonโ€™s method)

Newtonโ€™s Method for finding a root (1D)

( ( ))( ) ( )

'( ( ))

f x oldx new x old

f x old

W. Penny

12

Newtonโ€™s method โ€“ demonstration

Newtonโ€™s method is very efficient. However, its solution is not insensitive to the starting point, as shown above.

http://demonstrations.wolfram.com/LearningNewtonsMethod/

13

Variational Laplace

Nonlinear regression (example)

Ground truth

(known parameter values that were used to generate the data on the left):

Model (likelihood):

Data:

t

y(t)

W. Penny

where

14

Variational Laplace

Nonlinear regression (example)

We begin by defining our prior: Prior density ( ) (ground truth)

W. Penny

15

Variational Laplace

Nonlinear regression (example)

Posterior density ( ) VL optimization (4 iterations)

Starting point ( 2.9, 1.65 )

VL estimate ( 3.4, 2.1 )

True value ( 3.4012, 2.0794 )

W. Penny

Sampling 3

17

Sampling

Deterministic approximations

computationally efficient

efficient representation

learning rules may give additional insight

application initially involves hard work

systematic error

Stochastic approximations

asymptotically exact

easily applicable general-purpose algorithms

computationally expensive

storage intensive

18

We can obtain samples from some distribution ๐‘ ๐‘ง by first sampling from the uniform distribution and then transforming these samples.

Strategy 1 โ€“ Transformation method

0 0.5 10

0.2

0.4

0.6

0.8

1

1.2

1.4

p(u

)

0 1 2 3 40

0.5

1

1.5

2

p(z

)

๐‘ข = 0.9 ๐‘ง = 1.15

transformation: ๐‘ง ๐œ = ๐นโˆ’1 ๐‘ข ๐œ

yields high-quality samples

easy to implement

computationally efficient

obtaining the inverse cdf can be difficult

19

Strategy 2 โ€“ Rejection method

When the transformation method cannot be applied, we can resort to a more general method called rejection sampling. Here, we draw random numbers from a simpler proposal distribution ๐‘ž(๐‘ง) and keep only some of these samples.

The proposal distribution ๐‘ž(๐‘ง), scaled by a factor ๐‘˜, represents an envelope of ๐‘(๐‘ง).

yields high-quality samples

easy to implement

can be computationally efficient

computationally inefficient if proposal is a poor approximation Bishop (2007) PRML

20

Often the joint distribution of several random variables is unavailable, whereas the full-conditional distributions are available. In this case, we can cycle over full-conditionals to obtain samples from the joint distribution.

Strategy 3 โ€“ Gibbs sampling

easy to implement

samples are serially correlated

the full-conditions may not be available

Bishop (2007) PRML

21

Idea: we can sample from a large class of distributions and overcome the problems that previous methods face in high dimensions using a framework called Markov Chain Monte Carlo.

Strategy 4 โ€“ Markov Chain Monte Carlo (MCMC)

Increase in density:

Decrease in density:

p(z*) p(zฯ„)

p(zฯ„)

p(z*)

zฯ„

zฯ„

z*

z*

1

)(~*)(~

zp

zp

M. Dรผmcke

22

MCMC demonstration: finding a good proposal density

When the proposal distribution is too narrow, we might miss a mode.

When it is too wide, we obtain long constant stretches without an acceptance.

http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/

23

MH creates as series of random points ๐œƒ 1 , ๐œƒ 2 , โ€ฆ , whose distribution converges to the target distribution of interest. For us, this is the posterior density ๐‘(๐œƒ|๐‘ฆ).

We could use the following proposal distribution:

๐‘ž ๐œƒ ๐œ ๐œƒ ๐œโˆ’1 = ๐’ฉ 0, ๐œŽ๐ถ๐œƒ

MCMC for DCM

prior covariance

scaling factor adapted such that acceptance rate is between 20% and 40%

W. Penny

24

MCMC for DCM

W. Penny

25

MCMC โ€“ example

W. Penny

26

MCMC โ€“ example

W. Penny

27

MCMC โ€“ example

Variational-Laplace Metropolis-Hastings

W. Penny

Model comparison 4

29

Model evidence

W. Penny

30

Prior arithmetic mean

W. Penny

31

Posterior harmonic mean

W. Penny

32

In many situations we wish to compare models that are nested. For example:

๐‘š๐น: full model with parameters ๐œƒ = ๐œƒ1, ๐œƒ2

๐‘š๐‘…: reduced model with ๐œƒ = (๐œƒ1, 0)

Savage-Dickey ratio

In this case, we can use the Savage-Dickey ratio to obtain a Bayes factor without having to compute the two model evidences:

๐ต๐‘…๐น =๐‘ ๐œƒ2 = 0 ๐‘ฆ, ๐‘š๐น

๐‘ ๐œƒ2 = 0 ๐‘š๐น

๐ต๐‘…๐น = 0.9

W. Penny

33

Comparison of methods

Chumbley et al. (2007) NeuroImage

MCMC VL

34

Comparison of methods

esti

mat

ed m

inu

s tr

ue

log

BF

Chumbley et al. (2007) NeuroImage