Download - Bayesian Statistics and Data Assimilation Jonathan Stroud · 2012. 4. 10. · Jonathan Stroud Department of Statistics The George Washington University 1. Outline

Bayesian Statistics and DataAssimilation

Jonathan StroudDepartment of Statistics

The George Washington University

1

Outline

• Motivation

• Bayesian Statistics

• Parameter Estimation in Data Assimilation

• Combined State and Parameter Estimation within EnKF

– Physical Parameters

– Observation Error Variance

– Observation Error Covariance

2

Motivation

• Physical models and data assimilation systems often involve

unknown parameters:

– physical model parameters

– error covariance parameters

– covariance inflation factors

– localization radius

• Use data to estimate parameters, either off-line or sequentially.

• Two common approaches to parameter estimation

– Maximum likelihood estimation

– Bayesian estimation

3

Statistical Methods for Parameter Estimation

• Maximum Likelihood approach

– Specify a likelihood function for the data.

– Choose parameters to maximize this function.

• Bayesian approach

– Parameters are assigned a prior probability distribution.

– Update the prior distribution using Bayes Theorem.

– Summarize the parameters using the posterior distribution.

4

Bayesian Parameter Estimation

• A Bayesian model includes the following components

(where d = data; α = unknown parameters):

• Likelihood function: p(d |α)

• Prior distribution: p(α)

• Posterior distribution (via Bayes’ Theorem):

p(α|d) =p(α)p(d |α)

p(d)

• The parameters can be summarized using the posterior mean,

standard deviation, or 95% posterior intervals.

5

Example 1: Normal Data with Unknown Mean

• Let d1, ... , dn be iid samples from a normal distribution with

unknown mean θ and variance v . The likelihood function is

p(d |θ) = (2πv)−n/2 exp

{

−(d̄ − θ)2

2v/n

}

.

• The standard prior distribution is a normal: θ ∼ N(θb, vb)

p(θ) = (2πvb)−1/2 exp

{

−(θ − θb)

2

2vb

}

• Posterior distribution (likelihood × prior):

p(θ|d) ∝ exp

{

−(d̄ − θ)2

2v/n−

(θ − θb)2

2vb

}

6


• The posterior distribution is normal

θ|d ∼ N(θa, va)

with mean

θa =(v/n)θb + vbd̄

vb + (v/n)

and information

v−1a = v−1b + (v/n)

−1

• The posterior mean is a weighted average of the prior mean and

the sample mean. The posterior information is the sum of the

prior and data information.

7


−5 0 5 100.0

0.1

0.2

0.3

0.4Prior

N(0,1)

−5 0 5 100.00

0.05

0.10

0.15

0.20Likelihood

N(2,4)

−5 0 5 100.0

0.1

0.2

0.3

0.4

Posterior

N(0.4,0.8)

8

Example 2: Normal with Unknown Variance

• Let d1, ... , dn be iid samples from a normal distribution with

mean zero and unknown variance v . The likelihood function is

p(d |v) = (2πv)−n/2 exp

(

−1

2v

n∑

i=1

d2i

)

.

• The prior distribution is inverse gamma, v ∼ IG (m/2, s/2)

p(v) =(s/2)m/2

Γ(m/2)v−m/2−1 exp

(

−s

2v

)

.

• The posterior distribution is

p(v |d) ∝ v−(m+n)/2−1 exp

(

−s +

∑

d2i2v

)

.

9


• The posterior is also inverse gamma

v |d ∼ IG

(

m + n

2,s +

∑

d2i2

)

.

• The parameters are updated by adding the sample size and the

data sum of squares, respectively.

10


2 4 6 8 10 120.0

0.1

0.2

0.3

0.4

0.5 PriorIG(5,10)

mean=2.5

2 4 6 8 10 120.00

0.05

0.10

0.15

0.20

0.25Likelihood

IG(10,50)mean=5.6

2 4 6 8 10 120.0

0.1

0.2

0.3

0.4Posterior

IG(15,60)mean=4.3

11

Conjugate Priors

• The normal and inverse gamma priors are called “conjugate

priors”. These occur when the prior and posterior distribution

belong to the same family.

• These are convenient because the distribution can be updated

by updating a set of “sufficient statistics.”

• Other examples of conjugate priors:

– Inverse-Wishart, for the covariance matrix of a normal.

– Normal-Inverse Gamma, for the mean and variance of normal.

– Dirichlet, for the probabilities of a discrete distribution.

12

Sequential Bayesian Estimation

• If the data are assimilated sequentially, we want to update the

parameters α after each new observation d1, d2, ... , dn.

• Under the Bayesian approach, this requires calculating the

sequence of posterior distributions

p(α|d1), p(α|d1, d2), ... , p(α|d1, d2, ... , dn)

• This is done by applying Bayes theorem recursively after each

new observation:

p(α|d1, ... , dk) ∝ p(dk |α) p(α|d1, ... , dk−1), for each k .

13

Sequence of Posterior Distributions

0 1 2 3 4 5 60.0

0.1

0.2

0.3

0.4

Posterior t=1MLE=2.9Mode=1Mean=3

0 1 2 3 4 5 60.0

0.1

0.2

0.3

0.4

0.5

0.6Posterior t=2

MLE=0.2Mode=0.9Mean=2

0 1 2 3 4 5 60.00

0.05

0.10

0.15

0.20

0.25

0.30Posterior t=4

MLE=11Mode=2.2Mean=3.9

0 1 2 3 4 5 60.0

0.2

0.4

0.6

0.8

Posterior t=25MLE=1Mode=1.6Mean=1.8

0 1 2 3 4 5 60.00.20.40.60.81.01.21.4

Posterior t=100MLE=9Mode=2Mean=2.1

0 1 2 3 4 5 60.0

0.5

1.0

1.5

2.0Posterior t=200

MLE=8.6Mode=2Mean=2.1

14

Sequential Bayesian Estimates

0 50 100 150 2000

5

10

15

MLEs of alpha

0 50 100 150 200

1

2

3

4

Sequential EstimatesPosterior MeanPosterior Mode90% Posterior CI

15

Convergence of the Posterior Distribution

1. The posterior distribution converges to the true parameter value.

2. If the model is correct, and certain regularity conditions hold,

the posterior distribution converges to a normal distribution with

mean equal to the true value and covariance equal to the

asymptotic covariance matrix.

16

Parameter Estimation in Data Assimilation

• Many parameter estimation methods have been proposed for

data assimilation systems.

• Maximum likelihood estimation

– Dee (1995) and Dee & Da Silva (1999): Error covariances

– Mitchell & Houtekamer (2000): Error covariances (EnKF)

– Li, Kalnay & Miyoshi (2007): Variance/Covariance inflation

• Bayesian estimation

– Anderson & Anderson (1999): State augmentation

– Stroud & Bengtsson (2007): Observation error variance

– Anderson (2007, 2009): Covariance inflation factors

– Miyoshi (2011): Covariance inflation factors

17

Estimation of Physical Parameters

• State augmentation is used to estimate unknown parameters θ

in the physical model M(xt , θ).

• Define the augmented state vector zt = (xt , θt), and the

augmented model as

xt

θt

=

M(xt−1, θt−1)

θt−1

+

wt

0

.

• Specify an initial prior distribution, θ0 ∼ p(θ0).

• Then, standard data assimilation methods are applied to zt to

estimate the posterior distribution, p(θ|d1, ... , dt), at each time t.

18

Example: Lorenz 63 Model

• Model equations:

dx/dt = σ(y − x)

dy/dt = ρx − y − xz

dz/dt = xy − βz

-15-10

-5 0

510

15

x

-10

0

10

20

y

1020

3040

50z

•••••••••••••••••

•••

•••••• • • ••••••••••••••••••••••••

••

•

•

•

•

•• • • • • • • • • •••••••••

••••

•••••

••••••

••

•••••••••••••••••••••••••

•

•

•

••

• ••

•

•

•

••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• The state vector is x = (x , y , z), and parameter is θ = (σ, ρ,β)

• The parameters σ = 10, ρ = 28,β = 8/3 give the famous butterfly.

• Generate data with time step dt = .01, and observation noise = 1.

• ETKF/state augmentation on zt = (xt , θt) with ensemble size

100 and variance inflation factor 1.04.

19

Sequential Parameter Estimates: Lorenz 63

0 20 40 60 80 100

8

10

12

14

σPosterior Mean90% Posterior CI

0 20 40 60 80 10026

27

28

29

30

31

32

ρPosterior Mean90% Posterior CI

0 20 40 60 80 100

2.2

2.4

2.6

2.8

3.0

βPosterior Mean90% Posterior CI

20

Estimation of Covariance Parameters

• State augmentation does not work well for parameters in the

background or error covariance matrices, P, Q, and R.

• Dee (1995), D&D (1999) and Mitchell & Houtekamer (2000)

proposed Maximum Likelihood estimation for these parameters.

• Assuming the innovations d are normal with mean zero and

covariance D(α), the likelihood function is

p(d|α) ∝ |D(α)|−1/2 exp

{

−1

2d′D(α)−1d

}

and the maximum likelihood estimator is

α̂ML = argmaxα

p(d|α).

21

Estimation of Covariance Parameters

• Maximum Likelihood (ML) works well for large samples, but has

problems for recursive estimation.

• D95 and MH00 proposed the recursive ML estimator:

α̂t = (1− γt)α̂t−1 + γt

(

argmaxα

p(dt |α))

.

• Setting γt = 1/t defines α̂t as the mean of the ML estimates.

• They also defined α̂t as the median of the ML estimates.

22

Simple Scalar Example

• Mitchell & Houtekamer (2000) proposed the following example:

• Generate data d ∼ N(0, 2 + α), with true value α∗ = .3.

• Since α ≥ 0, the single sample ML estimator is

α̂ML = max(

0, d2 − 2)

23

Mitchell & Houtekamer Example

• Recursive estimates for α.

0 2000 4000 6000 8000 10000

0.0

0.5

1.0TrueMeanMedianBayes

• The recursive ML estimators do not converge to the true value.

• The Bayes estimator does converge.

24

Bayesian Parameter Estimation in the EnKF

• We propose the following generic EnKF algorithm for combined

estimation of states z and covariance parameters α.

1. Assume a prior distribution for the parameters α ∼ p(α).

2. Generate a forecast ensemble of parameters and states:

αfi ∼ p(α)

zfi ∼ p(z|α

fi )

3. Update the prior distribution via Bayes’ Theorem:

p(α|d) ∝ p(α)p(d|α)

4. Generate an analysis ensemble of parameters and states:

αi ∼ p(α|d)

zi ∼ p(z|αi , d)

25

Model 1: Unknown Observation Variance

• Stroud & Bengtsson (2007) considered the case where R = αR∗,

Q = αQ∗ and D = αD∗.

1. Assume an inverse gamma prior distribution: α ∼ IG (n/2, s/2).

2. Generate the forecast ensemble:

αfi ∼ IG (n/2, s/2),

zft,i ∼ M(zt−1,i) + N(0,α

fi Q

∗)

3. Update the parameters of the inverse gamma distribution:

n∗ = n + p, s∗ = s + d′(D∗)−1d

4. Generate the analysis ensemble:

αi ∼ IG (n∗/2, s∗/2)

zt,i ∼ zft,i +K(d+ N(0,αiR

∗))

26

Example: Lorenz 96 Model

• The Lorenz 96 model mimics advection on a latitude circle. The

model is highly nonlinear (chaotic), containing quadratic terms.

ẋt,j = (xt,j+1 − xt,j−2) xt,j−1 − xt,j + F .

• The state vector has 40

variables x = (x1, ... , x40).

• The model parameter is F ,

the forcing variable.

27

Model & Assimilation Settings

• Time step dt = .05 or .25.

• Forcing F = 8 (known or unknown).

• Observations at every location.

• Error covariances: Q = 0, R = αI; true α = 4.

• EnKF with ensemble size m = 100.

• Covariance localization with cutoff radius c = 10.

• Covariance inflation factor 1.01.

28

Sequential Bayesian Estimates of α (dt = .05)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,240)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(1.5,6)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

29

Sequential Bayesian Estimates of α (dt = .25)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(1.5,6)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,240)

50 100 150 200 250 300 350 400 450 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

30

Sequential Bayesian Estimates of (α,F )

100 200 300 400 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

100 200 300 400 5006

7

8

9

10

F

t

F|Y0~N(8,1)

100 200 300 400 5003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

100 200 300 400 5006

7

8

9

10

F

t

F|Y0~N(8,1)

31

Sequential Estimates of (α,F ): Sparse Network

200 400 600 800 10003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

200 400 600 800 10006

7

8

9

10

F

t

F|Y0~N(8,1)

200 400 600 800 10003

3.5

4

4.5

5

α

t

α|Y0~IG(15,15)

200 400 600 800 10006

7

8

9

10

F

t

F|Y0~N(8,1)

32

Spatially- and Temporally-Varying Scale Factors

1000 2000 3000 40003

3.5

4

4.5

5

α1

t

α1|Y

0~IG(1.5, 6)

1000 2000 3000 40001.5

2

2.5

α2

t

α2|Y

0~IG(1.5, 3)

1000 2000 3000 40003

3.5

4

4.5

5

α1

t

α1|Y

0~IG(1.5, 6)

1000 2000 3000 40007

8

9

10

11

α2

t

α2|Y

0~IG(1.5, 13.5)

33

Estimation of Spatial Correlation Parameters:Discrete Representation

• Assume R is defined by a covariance model K (r ;α).

1. Assume a discrete prior on a grid of parameter values α∗:

α ∼ Mult(α∗,π)

2. Generate the forecast ensemble.

3. Estimate the innovation mean d and covariance, D(α).

4. Update the posterior distribution:

p(α|d) ∝ Mult(α|α∗,π)p(d|α) = Mult(α|α∗,π∗)


αi ∼ Mult(α∗,π∗)

zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))

34

Estimation of Spatial Correlation Parameters:Gaussian Approximation

• Assume R is defined by a covariance model K (r ;α).

1. Assume a normal prior on the parameters:

α ∼ N(m,C)

2. Generate the forecast ensemble.

3. Estimate the innovation mean d and covariance, D(α).

4. Update the posterior distribution:

p(α|d) ∝ N(α|m,C)p(d|α) ≈ N(α|m∗,C∗)


αi ∼ N(m∗,C∗)

zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))

35

Grid vs Normal Posteriors: Linear Model

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

True: t=10

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Grid: t=10

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Normal: t=10

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

True: t=40

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Grid: t=40

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Normal: t=40

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

True: t=100

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Grid: t=100

3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5Normal: t=100

36

Sequential Posterior Estimates: Linear Model

0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

γ1

0 20 40 60 80 100

0.4

0.5

0.6

0.7

0.8

γ2

0 20 40 60 80 100

−0.1

0.0

0.1

0.2

0.3

γ3

0 20 40 60 80 100

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

σ2

0 20 40 60 80 100

0.6

0.8

1.0

1.2

1.4

1.6

θ2

0 20 40 60 80 100

0.8

1.0

1.2

1.4

α

37

Lorenz 96 Model & AssimilationSettings

• Time step dt = .01.

• Perfect model, F = 8 known.

• Observations at 40 locations.

• R defined by the Matérn correlation model:

K (r) =α

Γ(ν)2ν−1

( r

λ

)νKν

( r

λ

)

; α,λ, ν > 0.

• EnKF with ensemble size m = 100

• Covariance localization with radius r = 12.

• No covariance inflation.

38

Sequential Posterior Distributions: Discrete

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=0

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=1

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=5

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=10

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=50

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0t=200

39

Sequential Bayesian Estimates: Discrete

0 50 100 150 200 250

0.8

1.0

1.2

1.4

αPosterior Mean95% Interval

0 50 100 150 200 250

0.5

1.0

1.5

λPosterior Mean95% Interval

0 50 100 150 200 250

0.2

0.3

0.4

0.5

0.6

0.7

0.8

νPosterior Mean95% Interval

40

Conclusions

• Bayesian methods are useful for parameter estimation in DA.

• Two new algorithms for combined state and parameter

estimation within EnKF.

• Easily combined with state augmentation.

• Good convergence properties (unlike recursive ML).

• Conjugate priors allow for easy updating.

• Would love to collaborate with you on this topic!

41

Computational Methods

• Bayesian and ML methods rely heavily on calculation of the

likelihood.

• Several approximate methods have been proposed for computing

the likelihood for large spatial data sets

– Spectral approximations (Whittle, 1953)

– Approximate likelihood (Vecchia, 1988)

– Covariance localization (Kaufman et al., 2008)

• These methods can be applied in data assimilation systems.

42