Bayesian Statistics and DataAssimilation
Jonathan StroudDepartment of Statistics
The George Washington University
1
Outline
• Motivation
• Bayesian Statistics
• Parameter Estimation in Data Assimilation
• Combined State and Parameter Estimation within EnKF
– Physical Parameters
– Observation Error Variance
– Observation Error Covariance
2
Motivation
• Physical models and data assimilation systems often involve
unknown parameters:
– physical model parameters
– error covariance parameters
– covariance inflation factors
– localization radius
• Use data to estimate parameters, either off-line or sequentially.
• Two common approaches to parameter estimation
– Maximum likelihood estimation
– Bayesian estimation
3
Statistical Methods for Parameter Estimation
• Maximum Likelihood approach
– Specify a likelihood function for the data.
– Choose parameters to maximize this function.
• Bayesian approach
– Parameters are assigned a prior probability distribution.
– Update the prior distribution using Bayes Theorem.
– Summarize the parameters using the posterior distribution.
4
Bayesian Parameter Estimation
• A Bayesian model includes the following components
(where d = data; α = unknown parameters):
• Likelihood function: p(d |α)
• Prior distribution: p(α)
• Posterior distribution (via Bayes’ Theorem):
p(α|d) =p(α)p(d |α)
p(d)
• The parameters can be summarized using the posterior mean,
standard deviation, or 95% posterior intervals.
5
Example 1: Normal Data with Unknown Mean
• Let d1, ... , dn be iid samples from a normal distribution with
unknown mean θ and variance v . The likelihood function is
p(d |θ) = (2πv)−n/2 exp
{
−(d̄ − θ)2
2v/n
}
.
• The standard prior distribution is a normal: θ ∼ N(θb, vb)
p(θ) = (2πvb)−1/2 exp
{
−(θ − θb)
2
2vb
}
• Posterior distribution (likelihood × prior):
p(θ|d) ∝ exp
{
−(d̄ − θ)2
2v/n−
(θ − θb)2
2vb
}
6
Example 1: Normal Data with Unknown Mean
• The posterior distribution is normal
θ|d ∼ N(θa, va)
with mean
θa =(v/n)θb + vbd̄
vb + (v/n)
and information
v−1a = v−1b + (v/n)
−1
• The posterior mean is a weighted average of the prior mean and
the sample mean. The posterior information is the sum of the
prior and data information.
7
Example 1: Normal Data with Unknown Mean
−5 0 5 100.0
0.1
0.2
0.3
0.4Prior
N(0,1)
−5 0 5 100.00
0.05
0.10
0.15
0.20Likelihood
N(2,4)
−5 0 5 100.0
0.1
0.2
0.3
0.4
Posterior
N(0.4,0.8)
8
Example 2: Normal with Unknown Variance
• Let d1, ... , dn be iid samples from a normal distribution with
mean zero and unknown variance v . The likelihood function is
p(d |v) = (2πv)−n/2 exp
(
−1
2v
n∑
i=1
d2i
)
.
• The prior distribution is inverse gamma, v ∼ IG (m/2, s/2)
p(v) =(s/2)m/2
Γ(m/2)v−m/2−1 exp
(
−s
2v
)
.
• The posterior distribution is
p(v |d) ∝ v−(m+n)/2−1 exp
(
−s +
∑
d2i2v
)
.
9
Example 2: Normal with Unknown Variance
• The posterior is also inverse gamma
v |d ∼ IG
(
m + n
2,s +
∑
d2i2
)
.
• The parameters are updated by adding the sample size and the
data sum of squares, respectively.
10
Example 2: Normal with Unknown Variance
2 4 6 8 10 120.0
0.1
0.2
0.3
0.4
0.5 PriorIG(5,10)
mean=2.5
2 4 6 8 10 120.00
0.05
0.10
0.15
0.20
0.25Likelihood
IG(10,50)mean=5.6
2 4 6 8 10 120.0
0.1
0.2
0.3
0.4Posterior
IG(15,60)mean=4.3
11
Conjugate Priors
• The normal and inverse gamma priors are called “conjugate
priors”. These occur when the prior and posterior distribution
belong to the same family.
• These are convenient because the distribution can be updated
by updating a set of “sufficient statistics.”
• Other examples of conjugate priors:
– Inverse-Wishart, for the covariance matrix of a normal.
– Normal-Inverse Gamma, for the mean and variance of normal.
– Dirichlet, for the probabilities of a discrete distribution.
12
Sequential Bayesian Estimation
• If the data are assimilated sequentially, we want to update the
parameters α after each new observation d1, d2, ... , dn.
• Under the Bayesian approach, this requires calculating the
sequence of posterior distributions
p(α|d1), p(α|d1, d2), ... , p(α|d1, d2, ... , dn)
• This is done by applying Bayes theorem recursively after each
new observation:
p(α|d1, ... , dk) ∝ p(dk |α) p(α|d1, ... , dk−1), for each k .
13
Sequence of Posterior Distributions
0 1 2 3 4 5 60.0
0.1
0.2
0.3
0.4
Posterior t=1MLE=2.9Mode=1Mean=3
0 1 2 3 4 5 60.0
0.1
0.2
0.3
0.4
0.5
0.6Posterior t=2
MLE=0.2Mode=0.9Mean=2
0 1 2 3 4 5 60.00
0.05
0.10
0.15
0.20
0.25
0.30Posterior t=4
MLE=11Mode=2.2Mean=3.9
0 1 2 3 4 5 60.0
0.2
0.4
0.6
0.8
Posterior t=25MLE=1Mode=1.6Mean=1.8
0 1 2 3 4 5 60.00.20.40.60.81.01.21.4
Posterior t=100MLE=9Mode=2Mean=2.1
0 1 2 3 4 5 60.0
0.5
1.0
1.5
2.0Posterior t=200
MLE=8.6Mode=2Mean=2.1
14
Sequential Bayesian Estimates
0 50 100 150 2000
5
10
15
MLEs of alpha
0 50 100 150 200
1
2
3
4
Sequential EstimatesPosterior MeanPosterior Mode90% Posterior CI
15
Convergence of the Posterior Distribution
1. The posterior distribution converges to the true parameter value.
2. If the model is correct, and certain regularity conditions hold,
the posterior distribution converges to a normal distribution with
mean equal to the true value and covariance equal to the
asymptotic covariance matrix.
16
Parameter Estimation in Data Assimilation
• Many parameter estimation methods have been proposed for
data assimilation systems.
• Maximum likelihood estimation
– Dee (1995) and Dee & Da Silva (1999): Error covariances
– Mitchell & Houtekamer (2000): Error covariances (EnKF)
– Li, Kalnay & Miyoshi (2007): Variance/Covariance inflation
• Bayesian estimation
– Anderson & Anderson (1999): State augmentation
– Stroud & Bengtsson (2007): Observation error variance
– Anderson (2007, 2009): Covariance inflation factors
– Miyoshi (2011): Covariance inflation factors
17
Estimation of Physical Parameters
• State augmentation is used to estimate unknown parameters θ
in the physical model M(xt , θ).
• Define the augmented state vector zt = (xt , θt), and the
augmented model as
xt
θt
=
M(xt−1, θt−1)
θt−1
+
wt
0
.
• Specify an initial prior distribution, θ0 ∼ p(θ0).
• Then, standard data assimilation methods are applied to zt to
estimate the posterior distribution, p(θ|d1, ... , dt), at each time t.
18
Example: Lorenz 63 Model
• Model equations:
dx/dt = σ(y − x)
dy/dt = ρx − y − xz
dz/dt = xy − βz
-15-10
-5 0
510
15
x
-10
0
10
20
y
1020
3040
50z
•••••••••••••••••
•••
•••••• • • ••••••••••••••••••••••••
••
•
•
•
•
•• • • • • • • • • •••••••••
••••
•••••
••••••
••
•••••••••••••••••••••••••
•
•
•
••
• ••
•
•
•
••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• The state vector is x = (x , y , z), and parameter is θ = (σ, ρ,β)
• The parameters σ = 10, ρ = 28,β = 8/3 give the famous butterfly.
• Generate data with time step dt = .01, and observation noise = 1.
• ETKF/state augmentation on zt = (xt , θt) with ensemble size
100 and variance inflation factor 1.04.
19
Sequential Parameter Estimates: Lorenz 63
0 20 40 60 80 100
8
10
12
14
σPosterior Mean90% Posterior CI
0 20 40 60 80 10026
27
28
29
30
31
32
ρPosterior Mean90% Posterior CI
0 20 40 60 80 100
2.2
2.4
2.6
2.8
3.0
βPosterior Mean90% Posterior CI
20
Estimation of Covariance Parameters
• State augmentation does not work well for parameters in the
background or error covariance matrices, P, Q, and R.
• Dee (1995), D&D (1999) and Mitchell & Houtekamer (2000)
proposed Maximum Likelihood estimation for these parameters.
• Assuming the innovations d are normal with mean zero and
covariance D(α), the likelihood function is
p(d|α) ∝ |D(α)|−1/2 exp
{
−1
2d′D(α)−1d
}
and the maximum likelihood estimator is
α̂ML = argmaxα
p(d|α).
21
Estimation of Covariance Parameters
• Maximum Likelihood (ML) works well for large samples, but has
problems for recursive estimation.
• D95 and MH00 proposed the recursive ML estimator:
α̂t = (1− γt)α̂t−1 + γt
(
argmaxα
p(dt |α))
.
• Setting γt = 1/t defines α̂t as the mean of the ML estimates.
• They also defined α̂t as the median of the ML estimates.
22
Simple Scalar Example
• Mitchell & Houtekamer (2000) proposed the following example:
• Generate data d ∼ N(0, 2 + α), with true value α∗ = .3.
• Since α ≥ 0, the single sample ML estimator is
α̂ML = max(
0, d2 − 2)
23
Mitchell & Houtekamer Example
• Recursive estimates for α.
0 2000 4000 6000 8000 10000
0.0
0.5
1.0TrueMeanMedianBayes
• The recursive ML estimators do not converge to the true value.
• The Bayes estimator does converge.
24
Bayesian Parameter Estimation in the EnKF
• We propose the following generic EnKF algorithm for combined
estimation of states z and covariance parameters α.
1. Assume a prior distribution for the parameters α ∼ p(α).
2. Generate a forecast ensemble of parameters and states:
αfi ∼ p(α)
zfi ∼ p(z|α
fi )
3. Update the prior distribution via Bayes’ Theorem:
p(α|d) ∝ p(α)p(d|α)
4. Generate an analysis ensemble of parameters and states:
αi ∼ p(α|d)
zi ∼ p(z|αi , d)
25
Model 1: Unknown Observation Variance
• Stroud & Bengtsson (2007) considered the case where R = αR∗,
Q = αQ∗ and D = αD∗.
1. Assume an inverse gamma prior distribution: α ∼ IG (n/2, s/2).
2. Generate the forecast ensemble:
αfi ∼ IG (n/2, s/2),
zft,i ∼ M(zt−1,i) + N(0,α
fi Q
∗)
3. Update the parameters of the inverse gamma distribution:
n∗ = n + p, s∗ = s + d′(D∗)−1d
4. Generate the analysis ensemble:
αi ∼ IG (n∗/2, s∗/2)
zt,i ∼ zft,i +K(d+ N(0,αiR
∗))
26
Example: Lorenz 96 Model
• The Lorenz 96 model mimics advection on a latitude circle. The
model is highly nonlinear (chaotic), containing quadratic terms.
ẋt,j = (xt,j+1 − xt,j−2) xt,j−1 − xt,j + F .
• The state vector has 40
variables x = (x1, ... , x40).
• The model parameter is F ,
the forcing variable.
27
Model & Assimilation Settings
• Time step dt = .05 or .25.
• Forcing F = 8 (known or unknown).
• Observations at every location.
• Error covariances: Q = 0, R = αI; true α = 4.
• EnKF with ensemble size m = 100.
• Covariance localization with cutoff radius c = 10.
• Covariance inflation factor 1.01.
28
Sequential Bayesian Estimates of α (dt = .05)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,240)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(1.5,6)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
29
Sequential Bayesian Estimates of α (dt = .25)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(1.5,6)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,240)
50 100 150 200 250 300 350 400 450 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
30
Sequential Bayesian Estimates of (α,F )
100 200 300 400 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
100 200 300 400 5006
7
8
9
10
F
t
F|Y0~N(8,1)
100 200 300 400 5003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
100 200 300 400 5006
7
8
9
10
F
t
F|Y0~N(8,1)
31
Sequential Estimates of (α,F ): Sparse Network
200 400 600 800 10003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
200 400 600 800 10006
7
8
9
10
F
t
F|Y0~N(8,1)
200 400 600 800 10003
3.5
4
4.5
5
α
t
α|Y0~IG(15,15)
200 400 600 800 10006
7
8
9
10
F
t
F|Y0~N(8,1)
32
Spatially- and Temporally-Varying Scale Factors
1000 2000 3000 40003
3.5
4
4.5
5
α1
t
α1|Y
0~IG(1.5, 6)
1000 2000 3000 40001.5
2
2.5
α2
t
α2|Y
0~IG(1.5, 3)
1000 2000 3000 40003
3.5
4
4.5
5
α1
t
α1|Y
0~IG(1.5, 6)
1000 2000 3000 40007
8
9
10
11
α2
t
α2|Y
0~IG(1.5, 13.5)
33
Estimation of Spatial Correlation Parameters:Discrete Representation
• Assume R is defined by a covariance model K (r ;α).
1. Assume a discrete prior on a grid of parameter values α∗:
α ∼ Mult(α∗,π)
2. Generate the forecast ensemble.
3. Estimate the innovation mean d and covariance, D(α).
4. Update the posterior distribution:
p(α|d) ∝ Mult(α|α∗,π)p(d|α) = Mult(α|α∗,π∗)
5. Generate the analysis ensemble:
αi ∼ Mult(α∗,π∗)
zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))
34
Estimation of Spatial Correlation Parameters:Gaussian Approximation
• Assume R is defined by a covariance model K (r ;α).
1. Assume a normal prior on the parameters:
α ∼ N(m,C)
2. Generate the forecast ensemble.
3. Estimate the innovation mean d and covariance, D(α).
4. Update the posterior distribution:
p(α|d) ∝ N(α|m,C)p(d|α) ≈ N(α|m∗,C∗)
5. Generate the analysis ensemble:
αi ∼ N(m∗,C∗)
zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))
35
Grid vs Normal Posteriors: Linear Model
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
True: t=10
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Grid: t=10
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Normal: t=10
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
True: t=40
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Grid: t=40
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Normal: t=40
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
True: t=100
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Grid: t=100
3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5Normal: t=100
36
Sequential Posterior Estimates: Linear Model
0 20 40 60 80 100
0.1
0.2
0.3
0.4
0.5
γ1
0 20 40 60 80 100
0.4
0.5
0.6
0.7
0.8
γ2
0 20 40 60 80 100
−0.1
0.0
0.1
0.2
0.3
γ3
0 20 40 60 80 100
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
σ2
0 20 40 60 80 100
0.6
0.8
1.0
1.2
1.4
1.6
θ2
0 20 40 60 80 100
0.8
1.0
1.2
1.4
α
37
Lorenz 96 Model & AssimilationSettings
• Time step dt = .01.
• Perfect model, F = 8 known.
• Observations at 40 locations.
• R defined by the Matérn correlation model:
K (r) =α
Γ(ν)2ν−1
( r
λ
)νKν
( r
λ
)
; α,λ, ν > 0.
• EnKF with ensemble size m = 100
• Covariance localization with radius r = 12.
• No covariance inflation.
38
Sequential Posterior Distributions: Discrete
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=0
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=1
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=5
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=10
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=50
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0t=200
39
Sequential Bayesian Estimates: Discrete
0 50 100 150 200 250
0.8
1.0
1.2
1.4
αPosterior Mean95% Interval
0 50 100 150 200 250
0.5
1.0
1.5
λPosterior Mean95% Interval
0 50 100 150 200 250
0.2
0.3
0.4
0.5
0.6
0.7
0.8
νPosterior Mean95% Interval
40
Conclusions
• Bayesian methods are useful for parameter estimation in DA.
• Two new algorithms for combined state and parameter
estimation within EnKF.
• Easily combined with state augmentation.
• Good convergence properties (unlike recursive ML).
• Conjugate priors allow for easy updating.
• Would love to collaborate with you on this topic!
41
Computational Methods
• Bayesian and ML methods rely heavily on calculation of the
likelihood.
• Several approximate methods have been proposed for computing
the likelihood for large spatial data sets
– Spectral approximations (Whittle, 1953)
– Approximate likelihood (Vecchia, 1988)
– Covariance localization (Kaufman et al., 2008)
• These methods can be applied in data assimilation systems.
42
Top Related