7/31/2019 Estimation of Parameters2
1/44
Maximum Likelihood Estimation of ARMA Models
For i.i.d data with marginal pdf f(yt; ), the joint pdf
for a sample y = (y1,...,yT) is
f(y; ) = f(y1,...,yT; ) =T
t=1
f(yt; )
The likelihood function is this joint density treated asa function of the parameters given the data y:
L(|y) = L(|y1,...,yT) =T
t=1
f(yt; )
The log-likelihood is
L(|y) =T
t=1
lnf(yt; )
1
7/31/2019 Estimation of Parameters2
2/44
Conditional MLE of ARMA Models
Problem: For a sample from a covariance stationarytime series {yt}, the construction of the log-likelihoodgiven above doesnt work because the random variablesin the sample y = (y1,...,yT) are not i.i.d. One So-
lution: Conditional factorization of log-likelihood Intu-ition: Consider the joint density of two adjacent ob-servations f(y2, y1; ). The joint density can always befactored as the product of the conditional density of y2given y1 and the marginal density of y1:
f(y2, y1; ) = f(y2|y1; )f(y1; )For three observations, the factorization becomes
f(y3, y2, y1; ) = f(y3|y2, y1; )f(y2|y1; )f(y1; )2
7/31/2019 Estimation of Parameters2
3/44
In general, the conditional marginal factorization has
the form
f(yp,...,y1; ) =
Tt=p+1
f(yt|It1; )
.f(yp,...,y1; )
It = {yt,...,y1} = info available at time typ,...,y1 = initial values
The exact log-likelihood function may then be expressedas
mle = arg max Tt=p+1
lnf(yt|It1; ) + lnf(yp,...,y1; )
The conditional log-likelihood iscmle = arg max Tt=p+1
lnf(yt|It1; )
3
7/31/2019 Estimation of Parameters2
4/44
Two types of maximum likelihood estimates (mles) may
be computed. The first type is based on maximizing the
conditional log-likelihood function. These estimates are
called conditional mles and are defined by
cmle = arg maxT
t=p+1 lnf(yt|It 1; )The second type is based on maximizing the exact log-
likelihood function. These estimates are called exact
mles, and are defined by
mle = arg max Tt=p+1
lnf(yt|It 1; ) + lnf(yp,...,y1; )
4
7/31/2019 Estimation of Parameters2
5/44
Result:
For stationary models, cmle and mle are consistent andhave the same limiting normal distribution. In finite
samples, however, cmle and mle are generally not equaland my differ by a substantial amount if the data are
close to being non-stationary or non-invertible.
5
7/31/2019 Estimation of Parameters2
6/44
AR(p ), OLS equivalent to Conditional MLE
Model:yt = + 1yt
1 + .......... + pyt
p + t. et
W N(0, 2).
= xt + t, = 1, 2,....,p, xt = yt1, yt2,....,ytp
OLS: = (Tt=1xtxt)1Tt=1xtyt,2 =
1
T (p + 1)Tt=1(yt x
t)
2.
6
7/31/2019 Estimation of Parameters2
7/44
Properties of the estimator
is downward bias in a finite sample, i.e. E[] < . Estimator might be biased but consistent, it con-
verges in probability.
7
7/31/2019 Estimation of Parameters2
8/44
Example: MLE for stationary AR(1)
Yt = c + 1Yt1 + t, t = 1, ...., Tt W N(0, 2) || < 1 = (c,,2)
Conditional on It
1
yt|It1 N(c + yt1, 2), t = 2...., Twhich only depends on yt1. The conditional densityf(yt|It1;) is then
f(yt|yt1;) = (22)12 exp 1
22(yt c yt1)2,
t = 2...., T
8
7/31/2019 Estimation of Parameters2
9/44
To determine the marginal density for the initial valuey1, recall that for a stationary AR(1) process
E[y1] = =c
1 var[yt] =
2
1 2
It follows that
y1 N
c
1 ,2
1 2
f(y1; ) = 22
1 212exp 1 2
22(y1 c
1 )2
9
7/31/2019 Estimation of Parameters2
10/44
The conditional log-likelihood function is
Tt=2
lnf(yt|yt1; ) = (T 1)2
ln(2) (T 1)2
ln(2)
122
Tt=2
(yt c yt1)2
Notice that the conditional log-likelihood function has
the form of the log-likelihood function for a linear re-gression model with normal errors
yt = c yt1 + t, t N(0, 2), t = 2, .....TIt follows that
ccmle = colsccmle = cols
2cmle =T
t=2
(yt ccmle ccmleyt1)2
10
7/31/2019 Estimation of Parameters2
11/44
The marginal log-likelihood for the initial value y1 is
ln f(y1; ) = 1
2ln(2) 1
2ln
2
1 2
y1 c
1
2
The exact log-likelihood function is then
lnL(; y) = T2
ln(2) T2
ln(2
1 2) 1 2
22
y1
c
1
2
(T 1)
2ln(2)
1
22
T
t=2(yt c yt1)2
11
7/31/2019 Estimation of Parameters2
12/44
Prediction Error Decomposition
To illustrate this algorithm, consider the simple AR(1)model. Recall,
yt|It1 N(c + yt1, 2
), t = 1, 2...., Tfrom which it follows that
E[yt|It1] = c + yt1var[yt|It1] = 2
The 1-step ahead prediction errors may then be de-fined as
vt = yt E[yt|It1] = yt c + yt1, t = 2...., T12
7/31/2019 Estimation of Parameters2
13/44
The variance of the prediction error at time t is
ft = var(vt) = var(t) = 2, t = 2, ...T
For the initial value, the first prediction error and
its variance are
v1 = y1 E[y1] = y1 c
1 f1 = var(1) =
2
1
2
7/31/2019 Estimation of Parameters2
14/44
Using the prediction errors and the prediction error vari-ances, the exact log-likelihood function may be re-expressed
as
ln L(|y) = T2
ln(2) 12
Tt=1
lnft 12
Tt=1
v2tft
which is the prediction error decomposition. A furthersimplification may be achieved by writing
var(vt) = 2ft
= 21
1 2for t = 1= 2.1for t > 1
That is ft =1
12 for t = 1 and ft = 1 for t > 1. Thenthe log-likelihood becomes
ln L(|y) = T2
ln(2)12
Tt=1
ln212
Tt=1
lnft 1
22
Tt=1
v2tft
13
7/31/2019 Estimation of Parameters2
15/44
MLE Estimation of MA(1)
Recall
Yt = + t1 + t, || < 1et W N(0, 2).
|| < 1 is assumed for invertible representation only,
nothing about stationarity.
14
7/31/2019 Estimation of Parameters2
16/44
Estimation MA(1)
Yt|t1 N( + t1, 2),f(Yt|t1, ) =
122
e 1
22(ytt1)2
=(,,2).
Problem: without knowing t2 we dont observe t1.Need to know t2 to know t1 = yt t2.
But t1 unobservable. Assume 0 = 0.
Make it non-random, just fix it with number 0. This
works with any number.
15
7/31/2019 Estimation of Parameters2
17/44
Estimation MA(1)
Y1|0 N(, 2),y1 = + 1 1 = y1 ,y2 = + 2 + 1 2 = y2 (y1 ),t = yt (yt1 ) + ....... + (1)t1t1(y1 )
Conditional likelihood:
L(|y1....yT; 0 = 0) =T
t=1
122
e 1
22(2t1)
If || < 1 (much less),0 doesnt matter, CMLE is con-sistent.
Exact MLE requires Kalman Filter.
16
7/31/2019 Estimation of Parameters2
18/44
Why do we need Exact MLEs
Estimation of MA(1) models, assumed that e0 = 0
while calculating the sequence of ets. A more tradi-tional approach is to estimate the unconditional or the
exact log-likelihood function by assuming that e0 is ran-
dom and hence allowing it to follow some distribution
and use it to calculate the ets from the data. Such
an allowance will affect not only the prediction error
and the variance of y1; but also successive prediction
errors and their variances. The practice of obtaining an
estimate of the parameters by not conditioning on the
pre-sample values but obtaining them using such exact
MSEs is called the exact or unconditional ML estima-
tion method. Note that the problem of obtaining the
sequence of ets arises only in pure MA or mixed models.17
7/31/2019 Estimation of Parameters2
19/44
Exact MLEs
What are the advantages of using such a procedure? Tounderstand that, an examination of the first prediction
error and its variance is instructive. For example, in an
MA(1) process, the first prediction error is the same,
it is y1 in both conditional and exact ML estimation
procedures. But the assumption that e0 = 0 meansthat the var(z1) = var(e1) = 2 whereas, if we allow
e0 to be random, var(z1) = 2(1 + 2): And this will be
reflected in f1 in the exact log-likelihood function. And
in typically small sample sizes such an assumption may
matter a lot for the estimates. Besides, if happensto be close to unity, the differences will be even more
significant.
18
7/31/2019 Estimation of Parameters2
20/44
Exact MLEs
But estimating the exact log-likelihood function is dif-
ficult and was a costly exercise in terms of computing,
till some time back; but in these days of advanced and
cheap computing facilities, cost should not be a deter-
rent in using exact ML estimation method. Our job
becomes easier by noting that for any ARMA model it
can be shown that the two end-equations, namely, that
of the prediction error and the prediction error variance,
are recursive in nature and literature shows that these
two can be calculated by using two popular methods,
viz. (1) the triangular factorization method, TF and
(2) the kalman filter recursions.
19
7/31/2019 Estimation of Parameters2
21/44
Estimation
For the Gaussian AR(1) process,
Yt = c + 1Yt
1 + t,
|
|< 1
et W N(0, 2).
The joint distribution of YT = (Y1, Y2.....YT)
is
YT N0,the observations y (y1, y2,...,yT) are the singlerealization of YT.
20
7/31/2019 Estimation of Parameters2
22/44
MLE AR(1)
Y1.
.
.
YT
= N(,)
=
.
.
.
= 0 ... T1... . . .
.
..... . . . ...T1 ... 0
21
7/31/2019 Estimation of Parameters2
23/44
MLE AR(1)
The p.d.f. of the sample y = (y1, y2,...,yT) is given by
the multivariate normal density
fY
y; ;
= (2)T2 ||12 exp{ 1
2(y )1(y )}
Denoting
= 2y with ij = |ij|
= 0 ... T
1
... . . . ...
... . . . ...T1 ... 0
= 01 ...
T10
... . . . ...
... . . . ...T1
0... 1
22
7/31/2019 Estimation of Parameters2
24/44
= 2y =
2y
1 ... (T 1)... . . . ......
. ..
..
.(T 1) ... 1
(j) = j
Collecting the parameters of the model in = (c,,2),
the joint p.d.f. becomes:
fY
y;
= (22y )
T2 ||12 exp{ 1
22y(y )1(y )
}
Collecting the parameters of the model in = (c,,2),
the sample log-likelihood function is given by
L() = T2
log(2) T2
log(2y ) 1
2log(||) 1
22y(y )1(y )
7/31/2019 Estimation of Parameters2
25/44
MLE
The exact log-likelihood function is a non-linearfunction of the parameters . There is no closed
form solution for the exact mles.
The exact mles must be determined by numeri-cally maximizing the exact log-likelihood function.
Usually, a Newton-Raphson type algorithm is used
for the maximization which leads to the iterative
scheme
mle,n = mle,n H(mle,n)1s(mle,n)where H() is an estimate of the Hessian matrix
(2nd derivative of the log-likelihood function), and
23
7/31/2019 Estimation of Parameters2
26/44
s(mle,n) is an estimate of the score vector (1st
derivative of the log-likelihood function).
The estimates of the Hessian and Score may becomputed numerically (using numerical derivative
routines) or they may be computed analytically (if
analytic derivatives are known).
7/31/2019 Estimation of Parameters2
27/44
Factorization
Note that for large T, might be large and difficultto invert.
Since is positive definite symmetric matrix thenthere exists a unique, triangular factorization of ; = Af A,
where
fT XT =
f1 0 ... 00 f2
. . . ...... . . . ...0 ... f T
ft 0 for all t diagonal matrix
AT XT =
1 0 ... 0
a21 1. . . ...
... . . . ...aT1 aT1 ... 1
24
7/31/2019 Estimation of Parameters2
28/44
Likelihood
The likelihood function can be rewritten as:
L(|yT) = (2)T2 det(Af A)
12 e1
2(yT )(Af A)1(y )
This is done by converting the correlated variables y1, y2........yTinto a collection, say 1, 2........T of uncorrelated vari-
ables. In the following, let Pj denote the projection
onto the random variables in Xj Define
= A1
( yT ) (prediction error).where
A = ( yT ).25
7/31/2019 Estimation of Parameters2
29/44
Since A is lower-triangular matrix with 1s along the
principal diagonal,
1 = y1
2 = y2 P1y2 = y2 a1113 = y3 a211 a221
...
T = yT T1i=1
aT iT1
7/31/2019 Estimation of Parameters2
30/44
Also, since A is lower triangular with 1s along the prin-
cipal diagonal, det(A) = 1
det(AfA) = det(A) . det(f ) . det(A) = det(f ).
Then, L(|yT)=(2)T2 det(f1)
12 e
12
(f1)
=T
t=1 1
2fte12 2ft ,
where t is tth element of T x1 = prediction error yt
yt
|t..1,
yt|t..1 =t1i=1
at,iyi, 1 = 2, 3, .....T, where at,i is(t; i)th element ofA1.
26
7/31/2019 Estimation of Parameters2
31/44
Kalman Filters
The Kalman filter comprises a set of mathematical
equations which result from an optimal recursive so-
lution given by the least squares method.
The purpose of this solution consists in computing alinear, unbiased and optimal estimator of a systems
state at time t, based on information available at t..1
and update, with the additional information at t, these
estimates.
The filters performance assumes that a system can bedescribed through a stochastic linear model with an as-
sociated error following a normal distribution with mean
zero and known variance.
27
7/31/2019 Estimation of Parameters2
32/44
Kalman Filters
The Kalman filter is the main algorithm to estimate dy-
namic systems in state-space form. This representation
of the system is described by a set of state variables.
The state contains all information relative to that sys-
tem at a given point in time. This information should
allow to infer about the past behaviour of the system,
aiming at predict its future behaviour.
28
7/31/2019 Estimation of Parameters2
33/44
DEVELOPING THE KALMAN FILTER
ALGORITHM
The basic building blocks of a Kalman Filter are two
equations: the measurement equation and the transi-tion equation. The measurement equation relates the
unobserved data (xt where t indicates a point in time)
with observable data (yt; where t indicates a point in
time):
yt = m xt + vt (1)where, E(vt) = 0 and var(vt) = rt The transition equa-
tion is based on a model that allows the unobserved
29
7/31/2019 Estimation of Parameters2
34/44
data to change through time.
xt+1 = a xt + wt (2)where, E(wt) = 0 and var(wt) = qt The process starts
with an initial estimate for xt, call it x0, which has
a mean of 0 and a standard deviation of s0. Using
the expectation of equation (2), a prediction for x1
emerges, call it x1
x1 = E(a x0 + w0) = a 0 (3)The predicted value from equation (3) is then inserted
into equation (1) and again taken as an expectation to
produce a prediction for y1, call it y1
y1 = E(m x1 + v0) = m E(x1) = m a 0 (4)
7/31/2019 Estimation of Parameters2
35/44
Thus far, predictions of the future are based on expec-
tations and not on the variance or standard deviation
associated with the predicted variables. The variance
will eventually be incorporated to produce better es-
timation. However, the next step is to compare the
predicted y1 (i.e. y1) with the actual y1 when it occurs.
In equation (5), the expectation of the predicted value
and actual value for y1 are compared to produce the
predicted (or expected) error, ye1:
ye1 = E(y1 y1) = y m E(x1) = y1 m a 0 (5)Given the error in predicting y1 which is based on the
expectation of x1 from equation (3), a new estimation
for x1 is considered, x1E. Notice this is different from
x1 because x1E incorporates the prediction error of y1.
7/31/2019 Estimation of Parameters2
36/44
Equation (6) identifies x1E as an expectation of an
adjusted x1.
x1E = E[x1 + k1 ye1] = E[x1] + k1 (y1 E(m x1))
= a 0 + k1(y1 m a 0) (6)k1 (or more generically, kt in equation (6) is re-
ferred to as the Kalman gain and incorporates the vari-
ance of x1
(denoted as p1
or generically as pt
and the
variance of y1 (see the denominator in equation (8)below).
var(x1) = var(a0 x0) + var(w0) = 20 a2 + q0 = p1(7)k1 =
m p1p1 m
2
+ r0
=m (20 a2 + q0)
(2
0 a2
+ q0) m2
+ r0
(8)
The cycle starts over with x1E taking the place of x0in equation (3) and used to forecast y2. The mean of
7/31/2019 Estimation of Parameters2
37/44
x1E is the value of the expectation calculated in equa-
tion(6). Notice, the mean ofx1E
incorporates the mean
of x0, the variance of x0(via the Kalman gain), the vari-
ance of the error in the measurement equation (via the
Kalmna gain), the variance of the error in the transition
equation (via the Kalman gain), and the observe y1.
The variance of x1E
:
var(x1) = p1[1 k1 m] = p1 [
1 11 + r0
(20a2+q0)m2
](9)
Notice, the variance of x1E is reduced relative to the
variance of x1. Further,because the distributional as-pects of each estimated value of xt is known (assuming
7/31/2019 Estimation of Parameters2
38/44
a model), the model parameters within the measure-
ment and transition equations can be optimized using
maximum likelihood estimation. An iterative sequence
of filter followed by parameter estimation optimization
eventually optimizes the entire system.
Predict future unobserved variable (x) based on thecurrent estimate of the unobserved variable:
xt = E(a x(t)E + wt) (10)
Use predicted unobserved variable to predict future
observable variable (y):
yt+1 = E(m xt+1 + vt) (11)
7/31/2019 Estimation of Parameters2
39/44
When the future observable variable actually occurs,calculate the error in the prediction:
yet+1 = E(yt+1 yt+1) (12) Generate a better estimate of the unobserved vari-
able at time (t + 1) and start the process over for
time (t + 2):
x(t+1)E = E[x(t+1) + k(t+1) ye(t+1)] (13)
Note: kt+1 is the Kalman gain and is based on thevariance of the predicted variables in the first and
second step of the process:
kt+1 =m var(x
(t=1))
var(y(t+1))
(14)
7/31/2019 Estimation of Parameters2
40/44
APPLYING MAXIMUM LIKELIHOOD ESTIMATION
TO THE KALMAN FILTER
The Kalman Filter provides output throughout the time
series in the form of estimated values (e.g. x1E from
equation (6)) which are the means/expectations of theunobserved variables x1....xS for every time period t with
associated variances provided by equation (9) (Note: x0is also from a distribution with a mean of 0 and a vari-
ance of s20 and is part of the time series). Keeping with
the univariate structure from the previous section andassuming T time periods of data, a maximum likelihood
estimation (MLE) is imposed by further assuming x0,
7/31/2019 Estimation of Parameters2
41/44
w1....wT, and v1...vT are jointly normal and uncorrelated.
Consequently, a joint likelihood function exists:
{122
e
(x00)222
0
}{[
12qt
]Te
T
t=1(xtE(xt))22q1
}
{[1
2rt]T
e
T
t=1(ytE(yt))22rt } (15)
recall: x0 is distributed N(0, s20), wt is distributed N(0, qt),
and vt is distributed N(0, rt)
The problem with the likelihood function in equation(10) is that xt is not observable. However, as noted
above, the Kalman Filter provides means and variances
7/31/2019 Estimation of Parameters2
42/44
for xt throughout the time series. Consequently, re-
move the first two terms and define the mean of yt asyt = m xt = a m x(t1)E, define the variance ofyt as p
t m2 = r(t1) ,and keep the assumption of yt
being normally distributed. Notice, the variance defini-
tion(via pt ; see the denominator in equation (8)) and
the mean definition incorporate the distributional prop-erties of xt while dealing with the observable yt. Fur-
ther, the initial conditions are introduced when t equals
one as x0E = 0 and through p1 (see equation(7)). The
new likelihood function becomes
Tt=1
{[1
2[pt m2 + r(t1)]
]Te
Tt=1(ytamx(t1)E)2p
tm2+r(t1)
}(16)
7/31/2019 Estimation of Parameters2
43/44
To simplify the math, take the natural logarithm of the
likelihood function creating the log- likelihood function:
T ln(2)2
12
Tt=1
ln[pt m2 + rt1] 1
2
Tt=1
(yt amxt1)E)2pt m2 + rt1
(17)
Further assume that rt and qt (contained in pt ) are con-stant throughout time. Consequently, equation (12)
has a value which is called a score. The next step is
to maximize the log- likelihood function using the par-
tial derivatives for r (= rt for all t), q (= qt for all t),
and a. After solving for these three parameters based
on setting the partial derivatives to zero, the Kalman
Filter is re-estimated and then the maximum likelihood
procedure is applied again until the score improves by
7/31/2019 Estimation of Parameters2
44/44
less than a particular value (say 0.0001) indicating con-
vergence. The iterative use of the MLE procedure is
called the Expectation Maximization (EM) algorithm.
Top Related