Download - Estimation of Parameters2

7/31/2019 Estimation of Parameters2

1/44

Maximum Likelihood Estimation of ARMA Models

For i.i.d data with marginal pdf f(yt; ), the joint pdf

for a sample y = (y1,...,yT) is

f(y; ) = f(y1,...,yT; ) =T

t=1

f(yt; )

The likelihood function is this joint density treated asa function of the parameters given the data y:

L(|y) = L(|y1,...,yT) =T

t=1

f(yt; )

The log-likelihood is

L(|y) =T

t=1

lnf(yt; )

1


2/44

Conditional MLE of ARMA Models

Problem: For a sample from a covariance stationarytime series {yt}, the construction of the log-likelihoodgiven above doesnt work because the random variablesin the sample y = (y1,...,yT) are not i.i.d. One So-

lution: Conditional factorization of log-likelihood Intu-ition: Consider the joint density of two adjacent ob-servations f(y2, y1; ). The joint density can always befactored as the product of the conditional density of y2given y1 and the marginal density of y1:

f(y2, y1; ) = f(y2|y1; )f(y1; )For three observations, the factorization becomes

f(y3, y2, y1; ) = f(y3|y2, y1; )f(y2|y1; )f(y1; )2


3/44

In general, the conditional marginal factorization has

the form

f(yp,...,y1; ) =

Tt=p+1

f(yt|It1; )

.f(yp,...,y1; )

It = {yt,...,y1} = info available at time typ,...,y1 = initial values

The exact log-likelihood function may then be expressedas

mle = arg max Tt=p+1

lnf(yt|It1; ) + lnf(yp,...,y1; )

The conditional log-likelihood iscmle = arg max Tt=p+1

lnf(yt|It1; )

3


4/44

Two types of maximum likelihood estimates (mles) may

be computed. The first type is based on maximizing the

conditional log-likelihood function. These estimates are

called conditional mles and are defined by

cmle = arg maxT

t=p+1 lnf(yt|It 1; )The second type is based on maximizing the exact log-

likelihood function. These estimates are called exact

mles, and are defined by

mle = arg max Tt=p+1

lnf(yt|It 1; ) + lnf(yp,...,y1; )

4


5/44

Result:

For stationary models, cmle and mle are consistent andhave the same limiting normal distribution. In finite

samples, however, cmle and mle are generally not equaland my differ by a substantial amount if the data are

close to being non-stationary or non-invertible.

5


6/44

AR(p ), OLS equivalent to Conditional MLE

Model:yt = + 1yt

1 + .......... + pyt

p + t. et

W N(0, 2).

= xt + t, = 1, 2,....,p, xt = yt1, yt2,....,ytp

OLS: = (Tt=1xtxt)1Tt=1xtyt,2 =

1

T (p + 1)Tt=1(yt x

t)

2.

6


7/44

Properties of the estimator

is downward bias in a finite sample, i.e. E[] < . Estimator might be biased but consistent, it con-

verges in probability.

7


8/44

Example: MLE for stationary AR(1)

Yt = c + 1Yt1 + t, t = 1, ...., Tt W N(0, 2) || < 1 = (c,,2)

Conditional on It

1

yt|It1 N(c + yt1, 2), t = 2...., Twhich only depends on yt1. The conditional densityf(yt|It1;) is then

f(yt|yt1;) = (22)12 exp 1

22(yt c yt1)2,

t = 2...., T

8


9/44

To determine the marginal density for the initial valuey1, recall that for a stationary AR(1) process

E[y1] = =c

1 var[yt] =

2

1 2

It follows that

y1 N

c

1 ,2

1 2

f(y1; ) = 22

1 212exp 1 2

22(y1 c

1 )2

9


10/44

The conditional log-likelihood function is

Tt=2

lnf(yt|yt1; ) = (T 1)2

ln(2) (T 1)2

ln(2)

122

Tt=2

(yt c yt1)2

Notice that the conditional log-likelihood function has

the form of the log-likelihood function for a linear re-gression model with normal errors

yt = c yt1 + t, t N(0, 2), t = 2, .....TIt follows that

ccmle = colsccmle = cols

2cmle =T

t=2

(yt ccmle ccmleyt1)2

10


11/44

The marginal log-likelihood for the initial value y1 is

ln f(y1; ) = 1

2ln(2) 1

2ln

2

1 2

y1 c

1

2

The exact log-likelihood function is then

lnL(; y) = T2

ln(2) T2

ln(2

1 2) 1 2

22

y1

c

1

2

(T 1)

2ln(2)

1

22

T

t=2(yt c yt1)2

11


12/44

Prediction Error Decomposition

To illustrate this algorithm, consider the simple AR(1)model. Recall,

yt|It1 N(c + yt1, 2

), t = 1, 2...., Tfrom which it follows that

E[yt|It1] = c + yt1var[yt|It1] = 2

The 1-step ahead prediction errors may then be de-fined as

vt = yt E[yt|It1] = yt c + yt1, t = 2...., T12


13/44

The variance of the prediction error at time t is

ft = var(vt) = var(t) = 2, t = 2, ...T

For the initial value, the first prediction error and

its variance are

v1 = y1 E[y1] = y1 c

1 f1 = var(1) =

2

1

2


14/44

Using the prediction errors and the prediction error vari-ances, the exact log-likelihood function may be re-expressed

as

ln L(|y) = T2

ln(2) 12

Tt=1

lnft 12

Tt=1

v2tft

which is the prediction error decomposition. A furthersimplification may be achieved by writing

var(vt) = 2ft

= 21

1 2for t = 1= 2.1for t > 1

That is ft =1

12 for t = 1 and ft = 1 for t > 1. Thenthe log-likelihood becomes

ln L(|y) = T2

ln(2)12

Tt=1

ln212

Tt=1

lnft 1

22

Tt=1

v2tft

13


15/44

MLE Estimation of MA(1)

Recall

Yt = + t1 + t, || < 1et W N(0, 2).

|| < 1 is assumed for invertible representation only,

nothing about stationarity.

14


16/44

Estimation MA(1)

Yt|t1 N( + t1, 2),f(Yt|t1, ) =

122

e 1

22(ytt1)2

=(,,2).

Problem: without knowing t2 we dont observe t1.Need to know t2 to know t1 = yt t2.

But t1 unobservable. Assume 0 = 0.

Make it non-random, just fix it with number 0. This

works with any number.

15


17/44

Estimation MA(1)

Y1|0 N(, 2),y1 = + 1 1 = y1 ,y2 = + 2 + 1 2 = y2 (y1 ),t = yt (yt1 ) + ....... + (1)t1t1(y1 )

Conditional likelihood:

L(|y1....yT; 0 = 0) =T

t=1

122

e 1

22(2t1)

If || < 1 (much less),0 doesnt matter, CMLE is con-sistent.

Exact MLE requires Kalman Filter.

16


18/44

Why do we need Exact MLEs

Estimation of MA(1) models, assumed that e0 = 0

while calculating the sequence of ets. A more tradi-tional approach is to estimate the unconditional or the

exact log-likelihood function by assuming that e0 is ran-

dom and hence allowing it to follow some distribution

and use it to calculate the ets from the data. Such

an allowance will affect not only the prediction error

and the variance of y1; but also successive prediction

errors and their variances. The practice of obtaining an

estimate of the parameters by not conditioning on the

pre-sample values but obtaining them using such exact

MSEs is called the exact or unconditional ML estima-

tion method. Note that the problem of obtaining the

sequence of ets arises only in pure MA or mixed models.17


19/44

Exact MLEs

What are the advantages of using such a procedure? Tounderstand that, an examination of the first prediction

error and its variance is instructive. For example, in an

MA(1) process, the first prediction error is the same,

it is y1 in both conditional and exact ML estimation

procedures. But the assumption that e0 = 0 meansthat the var(z1) = var(e1) = 2 whereas, if we allow

e0 to be random, var(z1) = 2(1 + 2): And this will be

reflected in f1 in the exact log-likelihood function. And

in typically small sample sizes such an assumption may

matter a lot for the estimates. Besides, if happensto be close to unity, the differences will be even more

significant.

18


20/44

Exact MLEs

But estimating the exact log-likelihood function is dif-

ficult and was a costly exercise in terms of computing,

till some time back; but in these days of advanced and

cheap computing facilities, cost should not be a deter-

rent in using exact ML estimation method. Our job

becomes easier by noting that for any ARMA model it

can be shown that the two end-equations, namely, that

of the prediction error and the prediction error variance,

are recursive in nature and literature shows that these

two can be calculated by using two popular methods,

viz. (1) the triangular factorization method, TF and

(2) the kalman filter recursions.

19


21/44

Estimation

For the Gaussian AR(1) process,

Yt = c + 1Yt

1 + t,

|

|< 1

et W N(0, 2).

The joint distribution of YT = (Y1, Y2.....YT)

is

YT N0,the observations y (y1, y2,...,yT) are the singlerealization of YT.

20


22/44

MLE AR(1)

Y1.

.

.

YT

= N(,)

=

.

.

.

= 0 ... T1... . . .

.

..... . . . ...T1 ... 0

21


23/44

MLE AR(1)

The p.d.f. of the sample y = (y1, y2,...,yT) is given by

the multivariate normal density

fY

y; ;

= (2)T2 ||12 exp{ 1

2(y )1(y )}

Denoting

= 2y with ij = |ij|

= 0 ... T

1

... . . . ...

... . . . ...T1 ... 0

= 01 ...

T10

... . . . ...

... . . . ...T1

0... 1

22


24/44

= 2y =

2y

1 ... (T 1)... . . . ......

. ..

..

.(T 1) ... 1

(j) = j

Collecting the parameters of the model in = (c,,2),

the joint p.d.f. becomes:

fY

y;

= (22y )

T2 ||12 exp{ 1

22y(y )1(y )

}

Collecting the parameters of the model in = (c,,2),

the sample log-likelihood function is given by

L() = T2

log(2) T2

log(2y ) 1

2log(||) 1

22y(y )1(y )


25/44

MLE

The exact log-likelihood function is a non-linearfunction of the parameters . There is no closed

form solution for the exact mles.

The exact mles must be determined by numeri-cally maximizing the exact log-likelihood function.

Usually, a Newton-Raphson type algorithm is used

for the maximization which leads to the iterative

scheme

mle,n = mle,n H(mle,n)1s(mle,n)where H() is an estimate of the Hessian matrix

(2nd derivative of the log-likelihood function), and

23


26/44

s(mle,n) is an estimate of the score vector (1st

derivative of the log-likelihood function).

The estimates of the Hessian and Score may becomputed numerically (using numerical derivative

routines) or they may be computed analytically (if

analytic derivatives are known).


27/44

Factorization

Note that for large T, might be large and difficultto invert.

Since is positive definite symmetric matrix thenthere exists a unique, triangular factorization of ; = Af A,

where

fT XT =

f1 0 ... 00 f2

. . . ...... . . . ...0 ... f T

ft 0 for all t diagonal matrix

AT XT =

1 0 ... 0

a21 1. . . ...

... . . . ...aT1 aT1 ... 1

24


28/44

Likelihood

The likelihood function can be rewritten as:

L(|yT) = (2)T2 det(Af A)

12 e1

2(yT )(Af A)1(y )

This is done by converting the correlated variables y1, y2........yTinto a collection, say 1, 2........T of uncorrelated vari-

ables. In the following, let Pj denote the projection

onto the random variables in Xj Define

= A1

( yT ) (prediction error).where

A = ( yT ).25


29/44

Since A is lower-triangular matrix with 1s along the

principal diagonal,

1 = y1

2 = y2 P1y2 = y2 a1113 = y3 a211 a221

...

T = yT T1i=1

aT iT1


30/44

Also, since A is lower triangular with 1s along the prin-

cipal diagonal, det(A) = 1

det(AfA) = det(A) . det(f ) . det(A) = det(f ).

Then, L(|yT)=(2)T2 det(f1)

12 e

12

(f1)

=T

t=1 1

2fte12 2ft ,

where t is tth element of T x1 = prediction error yt

yt

|t..1,

yt|t..1 =t1i=1

at,iyi, 1 = 2, 3, .....T, where at,i is(t; i)th element ofA1.

26


31/44

Kalman Filters

The Kalman filter comprises a set of mathematical

equations which result from an optimal recursive so-

lution given by the least squares method.

The purpose of this solution consists in computing alinear, unbiased and optimal estimator of a systems

state at time t, based on information available at t..1

and update, with the additional information at t, these

estimates.

The filters performance assumes that a system can bedescribed through a stochastic linear model with an as-

sociated error following a normal distribution with mean

zero and known variance.

27


32/44

Kalman Filters

The Kalman filter is the main algorithm to estimate dy-

namic systems in state-space form. This representation

of the system is described by a set of state variables.

The state contains all information relative to that sys-

tem at a given point in time. This information should

allow to infer about the past behaviour of the system,

aiming at predict its future behaviour.

28


33/44

DEVELOPING THE KALMAN FILTER

ALGORITHM

The basic building blocks of a Kalman Filter are two

equations: the measurement equation and the transi-tion equation. The measurement equation relates the

unobserved data (xt where t indicates a point in time)

with observable data (yt; where t indicates a point in

time):

yt = m xt + vt (1)where, E(vt) = 0 and var(vt) = rt The transition equa-

tion is based on a model that allows the unobserved

29


34/44

data to change through time.

xt+1 = a xt + wt (2)where, E(wt) = 0 and var(wt) = qt The process starts

with an initial estimate for xt, call it x0, which has

a mean of 0 and a standard deviation of s0. Using

the expectation of equation (2), a prediction for x1

emerges, call it x1

x1 = E(a x0 + w0) = a 0 (3)The predicted value from equation (3) is then inserted

into equation (1) and again taken as an expectation to

produce a prediction for y1, call it y1

y1 = E(m x1 + v0) = m E(x1) = m a 0 (4)


35/44

Thus far, predictions of the future are based on expec-

tations and not on the variance or standard deviation

associated with the predicted variables. The variance

will eventually be incorporated to produce better es-

timation. However, the next step is to compare the

predicted y1 (i.e. y1) with the actual y1 when it occurs.

In equation (5), the expectation of the predicted value

and actual value for y1 are compared to produce the

predicted (or expected) error, ye1:

ye1 = E(y1 y1) = y m E(x1) = y1 m a 0 (5)Given the error in predicting y1 which is based on the

expectation of x1 from equation (3), a new estimation

for x1 is considered, x1E. Notice this is different from

x1 because x1E incorporates the prediction error of y1.


36/44

Equation (6) identifies x1E as an expectation of an

adjusted x1.

x1E = E[x1 + k1 ye1] = E[x1] + k1 (y1 E(m x1))

= a 0 + k1(y1 m a 0) (6)k1 (or more generically, kt in equation (6) is re-

ferred to as the Kalman gain and incorporates the vari-

ance of x1

(denoted as p1

or generically as pt

and the

variance of y1 (see the denominator in equation (8)below).

var(x1) = var(a0 x0) + var(w0) = 20 a2 + q0 = p1(7)k1 =

m p1p1 m

2

+ r0

=m (20 a2 + q0)

(2

0 a2

+ q0) m2

+ r0

(8)

The cycle starts over with x1E taking the place of x0in equation (3) and used to forecast y2. The mean of


37/44

x1E is the value of the expectation calculated in equa-

tion(6). Notice, the mean ofx1E

incorporates the mean

of x0, the variance of x0(via the Kalman gain), the vari-

ance of the error in the measurement equation (via the

Kalmna gain), the variance of the error in the transition

equation (via the Kalman gain), and the observe y1.

The variance of x1E

:

var(x1) = p1[1 k1 m] = p1 [

1 11 + r0

(20a2+q0)m2

](9)

Notice, the variance of x1E is reduced relative to the

variance of x1. Further,because the distributional as-pects of each estimated value of xt is known (assuming


38/44

a model), the model parameters within the measure-

ment and transition equations can be optimized using

maximum likelihood estimation. An iterative sequence

of filter followed by parameter estimation optimization

eventually optimizes the entire system.

Predict future unobserved variable (x) based on thecurrent estimate of the unobserved variable:

xt = E(a x(t)E + wt) (10)

Use predicted unobserved variable to predict future

observable variable (y):

yt+1 = E(m xt+1 + vt) (11)


39/44

When the future observable variable actually occurs,calculate the error in the prediction:

yet+1 = E(yt+1 yt+1) (12) Generate a better estimate of the unobserved vari-

able at time (t + 1) and start the process over for

time (t + 2):

x(t+1)E = E[x(t+1) + k(t+1) ye(t+1)] (13)

Note: kt+1 is the Kalman gain and is based on thevariance of the predicted variables in the first and

second step of the process:

kt+1 =m var(x

(t=1))

var(y(t+1))

(14)


40/44

APPLYING MAXIMUM LIKELIHOOD ESTIMATION

TO THE KALMAN FILTER

The Kalman Filter provides output throughout the time

series in the form of estimated values (e.g. x1E from

equation (6)) which are the means/expectations of theunobserved variables x1....xS for every time period t with

associated variances provided by equation (9) (Note: x0is also from a distribution with a mean of 0 and a vari-

ance of s20 and is part of the time series). Keeping with

the univariate structure from the previous section andassuming T time periods of data, a maximum likelihood

estimation (MLE) is imposed by further assuming x0,


41/44

w1....wT, and v1...vT are jointly normal and uncorrelated.

Consequently, a joint likelihood function exists:

{122

e

(x00)222

0

}{[

12qt

]Te

T

t=1(xtE(xt))22q1

}

{[1

2rt]T

e

T

t=1(ytE(yt))22rt } (15)

recall: x0 is distributed N(0, s20), wt is distributed N(0, qt),

and vt is distributed N(0, rt)

The problem with the likelihood function in equation(10) is that xt is not observable. However, as noted

above, the Kalman Filter provides means and variances


42/44

for xt throughout the time series. Consequently, re-

move the first two terms and define the mean of yt asyt = m xt = a m x(t1)E, define the variance ofyt as p

t m2 = r(t1) ,and keep the assumption of yt

being normally distributed. Notice, the variance defini-

tion(via pt ; see the denominator in equation (8)) and

the mean definition incorporate the distributional prop-erties of xt while dealing with the observable yt. Fur-

ther, the initial conditions are introduced when t equals

one as x0E = 0 and through p1 (see equation(7)). The

new likelihood function becomes

Tt=1

{[1

2[pt m2 + r(t1)]

]Te

Tt=1(ytamx(t1)E)2p

tm2+r(t1)

}(16)


43/44

To simplify the math, take the natural logarithm of the

likelihood function creating the log- likelihood function:

T ln(2)2

12

Tt=1

ln[pt m2 + rt1] 1

2

Tt=1

(yt amxt1)E)2pt m2 + rt1

(17)

Further assume that rt and qt (contained in pt ) are con-stant throughout time. Consequently, equation (12)

has a value which is called a score. The next step is

to maximize the log- likelihood function using the par-

tial derivatives for r (= rt for all t), q (= qt for all t),

and a. After solving for these three parameters based

on setting the partial derivatives to zero, the Kalman

Filter is re-estimated and then the maximum likelihood

procedure is applied again until the score improves by


44/44

less than a particular value (say 0.0001) indicating con-

vergence. The iterative use of the MLE procedure is

called the Expectation Maximization (EM) algorithm.