Eco No Notes 1

download Eco No Notes 1

of 47

Transcript of Eco No Notes 1

  • 8/8/2019 Eco No Notes 1

    1/47

    Chapter 1

    What Is Econometrics

    1.1 Data

    1.1.1 Accounting

    Definition 1.1 Accounting data is routinely recorded data (that is, records oftransactions) as part of market activities. Most of the data used in econmetricsis accounting data. 2

    Acounting data has many shortcomings, since it is not collected specificallyfor the purposes of the econometrician. An econometrician would prefer datacollected in experiments or gathered for her.

    1.1.2 Nonexperimental

    The econometrician does not have any control over nonexperimental data. Thiscauses various problems. Econometrics is used to deal with or solve these prob-lems.

    1.1.3 Data Types

    Definition 1.2 Time-series data are data that represent repeated observationsof some variable in subseqent time periods. A time-series variable is often sub-scripted with the letter t. 2

    Definition 1.3 Cross-sectional data are data that represent a set of obser-

    vations of some variable at one specific instant over several agents. A cross-sectional variable is often subscripted with the letter i. 2

    Definition 1.4 Time-series cross-sectional data are data that are both time-series and cross-sectional. 2

    1

  • 8/8/2019 Eco No Notes 1

    2/47

    2 CHAPTER 1. WHAT IS ECONOMETRICS

    An special case of time-series cross-sectional data is panel data. Panel data

    are observations of the same set of agents over time.

    1.1.4 Empirical Regularities

    We need to look at the data to detect regularities. Often, we use stylized facts,but this can lead to over-simplifications.

    1.2 Models

    Models are simplifications of the real world. The data we use in our model iswhat motivates theory.

    1.2.1 Economic Theory

    By choosing assumptions, the

    1.2.2 Postulated Relationships

    Models can also be developed by postulating relationships among the variables.

    1.2.3 Equations

    Economic models are usually stated in terms of one or more equations.

    1.2.4 Error ComponentBecause none of our models are exactly correct, we include an error componentinto our equations, usually denoted ui. In econometrics, we usually assume thatthe error component is stochastic (that is, random).

    It is important to note that the error component cannot be modeled byeconomic theory. We impose assumptions on ui, and as econometricians, focuson ui.

    1.3 Statistics

    1.3.1 Stochastic Component

    Some stuff here.

    1.3.2 Statistical Analysis

    Some stuff here.

  • 8/8/2019 Eco No Notes 1

    3/47

    1.4. INTERACTION 3

    1.3.3 Tasks

    There are three main tasks of econometrics:

    1. estimating parameters;

    2. hypothesis testing;

    3. forecasting.

    Forecasting is perhaps the main reason for econometrics.

    1.3.4 Econometrics

    Since our data is, in general, nonexperimental, econometrics makes use of eco-

    nomic theory to adjust for the lack of proper data.

    1.4 Interaction

    1.4.1 Scientific Method

    Econometrics uses the scientific method, but the data are nonexperimental. Insome sense, this is similar to astronomers, who gather data, but cannot conductexperiments (for example, astronomers predict the existence of black holes, buthave never made one in a lab).

    1.4.2 Role Of Econometrics

    The three components of econometrics are:

    1. theory;

    2. statistics;

    3. data.

    These components are interdependent, and each helps shape the others.

    1.4.3 Ocams Razor

    Often in econometrics, we are faced with the problem of choosing one modelover an alternative. The simplest model that fits the data is often the best

    choice.

  • 8/8/2019 Eco No Notes 1

    4/47

    Chapter 2

    Some Useful Distributions

    2.1 Introduction

    2.1.1 Inferences

    Statistical Statements

    As statisticians, we are often called upon to answer questions or make statementsconcerning certain random variables. For example: is a coin fair (i.e. is theprobability of heads = 0.5) or what is the expected value of GNP for the quarter.

    Population Distribution

    Typically, answering such questions requires knowledge of the distribution ofthe random variable. Unfortunately, we usually do not know this distribution(although we may have a strong idea, as in the case of the coin).

    Experimentation

    In order to gain knowledge of the distribution, we draw several realizations ofthe random variable. The notion is that the observations in this sample containinformation concerning the population distribution.

    Inference

    Defi

    nition 2.1 The process by which we make statements concerning the pop-ulation distribution based on the sample observations is called inference. 2

    Example 2.1 We decide whether a coin is fair by tossing it several times andobserving whether it seems to be heads about half the time. 2

    4

  • 8/8/2019 Eco No Notes 1

    5/47

    2.1. INTRODUCTION 5

    2.1.2 Random Samples

    Suppose we draw n observations of a random variable, denoted {x1, x2,...,xn}and each xi is independent and has the same (marginal) distribution, then{x1, x2,...,xn} constitute a simple random sample.

    Example 2.2 We toss a coin three times. Supposedly, the outcomes are inde-pendent. If xi counts the number of heads for toss i, then we have a simplerandom sample. 2

    Note that not all samples are simple random.

    Example 2.3 We are interested in the income level for the population in gen-

    eral. The n observations available in this class are not indentical since the higherincome individuals will tend to be more variable. 2

    Example 2.4 Consider the aggregate consumption level. The n observationsavailable in this set are not independent since a high consumption level in oneperiod is usually followed by a high level in the next. 2

    2.1.3 Sample Statistics

    Definition 2.2 Any function of the observations in the sample which is thebasis for inference is called a sample statistic. 2

    Example 2.5 In the coin tossing experiment, let S count the total number ofheads and P = S3 count the sample proportion of heads. Both S and P aresample statistics. 2

    2.1.4 Sample Distributions

    A sample statistic is a random variable its value will vary from one experimentto another. As a random variable, it is subject to a distribution.

    Definition 2.3 The distribution of the sample statistic is the sample distribu-tion of the statistic. 2

    Example 2.6 The statistic S introduced above has a multinomial sample dis-tribution. Specifically Pr(S = 0) = 1/8, Pr(S = 1) = 3/8, Pr(S = 2) = 3/8, andPr(S = 3) = 1/8. 2

  • 8/8/2019 Eco No Notes 1

    6/47

    6 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

    2.2 Normality And The Sample Mean

    2.2.1 Sample Sum

    Consider the simple random sample {x1, x2,...,xn}, where x measures the heightof an adult female. We will assume that E xi = and Var(xi) =

    2 , for alli = 1, 2,...,n

    Let S = x1 + x2 + + xn denote the sample sum. Now,

    E S = E( x1 + x2 + + xn ) = E x1 + E x2 + + E xn = n (2.1)

    Also,

    Var(S) = E( S E S)2

    = E( S n )2

    = E( x1 + x2 + + xn n )2

    = E

    "nXi=1

    ( xi )#2

    = E[( x1 )2 + ( x1 )( x2 ) + +( x2 )( x1 ) + ( x2 )2 + + ( xn )2]

    = n2. (2.2)

    Note that E[( xi )( xj )] = 0 by independence.

    2.2.2 Sample Mean

    Let x = Sn denote the sample mean or average.

    2.2.3 Moments Of The Sample Mean

    The mean of the sample mean is

    E x = ES

    n=

    1

    nE S =

    1

    nn = . (2.3)

    The variance of the sample mean is

    Var(x) = E(x )2

    = E(

    S

    n )2

    = E[1

    n( S n ) ]2

    =1

    n2E( S n )2

  • 8/8/2019 Eco No Notes 1

    7/47

    2.3. THE NORMAL DISTRIBUTION 7

    =1

    n2n2

    =2

    n. (2.4)

    2.2.4 Sampling Distribution

    We have been able to establish the mean and variance of the sample mean.However, in order to know its complete distribution precisely, we must knowthe probability density function (pdf) of the random variable x.

    2.3 The Normal Distribution

    2.3.1 Density FunctionDefinition 2.4 A continuous random variable xi with the density function

    f( xi ) =1

    22e

    1

    22( xi )

    2

    (2.5)

    follows the normal distribution, where and 2 are the mean and variance ofxi, respectively.2

    Since the distribution is characterized by the two parameters and 2, wedenote a normal random variable by xi N ( , 2 ).

    The normal density function is the familiar bell-shaped curve, as is shownin Figure 2.1 for = 0 and 2 = 1. It is symmetric about the mean .

    Approximately 23 of the probability mass lies within of and about .95lies within 2. There are numerous examples of random variables that havethis shape. Many economic variables are assumed to be normally distributed.

    2.3.2 Linear Transformation

    Consider the transformed random variable

    Yi = a + bxi

    We know thatY = E Yi = a + bx

    and2Y = E(Yi Y)2 = b22x

    If xi is normally distributed, then Yi is normally distributed as well. That is,

    Yi N ( Y, 2Y )

  • 8/8/2019 Eco No Notes 1

    8/47

    8 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

    Figure 2.1: The Standard Normal Distribution

    Moreover, if xi N ( x, 2x ) and zi N ( z, 2z ) are independent, thenYi = a + bxi + czi N ( a + bx + cz, b22x + c22z )

    These results will be formally demonstrated in a more general setting in thenext chapter.

    2.3.3 Distribution Of The Sample Mean

    If, for each i = 1, 2, . . . , n, the xis are independent, identically distributed (iid)normal random variables, then

    xi N ( x,2xn

    ) (2.6)

    2.3.4 The Standard Normal

    The distribution of x will vary with different values of x and 2x, which is

    inconvenient. Rather than dealing with a unique distribution for each case, we

  • 8/8/2019 Eco No Notes 1

    9/47

    2.4. THE CENTRAL LIMIT THEOREM 9

    perform the following transformation:

    Z =x xq

    2x

    n

    =xq2x

    n

    xq2x

    n

    (2.7)

    Now,

    E Z =E xq2x

    n

    xq2x

    n

    =xq2xn

    xq2xn

    = 0.

    Also,

    Var( Z) = E

    xq

    2x

    n

    xq2x

    n

    2

    = E

    n

    2x( x x )2

    =n

    2

    x

    2x

    n= 1.

    Thus Z N(0, 1). The N( 0, 1 ) distribution is the standard normal and is well-tabulated. The probability density function for the standard normal distributionis

    f( zi ) =12

    e1

    2( zi )

    2

    = (zi) (2.8)

    2.4 The Central Limit Theorem

    2.4.1 Normal Theory

    The normal density has a prominent position in statistics. This is not onlybecause many random variables appear to be normal, but also because mostany sample mean appears normal as the sample size increases.

    Specifically, suppose x1, x2, . . . , xn is a simple random sample and E xi = xand Var( xi ) =

    2x, then as n , the distribution ofx becomes normal. That

  • 8/8/2019 Eco No Notes 1

    10/47

    10 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

    is,

    limn

    f

    x xq

    2x

    n

    =

    x xq

    2x

    n

    (2.9)

    2.5 Distributions Associated With The Normal

    Distribution

    2.5.1 The Chi-Squared Distribution

    Definition 2.5 Suppose that Z1, Z2, . . . , Z n is a simple random sample, andZi

    N( 0, 1). Then

    nXi=1

    Z2i X2n , (2.10)

    where n are the degrees of freedom of the Chi-squared distribution. 2

    The probability density function for the X2n is

    f2( x ) =1

    2n/2( n/2 )xn/21ex/2, x > 0 (2.11)

    where (x) is the gamma function. See Figure 2.2. If x1, x2, . . . , xn is a simplerandom sample, and xi N ( x, 2x ), then

    nXi=1

    xi

    2 X2n . (2.12)

    The chi-squared distribution will prove useful in testing hypotheses on boththe variance of a single variable and the (conditional) means of several. Thismultivariate usage will be explored in the next chapter.

    Example 2.7 Consider the estimate of 2

    s2 =

    Pni=1( xi x )2

    n 1 .

    Then

    ( n 1 ) s2

    2 X2n1. (2.13)

    2

  • 8/8/2019 Eco No Notes 1

    11/47

    2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMAL DISTRIBUTION11

    Figure 2.2: Some Chi-Squared Distributions

    2.5.2 The t Distribution

    Definition 2.6 Suppose that Z N( 0, 1 ), Y X2k , and that Z and Y areindependent. Then

    Z

    qYk

    tk, (2.14)

    where k are the degrees of freedom of the t distribution. 2

    The probability density function for a t random variable with n degrees offreedom is

    ft( x ) =n+12

    n n2

    1 + x

    2

    n

    (n+1)/2 , (2.15)for < x < . See Figure 2.3

    The t (also known as Students t) distribution, is named after W.S. Gosset,who published under the pseudonym Student. It is useful in testing hypothesesconcerning the (conditional) mean when the variance is estimated.

    Example 2.8Consider the sample mean from a simple random sample of nor-mals. We know that x N( , 2/n ) and

    Z =x q

    2

    n

    N( 0, 1 ).

  • 8/8/2019 Eco No Notes 1

    12/47

    12 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

    Figure 2.3: Some t Distributions

    Also, we know that

    Y = ( n 1 ) s2

    2 X2n1,

    where s2 is the unbiased estimator of 2. Thus, if Z and Y are independent(which, in fact, is the case), then

    ZqY

    (n1 )

    =

    x2

    nr(n1 ) s

    2

    2

    (n1 )

    = ( x )s

    n2

    s2

    2

    = x qs2

    n

    tn1 (2.16)

    2

  • 8/8/2019 Eco No Notes 1

    13/47

    2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMAL DISTRIBUTION13

    2.5.3 The F Distribution

    Definition 2.7 Suppose that Y X2m, W X2n , and that Y and W are inde-pendent. Then

    YmWn

    Fm,n, (2.17)

    where m,n are the degrees of freedom of the F distribution. 2

    The probability density function for a F random variable with m and ndegrees of freedom is

    fF( x ) =m+n2

    (m/n)m/2

    m2

    n2

    x(m/2)1

    (1 + mx/n)(m+n)/2(2.18)

    The F distribution is named after the great statistician Sir Ronald A. Fisher,and is used in many applications, most notably in the analysis of variance. Thissituation will arise when we seek to test multiple (conditional) mean parameterswith estimated variance. Note that when x tn then x2 F1,n. Someexamples of the F distribution can be seen in Figure 2.4.

    Figure 2.4: Some F Distributions

  • 8/8/2019 Eco No Notes 1

    14/47

    Chapter 3

    Multivariate Distributions

    3.1 Matrix Algebra Of Expectations

    3.1.1 Moments of Random Vectors

    Let

    x1x2...

    xm

    = x

    be an m 1 vector-valued random variable. Each element of the vector is ascalar random variable of the type discussed in the previous chapter.

    The expectation of a random vector is

    E[ x ] =

    E[ x1 ]

    E[ x2 ]...

    E[ xm ]

    =

    12...

    m

    = . (3.1)

    Note that is also an m1 column vector. We see that the mean of the vectoris the vector of the means.

    Next, we evaluate the following:

    E[( x )( x )0]

    = E

    ( x1 1 )2 ( x1 1 )( x2 2 ) ( x1 1 )( xm m )( x2 2 )( x1 1 ) ( x2 2 )2 ( x2 2 )( xm m )

    ......

    . . ....

    ( xm m )( x1 1 ) ( xm m )( x2 2 ) ( xm m )2

    14

  • 8/8/2019 Eco No Notes 1

    15/47

    3.1. MATRIX ALGEBRA OF EXPECTATIONS 15

    =

    11 12 1m21 22 2m

    ......

    . . ....

    m1 m2 mm

    = . (3.2)

    , the covariance matrix, is an m m matrix of variance and covarianceterms. The variance 2i = ii ofxi is along the diagonal, while the cross-productterms represent the covariance between xi and xj .

    3.1.2 Properties Of The Covariance Matrix

    Symmetric

    The variance-covariance matrix is a symmetric matrix. This can be shownby noting that

    ij = E( xi i )( xj j ) = E( xj j )( xi i ) = ji.

    Due to this symmetry will only have m(m + 1)/2 unique elements.

    Positive Semidefinite

    is a positive semidefinite matrix. Recall that any m m matrix is posi-tive semidefinite if and only if it meets any of the following three equivalent

    conditions:

    1. All the principle minors are nonnegative;

    2. 0 0, for all |{z}m1

    6= 0;

    3. = PP0, for some P|{z}mm

    .

    The first condition (actually we use negative definiteness) is useful in the studyof utility maximization while the latter two are useful in econometric analysis.

    The second condition is the easiest to demonstrate in the current context.Let 6= 0. Then, we have

    0 = 0 E[( x )( x )0]

    = E[0( x )( x )0 ]

    = E{ [0( x )]2 } 0,

  • 8/8/2019 Eco No Notes 1

    16/47

    16 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

    since the term inside the expectation is a quadratic. Hence, is a positive

    semidefinite matrix.Note that P satisfying the third relationship is not unique. Let D be any

    m m orthonormal martix, then DD0 = Im and P = PD yields PP0 =

    PDD0P0 = PImP0 = . Usually, we will choose P to be an upper or lower

    triangular matrix with m(m + 1)/2 nonzero elements.

    Positive Definite

    Since is a positive semidefinite matrix, it will be a positive definite matrix ifand only if det() 6= 0. Now, we know that = PP0 for some m m matrixP. This implies that det(P) 6= 0.

    3.1.3 Linear Transformations

    Let y|{z}m1

    = b|{z}m1

    + B|{z}mm

    m1z}|{x . Then

    E[ y ] = b + B E[ x ]

    = b + B

    = y (3.3)

    Thus, the mean of a linear transformation is the linear transformation of themean.

    Next, we have

    E[ ( yy )( y y )0

    ] = E{[ B ( x )][(B ( x ))0

    ]}= B E[( x )( x)0]B0= BB0

    = BB0 (3.4)

    = y (3.5)

    where we use the result ( ABC)0 = C0B0A0, if conformability holds.

    3.2 Change Of Variables

    3.2.1 Univariate

    Let x be a random variable and fx() be the probability density function of x.Now, define y = h(x), where

    h0( x ) =d h( x )

    d x> 0.

  • 8/8/2019 Eco No Notes 1

    17/47

    3.2. CHANGE OF VARIABLES 17

    That is, h( x ) is a strictly monotonically increasing function and so y is a one-

    to-one transformation ofx. Now, we would like to know the probability densityfunction of y, fy(y). To find it, we note that

    Pr( y h( a ) ) = Pr( x a ), (3.6)

    Pr( x a ) =Za

    fx( x ) dx = Fx( a ), (3.7)

    and,

    Pr( y h( a ) ) =Zh( a )

    fy( y ) dy = Fy( h( a )), (3.8)

    for all a.Assuming that the cumulative density function is differentiable, we use (3.6)

    to combine (3.7) and (3.8), and take the total differential, which gives us

    dFx( a ) = dFy( h( a ))

    fx( a )da = fy( h( a ))h0( a )da

    for all a. Thus, for a small perturbation,

    fx( a ) = fy( h( a ))h0( a ) (3.9)

    for all a. Also, since y is a one-to-one transformation of x, we know that h()can be inverted. That is, x = h1(y). Thus, a = h1(y), and we can rewrite(3.9) as

    fx( h1( y )) = fy( y )h

    0( h1( y )).

    Therefore, the probability density function of y is

    fy( y ) = fx( h1( y ))

    1

    h0( h1( y )). (3.10)

    Note that fy( y ) has the properties of being nonnegative, since h0() > 0. If

    h0() < 0, (3.10) can be corrected by taking the absolute value of h0(), whichwill assure that we have only positive values for our probability density function.

    3.2.2 Geometric Interpretation

    Consider the graph of the relationship shown in Figure 3.1. We know that

    Pr[ h( b ) > y > h( a )] = Pr( b > x > a ).

    Also, we know that

    Pr[ h( b ) > y > h( a )] ' fy[ h( b )][ h( b ) h( a )],

  • 8/8/2019 Eco No Notes 1

    18/47

    18 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

    a b x

    h(b)h(a)

    y

    h(x)

    Figure 3.1: Change of Variables

    andPr( b > x > a ) ' fx( b )( b a ).

    So,

    fy[ h( b )][ h( b ) h( a )] ' fx( b )( b a )fy[ h( b )] ' fx( b )

    1

    [ h( b ) h( a )]/( b a )](3.11)

    Now, as we let a b, the denominator of (3.11) approaches h0(). This is thenthe same formula as (3.10).

    3.2.3 Multivariate

    Letx|{z}

    m1

    fx( x ).

    Define a one-to-one transformation

    y|{z}m1

    =

    m1

    z}|{h ( x ).Since h() is a one-to-one transformation, it has an inverse:

    x = h1( y ).

  • 8/8/2019 Eco No Notes 1

    19/47

    3.2. CHANGE OF VARIABLES 19

    We also assume that h(x )x0 exists. This is the m m Jacobian matrix, where

    h( x )

    x0=

    h1( x )h2( x )

    ...hm( x )

    (x1x2 xm)

    =

    h1(x )x1

    h2(x )x1

    hm(x )x1

    h1(x )x2

    h2(x )x2

    hm(x )x2...

    .... . .

    ...h1(x )xm

    h2(x )xm

    hm(x )xm

    = Jx( x ) (3.12)

    Given this notation, the multivariate analog to (3.11) can be shown to be

    fy( y ) = fx[ h1( y )]

    1

    |det(Jx[ h1( y )])|(3.13)

    Since h() is differentiable and one-to-one then det(Jx[ h1( y )]) 6= 0.

    Example 3.1 Let y = b0 + b1x, where x, b0, and b1 are scalars. Then

    x =y b0

    b1

    anddy

    dx= b1.

    Therefore,

    fy( y ) = fx

    y b0

    b1

    1

    |b1|. 2

    Example 3.2 Let y = b + Bx, where y is an m 1 vector and det(B) 6= 0.Then

    x = B1( y b )and

    y

    x0 = B = Jx( x ).

    Thus,

    fy( y ) = fx

    B1( y b ) 1|det( B )|

    . 2

  • 8/8/2019 Eco No Notes 1

    20/47

    20 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

    3.3 Multivariate Normal Distribution

    3.3.1 Spherical Normal Distribution

    Definition 3.1 An m 1 random vector z is said to be spherically normallydistributed if

    f(z) =1

    ( 2 )n/2e

    1

    2z0z. 2

    Such a random vector can be seen to be a vector of independent standardnormals. Let z1, z2, . . . , zm, be i.i.d. random variables such that zi N(0, 1).That is, zi has pdf given in (2.8), for i = 1,...,m. Then, by independence, the

    joint distribution of the zis is given by

    f( z1, z2, . . . , zm ) = f( z1 )f( z2 ) f( zm )

    =nYi=1

    12

    e1

    2z2i

    =1

    ( 2 )m/2e

    1

    2

    ni=1 z

    2

    i

    =1

    ( 2 )m/2e

    1

    2z0z, (3.14)

    where z0 = (z1 z2 ... zm).

    3.3.2 Multivariate Normal

    Defi

    nition 3.2 The m 1 random vector x with density

    fx( x ) =1

    ( 2 )n/2[ det( )]1/2e

    1

    2(x )01(x ) (3.15)

    is said to be distributed multivariate normal with mean vector and positivedefinite covariance matrix . 2

    Such a distribution for x is denoted by x N(,). The spherical normaldistribution is seen to be a special case where = 0 and = Im.

    There is a one-to-one relationship between the multivariate normal randomvector and a spherical normal random vector. Let z be an m 1 sphericalnormal random vector and

    x|{z}m1

    = + Az,

    where z is defined above, and det(A) 6= 0. Then,

    E x = + A E z = , (3.16)

  • 8/8/2019 Eco No Notes 1

    21/47

    3.3. MULTIVARIATE NORMAL DISTRIBUTION 21

    since E[z] = 0.

    Also, we know that

    E( zz0 ) = E

    z21 z1z2 z1zmz2z1 z22 z1zm

    ......

    . . ....

    zmz1 zmz2 z2m

    =

    1 0 00 1 0...

    .... . .

    ...0 0 1

    = Im , (3.17)

    since E[zizj ] = 0, for all i 6= j, and E[z2i ] = 1, for all i. Therefore,

    E[( x )( x )0] = E( Azz0A0 )= A E( zz

    0 )A0

    = AIm

    A0 = , (3.18)

    where is a positive definite matrix (since det(A) 6= 0).

    Next, we need to find the probability density function fx( x ) ofx. We knowthat

    z = A1( x ) ,

    z0 = ( x )0A10 ,

    and

    Jz( z ) = A,

    so we use (3.13) to get

    fx( x ) = fz[ A1( x )] 1

    | det( A )|

    =1

    ( 2 )m/2e

    1

    2(x )0A1

    0

    A1(x ) 1

    |det( A )|

    =1

    ( 2 )m/2e

    1

    2(x )0(AA0)1(x ) 1

    | det( A )|(3.19)

    where we use the results (ABC) = C1B1A1 and A01 = A10. However,

    = AA0, so det() = det(A) det(A), and |det(A)|= [det()]1/2. Thus wecan rewrite (3.19) as

    fx( x ) = 1( 2 )m/2[ det( )]1/2

    e1

    2 (x )0

    1

    (x ) (3.20)

    and we see that x N(,) with mean vector and covariance matrix .Since this process is completely reversable the relationship is one-to-one.

  • 8/8/2019 Eco No Notes 1

    22/47

  • 8/8/2019 Eco No Notes 1

    23/47

    3.4. NORMALITY AND THE SAMPLE MEAN 23

    3.4 Normality and the Sample Mean

    3.4.1 Moments of the Sample Mean

    Consider the m 1 vector xi i.i.d. jointly, with m 1 vector mean E[xi] = and mm covariance matrix E[(xi)(xi)0] = . Define xn = 1n

    Pni=1 xi

    as the vector sample meanwhich is also the vector of scalar sample means. Themean of the vector sample mean follows directly:

    E[xn] =1

    n

    nXi=1

    E[xi] = .

    Alternatively, this result can be obtained by applying the scalar results elementby element. The second moment matrix of the vector sample mean is given by

    E[(xn )(xn )0] = 1n2 Eh(Pni=1 xi n) (Pni=1 xi n)0i

    =1

    n2E[{(x1 ) + (x2 ) + ... + (xn )}

    {(x1 ) + (x2 ) + ... + (xn )}0]=

    1

    n2n =

    1

    n

    since the covariances between different observations are zero.

    3.4.2 Distribution of the Sample Mean

    Suppose xi i.i.d. N(,) jointly. Then it follows from joint multivariatenormality that xn must also be multivariate normal since it is a linear transfor-

    mation. Specifically, we have

    xn N(,1

    n)

    or equivalently

    xn N(0,1

    n)

    n(xn ) N(0,)

    n1/2(xn ) N(0, Im)where = 1/21/20 and 1/2 = (1/2)1.

    3.4.3 Multivariate Central Limit TheoremTheorem 3.3 Suppose that (i) xi i.i.d jointly, (ii) E[xi] = , and (iii)E[(xi )(xi )0] = , then

    n(xn ) d N(0,)

  • 8/8/2019 Eco No Notes 1

    24/47

    24 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

    or equivalently

    z = n1/2(xn ) d N(0, Im) 2 .These results apply even if the original underlying distribution is not normal

    and follow directly from the scalar results applied to any linear combination ofxn.

    3.4.4 Limiting Behavior of Quadratic Forms

    Consider the following quadratic form

    n (xn )01(xn ) = n (xn )0(1/21/20)1(xn )= [n (xn )01/201/2(xn )= [

    n1/2(xn

    )]

    0[

    n1/2(xn

    )]

    = z0z 2d 2m.This form is convenient for asymptotic joint test concerning more than one meanat a time.

    3.5 Noncentral Distributions

    3.5.1 Noncentral Scalar Normal

    Definition 3.3 Let x N(, 2). Then,

    z =x

    N( /, 1 ) (3.23)

    has a noncentral normal distribution. 2

    Example 3.3 When we do a hypothesis test of mean, with known variance, wehave, under the null hypothesis H0 : = 0,

    x 0

    N( 0, 1 ) (3.24)

    and, under the alternative H1 : = 1 6= 0,

    x 0

    =x 1

    +

    1 0

    = N( 0, 1 ) +1 0

    N

    1 0

    , 1

    . (3.25)

    Thus, the behavior of x0 under the alternative hypothesis follows a non-central normal distribution. 2

    As this example makes clear, the noncentral normal distribution is especiallyuseful when carefully exploring the behavior of the alternative hypothesis.

  • 8/8/2019 Eco No Notes 1

    25/47

    3.5. NONCENTRAL DISTRIBUTIONS 25

    3.5.2 Noncentral t

    Definition 3.4 Let zN ( /, 1 ), w 2k, and let z and w be independent.Then

    zpw/k

    tk( ) (3.26)

    has a noncentral t distribution. 2

    The noncentral t distribution is used in tests of the mean, when the varianceis unknown.

    3.5.3 Noncentral Chi-Squared

    Definition 3.5 Let zN (, Im). Then

    z0

    z

    X2

    m( ), (3.27)has a noncentral chi-aquared distribution, where = 0 is the noncentralityparameter. 2

    In the noncentral chi-squared distribution, the probability mass is shifted tothe right as compared to a regular chi-squared distribution.

    Example 3.4 When we do a test of, with known , we have

    H0 : = 0

    ( x 0 )01( x 0 ) X2m (3.28)

    H1 : = 1 6= 0

    Let z=P1(x0). Then, we have( x 0 )01( x0 ) = ( x0 )0P1

    0

    P1( x0 )= z

    0z

    X2m[(1 0 )01(1 0 )] (3.29)2

    3.5.4 Noncentral F

    Definition 3.6 Let Y 2m(), W 2n, and let Y and W be independentrandom variables. Then

    Y /m

    W/n Fm,n( ), (3.30)has a noncentral F distribution, where is the noncentrality parameter. 2

    The noncentral F distribution is used in tests of mean vectors, where thevariance-covariance matrix is unknown and must be estimated.

  • 8/8/2019 Eco No Notes 1

    26/47

    Chapter 4

    Asymptotic Theory

    4.1 Convergence Of Random Variables

    4.1.1 Limits And Orders Of Sequences

    Definition 4.1 A sequence of real numbers a1, a2, . . . , an, . . . , is said to havea limit of if for every > 0, there exists a positive real number N such thatfor all n > N, | an | < . This is denoted as

    limn

    an = . 2

    Definition 4.2 A sequence of real numbers { an } is said to be of at most ordernk, and we write { an } is O( n

    k ), if

    limn

    1

    nkan = c,

    where c is any real constant. 2

    Example 4.1 Let { an } = 3 + 1/n, and { bn } = 4 n2. Then, { an } is O( 1 )= O( n0 ), since

    limn

    1

    nan = 3,

    and { bn } is O( n2 ), since

    limn

    1

    n2bn=-1. 2

    Definition 4.3 A sequence of real numbers { an } is said to be of order smallerthan nk, and we write { an } is o( nk ), if

    limn

    1

    nkan = 0. 2

    26

  • 8/8/2019 Eco No Notes 1

    27/47

    4.1. CONVERGENCE OF RANDOM VARIABLES 27

    Example 4.2 Let { an } = 1/n. Then, { an } is o(1 ), since

    limn

    1

    n0an=0. 2

    4.1.2 Convergence In Probability

    Definition 4.4 A sequence of random variables y1, y2, . . . , yn, . . . , with distri-bution functions F1(), F2(), . . . , F n(), . . . , is said to converge weakly in proba-bilityto some constant c if

    limn

    Pr[ | yn c | > ] = 0. 2 (4.1)

    for every real number > 0.

    Weak convergence in probability is denoted by

    plimn

    yn = c, (4.2)

    or sometimes,

    ynp c, (4.3)

    oryn p c.

    This definition is equivalent to saying that we have a sequence of tail proba-bilities (of being greater than c + or less than c), and that the tail probabil-ities approach 0 as n , regardless of how small is chosen. Equivalently,the probability mass of the distribution of yn is collapsing about the point c.

    Definition 4.5 A sequence of random variables y1, y2, . . . , yn, . . . , is said toconverge strongly in probability to some constant c if

    limN

    Pr[ supn>N

    | yn c | > ] = 0, (4.4)

    for any real > 0. 2

    Strong convergence is also called almost sure convergence and is denoted

    yna.s. c, (4.5)

    oryn a.s. c.

    Notice that if a sequence of random variables converges strongly in probability, itconverges weakly in probability. The difference between the two is that almost

  • 8/8/2019 Eco No Notes 1

    28/47

    28 CHAPTER 4. ASYMPTOTIC THEORY

    sure convergence involves an element of uniformity that weak convergence does

    not. A sequence that is weakly convergent can have Pr[| yn c | > ] wiggle-waggle above and below the constant used in the limit and then settle downto subsequently be smaller and meet the condition. For strong convergence,once the probability falls below for a particular N in the sequence it willsubsequently be smaller for all larger N.

    Definition 4.6 A sequence of random variables y1, y2, . . . , yn, . . . , is said toconverge in quadratic mean if

    limn

    E[ yn ] = c

    and

    limn

    Var[ yn ] = 0. 2

    By Chebyshevs inequality, convergence in quadratic mean implies weak con-vergence in probability. For a random variable x with mean and variance2, Chebyshevs inequality states Pr(|x | k) 1

    k2. Let 2n denote

    the variance on yn, then we can write the condition for the present case asPr(|yn E[yn]| kn) 1k2 . Since 2n 0 and E[yn] c the probability willbe less than 1

    k2for sufficiently large n for any choice ofk. But this is just weak

    convergence in probability to c.

    4.1.3 Orders In Probability

    Definition 4.7 Let y1, y2, . . . , yn, . . . be a sequence of random variables. This

    sequence is said to be bounded in probability if for any 1 > > 0, there exist a < and some N sufficiently large such that

    Pr(| yn | > ) < ,

    for all n > N. 2

    These conditions require that the tail behavior of the distributions of thesequence not be pathological. Specifically, the tail mass of the distributionscannot be drifting away from zero as we move out in the sequence.

    Definition 4.8 The sequence of random variables { yn } is said to be at mostof order in probability n, and is denoted Op( n ), if n

    yn is bounded in

    probability.2

    Example 4.3 Suppose z N(0, 1) and yn = 3 + n z, then n1yn = 3/n + z isa bounded random variable since the first term is asymptotically negligible andwe see that yn = Op( n ).

  • 8/8/2019 Eco No Notes 1

    29/47

    4.1. CONVERGENCE OF RANDOM VARIABLES 29

    Definition 4.9 The sequence of random variables { yn } is said to be of order

    in probability smaller than n, and is denoted op( n ), if nyn p 0. 2

    Example 4.4 Convergence in probability can be represented in terms of order

    in probability. Suppose that ynp c or equivalently yn c p 0, then

    n0(yn c) p 0 and yn c = op( 1 ) .

    4.1.4 Convergence In Distribution

    Definition 4.10 A sequence of random variables y1, y2, . . . , yn, . . . , with cu-mulative distribution functions F1(), F2(), . . . , F n(), . . . , is said to convergein distribution to a random variable y with a cumulative distribution functionF( y ), if

    limnFn() = F(), (4.6)

    for every point of continuity of F(). The distribution F() is said to be thelimiting distribution of this sequence of random variables. 2

    For notational convenience, we often write ynd F() or yn d F() if

    a sequence of random variables converges in distribution to F(). Note thatthe moments of elements of the sequence do not necessarily converge to themoments of the limiting distribution.

    4.1.5 Some Useful Propositions

    In the following propositions, let xn and yn be sequences of random vectors.

    Proposition 4.1 If xn yn converges in probability to zero, and yn has alimiting distribution, then xn has a limiting distribution, which is the same. 2

    Proposition 4.2 If yn has a limiting distribution and plimn

    xn = 0, then for

    zn = yn0xn,

    plimn

    zn = 0. 2

    Proposition 4.3 Suppose that yn converges in distribution to a random vari-able y, and plim

    nxn = c. Then xn

    0yn converges in distribution to c0y. 2

    Proposition 4.4 If g() is a continuous function, and if xn yn converges inprobability to zero, then g( xn ) g( yn ) converges in probability to zero. 2Proposition 4.5 Ifg() is a continuous function, and if xn converges in proba-bility to a constant c, thenzn = g( xn ) converges in distribution to the constantg( c ). 2

  • 8/8/2019 Eco No Notes 1

    30/47

    30 CHAPTER 4. ASYMPTOTIC THEORY

    Proposition 4.6 If g() is a continuous function, and if xn converges in dis-

    tribution to a random variable x, then zn = g( xn ) converges in distribution toa random variable g( x ). 2

    4.2 Estimation Theory

    4.2.1 Properties Of Estimators

    Definition 4.11 An estimator bn of the p1 parameter vector is a functionof the sample observations x1, x2,...,xn. 2

    It follows that

    b1,

    b2, . . . ,

    bn form a sequence of random variables.

    Definition 4.12 The estimator bn is said to be unbiased ifEbn = , for all n.2

    Definition 4.13 The estimatorbn is said to be asympotically unbiased if limn

    Ebn =.2

    Note that an estimator can be biased in finite samples, but asymptoticallyunbiased.

    Definition 4.14 The estimator bn is said to be consistent if plimn

    bn =. 2Consistency neither implies nor is implied by asymptotic unbiasedness, as

    demonstrated by the following examples.

    Example 4.5 Let

    en =

    , with probability 1 1/n + nc, with probability 1/n

    We have E en = + c, so en is a biased estimator, and limnE en = + c, soen is asymptotically biased as well. However, limnPr(| en | > ) = 0, soen is a consistent estimator. 2Example 4.6 Suppose xi i.i.d. N(, 2) for i = 1, 2,...,n,... and let

    exn = xn

    be an estimator of . Now E[

    exn] = so the estimator is unbiased but

    Pr(|exn | > 1.96) = .05so the probability mass is not collapsing about the target point so the esti-mator is not consistent.

  • 8/8/2019 Eco No Notes 1

    31/47

    4.2. ESTIMATION THEORY 31

    4.2.2 Laws Of Large Numbers And Central Limit Theo-

    rems

    Most of the large sample properties of he estimators considered in the sequelderive from the fact the estimators involve sample averages and the asymptoticbehavior of averages is well studies. In addition to the central limit theoremspresented in the previous two chapters we have the following two laws of largenumbers:

    Theorem 4.1 Ifx1, x2, . . . , xn is a simple random sample, that is, the xis arei.i.d., and E xi = and Var( xi ) =

    2, then by Chebyshevs Inequality,

    plimn

    xn = 2 (4.7)

    Theorem 4.2 (Khitchine) Suppose thatx1, x2, . . . , xn arei.i.d. random vari-ables, such that for all i = 1, . . . , n, E xi = , then,

    plimn

    xn = 2 (4.8)

    Both of these results apply element-by-element to vectors of estimators. Forsake of completeness we repeat the following scalar central linit theorem.

    Theorem 4.3 (Linberg-Levy) Suppose that x1, x2, . . . , xn are i.i.d. randomvariables, such that for all i = 1, . . . , n, E xi = and Var( xi ) =

    2, thenn( xn ) d N( 0, 2 ), or

    limnf( xn ) =1

    22 e1

    22( xn )

    2

    = N( 0, 2 ) 2 (4.9)

    This result is easily generalized to obtain the multivariate version given in The-orem 3.3.

    Theorem 4.4 (Multivariate CLT) Suppose thatm1 random vectorsx1, x2, . . . , xnare (i) jointly i.i.d., (ii) E xi = , and (iii) Cov( xi ) = , then

    n( xn ) d N( 0, ) (4.10)

    4.2.3 CUAN And Efficiency

    Definition 4.15 An estimator is said to be consistently uniformly asymptoti-cally normal (CUAN) if it is consistent, and if

    n(bn ) converges in distrib-

    ution to N( 0, ), and if the convergence is uniform over some compact subsetof the parameter space. 2

  • 8/8/2019 Eco No Notes 1

    32/47

    32 CHAPTER 4. ASYMPTOTIC THEORY

    Suppose that

    n(bn ) converges in distribution to N( 0, ). Let enbe an alternative estimator such that n( en ) converges in distribution toN( 0, ).

    Definition 4.16 Ifbn is CUAN with asymptotic covariance and en is CUANwith asymptotic covariance , then bn is asymptotically efficient relative to enif is a positive semidefinite matrix. 2

    Among other properties asymptotic relative efficiency implies that the diag-onal elements of are no larger than those of, so the asymptotic variancesof bn,i are no larger than those of en,i. And a similar result applies for theasymptotic variance of any linear combination.

    Definition 4.17 A CUAN estimator

    bn is said to be asymptotically efficient if

    it is asymptotically efficient relative to any other CUAN estimator. 2

    4.3 Asymptotic Inference

    4.3.1 Normal Ratios

    Now, under the conditions of the central limit theorem,

    n( xn ) dN( 0, 2 ), so

    ( xn )p2/n

    d N( 0, 1 ) (4.11)

    Suppose that

    c2 is a consistent estimator of 2. Then we also have

    ( xn )pb2/n = p2/npb2/n ( xn )p2/n=

    r2b2 ( xn )p2/n d N( 0, 1 ) (4.12)

    since the term under the square root converges in probability to one and theremainder converges in distribution to N( 0, 1).

    Most typically, such ratios will be used for inference in testing a hypothesis.Now, for H0 : = 0, we have

    ( xn 0 )pb

    2/n

    d N( 0, 1 ), (4.13)

    while under H1 : = 1 6= 0, we find that( xn 0 )pb2/n =

    n( xn 1 )b2 +

    n( 1 0 )b2

    = N( 0, 1 ) + Op(

    n ) (4.14)

  • 8/8/2019 Eco No Notes 1

    33/47

  • 8/8/2019 Eco No Notes 1

    34/47

    34 CHAPTER 4. ASYMPTOTIC THEORY

    Thus, if we obtain a large value of the statistic, we may take it as evidence that

    the null hypothesis is incorrect.This result can also be applied to any subvector of. Let

    =

    1

    2

    ,

    where 1 is a p1 1 vector. Then

    n(b1 1 ) d N( 0,11 ), (4.21)

    where 11 is the upper left-hand q q submatrix of and,

    n (

    b1 1 )0

    b111 (

    b1 1 ) d 2p1 (4.22)

    4.3.3 Tests Of General Restrictions

    We can use similar results to test general nonlinear restrictions. Suppose thatr() is a q 1 continuously differentiable function, and

    n(b ) d N( 0, ).

    By the intermediate value theorem we can obtain the exact Taylors series rep-resentation

    r(b) = r() + r()0

    (b )or equivalently

    n(r(b) r()) = r()

    0 n(b )= R( )

    n(b )

    where R( ) = r()

    0and lies between b and . Now b p so p

    and R( ) R( ). Thus, we have

    n[ r(b ) r( )] d N( 0, R( )R0( )). (4.23)Thus, under H0 : r( ) = 0, assuming R( )R

    0( ) is nonsingular, we have

    n r(b )0[R( )R0( )]1r(b ) d 2q, (4.24)where q is the length ofr(). In practice, we substitute the consistent estimates

    R(b ) for R( ) and b for to obtain, following the arguments given aboven r(b )0[R(b )bR0(b )]1r(b ) d 2q, (4.25)

    The behavior under the alternative hypothesis will be Op(n) as above.

  • 8/8/2019 Eco No Notes 1

    35/47

    Chapter 5

    Maximum Likelihood

    Methods

    5.1 Maximum Likelihood Estimation (MLE)

    5.1.1 Motivation

    Suppose we have a model for the random variable yi, for i = 1, 2, . . . , n, withunknown (p 1) parameter vector . In many cases, the model will imply adistribution f( yi, ) for each realization of the variable yi.

    A basic premise of statistical inference is to avoid unlikely or rare models,for example, in hypothesis testing. If we have a realization of a statistic thatexceeds the critical value then it is a rare event under the null hypothesis. Underthe alternative hypothesis, however, such a realization is much more likely tooccur and we reject the null in favor of the alternative. Thus in choosingbetween the null and alternative, we select the model that makes the realizationof the statistic more likely to have occured.

    Carrying this idea over to estimation, we select values of such that thecorresponding values of f( yi, ) are not unlikely. After all, we do not want amodel that disagrees strongly with the data. Maximum likelihood estimation

    is merely a formalization of this notion that the model chosen should not beunlikely. Specifically, we choose the values of the parameters that make therealized data most likely to have occured. This approach does, however, requirethat the model be specified in enough detail to imply a distribution for thevariable of interest.

    35

  • 8/8/2019 Eco No Notes 1

    36/47

    36 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

    5.1.2 The Likelihood Function

    Suppose that the random variables y1, y2, . . . , yn are i.i.d. Then, the jointdensity function for n realizations is

    f( y1, y2, . . . , yn| ) = f( y1| ) f( y2, | ) . . . f( yn| )

    =nYi=1

    f( yi| ) (5.1)

    Given values of the parameter vector , this function allows us to assign localprobability measures for various choices of the random variables y1, y2, . . . , yn.This is the function which must be integrated to make probability statementsconcerning the joint outcomes of y1, y2, . . . , yn.

    Given a set of realized values of the random variables, we use this same

    function to establish the probability measure associated with various choices ofthe parameter vector .

    Definition 5.1 The likelihood functionof the parameters, for a particular sam-ple of y1, y2, . . . , yn, is the joint density function considered as a function of given the yis. That is,

    L(|y1, y2, . . . , yn ) =nYi=1

    f( yi| ) 2 (5.2)

    5.1.3 Maximum Likelihood Estimation

    For a particular choice of the parameter vector , the likelihood function gives

    a probability measure for the realizations that occured. Consistent with theapproach used in hypothesis testing, and using this function as the metric, wechoose that make the realizations most likely to have occured.

    Definition 5.2 The maximum likelihood estimator of is the estimator ob-tained by maximizing L(|y1, y2, . . . , yn ) with respect to . That is,

    max

    L(|y1, y2, . . . , yn ) = b, (5.3)where b is called the MLE of. 2

    Equivalently, since log() is a strictly monotonic transformation, we may findthe MLE of by solving

    max

    L(|y1, y2, . . . , yn ), (5.4)

    whereL(|y1, y2, . . . , yn ) = log L(|y1, y2, . . . , yn )

  • 8/8/2019 Eco No Notes 1

    37/47

    5.1. MAXIMUM LIKELIHOOD ESTIMATION (MLE) 37

    is denoted the log-likelihood function. In practice, we obtain b by solving thefirst-order conditions (FOC)L(; y1, y2, . . . , yn )

    =

    nXi=1

    log f( yi;b )

    = 0.

    The motivation for using the log-likelihood function is apparent since the sum-mation form will result, after division by n, in estimators that are approximatelyaverages, about which we know a lot. This advantage is particularly clear inthe folowing example.

    Example 5.1 Suppose that yi i.i.d.N( , 2 ), for i = 1, 2, . . . , n. Then,

    f( yi|, 2) = 1

    22e

    1

    22( yi )

    2

    ,

    for i = 1, 2, . . . , n. Using the likelihood function (5.2), we have

    L( , 2|y1, y2, . . . , yn ) =nYi=1

    122

    e1

    22( yi )

    2

    . (5.5)

    Next, we take the logarithm of (5.5), which gives us

    log L( , 2|y1, y2, . . . , yn ) =n

    Xi=11

    2log(22 ) 1

    22( yi )2

    = n

    2log(22 ) 1

    22

    nXi=1

    ( yi )2 (5.6)

    We then maximize (5.6) with respect to both and 2. That is, we solve thefollowing first order conditions:

    (A) logL()

    = 12

    Pni=1( yi ) = 0;

    (B) logL()2

    = n22 + 124Pn

    i=1( yi )2 = 0.

    By solving (A), we find that = 1

    nPni=1 yi = yn. Solving (B) gives us2 = 1nPn

    i=1( yi yn ). Therefore, b = yn, and b2 = 1nPni=1( yi b ). 2Note that b2 = 1

    n

    Pni=1( yi b ) 6= s2, where s2 = 1n1Pni=1( yi b ). s2 is

    the familiar unbiased estimator for 2, and b2 is a biased estimator.

  • 8/8/2019 Eco No Notes 1

    38/47

    38 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

    5.2 Asymptotic Behavior of MLEs

    5.2.1 Assumptions

    For the results we will derive in the following sections, we need to make fiveassumptions:

    1. The yis are iid random variables with density function f( yi, ) for i =1, 2, . . . , n;

    2. log f( yi, ) and hence f( yi, ) possess derivatives with respect to up tothe third order for

    ;

    3. The range ofyi is independent of hence differentiation under the integralis possible;

    4. The parameter vector is globally identified by the density function.

    5. 3 log f( yi, )/ijk is bounded in absolute value by some functionHijk( y ) for all y and , which, in turn, has a finite expectation for all .

    Thefi

    rst assumption is fundamental and the basis of the estimator. If it isnot satisfied then we are misspecifying the model and there is little hope forobtaining correct inferences, at least in finite samples. The second assumptionis a regularity condition that is usually satisfied and easily verified. The thirdassumption is also easily verified and guarateed to be satisfied in models wherethe dependent variable has smooth and infinite support. The fourth assumptionmust be verified, which is easier in some cases than others. The last assumptionis crucial and bears a cost and really should be verified before MLE is undertakenbut is usually ignored.

    5.2.2 Some Preliminaries

    Now, we know that ZL(0|y )dy =

    Zf( y|0 )dy = 1 (5.7)

  • 8/8/2019 Eco No Notes 1

    39/47

    5.2. ASYMPTOTIC BEHAVIOR OF MLES 39

    for any value of the true parameter vector 0. Therefore,

    0 =RL(0|y )dy

    =

    ZL(0|y )

    dy

    =

    Zf( y|0 )

    dy

    =

    Zlog f( y|0 )

    f( y|0 )dy, (5.8)

    and

    0 = Elog f( y|0 )

    = E

    log L(0|y )

    (5.9)

    for any value of the true parameter vector 0.Differentiating (5.8) again yields

    0 =

    Z2 log f( y|0 )

    0f( y|0 ) +

    log f( y|0 )

    f( y|0 )

    0

    dy.

    (5.10)

    Since

    f( y|0 )

    0=

    log f( y|0 )

    0f( y|0 ), (5.11)

    then we can rewrite (5.10) as

    0 = E

    2 log f( y|0 )

    0

    + E

    log f( y|0 )

    log f( y|0 )

    0

    .

    (5.12)

    or, in terms of the likelihood function,

    (0 ) = E

    log L(0|y )

    log L(0|y )

    0

    = E

    log L(0|y )

    0

    .

    (5.13)

    The matrix (0 ) is called the information matrix and the relationship givenin (5.13) the information matrix equality.

    Finally, we note that

    E

    log L(0

    |y )

    log L(0

    |y )0

    = E" nXi=1

    log fi( y|0

    )

    nXi=1

    log fi( y|0

    )0

    #=

    nXi=1

    E

    log fi( y|

    0 )

    log fi( y|0 )

    0

    ,(5.14)

  • 8/8/2019 Eco No Notes 1

    40/47

  • 8/8/2019 Eco No Notes 1

    41/47

    5.2. ASYMPTOTIC BEHAVIOR OF MLES 41

    nonlinear in a, then

    =b b2 4ac

    2a. (5.18)

    Since ac = op(1), then p 0 for the plus root while p ( 0 )/ for the

    negative root if plimn

    a = 6= 0 exists. Ifa = 0, then the FOC are linear

    in whereupon = cb

    and again p 0. If the F.O.C. are nonlinear but

    asymptotically linear then ap 0 and ac in the numerator of (5.18) will still

    go to zero faster than a in the denominator and p 0. Thus there exits at

    least one consistent solution b which satisfiesplimn

    (

    b 0 ) = 0. (5.19)

    and in the asymptotically nonlinear case there is also a possibly inconsistentsolution.

    For the case of a vector, we can apply a similar style proof to show thereexists a solution b to the FOC that satisfies plim

    n(b0 ) = 0. And in the event

    of asymptotically nonlinear FOC there is at least on other possibly inconsistentroot.

    Global Maximum Is Consistent

    In the event of multiple roots, we are left with the problem of selecting betweenthem. By assumption 4, the parameter is globally identified by the densityfunction. Formally, this means that

    f( y, ) = f( y, 0 ), (5.20)

    for all y implies that = 0. Now,

    E

    f( y, )

    f( y, 0 )

    =

    Zf( y, )

    f( y, 0 )f( y, 0 ) = 1. (5.21)

    Thus, by Jensens Inequality, we have

    E

    log

    f( y, )

    f( y, 0 )

    < log E

    f( y, )

    f( y, 0 )

    = 0, (5.22)

    unless f( y, ) = f( y, 0 ) for all y, or = 0. Therefore, E [log f( y, )]achieves a maximum if and only if = 0.

    However, we are solving

    max1

    n

    nXi=1

    log f( yi, )p E [log f( yi, )] , (5.23)

  • 8/8/2019 Eco No Notes 1

    42/47

    42 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

    and if we choose an inconsistent root, we will not obtain a global maximum.

    Thus, asymptotically, the global maximum is a consistent root. This choice ofthe global root has added appeal since it is in fact the MLE among the possiblealternatives and hence the choice that makes the realized data most likely tohave occured.

    There are complications in finite samples since the value of the likelihoodfunction for alternative roots may cross over as the sample size increases. Thatis the global maximum in small samples may not be the global maximum inlarger samples. An added problem is to identify all the alternative roots sowe can choose the global maximum. Sometimes a solution is available in asimple consistent estimator which may be used to start the nonlinear MLEoptimization.

    Asymptotic Normality

    For p = 1, we have a2 + b+ c = 0, so

    = b 0 = ca+ b

    (5.24)

    and

    n(b 0 ) = nc

    a(b 0 ) + b=

    1a(b 0 ) + bnc.

    Now since a = Op(1) and b 0 = op(1), then a(b 0 ) = op(1) anda(b 0 ) + b p ( 0 ). (5.25)

    And by the CLT we have

    nc =

    1n

    nXi=1

    log f( yi|0 )

    d N( 0, ( 0 )). (5.26)

    Substituing these two results in (5.25), we find

    n(b 0 ) d N( 0, ( 0 )1).

    In general, for p > 1, we can apply the same scalar proof to show

    n(0

    b

    0

    0

    )d

    N( 0,0

    (0

    )

    1) for any vector , which means

    n(b 0 ) d N( 0,1(0 )), (5.27)ifb is the global maximum.

  • 8/8/2019 Eco No Notes 1

    43/47

    5.2. ASYMPTOTIC BEHAVIOR OF MLES 43

    Cramer-Rao Lower Bound

    In addition to being the covariance matrix of the MLE, 1(0 ) defines a lower

    bound for covariance matrices with certain desirable properties. Let e( y ) beany unbiased estimator, then

    E e( y ) = Ze( y )f( y; )dy = , (5.28)for any underlying = 0. Differentiating both sides of this relationship withrespect to yields

    Ip =E e(, y )

    0

    = Ze( y ) f( y;0 )

    0 dy

    =

    Ze( y ) log f( y;0 )0

    f( y;0 )dy

    = E

    e( y ) log f( y;0 )0

    = E

    ( e( y ) 0 ) log f( y;0 )

    0

    . (5.29)

    Next, we let

    C(0 ) = Eh

    (

    e( y ) 0 )(

    e( y ) 0 )0

    i. (5.30)

    be the covariance matrix ofe( y ), then,Cov

    e( y )logL

    =

    C(0 ) Ip

    Ip (0 )

    , (5.31)

    where Ip is a pp identity matrix, and (5.31) as a covariance matrix is positivesemidefinite.

    Now, for any (p 1) vector a, we havea0 a0(0 )1

    C(0 ) IpIp (

    0 ),

    a0

    a0(0 )1

    = a0[ C(0 ) (0 )1 ]a 0.

    (5.32)

    Thus, any unbiased estimator

    e( y ) has a covariance matrix that exceeds (0 )1

    by a positive semidefinite matrix. And if the MLE estimator is unbiased, it isefficient within the class of unbiased estimators. Likewise, any CUAN estima-tor will have a covariance exceeding (0 )1. Since the asymptotic covarianceof MLE is, in fact, (0 )1, it is efficient (asymptotically). (0 )1 is calledthe Camer-Rao lower bound.

  • 8/8/2019 Eco No Notes 1

    44/47

    44 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

    5.3 Maximum Likelihood Inference

    5.3.1 Likelihood Ratio Test

    Suppose we wish to test H0 : = 0 against H0 : 6=

    0. Then, we define

    Lu = max

    L(|y ) = L(b|y ) (5.33)and

    Lr = L(0|y ), (5.34)

    where Lu is the unrestricted likelihood and Lr is the restricted likelihood. Wethen form the likelihood ratio

    =Lr

    Lu. (5.35)

    Note that the restricted likelihood can be no larger than the unrestricted whichmaximizes the function.

    As with estimation, it is more convenient to work with the logs of the like-lihood functions. It will be shown below that, under H0,

    LR = 2log

    = 2

    logLrLu

    = 2[L(b|y ) L(0|y )] d 2p, (5.36)

    where

    b is the unrestricted MLE, and 0 is the restricted MLE. If H1 applies,

    then LR = Op( n ). Large values of this statistic indicate that the restrictionsmake the observed values much less likely than the unrestricted and we preferthe unrestricted and reject the restictions.

    In general, for H0 : r( ) = 0, and H1 : r( ) 6= 0, we have

    Lu = max

    L( y ) = L(b |y ), (5.37)and

    Lr = max

    L( y ) s.t. r( ) = 0

    = L( e |y ). (5.38)Under H0,

    LR = 2[L(b|y ) L( e|y )] d 2q, (5.39)where q is the length of r().

    Note that in the general case, the likelihood ratio test requires calculationof both the restricted and the unrestricted MLE.

  • 8/8/2019 Eco No Notes 1

    45/47

    5.3. MAXIMUM LIKELIHOOD INFERENCE 45

    5.3.2 Wald Test

    The asymptotic normality of MLE may be used to obtain a test based only onthe unrestricted estimates.

    Now, under H0 : = 0, we have

    n(b 0 ) d N( 0, 1(0 )). (5.40)

    Thus, using the results on the asymptotic behavior of quadratic forms from theprevious chapter, we have

    W = n(b 0 )0(0 )(b 0 ) d 2p, (5.41)which is the Wald test. As we discussed for quadratic tests, in general, underH0 : 6=

    0, we would have W = Op( n ).

    In practice, since

    1

    n

    2L(b|y )0

    =nXi=1

    1

    n

    2 log f(b|y )0

    p (0 ), (5.42)

    we use

    W = (b 0 )0 2L(b|y )0

    (b 0 ) d 2p. (5.43)Aside from having the same asymptotic distribution, the Likelihood Ratio

    and Wald tests are asymptotically equivalent in the sense that

    plimn

    ( LRW ) = 0. (5.44)

    This is shown by expanding L(0|y ) in a Taylors series about b. That is,L(0 ) = L(b ) + L(b )

    (0 b )

    +1

    2(0 b )0 2L(b )

    0(0 b )

    +1

    6

    Pi

    Pj

    Pk

    3L( )

    ijk(0i bi)(0j bj)(0k bk). (5.45)

    where the third line applies the intermediate value theorem for between

    b

    and 0. Now L( )

    = 0, and the third line can be shown to be Op(1/

    n)

    under assumption 5, whereupon we have

    L(b ) L( 0 ) = 12

    (0 b )0 2L(b )0

    (0 b ) + Op( 1/n )(5.46)

  • 8/8/2019 Eco No Notes 1

    46/47

  • 8/8/2019 Eco No Notes 1

    47/47

    5.5. CHOOSING BETWEEN TESTS 47

    and

    LMd 2p. (5.55)

    Note that the Lagrange Multiplier test only requires the restricted values of theparameters.

    In general, we may test H0 : r( ) = 0 with

    LM = L(e )

    0

    2L( e )0

    !1

    L( e )

    d 2q, (5.56)

    where L() is the unrestricted log-likelihood function, and

    e is the restricted

    MLE.

    5.5 Choosing Between Tests

    The above analysis demonstrates that the three tests: likelihood ratio, Wald,and Lagrange multiplier are asymptotically equivalent. In large samples, notonly do they have the same limiting distribution, but they will accept and rejecttogether. This in not the case in finite samples where one can reject when theother does not. This might lead a cynical analyst to use one rather than theother by choosing the one that yields the results (s)he wants to obtain. Makingan informed choice based on their finite sample behavior is beyond the scope ofthis course.

    In many cases, however, one of the tests is a much more natural choice thanthe others. Recall that Wald test only requires the unrestricted estimates whilethe Lagrange multiplier test only requires the restricted estimates. In somecases the unrestricted estimates are much easier to obtain than the restrictedand in other cases the reverse is true. In the first case we might be inclinedto use the Wald test while in the latter we would prefer to use the Lagrangemultiplier.

    Another issue is the possible sensitivity of the test results to how the restric-tions are written. For example, 1 + 2 = 0 can also be written 2/1 = 1.The Wald test, in particular is sensitive to how the restriction is written. Thisis yet another situation where a cynical analyst might be tempted to choose thenormalization of the restriction to force the desired result. The Lagrangemultiplier test, as presented above, is also sensitive to the normalization of therestriction but can be modified to avoid this difficulty. The likelihood ratio test

    however will be unimpacted by the choice of how to write the restriction.