Chapter 4 Maximum Likelihood Estimationpeterf/CSI_ch4_part1.pdf · Maximum Likelihood Estimation...

Chapter 4

Maximum LikelihoodEstimation

4.1 Maximum likelihood method of estima-

tion

We have already seen at the beginning of chapter 1 that for a given observedvalue of x of the sample X the joint p.m/d.f. fX(x|θ), as function of θ,gives us an indication as to how the chances of getting the observed resultsx varies with θ and that it is therefore referred to as the likelihood function

of the observed results x and denoted by L (θ;x) . If the true value of θ isunknown to us then, naturally, the obvious estimate for the value of θ is thatvalue θ̇ which makes the observed results most probable i.e. θ̇ is that valueof θ which maximises the chances of getting the results we did get. Thus θ̇maximises the likelihood L (θ;x) i.e.

L(

θ̇;x)

= supθ∈Θ

L (θ;x)

θ̇ = θ̇ (x) is called the maximum likelihood estimate of θ and θ̇ (X) iscalled the maximum likelihood estimator of θ (or m.l.e. of θ) . Usually,but not always, the maximum likelihood estimate θ̇ is found by differenti-ating L (θ;x) with respect to θ, equating to zero and then solving for θ̇ or,equivalently, since log is a monotonic function, by differentiating

ℓ (θ;x) = log L (θ;x) ,

53

54 CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

equating to zero and then solving for θ̇. The function ℓ (θ;x) is called thelog-likelihood function. If θ is a vector with elements (θ1, θ2, . . . , θk) then

the m.l.e θ̇ of θ consists of elements(

θ̇1, θ̇2, . . . , θ̇k

)

which maximise L (θ;x)

and are usually, but not always, obtained by differentiating ℓ (θ;x) w.r.t.θ1, θ2, . . . , θk, equating the resulting expressions to zero and solving simulta-neously for θ̇1, θ̇2, . . . , θ̇k. These equations

∂

∂θ1

ℓ (θ;x) = 0

∂

∂θ2

ℓ (θ;x) = 0

...∂

∂θk

ℓ (θ;x) = 0

are called the maximum likelihood equations and their solutions are the m.l.e.

θ̇ =(

θ̇1, θ̇2, . . . , θ̇k

)

. You may need to solve these equations numerically.

4.1.1 Things to watch out for

1. The m.l.e. may not be a turning point i.e. may not be a point atwhich the first derivative of the likelihood (and log-likelihood) functionvanishes (see figure 4.1).

2. The m.l.e. may not be unique (see figure 4.2).

3. If the m.l.e. is found numerically by an iterative procedure, it may takelots of iterations to converge because the likelihood may be very flat.Conversely if you allow your iterative procedure to stop too early bynot asking for great accuracy the obtained m.l.e. may be quite distantfrom the true point of maximum if the likelihood function is very flat.

4. There may be local maxima so numerical solution of likelihood equa-tions need not necessarily provide the global maximum (see figure 4.3).

4.1. MAXIMUM LIKELIHOOD METHOD OF ESTIMATION 55

0 5 10 15 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

parameter θ

Like

lihoo

d L(

θ)

m.l.e.

Figure 4.1: The m.l.e. is a boundary point

0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

parameter θ

Like

lihoo

d (θ

)

a b

Figure 4.2: Any point between a and b is a m.l.e


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.550

100

150

200

250

300

350

400

450

parameter θ

Like

lihoo

d L(

θ)

local maximum global maximum

Figure 4.3: Likelihood function exhibits more than one maximum.

Example: Let X1, X2, . . . , Xn be the lifetimes of n randomly selected com-ponents produced by a certain manufacturer which are observed to take thevalues x1, x2, . . . , xn. Assuming lifetimes are exponentially distributed withp.d.f.

f (x|θ) = θe−θx x > 0

find the m.l.e of θ on the basis of these n observations.Solution: The likelihood is

L (θ;x) =n

∏

i=1

θe−θxi = θn exp

(

−θ

n∑

i=1

xi

)

and the log-likelihood is

ℓ (θ;x) = n log θ − θ

n∑

i=1

xi

Either of these functions in θ exhibits a maximum as a turning point; hencethe m.l.e. is obtained by differentiation. In particular

∂ℓ (θ;x)

∂θ= 0 ⇐⇒ n

θ−

n∑

i=1

xi = 0


i.e.

θ̇ (x) =n

∑ni=1 xi

=1

x̄

Hence the m.l.e. estimator

θ̇ (X) =n

∑ni=1 Xi

=1

X̄

Example: In a survival time study n cancer patients were observed for afixed time T after operation and if the symptoms reappear, the time X, sincethe operation, this happens is recorded. For r of these patients symptomsreappeared at times x1, x2, . . . , xr after their operation and the remainingn − r patients were still free of symptoms at the end of the time period T.If the time X to the return of symptoms has exponential distribution withp.d.f.

f (x|θ) = θe−θx x > 0

find the m.l.e. of θ on the basis of the study results.Solution:The information from the study is as follows

X1 = x1, X2 = x2, . . . , Xr = xr, Xr+1 > T, Xr+2 > T, . . . , Xn > T

The likelihood of these results is

L (θ) = θe−θx1θe−θx2 . . . θe−θxre−θT e−θT . . . e−θT

= θr exp

(

−θ

r∑

i=1

xi

)

exp (− (n − r) θT )

= θr exp

(

−θ

[

r∑

i=1

xi + (n − r) T

])


ℓ (θ) = r log θ − θ

[

r∑

i=1

xi + (n − r) T

]

This clearly has a maximum as a turning point. Thus differentiating andequating to zero we get

∂ℓ

∂θ= 0 ⇐⇒ r

θ−

[

r∑

i=1

xi + (n − r) T

]

= 0


i.e.θ̇ =

r

[∑r

i=1 xi + (n − r) T ]

Hence the maximum likelihood estimator of θ is

θ̇ (X) =R

[∑r

i=1 Xi + (n − R) T ]

where R = # of patients whose symptoms return within time T after theiroperation.Example: A random sample X1, X2, . . . , Xn is drawn from the N (µ, σ2)distribution providing the results x1, x2, . . . , xn. Find the m.l.e. of θ = (µ, σ2).Solution: The likelihood of the results is

L (θ;x) =n

∏

i=1

1

(2πσ2)1

2

exp

(

− 1

2σ2(xi − µ)2

)

=(

2πσ2)−n

2 exp

(

− 1

2σ2

n∑

i=1

(xi − µ)2

)


ℓ (θ;x) = −n

2log σ2 − 1

2σ2

n∑

i=1

(xi − µ)2 + constant

Differentiating with respect to µ and σ2 we obtain the maximum likelihoodequations

∂ℓ

∂µ= 0

∂ℓ

∂σ2= 0

⇐⇒

1

2σ̇2

∑ni=1 (xi − µ̇) = 0

− n

2σ̇2+

1

2σ̇4

∑ni=1 (xi − µ̇)2 = 0

From the first likelihood equation we get

µ̇ =1

n

n∑

i=1

xi = x̄

Putting this solution in the second equation we get

σ̇2 =1

n

n∑

i=1

(xi − x̄)2


Hence the maximum likelihood estimators are

µ̇ (X) =1

n

n∑

i=1

Xi = X̄

and

σ̇2 (X) =1

n

n∑

i=1

(

Xi − X̄)2

Example: Let X1, X2, . . . , Xn be a random sample from the Uniform[

θ − 12, θ + 1

2

]

distribution. Find the m.l.e. of θ.Solution: Given the results (x1, x2, . . . , xn) = x their likelihood is

L (θ;x) = I[θ− 1

2,θ+ 1

2] (xi)

=n

∏

i=1

I(−∞,θ+ 1

2] (xi) I[θ+ 1

2,∞) (xi)

= I(−∞,θ+ 1

2]

(

max1≤i≤n

xi

)

I[θ+ 1

2,∞)

(

min1≤i≤n

xi

)

= I[max xi− 1

2,∞) (θ) I(−∞,min xi+

1

2] (θ)

= I[max xi− 1

2,min xi+

1

2] (θ)

where the set function IA (u) is defined as

IA (u) =

{

1 if u ∈ A0 if u /∈ A

From the plot of L (θ;x) in figure 4.4 we see that the likelihood is max-

imised by any θ̇ in the interval

[

max1≤i≤n

xi − 12, min1≤i≤n

xi + 12

]

i.e. the maximum

likelihood estimator is not unique. Possible candidates are

θ1 (X) = min1≤i≤n

Xi +1

2

θ2 (X) = max1≤i≤n

Xi −1

2

θ3 (X) =1

2

(

max1≤i≤n

Xi + min1≤i≤n

Xi

)

This example also demonstrates that the m.l.e. is not always obtained bydifferentiating and equating to zero.


1

2

Parameter θ

Like

lihoo

d L

(θ)

max xi − 1/2 min x

i +1/2

Figure 4.4: Likelihood function for observations x1, x2, . . . , xn sampled from the

Uniform[θ − 12 , θ + 1

2 ] distribution.

4.2 Properties of Maximum likelihood esti-

mators

4.2.1 Invariance principle.

Suppose that φ (θ) is a function of θ and that θ̇ is the m.l.e. of θ. Then them.l.e. of φ is given by

φ̇ = φ(

θ̇)

i.e. it is obtained by evaluating the faction φ at θ = θ̇. If φ is an one-to-onefunction then this is clearly obvious. If φ however is not one-to-one then thejustification is not straightforward and is omitted.

4.2.2 M.l.e. and most efficient estimators

If a most efficient unbiased estimator θ̂ of θ exists then by 2.14

∂

∂θlog fX(x|θ) = I (θ)

[

θ̂ (x) − θ]

4.2. PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS 61

But under the same regularity conditions under which this result is true, them.l.e. emerges as the solution of the likelihood equation

∂

∂θlog fX(x|θ) = 0

i.e. the m.l.e. θ̇ satisfies

I(

θ̇) [

θ̂ (x) − θ̇]

= 0

and since I (θ) > 0 for all θ we must have that

θ̇ = θ̂ (x)

i.e. we have the following resultResult: If a most efficient unbiased estimator θ̂ of θ exists (i.e. θ̂ is

unbiased and its variance is equal to the CRLB) then the maximum likelihoodmethod of estimation will produce it.

4.2.3 M.l.e. and sufficiency

Recall that if T is a sufficient statistic for θ then by the factorization theorem

fX(x|θ) = g (t, θ) h (x) with t = T (x)

Since the m.l.e. θ̇ (x) maximises fX(x|θ) with respect to θ it follows that whena sufficient statistic exists the m.l.e. θ̇ (x) maximises g(t, θ) with respect to θ.θ̇ must therefore depend on the sample observations only through the valuet of the sufficient statistic. Since the sufficient statistic is arbitrary we havethe following result.

Result: A maximum likelihood estimator is a function of all sufficientstatistics of θ including the minimal sufficient statistic.

4.2.4 Asymptotic properties of m.l.e

By far the best justification for the use of the maximum likelihood method ofestimation is the asymptotic behaviour of the maximum likelihood estimator.In particular, under some regularity conditions and provided the sample sizeis large enough, the maximum likelihood estimator produces, with very high


probability, estimates very close to the true value of the parameter it is esti-mating (i.e. the m.l.e. is consistent). Further, under the same conditionsthe m.l.e. estimator is approximately unbiased, has variance approximatelyequal to the CRLB and its distribution is approximately Normal. Thus forlarge sample sizes maximum likelihood estimators are approximately mostefficient unbiased for the parameter they estimate. In particular

Theorem 4.2.1 Subject to mild regularity conditions the maximum likeli-hood estimator θ̇ (X) of θ, where X is a random sample of size n from thedistribution with p.m/d.f. f(x|θ), is

1. weakly consistent i.e.

Pr(∣

∣

∣θ̇ (X) − θ0

∣

∣

∣≤ ε

)

→ 1 as n → ∞

however small the value ε > 0, where θ0 is the true value of theθ parameter,

2. asymptotically most efficient, unbiased and Normally distributed i.e.

θ̇ (X) ∼ N

(

θ0,1

I (θ0)

)

as n → ∞,

where I (θ0) is the sample Fisher information evaluated at θ = θ0

If θ is a vector parameter of dimension k and θ̇ (X) is the vector m.l.e. ofθ then

θ̇ (X) ∼ MNk

(

θ0, I−1 (θ0)

)

for large n, where I (θ0) is the sample Fisher information matrix.

Sketch of Proof: The proof requires more mathematics than we areprepared to use in this course but, for the case when θ is scalar, is based onthe following lines (the case when θ is a vector is very similar).

1. Consider the random variable

Z (θ) = log f(X|θ)


which is dependent on θ. Its mean with respect to the true distributionof X is (assuming X to be continuous)

µ (θ) =

∫

Xlog f(x|θ).f(x|θ0)dx.

As a function in θ, µ (θ) attains a maximum at θ = θ0. This followsfrom the fact that for all θ 6= θ0

µ (θ) − µ (θ0) =

∫

Xlog

(

f(x|θ)f(x|θ0)

)

.f(x|θ0)dx

<

∫

X

{(

f(x|θ)f(x|θ0)

)

− 1

}

.f(x|θ0)dx (4.1)

since for any u 6= 1 , log u < u− 1. But the integral in 4.1 is zero so weget that µ (θ) < µ (θ0) for all θ 6= θ0 i.e. µ (θ) attains a maximum atθ = θ0.

Now

log fX(X|θ) = log

(

n∏

i=1

f(Xi|θ))

=n

∑

i=1

log f(Xi|θ) =n

∑

i=1

Zi (θ)

where the Zi (θ) are independent random variables having the samedistribution as Z (θ) defined above. But by the Law of large numbers,as n → ∞,

1

n

n∑

i=1

Zi (θ) → µ (θ)

i.e. for large n,1

nlog fX(X|θ) is close to µ (θ) fore all θ. Consequently

the point of maximum of1

nlog fX(X|θ), namely θ̇ (X) , must be close

to the point of maximum of µ (θ) which is θ0, provided the convergence

of1

nlog fX(X|θ) to µ (θ) is uniform in θ. Indeed as n → ∞, θ̇ (X) → θ0

in probability which implies weak consistency


Assuming that θ̇ (X) is a turning point of the log-likelihood i.e. it is asolution of the likelihood equation

0 =∂

∂θℓ(

θ̇;X)

or, dropping X for convenience

0 =∂

∂θℓ(

θ̇)

(4.2)

where∂

∂θℓ(

θ̇)

means∂

∂θℓ (θ)

∣

∣

∣

∣

θ=θ̇

, the first derivative of ℓ (θ) evaluated at

θ = θ̇. Expanding the r.h.s. of 4.2 about θ0 we get

0 =∂

∂θℓ (θ0) +

(

θ̇ − θ0

) ∂2

∂θ2ℓ (θ0) + Remainder term (4.3)

Since θ̇ is weekly consistent, for large n, the value of θ̇ will, with high proba-bility, be close to θ0 so that with high probability the Remainder term in 4.3will be negligible and can be ignored. Hence

(

θ̇ − θ0

)

= −[

∂2

∂θ2ℓ (θ0)

]−1∂

∂θℓ (θ0)

= −[

∂2

∂θ2ℓ (θ0)

]−1

S (X, θ0)

or√

n(

θ̇ − θ0

)

= −[

1

n

∂2

∂θ2ℓ (θ0)

]−11√n

S (X, θ0) (4.4)

where

S (X, θ0) =∂

∂θlog fX(X|θ)

∣

∣

∣

∣

θ=θ0

= Score function evaluated at θ = θ0

However, since the observations in a random sample are independent fromthe same distribution with p.m/d.f. f (x|θ) we have that

fX(X|θ) =n

∏

i=1

f (Xi|θ)


and

log fX(X|θ) =n

∑

i=1

log f (Xi|θ) (4.5)

or

S (X, θ) =n

∑

i=1

∂

∂θlog f (Xi|θ)

Hence1√n

S (X, θ0) =1√n

n∑

i=1

Si (4.6)

where Si =∂

∂θlog f (Xi|θ)

∣

∣

∣

∣

θ=θ0

. But by the Central Limit Theorem, and

recalling that E (Si) = 0 (see 2.11)

1√n

n∑

i=1

Si → N (0, V ar (Si))

= N (0, i (θ0)) (4.7)

since V ar (Si) = i (θ0) (see 2.12) where

i (θ0) = E(

S2i

)

= Fisher information in one observation

= E

(

− ∂2

∂θ2log f (Xi|θ)

∣

∣

∣

∣

θ=θ0

)

Further, because of 4.5, in 4.4

1

n

∂2

∂θ2ℓ (θ0) =

1

n

n∑

i=1

∂2

∂θ2log f (Xi|θ0)

→ E

(

∂2

∂θ2log f (Xi|θ0)

)

= −i (θ0) (4.8)

by the Law of Large Numbers. Hence from 4.4, 4.7 and 4.8 we have

√n

(

θ̇ (X) − θ0

)

→ [i (θ0)]−1 N (0, i (θ0))

= N

(

0,1

i (θ0)

)


or

θ̇ (X) → 1√n

N

(

0,1

i (θ0)

)

+ θ0

= N

(

θ0,1

ni (θ0)

)

= N

(

θ0,1

I (θ0)

)

Corollary to the last theoremLet X be a random sample of size n from a distribution with p.m/d.f f(x|θ)where θ is a scalar parameter. Under mild regularity conditions

1√n

S(X, θ) → N(0, i(θ)) as n → ∞

where S(X, θ) is the score statistic i.e. for large sample size n, S(X, θ) isapproximately N(0, ni(θ)) distributed, or

S(X, θ) ∼ N(0, I(θ)) approximately.

The asymptotic properties of the maximum likelihood estimator of ascalar parameter θ can be extended to the case of the maximum likelihoodestimator of a real valued function φ(θ) of the scalar parameter θ.

Theorem:Subject to mild regularity conditions the maximum likelihood estimatorφ̇ (X) of a real valued function φ (θ) of θ, where X is a random sampleof size n from the distribution with p.m/d.f. f(x|θ), is

1. weakly consistent i.e.

Pr(∣

∣

∣φ̇ (X) − φ (θ0)∣

∣

∣ ≤ ε)

→ 1 as n → ∞

however small the value ε > 0, where θ0is the true value of the θparameter,

2. asymptotically most efficient, unbiased and Normally distributed i.e.

φ̇ (X) ∼ N

(

φ (θ0) ,[φ′ (θ0)]

2

I (θ0)

)

as n → ∞,

where I (θ0) is the sample Fisher information evaluated at θ = θ0 and

φ′ (θ0) is the derivative dφ(θ)dθ

evaluated at θ = θ0.

4.3. ASYMPTOTIC CONFIDENCE INTERVALS FOR θ 67

The proof of this theorem can be obtained by adjusting the proof of theprevious theorem or by using (a) the invariance principle of the maximumlikelihood estimation (b) the last theorem and (c) the fact that, given a realvalued function φ(θ) of θ, the Fisher Information

I(φ) = I(θ)/

[

dφ

dθ

]2

.

4.3 Asymptotic confidence intervals for θ

4.3.1 Asymptotic confidence intervals using m.l.e.

Let. X1, X2, . . . , Xn be a random sample from the N (0, θ) distribution withn large. The sample joint p.d.f. is

fX(x|θ) =n

∏

i=1

1√2πθ

exp

(

− 1

2θx2

i

)

= (2πθ)−n/2 exp

(

− 1

2θ

n∑

i=1

x2i

)

and

log fX(x|θ) = −n

2log θ − 1

2θ

n∑

i=1

x2i + constant

Hence

S (x) =∂

∂θlog fX(x|θ) = − n

2θ+

∑ni=1 x2

i

2θ2

and equating to zero we get θ̇ (x) =P

n

i=1x2

i

nas the m.l.e. of θ. Further

∂2

∂θ2log fX(x|θ) =

n

2θ2−

∑ni=1 x2

i

θ3


so that

I (θ) = −E

(

∂2

∂θ2log fX(X|θ)

)

= E

(

− n

2θ2+

∑ni=1 X2

i

θ3

)

= − n

2θ2+

∑ni=1 E (X2

i )

θ3= − n

2θ2+

nE (X21 )

θ3

= − n

2θ2+

nV ar (X1)

θ3since E (X1) = 0

= − n

2θ2+

nθ

θ3=

n

2θ2

Since n is assumed to be large we therefore have from theorem 40 that them.l.estimator

∑ni=1 X2

i

n∼ N

(

θ,1

I (θ)

)

= N

(

θ,2θ2

n

)

asymptotically. Hence, approximately,

Pr

−1.96 ≤1n

∑ni=1 X2

i − θ

θ√

2n

≤ 1.96

= 0.95 (4.9)

i.e.

Pr

((

1 − 1.96

√

2

n

)

θ ≤ 1

n

n∑

i=1

X2i ≤

(

1 + 1.96

√

2

n

))

= 0.95

i.e.

Pr

1n

∑ni=1 X2

i(

1 + 1.96√

2n

) ≤ θ ≤1n

∑ni=1 X2

i(

1 − 1.96√

2n

)

= 0.95 (4.10)

Equation 4.10 states that the random interval

1n

∑ni=1 X2

i(

1 + 1.96√

2n

) ,1n

∑ni=1 X2

i(

1 − 1.96√

2n

)

(4.11)

has probability 95% of positioning itself so that it includes the fixed butunknown value of the parameter θ. It therefore constitutes a 95% confidenceinterval estimator of θ .


−4 −3 −2 −1 0 1 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Percentile of standard Normal distribution

z

Sta

nd

ard

No

rma

l p.d

.f.

α/2

1−α/2

zα/2

Figure 4.5:

It is instructive to note how this confidence interval was obtained startingfrom equation 4.9. A probabilistic statement on the standardised m.l.e.

θ̇ (X) − θ

1/√

I (θ)=

1n

∑ni=1 X2

i − θ

θ√

2n

was seen, after some re-arrangements, to define a random interval (i.e. aregion whose position was random) which had a certain probability of posi-tioning itself so that it contains within it the unknown value of the parameterθ. Generalising this approach provides us with a means of obtaining confi-dence intervals as follows. Since from theorem 40 we have, approximately for

large sample sizes, that θ̇ (X) ∼ N

(

θ,1

I (θ)

)

i.e.

θ̇ (X) − θ

1/√

I (θ)=

(

θ̇ (X) − θ)

√

I (θ) ∼ N (0, 1)

it follows that

Pr(

−zα/2 ≤(

θ̇ (X) − θ)

√

I (θ) ≤ zα/2

)

= 1 − α (4.12)

where zα/2 is such that Φ(

zα/2

)

= 1−α/2, Φ being the distribution functionof the N (0, 1) distribution (see figure 4.5).


The inequality

−zα/2 ≤(

θ̇ (X) − θ)

√

I (θ) ≤ zα/2 (4.13)

in 4.12 defines a random region CR (X) in the parameter space Θ andequation 4.12 can now be interpreted as

Pr (CR (X) contains θ) = 1 − α

i.e. the probability that CR (X) positions itself so that it contains within itthe value of the parameter θ is 1−α. Hence CR (X) constitutes a 100 (1 − α) %confidence region for the true value of θ. In most cases, as in the example atthe start of this section, it will be easy to identify the region CR (X). Therewill be occasions, however, when it will be difficult to identify the regionCR (X) from equation 4.13. If it is not possible to do so, then an approxi-mation to CR (X) can be obtained by evaluating the Fisher information in4.13 at the m.l.e. θ̇ instead of at θ. In that event 4.13 becomes

−zα/2 ≤(

θ̇ (X) − θ)

√

I(

θ̇)

≤ zα/2

i.e.θ̇ (X) − zα/2

√

I(

θ̇)

≤ θ ≤ θ̇ (X) +zα/2

√

I(

θ̇)

i.e. the interval

θ̇ (X) − zα/2√

I(

θ̇)

, θ̇ (X) +zα/2

√

I(

θ̇)

(4.14)

is an approximate 100 (1 − α) % confidence interval estimator of θ.

Example: Continuing with the last example and introducing the approxi-

mation I(

θ̇)

in 4.9 we see that the 95% confidence interval of θ obtained in

4.11 can be approximated by the interval(

1

n

n∑

i=1

X2i − 1.96

√

2

n.1

n

n∑

i=1

X2i ,

1

n

n∑

i=1

X2i + 1.96

√

2

n.1

n

n∑

i=1

X2i

)


i.e.(

1

n

n∑

i=1

X2i

(

1 − 1.96

√

2

n

)

,1

n

n∑

i=1

X2i

(

1 − 1.96

√

2

n

))

.

4.3.2 Asymptotic confidence intervals using the scorestatistic

We have seen in corollary on page 66 that for large sample sizes

S (X) ∼ N (0, I (θ))

i.e.S (X)√

I (θ)∼ N (0, 1) .

Hence we have, approximately, that

Pr

(

−zα/2 ≤S (X)√

I (θ)≤ zα/2

)

= 1 − α (4.15)

Once again the inequality

−zα/2 ≤S (X)√

I (θ)≤ zα/2 (4.16)

defines a region CR (X) in the parameter space Θ whose position is de-termined by the random vector X. Its position is therefore deemed to berandom. Equation 4.15 is therefore interpreted to say that there is approxi-mately probability (1 − α) that the random region CR (X) will position itselfso that it contains within it the unknown value of the parameter θ. HenceCR (X) constitutes a 100(1 − α) % confidence interval estimator of the pa-rameter θ. If the region CR (X) is difficult to identify from the inequality

4.16 then it can be found approximately by replacing I (θ) with I(

θ̇)

in

4.16.

Example: Let X1, X2, . . . , Xn be a random sample from the Poisson distri-bution with parameter θ. Find an approximate 95% confidence interval forθ.


Solution: The joint p.m.f. of the sample is

fX(x|θ) =n

∏

i=1

e−θθxi

xi!=

e−nθθP

n

i=1xi

n∏

i=1

xi!

Hence

log fX(x|θ) = −nθ +n

∑

i=1

xi log θ + constant

and

S (X) =∂

∂θlog fX(X|θ) = −n +

∑ni=1 Xi

θ

Further

I (θ) = E

(

− ∂2

∂θ2log fX(x|θ)

)

= E

(∑ni=1 Xi

θ2

)

=nE (X1)

θ2=

nθ

θ2=

n

θ

Thus for large n we have, approximately,

Pr

−1.96 ≤

∑ni=1 Xi

θ− n

√

n

θ

≤ 1.96

= 0.95 (4.17)

i.e.

Pr

(

−1.96 ≤√

n(

X̄ − θ)

√θ

≤ 1.96

)

= 0.95

with X̄ = 1n

∑ni=1 Xi, or equivalently

Pr

(

n(

X̄ − θ)2

θ≤ 1.962

)

= 0.95

i.e.Pr

(

n(

X̄ − θ)2 ≤ 1.962θ

)

= 0.95

orPr

(

nX̄2 −(

2nX̄ + 1.962)

θ + nθ2 ≤ 0)

= 0.95


0

θ

Val

ue o

f the

qua

drat

ic in

θ

Confidence interval for θ

θ1 θ

2

Figure 4.6: The interval between the roots of the quadratic nθ2−(

2nX̄ + 1.962)

θ+nX̄2 is an approximate 95% confidence interval for θ.

Thus there is probability approximately 95% that the quadratic

nθ2 −(

2nX̄ + 1.962)

θ + nX̄2

with random coefficients (i.e. with random position) will position itself sothat the true value of the parameter θ falls in the region for which thequadratic is negative i.e. there is probability approximately 95% that thequadratic will position itself so that the true value of the parameter θ isbetween the roots θ1 and θ2 of the quadratic i.e.

Pr (θ1 (X) ≤ θ ≤ θ2 (X)) = 0.95

Hence (θ1 (X) , θ2 (X)) is an approximate 95% Confidence interval for θ. But

(

θ1

θ2

)

=

(

2nX̄ + 1.962)

±√

(

2nX̄ + 1.962)2 − 4n2X̄2

2n

=

(

X̄ +1.962

2n

)

± 1.96√2n

√

2X̄ +1.962

2n

Thus(

X̄ +1.962

2n

)

− 1.96√2n

√

2X̄ +1.962

2n,

(

X̄ +1.962

2n

)

+1.96√

2n

√

2X̄ +1.962

2n


is a 95% Confidence Interval for θ.An approximation to this confidence interval could be obtained by re-

placing in 4.17 the expression for I (θ) by I(

θ̇)

. Since θ̇ = 1n

∑ni=1 Xi an

approximation to 4.17 is

Pr

−1.96 ≤

∑ni=1 Xi

θ− n

√

n1n

∑ni=1 Xi

≤ 1.96

= 0.95

i.e.

Pr

(

n − 1.96n√

∑ni=1 Xi

≤∑n

i=1 Xi

θ≤ n +

1.96n√

∑ni=1 Xi

)

= 0.95

or

Pr

∑ni=1 Xi

n

(

1 + 1.96√P

n

i=1Xi

) ≤ θ ≤∑n

i=1 Xi

n

(

1 − 1.96√P

n

i=1Xi

)

= 0.95

i.e.

Pr

X̄(

1 + 1.96√nX̄

) ≤ θ ≤ X̄(

1 − 1.96√nX̄

)

= 0.95

i.e.

Pr

√nX̄

3

2

(√nX̄ + 1.96

) ≤ θ ≤√

nX̄3

2

(√nX̄ − 1.96

)

= 0.95

Hence

√nX̄

3

2

(√nX̄ + 1.96

) ,

√nX̄

3

2

(√nX̄ − 1.96

)

is an approximate 95% Confidence interval for θ.

Chapter 4 Maximum Likelihood Estimationpeterf/CSI_ch4_part1.pdf · Maximum Likelihood Estimation...

Documents

Transcript of Chapter 4 Maximum Likelihood Estimationpeterf/CSI_ch4_part1.pdf · Maximum Likelihood Estimation...