BTRY 4090 / STSCI 4090, Spring 2010 - Rutgers...

Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li 1

BTRY 4090 / STSCI 4090, Spring 2010

Theory of Statistics

Instructor: Ping Li

Department of Statistical Science

Cornell University


General Information

• Lectures: Tue, Thu 10:10-11:25 am, Stimson Hall G01

• Section: Mon 2:55 - 4:10 pm, Warren Hall 131

• Instructor: Ping Li, [email protected],

Office Hours: Tue, Thu 11:25 am -12 pm, 1192, Comstock Hall

• TA: Xiao Luo, [email protected]. Office hours: TBD

(1) Mon, 4:10 - 5:10pm Warren Hall 131;

(2) Wed, 2:30 - 3:30pm, Comstock Hall 1181.

• Prerequisites: BTRY 4080 or equivalent

• Textbook: Rice, Mathematical Statistics and Data Analysis, 3rd edition


• Exams:

– Prelim 1: In Class, Feb. 25, 2010

– Prelim 2: In Class, April 8, 2010

– Final Exam: Warren Hall 145, 2pm - 4:30pm, May 13, 2010

– Policy: Close book, close notes

• Programming: Some programming assignments. You can either use Matlab

or R. For practice, please download the Matlab examples in 4080 lecture

notes.


• Homework: Weekly

– Please turn in your homework either in class or to BSCB front desk

(Comstock Hall, 1198).

– No late homework will be accepted.

– Before computing your overall homework grade, the assignment with the

lowest grade (if ≥ 25%) will be dropped, the one with the second lowest

grade (if ≥ 50%) will also be dropped.

– It is the students’ responsibility to keep copies of the submitted homework.


• Grading: Two formulas

1. Homework: 30% + Two Prelims: 35% + Final: 35%

2. Homework: 30% + Two Prelims: 25% + Final: 45%

Your grade is whichever higher.

• Course Letter Grade Assignment

A ≈ 90% (in the absolute scale)

C ≈ 60% (in the absolute scale)

In borderline cases, participation in section and class interactions will be used as

a determining factor.


Syllabus

Topic Textbook

Random number generation

Probability, Random Variables, Joint Distributions, Expected Values Chapters 1-4

Limit Theorems Chapter 5

Distributions Derived from the Normal Distribution Chapter 6

Estimation of Parameters and Fitting of Probability Distributions Chapter 8

Testing Hypothesis and Assessing Goodness of Fit Chapter 9

Comparing Two Samples Chapter 11

The Analysis of Categorical Data Chapter 13

Linear Least Squares Chapter 14


Chapters 1 to 4: Mostly Reviews

• Random number generation

• The method of random projections : A real example

of using probability to solve computationally intensive (or infeasible) problems.

• Capture/Recpature method : An example of discrete probability and

the introduction to parameter estimation using maximum likelihood.

• Conditional expectations, bivariate normal, and random pr ojections

• Moment generating function and random projections


Nonuniform Sampling by Inversion

The goal : Sample X from a distribution F (x).

The inversion transform sampling :

• Sample U ∼ Uniform(0, 1).

• OutputX = F−1(U)

Proof:

Pr (X ≤ x) = Pr(

F−1(U) ≤ x)

= Pr (U ≤ F (x)) = F (x)

Limitation: Need a closed-form F−1, but many common distributions (eg,

normal) do not have closed-form F−1.


Examples of Inversion Transform Sampling

• X ∼ Exponential(λ), i.e., F (x) = 1 − e−λx, x ≥ 0.

Let U ∼ Uniform(0, 1), thenlog(1−U)

−λ ∼ Exponential(λ)

• X ∼ Pareto(α), i.e., F (x) = 1 − 1xα , x ≥ 1.

Let U ∼ Uniform(0, 1), then 1(1−U)1/α ∼ Pareto(α).

A small trick:

If U ∼ Uniform(0, 1), then 1 − U ∼ Uniform(0, 1).

Thus, we can replace 1 − U by U .


The Box-Muller Transform

U1 and U2 are i.i.d. samples from Uniform(0, 1). Then

N1 =√

−2 logU1 cos(2πU2)

N2 =√

−2 logU1 sin(2πU2)

are two i.i.d samples from the standard normalN(0, 1).

Q: How to generate non-standard normals?


An Introduction to Random Projections

Many applications require a data matrix : A ∈ Rn×D

For example, the term-by-document matrix may contain n = 1010 documents

(web pages) andD = 106 single words, orD = 1012 double words (bi-gram

model), or D = 1018 triple words (tri-gram model).

Many matrix operations boil down to computing how close (how far) two rows

(columns) of the matrix are. For example, linear least square (ATA)−1

ATy.

Challenges : The matrix may be too large to store,

or computing ATA is too expensive.


Random Projections : Replace A by B = A × R

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

k is very small (eg k = 50 ∼ 100), but n andD are very large.

B approximately preserves the Euclidean distance and dot products between any

two rows of A. In particular,E (BBT) = AAT.


Consider first two rows in A: u1, u2 ∈ RD .

u1 = u1,1, u1,2, u1,3, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, u2,3, ..., u2,i, ..., u2,D

and first two rows in B: v1, v2 ∈ Rk .

v1 = v1,1, v1,2, v1,3, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, v2,3, ..., v2,j, ..., v2,k

v1 = RTu1, v2 = RTu2.

R = rij, i = 1 to D and j = 1 to k. rij ∼ N(0, 1).


What are we hoping for?

•∑k

j=1 |v1,j |2 ≈∑D

i=1 |u1,i|2, as close as possible.

• ∑kj=1 |v1,j − v2,j |2 ≈∑D

i=1 |u1,i − u2,i|2, as close as possible.

• k should be as small as possible, for a specified level of accuracy.


Unbiased Estimator of d and m1, m2

We need a good estimator, unbiased and has small variance.

Note that the estimation problem is essentially the same for d and for m1 (m2).

Thus, we can focus on estimating m1.

By random projections, we have k i.i..d. samples (why?)

vj =D∑

i=1

riju1,i, j = 1, 2, ...k

Because rij ∼ N(0, 1), we can develop estimators and analyze the properties

using normal and χ2 distributions. But we can also solve the problem without

using normals.


Unbiased Estimator of m1

v1,j =D∑

i=1

riju1,i, j = 1, 2, ...k, (rij ∼ N(0, 1))

To get started, let’s first look the moments

E(v1,j) = E

(

D∑

i=1

riju1,i

)

=D∑

i=1

E(rij)u1,i = 0


E(v21,j) =E

[

D∑

i=1

riju1,i

]2

=E

D∑

i=1

r2iju21,i +

∑

i 6=i′

riju1,iri′ju1,i′

=

D∑

i=1

E(r2ij)u21,i +

∑

i 6=i′

E(rijri′j)u1,iu1,i′

=

(

D∑

i=1

u21,i + 0

)

= m1

Great! m1 is exactly what we are after.

Since we have k, i.i.d. samples vj , we can simply average them to estimate m1.


An unbiased estimator of the Euclidean norm m1 =∑D

i=1 |u1,i|2

m1 =1

k

k∑

j=1

|v1,j |2,

E (m1) =1

k

k∑

j=1

E(

|v1,j |2)

=1

k

k∑

j=1

m1 = m1

We need to analyze its variance to assess its accuracy.

Recall, our goal is to use k (number of projections) as small as possible.


V ar (m1) =1

k2

k∑

j=1

V ar(

|v1,j |2)

=1

kV ar

(

|v1,j |2)

=1

k

[

E(

|v1,j |4)

−E2(|v1,j |2)]

=1

k

E

(

D∑

i=1

riju1,i

)4

−m21

We can computeE(

∑Di=1 riju1,i

)4

directly, but it would be much easier if we

take advantage of the χ2 distribution.


χ2 Distribution

If X ∼ N(0, 1), then Y = X2 is a Chi-Square distribution with one degree of

freedom, denoted by χ21.

If Xj , j = 1 to k are i.i.d. normalXi ∼ N(0, 1). Then Y =∑k

j=1X2j follows

a Chi-square distribution with k degrees of freedom, denoted by χ2k.

If Y ∼ χ2k, then

E(Y ) = k, V ar(Y ) = 2k


Recall, after random projections,

v1,j =D∑

i=1

riju1,i, j = 1, 2, ...k, rij ∼ N(0, 1)

Therefore, vj also has a normal distribution:

v1,j ∼ N

(

0,D∑

i=1

|ui,i|2)

= N (0, m1)

Equivalentlyv1,j√m1

∼ N(0, 1).

Therefore,

[

v1,j√m1

]2

=v21,j

m1∼ χ2

1, V ar

(

v21,j

m1

)

= 2, V ar(

v21,j

)

= 2m21

Now we can figure out the variance formula for random projections.


V ar (m1) =1

kV ar

(

|v1,j |2)

=2m2

1

k

Implication

V ar(m1)

m21

=2

k, independent of m1

√

V ar(m1)m2

1is known as the coefficient of variation.

——————-

We have solved the variance using χ21.

We can actually figure out the distribution of m1 using χ2k.


m1 =1

k

k∑

j=1

|v1,j |2, v1,j ∼ N (0, m1)

Because v1,j ’s are i.i.d, we know

km1

m1=

k∑

j=1

(

v1,j√m1

)2

∼ χ2k (why?)

This will be useful for analyzing the error bound using probability inequalities.

We can also write down the moments of m1 directly using χ2k


Recall, if Y ∼ χ2k , then E(Y ) = k, and V ar(Y ) = 2k

=⇒

E

(

km1

m1

)

= k, V ar

(

km1

m1

)

= 2k,

=⇒

V ar(m1) = 2km2

1

k2=

2m21

k


An unbiased estimator of the Euclidean distance d =∑D

i=1 |u1,i − u2,i|2

d =1

k

k∑

j=1

|v1,j − v2,j |2,kd

d∼ χ2

k, V ar(d) =2d2

k.

They can be derived exactly the way as we analyze the estimator of m1.

Note that the coefficient of variation for d

V ar(d)

d2=

2

k, independent of d

meaning that the errors are pre-determined by k, a huge advantage.


More probability problems

• What is the error probability P(

|d− d| ≥ ǫd)

?

• How large k should be?

• What about the inner (dot) product a =∑D

i=1 u1,iu2,i?


An unbiased estimator of the inner product a =∑D

i=1 u1,iu2,i

a =1

k

k∑

j=1

v1,jv2,j ,

E(a) = a

V ar(a) =m1m2 + a2

k

Proof :

v1,jv2,j =

[

D∑

i=1

u1,irij

][

D∑

i=1

u2,irij

]


v1,jv2,j =

[

D∑

i=1

u1,irij

][

D∑

i=1

u2,irij

]

=D∑

i=1

u1,iu2,ir2ij +

∑

i 6=i′

u1,iu2,i′rijri′j

=⇒

E(v1,jv2,j) =D∑

i=1

u1,iu2,iE[

r2ij]

+∑

i 6=i′

u1,iu2,i′E [rijri′j ]

=D∑

i=1

u1,iu2,i1 +∑

i 6=i′

u1,iu2,i′0

=

D∑

i=1

u1,iu2,i = a

This proves the unbiasedness.


We first derive the variance of a using a complicated brute force method, then we

show a much simpler method using conditional expectation.

[v1,jv2,j ]2 =

D∑

i=1

u1,iu2,ir2ij +

∑

i 6=i′


2

=

[

D∑

i=1

u1,iu2,ir2ij

]2

+

∑

i 6=i′


2

+ ...

=D∑

i=1

[u1,iu2,i]2 r4ij + 2

∑

i 6=i′

u1,iu2,iu1,i′u2,i′ [rijri′j ]2

+∑

i 6=i′

[u1,iu2,i′ ]2 [rijri′j ]

2 + ...

Why can we ignore the rest of the terms (after taking expectations)?


Why can we ignore the rest of the terms (after taking expectations)?

Recall rij ∼ N(0, 1) i.i.d.

E(rij) = 0, E(r2ij) = 1, E(rijri′j) = E(rij)E(ri′j) = 0

E(r3ij) = 0, E(r4ij) = 3, E(r2ijri′j) = E(r2ij)E(ri′j) = 0


Therefore,

E [v1,jv2,j ]2 =

D∑

i=1

3 [u1,iu2,i]2 + 2

∑

i 6=i′

u1,iu2,iu1,i′u2,i′ +∑

i 6=i′

[u1,iu2,i′ ]2

But

a2 =

[

D∑

i=1

u1,iu2,i

]2

=

D∑

i=1

[u1,iu2,i]2

+∑

i 6=i′

u1,iu2,iu1,i′u2,i′

m1m2 =

[

D∑

i=1

|u1,i|2][

D∑

i=1

|u2,i|2]

=D∑

i=1

[u1,iu2,i]2 +

∑

i 6=i′

[u1,iu2,i′ ]2

Therefore,

E [v1,jv2,j ]2 = m1m2 + 2a2, V ar [v1,jv2,j ] = m1m2 + a2


An unbiased estimator of the inner product a =∑D

i=1 u1,iu2,i

a =1

k

k∑

j=1

v1,jv2,j , E(a) = a

V ar(a) =m1m2 + a2

k

The coefficient of variation

√

V ar(a)a2 =

√

m1m2+a2

a21k , not independent of a.

When two vectors u1 and u2 are almost orthogonal, a ≈ 0,

=⇒ coefficient of variation ≈ ∞.

=⇒ random projections may not be good for estimating inner products.


The joint distribution of v1,j =∑D

i=1 u1,irij and v2,j =∑D

i=1 u2,irij .

E(v1,j) = 0, V ar(v1,j) =D∑

i=1

|u1,i|2 = m1

E(v2,j) = 0, V ar(v2,j) =D∑

i=1

|u2,i|2 = m2

Cov(v1,i, v2,j) = E(v1,jv2,j) − E(v1,j)E(v2,j) = a

v1,j and v2,j are jointly normal (bivariate normal).

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

(What if we know m1 and m2 exactly? For example, by one scan of data matrix.)


Summary of Random Projections

Random Projections : Replace A by B = A × R

A R = B

• An elegant method, interesting probability exercise.

• Suitable for approximating Euclidean distances in massive, dense, and

heavy-tailed (some entries are excessively large) data matrices.

• It does not take advantage of data sparsity.

• We will come back to study its error probability bounds (and other things).


Capture/Recapture Method: Section 1.4, Example I

The method may be used to estimate the size of a wildlife population. Suppose

that t animals are captured, tagged, and released. On a later occasion,m

animals are captured, and it is found that r of them are tagged.

Assume the total population is N .

Q 1: What is the probability mass function PN = n?

Q 2: How large is the populationN , estimated from m, r, and t ?


Solution:

PN = n =

(

tr

)(

n−tm−r

)

(

nm

)

To estimate N , we can choose the N = n such that Ln = PN = n is

maximized.

Ln =

t!r!(t−r)!

(n−t)!(m−r)!(n−t−m+r)!

n!m!(n−m)!

∝(n−t)!

(n−t−m+r)!

n!(n−m)!

=(n− t)!(n−m)!

(n− t−m+ r)!n!


The method of maximum likelihood To find the n such that Ln is

maximized

Ln =(n− t)!(n−m)!

(n− t−m+ r)!n!

If Ln has a global maximum, then it is equivalent to finding the n such that

gn =Ln

Ln−1= 1 =

(n− t)(n−m)

n(n− t−m+ r)

=⇒n =mt

r

Indeed, if n < mtr , then

(n− t)(n−m) − n(n− t−m+ r) = mt− nr < 0

Thus, if n < mtr , then gn is increasing; if n > mt

r , then gn is decreasing.


How to plot Ln?

Ln =(n− t)!(n−m)!

(n− t−m+ r)!n!=

(n−m)(n−m− 1)...(n−m− t+ r + 1)

n(n− 1)...(n− t+ 1)

logLn =t−r∑

j=1

log (n−m− j + 1) −t∑

i=1

log(n− i+ 1)


20 40 60 80 100 120 140 1600

0.2

0.4

0.6

0.8

1

1.2x 10

−8

n

Like

lihoo

d

Likelihood (Ln): t = 10 m = 20 r = 4

20 40 60 80 100 120 140 1600.9

1

1.1

1.2

1.3

1.4

1.5

nLi

kelih

ood

Rat

io

Likelihood ratio (gn): t = 10 m = 20 r = 4


Matlab code

function cap_recap(t, m, r);

n0 = max(t+m-r, m)+5;

j=1:(t-r); i = 1:t;

for n = n0:5 * n0

L(n-n0+1) = exp( sum(log(n-m+1-j)) - sum(log(n+1-i)));

g(n-n0+1)= (n-t) * (n-m)./n./(n-t-m+r);

end

figure;

plot(n0:5 * n0,L,’r’,’linewidth’,2);grid on;

xlabel(’n’); ylabel(’Likelihood’);

title([’Likelihood (L_n): t = ’ num2str(t) ’ m = ’ num2str(m) ’ r = ’ num2str(r)]);

figure;

plot(n0:5 * n0,g, ’r’,’linewidth’,2);grid on;

xlabel(’n’); ylabel(’Likelihood Ratio’);

title([’Likelihood ratio (g_n): t = ’ num2str(t) ’ m = ’ num2s tr(m) ’ r = ’ num2str(r)]);


The Bivariate Normal Distribution

The random variables X and Y have a bivariate normal distribution if, for

constants, ux, uy , σx > 0, σy > 0, −1 < ρ < 1, their joint density function is

given, for all −∞ < x, y <∞, by

f(x, y) =1

2πσxσy

√

1 − ρ2e− 1

2(1−ρ2)

[

(x−µx)2

σ2x

+(y−µy)2

σ2y

−2ρ(x−µx)(y−µy)

σxσy

]

If X and Y are independent, then ρ = 0, and

f(x, y) =1

2πσxσye− 1

2

[

(x−µx)2

σ2x

+(y−µy)2

σ2y

]


Denote that X and Y are jointly normal:

X

Y

∼ N

µ =

µx

µy

, Σ =

σ2x ρσxσy

ρσxσy σ2y

X and Y are marginally normal:

X ∼ N(µx, σ2x), Y ∼ N(µy, σ

2y)

X and Y are also conditionally normal:

X |Y ∼ N

(

µx + ρ(y − µy)σx

σy, (1 − ρ2)σ2

x

)

Y |X ∼ N

(

µy + ρ(x− µx)σy

σx, (1 − ρ2)σ2

y

)


Bivariate Normal and Random Projections

A R = B

v1 and v2, the first two rows in B, have k entries:

v1,j =∑D

i=1 u1,irij and v2,j =∑D

i=1 u2,irij .

v1,j and v2,j are bivariate normal:

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

m1 =∑D

i=1 |u1,i|2, m2 =∑D

i=1 |u2,i|2, a =∑D

i=1 u1,iu2,i


Simplify calculations using conditional normality

v1,j |v2,j ∼ N

(

a

m2v2,j ,

m1m2 − a2

m2

)

E (v1,jv2,j)2

=E(

E(

v21,jv

22,j |v2,j

))

= E(

v22,jE

(

v21,j |v2,j

))

=E

(

v22,j

(

m1m2 − a2

m2+

(

a

m2v2,j

)2))

=m2m1m2 − a2

m2+ 3m2

2

a2

m22

=(

m1m2 + 2a2)

.

The unbiased estimator a = 1k

∑Di=1 v1,jv2,j has variance

Var (a) =1

k

(

m1m2 + a2)


Moment Generating Function (MGF)

Definition: For a random variable X , its moment generating function (MGF),

is defined as

MX(t) = E[

etX]

=

∑

x p(x)etx if X is discrete

∫∞−∞ etxf(x)dx if X is continuous

MGF MX(t) uniquely determines the distribution of X .


MGF of Normal

SupposeX ∼ N(0, 1), i.e., fX(x) = 1√2πe−

−x2

2 .

MX(t) =

∫ ∞

−∞etx 1√

2πe−

x2

2 dx

=

∫ ∞

−∞

1√2πe−

x2

2 +txdx

=

∫ ∞

−∞

1√2πe−

x2−2tx+t2−t2

2 dx

=et2

2

∫ ∞

−∞

1√2πe−

(x−t)2

2 dx

=et2

2


Suppose Y ∼ N(µ, σ2).

Write Y = σX + µ, where X ∼ N(0, 1).

MY (t) = E[

etY]

= E[

eµt+σtX]

= eµtE[

eσtX]

We can view σt as another t′.

MY (t) = eµtMX(σt) = eµt × eσ2t2

2 = eµt+ σ2

2 t2


MGF of Chi-Square

If Xj , j = 1 to k, are i.i.d. N(0, 1), then

Y =∑k

j=1X2j ∼ χ2

k , a Chi-squared distribution with k degrees of freedom.

What is the density function? Well, since the MGF uniquely determines the

distribution, we can analyze MGF first.

By the independence of Xj ,

MY (t) = E[

eY t]

= E[

et∑k

j=1 X2j

]

=k∏

j=1

E[

etX2j

]

=(

E[

etX2j

])k


E[

etX2j

]

=

∫ ∞

−∞etx2 1√

2πe−

x2

2 dx

=

∫ ∞

−∞

1√2πe−

x2

2 +tx2

dx

=

∫ ∞

−∞

1√2πe−

x2(1−2t)2 dx

=

∫ ∞

−∞

1√2πe−

x2

2σ2 dx,

(

σ2 =1

1 − 2t

)

=σ

∫ ∞

−∞

1√2πσ

e−x2

2σ2 dx = σ

=1

(1 − 2t)1/2

MY (t) =(

E[

etX2j

])k

=1

(1 − 2t)k/2, (t < 1/2)


MGF for Random Projections

In random projections , the unbiased estimator d = 1k

∑kj=1 |v1,j − v2,j |2

kd

d=

k∑

j=1

|v1,j − v2,j |2d

∼ χ2k

Q: What is the MGF of d.

Solution:

Md(t) = E(

edt)

= E

(

e

[

kdd

]

[ dtk ])

=

(

1 − 2dt

k

)−k/2

where 2dt/k < 1, i.e., t < k/(2d).


Moments and MGF

MX(t) = E[

etX]

=⇒M ′X(t) = E

[

XetX]

=⇒M(n)X (t) = E

[

XnetX]

Setting t = 0,

E [Xn] = M(n)X (0)


Example: X ∼ χ2k . MX(t) = 1

(1−2t)k/2 .

M ′(t) =−k2

(1 − 2t)−k/2

(−2) = k (1 − 2t)−k/2−1

M ′′(t) =k

(−k2

− 1

)

(1 − 2t)−k/2−2 (−2)

=k(k + 2) (1 − 2t)−k/2−2

Therefore,

E(X) = M ′(0) = k, E(X2) = M ′′(0) = k2 + 2k

V ar(X) = (k2 + 2k) − k2 = 2k.


Example: MGF and Moments of a in Random Projections

The unbiased estimator of inner product: a = 1k

∑ki=1 v1,jv2,j .

Using conditional expectation:

v1,j |v2,j ∼ N

(

a

m2v2,j ,

m1m2 − a2

m2

)

v2,j ∼ N(0,m2)

For simplicity, let

x = v1,j , y = v2,j , µ =a

m2v2,j =

a

m2y,

σ2 =m1m2 − a2

m2


E (exp(v1,jv2,jt)) = E (exp(xyt)) = E (E (exp(xyt)) |y)

Using the MGF of x|y ∼ N(µ, σ2)

E (exp(xyt)|y) = eµyt+ σ2

2 (yt)2

E (E (exp(xyt)|y)) = E(

eµyt+ σ2

2 (yt)2)

µyt+σ2

2(yt)2 = y2

(

a

m2t+

σ2

2t2)

Since y ∼ N(0,m2), we known y2

m2∼ χ2

1.


Using MGF of χ21, we obtain

E(

eµyt+ σ2

2 (yt)2)

= E

(

ey2

m2m2

(

am2

t+ σ2

2 t2))

=

(

1 − 2m2

(

a

m2t+

σ2

2t2))−1/2

=(

1 − 2at−(

m1m2 − a2)

t2)− 1

2 .

By independence,

Ma(t) =

(

1 − 2at

k−(

m1m2 − a2) t2

k2

)− k2

.

Now, we can use this MGF to calculate moments of a.


Ma(t) =

(

1 − 2at

k−(

m1m2 − a2) t2

k2

)− k2

,

Ma(1)(t) =(−k/2)

[

(

1 − 2at

k−(

m1m2 − a2) t2

k2

)− k2−1]

×(

−2a/k −(

m1m2 − a2) 2t

k2

)

The term in [...] will not matter after letting t = 0.

Therefore,

E(a) =(

MGFa(1)(0)

)

= (−k/2)(−2a/k) = a


Following a similar procedure, we can obtain

Var (a) =m1m2 + a2

k

E (a− a)3

=2a

k2

(

3m1m2 + a2)

The centered third moment measures the skewness of the distribution and can be

quite useful, for example, testing the normality.


Tail Probabilities

The tail probability P (X > t) is extremely important.

For example, in random projections,

P(

|d− d| ≥ ǫd)

tells what is the probability that the difference (error) between the estimated

Euclidian distance d and the true distance d exceeds an ǫ fraction of the true

distance d.

Q: Is it just the cumulative probability function (CDF)?


Tail Probability Inequalities (Bounds)

P (X > t) ≤ ???

Reasons to study tail probability bounds:

• Even if the distribution of X is known, evaluating P (X > t) often requires

numerical methods.

• Often the exact distribution of X is unknown. Instead, we may know the

moments (mean, variance, MGF, etc).

• Theoretical reasons. For example, studying how fast the error decreases.


Several Tail Probability Inequalities (Bounds)

• Markov’s Inequality .

Only use the first moment. Most basic.

• Chebyshev’s Inequality .

Only use the second moment.

• Chernoff’s Inequality .

Use the MGF. Most accurate and popular among theorists.


Markov’s Inequality: Theorem A in Section 4.1

If X is a random variable with P (X ≥ 0) = 1, and for which E(X) exists, then

P (X ≥ t) ≤ E(X)

t

Proof: AssumeX is continuous with probability density f(x).

E(X) =

∫ ∞

0

xf(x)dx ≥∫ ∞

t

xf(x)dx ≥∫ ∞

t

tf(x)dx = tP (X ≥ t)

See the textbook for the proof by assumingX is discrete.

Many extremely useful bounds can be obtained from Markov’s inequality.


Markov’s inequality P (X ≥ t) ≤ E(X)t . If t = kE(X), then

P (X ≥ t) = P (X ≥ kE(X)) ≤ 1

k

The error decreases at the rate of 1k , which is too slow.

The original Markov’s inequality only utilizes the first moment (hence its

inaccuracy).


Chebyshev’s Inequality: Theorem C in Section 4.1

Let X be a random variable with mean µ and variance σ2. Then for any t > 0,

P (|X − µ| ≥ t) ≤ σ2

t2

Proof: Let Y = (X − µ)2 = |X − µ|2, w = t2, then

P (Y ≥ w) ≤ E(Y )

w=E (X − µ)

2

w=σ2

w

Note that |X − µ|2 ≥ t2 ⇐⇒ |x− µ| ≥ t. Therefore,

P (|X − µ| ≥ t) = P(

|X − µ|2 ≥ t2)

≤ σ2

t2


Chebyshev’s inequality P (|X − µ| ≥ t) ≤ σ2

t2 . If t = kσ, then

P (|X − µ| ≥ kσ) ≤ 1

k2

The error decreases at the rate of 1k2 , which is faster than 1

k .


Chernoff Inequality

Ross, Proposition 8.5.2: If X is a random variable with finite MGF MX(t),

then for any ǫ > 0

P X ≥ ǫ ≤ e−tǫMX(t), for all t > 0

P X ≤ ǫ ≤ e−tǫMX(t), for all t < 0

Application: One can choose the t to minimize the upper bounds. This

usually leads to accurate probability bounds, which decrease exponentially fast.


Proof: Use Markov’s Inequality.

For t > 0, becauseX > ǫ⇐⇒ etX > etǫ (monotone transformation)

P (X > ǫ) =P(

etX ≥ etǫ)

≤E[

etX]

etǫ

=e−tǫMX(t)


Tail Bounds of Normal Random Variables

X ∼ N(µ, σ2). Assume µ > 0. Need to know P (|X − µ| ≥ ǫµ) ≤ ??

Chebyshev’s inequality :

P (|X − µ| ≥ ǫµ) ≤ σ2

ǫ2µ2=

1

ǫ2

[

σ2

µ2

]

The bound is not good enough, only decreasing at the rate of 1ǫ2 .


Tail Bounds of Normal Using Chernoff’s Inequality

Right tail bound P (X − µ ≥ ǫµ)

For any t > 0,

P (X − µ ≥ ǫµ)

=P (X ≥ (1 + ǫ)µ)

≤e−t(1+ǫ)µMX(t)

=e−t(1+ǫ)µeµt+σ2t2/2

=e−t(1+ǫ)µ+µt+σ2t2/2

=e−tǫµ+σ2t2/2

What’s next? Since the inequality holds for any t > 0, we can choose the t to

minimize the upper bound.


Right tail bound P (X − µ ≥ ǫµ)

Choose the t = t∗ to minimize g(t) = −tǫµ+ σ2t2/2.

g′(t) = −ǫµ+ σ2t = 0 =⇒ t∗ =µǫ

σ2=⇒ g(t∗) = − ǫ

2

2

µ2

σ2

Therefore,

P (X − µ ≥ ǫµ) ≤ e−ǫ2

2µ2

σ2

decreasing at the rate of e−ǫ2 .


Left tail bound P (X − µ ≤ −ǫµ)

For any t < 0,

P (X − µ ≤ −ǫµ) =P (X ≤ (1 − ǫ)µ)

≤e−t(1−ǫ)µMX(t)

=e−t(1−ǫ)µeµt+σ2t2/2

=etǫµ+σ2t2/2

Choose the t = t∗ = − µǫσ2 to minimize tǫµ+ σ2t2/2. Therefore,

P (X − µ ≤ −ǫµ) ≤ e−ǫ2

2µ2

σ2


Combine left and right tail bounds P (|X − µ| ≥ ǫµ)

P (|X − µ| ≥ ǫµ)

=P (X − µ ≥ ǫµ) + P (X − µ ≤ −ǫµ)

≤2e−ǫ2

2µ2

σ2


Sample Size Selection Using Tail Bounds

Xi ∼ N(

µ, σ2)

, i.i.d. i = 1 to k.

An unbiased estimator of µ is µ

µ =1

k

k∑

i=1

Xi, µ ∼ N

(

µ,σ2

k

)

Choose k such that

P (|µ− µ| ≥ ǫµ) ≤ δ

———–

We already know P (|µ− µ| ≥ ǫµ) ≤ 2e− ǫ2

2µ2

σ2/k .


It suffices to select the k such that

2e−ǫ2

2kµ2

σ2 ≤ δ

=⇒e−ǫ2

2kµ2

σ2 ≤ δ

2

=⇒− ǫ2

2

kµ2

σ2≤ log

(

δ

2

)

=⇒ǫ2

2

kµ2

σ2≥ − log

(

δ

2

)

=⇒k ≥[

− log

(

δ

2

)]

2

ǫ2σ2

µ2


SupposeXi ∼ N(µ, σ2), i = 1 to k, i.i.d. Then µ = 1n

∑ni=1Xi is an

unbiased estimator of µ. If the sample size k satisfies

k ≥[

log

(

2

δ

)]

2

ǫ2σ2

µ2,

then with probability at least 1 − δ, the estimated µ is within a 1 ± ǫ factor of the

true µ, i.e., |µ− µ| ≤ ǫµ.


What affects sample size k?

k ≥[

log

(

2

δ

)]

2

ǫ2σ2

µ2

• δ: level of significance. Lower δ → more significant → larger k.

• σ2

µ2 : noise/signal ratio. Higher σ2

µ2 → larger k.

• ǫ: accuracy. Lower ǫ→ more accurate → larger k.

• The evaluation criterion. For example, |µ− µ| ≤ ǫµ, or |µ− µ| ≤ ǫ?


Exercise : In random projections, d is the unbiased estimator of the Euclidian

distance d.

• Prove the exponential tail bound:

P(

|d− d| ≥ ǫd)

≤ e???

• Determine the sample size such that

P(

|d− d| ≥ ǫd)

≤ δ


Section 4.6: Approximate Methods

Suppose we know E(X) = µX , V ar(X) = σ2X . Suppose Y = g(X).

What about E(Y ), V ar(Y ) ?

In many cases, analytical solutions are not available (or too complicated).

How about Y = aX? Easy!

We knowE(Y ) = aE(X) = aµX , V ar(Y ) = a2σ2X .


The Delta Method

General idea : Linear expansion of Y = g(X) aboutX = µX .

Y = g(X) = g(µX) + (X − µX)g′(µX) +1

2(X − µX)2g′′(µX) + ...

Taking expectations on both sides:

E(Y ) = g(µX) + E(X − µX)g′(µX) +1

2E(X − µX)2g′′(µX) + ...

=⇒E(Y ) ≈ g(µX) +σ2

X

2g′′(µX)

What about the variance?


Use the linear approximation only:

Y = g(X) = g(µX) + (X − µX)g′(µX) + ...

V ar(Y ) ≈ [g′(µX)]2σ2

X

How good are these approximates? Depends on the nonlinearity of g(X)

about µX .


Example B in Section 4.6

X ∼ U(0, 1), Y =√X . ComputeE(Y ) and V ar(Y ).

Exact Method

E(Y ) =

∫ 1

0

√xdx =

1

1/2 + 1x1/2+1

∣

∣

∣

∣

1

0

=2

3.

E(Y 2) =

∫ 1

0

xdx =1

2, V ar(Y ) =

1

2−(

2

3

)2

=1

18= 0.0556


Delta Method : X ∼ U(0, 1), E(X) = 12 , V ar(X) = 1

12 .

Y = g(X) =√X . g′(X) = 1

2X−1/2,

g′′(X) = − 12

12X

−1/2−1 = − 14X

−3/2.

E(Y ) ≈√

E(X) +V ar(X)

2

[

−1

4E−3/2(X)

]

=√

1/2 +1/12

2

[

−1

4(1/2)−3/2

]

= 0.6776

V ar(Y ) ≈V ar(X)

[

1

2E−1/2(X)

]2

=1

12

[

1

2(1/2)−1/2

]2

= 0.0417


Delta Method for Sign Random Projections

The projected data v1,j and v2,j are bivariate normal.

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

, j = 1, 2, ..., k

One can use a = 1k

∑kj=1 v1,jv2,j to estimate a without bias. One can also

first estimate the angle θ = cos−1(

a√m1m2

)

using

Pr (sign(v1,j) = sign(v2,j) = 1 − θ

π

then estimate a using cos(

θ)√

m1m2. Delta method can help the analysis.

(Why sign random projections?)


The Delta Method for Two Variables

Z = g(X,Y ). E(X) = µX , E(Y ) = µY , V ar(X) = σ2X ,

V ar(Y ) = σ2Y , Cov(X,Y ) = σXY .

Taylor expansion of Z = g(X,Y ), about (X = µX , Y = µY ):

Z =(X − µX)∂g(µX , µY )

∂X+

1

2(X − µX)2

∂g2(µX , µY )

∂X2

+(Y − µY )∂g(µX , µY )

∂Y+

1

2(Y − µY )2

∂g2(µY , µY )

∂Y 2

+g(µX , µY ) + (X − µX)(Y − µY )∂g2(µY , µY )

∂X∂Y+ ...


Taking expectations of both sides of the expansion:

E(Z) ≈g(µX , µY ) +σ2

X

2

∂g2(µX , µY )

∂X2

+σ2

Y

2

∂g2(µX , µY )

∂Y 2+ σXY

∂g2(µY , µY )

∂X∂Y

Only using linear expansion yields

V ar(Z) ≈σ2X

(

∂g(µX , µY )

∂X

)2

+ σ2Y

(

∂g(µX , µY )

∂Y

)2

+ 2σXY

(

∂g(µX , µY )

∂X

)(

∂g(µX , µY )

∂Y

)


Chapter 5: Limit Theorems

X1, X2, ..., Xn are i.i.d. samples. What Happen if n→ ∞?

• The Law of Large Numbers

• The Central Limit Theorem

• The Normal Approximation


The Law of Large Numbers

Theorem 5.2.A: Let X1, X2, ..., be a sequence of independent random

variables with E(Xi) = µ and V ar(Xi) = σ2. Then, for any ǫ > 0, as

n→ ∞,

P

(∣

∣

∣

∣

∣

1

n

n∑

i=1

Xi − µ

∣

∣

∣

∣

∣

> ǫ

)

→ 0

The sequence Xn is said to Converge in probability to µ.


Proof: Using Chebyshev’s Inequality.

BecauseXi’s are i.i.d.,

X =1

n

n∑

i=1

Xi.

E(X) =1

n

n∑

i=1

E(Xi) =1

nnµ = µ

V ar(X) =1

n2

n∑

i=1

V ar(Xi) =1

n2nσ2 =

σ2

n

Thus, by Chebyshev’s Inequality,

P (|X − µ| ≥ ǫ) ≤ V ar(X)

ǫ2=

σ2

nǫ2→ 0


100

101

102

103

104

105

106

9.2

9.4

9.6

9.8

10

10.2

10.4

10.6

10.8

n

Sam

ple

Mea

n

Normal Distribution


100

101

102

103

104

105

106

9.8

10

10.2

10.4

10.6

10.8

11

n

Sam

ple

Mea

n

Gamma Distribution


100

101

102

103

104

105

106

4

6

8

10

12

14

16

n

Sam

ple

Mea

n

Uniform Distribution


Matlab Code

function TestLawLargeNumbers(MEAN)

N = 10ˆ6;

figure; c = [’r’,’k’,’b’];

for repeat = 1:3

X = normrnd(MEAN, 1, 1, N); % var = 1

semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2);

grid on; hold on;

end;

xlabel(’n’); ylabel(’Sample Mean’);

title(’Normal Distribution’);

figure;

for repeat = 1:3

X = gamrnd(MEAN.ˆ2, 1./MEAN, 1, N); % var = 1



grid on; hold on;

end;


title(’Gamma Distribution’);

figure;

for repeat = 1:3

X = rand(1, N) * MEAN* 2;


grid on; hold on;

end;


title(’Uniform Distribution’);


Monte Carlo Integration

To calculate

I(f) =

∫ 1

0

f(x)dx, for example f(x) = e−x2/2

Numerical integration can be difficult, especially in high-dimensions.


Monte Carlo integration:

Generate n i.i.d. samplesXi ∼ U(0, 1). Then by LLN

1

n

∑

f(Xi) → E(f(Xi)) =

∫ 1

0

f(x)1dx

as n→ ∞.

Advantages

• Very flexible. The interval does not have to be [0,1]. The function f(x) can

be complicated. The function can be decomposed in various ways, e.g.,

f(x) = g(x) ∗ h(x), and one can sample from other distributions.

• Straightforward in high-dimensions, double integrals, triple integrals, etc.


Major disadvantage of Monte Carlo integration

LLN converges at the rate of 1√n

, from the Central Limit theorem.

Numerical integrations converges at the rate of 1n .

However, in high-dimensions, the difference becomes smaller.

Also, there are more advanced Monte Carlo techniques to achieve better rates.


Examples for Monte Carlo Numerical Integration

Treat∫ 1

0cosxdx as an expectation:

∫ 1

0

cosxdx =

∫ 1

0

1 × cosxdx = E(cos(x)), x ∼ Uniform U(0, 1)

Monte Carlo integration procedure:

• GenerateN i.i.d. samples xi ∼ Uniform U(0, 1), i = 1 to N .

• Use empirical expectation 1N

∑Ni=1 cos(xi) to approximate E(cos(x)).


True value:∫ 1

0cosxdx = sin(1) = 0.8415

101

102

103

104

105

106

107

0.8

0.82

0.84

0.86

0.88

N


∫ 1

0

log2(x+ 0.1)√

sin(x+ 0.1)e−x0.15

dx

101

102

103

104

105

106

107

0.5

1

1.5

2

N


Section 5.3: Central Limit Theorem and Normal Approximatio n

Central Limit Theorem LetX1, X2, ..., be a sequence of independent and

identically distributed random variables, each having finite meanE(Xi) = µ and

variance σ2. Then as n→ ∞

P

X1 +X2 + ...+Xn − nµ√nσ

≤ y

→∫ y

−∞

1√2πe−

t2

2 dt


Normal Approximation

X1 +X2 + ...+Xn − nµ√nσ

=X − µ√

σ2/nis approximately N(0, 1)

Non-rigorously, we may say X is approximatelyN(

µ, σ2

n

)

.

But we know E(X) = µ, V ar(X) = σ2

n .


Normal Distribution Approximates Binomial

SupposeX ∼ Binomial(n, p). For fixed p, as n→ ∞

Binomial(n, p) ≈ N(µ, σ2),

µ = np, σ2 = np(1 − p).


0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

x

Den

sity

(m

ass)

func

tion

n = 10 p = 0.2


−10 −5 0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

x

Den

sity

(m

ass)

func

tion

n = 20 p = 0.2


−10 −5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

x

Den

sity

(m

ass)

func

tion

n = 50 p = 0.2


−10 0 10 20 30 40 500

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

x

Den

sity

(m

ass)

func

tion

n = 100 p = 0.2


100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

x

Den

sity

(m

ass)

func

tion

n = 1000 p = 0.2


Matlab code

function NormalApprBinomial(n,p);

mu = n* p; sigma2 = n * p* (1-p);

figure;

bar((0:n),binopdf(0:n,n,p),’g’);hold on; grid on;

x = mu - 3 * sigma2:0.001:mu+3 * sigma2;

plot(x,normpdf(x,mu,sqrt(sigma2)),’r-’,’linewidth’, 2);

xlabel(’x’);ylabel(’Density (mass) function’);

title([’n = ’ num2str(n) ’ p = ’ num2str(p)]);


Convergence in Distribution

Definition: Let X1, X2, ..., be a sequence of random variables with

cumulative distributions F1, F2, ..., and let X be a random variable with

distribution function F . We say that Xn converges in distribution to X if

limn→∞

Fn(x) = F (x)

at every point x at which F is continuous.


Theorem 5.3A: Continuity Theorem

Let Fn be a sequence of cumulative distribution functions with the corresponding

MGF Mn. Let F be a cumulative distribution function with MGF M .

If Mn(t) →M(t) for all t in an open interval containing zero,

then Fn(x) → F (x) at all continuity points of F .


Approximate Poisson by Normal

If X ∼ Poi(λ) is approximately normal when λ is large.

Recall Poi(λ) approximatesBin(n, p) with λ ≈ np, and large n.

——————————-

Let Xn ∼ Poi(λn). Let λ1, λ2, ... be an increasing sequence with λn → ∞.

Let Zn = Xn−λn√λn

, with CDF Fn.

Let Z ∼ N(0, 1), with CDF F .

To show Fn → F , suffices to show MZn(t) →MZ(t) = et2/2.


Proof:

If Y ∼ Poi(λ), then MY (t) = eλ(et−1). Then, for Zn = Xn−λn√λn

,

MZn(t) =e− λn√

λnteλn

(

et/√

λn−1)

= exp[

−t√

λn + λn

(

et/√

λn−1)]

= exp[g(t, n)]


Recall et = 1 + t+ t2

2 + t2

6 + ...

g(t, n) = − t√

λn + λn

(

et/√

λn − 1)

= − t√

λn + λn

(

t√λn

+1

2

t2

λn+

1

6

t3

λ3/2n

+ ...

)

=t2

2+

1

6

t3

λ1/2n

+ ...→ t2

2

Therefore,MZn(t) → et2/2 = MZ(t)


The Proof of Central Limit Theorem

Theorem 5.3.B: Let X1, X2, ..., be a sequence of independent random

variables having mean µ and variance σ2 and the common probability distribution

function F and MGF M defined in the neighborhood of zero. Then

limn→∞

P

(∑ni=1Xi − nµ

σ√n

≤ x

)

=

∫ x

−∞

1√2πe−z2/2dz, −∞ < x <∞

Proof: Let Sn =∑n

i=1Xi and Zn = Sn−nµσ√

n. It suffices to show

MZn(t) → et2/2, as n→ ∞.


Note that MSn(t) = Mn(t). Hence

MZn(t) =e− nµ

σ√

ntMSn

(

t

σ√n

)

= e−√

nµσ tMn

(

t

σ√n

)

Taylor expandM(t) about zero

M(t) =1 + tM ′(0) +t2

2M ′′(0) + ...

=1 + tµ+t2

2

(

σ2 + µ2)

+ ...


Therefore,

MZn(t) =e−√

nµσ tMn

(

t

σ√n

)

=e−√

nµσ t

(

1 +µt

σ√n

+t2

2σ2n

(

σ2 + µ2)

+ ...

)n

=exp

(

−√nµ

σt+ n log

(

1 +µt

σ√n

+t2

2σ2n

(

σ2 + µ2)

+ ...

))


By Taylor expansion, log(1 + x) = x− x2

2 + .... Therefore,

n log

(

1 +µt

σ√n

+t2

2σ2n

(

σ2 + µ2)

)

=n

[

µt

σ√n

+t2

2σ2n

(

σ2 + µ2)

− 1

2

(

µt

σ√n

)2

+ ...

]

=n

[

µt

σ√n

+t2

2n+ ...

]

Hence

MZn(t) = exp

(

−√nµ

σt+ n log

(

1 +µt

σ√n

+t2

2σ2n

(

σ2 + µ2)

+ ...

))

→et2/2

The textbook assumed µ = 0 to start with, which simplified the algebra.


Chapter 6: Distributions Derived From Normal

• χ2 distribution If X1, X2, ..., Xn are i.i.d. N(0, 1). Then∑n

i=1X2i ∼ χ2

n, the χ2 distribution with n degrees of freedom.

• t distribution If U ∼ χ2n, Z ∼ N(0, 1), and Z and U are

independent, then Z√U/n

∼ tn, the t distribution with n degrees of freedom.

• F distribution If U ∼ χ2m, V ∼ χ2

n, and U and V are independent,

thenU/mV/n ∼ Fm,n, the F distribution with m and n degrees of freedom.


χ2 Distribution

If X1, X2, ..., Xn are i.i.d. N(0, 1). Then∑n

i=1X2i ∼ χ2

n, the χ2 distribution

with n degrees of freedom.

• Z ∼ χ2n, then MGF MZ(t) = (1 − 2t)−n/2.

• Z ∼ χ2n, then E(Z) = n, V ar(Z) = 2n.

• Z1 ∼ χ2n, Z2 ∼ χ2

m, Z1 and Z2 are independent. Then

Z = Z1 + Z2 ∼ χ2n+m.

• χ2n = Gamma

(

α = n2 , λ = 1

2

)

.


If X ∼ Gamma(α, λ), then MX(t) =(

λλ−t

)α

=(

11−t/λ

)α

.

If Z ∼ χ2n, then MZ(t) = (1 − 2t)

−n/2=(

11−2t

)n/2

Therefore, Z ∼ χ2n = Gamma

(

n2 ,

12

)

Therefore, the density function of Z ∼ χ2n

fZ(z) =1

2n/2Γ(n/2)zn/2−1e−z/2, z ≥ 0


t Distribution

If U ∼ χ2n, Z ∼ N(0, 1), and Z and U are independent, then Z√

U/n∼ tn,

the t distribution with n degrees of freedom.

Theorem 6.2.A : The density function of the Z ∼ tn is

fZ(z) =Γ[(n+ 1)/2]√nπΓ(n/2)

(

1 +z2

n

)−(n+1)/2


−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

dens

ity

1 degree10 degreesnormal


Matlab Code

function plot_tdensity

figure;

x=-5:0.01:5;

plot(x,tpdf(x,1),’g-’,’linewidth’,2);hold on; grid on;

plot(x,tpdf(x,10),’k-’,’linewidth’,2);hold on; grid on ;

plot(x,normpdf(x),’r’,’linewidth’,2);

for n = 2:9;

plot(x,tpdf(x,n));hold on; grid on;

end;

xlabel(’x’); ylabel(’density’);

legend(’1 degree’,’10 degrees’,’normal’);


Things to know about tn distributions:

• It is widely used in statistical testing, the t-test.

• It is practically indistinguishable from normal, when n ≥ 45.

• It is a heavy-tailed distribution, only has < nth moments.

• It is the Cauchy distribution when n = 1.


The F Distribution

If U ∼ χ2m, V ∼ χ2

n, and U and V are independent, then

Z = U/mV/n ∼ Fm,n, the F distribution with m and n degrees of freedom.

Proposition 6.2.B: If Z ∼ Fm,n, then the density

fZ(z) =Γ[(m+ n)/2]

Γ(m/2)Γ(n/2)

(m

n

)m/2

zm/2−1(

1 +m

nz)−(m+n)/2

The F distribution is also widely used in statistical testing, the F -test.


The Cauchy Distribution

If X ∼ N(0, 1) and Y ∼ N(0, 1), and X and Y are independent. Then

Z = XY has the standard Cauchy distribution, with density

fZ(z) =1

π(z2 + 1), −∞ < z <∞

Cauchy distribution does not have a finite mean,E(Z) = ∞.

It is also the t-distribution with 1 degree of freedom.


Proof:

FZ(z) = P (Z ≤ z) = P

(

X

Y≤ z

)

=P (X ≤ Y z, Y > 0) + P (X ≥ Y z, Y < 0)

=2P (x ≤ yz, Y > 0)

=2

∫ ∞

0

∫ yz

0

fX,Y (x, y)dxdy

=2

∫ ∞

0

∫ yz

0

1√2πe−

x2

21√2πe−

y2

2 dxdy

=1

π

∫ ∞

0

e−y2

2

∫ yz

0

e−x2

2 dxdy

Now what? It actually appears easier to work the PDF fZ(z).


Use the fact

∂∫ g(x)

ch(y)dy

∂x= h(g(x))g′(x), for any constant c.

fZ(z) =1

π

∫ ∞

0

e−y2

2

[

ye−y2z2

2

]

dy

=1

π

∫ ∞

0

e−y2(z2+1)

2 d

[

y2

2

]

=1

π

1

z2 + 1.

What’s the problem when working directly with the CDF?


FZ(z) =1

π

∫ ∞

0

e−y2

2

∫ yz

0

e−x2

2 dxdy

=1

π

∫ ∞

0

∫ yz

0

e−x2+y2

2 dxdy

=1

π

∫ ∞

0

∫ π/2

tan−1(1/z)

e−r2

2 rdθdr

=1

π

∫ ∞

0

e−r2

2 r[π

2− tan−1(1/z)

]

dr

=π/2 − tan−1(1/z)

π

∫ ∞

0

e−r2

2 d

[

r2

2

]

=π/2 − tan−1(1/z)

π

Therefore,

fZ(z) =1

π

1

z2 + 1.


Section 6.3: Sample Mean and Sample Variance

Let X1, X2, ..., Xn be independent samples from N(µ, σ2).

The sample mean X =1

n

n∑

i=1

Xi

The sample variance S2 =1

n− 1

n∑

i=1

(

Xi − X)2


Theorem 6.3.A: The random variable X and the vector

(X1 − X, X2 − X, ..., Xn − X) are independent.

Proof: Read the book for a more rigorous proof.

Let’s only prove that X and Xi − X are uncorrelated (homework problem).


Corollary 6.3.A: X and S2 are independently distributed.

Proof: It follows immediately because S2 is a function of

(X1 − X, X2 − X, ..., Xn − X).


Joint Distribution of the Sample Mean and Sample Variance

Theorem 6.3.B: (n− 1)S2/σ2 ∼ χ2n−1.

Proof:

X1, X2, ..., Xn, are independent normal variables,Xi ∼ N(µ, σ2).

Intuitively, S2 = 1n−1

∑ni=1

(

Xi − X)2

should be closely related to a

Chi-squared distribution.


(n− 1)S2 =

n∑

i=1

(

Xi − X)2

=n∑

i=1

(

Xi − µ+ µ− X)2

=n∑

i=1

(Xi − µ)2 − n(

µ− X)2


n∑

i=1

(

Xi − µ

σ

)2

∼ χ2n

(

µ− X

σ/√n

)2

∼ χ21

Y =(n− 1)S2

σ2=

n∑

i=1

(

Xi − µ

σ

)2

−(

µ− X

σ/√n

)2

Y +

(

µ− X

σ/√n

)2

=n∑

i=1

(

Xi − µ

σ

)2

The MGFs in both sides should be equal.

Also, note that Y and X are independent.


Y =(n− 1)S2

σ2, Y +

(

µ− X

σ/√n

)2

=

n∑

i=1

(

Xi − µ

σ

)2

Equating the MGFs of both sides (also using independence).

E[

etY]

(1 − 2t)−1/2 = (1 − 2t)−n/2

=⇒E[

etY]

= (1 − 2t)−(n−1)/2

Therefore,

(n− 1)S2

σ2∼ χ2

n−1


Corollary 6.3.B:

X − µ

S/√n

∼ tn−1.

Proof:

X − µ

S/√n

=

X−µσ/

√n

√

(n− 1)S2/σ2/(n− 1)=

U√V

U ∼ N(0, 1). V ∼ χ2n−1. Therefore, U√

V∼ tn−1 by definition.


Chapter 8: Parameter Estimation

One of the most important chapters for 4090!

Assume n i.i.d. observationsXi, i = 1 to n. Xi’s has density function with k

parameters θ1, θ2, ... ,θk , written as fX(x; θ1, θ2, ..., θk).

The task is to estimate θ1, θ2, ..., θk , from n samplesX1, X2, ..., Xn.

———————————-

Where did the density function fX come from in the first place?

This is often a chicken-egg problem, but it is not a major concern for this class.


Two Basic Estimation Methods

SupposeX1, X2, ..., Xn are i.i.d. samples with density fX(x; θ1, θ2).

• The method of moments

Force 1n

∑ni=1Xi = E(X) and 1

n

∑ni=1X

2i = E(X2)

Two equations, two unknowns (θ1, θ2).

• The method of maximum likelihood

Find the θ1 and θ2 that maximizes the joint probability (likelihood)∏n

i=1 fX(xi; θ1, θ2). An optimization problem, maybe convex.


The Method of Moments

Assume n i.i.d. observationsXi, i = 1 to n. Xi’s has density function with k

parameters θ1, θ2, ... ,θk , written as fX(x; θ1, θ2, ..., θk).

Define the mth theoretical moment of X

µm = E(Xm)

Define the mth empirical moment of X

µm =1

n

n∑

i=1

Xmi

Solve a system of k equations: µm = µm , m = 1 to k.

What could be the difficulties?


Example 8.4.A: Xi ∼ Poisson(λ), i.i.d. i = 1 to n.

BecauseE(Xi) = λ, the moment estimator would be

λ =1

n

n∑

i=1

Xi = X

—————

Properties of λ

E(λ) =1

n

n∑

i=1

E(Xi) = λ

V ar(λ) =1

nV ar(Xi) =

λ

n


Xi ∼ Poisson(λ), i.i.d. i = 1 to n.

Because V ar(Xi) = λ, we can also estimate λ by

λ2 =1

n

n∑

i=1

X2i −

(

1

n

n∑

i=1

Xi

)2

This estimator λ2 is no longer unbiased, because

E(λ2) =[

λ+ λ2]

−[

λ

n+ λ2

]

= λ− λ

n

Moment estimators are in general biased .

Q: How to modify λ2 to obtain an unbiased estimator?


Example 8.4.B: Xi ∼ N(µ, σ2), i.i.d. i = 1 to n.

Solve for µ and σ2 from the equations

µ =1

n

n∑

i=1

Xi, σ2 =1

n

n∑

i=1

X2i −

(

1

n

n∑

i=1

Xi

)2

The moment estimators are

µ = X, σ2 =1

n

n∑

i=1

(Xi − X)2

We have known that µ and σ2 are independent, and

µ ∼ N

(

µ,σ2

n

)

,nσ2

σ2∼ χ2

n−1


Example 8.4.C: Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.

The first two moments are

µ1 =α

λ, µ2 =

α(α+ 1)

λ2

Equivalently

α =µ2

1

µ2 − µ21

, λ =µ1

µ2 − µ21

The moment estimators are

α =µ2

1

µ2 − µ21

=X2

σ2,

λ =µ1

µ2 − µ21

=X

σ2


Example 8.4.D: Assume that the random variable X has density

fX(x) =1 + αx

2, |x| ≤ 1, |α| ≤ 1

Then α can be estimated from the first moment

µ1 =

∫ 1

−1

x1 + αx

2dx =

α

3.

Therefore, the moment estimator would be

α = 3X.


Consistency of Moment Estimators

Definition: Let θn be an estimator of a parameter θ based on a sample of

size n. Then θn is consistent in probability , if for any ǫ > 0,

P(

|θn − θ| ≥ ǫ)

→ 0, as n→ ∞

Moment estimators are consistent if the conditions for Weak Law of Large

Numbers are satisfied.


A Simulation Study for Estimating Gamma Parameters

Consider a gamma distribution Gamma(α, λ) with α = 4 and λ = 0.5.

Generate n, for n = 5 to n = 105, samples from Gamma(α = 4, λ = 0.5).

Estimate α and λ by moment estimators for every n.

Repeat the experiment 4 times.


100

101

102

103

104

105

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6Gamma: Moment estimate of α = 4


100

101

102

103

104

105

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Gamma: Moment estimate of λ = 0.5


Matlab Code

function est_gamma

n = 10ˆ5; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’];

for t = 1:4;

X = gamrnd(al,1/lam,n,1);

mu1 = cumsum(X)./(1:n)’;

mu2 = cumsum(X.ˆ2)./(1:n)’;

est_al = mu1.ˆ2./(mu2-mu1.ˆ2);

est_lam = mu1./(mu2-mu1.ˆ2);

st =5;

figure(1);

semilogx((st:n)’,est_al(st:end),c(t), ’linewidth’,2) ; hold on; grid on;

title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]) ;

figure(2);

semilogx((st:n)’,est_lam(st:end),c(t),’linewidth’,2 ); hold on; grid on;

title([’Gamma: Moment estimate of \lambda = ’ num2str(lam) ]);

end;


The Method of Maximum Likelihood

Suppose that random variables X1, X2, ..., Xn have a joint density

f(x1, x2, ..., xn|θ). Given observed values Xi = xi, where i = 1 to n, the

likelihood of θ as a function of (x1, x2, .., xn) is defined as

lik(θ) = f(x1, x2, ..., xn|θ)

The method of maximum likelihood seeks the θ that maximizes lik(θ).


The Log Likelihood in the I.I.D. Case

If Xi’s are i.i.d., then

lik(θ) =n∏

i=1

f(Xi|θ)

It is often more convenient to work with its logarithm, called the Log Likelihood

l(θ) =n∑

i=1

log f(Xi|θ)


Example 8.5.A: SupposeX1, X2, ..., Xn, are i.i.d. samples of

Poisson(λ). Then the likelihood of λ is

lik(λ) =

n∏

i=1

λXie−λ

Xi!

The log likelihood is

l(λ) =n∑

i=1

[Xi log λ− λ− logXi!]

= log λn∑

i=1

Xi − nλ+

[

−n∑

i=1

logXi!

]

The part in [...] is useless for finding the MLE.


The log likelihood is

l(λ) = log λn∑

i=1

Xi − nλ−n∑

i=1

logXi!

The MLE is the solution to l′(λ) = 0, where

l′(λ) =1

λ

n∑

i=1

Xi − n

Therefore, the MLE is λ = X , same as the moment estimator.

For verification, check l′′(λ) = − 1λ2

∑ni=1Xi≤ 0, meaning that l(λ) is a

concave function and the solution to l′(λ) = 0 is indeed the maximum.


Example 8.5.B: Given n i.i.d. samples,Xi ∼ N(µ, σ2), i = 1 to n. The

log likelihood is

l(

µ, σ2)

=n∑

i=1

log fX(Xi;µ, σ2)

= − 1

2σ2

n∑

i=1

(Xi − µ)2 − 1

2n log(2πσ2)

∂l

∂µ=

1

2σ22

n∑

i=1

(Xi − µ) = 0 =⇒ µ =1

n

n∑

i=1

Xi

∂l

∂σ2=

1

2σ4

n∑

i=1

(Xi − µ)2 − n

2σ2= 0 =⇒ σ2 =

1

n

n∑

i=1

(Xi − µ)2.


Example 8.5.C: Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.

The likelihood function is

lik(α, λ) =n∏

i=1

1

Γ(α)λαXα−1

i e−λXi

The log likelihood function is

l(α, λ) =n∑

i=1

− log Γ(α) + α log λ+ (α− 1) logXi − λXi

Taking derivatives

∂l(α, λ)

∂α= −nΓ′(α)

Γ(α)+ n log λ+

n∑

i=1

logXi

∂l(α, λ)

∂λ= n

α

λ−

n∑

i=1

Xi


The MLE solutions are

λ =α

X

− Γ′(α)

Γ(α)+ log α− log X +

1

n

n∑

i=1

logXi = 0

Need an iterative scheme to solve for α and λ. This is actually a difficult

numerical problems because naive method will not converge, or possibly because

the Matlab implementation of the “psi” functionΓ′(α)Γ(α) is not that accurate.

As the last resort, one can always do exhaustive search or binary search.

Our simulations can show MLE is indeed better than moment estimator.


10 20 30 40 50 60 70 80 90 1002

3

4

5

6

7

8Gamma: Moment estimate of α = 4

MomentMLE


10 20 30 40 50 60 70 80 90 100

0.4

0.5

0.6

0.7

0.8

0.9

1Gamma: Moment estimate of λ = 0.5

MomentMLE


Matlab Codefunction est_gamma_mle

close all; clear all;

n = 10ˆ2; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’];

for t = 1:3;

X = gamrnd(al,1/lam,n,1);

% Find the moment estimators as starting points.

mu1 = cumsum(X)./(1:n)’;

mu2 = cumsum(X.ˆ2)./(1:n)’;

est_al = mu1.ˆ2./(mu2-mu1.ˆ2);

est_lam = mu1./(mu2-mu1.ˆ2);

% Exhaustive search in the neighbor of the moment estimator.

mu_log = cumsum(log(X))./(1:n)’;

m = 400;

for i = 1:m;

al_m(:,i) = est_al-2+0.01 * (i-1);

ind_neg = find(al_m(:,i)<0);

al_m(ind_neg,i) = eps;

lam_m(:,i)= al_m(:,i)./mu1;

end;

L = log(lam_m). * al_m + (al_m-1). * (mu_log * ones(1,m)) - lam_m. * (mu1* ones(1,m)) - log(gamma(al_m));

[dummy, ind] = max(L,[],2);

for i = 1:n

est_al_mle(i) = al_m(i,ind(i));

est_lam_mle(i) = lam_m(i,ind(i));

end;

st =10;

figure(1);

plot((st:n)’,est_al(st:end),[c(t) ’--’], ’linewidth’, 2); hold on; grid on;


plot((st:n)’,est_al_mle(st:end),c(t), ’linewidth’,2) ; hold on; grid on;

title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]) ;

legend(’Moment’,’MLE’);

figure(2);

plot((st:n)’,est_lam(st:end),[c(t) ’--’],’linewidth’ ,2); hold on; grid on;

plot((st:n)’,est_lam_mle(st:end),c(t),’linewidth’,2 ); hold on; grid on;

title([’Gamma: Moment estimate of \lambda = ’ num2str(lam) ]);

legend(’Moment’,’MLE’);

end;


Newton’s Method

To find the maximum or minimum of function f(x) is equivalent to find the x∗

such that f ′(x∗) = 0.

Suppose x is close to x∗. By Taylor expansions

f ′(x∗) = f ′(x) + (x∗ − x)f ′′(x) + ... = 0

we obtain

x∗ ≈ x− f ′(x)

f ′′(x)

This gives an iterative formula.

In multi-dimensions, need invert a Hessian matrix (not just a reciprocal of f ′′(x)).


MLE Using Newtons’ Method for Estimating Gamma Parameters

Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.

The log likelihood function

l(α, λ) =n∑

i=1

− log Γ(α) + α log λ+ (α− 1) logXi − λXi

First derivatives

∂l(α, λ)

∂α= −nΓ′(α)

Γ(α)+ n log λ+

n∑

i=1

logXi

∂l(α, λ)

∂λ= n

α

λ−

n∑

i=1

Xi


Second derivatives

∂2l(α, λ)

∂α2= −nψ′(α), ψ(α) =

Γ′(α)

Γ(α)

∂2l(α, λ)

∂λ2= −n α

λ2

∂2l(α, λ)

∂λα= n

1

λ

We can use Newton’s method (two dimensions), starting with moment estimators.

The problem is actually more complicated because we have a constrained

optimization problem. The constraints: α ≥ 0 and λ ≥ 0 may not be satisfied

during iterations, especially when sample size n is not large.

One the other hand, One-Step Newton’s method usually works well, starting with

an (already pretty good) estimator. Often more iterations do not help much.


20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

2

2.5

3

3.5

4

Sample size

MS

E

Gamma: One−step MLE of α = 4

MomentOne−step MLE


20 40 60 80 100 120 140 160 180 2000

0.01

0.02

0.03

0.04

0.05

0.06

0.07Gamma: One−step MLE of λ = 0.5

Sample size

MS

E

MomentOne−step MLE


Matlab Code for MLE Using One-Step Newton Updatesfunction est_gamma_mle_onestep

al =4; lam = 0.5;

N=[20:10:50, 80, 100, 150 200]; T = 10ˆ4;

X = gamrnd(al,1/lam,T,max(N));

for i = 1:length(N)

n = N(i);

mu1 = sum(X(:,1:n),2)./n;

mu2 = sum(X(:,1:n).ˆ2,2)./n;

est_al0 = mu1.ˆ2./(mu2-mu1.ˆ2);

est_lam0 = mu1./(mu2-mu1.ˆ2);

est_al0_mu(i) = mean(est_al0);

est_al0_var(i) = var(est_al0);

est_lam0_mu(i) = mean(est_lam0);

est_lam0_var(i) = var(est_lam0);

est_al_mle_s1 = est_al0;

est_lam_mle_s1= est_lam0;

d1_al = log(est_lam_mle_s1)+mean(log(X(:,1:n)),2) - psi (est_al_mle_s1);

d1_lam =est_al_mle_s1./est_lam_mle_s1 - mean(X(:,1:n), 2);

d2_al = - psi(1,est_al_mle_s1);

d12 = 1./est_lam_mle_s1;

d2_lam = -est_al_mle_s1./est_lam_mle_s1.ˆ2;

for j = 1:T;

update(j,:) = (inv([d2_al(j) d12(j); d12(j) d2_lam(j)]) * [d1_al(j);d1_lam(j)])’;

end;

est_al_mle_s1 = est_al_mle_s1 - update(:,1);

est_lam_mle_s1 = est_lam_mle_s1 - update(:,2);

est_lam_mle_s1 = est_al_mle_s1./mean(X(:,1:n),2);


est_al_mle_s1_mu(i) = mean(est_al_mle_s1);

est_al_mle_s1_var(i) = var(est_al_mle_s1);

est_lam_mle_s1_mu(i) = mean(est_lam_mle_s1);

est_lam_mle_s1_var(i) = var(est_lam_mle_s1);

end;

figure;

plot(N, (est_al0_mu-al).ˆ2+est_al0_var,’k--’,’linewi dth’,2); hold on; grid on;

plot(N, (est_al_mle_s1_mu-al).ˆ2+est_al_mle_s1_var,’ r-’,’linewidth’,2);

xlabel(’Sample size’);ylabel(’MSE’);

title([’Gamma: One-step MLE of \alpha = ’ num2str(al)]);

legend(’Moment’,’One-step MLE’);

figure;

plot(N, (est_lam0_mu-lam).ˆ2+est_lam0_var,’k--’,’lin ewidth’,2); hold on; grid on;

plot(N, (est_lam_mle_s1_mu-lam).ˆ2+est_lam_mle_s1_va r,’r-’,’linewidth’,2);

title([’Gamma: One-step MLE of \lambda = ’ num2str(lam)]);

xlabel(’Sample size’);ylabel(’MSE’);

legend(’Moment’,’One-step MLE’);


MLE of Multinomial Probabilities

SupposeX1, X2, ..., Xm, which are the counts of cells 1, 2, ..., m, follow a

multinomial distribution with total count of n and cell probabilities p1, p2, ..., pm.

To estimate p1, p2, ..., pm from the observationsX1 = x1, X2 = x2, ...,

Xm = xm, write down the joint likelihood

f(x1, x2, ..., xm| p1, p2, ..., pm) ∝m∏

i=1

pxii

and the log likelihood

L(p1, p2, ..., pm) =m∑

i=1

xi log pi,m∑

i=1

pi = 1

A constrained optimization problem.


Solution 1: Reduce to m− 1 variables.

L(p2, ..., pm) = x1 log(1 − p2 − p3 − ....− pm) +m∑

i=2

xi log pi,

where

m∑

i=2

pi ≤ 1, pi ≥ 0, pi ≤ 1

We do not have to worry about the inequality constraints unless they are violated.


∂L

∂pi=

−x1

1 − p2 − p3 − ...− pm+xi

pi= 0, i = 2, 3, ...,m

=⇒ x1

p1=xi

pi

=⇒ x1

p1=x2

p2=x3

p3= ... =

xm

pm= λ

Therefore,

p1 =x1

λ, p2 =

x2

λ, ..., pm =

xm

λ,

=⇒ 1 =

m∑

i=1

pi =

∑mi=1 xi

λ=n

λ

=⇒ λ = n =⇒ pi =xi

n, i = 1, 2, ...,m


Solution 2: Lagrange multiplier (essentially the same as solution 1)

Convert the original problem into an “unconstrained” problem

L(p1, p2, ..., pm) =m∑

i=1

xi log pi − λ

(

m∑

i=1

pi − 1

)


Example A: Hardy-Weinberg Equilibrium

If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur in a

population with frequencies

(1 − θ)2, 2θ(1 − θ), θ2,

respectively. Suppose we observe sample counts x1, x2, and x3, with total = n.

Q: Estimate θ using MLE.


Solution: The log likelihood can be written as

l(θ) =3∑

i=1

xi log pi

=x1 log(1 − θ)2 + x2 log 2θ(1 − θ) + x3 log θ2

∝2x1 log(1 − θ) + x2 log θ + x2 log(1 − θ) + 2x3 log θ

=(2x1 + x2) log(1 − θ) + (x2 + 2x3) log θ

Taking the first derivative

∂l(θ)

∂θ= −2x1 + x2

1 − θ+x2 + 2x3

θ= 0

=⇒ θ =x2 + 2x3

2n

What is V ar(θ)?


V ar(θ) =1

4n2(V ar(x2) + 4V ar(x3) + 4Cov(x2, x3))

=1

4n2(np2(1 − p2) + 4np3(1 − p3)−4np2p3)

=1

4n

(

p2 + 4p3 − (p2 + 2p3)2)

=θ(1 − θ)

2n

We will soon show the variance of MLE is asymptotically 1I(θ) , where

I(θ) = −E(

∂2l(θ)

∂θ2

)

is the Fisher Information.


∂2l(θ)

∂θ2= −2x1 + x2

(1 − θ)2− x2 + 2x3

θ2

I(θ) = −E

(

∂2l(θ)

∂θ2

)

=n2(1 − θ)2 + 2θ(1 − θ)

(1 − θ)2+ n

2θ(1 − θ) + 2θ2

θ2

=2n

1 − θ+

2n

θ=

2n

θ(1 − θ)

Therefore, the “asymptotic variance” is V ar(θ) = θ(1−θ)2n ,

which in this case is the exact variance.


Review Properties of Multinomial Distribution

SupposeX1, X2, ..., Xm, which are the counts of cells 1, 2, ..., m, follow a

multinomial distribution with total count of n and cell probabilities p1, p2, ..., pm.

Marginal and conditional distributions

Xj ∼ Binomial(n, pj)

Xj |Xi ∼ Binomial

(

n−Xi,pj

1 − pi

)

, i 6= j


Moments

E(Xj) = npj , V ar(Xj) = npj(1 − pj)

E(Xj |Xi) = (n−Xi)pj

1 − pi

E(XiXj) = E(XiE(Xj |Xi)) = E

(

(

nXi −X2i

) pj

1 − pi

)

=pj

1 − pi

(

n2pi − npi(1 − pi) − n2p2i

)

= npipj (n− 1)

Cov(Xi, Xj) = E(XiXj) −E(Xi)E(Xj)

= npipj (n− 1) − n2pipj = −npipj


Large Sample Theory for MLE

Assume i.i.d. samples of size n, Xi, i = 1 to n, with density f(x|θ).

The MLE of θ, denoted by θ is given by

θ = argmaxθ

n∑

i=1

log f(xi|θ)

Large sample theory says, as n→ ∞, θ is asymptotically unbiased and normal.

θ ∼ N

(

θ,1

nI(θ)

)

, approximately

I(θ) is the Fisher Information of θ: I(θ) = −E[

∂2

∂θ2 log f(X |θ)]

.


Fisher Information

I(θ) =E

[

∂

∂θlog f(X |θ)

]2

= −E

[

∂2

∂θ2log f(X |θ)

]

How to prove the equivalence of two definitions?


Proof:

E

[

∂

∂θlog f(X |θ)

]2

=

∫[

∂f

∂θ

]21

f2fdx =

∫[

∂f

∂θ

]21

fdx

−E[

∂2

∂θ2log f(X |θ)

]

= −∫ f ∂2f

∂θ2 −[

∂f∂θ

]2

f2fdx

= −∫

∂2f

∂θ2dx+

∫[

∂f

∂θ

]21

fdx

Therefore, it suffices to show (in fact assume):

∫

∂2f

∂θ2dx =

∂2

∂θ2

[∫

fdx

]

= 0


Example: Normal Distribution

Given n i.i.d. samples, xi ∼ N(µ, σ2), i = 1 to n.

log fX(x;µ, σ2) = − 1

2σ2(x− µ)2 − 1

2log(2πσ2)

∂2 log fX(x;µ, σ2)

∂µ2= − 1

σ2=⇒ I(µ) =

1

σ2

∂2 log fX(x;µ, σ2)

∂(σ2)2= − (x− µ)2

σ6+

1

2σ4

=⇒ I(σ2) =σ2

σ6− 1

2σ4=

1

2σ4

“Asymptotic” variances of MLE are in fact exact in this case.


Example: Binomial Distribution

x ∼ Binomial(p, n): Pr (x = k) =(

nk

)

pk(1 − p)n−k

Log likelihood and Fisher Information:

l(p) = k log p+ (n− k) log(1 − p)

l′(p) =k

p− n− k

1 − p=⇒ MLE p =??

l′′(p) = − k

p2− n− k

(1 − p)2

I(p) = −E (l′′(p)) =np

p2+

n− np

(1 − p)2=

n

p(1 − p)

“Asymptotic” variance of MLE is also exact in this case.


Intuition About the Asymptotic Distributions & Variances o f MLE

The MLE θ is the solution to the MLE equation l′(θ) = 0.

The Taylor expansion around the true θ

l′(θ) ≈ l′(θ) + (θ − θ)l′′(θ)

Let l′(θ) = 0 (because θ is the MLE solution)

(θ − θ) ≈ − l′(θ)

l′′(θ)

What is the mean of l′(θ)? What is the mean of l′′(θ)?


l′(θ) =n∑

i=1

∂ log f(xi)

∂θ=

n∑

i=1

∂f(xi)∂θ

f(xi)

E (l′(θ)) =n∑

i=1

E

(

∂ log f(xi)

∂θ

)

= nE

(

∂f(x)∂θ

f(x)

)

= 0

because

E

(

∂f(x)∂θ

f(x)

)

=

∫ ∂f(x)∂θ

f(x)f(x)dx =

∫

∂f(x)

∂θdx =

∂

∂θ

∫

f(x)dx = 0


E(l′(θ)) = 0, and we know −E(l′′(θ)) = nI(θ), the Fisher Information. Thus

(θ − θ) ≈ − l′(θ)

l′′(θ)≈ l′(θ)

nI(θ)

and

E(θ − θ) ≈ E(l′(θ))

nI(θ)= 0

Then, the variance

V ar(θ) ≈ E(l′(θ))2

n2I2(θ)=

nI(θ)

n2I(θ)=

1

nI(θ)


Sec. 8.7: Efficiency and Cram er-Rao Lower Bound

Definition: Given two unbiased estimates, θ1 and θ2, the efficiency of θ1

relative to θ2 is

eff(

θ1, θ2

)

=V ar(θ2)

V ar(θ1)

For example, if the variance of θ2 is 0.8 times the variance of θ1. Then θ1 is 80%

efficient relative to θ2.

Asymptotic relative efficiency Given two asymptotically unbiased

estimates, θ1 and θ2, the asymptotic relative efficiency of θ1 relative to θ2 is

computed using their asymptotic variances (as sample size goes to infinity).


Example 8.7.A :

Assume that the random variable X has density

fX(x) =1 + αx

2, |x| ≤ 1, |α| ≤ 1

Method of Moments: α can be estimated from the first moment

µ1 =

∫ 1

−1

x1 + αx

2dx =

α

3.

Therefore, the moment estimator would be

αm = 3X.

whose variance

V ar(αm) =9

nV ar(X) =

9

n

[

E(X2) −E2(X)]

=3 − α2

n


Maximum Likelihood Estimate: The first two derivatives

∂

∂αlog fX(x;α) =

x

1 + αx

∂2

∂α2log fX(x;α) =

−x2

(1 + αx)2

Therefore, the MLE is the solution to

n∑

i=1

Xi

1 + αmleXi= 0.

Can not compute the exact variance. We resort to approximate (asymptotic)

variance

V ar (αmle) ≈1

nI(α)


Use the second derivatives to compute I(α)

I(α) = − E

[

∂2

∂α2log fX(x|α)

]

=

∫ 1

−1

x2

(1 + αx)21 + αx

2dx

=

∫ 1

−1

x2

2(1 + αx)dx

=log 1+α

1−α − 2α

2α3, α 6= 0

When α = 0, I(α) =∫ 1

−1x2

2 dx = 13 ,

which can also be obtained by taking limit of I(α).


The asymptotic relative efficiency of αm to αmle is

V ar(αmle)

V ar(αm)=

2α3

3−α2

log 1+α1−α − 2α

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α

Effi

cien

cy

Why the efficiency is no larger than 1?


Cramer-Rao Inequality

Theorem 8.7.A: Let X1, X2, ..., Xn be i.i.d. with density function f(x; θ).

Let T be an unbiased estimate of θ. Then under smoothness assumption on

f(x; θ),

V ar(T ) ≥ 1

nI(θ)

Thus, under reasonable assumptions, MLE is optimal or (asymptotically) optimal.


Sec. 8.8: Sufficiency

Definition: LetX1, X2, ..., Xn be i.i.d. samples with density f(x; θ). A

statistic T = T (X1, X2, ..., Xn) is said to be sufficient for θ if the conditional

distribution of X1, X2, ..., Xn, given T = t, does not depend on θ for any t.

In other words, given T , we can gain no more knowledge about θ.


Example 8.8.A: LetX1, X2, ..., Xn be a sequence of independent

Bernoulli random variables with P (Xi = 1) = θ. Let T =∑n

i=1Xi.

P (X1 = x1, ..., Xn = xn|T = t) =P (X1 = x1, ..., Xn = xn, T = t)

P (T = t)

=θt(1 − θ)n−t

(

nt

)

θt(1 − θ)n−t

=1(

nt

) ,

which is independent of θ. Therefore, T =∑n

i=1Xi is a sufficient statistic.


Theorem 8.8.1.A: Factorization Theorem

A necessary and sufficient condition for T (X1, ..., Xn) to be sufficient for a

parameter θ is that the joint probability density (mass) function factors

f(x1, x2, .., xn; θ) = g [T (x1, x2, ..., xn), θ]h(x1, x2, ..., xn)


Example 8.8.1.A: X1, X2, ..., Xn are i.i.d. Bernoulli random variables with

success probability θ.

f(x1, x2, ..., xn; θ) =θ∏

i=1

θxi(1 − θ)1−xi

=θ∑n

i=1 xi(1 − θ)n−∑ni=1 xi

=

(

θ

1 − θ

)

∑ni=1 xi

(1 − θ)n

=g (T, θ) × h

h(x1, x2, ..., xn) = 1.

T (x1, x2, ..., xn) =∑n

i=1 xi is the sufficient statistic.

g(T, θ) =(

θ1−θ

)T

(1 − θ)n


Example 8.8.1.B: X1, X2, ..., Xn are i.i.d. normalN(µ, σ2), both µ and

σ2 are unknown.

f(x1, x2, ..., xn;µ, σ2) =n∏

i=1

1√2πσ

e−(xi−µ)2

2σ2

=1

(2π)n/2σne−

∑ni=1

(xi−µ)2

2σ2

=1

(2π)n/2σne

−1

2σ2 [∑n

i=1 x2i−2µ

∑ni=1 xi+nµ2]

Therefore,∑n

i=1 x2i and

∑ni=1 xi are sufficient statistics.

Equivalently, we say T = (X, S2) is the sufficient statistic for normal with

unknown mean and variance.


Proof of the Factorization Theorem (Discrete Case)

Theorem: A necessary and sufficient condition for T (X1, ..., Xn) to be

sufficient for a parameter θ is that the joint probability mass function factors

P (X1 = x1, .., Xn = xn; θ) = g [T (x1, ..., xn), θ]h(x1, ..., xn)

Proof of sufficient condition: Assume

P (X1 = x1, .., Xn = xn; θ) = g [T (x1, ..., xn), θ]h(x1, ..., xn)

Then the conditional distribution


P (T = t)


But we assume P (X1, .., Xn) factors, i.e.,

P (T = t) =∑

T (x1,...,xn)=t

P (X1 = x1, ..., Xn = xn)

=g(t, θ)∑

T (x1,...,xn)=t

h(x1, ..., xn)

Note that t is constant. Thus, the conditional distribution


P (T = t)

=g(t, θ)h(x1, ..., xn)

∑

T (x1,...,xn)=t g(t, θ)h(x1, ..., xn)

=h(x1, ..., xn)

∑

T (x1,...,xn)=t h(x1, ..., xn),

which does not depend on θ.

Therefore, T (X1, ..., Xn) is a sufficient statistic.


Proof of necessary condition: Assume T (X1, ..., Xn) is sufficient. That

is, the conditional distribution (X1, ..., Xn)|T does not depend on θ. Then

P (X1 = x1, ..., Xn = xn) =P (X1 = x1, ..., Xn = xn|T = t)P (T = t)

=P (T = t)P (X1 = x1, ..., Xn = xn|T = t)

=g(t, θ)h(x1, ..., xn),

where

h(x1, ..., xn) = P (X1 = x1, ..., Xn = xn|T = t)

g(t, θ) = P (T = t)

Therefore, the probability mass function factors.


Exponential Family

Definition: Members of one parameter (θ) exponential family have density

function (or frequency functions) of the form

f(x; θ) =

exp [c(θ)T (x) + d(θ) + S(x)] if x ∈ A

0 otherwise

Where the setA does not depend on θ.

Many common distributions: normal, binomial, Poisson, gamma, are members of

this family.


Example 8.8.C : The frequency function of the Bernoulli distribution is

P (X = x) =θx(1 − θ)1−x, x ∈ 0, 1

=exp

[

x logθ

1 − θ+ log(1 − θ)

]

Therefore, this is a member of the exponential family, with

c(θ) = logθ

1 − θ

T (x) = x

d(θ) = log 1 − θ

S(x) = 0.

—————-

f(x; θ) = exp [c(θ)T (x) + d(θ) + S(x)].


Sufficient statistics of exponential family

Suppose thatX1, X2, ..., Xn is an i.i.d. sample from a member of the

exponential family. Then the joint probability is

n∏

i=1

f(xi|θ) =n∏

i=1

exp [c(θ)T (xi) + d(θ) + S(xi)]

= exp

[

c(θ)n∑

i=1

T (xi) + nd(θ)

]

exp

[

n∑

i=1

S(xi)

]

By the factorization theorem, we know∑n

i=1 T (xi) is a sufficient statistic.

In the Bernoulli example,∑n

i=1 T (xi) =∑n

i=1 xi.


The MLE of exponential family

If T (x) is a sufficient statistic for θ, then the MLE is a function of T .

Recall: if X ∼ N(µ, σ2), then the MLE

µ =1

n

n∑

i=1

xi

σ2 =1

n

n∑

i=1

(xi − µ)2

We know that (∑n

i=1 xi,∑n

i=1 x2i ) is sufficient statistic.

Note that normal is a member of the two-parameter exponential family.


k-parameter Exponential Family

Definition: Members of k-parameter (θ) exponential family have density

function (or frequency functions) of the form

f(x; θ) = exp

k∑

j=1

cj(θ)Tj(x) + d(θ) + S(x)


Normal Distribution and Exponential Family

SupposeX ∼ N(µ, σ2). Then

f(x;µ, σ2) =1√2π

exp

[

−1

2log σ2 − 1

2σ2x2 +

µ

σ2x− µ2

2σ2

]

Does it really belong to a (2-dim) exponential family?

Well, suppose σ2 is known, then it is clear that it does belong to a one-dim

exponential family.

f(x; θ) = exp [c(θ)T (x) + d(θ) + S(x)]

θ = µ, T (x) = x, c(θ) =µ

σ2

d(θ) = − µ2

2σ2, S(x) = − x2

2σ2− 1

2log σ2 − 1

2log 2π


When σ2 is unknown, then we need to re-parameterize the distribution by letting

θ =( µ

σ2, σ2)

= (θ1, θ2)

Then it belongs to a 2-dim exponential family

f(x; θ) = exp [c1(θ)T1(x) + c2(θ)T2(x) + d(θ) + S(x)]

c1(θ) =µ

σ2= θ1, T1(x) = x

c2(θ) = − 1

2σ2= − 1

2θ2, T2(x) = x2

d(θ) = −1

2log σ2 − µ2

2σ2= −1

2log θ2 −

θ21

2θ2

S(x) = −1

2log 2π


Another Nice Property of Exponential Family

Suppose

f(x; θ) = exp

k∑

j=1


Then

E (Ti(X)) = − ∂d(θ)

∂ci(θ)

Exercise: What about variances and covariances?


Proof: Take derivatives on both sides of∫

fdx = 1, i.e.,∂∫

fdx∂ci(θ) = 0.

∂∫

fdx

∂ci(θ)=

∫

∂f

∂ci(θ)dx

=

∫

∂

∂ci(θ)exp

k∑

j=1


dx

=

∫

exp

k∑

j=1


[

Ti(x) +∂d(θ)

∂ci(θ)

]

dx

=

∫

f

[

Ti(x) +∂d(θ)

∂ci(θ)

]

dx

Therefore

E (Ti(X)) =

∫

fTi(x)dx = −∫

f∂d(θ)

∂ci(θ)dx = − ∂d(θ)

∂ci(θ)


For example,X ∼ N(µ, σ2) belongs to a 2-dim exponential family

θ = (θ1, θ2) =( µ

σ2, σ2)

T1(x) = x, T2(x) = x2

Apply the previous result,

E(T1(x)) = E(x) = −∂d(θ)c1(θ)

= − (−θ1θ2) =µ

σ2σ2 = µ

as expected.


Sec. 8.6: The Bayesian Approach to Parameter Estimation

The θ is the parameter to be estimated

The prior distribution fΘ(θ).

The joint distribution fX,Θ(x, θ) = fX|Θ(x|θ)fΘ(θ)

The marginal distribution

fX(x) =

∫

fX,Θ(x, θ)dθ =

∫

fX|Θ(x|θ)fΘ(θ)dθ

The posterior distribution

fΘ|X(θ|x) =fX,Θ(x, θ)

fX(x)=

fX|Θ(x|θ)fΘ(θ)∫

fX|Θ(x|u)fΘ(u)du


Three main issues in Bayesian estimation

• Specify a prior (without looking at the data first).

• Calculate the posterior distribution, maybe computationally intensive.

• Choose appropriate estimators from the posterior distribution:

mean, median, mode, ...


The Add-one Smoothing

Consider n+m trials having a common probability of success. Suppose,

however, that this success probability is not fixed in advance but is chosen from

U(0, 1).

Q: What is the conditional distribution of this success probability given that the

n+m trials result in n successes?

Solution:

Let X = trial success probability. X ∼ U(0, 1).

Let N = total number of successes. N |X = x ∼ Binomial(n+m,x).


Solution:

Let X = trial success probability. X ∼ U(0, 1).

Let N = total number of successes. N |X = x ∼ Binomial(n+m,x).

fX|N (x|n) =PN = n|X = xfX(x)

PN = n

=

(

n+mn

)

xn(1 − x)m

PN = n∝ xn(1 − x)m

Therefore,X |N ∼ Beta(n+ 1,m+ 1).

Here X ∼ U(0, 1) is the prior distribution.


If X |N ∼ Beta(n+ 1,m+ 1), then

E(X |N) =n+ 1

(n+ 1) + (m+ 1)

Suppose we do not have a prior knowledge of the success probability X .

We observe n successes out of n+m trials.

The most intuitive estimate (in fact MLE) of X should be

X =n

n+m

Assuming a uniform prior on X leads to the add-one smoothing .


0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8

9

10Posterior distribution, assuming p ~ U(0,1)

p

PM

F

m = 8, n = 0m = 8, n = 2m = 80, n = 20

Posterior distribution X |N ∼ Beta(n+ 1,m+ 1).

Posterior mean: E(X) = n+1m+1 ,

Posterior mode (peak of density): nm .


Estimating Binomial Parameter Under Beta Prior

X ∼ Bin(n, p). p ∼ Beta(a, b).

Joint probability

fX,P (x, p) =

[(

n

x

)

px(1 − p)n−x

] [

Γ(a+ b)

Γ(a)Γ(b)pa−1(1 − p)b−1

]

=Γ(a+ b)

Γ(a)Γ(b)

(

n

x

)

[

px+a−1(1 − p)n−x+b−1]

which is also a beta distribution Beta(x+ a, n− x+ b).

Marginal distribution

fX(x) =

∫ 1

0

fX,P (x, p)dp = g(n, x), (very nice, why?)


Therefor, the posterior distribution is also Beta, with parameters

(x+ a, n− x+ b). This is extremely convenient.

Moment estimator (using posterior mean)

p = E(p|x) =x+ a

(x+ a) + (n− x+ b)=

x+ a

n+ a+ b

=x

n

n

a+ b+ n+

a

a+ b

a+ b

a+ b+ n

xn : the usual estimate without considering priors.

aa+b : the estimate when there are no data.

The add-one smoothing is a special case with a = b = 1.

What about the bias-variance trade-off??


The Bias-Variance Trade-off

Bayesian estimator (using posterior mean)

p =x+ a

n+ a+ b

MLE

pMLE =x

n

Assume p is fixed (conditional on p). Study the MSE ratio

MSE ratio =MSE(p)

MSE(pMLE)

We hope MSE ratio ≤ 1, especially when sample size n is reasonable.


Asymptotic MSE ratio : when n is not too small

Asymptotic MSE ratio = 1 +A

n+O

(

1

n2

)

.

We hopeA ≤ 0

Exercise : Find A, which is a function of p, a, b.


0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n

p = 0.5 , a = 1 , b = 1

MS

E r

atio

s

Exact MSE ratiosAsymptotic MSE ratios


0 20 40 60 80 1000.5

1

1.5

2

n

MS

E r

atio

s

p = 0.9 , a = 1 , b = 1

Exact MSE ratiosAsymptotic MSE ratios


Conjugate Priors

The prior distribution fΘ(θ), belongs to family G.

The conditional distribution fX|Θ(x|θ), belongs to family H .

The posterior distribution

fΘ|X(θ|x) =fX,Θ(x, θ)

fX(x)=

fX|Θ(x|θ)fΘ(θ)∫

fX|Θ(x|u)fΘ(u)du

If the posterior distribution also belongs to G, then G is conjugate to H .

Conjugate priors were introduced mainly for the computational convenience.


Examples of Conjugate priors:

Beta is conjugate to Binomial.

Gamma is conjugate to Poisson.

Dirichlet is conjugate to multinomial.

Gamma is conjugate to exponential.

Normal is conjugate to normal (with known variance).


Chapter 9: Testing Hypothesis

Suppose you have a coin which is possibly biased. You want to test whether the

coin is indeed biased (i.e., p 6= 0.5), by tossing the coin n = 10 times.

Suppose you observe k = 8 heads (out of n = 10 tosses). It is reasonable to

guess that this coin is indeed biased. But how to make a precise statement?

Are n = 10 tosses enough? How about n = 100? n = 1000? What is the

principled approach?


Terminologies

Null hypothesis H0 : p = 0.5

Alternative hypothesis HA : p 6= 0.5

Type I error RejectingH0 when it is true

Significance level P (Type I error) = P (RejectH0|H0) = α

Type II error AcceptingH0 when it is false

P (Type II error) = P (AcceptH0|HA) = β

Power 1 − β

Goal: Low α and high 1 − β.


Example 9.2.A LetX1, X2, ..., Xn be an i.i.d. sample from a normal with

known variance σ2 and unknown mean µ. Consider two simple hypotheses :

H0 : µ = µ0

HA : µ = µ1 (µ1 > µ0)

Under H0, the null distribution likelihood is

f0 ∝n∏

i=1

exp

[

− 1

2σ2(Xi − µ0)

2

]

= exp

[

− 1

2σ2

n∑

i=1

(Xi − µ0)2

]

Under HA, the likelihood is

f1 ∝ exp

[

− 1

2σ2

n∑

i=1

(Xi − µ1)2

]

Which hypothesis is more likely?


Likelihood Ratio : Small ratios =⇒ rejection. Sounds reasonable, but why?

f0f1

=exp

[

− 12σ2

∑ni=1(Xi − µ0)

2]

exp[

− 12σ2

∑ni=1(Xi − µ1)2

]

= exp

[

− 1

2σ2

n∑

i=1

[

(Xi − µ0)2 − (Xi − µ1)

2]

]

= exp[ n

2σ2

[

2X(µ0 − µ1) + µ21 − µ2

0

]

]

Because µ0 − µ1 < 0 (by assumption), the likelihood is small if X is large.

Suppose the significance level α = 0.05. With how large X can we reject H0?

Neyman-Pearson Lemma provides the answers.


Neyman-Pearson Lemma

Suppose thatH0 and HA are simple hypotheses and that the test that rejects

H0 whenever the likelihood ratio is less than c has significance level α. Then any

other test for which the significance level is ≤ α has power less than or equal to

that of the likelihood ratio test.

In other words, among all possible tests achieving significance level ≤ α, the test

based on likelihood ratio maximizes the power.


Proof: Let H0 : f(x) = f0(x), HA : f(x) = fA(x)

Denote two tests

d(x) =

0, if H0 is accepted

1, if H0 is rejectedd∗(x) =

0, if H0 is accepted

1, if H0 is rejected

The test d(x), based on the likelihood ratio, has a significance level α, i.e.,

d(x) = 1, whenever f0(x) < cfA(x), (c > 0)

α = P (d(x) = 1|H0) = E(d(x)|H0) =

∫

d(x)f0(x)dx

Assume the test d∗(x) has smaller significance level, i.e.,

P (d∗(x) = 1|H0) ≤ P (d(x) = 1|H0) = α

=⇒∫

[d(x) − d∗(x)] f0(x)dx ≥ 0


To show: P (d∗(x) = 1|HA)≤P (d(x) = 1|HA)

Equivalently, we need to show∫

[d(x) − d∗(x)] fA(x)dx≥0

We make use of a key inequality

d∗(x) [cfA(x) − f0(x)] ≤ d(x) [cfA(x) − f0(x)]

which is true because d(x) = 1 whenever cfA(x) − f0(x) > 0, and

d(x), d∗(x) only take values in 0, 1.

More specifically, let M(x) = cfA(x) − f0(x).

If M(x) > 0, then the right-hand-side of the inequality becomesM(x), but the

left-hand-side becomesM(x) (if d∗(x) = 1) or 0 (if d∗(x) = 0). Thus the

inequality holds, becauseM(x) > 0.


If M(x) < 0, then the right-hand-side of the inequality becomes 0, but the

left-hand-side becomesM(x) (if d∗(x) = 1) or 0 (if d∗(x) = 0). Thus the

inequality also holds, becauseM(x) < 0.

Integrating both sides of the inequality yields∫

d∗(x) [cfA(x) − f0(x)] dx ≤∫

d(x) [cfA(x) − f0(x)] dx

=⇒c

∫

[d(x) − d∗(x)] fAdx ≥∫

[d(x) − d∗(x)] f0dx ≥ 0


Continue Example 9.2.A : f0

f1≤ c=⇒ RejectH0.

f0f1

= exp[ n

2σ2

[

2X(µ0 − µ1) + µ21 − µ2

0

]

]

≤ c

α = P (rejectH0|H0) = P (f0 ≤ cf1|H0)

Equivalently,

RejectH0 if X ≥ x0, and

P (X ≥ x0|H0) = α.

Under H0: X ∼ N(

µ0, σ2/n)


α =P (X ≥ x0|H0)

=P

(

X − µ0

σ/√n

>x0 − µ0

σ/√n

)

=1 − Φ

(

x0 − µ0

σ/√n

)

=⇒ x0 = µ0 + zασ√n

zα is the upper α point of the standard normal:

P (Z ≥ zα) = α, where Z ∼ N(0, 1). z0.05 = 1.645, z0.025 = 1.960

Therefore, the test rejects H0 if X ≥ µ0 + zασ√n

.

—————

Q: What is β? What is the power? Can we reduce both α and β when n is fixed?


Uniformly Most Powerful Test

Neyman-Pearson Lemma requires that both hypotheses be simple. However,

most real-situations require composite hypothesis.

If the alternative H1 is composite, a test that is most powerful for every simple

alternative in H1 is uniformly most powerful (UMP).


Continuing Example 9.2.A: Consider testing

H0 : µ = µ0

H1 : µ > µ0

For every µ1 > µ0, the likelihood ratio test rejects H0 if X ≥ x0, where

x0 = µ0 + zασ√n

does not depend on µ1.

Therefore, this test is most powerful for every µ1 > µ0 and hence it is UMP.


Similarly, the test is UMP for testing (one-sided alternative)

H0 : µ < µ0

H1 : µ > µ0

However, the test is not UMP for testing (two-sided alternative)

H0 : µ = µ0

H1 : µ 6= µ0

Unfortunately, in typical composite situations, there is no UMP test.


P-Value

Definition: The p-value is the smallest significance level at which the null

hypothesis would be rejected.

The smaller the p-value, the stronger the evidence against the null hypothesis.

In a sense, calculating the p-value is more sensible than specifying (often

arbitrarily) the level of significance α.


Confidence Intervals

Example 9.3 A: LetX1, ... Xn be an i.i.d. sample from a normal distribution

having unknown mean µ and known variance σ2. Consider testing

H0 : µ = µ0

HA : µ 6= µ0

Consider a test that rejects H0: for |X − µ0| ≥ x0 such that

P (|X − µ0| > x0|H0) = α

Solve for x0: x0 = σ√nzα/2.


The test acceptsH0 if

X − σ√nzα/2 ≤ µ0 ≤ X +

σ√nzα/2

We say a 100(1 − α)% confidence interval for µ0 is

µ0 ∈[

X − σ√nzα/2, X +

σ√nzα/2

]

Duality: µ0 lies in the confidence interval for µ if and only if the hypothesis

test accepts. This result holds more generally.


Duality of Confidence Intervals and Hypothesis Tests

Let θ be a parameter of a family of probability distributions. θ ∈ Θ. Denote the

random variables constituting the data by X.

Theorem 9.3.A: Suppose that for every value θ0 ∈ Θ there is a test at level

α of the hypothesis: H0 : θ = θ0. Denote that acceptance region of the test by

A(θ0). Then the set

C(X) = θ : X ∈ A(θ)

is a 100(1 − α)% confidence region for θ.


Proof: Need to show

P [θ0 ∈ C(X)|θ = θ0] = 1 − α

By the definition of C(X), we know

P [θ0 ∈ C(X)|θ = θ0] = P [X ∈ A(θ0)|θ = θ0]

By the definition of level of significance, we know

P [X ∈ A(θ0)|θ = θ0] = 1 − α.

This completes the proof.


Theorem 9.3.B: Suppose that C(X) is 100(1 − α)% confidence region for

θ; that is, for every θ0,

P [θ0 ∈ C(X)|θ = θ0] = 1 − α

Then an acceptance region for a test at level α of H0 : θ = θ0 is

A(θ0) = X|θ0 ∈ C(X)

Proof:

P [X ∈ A(θ0)|θ = θ0] = P [θ0 ∈ C(X)|θ = θ0] = 1 − α


Generalized Likelihood Ratio Test

Likelihood ratio test :

A simple hypothesis versus a simple hypothesis. Optimal. Very limited use.

Generalized likelihood ratio test:

Composite hypotheses. Sub-optimal and widely-used.

Play the same role as MLE in parameter estimation.


Assume a sampleX1, ... ,Xn from a distribution with unknown parameter θ.

H0 : θ ∈ ω0

HA : θ ∈ ω1

Let Ω = ω0 ∪ ω1. The test statistic

Λ =

maxθ∈ω0

lik(θ)

maxθ∈Ω

lik(θ)

RejectH0 if Λ ≤ λ0, such that

P (Λ ≤ λ0|H0) = α


Example 9.4.A: Testing a Normal Mean Let X1, ..., Xn be i.i.d. and

normally distributed with mean µ and known variance σ2. Test

H0 : µ = µ0

HA : µ 6= µ0

In other words, ω0 = µ0, Ω = −∞ < µ <∞.

maxθ∈ω0

lik(µ) =1

(√2πσ

)n e− 1

2σ2

∑ni=1(Xi−µ0)

2

maxθ∈Ω

lik(µ) =1

(√2πσ

)n e− 1

2σ2

∑ni=1(Xi−X)2


Λ =

maxθ∈ω0

lik(θ)

maxθ∈Ω

lik(θ)= exp

− 1

2σ2

[

n∑

i=1

(Xi − µ0)2 −

n∑

i=1

(Xi − X)2

]

=exp

− 1

2σ2

[

n∑

i=1

(X − µ0)(2Xi − µ0 − X)

]

=exp

− 1

2σ2

[

n(X − µ0)2]

−2 log Λ =(X − µ0)

2

σ2/n

Because underH0, ∼ N(µ0, σ2/n), we know, underH0,

−2 log Λ|H0 ∼ χ21


The test rejects H0

(X − µ0)2

σ2/n> χ2

1,α

χ21,0.05 = 3.841.

Equivalently, the test rejects H0 if

∣

∣X − µ0

∣

∣ ≥ zα/2σ√n

—————–

In this case, we know the sample null distribution exactly. When the sample

distribution is unknown (or not in a convenient form), we resort to the

approximation by central limit theorem.


Theorem 9.4.A: Under some smoothness conditions on the probability

density of mass functions, the null distribution of −2 log Λ tends to a chi-square

distribution with degrees of freedom equal to dimΩ− dimω0, as the sample size

tends to infinity.

dimΩ = number of free parameters under Ω

dimω0 = number of free parameters under ω0.

In Example 9.4.A, the null hypothesis specifies µ and σ2 and hence there are no

free parameters underH0, i.e., dimω0 = 0.

Under Ω, σ2 is known (fixed) but µ is free, so dimΩ = 1.


Generalized Likelihood Ratio Tests for Multinomial Distri bution

Goodness of fit : Assume the multinomial probabilities pi are specified by

H0 : p = p(θ), θ ∈ ω0

where θ is a (vector of) parameter(s) to be estimated.

We need to know whether the model p(θ) is good or not, according to the

observed data (cell counts).

We also need an alternative hypothesis. A common choice of Ω would be

Ω = pi, i = 1, 2, ...,m|pi ≥ 0,

m∑

i=1

pi = 1


Λ =maxp∈ω0

lik(p)

maxp∈Ω

lik(p)

=

(

nx1,x2,...,xm

)

p1(θ)x1 ...pm(θ)xm

(

nx1,x2,...,xm

)

px11 ...pxm

m

=m∏

i=1

(

pi(θ)

pi

)xi

θ: the MLE under ω0 pi = xi

n : the MLE under Ω.

Λ =m∏

i=1

(

pi(θ)

pi

)npi

, −2 log Λ = −2nm∑

i=1

pi log

(

pi(θ)

pi

)


−2 log Λ = − 2nm∑

i=1

pi log

(

pi(θ)

pi

)

=2

m∑

i=1

npi log

(

npi

npi(θ)

)

=2m∑

i=1

Oi logOi

Ei

Oi = npi = xi : the observed counts,

Ei = npi(θ) : the expected counts

−2 log Λ is asymptotically χ2s .

The degrees of freedom s = dimΩ − dimω0 = (m− 1) − k.

k = length of the vector θ = number of parameters in the model.


G2 Test Versus X2 Test

Generalized likelihood ratio test

G2 = −2 log Λ =2

m∑

i=1

npi log

(

npi

npi(θ)

)

= 2

m∑

i=1

Oi logOi

Ei

Pearson’s Chi-square test

X2 =m∑

i=1

[

xi − npi(θ)]2

npi(θ)=

m∑

i=1

[Oi −Ei]2

Ei

G2 and X2 are asymptotically equivalent.


By Taylor expansions, about x ≈ x0,

x logx

x0=x log

x− x0 + x0

x0= x log

(

1 +x− x0

x0

)

=x

(

x− x0

x0− (x− x0)

2

2x20

+ ...

)

= (x− x0 + x0)

(

x− x0

x0− (x− x0)

2

2x20

+ ...

)

=(x− x0) +(x− x0)

2

2x0+ ...


Under H0, we expect npi = xi ≈ npi(θ). Thus

G2 =2

m∑

i=1

npi log

(

npi

npi(θ)

)

=2m∑

i=1

[

(npi − npi(θ)) +(npi − npi(θ))

2

2npi(θ)+ ...

]

≈m∑

i=1

(npi − npi(θ))2

npi(θ)= X2

It appearsG2 test should be “more accurate,” butX2 is actually more frequently

used.


Example 9.5.A: The Hardy-Weinberg equilibrium model assumes the cell

probabilities are

(1 − θ)2, 2θ(1 − θ), θ2

The observed counts are 342, 500, and 187, respectively (total n = 1029).

Using MLE, we estimate θ = 2x3+x2

2n = 0.4246842.

The expected (estimated) counts are 340.6, 502.8, and 185.6, respectively.

G2 = 0.032499, X2 = 0.0325041 (slightly different numbers in the Book)

Both G2 andX2 are asymptotically χ2s where

s = (m− 1) − k = (3 − 1) − 1 = 1.


G2 = 0.032499, X2 = 0.0325041, both asymptotically χ21.

p-values

For G2, p-value = 0.85694. For X2, p-value = 0.85682

Very large p-values indicate that we should not rejectH0.

In other words, the model is very good.

Suppose we do want to reject H0, we must use a significance level α ≥ 0.86.


The Poisson Dispersion Test

AssumeX ∼ Poi(λ), then E(X) = λ and V ar(X) = λ.

However, for many real data, the variance may considerably exceed the mean.

Over-dispersion is often caused by subject heterogeneity, which may require a

more flexible model to explain the data

Given counts x1, ..., xn, consider

ω0 : xi ∼ Poi(λ), i = 1, 2, ..., n

Ω : xi ∼ Poi(λi), i = 1, 2, ..., n


Given counts x1, ..., xn, consider

ω0 : xi ∼ Poi(λ), i = 1, 2, ..., n

Ω : xi ∼ Poi(λi), i = 1, 2, ..., n

Under ω0, the MLE is λ = x.

Λ =

maxλ∈ω0

lik(λ)

maxλi∈Ω

lik(λi)=

∏ni=1 λ

xie−λ/xi!∏n

i=1 λxii e

−λi/xi!

=

∏ni=1 x

xie−x/xi!∏n

i=1 xxi

i e−xi/xi!

=n∏

i=1

(

x

xi

)xi

exi−x

−2 log Λ = 2n∑

i=1

xi logxi

x∼ χ2

n−1 (asymptotically)


Tests for Normality

If X ∼ N(µ, σ2), then

• The density function is symmetric about µ, with coefficient of skewness

b1 = 0, where

b1 =E(X − µ)3

σ3

• The coefficient of kurtosis b2 = 3, where

b2 =E(X − µ)4

σ4

These provide two simple tests for normality (among many tests).


Two simple tests for normality

• Reject if the empirical coefficient of skewness |b1| is large, where

b1 =1n

∑ni=1(Xi − X)3

[

1n

∑ni=1(Xi − X)2

]3/2

• Reject if the empirical coefficient of kurtosis |b2 − 3| is large, where

b2 =1n

∑ni=1(Xi − X)4

[

1n

∑ni=1(Xi − X)2

]2

Difficulty : The distributions of b1 and b2 have no closed-forms and one must

resort to a numerical procedure.


Chapter 11: Comparing Two Samples

• Comparing two independent samples

For example, a sample X1, ... , Xn is drawn from N(µX , σ2); and an

independent sample Y1, ..., Ym is drawn from N(µY , σ2).

H0 : µX = µY

HA : µY 6= µY

• Comparing paired samples

For example, we observe pairs (Xi, Yi), i = 1 to n. We would like to test

the differenceX and Y .

Pairing causes samples to be dependent, i.e., Cov(Xi, Yi) = σXY .


Section 11.2: Comparing Two Independent Samples

Example: In a medical study, a sample of subjects may be assigned to a

particular treatment, and another independent sample may be assigned to a

control treatment.

• Methods based on the normal distribution

• The analysis of power


Methods Based on the Normal Distribution

A sample X1, ... , Xn is drawn from N(µX , σ2);

An independent sample Y1, ..., Ym is drawn from N(µY , σ2).

The goal is to study the difference µX − µY from the observations.

By the independence assumption,

X − Y ∼ N

[

µX − µY , σ2

(

1

n+

1

m

)]

.

Two scenarios:

• σ2 is known.

• σ2 is unknown.


Two Independent Normal Samples with Known Variance

X − Y ∼ N

[

µX − µY , σ2

(

1

n+

1

m

)]

.

Assume σ2 is known. Then

Z =(X − Y ) − (µX − µY )

σ√

1n + 1

m

∼ N(0, 1)

The 100(1 − α)% confidence interval of is

(X − Y ) ± zα/2 σ

√

1

n+

1

m

However, σ2 in general must be estimated from the data.


Two Independent Normal Samples with Unknown Variance

The pooled sample variance

s2p =(n− 1)s2X + (m− 1)s2Y

m+ n− 2

is an estimate of the common variance σ2, where

s2X =1

n− 1

n∑

i=1

(Xi − X)2

s2Y =1

m− 1

m∑

i=1

(Yi − Y )2

are the sample variances of the X ’s and Y ’s.

s2p is the weighted average of s2X and s2Y .


Theorem 11.2.A: The test statistic

t =(X − Y ) − (µX − µY )

sp

√

1n + 1

m

∼ tm+n−2

a t distribution with m+ n− 2 degrees of freedom.

Proof: Recall in Chapter 6, if V ∼ χ2n, U ∼ N(0, 1), and U and V are

independent, then U√V/n

∼ tn.


s2p(m+ n− 2)

σ2=

(n− 1)s2X + (m− 1)s2Yσ2

∼ χ2m+n−2

Let

U =(X − Y ) − (µX − µY )

σ√

1n + 1

m

∼ N(0, 1)

Then

U√

s2p/σ2∼ tm+n−2

That is,

U

s2p/σ2

=

(X−Y )−(µX−µY )

σ√

1n + 1

m

sp/σ=

(X − Y ) − (µX − µY )

sp

√

1n + 1

m

∼ tm+n−2


Three Types of Hypothesis Testing

The null hypothesis:

H0 : µX = µY

Three common alternative hypotheses

H1 : µX 6= µY

H2 : µX > µY

H3 : µX < µY

H1 is a two-sided alternative

H2 andH3 are one-sided alternatives


Using the test statistic t = X−Y

sp

√1n + 1

m

, the rejection regions are

For H1 : |t| > tn+m−2,α/2

For H2 : t > tn+m−2,α

For H3 : t < −tn+m−2,α

Pay attention to the p-value calculation for H1.


The Equivalence between t-test and Likelihood Ratio Test

H0 : µX = µY , H1 : µX 6= µY .

Three parameters: θ =(

µX , µY , σ2)

.

Λ =

maxθ∈ω0

lik(µX , µY , σ2)

maxθ∈Ω


We can show rejecting small Λ (i.e., rejecting large −2 log Λ) is equivalent to

rejecting large |t| = |X−Y |sp

√1n + 1

m

.


Three parameters: θ =(

µX , µY , σ2)

.

ω0 = µX = µY = µ0, 0 < σ = σ0 <∞Ω = −∞ < µX , µY <∞, 0 < σ <∞


=n∏

i=1

1√2πσ

exp

[

− (Xi − µX)2

2σ2

] m∏

i=1

1√2πσ

exp

[

− (Yi − µY )2

2σ2

]

l(µX , µY , σ2) = − m+ n

2log 2π − m+ n

2log σ2

− 1

2σ2

[

n∑

i=1

(Xi − µX)2 +m∑

i=1

(Yi − µY )2

]


Under ω0 = µX = µY = µ0, 0 < σ = σ0 <∞

l(µ0, σ20) = − m+ n

2log 2π − m+ n

2log σ2

0

− 1

2σ20

[

n∑

i=1

(Xi − µ0)2 +

m∑

i=1

(Yi − µ0)2

]

In fact, since both Xi ∼ N(µ0, σ20) and Yi ∼ N(µ0, σ

20), Xi and Yi are

independent, we have m+ n samples in N(µ0, σ20).

Therefore, the MLEs are

µ0 =1

m+ n

[

n∑

i=1

Xi +m∑

i=1

Yi

]

=n

m+ nX +

m

m+ nY

σ20 =

1

m+ n

[

n∑

i=1

(Xi − µ0)2 +

m∑

i=1

(Yi − µ0)2

]


Thus, under the null ω0 = µX = µY = µ0, 0 < σ = σ0 <∞

l(µ0, σ20) = − m+ n

2log 2π − m+ n

2log σ2

0 − m+ n

2

Under Ω = −∞ < µX , µY <∞, 0 < σ <∞. We can show

µX = X, µY = Y

σ2 =1

m+ n

[

n∑

i=1

(Xi − µX)2 +m∑

i=1

(Yi − µY )2

]

l(µX , µY , σ2) = −m+ n

2log 2π − m+ n

2log σ2 − m+ n

2


The negative log likelihood ratio is

−[

l(µ0, σ20) − l(µX , µY , σ

2)]

=m+ n

2log

σ20

σ2

Therefore, the test rejects for large values ofσ20

σ2 .

σ20

σ2=

∑ni=1(Xi − µ0)

2 +∑n

i=1(Yi − µ0)2

∑ni=1(Xi − X)2 +

∑ni=1(Yi − Y )2

=1 +mn

m+ n

(X − Y )2∑n

i=1(Xi − X)2 +∑n

i=1(Yi − Y )2

Equivalently, the test rejects for large values of

|X − Y |√

∑ni=1(Xi − X)2 +

∑ni=1(Yi − Y )2

which is the t statistic.


Power Analysis of Two-Sample t Test

Recall power = 1 - Type II error = P (rejectH0|HA).

To compute the power, we must specify a simple alternative hypothesis. We

consider

H0 : µX − µY = 0

H1 : µX − µY = ∆.

For simplicity, we assume σ2 is known and n = m.

The t test rejects if |X − Y | > zα/2σ√

2n .


Therefore, the power can be computed from

power =1 − Φ

[

zα/2 −∆

σ

√

n/2

]

+ Φ

[

−zα/2 −∆

σ

√

n/2

]

=1 − Φ[

zα/2 − ∆′]+ Φ[

−zα/2 − ∆′]

where ∆′ = ∆σ

√

n/2.

Three parameters, α, ∆, and n, affect the power.

• Larger α =⇒ smaller zα/2 =⇒ larger power.

• Larger |∆′| =⇒ larger power.

• Larger |∆| =⇒ larger power.

• Larger n =⇒ larger power.

• Smaller σ =⇒ larger power.

What is the relation between α and power if ∆ = 0?


Section 11.3: Comparing Paired Samples

In many cases, the samples are paired (and dependent),

for example, measurements before and after medical treatments.

Consider

(Xi, Yi), i = 1, 2, ...n

(Xi, Yi) is independent of (Xj , Yj), if i 6= j

E(Xi) = µX , E(Yi) = µY

V ar(Xi) = σ2X , V ar(Yi) = σ2

Y


Let Di = Xi − Yi, and D = 1n

∑ni=1Di. Then,

E(D) = µX − µY ,

V ar(D) =1

n

(

σ2X + σ2

Y − 2ρσXσY

)

Therefore, D is still an unbiased estimator of µX − µY , but it has smaller

variance if there exists positive correlation (ρ > 0).


Paired Test Based on the Normal Distribution

This methods assume that Di = Xi − Yi is i.i.d. normal with

E(Di) = µD, V ar(Di) = σ2D

In general, σ2D needs to be estimated from the data.

Consider a two-sided test

H0 : µD = 0, HA : µD 6= 0

A t-test rejects for large values of |t|, where t = D−µD

sD.

The rejection region is D > tn−1,α/2sD.


Example 11.3.1.A: Effect of cigarette smoking on platelet aggregation.

Before (X) After (Y ) Difference (D)

25 27 2

25 29 4

27 37 10

44 56 12

30 46 16

67 82 15

53 57 4

53 80 27

52 61 9

60 59 -1

28 43 15


D = 10.272

sD =

√

63.6182

11= 2.405

ρ = 0.8938

H0 : t =D

sD

=10.272

2.405= 4.271.

Suppose α = 0.01.

tα/2,n−1 = t0.005,10 = 3.169 < t.

Therefore, the test rejects H0 at significance level α = 0.01.

Alternatively, we say the p-value is smaller than 0.01.


A Heuristic Explanation on GLRT

Why, underH0, the test statistic

Λ =

maxθ∈ω0

lik(θ)

maxθ∈Ω

lik(θ)

satisfies

−2 log Λ → χ2s, as n→ ∞??

The heuristic argument

• Only considers s = 1.

• Utilizes Taylor expansion.

• Uses the fact that the MLE is asymptotically normal.


Since s = 1, we considerH0 : θ = θ0.

Let l(θ) = log lik(θ) and θ be the MLE of θ ∈ Ω.

−2 log Λ = −2[

l(θ0) − l(θ)]

Applying Taylor expansion

l(θ0) = l(θ) + (θ0 − θ)l′(θ) +(θ0 − θ)2

2l′′(θ) + ...

Because θ is the MLE, we know l′(θ) = 0. Therefore,

−2 log Λ = −2[

l(θ0) − l(θ)]

= −l′′(θ)(θ0 − θ)2 + ...


The MLE is asymptotically normal, i.e., as n→ ∞,(

θ − θ0

)

√

1nI(θ)

=(

θ − θ0

)

√

nI(θ) → N(0, 1)

Because nI(θ) = −E(l′′(θ)), we can (heuristically) write, as n→ ∞,

−2 log Λ = − l′′(θ)(θ0 − θ)2

≈[(

θ − θ0

)

√

nI(θ)]2

→χ21


Chapter 14: Linear Lease Squares

Materials:

• The basic procedure : Observe (xi, yi). Assume y = β0 + β1x.

Estimate β0, β1 by minimizing∑

(yi − β0 − β1xi)2

• Statistical analysis of linear square estimates

Assume y = β0 + β1x+ e, e ∼ N(0, σ2), and x is constant.

What are the statistical properties of β0 and β1, which are estimated by the

least square procedure?

• Matrix approach to multiple least squares

• Conditional expectation and best linear estimator

for better understanding of the basic procedure.

If X and Y are jointly normal, then linear regression is the best under MSE.


Linear Lease Squares: The Basic Procedure

The basic procedure is to fit a straight line to a plot of points (xi, yi),

y = β0 + β1x

by minimizing

L(β0, β1) =n∑

i=1

(yi − β0 − β1xi)2,

i.e., solving for β0 and β1 from

∂L(β0, β1)

∂β0= 0

∂L(β0, β1)

∂β1= 0


Taking the first derivatives,

∂L(β0, β1)

∂β0=

n∑

i=1

2(yi − β0 − β1xi)(−1)

∂L(β0, β1)

∂β1=

n∑

i=1

2(yi − β0 − β1xi)(−xi)

Setting them to zero =⇒

β0 =y − xβ1

β1 =

∑ni=1 xiyi − y

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi


Statistical Properties of β0 and β1

Model:

yi = β0 + β1xi + ei, i = 1, 2, ..., n

ei ∼ N(0, σ2), i.i.d.

xi’s are constants. The randomness of yi’s is due to ei.

The coefficients β0 and β1 are estimated by least squares.

Q: Under this model, what are E(β0), V ar(β0), E(β1), V ar(β1), etc.?


According to the model: yi = β0 + β1xi + ei, ei ∼ N(0, σ2),

E(yi) = β0 + β1xi

E(y) = β0 + β1x

V ar(yi) = σ2

Cov(yi, yj) = 0, if i 6= j

Therefore,

E(β0) = E(y − xβ1) = β0 + β1x− xE(β1)

i.e., E(β0) = β0 iff E(β1) = β1


E(β1) =E

[∑ni=1 xiyi − y

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi

]

=

∑ni=1 xiE(yi) −E(y)

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi

=

∑ni=1 xi (β0 + β1xi) − (β0 + β1x)

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi

=β1

∑ni=1 x

2i − β1x

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi

=β1

Theorem 14.2.A: Unbiasedness

E(β0) = β0, E(β1) = β1


Another way to express β1:

β1 =

∑ni=1 xiyi − y

∑ni=1 xi

∑ni=1 x

2i − x

∑ni=1 xi

=

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

=

∑ni=1(xi − x)yi

∑ni=1(xi − x)2

Note that

n∑

i=1

(xi − x) = 0,n∑

i=1

(yi − y) = 0.


Theorem 14.2.B:

V ar(β1) =

∑ni=1(xi − x)2V ar(yi)

[∑n

i=1(xi − x)2]2

=σ2

∑ni=1(xi − x)2

Exercises

V ar(β0) =σ2

n

∑ni=1 x

2i

∑ni=1(xi − x)2

,

Cov(β0, β1) =−σ2x

∑ni=1(xi − x)2

.


Residual Sum of Squares (RSS)

Definition

RSS =n∑

i=1

(yi − β0 − β1xi)2

We can show

E(RSS) = (n− 2)σ2

In other words,

s2 =RSS

n− 2

is an unbiased estimator of σ2.


E(RSS) = E

[

n∑

i=1

(yi − β0 − β1xi)2

]

=E

[

n∑

i=1

(β0 + β1xi + ei − β0 − β1xi)2

]

=E

[

n∑

i=1

(β0 − β0 + (β1 − β1)xi + ei)2

]

=nV ar(β0) + V ar(β1)n∑

i=1

x2i + nσ2 + 2Cov(β0, β1)

n∑

i=1

xi

+ 2E

[

n∑

i=1

ei

(

β0 − β0 + (β1 − β1)xi

)

]

=(n+ 2)σ2 + 2E

[

n∑

i=1

ei

(

β0 − β0 + (β1 − β1)xi

)

]


E

[

n∑

i=1

ei

(

β0 − β0 + (β1 − β1)xi

)

]

=E

[

n∑

i=1

ei

(

β0 − y + xβ1 + (β1 − β1)xi

)

]

=E

[

n∑

i=1

ei

(

β0 − β0 − xβ1 − e+ xβ1 + (β1 − β1)xi

)

]

=E

[

n∑

i=1

ei

(

−xβ1 + xβ1 + (β1 − β1)xi

)

]

− σ2

=E

[

β1

n∑

i=1

ei (x− xi)

]

− σ2


E

[

β1

n∑

i=1

ei (x− xi)

]

=E

[

∑ni=1(xi − x)yi

∑ni=1(xi − x)2

n∑

i=1

ei (x− xi)

]

=E

[

∑ni=1(xi − x)(β0 + β1xi + ei)

∑ni=1(xi − x)2

n∑

i=1

ei (x− xi)

]

=E

[∑ni=1(xi − x)(x− xi)e

2i

∑ni=1(xi − x)2

]

= − σ2

Therefore,

E(RSS) = (n+ 2)σ2 + 2(−2σ2) = (n− 2)σ2


The Distributions β0 and β1

Model: yi = β0 + β1xi + ei, ei ∼ N(0, σ2),

yi ∼ N(β0 + β1xi, σ2)

β1 =

n∑

i=1

ciyi ∼ N(β1, V ar(β1))

β0 = y − xβ1 ∼ N(β0, V ar(β0))

s2

σ2(n− 2) =

RSS

σ2∼ χ2

n−2

β0 − β0

sβ0

∼ tn−2,β1 − β1

sβ1

∼ tn−2

What if ei is not normal? Central limit theorem and normal approximation.


Hypothesis Testing

Once we know the distributions

β0 − β0

sβ0

∼ tn−2,β1 − β1

sβ1

∼ tn−2

we can conduct hypothesis test, for example,

H0 : β1 = 0

HA : β1 6= 0


Multiple Least Squares

Model: yi = β0 + β1xi,1 + ...+ βp−1xi,p−1 + ei, ei ∼ N(0, σ2) i.i.d.

Observations: (xi, yi), i = 1 to n.

Multiple least squares : Estimate βj by minimizing the MSE

L(βj , j = 0, 1, ..., p− 1) =

n∑

i=1

(yi − β0 −p−1∑

j=1

xi,jβj)2


Matrix Approach to Linear Least Square

X =

1 x1,1 x1,2 . . . x1,p−1

1 x2,1 x2,2 . . . x2,p−1

1 x3,1 x3,2 . . . x3,p−1

......

......

...

1 xn,1 xn,2 . . . xn,p−1

, β =

β0

β1

β2

...

βp−1

L(β) =n∑

i=1

(yi − β0 −p−1∑

j=1

xi,jβj)2 = ‖Y − Xβ‖2


L(β) =

n∑

i=1

(yi − β0 −p−1∑

j=1

xi,jβj)2 = ‖Y − Xβ‖2

Matrix/vector derivative

∂L(β)

∂β=2(−XT) (Y − Xβ)

= − 2(

XTY − XTXβ)

= 0

=⇒

XTXβ = XTY

=⇒β =[

XTX]−1

XTY


Statistical Properties of β

Model:

Y = Xβ + e, ei ∼ N(0, σ)2 i.i.d.

Unbiasedness (Theorem 14.4.2.A):

Eβ =E(

[

XTX]−1

XTY)

=E(

[

XTX]−1

XT [Xβ + e])

=E(

[

XTX]−1 [

XTX]

β)

+ E(

[

XTX]−1

XTe)

=β


Covariance matrix of β (Theorem 14.4.2.B)

V ar(

β)

=V ar(

[

XTX]−1 [

XTX]

β +[

XTX]−1

XTe)

=V ar(

[

XTX]−1

XTe)

=[

XTX]−1

XT V ar(e)[

[

XTX]−1

XT]T

=[

XTX]−1

XT V ar(e) X[

XTX]−1

=σ2[

XTX]−1

XTX[

XTX]−1

=σ2[

XTX]−1

Note that V ar(e) is a diagonal matrix = σ2In×n


Theorem 14.4.3.A: An unbiased estimator of σ2 is s2, where

s2 =‖Y − Y‖2

n− p

Proof:

Y = Xβ =[

X[

XTX]−1

XT]

Y = PY

Lemma 14.4.3.A :

P2 = P = PT

(I − P)2 = I − P = (I− P)T

Proof of Lemma 14.4.3.A

P2 = X[

XTX]−1

XT X[

XTX]−1

XT = X[

XTX]−1

XT = P


Therefore,

‖Y − Y‖2 = ‖(I− P)Y‖2 = YT(I− P)T(I− P)Y = YT(I− P)Y

and

E[

YT(I− P)Y]

=E[(

βTXT + eT)

(I− P) (Xβ + e)]

=βTXT(I− P)Xβ +E[

eT(I − P)e]

=E[

eT(I− P)e]

=nσ2 −E[

eTPe]

because

XT(I− P)X = XTX − XT[

X[

XTX]−1

XT]

X = 0


E[

eT(P)e]

=E

p∑

j=1

[

n∑

i=1

eiPij

]

ej

=E

p∑

j=1

Pjje2j

=σ2

p∑

j=1

Pjj = pσ2

where we skip the proof of the very last step.

Combining the results, we obtain

E(

‖Y − Y‖2)

= (n− p)σ2


Properties of Residuals

Residuals: e = Y − Y = (I− P)Y.

Covariance matrix of residuals:

V ar(e) =(I− P)Var(Y)(I− P)T

=(I− P)σ2I(I − P)

=σ2(I− P)

=⇒ Residuals are correlated


Theorem 14.4.A: The residuals are uncorrelated with the fitted values.

Proof:

E(eTY) =E(

YT(I− P)PY)

=E(

YT(P − P2)Y)

=E(

YT(P − P)Y)

=0


Inference about β

V ar(

β)

= σ2[

XTX]−1

= σ2C.

Using s2 to estimate σ2, we obtain the distribution

βj − βj

sβj

∼ tn−p, where sβj= s

√cii

which allows us to conduct hypothesis test on the significance of the fit.

BTRY 4090 / STSCI 4090, Spring 2010 - Rutgers...

Documents

Transcript of BTRY 4090 / STSCI 4090, Spring 2010 - Rutgers...