Statistics 553: Asymptotic Toolspersonal.psu.edu/drh20/asymp/fall2004/lectures/asymp.pdf ·...

Statistics 553:

Asymptotic Tools

David R. Hunter

Fall 2004

Penn State University

Contents

Preface 2

1 Mathematical and Statistical Preliminaries 4

1.1 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Differentiability and Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Order Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Multivariate Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Expectation and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Convergence of Random Variables 20

2.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Convergence of Tranformed Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 The Dominated Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 The Skorohod Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Central Limit Theorems 43

3.1 Characteristic Functions and the Normal Distribution . . . . . . . . . . . . . . . . . 43

3.2 The Lindeberg-Feller Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Central Limit Theorems for Dependent Sequences . . . . . . . . . . . . . . . . . . . 55

4 The Delta Method and Applications 61

4.1 Asymptotic distributions and Variance-Stabilizing Transformations . . . . . . . . . . 61

4.2 Sample Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

1

4.3 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Order Statistics and Quantiles 69

5.1 Extreme Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Sample Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Maximum Likelihood Estimation 80

6.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 Asymptotic normality of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Efficient estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.4 The multiparameter case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5 Nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.6 Wald, Rao, and Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 Hypothesis Testing 99

7.1 Contiguity and Local Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 The Wilcoxon Rank-Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8 Categorical Data 111

8.1 Pearson’s Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9 U-statistics 119

9.1 Statistical Functionals and V-Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.2 Hoeffding’s Decomposition and Asymptotic Normality . . . . . . . . . . . . . . . . . 121

9.3 Generalizing U-statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

1

Preface

This book is primarily intended to be used as a graduate-level textbook for a course in large-sampletheory. As such, it is structured specifically to facilitate teaching. We also hope that by choosingtopics carefully and organizing the book well, we have written a useful reference book.

The main reason we wrote this book is the dearth of large-sample theory textbooks for studentswithout training in measure-theoretic probability (Erich Lehmann’s wonderful 1999 book is theonly such book we know). This shortage is emblematic of a traditional view in graduate-levelstatistics education that students should take measure-theoretic probability before large-sampletheory; indeed, even in some top graduate programs, measure theory is required for PhD studentswhile large-sample theory is not. We believe these priorities are backwards: Most statisticianshave more to gain from an understanding of large-sample theory than of measure theory. For thisreason, the level of mathematical and statistical prerequisites we assume is relatively low. Muchof the book should be accessible to a student with a grouding in calculus, linear algebra, andmathematical statistics.

However, we stress that the level of content must not be confused with the level of mathematicalrigor. We have no interest in sacrificing mathematical rigor for the sake of writing a non-measure-theory-based book. In fact, we view this book as the basis for a course that provides both anintroduction to large-sample theory and a chance to enhance students’ mathematical maturity. Inthis sense, this book can even be seen as preparing students to take mathematically demandingcourses such as measure theory. For instance, experience in teaching courses based on this materialhas taught us that the material in Chapter 1, while often familiar to students before they take thiscourse, is very important to cover. Not only does a review of limits, continuity, and differentiabilityremind students of some material they may not have seen in a while, but insisting upon a rigoroustreatment of these topics sets the stage for the rigorous mathematical development of the topics tofollow.

The book is intended to be self-contained, with all important results either proved in the textor, more frequently for deep results, to be proved in the exercises according to step-by-step hints.We have relegated complicated proofs to the exercise sections for two reasons: First, we believethat students are much more likely to appreciate the salient details of a proof if they are forcedto struggle with them on their own; and second, the text is more readable when the narrativeis not constantly interrupted by proofs of theorems. Because we have tried to make the bookself-contained, we have included some mathematical topics that may be abridged or omitted atthe discretion of the instructor, depending on the scope of the course being taught. Topics suchas the dominated convergence theorem, the Skorohod representation theorem, and characteristicfunctions might fall into this category.

2

The exercises in this book are centrally important; we consider them not merely as activities usefulfor evaluating students’ performance but more like appendices, where results can be proved orexpanded upon and new material can be introduced. The exercises have been designed with thefollowing principles in mind:

1. The exercises should be challenging but not impossible. When difficult theorems are provedin the exercises, as is frequently the case, students are guided through the proofs step-by-step,with hints provided when appropriate.

2. Many exercises require students to do some computing. Our philosophy is that computingskills should be emphasized in all statistics courses whenever possible, provided the com-puting enhances the understanding of the subject matter. In various exercises throughoutthe book, students are required to simulate random variables; produce graphs of varioustypes; numerically determine maxima and minima of functions; implement algorithms suchas Newton-Raphson and scoring; grapple with issues of underflow and overflow; and more. Inshort, this book can be used to help students to learn a great deal of statistical computing atthe same time they learn large-sample theory.

3. New material is often introduced in the exercises. This is particularly true when a topic doesnot fit well into the narrative structure of a chapter and when its inclusion in the main textwould distract from the main point or make the prose less readable.

We explained above that computing is a very important aspect of many of the exercises in this book.The study of large-sample theory lends itself very well to computing, since frequently the theoreticallarge-sample results we prove do not give any indication of how well asymptotic approximationswork for finite samples. Thus, simulation for the purpose of checking the quality of asymptoticapproximations for small samples is very important in understanding the limitations of the resultsbeing learned. Many of the computing-based exercises involve just this sort of simulation activity.(There are also other types of computing exercises, involving a broad range of computing topics.)Of course, all computing activities will force students to choose a particular computing platformand software package. We try to remain neutral on the question of which package or languageto use, though we are sometimes forced in the hints to provide specific details of code. In thesecases, we offer hints for using the S language, as implemented for instance in the Splus package(http://www.insightful.com) and the R package (http://www.r-project.org). However, nearly allof these exercises can be completed just as well using other packages or languages.

3

Chapter 1

Mathematical and Statistical

Preliminaries

We assume that most readers are familiar with most of the material presented in this chapter.However, we do not view this material as superfluous, and we include it as the first chapter of thebook for several reasons. First, some of these topics may have been learned long ago by readers,and a review of this chapter may help remind them of knowledge they have forgotten. Second,including these preliminary topics as a separate chapter makes the book more self-contained thanif the topics were omitted: We do not have to refer readers to “a standard calculus textbook” or “astandard mathematical statistics textbook” whenever an advanced result relies on this preliminarymaterial.

Finally, and perhaps most importantly, we wish to set the stage in this chapter for a mathematicallyrigorous treatment of large-sample theory. By “mathematically rigorous,” we mean logically sound,relying on arguments in which assumptions and definitions are unambiguously stated and asser-tions must be provable from these assumptions and definitions. Thus, even well-prepared readerswho know the material in this chapter may benefit from reading it and attempting the exercises,particularly if they are new to rigorous mathematics and proof-writing.

1.1 Limits and Continuity

Fundamental to the study of large-sample theory is the idea of the limit of a sequence. Much ofthis book will be devoted to sequences of random variables; however, we begin here by focusing onsequences of real numbers. Technically, a sequence of real numbers is a function from the naturalnumbers 1, 2, 3, . . . into the real numbers R; yet we always write a1, a2, . . . instead of the moretraditional function notation a(1), a(2), . . ..

We begin by definiting a limit of a sequence of real numbers.

Definition 1.1 A sequence of real numbers a1, a2, . . . has limit equal to the real number aif for every ǫ > 0, there exists N such that

|an − a| < ǫ for all n > N .

4

In this case, we write an → a as n → ∞ or limn→∞ an = a.

Definition 1.2 A sequence of real numbers a1, a2, . . . has limit ∞ if for every real numberM , there exists N such that

an > M for all n > N .

In this case, we write an → ∞ as n → ∞ or limn→∞ an = ∞. Similarly, an → −∞ asn → ∞ if for all M , there exists N such that an < M for all n > N .

Implicit in the language of Definition 1.1 is that N may depend on ǫ. Similarly, N may depend onM — in fact, it must depend on M — in Definition 1.2.

The “values” +∞ and −∞ are not considered real numbers; otherwise, Definition 1.1 would beinvalid for a = ∞ and Definition 1.2 would be invalid since M could be taken to be ∞. Evenmore important, however, is the fact that not every sequence has a limit, even though “has a limit”includes the possibilities ±∞. A simple example of a sequence without a limit is given in example1.1.

Example 1.1 Define

an = log n; bn = 1 + (−1)n/n; cn = 1 + (−1)n/n2; dn = (−1)n.

Then an → ∞, bn → 1, and cn → 1; but the sequence d1, d2, . . . does not have a limit.(Note that we do not always write “as n → ∞” when this is clear from the context.)

Although limn bn = limn cn in example 1.1, there is evidently something different about the mannerin which these sequences approach this limit. This difference will prove important when we studyrates of convergence beginning in Section 1.3.

Example 1.2 A very important example of a limit of a sequence is

limn→∞

(

1 +c

n

)n= exp(c) (1.1)

for any real number c. Equation (1.1) may be proved using l’Hopital’s rule, which isstated in section 1.2. See exercise 1.2 for a slightly more general form of equation (1.1)

Two or more sequences may added, multiplied, or divided, and the results follow certain intuitivelypleasing rules. The sum (or product) of limits equals the limit of the sums (or products); and aslong as division by zero does not occur, the ratio of limits equals the limit of the ratios. These rulesare stated formally as Theorem 1.1, whose proof is the subject of Exercise 1.1:

Theorem 1.1 Suppose an → a and bn → b as n → ∞. Then an+bn → a+b and anbn → ab;furthermore, if b 6= 0 then an/bn → a/b.

Note that Theorem 1.1 may be extended by allowing a and b to represent ±∞, and the resultsremain true as long as they do not involve the indeterminate forms ∞−∞, ±∞× 0, or ±∞/∞.

Although Definitions 1.1 and 1.2 concern limits, they refer only to sequences of real numbers. Recallthat a sequence is a real-valued function of the natural numbers. We shall also require the conceptof a limit of a real-valued function of a real variable. To this end, we make the following definition.

5

Definition 1.3 For a real-valued function f(x) defined on a neighborhood of x0, we call athe limit of f(x) as x goes to x0, written

limx→x0

f(x) = a,

if for each ǫ > 0 there is a δ > 0 such that |f(x) − a| < ǫ whenever 0 < |x − x0| < δ.

Implicit in definition 1.3 is the fact that a is the limiting value of f(x) no matter whether xapproaches x0 from above or below; thus, f(x) has a two-sided limit at x0. We may also considerone-sided limits, a concept that arises in certain statistical contexts such as distribution functions.

Definition 1.4 The value a is called the right-handed limit of f(x) as x goes to x0, written

limx→x0+

f(x) = a or f(x0+) = a,

if for each ǫ > 0 there is a δ > 0 such that |f(x) − a| < ǫ whenever x ∈ (x0, x0 + δ).

The left-handed limit, limx→x0− f(x) or f(x0−), is defined analagously: f(x0−) = a iffor each ǫ > 0 there is a δ > 0 such that |f(x) − a| < ǫ whenever x ∈ (x0 − δ, x0).

It is not hard to verify from the preceding definitions that

limx→x0

f(x) = a if and only if limx→x0−

f(x) = limx→x0+

f(x) = a;

in other words, the (two-sided) limit exists if and only if both one-sided limits exist and theycoincide.

An important aspect of Definitions 1.3 and 1.4 is that the value of f(x0) is completely irrelevant(in fact, f(x0) might not even be defined). In the special case that f(x0) is defined and equal to a,then we say that f(x) is continuous (or right- or left-continuous) at x0, as summarized by Definition1.6 below. Before formally defining continuity, however, we address the possibilities that f(x) hasa limit as x → ±∞ or that f(x) tends to ±∞:

Definition 1.5 (a) We write limx→∞ f(x) = a if for every ǫ > 0, there exists N such that|f(x) − a| < ǫ for all x > N .

(b) We write limx→x0 f(x) = ∞ if for every M , there exists δ > 0 such that f(x) > Mwhenever 0 < |x − x0| < δ.

(c) We write limx→∞ f(x) = ∞ if for every M , there exists N such that f(x) > Mfor all x > N .

Definitions involving −∞ are analagous, as are definitions of f(x0+) = ±∞ andf(x0−) = ±∞.

We now define a continuous real-valued function of a real variable. Intuitively, f(x) is continuousat x0 if it is possible to draw the graph of f(x) in some open neighborhood of x0 without liftingthe pencil from the page.

Definition 1.6 If f(x) is a real-valued function defined in a neighborood of x0,

6

• f(x) is continuous at x0 if limx→x0 f(x) = f(x0);

• f(x) is right-continuous at x0 if f(x0+) = f(x0);

• f(x) is left-continuous at x0 if f(x0−) = f(x0).

Exercises for Section 1.1

Exercise 1.1 Assume that an → a and bn → b. Prove

(a) an + bn → a + b

(b) anbn → ab

Hint: |anbn − ab| ≤ |(an − a)(bn − b)| + a|bn − b| + b|an − a|.

(c) If the real-valued function f(x) is continuous at the point a, then f(an) → f(a).

(d) If b 6= 0, an/bn → a/b.

Exercise 1.2 (a) Prove equation (1.1). You may assume that exp(c) is the unique solutionof the equation log x = c.

(b) Suppose that a is a real number. If limn→∞ an = a and limn→∞ bn = ∞, provethat

limn→∞

(

1 +an

bn

)bn

= exp(a).

1.2 Differentiability and Taylor’s Theorem

Differential calculus plays a fundamental role in much asymptotic theory. In this section we reviewsimple derivatives and one form of Taylor’s well-known theorem. Approximations to functionsbased on Taylor’s theorem, often called Taylor expansions, are ubiquitous in large-sample theory.

We assume that readers are familiar with the definition of a derivative of a real-valued functionf(x):

Definition 1.7 If f(x) is continuous in a neighborhood of x0 and

limx→x0

f(x) − f(x0)

x − x0(1.2)

exists, then f(x) is said to be differentiable at x0 and the limit (1.2) is called thederivative of f(x) at x0 and is denoted by f ′(x0).

We use the standard notation for second- and higher-order derivatives. Thus, if f ′(x) is itselfdifferentiable at x0, we express this derivative as f ′′(x0) or f (2)(x0). In general, if the kth derivativef (k)(x) is differentiable at x0, then we denote this derivative by f (k+1)(x0).

In large-sample theory, differential calculus is most commonly applied in the construction of Taylorexpansions. There are several different versions of Taylor’s theorem. The form we present here,

7

which is proved in exercise 1.4, has the advantage that it does not require that the function havean extra derivative. For instance, a second-order Taylor expansion requires only two derivativesusing this version of Taylor’s theorem (and the second derivative need only exist at a single point),whereas other forms of Taylor’s theorem require the existence of a third derivative over an entireinterval. The disadvantage of this form of Taylor’s theorem is that we do not get any sense of whatthe remainder term is, only that it goes to zero; however, for the applications in this book, thisform of Taylor’s theorem will suffice.

Theorem 1.2 If f(x) has d derivatives at a, then

f(x) = f(a) + (x − a)f ′(a) + · · · + (x − a)d

d!

f (d)(a) + rd(x, a)

, (1.3)

where rd(x, a) → 0 as x → a.

Because it provides an elegant way to prove Theorem 1.2, among other things, we state here thewell-known result known as l’Hopital’s rule:

Theorem 1.3 l’Hopital’s rule: For a real number c, suppose that f(x) and g(x) are dif-ferentiable for all points in a neighborhood containing c except possibly c itself. Iflimx→c f(x) = 0 and limx→c g(x) = 0, then

limx→c

f(x)

g(x)= lim

x→c

f ′(x)

g′(x), (1.4)

provided the right-hand limit exists. Similarly, if limx→c f(x) = ∞ and limx→c g(x) =∞, then equation (1.4) also holds. Finally, the theorem also applies if c = ±∞, inwhich case a “neighborhood containing c” refers to an interval (a,∞) or (−∞, a).


Exercise 1.3 For f(x) continuous in a neighborhood of x0, consider

limx→x0

f(x) − f(2x0 − x)

2(x − x0). (1.5)

(a) Prove or give a counterexample: When f ′(x0) exists, limit (1.5) also exists and itis equal to f ′(x0).

(b) Prove or give a counterexample: When limit (1.5) exists, it equals f ′(x0), whichalso exists.

Exercise 1.4 Prove Theorem 1.2.

Hint: Let Pd(x) denote the Taylor polynomial such that rd(x, a) = d![f(x)−Pd(x)]/(x−a)d. Then use l’Hopital’s rule, Theorem 1.3.

1.3 Order Notation

As we saw in example 1.1, the limiting behavior of a sequence is not fully characterized by the valueof its limit alone, if the limit exists. In that example, both 1+ (−1)n/n and 1+ (−1)n/n2 converge

8

to the same limit, but they approach this limit at different rates. In this section we consider notonly the value of the limit, but the rate at which that limit is approached. In so doing, we presentsome convenient notation for comparing the limiting behavior of different sequences.

Definition 1.8 The sequence of real numbers a1, a2, . . . is asymptotically equivalent to thesequence b1, b2, . . ., written an ∼ bn, if (an/bn) → 1.

Equivalently, an ∼ bn if and only if∣

∣

∣

∣

an − bn

an

∣

∣

∣

∣

→ 0.

The expression |(an − bn)/an| above is called the relative error in approximating an by bn.

Note that the definition of asymptotic equivalence does not say that

lim an

lim bn= 1;

the above fraction might equal 0/0 or ∞/∞, or the limits might not even exist! (See exercise 1.6.)

Example 1.3 A well-known asymptotic equivalence is Stirling’s formula, which states

n! ∼√

2πnn+.5e−n. (1.6)

There are multiple ways to prove Stirling’s formula; we outline a proof based on thePoisson distribution in exercise 3.8.

Example 1.4 For any k > −1,

n∑

i=1

ik ∼ nk+1

k + 1.

This is proved in exercise 1.8.

The next notation we introduce expresses the idea that one sequence is asymptotically negiblecompared to another sequence.

Definition 1.9 We write an = o(bn) (“an is little-o of bn”) if an/bn → 0 as n → ∞.

Among other advantages, the o-notation makes it possible to focus on the most important termsof a sequence while ignoring the terms that are comparatively negligible.

Example 1.5 According to definition 1.9, we may write

1

n− 2

n2+

4

n3=

1

n+ o

(

1

n

)

.

This makes it clear at a glance how fast the sequence on the left tends to zero, sinceall terms other than the dominant term are lumped together as o(1/n).

Some of the exercises in this section require proving that one sequence is little-o of another se-quence. Sometimes, l’Hopital’s rule may be helpful, though a bit of care must be exercised because

9

l’Hopital’s rule applies to functions on the real numbers whereas sequences are functions on thenatural numbers. For example, to prove log n = o(n), we could note that the function (log x)/x,defined for x > 0, agrees with (log n)/n on the positive integers; thus, since l’Hopital’s rule implies(log x)/x → 0 as x tends to ∞ as a real number, we conclude that (log n)/n must also tend to 0 asn tends to ∞ as an integer. Often, however, one may simply prove an = o(bn) using definition 1.9,as in the next example.

Example 1.6 Prove that

n = o

(

n∑

i=1

√i

)

.

Proof: Letting ⌊n/2⌋ denote the largest integer less than or equal to n/2,

n∑

i=1

√i ≥

n∑

i=⌊n/2⌋

√i ≥ n

2

√

n

2.

Since n = o(n√

n), the desired result follows.

We define one additional order notation, the capital O.

Definition 1.10 We write an = O(bn) (“an is big-o of bn”) if there exist M > 0 and N > 0such that |an/bn| < M for all n > N .

In particular, an = o(bn) implies an = O(bn). In a vague sense, o and O are to sequences as < and≤ to real numbers. However, this analogy is not perfect. For example, note that it is not alwaystrue that either an = O(bn) or bn = O(an).

Although the notations above are very precisely defined, unfortunately this is not the case with thelanguage used to describe the notations. In particular, “an is of order bn” does not always meanan = O(bn); it sometimes means that an = O(bn) but an 6= o(bn). Although the language can beimprecise, it is usually clear from context what the intent is.

The use of o, O, or ∼ always implies that there is some sort of limit being taken. Often, an expressioninvolves n and it is usually clear in this context that we are letting n tend to ∞; however, sometimesthings are not so clear, so it helps to be explicit:

Example 1.7 According to definition 1.9, a sequence that is o(1) tends to zero. Therefore,equation (1.3) of Taylor’s theorem may be rewritten

f(x) = f(a) + (x − a)f ′(a) + · · · + (x − a)d

d!

f (d)(a) + o(1)

as x → a.

Note that it is important to write “as x → a” in this case.

There are certain rates of growth toward ∞ that are so common that they have names, suchas logarithmic, polynomial, and exponential growth. If α, β, and γ are positive constants, thenthe sequences (log n)α, nβ, and (1 + γ)n exhibit logarithmic, polynomial, and exponential growth,respectively. Furthermore, no matter what the values of α, β, and γ are, we always have

(log n)α = o(nβ) and nβ = o([1 + γ]n).

10

Thus, in the sense of definition 1.9, logarithmic growth is always slower than polynomial growthand polynomial growth is always slower than exponential growth (see exercise 1.13).


Exercise 1.5 Prove that an ∼ bn if and only if |(an − bn)/an| → 0.

Exercise 1.6 For each of the following statements, prove the statement or provide a coun-terexample that disproves it.

(a) If an ∼ bn, then limn an/ limn bn = 1.

(b) If limn an/ limn bn is well-defined and equal to 1, then an ∼ bn.

(c) If neither limn an nor limn bn exists, then an ∼ bn is impossible.

Exercise 1.7 Suppose that an ∼ bn and cn ∼ dn.

(a) Prove that ancn ∼ bndn.

(b) Show by counterexample that it is not generally true that an + cn ∼ bn + dn.

(c) Prove that |an| + |cn| ∼ |bn| + |dn|.

(d) Show by counterexample that it is not generally true that f(an) ∼ f(bn) for acontinuous function f(x).

Exercise 1.8 Prove the asymptotic relationship in Example 1.4.

Hint: One way to proceed is to prove that the sum lies between two simple-to-evaluateintegrals that are themselves asymptotically equivalent.

Exercise 1.9 Suppose that δ(X) is an estimator of g(θ). Prove that the variance of δ(X)plus the square of the bias of δ(X) equals the mean square error of δ(X).

Exercise 1.10 Let X1, . . . , Xn be independent with identical density functions f(x) =θxθ−1I0 < x < 1.

(a) Let δn be the posterior mean of θ, assuming a standard exponential prior for θ(i.e., p(θ) = e−θIθ > 0). Compute δn.

Hints: The posterior distribution of θ is gamma. If Y is a gamma random variable,then f(y) ∝ yα−1e−yβ and the mean of Y is α/β.

(b) For each n ∈ 10, 50, 100, 500, simulate 1000 different samples of size n from thegiven distribution with θ = 2. Use these to calculate the value of δn 1000 times foreach n. Make a table in which you report, for each n, your estimate of the bias (thesample mean of δn − 2) and the variance (the sample variance of δn). Try to estimatethe asymptotic order of the bias and the variance of this estimator by finding “nice”exponents a and b such that na|biasn| and nbvariancen are roughly constant.

Hints: To generate a sample from the given distribution, use the fact that if U1, U2, . . .

11

is a sample from a uniform (0, 1) density and the continuous distribution function F (x)may be inverted easily, then letting Xi = F−1(Ui) results in X1, X2, . . . being a simplerandom sample from F (x). In this exercise, the distribution function is easy to find andinvert. If you’re using R, a sample from uniform (0, 1) of size, say, 50 may be obtainedby typing runif(50).

Calculating δn involves taking the sum of logarithms. As you know, this is the sameas the log of the product. However, when you’re computing, be very careful! If youhave a large sample, multiplying all the values is likely to overflow or underflow themachine, so the log of the product won’t always work. Adding the logs is much safereven though it requires more computation due to the fact that you have to take a bunchof logs instead of just one.

Exercise 1.11 Let X1, X2, . . . be defined as in Exercise 1.10.

(a) Find the maximum likelihood estimator of θ for a sample of size n. Call it δn.

(b) Follow the directions for Exercise 1.10 part (b) using the new definition of δn.

Exercise 1.12 (a) We call a function f(x) a convex function if for all x, y and any α ∈ [0, 1],we have

f [αx + (1 − α)y] ≤ αf(x) + (1 − α)f(y).

Suppose that a1, a2, . . . and b1, b2, . . . are sequences of real numbers and f is a convexfunction such that an → ∞, bn → ∞, an = o(bn), and f(x) → ∞ as x → ∞.

(a) Prove that f(an) = o[f(bn)].

(b) Create counterexamples to the result in part (a) if the hypotheses are weakenedas follows:

(i) Find an, bn, and convex f(x) with limx→∞ f(x) = ∞ such that an = o(bn) butf(an) 6= o[f(bn)].

(ii) Find an, bn, and convex f(x) such that an → ∞, bn → ∞, and an = o(bn) butf(an) 6= o[f(bn)].

(iii) Find an, bn, and f(x) with limx→∞ f(x) = ∞ such that an → ∞, bn → ∞, andan = o(bn) but f(an) 6= o[f(bn)].

Exercise 1.13 (a) Prove that polynomial growth is faster than logarithmic growth andthat exponential growth is faster than polynomial growth. In other words, prove

(i) For any α > 0 and ǫ > 0, (log n)α = o(nǫ).

(ii) For any α > 0 and ǫ > 0, nα = o [(1 + ǫ)n].

(b) Each of the following 5 sequences tends to ∞ as n → ∞. However, no two ofthem are of the same order. List them in order of rate of increase from slowest tofastest. In other words, give an ordering such that first sequence = o(second sequence),

12

second sequence = o(third sequence), etc.

n√

log n!∑n

i=13√

i 2log n (log n)log log n

Prove the 4 order relationships that result from your list.

Exercise 1.14 Follow the directions of Exercise 1.13 part (b) for the following 13 sequences.You may assume the results of Exercise 1.12 part (a) and Exercise 1.13 part (a).

log(n!) n2 nn 3n

log(log n) n log n 23 log n nn/2

n! 22nnlog n (log n)n

Proving the 12 order relationships is challenging but not quite as tedious as it sounds;some of the proofs will be very short.

1.4 Multivariate Extensions

We now consider vectors in Rk, k > 1. We denote vectors by bold face and their components by

regular type with subscripts; thus, a is equivalent to (a1, . . . , ak). For sequences of vectors, weuse superscripts in parentheses, as in a(1),a(2), . . .. Using notation to distinguish between scalarsand vectors in this way is usually unnecessary, since the meaning is often clear from context.Nevertheless, special notation can be helpful, especially when one is learning new material.

The extension to the multivariate case from the univariate case is often so trivial that it is reasonableto ask why we consider the cases separately at all. There are two main reasons. The first ispedagogical: We feel that any disadvantage due to repeated or overlapping material is outweighedby the fact that concepts are often intuitively easier to grasp in R than in R

k. Furthermore,generalizing from R to R

k is often instructive in and of itself, as in the case of the multivariateconcept of differentiability. The second reason is mathematical: Some one-dimensional results,like Taylor’s theorem 1.2 for d > 2, need not (or cannot, in some cases) be extended to multipledimensions in this book. In later chapters in this book, we will treat univariate and multivariatetopics together sometimes and separately sometimes, and we will maintain the bold-face notationfor vectors throughout.

To define a limit of a sequence of vectors, we must first define a norm on Rk. We are interested

primarily in whether the norm of a vector goes to zero, a concept for which any norm will suffice,so we may as well take the Euclidean norm:

‖a‖ def=

√

√

√

√

k∑

i=1

a2i .

We may now write down the analogue of definition 1.1.

Definition 1.11 The sequence a(1),a(2), . . . has limit c ∈ Rk, written a(n) → c as n → ∞

or limn→∞ a(n) = c, if ‖a(n) − c‖ → 0 as n → ∞. That is, a(n) → c means that for anyǫ > 0 there exists N such that ‖a(n) − c‖ < ǫ for all n > N .

13

It is sometimes possible to define multivariate concepts by using the univariate definition on eachof the components of the vector. For instance, the following lemma gives an alternative way todefine a(n) → c:

Lemma 1.1 a(n) → c if and only if a(n)i → ci for all 1 ≤ i ≤ k.

Proof: The “if” part follows from exercise 1.1 because ‖a(n) − c‖ is a continuous function of

|a(n)1 − c1|, . . . , |a(n)

k − ck|; the “only if” part follows because |a(n)i − ci| ≤ ‖a(n) − c‖.

There is no multivariate analogue of definition 1.2; it is nonsensical to write a(n) → ∞. However,since ‖a(n)‖ is a real number, writing ‖a(n)‖ → ∞ is permissible. If we write lim‖x‖→∞ f(x) = cfor a real-valued function f(x), then it must be true that f(x) tends to the same limit c no matterwhat path x takes as ‖x‖ → ∞.

Suppose that the function f(x) maps vectors in some open subset U of Rk to vectors in R

ℓ, aproperty denoted by f : U → R

ℓ. In order to define continuity, we first extend definition 1.3 to themultivariate case:

Definition 1.12 For a function f : U → Rℓ, where U is open in R

k, we write limx→a f(x) =c for some a ∈ U and c ∈ R

ℓ if for every ǫ > 0 there exists a δ > 0 such that‖f(x) − c‖ < ǫ whenever x ∈ U and 0 < ‖x − a‖ < δ.

Note in definition 1.12 that ‖f(x) − c‖ refers to the norm on Rℓ, while ‖x − a‖ refers to the norm

on Rk.

Definition 1.13 A function f : U → Rℓ is continuous at a ∈ U ⊂ R

k if limx→a f(x) = f(a).

Since there is no harm in letting k = 1 or ℓ = 1 or both, definitions 1.12 and 1.13 include definitions1.3 and 1.6, respectively, as special cases.

The extension of differentiation from the univariate to the multivariate setting is not quite asstraightforward as the extension of continuity. Fortunately, however, most of the difficulty liesmerely in notation. Recall that in the univariate case, the derivative f ′(x) of a function f(x) isdefined by definition 1.7, which according to theorem 1.2 implies

f(x + h) − f(x) − hf ′(x)

h→ 0 as h → 0. (1.7)

However, the converse is also true: That is, equation (1.7) could be taken as the definition ofthe derivative f ′(x). To do so would require just a bit of extra work to prove that equation (1.7)uniquely defines f ′(x). This alternative definition is the one that we shall extend to the multivariatecase.

Definition 1.14 Suppose that f : U → Rℓ, where U ⊂ R

k is open. For a point a ∈ U ,suppose there exists an ℓ × k matrix Jf(a), depending on a but not on h, such that

limh→0

f(a + h) − f(a) − Jf(a)h

‖h‖ = 0. (1.8)

Then Jf(a) is unique and we call Jf(a) the Jacobian matrix of f(x). We say that f(x)is differentiable at the point a.

14

The assertion in definition 1.14 that Jf(a) is unique may be proved as follows: Suppose that J(1)

f(a)

and J(2)

f(a) are two versions of the Jacobian matrix. Then equation (1.8) implies that

limh→0

(

J(1)

f(a) − J

(2)

f(a)

)

h

‖h‖ = 0;

but h/‖h‖ is an arbitrary unit vector, which means that(

J(1)

f(a) − J

(2)

f(a)

)

must be the zero

matrix, proving the assertion.

Although definition 1.14 is straightforward and quite common throughout the calculus literature,there is unfortunately not a universally accepted notation for multivariate derivatives. Varioussources use notation such as f′(a), f(a), Df(a), or ∇f(a) to denote the Jacobian matrix or itstranspose, depending on the situation. In this book, we adopt perhaps the most widespread ofthese notations, letting ∇f(a) denote the transpose of the Jacobian matrix Jf(a). Furthermore, itis possible to show (see exercise 1.15) that the Jacobian matrix is the matrix of partial derivatives.Putting these facts together, we define the notation ∇f(a) as follows:

Definition 1.15 Suppose f(x) is differentiable at a in the sense of definition 1.14. Thenthe gradient matrix of f, evaluated at a, is

∇f(a)def=

∂f1(x)∂x1

· · · ∂fℓ(x)∂x1

......

∂f1(x)∂xk

· · · ∂fℓ(x)∂xk

∣

∣

∣

∣

∣

∣

∣

∣

x=a.

(1.9)

Note that when f maps k-vectors to ℓ-vectors, ∇f(x) is a k × ℓ matrix, a fact that is important tomemorize; it is often very helpful to remember the dimensions of the gradient matrix when tryingto recall the form of various multivariate results. By definition (1.9), the gradient matrix satisfiesthe first-order Taylor formula

f(x) = f(a) + ∇f(a)t(x − a) + r(x,a), (1.10)

where r(x,a)/‖x − a‖ → 0 as x → a.

Now that we have generalized Taylor’s theorem 1.2 for the linear case d = 1, it is worthwhile to askwhether a similar generalization is necessary for larger d. The answer is no, except for one particularcase: We will require a quadratic Taylor expansion (that is, d = 2) when f(x) is real-valued buttakes a vector argument. To this end, suppose that U ⊂ R

k is open and that f(x) maps U intoR. Then according to equation (1.9), ∇f(x) is a k × 1 vector of partial derivatives, which meansthat ∇f(x) maps k-vectors to k-vectors. If we differentiate once more and evaluate the result at a,denoting the result by ∇2f(a), then equation (1.9) with ∂/∂xif(x) substituted for fi(x) gives

∇2f(a) =

∂2f(x)∂x2

1· · · ∂2f(x)

∂x1∂xk

......

∂2f(x)∂xk∂x1

· · · ∂2f(x)∂x2

k

∣

∣

∣

∣

∣

∣

∣

∣

∣

x=a.

(1.11)

The k × k matrix on the right hand side of equation (1.11) is called the Hessian matrix of thefunction f(x). One technical detail that will usually not concern us in this book is the fact that

15

the Hessian matrix of partial derivatives may be defined even if f(x) is not twice differentiable ata in the sense of definition 1.14. In other words, the existence of second-order partial derivativesdoes not imply twice-differentiability. However, if f(x) is twice differentiable, then not only doesthe Hessian matrix exist, it is symmetric (see exercise 1.16).

We are finally ready to extend Taylor’s theorem to second-order expansions of real-valued functionsthat take vector arguments:

Theorem 1.4 Suppose that f : U → R, where U is an open subset of Rk. Then if f(x) istwice differentiable at the point a,

f(x) = f(a) + ∇f(a)t(x − a) +1

2(x − a)t∇2f(a)(x − a) + r2(x,a),

where r2(x,a)/‖x − a‖2 → 0 as x → a.


Exercise 1.15 For a real-valued function g(x) defined on a subset of Rk, we define the ith

partial derivative of g(x) at a, where 1 ≤ i ≤ k, as follows:

∂g(x)

∂xi

def= lim

h→0

g(a + he(i)) − g(a)

h

if this limit exists, where e(i) is the ith standard basis vector, defined by e(i)j = Ii = j

for all 1 ≤ j ≤ k. Prove that if f(x) is differentiable at a, then the transpose of theJacobian matrix Jf(a) is the right-hand side of equation (1.9).

Exercise 1.16 (a) Prove that if a real-valued function f(x) is twice differentiable at a,then the Hessian matrix is symmetric at that point.

(b) Give an example in which f(x) has a well-defined but asymmetric Hessian matrixat some point.

Exercise 1.17 (a) Suppose that f(x) is continuous at 0. Prove that f(te(i)) is continuousas a function of t at t = 0 for each i, where e(i) is the ith standard basis vector.

(b) Prove that the converse of (a) is not true by inventing a function f(x) that is notcontinuous at 0 but such that f(te(i)) is continuous as a function of t at t = 0 for eachi.

Exercise 1.18 Suppose that a(n)i → ci as n → ∞ for i = 1, . . . , k. Prove that if f : R

k → R

is continuous at the point c, then f(a(n)) → f(c). Note that this proves every part ofExercise 1.1. (the hard work of an exercise like 1.1(b) is in showing that multiplicationis continuous).

1.5 Expectation and Inequalities

In this section we consider random variables. Although we do not wish to make the definition of arandom variable rigorous here — to do so requires measure theory — we assume that the reader is

16

familiar with the basic idea: A random variable is a function from a sample space Ω into R. (Weoften refer to “random vectors” rather than “random variables” if the range space is R

k rather thanR.)

For any random variable X, we denote the expected value of X by E X. We expect that the readeris already very familiar with expected values for commonly-encountered random variables X, so wedo not attempt here to define the expectation operator E rigorously. In particular, we avoid thequestion of whether E f(X) should involve a sum, if X is discrete; or an integral, if X is continuous;or neither, if X is neither discrete nor continuous. However, the expectation operator has severaloften-used properties. Since we will not delve into the measure-theoretic concepts necessary toderive them rigorously, we present these properties here as axioms:

1. Linearity: For any random variables X and Y defined on the same sample space and anyreal numbers a and b, E (aX + bY ) = aE (X) + b E (Y ).

2. Monotonicity: For any random variables X and Y defined on the same sample space suchthat X(ω) ≤ Y (ω) for all ω ∈ Ω, E X ≤ E Y .

3. Monotone Convergence: For any random variable X and sequence X1, X2, . . . defined onthe same probability space such that |Xn(ω)| is a nondecreasing sequence for all ω and|Xn(ω)| → |X(ω)| as n → ∞ for all ω, E |Xn| → E |X|.

4. Conditioning: For any independent random variables X and Y defined on the same samplespace and any function f(x, y), E f(X, Y ) = E E [f(X, Y )|Y ].

The reader should already be familiar with each of these four properties, with the possible exceptionof monotone convergence (fortunately, monotone convergence is the least-widely-used of the fourproperties in this book). Other familiar properties may be derived from these four: For instance,as a special case of the conditioning property, if X and Y are independent then taking f(x, y) = xygives E XY = (E X)(E Y ).

The variance and covariance operators are defined as usual, namely,

Cov (X, Y )def= E XY − (E X)(E Y )

and Var (X)def= Cov (X, X). The linearity property above extends to random vectors: For scalars

a and b we have E (aX + bY) = aE (X) + b E (Y), and for matrices P and Q with dimensions suchthat PX + QY is well-defined, E (PX + QY) = P E (X) + QE (Y). The covariance between tworandom vectors is

Cov (X,Y)def= E XtY − (E X)t(E Y),

and the variance matrix of a random vector (sometimes referred to as the covariance matrix) is

Var (X)def= Cov (X,X). Among other things, these properties imply that

Var (PX) = P Var (X)P t (1.12)

for any constant matrix P with as many columns as X has rows.

17

As a first application of the expectation operator, we derive a useful inequality called Chebyshev’sinequality. For any positive constants a and r and any random variable X, observe that

|X|r ≥ |X|rI|X| ≥ a ≥ arI|X| ≥ a,

where throughout the book, I· denotes the indicator function

Iexpression def=

1 if expression is true0 if expression is not true.

(1.13)

Since E I|X| ≥ a = P (|X| ≥ a), the monotonicity of the expectation operator implies

P (|X| ≥ a) ≤ E |X|rar

. (1.14)

Inequality (1.14) is sometimes called Markov’s inequality. In the special case that X = Y − E Yand r = 2, we obtain Chebyshev’s inequality: For any a > 0 and any random Y ,

P (|Y − E Y | ≥ a) ≤ Var Y

a2. (1.15)

While the proof of Chebyshev’s inequality utilized the monotonicity of the expectation operator,we now derive another inequality that takes advantage of its linearity. Jensen’s inequality statesthat

f (E X) ≤ E f(X) (1.16)

for any convex function f(x) and any random variable X. The definition of a convex function isgiven in exercise 1.12. To prove inequality 1.16, we require another property of any convex function:There exists a line g(x) = ax + b through the point [E X, f(E X)] such that g(x) ≤ f(x) for allx. (This property, true for convex functions in any dimension, is called the supporting hyperplaneproperty; see exercise 1.20.) We may invoke linearity to show that E g(X) = g(E X) = f(E X).Since E g(X) ≤ E f(X) by monotonicity, this proves that f(E X) ≤ E f(X) as desired.


Exercise 1.19 Show by example that equality can hold in inequality 1.15.

Exercise 1.20 Let f(x) be a convex function on some interval, and let x0 be any point onthe interior of that interval.

(a) Prove that

limx→x0+

f(x) − f(x0)

x − x0(1.17)

exists and is finite; that is, a one-sided derivative exists at x0.

Hint: Using the definition of convexity in exercise 1.12, show that the fraction inexpression (1.17) is nonincreasing and bounded below as x decreases to x0.

18

(b) Prove that there exists a linear function g(x) = ax+b such that g(x0) = f(x0) andg(x) ≤ f(x) for all x in the interval. This fact is the supporting hyperplane property inthe case of a convex function taking a real argument.

Hint: Let f ′(x0+) denote the one-sided derivative of part (a). Consider the linef(x0) + f ′(x0+)(x − x0).

Exercise 1.21 Prove Holder’s inequality: For random variables X and Y and positive pand q such that p + q = 1,

E |XY | ≤(

E |X|1/p)p (

E |Y |1/q)q

. (1.18)

(Note: If p = q = 1/2, inequality (1.18) is also called the Cauchy-Schwartz inequality.)

Hint: Use the convexity of exp(x) to prove that |abXY | ≤ p|aX|1/p+q|bY |1/q wheneveraX 6= 0 and bY 6= 0 (the same inequality is also true if aX = 0 or bY = 0). Takeexpectations, then find values for the scalars a and b that give the desired result whenthe right side of inequality (1.18) is nonzero.

Exercise 1.22 Kolmogorov’s inequality is a strengthening of Chebyshev’s inequality for asum of independent random variables: If X1, . . . , Xn are independent random variables,define

Sk =k

∑

i=1

(Xi − E Xi)

to be the centered kth partial sum for 1 ≤ k ≤ n. Then for a > 0, Kolmogorov’sinequality states that

P

(

max1≤k≤n

|Sk| ≥ a

)

≤ Var Sn

a2. (1.19)

Prove inequality (1.19).

Hint: Show that that function maxi≤k≤n xi is a convex function on Rn, then use

Markov’s inequality and Jensen’s inequality.

19

Chapter 2

Convergence of Random Variables

Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued. In thischapter we extend this notion by considering what it means for a sequence of random variables tohave a limit, then develop some of the theory of limits of random vector sequences.

2.1 Modes of Convergence

Whereas the limit of a constant sequence is very simply expressed by definition 1.11, in the case ofrandom variables there are several ways to define the convergence of a sequence.

2.1.1 Almost Sure Convergence and Convergence in Probability

Suppose we wish to express the idea that Xn converges to X. (For now, we shall assume that allof the Xn and X are defined on the same sample space.) For given n and ǫ > 0, define the events

An = ω ∈ Ω : |Xk(ω) − X(ω)| < ǫ for all k ≥ n (2.1)

and

Bn = ω ∈ Ω : |Xn(ω) − X(ω)| < ǫ. (2.2)

Both An and Bn occur when Xn is in some sense close to X. Therefore, if X is to be the limit ofXn, then both An and Bn seem like events whose probabilities should tend toward 1 as n increases.Since An ⊂ Bn, we must have P (An) ≤ P (Bn); therefore, P (An) → 1 implies P (Bn) → 1 but notnecessarily conversely. We shall define P (An) → 1 and P (Bn) → 1, when true for all ǫ > 0, asalmost sure convergence and convergence in probability, respectively:

Definition 2.1 We say that Xn converges almost surely (or with probability one) to X iffor any ǫ > 0,

P (|Xk − X| < ǫ for all k ≥ n) → 1 as n → ∞. (2.3)

We denote almost sure convergence by Xnas→X or Xn → X as or Xn → X wp 1.

20

Definition 2.2 We say that Xn converges in probability to X if for any ǫ > 0,

P (|Xn − X| < ǫ) → 1 as n → ∞. (2.4)

We denote convergence in probability by XnP→X.

It is very common that the X in definitions 2.1 and 2.2 is a constant, say X ≡ c. In such cases,

we simply write Xnas→ c or Xn

P→ c; in fact, there is no harm in replacing X by c in the definitions,as any constant may be viewed as a random variable defined on any sample space. In the mostcommon statistical usage of convergence to a constant, the random variable Xn is some estimatorof a particular parameter, say g(θ):

Definition 2.3 If XnP→ g(θ), Xn is said to be consistent (or weakly consistent) for g(θ).

If Xnas→ g(θ), Xn is said to be strongly consistent for g(θ).

The definition of convergence in probability is not hard to understand: The probability that theabsolute difference between Xn and X is large should become vanishingly small as n → ∞. Admit-tedly, however, it is not immediately clear how definition 2.1 may be simply interpreted. A simpleinterpretation comes as a result of the following theorem, whose proof is the subject of exercise 2.1.

Theorem 2.1 Define An as in equation (2.1) and define A = ω ∈ Ω : Xn(ω) → X(ω).Then P (An) → 1 for any ǫ > 0 if and only if P (A) = 1.

According to theorem 2.1, convergence with probability one is exactly what it sounds like: Xnas→X

means that the probability of the event ω : Xn → X equals one. Indeed, theorem 2.1 suggestsan alternative definition of almost sure convergence, and many textbooks give P (Xn → X) = 1 asthe definition of Xn

as→X. We chose our alternative definition because it makes clear the important

fact that Xnas→X implies Xn

P→X; nonetheless, we hope that students will remember theorem 2.1.

We illustrate the fact that almost sure convergence is stronger than convergence in probability byinventing a simple example in which a sequence converges in probability but not with probabilityone.

Example 2.1 Take Ω to be the half-open interval (0, 1], and for any interval J ⊂ Ω, sayJ = (a, b], take P (J) = b − a to be the length of that interval. Define a sequence ofintervals J1, J2, . . . as follows:

J1 = (0, 1]

J2 through J4 = (0, 13] , ( 1

3, 23] , ( 2

3,1]

J5 through J9 = (0, 15] , ( 1

5, 25] , ( 2

5, 35] , ( 3

5, 45] , ( 4

5,1]

...

Jm2+1 through J(m+1)2 =

(

0,1

2m + 1

]

, . . . ,

(

2m

2m + 1, 1

]

...

Note in particular that P (Jn) = 1/(2m+1), where m = ⌊√

n − 1⌋ is the largest integernot greater than

√n − 1. Now, define Xn = IJn and take 0 < ǫ < 1. Let X ≡ 0.

Then P (|Xn − 0| < ǫ) is the same as 1−P (Jn). Since P (Jn) → 0, we conclude XnP→ 0

by definition.

21

However, it is not true that Xnas→ 0. Since every ω ∈ Ω is contained in infinitely many

Jn, the set An defined in equation (2.1) is empty for all n. In other words, for every nand ω ∈ Ω there exists k ≥ n such that |Xn| = 1, so An must be empty if 0 < ǫ < 1.Alternatively, consider the set S = ω : Xn(ω) → 0. For any ω, Xn(ω) has no limitbecause Xn(ω) = 1 infinitely often. Thus S is empty. This is not convergence withprobability one; it’s convergence with probability zero!

There are analogues of the o and O notations that may be applied to random variables.

Definition 2.4 We write Xn = oP (Yn) if Xn/YnP→ 0.

Definition 2.5 We write Xn = OP (Yn) if for every ǫ > 0, there exist M and N such that

P

(∣

∣

∣

∣

Xn

Yn

∣

∣

∣

∣

< M

)

> 1 − ǫ for all n > N .

In concluding this subsection, we extend definitions 2.1 and 2.2 to the multivariate case in acompletely straightforward way:

Definition 2.6 X(n) converges almost surely (or with probability one) to X if for any ǫ > 0,

P(

‖X(k) − X‖ < ǫ for all k ≥ n)

→ 1 as n → ∞,

or, alternatively, if

P(

X(n) → X as n → ∞)

= 1.

Definition 2.7 Xn converges in probability to X if for any ǫ > 0,

P(

‖X(n) − X‖ < ǫ)

→ 1 as n → ∞.

2.1.2 Convergence in Distribution

We now define a third mode of convergence, known as convergence in distribution or convergence inlaw. (A fourth mode of convergence, called convergence in rth mean, is the subject of exercise 2.2.)As the name suggests, convergence in distribution has to do with convergence of the distributionfunctions of random variables. Given a random variable X, the distribution function of X is thefunction

F (x) = P (X ≤ x). (2.5)

Any distribution function F (x) is nondecreasing and right-continuous and satisfies limx→−∞ F (x) =0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these properties is a distributionfunction for some random variable.

It is not enough to define convergence in distribution as simple pointwise convergence of a sequenceof distribution functions; there are technical reasons that such a simplistic definition fails to captureany useful concept of convergence of random variables. These reasons are illustrated by the followingtwo examples.

22

Example 2.2 Let Xn be normally distributed with mean 0 and variance n. Then thedistribution function of Xn is Φ(x/

√n), where Φ(z) denotes the standard normal dis-

tribution function. Because Φ(0) = 1/2, we see that for any fixed x, the distributionfunction of Xn converges to 1/2 at that point as n → ∞. But the function that isconstant at 1/2 is not a distribution function; therefore, not all convergent sequencesof distribution functions have limits that are distribution functions.

Example 2.3 By any sensible definition of convergence, 1/n should converge to 0. Butconsider the distribution functions Fn(x) = Ix ≥ 1/n and F (x) = Ix ≥ 0 cor-responding to the constant random variables 1/n and 0. We do not have pointwiseconvergence of Fn(x) to F (x), since Fn(0) = 0 for all n but F (0) = 1. Note, however,that Fn(x) → F (x) is true for all x 6= 0. The point x = 0 is the only point at whichthe function F (x) is not continuous.

To write a sensible definition of convergence in distribution, example 2.2 demonstrates that weshould require that the limit of distribution functions be a distribution function, while example 2.3indicates that we should exclude points where F (x) is not continuous. We therefore arrive at thefollowing definition:

Definition 2.8 Suppose that X has distribution function F (x) and that Xn has distributionfunction Fn(x) for each n. Then we say Xn converges in distribution (or law) to X,

written Xnd→X, if Fn(x) → F (x) as n → ∞ for all x at which F (x) is continuous.

Could Xnd→X imply either Xn

as→X or XnP→X? Not in general: Since convergence in distribution

only involves distribution functions, Xnd→X is possible even if Xn and X are not defined on the

same sample space. However, we now prove that the converse is true.

Theorem 2.2 If XnP→X, then Xn

d→X.

Proof: Let Fn(x) and F (x) denote the distribution functions of Xn and X, respectively. Supposec is a point of continuity of F (x). We need to show that Fn(c) → F (c).

Choose any ǫ > 0. Note that whenever Xn ≤ c, it must be true that either X ≤ c+ǫ or |Xn−X| > ǫ.This implies that

Fn(c) ≤ F (c + ǫ) + P (|Xn − X| > ǫ).

Similarly, whenever X ≤ c − ǫ, either Xn ≤ c or |Xn − X| > ǫ, implying

F (c − ǫ) ≤ Fn(c) + P (|Xn − X| > ǫ).

We conclude that for arbitrary n and ǫ > 0,

F (c − ǫ) − P (|Xn − X| > ǫ) ≤ Fn(c) ≤ F (c + ǫ) + P (|Xn − X| > ǫ). (2.6)

Now take an arbitrary δ > 0. Since F (x) is continuous at c and XnP→X, it is always possible

to find ǫ small enough and N large enough so that the expressions on both the left and right ofinequality (2.6) are within δ of F (c) for all n > N . Thus |Fn(c) − F (c)| < δ for all n > N , whichmeans that Fn(c) → F (c) as desired.

We remarked earlier that Xnd→X could not possibly imply Xn

P→X because the latter expressionrequires that Xn and X be defined on the same sample space. However, a constant c may be

23

considered to be a random variable defined on any sample space; thus, it is reasonable to ask

whether Xnd→ c implies Xn

P→ c. The answer is yes:

Theorem 2.3 Xnd→ c if and only if Xn

P→ c.

Proof: We only need to prove that Xnd→ c implies Xn

P→ c, since the other direction is a specialcase of Theorem 2.2. For any ǫ > 0,

P (−ǫ < Xn − c ≤ ǫ) = Fn(c + ǫ) − Fn(c − ǫ),

which tends as to F (c + ǫ) − F (c − ǫ) = 1 as n → ∞.

As with almost sure convergence and convergence in probability, it is easy to extend convergencein distribution to random vectors. The multivariate analogue of equation (2.5) is

F (x)def= P (X ≤ x),

where X is a random vector in Rk and X ≤ x means that Xi ≤ xi for all 1 ≤ i ≤ k.

Definition 2.9 X(n) converges in distribution to X if for any point c at which F (x) iscontinuous,

Fn(c) → F (c) as n → ∞.

The relationships among the four modes of convergence discussed above and in exercises 2.2 and2.3 are summarized by the diagram below. Note that when convergence is to a constant c ratherthan a random vector X, the picture changes slightly.

X(n) qm→X⇓

X(n) as→X ⇒ X(n) P→X ⇒ X(n) d→X

X(n) qm→ c⇓

X(n) as→ c ⇒ X(n) P→ c ⇔ X(n) d→ c


Exercise 2.1 Prove theorem 2.1.

Hint: For ǫ > 0, define C(ǫ) = ∪∞n=1An, where An is defined as in equation (2.1).

Note that ω ∈ C(ǫ) if and only if there exists N such that |Xn(ω) − X(ω)| < ǫ for alln > N .

Exercise 2.2 We first define convergence in the rth mean.

Definition 2.10 For r > 0, Xn converges in the rth mean to X if E |Xn −X|r → 0 as n → ∞. We write Xn

r→X. As a special case, we say that Xn

converges in quadratic mean to X, Xnqm→X, if E (Xn − X)2 → 0.

Prove that for a constant c, Xnqm→ c if and only if E Xn → c and Var Xn → 0.

24

Exercise 2.3 (a) Prove that if Xnqm→X, then Xn

P→X.

(b) The converse is not true: XnP→ c does not imply E Xn → c or Xn

qm→ c or Var Xn →0. Construct an example in which Xn

P→ 0 but E Xn = 1 for all n.

Hint: Convergence in probability means that convergence happens everywhere excepton a subset of the sample space with small probability. However, the mean of a randomvariable may be strongly influenced by an event with small probability.

Exercise 2.4 Prove or disprove this statement: If there exists M such that P (|Xn| < M) =

1 for all n, then XnP→ c implies Xn

qm→ c.

Exercise 2.5 If X is a univariate random variable with distribution function F (x), thenF (x) is continuous at c if and only if P (X = c) = 0. Prove by counterexample thatthis is not true if variables X and c are replaced by vectors X and c.

2.2 Laws of Large Numbers

Given a sequence of independent and identically distributed random variables or vectors, the stronglaw of large numbers states that the sample mean is strongly consistent for the population mean,while the weak law states that the sample mean is (weakly) consistent for the population mean.For a sequence of random vectors X1,X2, . . ., we denote the nth sample mean by

X(n) def

=1

n

n∑

i=1

Xi.

If the Xi are not vectors, we often use Xn rather than X(n)

.

Theorem 2.4 Strong Law of Large Numbers: Suppose that X(1),X(2), . . . are independent

and identically distributed and have finite mean µ. Then X(n) as→µ.

Sinceas→ implies

P→, theorem 2.4 implies:

Theorem 2.5 Weak Law of Large Numbers: Under the assumptions of theorem 2.4,

X(n) P→µ.

We will discuss several different proofs of the strong and weak laws, some of them using morerestrictive assumptions than those given in theorems 2.4 and 2.5. In particular, note that neithertheorem requires the X(i) to have any finite moments other than the mean; though if we assumein addition that finite variances or fourth moments exist, then the univariate versions of thesetheorems may be proved with considerably less effort. An outline of a proof of the strong law inthe general case is left to exercise 2.8, and a general proof of theorem 2.5 based on characteristicfunctions is given in exercise 3.4.

If the Xn are univariate with finite variance, the weak law may be proved in a single line: Cheby-shev’s inequality (1.15) implies that

P(

|Xn − µ| ≥ ǫ)

≤ Var Xn

ǫ2=

Var X1

nǫ2→ 0

25

as n → ∞, so XnP→µ follows by definition. (This is a special case of the result in exercise 2.3.)

For a relatively quick proof of the strong law under more restrictive assumptions, we first establisha useful lemma.

Lemma 2.1 If∑∞

k=1 P (‖X(k) − X‖ > ǫ) < ∞ for any ǫ > 0, then X(n) as→X.

Proof: The proof relies on the countable subadditivity of any probability measure, which statesthat for any sequence A1, A2, . . . of events,

P

( ∞⋃

k=1

Ak

)

≤∞

∑

i=k

P (Ak). (2.7)

To prove the lemma, we must demonstrate that P(

‖X(k) − X‖ ≤ ǫ for all k ≥ n)

→ 1 as n →

∞, which (taking complements) is equivalent to P(

‖X(k) − X‖ > ǫ for some k ≥ n)

→ 0. By

countable subadditivity,

P(

‖X(k) − X‖ > ǫ for some k ≥ n)

= P

( ∞⋃

k=n

‖X(k) − X‖ > ǫ

)

≤∞

∑

k=n

P(

‖X(k) − X‖ > ǫ)

,

and the right hand side tends to 0 as n → ∞ because it is the tail of a convergent series.

We now prove the strong law in the univariate case under the additional assumption that E X4n < ∞.

Proof: We may assume without loss of generality that E Xn = 0, since replacing Xn by Xn − µdoes not change the theorem in any important way. In light of lemma 2.1, we need only prove that

∞∑

n=1

P(

|Xn| > ǫ)

< ∞. (2.8)

According to Markov’s inequality (1.14) with r = 4,

P(

|Xn| > ǫ)

≤ E S4n

n4ǫ4, (2.9)

where Sn denotes the nth partial sum∑n

i=1 Xi. Therefore, it might help if we could evaluate

E S4n = E

n∑

a=1

n∑

b=1

n∑

c=1

n∑

d=1

XaXbXcXd.

Whenever one of the four indices a, b, c, or d is distinct from the other three, then by independenceand the fact that E Xi = 0 we have E XaXbXcXd = 0. Therefore, the only nonzero terms in E S4

n

are n copies of E X41 (when a = b = c = d) along with n(n − 1) copies of (E X2

1 )2 for each of thecases a = b and c = d, a = c and b = d, and a = d and b = c. We conclude that

E S4n

n4ǫ4=

n E X41 + 3n(n − 1)(E X2

1 )2

n4ǫ4=

k

n2+ o

(

1

n2

)

,

where k does not depend on n. Since∑∞

n=1 n−2 < ∞, inequality 2.9 shows that inequality 2.8 issatisfied.

26

Suppose that in the univariate case we consider a generalization of the independent and identicallydistributed assumption: Suppose that X1, X2, . . . are independent but not necessarily identicallydistributed, such that E Xi = µ and Var Xi = σ2

i . When is Xn consistent for µ?

Since E Xn = µ for all n, the results of exercises 2.2 and 2.3(a) imply that Xn is consistent wheneverVar Xn → 0 (note, however, that the converse is not true; see exercise 2.9). The variance of Xn is

easy to write down, and we conclude that XnP→µ if

∑ni=1 σ2

i = o(n2).

On the other hand, if instead of Xn we consider the BLUE (best linear unbiased estimator)

δn =

∑ni=1 Xi/σ2

i∑n

j=1 1/σ2j

, (2.10)

then as before E δn = µ, but

Var δn =1

∑nj=1 1/σ2

j

.

Example 2.4 Consider the case of simple linear regression,

Yi = β0 + β1zi + ǫi,

where the zi are known covariates and the ǫi are independent and identically distributedwith mean 0 and variance σ2. If we define

wi =zi − z

∑nj=1(zj − z)2

and vi =1

n− zwi,

then the least squares estimators of β0 and β1 are

β0n =n

∑

i=1

viYi and β1n =n

∑

i=1

wiYi,

respectively. Since E Yi = β0 + β1zi, we have

E β0n = β0 + β1z − β0zn

∑

i=1

wi − β1zn

∑

i=1

wizi

and

E β1n = β0w + β1

n∑

i=1

wizi.

Note that∑n

i=1 wi = 0 and∑n

i=1 wizi = 1. Therefore, E β0n = β0 and E β1n =

β1, which is to say that E β0n and E β1n are unbiased. Therefore, by exercises 2.2and 2.3(a), a sufficient condition for the consistency of E β0n and E β1n is that theirvariances tend to zero as n → ∞. Since Var Yi = σ2, we obtain

Var β0n = σ2n

∑

i=1

v2i and Var β1n = σ2

n∑

i=1

w2i .

27

With∑n

i=1 w2i =

∑nj=1(zi − z)2

−1, these expressions simplify to

Var β0n =σ2

n+

σ2z2

∑nj=1(zi − z)2

and Var β1n =σ2

∑nj=1(zi − z)2

.

Therefore, β0n and β1n are consistent if

z2

∑nj=1(zi − z)2

and1

∑nj=1(zi − z)2

,

respectively, tend to zero.

Suppose that X1, X2, . . . have E Xi = µ but they are not independent. Then clearly E Xn = µ andso Xn is consistent for µ if Var Xn → 0. In this case,

Var Xn =1

n2

n∑

i=1

n∑

j=1

Cov (Xi, Xj). (2.11)

Definition 2.11 The sequence X1, X2, . . . is stationary if the joint distribution of (Xi, . . . , Xi+k)does not depend on i for any i > 0 and any k ≥ 0.

For a stationary sequence, Cov (Xi, Xj) depends only on the “gap” j−i; for example, Cov (X1, X4) =Cov (X2, X5) = Cov (X5, X8) = · · ·. Therefore, the variance expression of equation (2.11) becomes

Var Xn =σ2

n+

2

n2

n−1∑

k=1

(n − k)Cov (X1, X1+k). (2.12)

Lemma 2.2 Expression (2.12) tends to 0 as n → ∞ if σ2 < ∞ and Cov (X1, X1+k) → 0 ask → ∞.

Proof: It is clear that σ2/n → 0 if σ2 < ∞. Assuming that Cov (X1, X1+k) → 0, select ǫ > 0 andnote that for N chosen so that |Cov (X1, X1+k)| < ǫ/2 for all n > N , we have

∣

∣

∣

∣

∣

2

n2

n−1∑

k=1

(n − k)Cov (X1, X1+k)

∣

∣

∣

∣

∣

≤ 2

n

N∑

k=1

|Cov (X1, X1+k)| +2

n

n−1∑

k=N+1

|Cov (X1, X1+k)| .

Note that the second term on the right is strictly less than ǫ/2, and the first term is a constantdivided by n, which may be made smaller than ǫ/2 by choosing n large enough. This proves thatσ2 < ∞ and Cov (X1, X1+k) → 0 together imply that expression (2.12) goes to zero.

Definition 2.12 The sequence X1, X2, . . . is m-dependent for some fixed nonnegative in-teger m if the random vectors (X1, . . . , Xi) and (Xj , Xj+1, . . .) are independent when-ever j − i > m.

Any independent and identically distributed sequence is 0-dependent and stationary. Also, anystationary m-dependent sequence trivially satisfies Cov (X1, X1+k) → 0 as k → ∞, so Xn isconsistent for any stationary m-dependent sequence with finite variance.

28


Exercise 2.6 Suppose that ak → c as k → ∞ for a sequence of real numbers a1, a2, . . ..Prove that this implies convergence in the sense of Cesaro, which means that

1

n

n∑

k=1

ak → c. (2.13)

In this case, c may be real or it may be ±∞.

Hint: If c is real, consider the definition of ak → c: There exists N such that |ak−c| < ǫfor all k > N . Consider what happens when the sum in expression (2.13) is broken intotwo sums, one for k ≤ N and one for k > N . The case c = ±∞ follows a similar lineof reasoning.

Exercise 2.7 Let B1, B2, . . . denote a sequence of events. Let Bn i.o., which stands for Bn

infinitely often, denote the set

Bn i.o.def= ω ∈ Ω : for every n, there exists k ≥ n such that ω ∈ Bk.

Prove the first Borel-Cantelli lemma, which states that if∑∞

n=1 P (Bn) < ∞, thenP (Bn i.o.) = 0.

Hint: Argue that

Bn i.o. =

∞⋂

n=1

∞⋃

k=n

Bk,

then adapt the proof of lemma 2.1.

Exercise 2.8 Prove the strong law of large numbers, theorem 2.4. Use the following stepsto accomplish this job:

(a) First, argue that we may assume without loss of generality that Xn is 1-dimensionalrather than k-dimensional and that Xn is nonnegative.

Hint: For the second part, write Xn = X+n − X−

n , where X+n = XnIXn ≥ 0 and

X−n = |Xn|IXn ≤ 0. show that the strong law is true for the Xn if it is true for the

X+n and X−

n .

(b) Prove that for any nonnegative random variable Y ,

∞∑

i=1

P (Y ≥ k) ≤ E Y.

(c) Define X∗n = XnIXn ≤ n to be a truncated version of Xn. Define

Sn =n

∑

i=1

Xi and Tn =n

∑

i=1

X∗i .

Prove that Xn − X∗n

as→ 0 and thus (Sn − Tn)/nas→ 0.

29

Hint: Use lemma 2.1 and the result of exercise 2.6.

(c) Fix δ > 1, and define αn = ⌊δn⌋ to be the largest integer less than or equal to δn.Prove that

Tαn − E Tαn

αn

as→ 0.

Hint: Use Chebyshev’s inequality to derive an upper bound on P (|Tαn − E Tαn | > ǫαn),then use lemma 2.1.

(d) Show that Sαn/αnas→µ, where µ = E Xn.

Hint: Use the result of exercise 2.6 to show that E Tαn/αnas→µ. Then use parts (b)

and (c).

(e) Finally, prove that part (d) implies that Sn/nas→µ, completing the proof.

Hint: Argue that whenever αn ≤ k ≤ αn+1,

αn

αn+1

Sαn

αn≤ Sk

k≤ αn+1

αn

Sαn+1

αn+1.

Find the limits of the left hand side and the right hand side, then recall that δ > 1 isarbitrary.

Exercise 2.9 Suppose X1, X2, . . . are independent with E Xi = µ and Var Xi = σ2i .

(a) Give an example in which XnP→µ but Var Xn does not converge to 0.

(b) Prove that, for δn defined as in equation (2.10), Var δn ≤ Var Xn and give anexample (i.e., specify the values of σ2

i ) in which Var δn → 0 but Var Xn → ∞.

Exercise 2.10 Suppose X1, X2, . . . are independent and identically distributed with E(Xi) =µ and Var (Xi) = σ2 < ∞. Let Yi = Xi = (

∑ij=1 Xj)/i.

(a) Prove that Y n = (∑n

i=1 Yi)/n is a consistent estimator of µ.

(b) Compute the relative efficiency eY n,Xnof Y n to Xn, defined as Var (Xn)/ Var (Y n),

for n ∈ 5, 10, 20, 50, 100,∞ and report the results in a table. Note that n = ∞ inthe table does not actually mean n = ∞, since there is no such real number; instead,n = ∞ is shorthand for the limit (of the efficiency) as n → ∞.

Exercise 2.11 Let Y1, Y2, . . . be independent and identically distributed with mean µ andvariance σ2 < ∞. Let

X1 = Y1, X2 =Y2 + Y3

2, X3 =

Y4 + Y5 + Y6

3, etc.

Define δn as in equation (2.10).

(a) Show that δn and Xn are both consistent estimators of µ.

30

(b) Calculate the relative efficiency eXn,δnof Xn to δn, defined as Var (δn)/ Var (Xn),

for n = 5, 10, 20, 50, 100, and ∞ and report the results in a table.

(c) Using example 1.4, give a simple expression asymptotically equivalent to eXn,δn.

Report its values in your table for comparison. How good is the approximation forsmall n?

2.3 Convergence of Tranformed Sequences

2.3.1 Continuous transformations

Continuous functions preserve all three modes of convergence of section 2.1.

Theorem 2.6 Suppose that f : Rk → R

ℓ is continuous and X(n) is a k-component randomvector.

• If X(n) as→X, then f(X(n))as→ f(X).

• If X(n) P→X, then f(X(n))P→ f(X).

• If X(n) d→X, then f(X(n))d→ f(X).

Interestingly, although the three statements in theorem 2.6 are very similar, they are not all provableby the same means. The statement on almost sure convergence follows immediately from earlierfacts regarding continuous functions of sequences. Similarly, the statement on convergence inprobability follows immediately from the definition of continuity (see exercise 2.12). However,the third statement regarding convergence in distribution is not trivial to prove, and we use thefollowing theorem:

Theorem 2.7 X(n) d→X if and only if E g(X(n)) → E g(X) for all bounded and continuousreal-valued functions g : R

k → R.

Theorem 2.7 is proved as exercises 2.16 and 2.17. Once this theorem is known, the third part oftheorem 2.6 follows easily:

Proof of theorem 2.6, part 3: If f : Rk → R

ℓ is continuous and g : Rℓ → R is any bounded and

continuous function, then g f : Rk → R is bounded and continuous. By theorem 2.7, this implies

that E [g f(X(n))] → E [g f(X)]. But this proves f(X(n))d→ f(X), again by theorem 2.7.

2.3.2 Slutsky’s theorem

Related to theorem 2.6 is the question of when we may “stack” random variables to make randomvectors while preserving convergence. It is here that we encounter perhaps the biggest surprise ofthis section: Convergence in distribution is not preserved by “stacking”.

To understand what we mean by “stacking,” consider that by the definition of convergence in

31

probability,

XnP→X and Yn

P→Y implies that

(

Xn

Yn

)

P→(

X

Y

)

. (2.14)

Note that the converse of (2.14) is true by theorem 2.6 because the function f(x, y) = x is acontinuous function from R

2 to R. By induction, we can therefore stack or unstack arbitrarilymany random variables or vectors without distrurbing convergence in probability. Combining thisfact with theorem 2.6 yields a powerful result; see exercise 2.18.

However, statement 2.14 does not remain true in general ifP→ is replaced by

d→ throughout thestatement. As a simple example of this fact, take Xn and Yn to be independent standard normal

random variables for all n. It is trivially true that Xnd→Z and Yn

d→Z, where Z ∼ N(0, 1). But itis just as clearly not true that

(

Xn

Yn

)

d→(

Z

Z

)

, (2.15)

since the distribution on the left is bivariate normal with correlation 0, while the distribution onthe right is bivariate normal with correlation 1. Expression (2.15) is untrue precisely because themarginal distributions do not uniquely determine the joint distribution.

However, there are certain special cases in which the marginals do determine the joint distribution.For instance, if random variables are independent, then their marginal distributions uniquely de-termine their joint distribution. In similar fashion, we can say that if Xn → X and Yn → Y , where

Xn is independent of Yn and X is independent of Y , then statement (2.14) remains true whenP→

is replaced byd→ (see exercise 2.19). As a special case, the constant c, when viewed as a random

variable, is automatically independent of any other random variable. Since Ynd→ c is equivalent to

YnP→ c by theorem 2.3, it must be true that

Xnd→X and Yn

P→ c implies that

(

Xn

Yn

)

d→(

X

c

)

(2.16)

if Xn is independent of Yn for every n. The content of a powerful theorem called Slutsky’s theoremis that statement (2.16) remains true even if the Xn and Yn are not independent:

Theorem 2.8 Slutsky’s theorem: For any random vectors X(n) and Y(n) and any constant

c, if X(n) d→X and Y(n) P→ c, then

(

X(n)

Y(n)

)

d→(

X

c

)

.

A proof of theorem 2.8 is outlined in exercise 2.20.

Example 2.5 Two-sample z statistic: Here we prove the asymptotic normality of thestatistic

Z =(Y n − Xm) − (η − µ)

√

σ2

m + τ2

n

,

32

where X1, . . . , Xm are iid with mean µ and variance σ2 < ∞ and Y1, . . . , Yn are iid withmean η and variance τ2 < ∞. The two samples are independent. In order to applyasymptotic results, we insist that both m and n are considered to go to infinity. Oneway to do this is to let m ≡ mn be a sequence that tends to infinity as n → ∞. Wemake one additional assumption, that m/(m + n) → λ as n → ∞ for some 0 < λ < 1.

The (univariate) CLT gives

√n(Y n − η)

d→N(0, τ2) and√

m(Xm − µ)d→N(0, σ2)

as n → ∞. By the result of exercise 2.19,(√

n(Y n − η)√n(Xn − µ)

)

d→(

τZ1

σZ2

)

,

where Z1 and Z2 are independent standard normal random variables. This gives(√

m + n(Y n − η)√m + n(Xn − µ)

)

d→(

τZ1/√

1 − λ

σZ2/√

λ

)

,

which in turn implies

√m + n

(Y n − Xm) − (η − µ) d→ τZ1√

1 − λ− σZ2√

λ.

The random variable on the right side above has a N

0, τ2/(1 − λ) + σ2/λ

distribu-tion. In other words,

√

m + nσ2

λ + τ2

1−λ

(Y n − Xm) − (η − µ) d→N(0, 1).

Finally, one may easily verify that√

m + nσ2

λ + τ2

1−λ

∼ 1√

σ2

m + τ2

n

,

so applying Slutsky’s theorem (the univariate version) gives

Z =(Y n − Xm) − (η − µ)

√

σ2

m + τ2

n

d→N(0, 1).

Notice in the previous example that multivariate techniques were employed to determine the uni-variate distribution of the random variable Z. This is a common trick that will be used extensivelyin later chapters: We often obtain univariate results using multivariate techniques.

Putting several of the preceding results together yields the following corollary.

Corollary 2.1 If X is a k-vector such that X(n) d→X, and Y(n)i

P→ ci for 1 ≤ i ≤ m, then

f

(

X(n)

Y(n)

)

d→ f

(

X

c

)

for any continuous function f : Rk+m → R

ℓ.

33

2.3.3 The Cramer-Wold theorem

Suppose that X(1),X(2), . . . is a sequence of random k-vectors. By theorem 2.6, we see immediatelythat

X(n) d→X implies atX(n) d→atX for any a ∈ Rk. (2.17)

It is not clear, however, whether the converse of statment (2.17) is true. Such a converse wouldbe useful because it would give a means for proving multivariate convergence in distribution usingonly univariate methods. We have already seen that multivariate convergence in distribution doesnot follow from the mere fact that each of the components converges in distribution. However, theCramer-Wold theorem tells us that the converse of statement (2.17) is true:

Theorem 2.9 X(n) d→X if and only if atX(n) d→atX for all a ∈ Rk.

Using the machinery of characteristic functions, to be presented in section 3.1, the proof of theCramer-Wold theorem is immediate; see exercise 3.3.


Exercise 2.12 Suppose that f : Rk → R

ℓ is continuous and X(n) is a k-component randomvector.

(a) If X(n) as→X, prove that f(X(n))as→ f(X).

Hint: Show that ω : X(n)(ω) → X(ω) ⊂ ω : f[X(n)(ω)] → f[X(ω)].

(b) If X(n) P→X, prove that f(X(n))P→ f(X).

Hint: For any ǫ > 0, prove that there exists δ > 0 such that P (‖X(n) − X‖ < δ) ≤P (‖f(X(n)) − f(X)‖).

Exercise 2.13 Construct a counterexample to show that Theorem 2.8 may not be strength-

ened by changing YnP→ c to Yn

P→Y .

Exercise 2.14 This exercise deals with bounded in probability sequences.

Definition 2.13 We say that Xn is bounded in probability if Xn = OP (1).i.e., if for every ǫ > 0, there exist M and N such that P (|Xn| < M) > 1− ǫfor n > N .

(a) Prove that if Xnd→X for some random variable X, then Xn is bounded in prob-

ability.

Hint: You may use the fact that any interval of real numbers must contain a point ofcontinuity of F (x). Also, recall that F (x) → 1 as x → ∞.

(b) Prove that if Xn is bounded in probability and YnP→ 0, then XnYn

P→ 0.

Exercise 2.15 Here we prove part of theorem 2.7: If Xnd→X and g(x) is a bounded,

continuous function on R, then E g(Xn) → E g(X). We treat the univariate case

34

separately here because it is much easier to understand the essential details of theproof without dealing with multivariate issues (but see exercise 2.16).

Let Fn(x) and F (x) denote the distribution functions of Xn and X, as usual. For ǫ > 0,take b < c to be constant real numbers such that F (b) < ǫ and F (c) > 1 − ǫ. First,we note that since g(x) is bounded and continuous, it must be uniformly continuous

on [b, c]: That is, for any ǫ > 0 there exists δ > 0 such that |g(x) − g(y)| < ǫ whenever|x − y| < δ. This fact, along with the boundedness of g(x), ensures that there exists afinite set of real numbers b = a0 < a1 < · · · < am = c such that:

• Each ai is a continuity point of F (x).

• F (a0) < ǫ and F (am) > 1 − ǫ.

• For 1 ≤ i ≤ m, |g(x) − g(ai)| < ǫ for all x ∈ [ai−1, ai].

(a) Define

h(x) =

g(ai) if ai−1 < x ≤ ai for some 1 ≤ i ≤ m.0 otherwise.

Prove that there exists N such that |E h(Xn) − E h(X)| < ǫ for all n > N .

Hint: Let M be such that |g(x)| < M for all x ∈ R. Use the fact that for any randomvariable Y ,

E h(Y ) =m

∑

i=1

g(ai)P (ai−1 < Y ≤ ai).

Also, please note that we may not write E h(Xn)−E h(X) as E [h(Xn)−h(X)] becauseit is not necessarily the case that Xn and X are defined on the same sample space.

(b) Prove that E g(Xn) → E g(X).

Hint: Use the fact that

|E g(Xn) − E g(X)| ≤ |E g(Xn) − E h(Xn)| + |E h(Xn) − E h(X)|+|E h(X) − E g(X)|.

Exercise 2.16 Adapt the method of proof in exercise 2.15 to the multivariate case, prov-

ing half of theorem 2.7: If X(n) d→X, then E g(X(n)) → E g(X) for any bounded,continuous g : R

k → R. (This half of theorem 2.7 is sometimes called the Helly-Braytheorem.)

Hint: Instead of intervals (ai−1, ai] as in exercise 2.15, use small regions x : ai,j−1 <xi ≤ ai,j for all i of R

k. Make sure these regions are chosen so that their boundariescontain only continuity points of F (x).

Exercise 2.17 Prove the other half of theorem 2.7:

(a) Let a be any continuity point of F (x). Let ǫ > 0 be arbitrary and denote by 1 thek-vector of all ones. Show that there exists δ > 0 such that F (a − δ1) > F (a) − ǫ andF (a + δ1) < F (a) + ǫ.

35

(b) Show how to define continuous functions g1 : Rk → [0, 1] and g2 : R

k → [0, 1] suchthat for all x ≤ a, g1(x) = g2(x − δ1) = 0 and for all x 6≤ a, g1(x + δ1) = g2(x) = 1.Use these functions to bound the difference between Fn(a) and F (a) in such a way thatthis difference must tend to 0.

Exercise 2.18 (a) Prove that if f : Rk → R

ℓ is continuous and X(n)i

P→Xi for all 1 ≤ i ≤ k,

then f(X(n))P→ f(X).

(b) Taking f(a, b) = a + b for simplicity, construct an example demonstrating that

part (a) is not true ifP→ is replaced by

d→.

Exercise 2.19 Prove that if Xn → X and Yn → Y , where Xn is independent of Yn and Xis independent of Y , then

Xnd→X and Yn

d→Y implies that

(

Xn

Yn

)

d→(

X

Y

)

.

Hint: Be careful to deal with points of discontinuity: If Xn and Yn are independent,what characterizes a point of discontinuity of the joint distribution?

Exercise 2.20 Prove Slutsky’s theorem, theorem 2.8, using the following approach:

(a) Prove the following lemma:

Lemma 2.3 Let V(n) and W(n) be k-dimensional random vectors.

If V(n) d→V and W(n) P→0, then V(n) + W(n) d→V.

Hint: For arbitrary ǫ > 0, let ǫ denote the k-vector all of whose entries are ǫ. Showthat for any a ∈ R

k,

P (V(n) ≤ a − ǫ) + P (‖W(n)‖ < ‖ǫ‖) − 1 ≤ P (V(n) + W(n) ≤ a)

≤ P (V(n) ≤ a + ǫ) + P (‖W(n)‖ ≥ ‖ǫ‖).

If N is such that P (‖W(n)‖ ≥ ‖ǫ‖) < ǫ for all n > N , rewrite the above inequalitiesfor n > N . If a is a continuity point of F (v), what happens as ǫ → 0?

(b) Show how to prove theorem 2.8 using the lemma in part (a).

2.4 The Dominated Convergence Theorem

This section introduces several mathematical topics that arise frequently in the study of asymptoticmethods in statistics. It culminates in a useful theorem, called the dominated convergence theorem,

that gives a sufficient condition under which Xnd→X implies E Xn → E X. Recall that this is not

true in general, as shown for instance in exercise 2.3(b) and in example 2.19 below.

36

2.4.1 Limit Superior and Limit Inferior

We first define supremum and infimum, which are generalizations of the ideas of maximum andminumum.

Definition 2.14 For any set of real numbers, say S, the supremum supS is defined to bethe least upper bound of S (or +∞ if no upper bound exists). The infimum inf S isdefined to be the greatest lower bound of S (or −∞ if no lower bound exists).

When the set S in definition 2.14 has infinitely many elements, there is not necessarily a maximumor minimum value in S; however, supS and inf S are always defined, no matter how many elementsS has. For example, let S = a1, a2, a3, . . ., where an = 1/n. Then inf S, which may also bedenoted infn an, equals 0 even though 0 6∈ S. But supn an = 1, which is contained in S.

If we denote by supk≥n ak the supremum of an, an+1, . . ., then we see that this supremum is takenover a smaller and smaller set as n increases. Therefore, supk≥n ak is a nonincreasing sequence,which implies that it has a limit as n → ∞. Similarly, infk≤n an is a nondecreasing sequence, whichimplies that it has a limit.

Definition 2.15 The limit superior of a sequence a1, a2, . . ., denoted lim supn an, is thelimit of the nonincreasing sequence

supk≥1

ak, supk≥2

ak, . . . .

The limit inferior, denoted lim infn an, is the limit of the nondecreasing sequence

infk≥1

ak, infk≥2

ak, . . . .

Intuitively, the limit supremum and limit inferior are the largest and smallest limit points (actually,the supremum and infimum of the set of limit points) of that sequence, where a limit point is anylimit of a subsequence of that sequence.

The most useful facts regarding limits superior and inferior are these:

Lemma 2.4 Let a1, a2, . . . and b1, b2, . . . be arbitrary sequences of real numbers.

• lim supn an and lim infn an always exist, unlike limn an.

• lim infn an ≤ lim supn an

• Both lim sup and lim inf preserve nonstrict inequalities; that is, if an ≤ bn for alln, then lim supn an ≤ lim supn bn and lim infn an ≤ lim infn bn.

• limn an exists if any only if lim infn an = lim supn an, in which case limn an =lim infn an = lim supn an.

Example 2.6 Theorem 2.2 states that XnP→X implies Xn

d→X. The proof of this theoremgiven in section 2.1 ends with an argument using ǫ, δ, and N that can be made moreelegantly using lim inf and lim sup: Inequality (2.6) states that

F (c − ǫ) − P (|Xn − X| > ǫ) ≤ Fn(c) ≤ F (c + ǫ) + P (|Xn − X| > ǫ).

37

Taking both the lim infn and the lim supn of the above inequality, we conclude (since

XnP→X) that

F (c − ǫ) ≤ lim infn

Fn(c) ≤ lim supn

Fn(c) ≤ F (c + ǫ)

for all ǫ. If c is a continuity point of F (x), then by letting ǫ → 0 we obtain

F (c) = lim infn

Fn(c) = lim supn

Fn(c),

so we know Fn(c) → F (c) and the theorem is proved.

2.4.2 Uniform Convergence

In the definition of convergence in distribution, we encountered pointwise convergence of distri-

bution functions: If F (x) is a continuous function, then Fnd→F means that for each x, Fn(x) →

F (x). In other words, for every x and ǫ > 0, there exists N such that

|Fn(x) − F (x)| < ǫ for all n > N. (2.18)

A very important fact about the above wording is the stipulation that N depends on x. In otherwords, for a given ǫ, a value of N that makes statement (2.18) true for some x might not work forsome other x.

The idea of uniform convergence, on the other hand, is that we can choose N without regard tothe value of x. Thus, for a given ǫ, we can select N so that (2.18) is true for all x.

Definition 2.16 gn(x) converges uniformly to g(x) if for every ǫ > 0, there exists N suchthat

|gn(x) − g(x)| < ǫ for all n > N and for all x.

Equivalently, gn(x) converges uniformly to g(x) if and only if supx |gn(x) − g(x)| → 0.

Intuitively, gn(x) → g(x) uniformly if it is possible to draw a band of width ǫ around the graph ofg(x) such that the band contains all of the graphs of gn(x) for n > N .

Unlike uniform convergence, pointwise convergence merely asserts that |gn(x)− g(x)| → 0 for all x.It is tempting to believe that supx |gn(x)−g(x)| → 0 and |gn(x)−g(x)| → 0 for all x are equivalentstatements, but that would be a mistake.

Example 2.7 It is easy to demonstrate that uniform convergence is not the same thingas pointwise convergence by exhibiting examples in which pointwise convergence holdsbut uniform convergence does not.

• If gn(x) = x(1 + 1/n) and g(x) = x, then gn(x) → g(x) for all x (i.e., pointwiseconvergence holds). However, since supx |gn(x) − g(x)| = ∞ for all n, uniformconvergence does not hold.

• If gn(x) = xn for all x ∈ (0, 1), then gn(x) → 0 for all fixed x as n → ∞, butsupx∈(0,1) |gn(x)| = 1.

38

Note in both of these examples that for small ǫ > 0, an ǫ-band around g(x) = x in thefirst example and g(x) = 0 in the second example fails to capture the graphs of anygn(x).

Although example 2.7 demonstrates that pointwise convergence does not in general imply uniformconvergence, the following theorem gives a special case in which it does.

Theorem 2.10 If Fn(x) and F (x) are distribution functions and F (x) is continuous, thenpointwise convergence of Fn to F implies uniform convergence of Fn to F .

2.4.3 Application: The Dominated Convergence Theorem

We now consider the question of when Ynd→Y implies E Yn → E Y . This is not in general the

case: Consider the contaminated normal distributions with distribution functions

Fn(x) =

(

1 − 1

n

)

Φ(x) +1

nΦ(x − 7n). (2.19)

These distributions converge in distribution to a standard normal, yet each has mean 7. One of themost powerful criteria that imply E Xn → E X is summarized in the dominated convergencetheorem:

Theorem 2.11 Dominated Convergence Theorem: If for some random variable Z, |Xn| ≤|Z| for all n and E |Z| < ∞, then Xn

d→X implies that E Xn → E X.

Partial Proof: Fatou’s lemma (see exercise 2.22) shows that

E lim infn

|Xn| ≤ lim infn

E |Xn|. (2.20)

Since |Z| − |Xn| is nonnegative by assumption, Fatou’s lemma applied to |Z| − |Xn|, together withthe fact that lim infn(−an) = − lim supn an, implies

E |Z| − E lim supn

|Xn| ≤ E |Z| − lim supn

E |Xn|. (2.21)

The assumption that E |Z| < ∞ means that inequality (2.21) implies

lim supn

E |Xn| ≤ E lim supn

|Xn|. (2.22)

Together, inequalities (2.20) and (2.21) imply

E lim infn

|Xn| ≤ lim infn

E |Xn| ≤ lim supn

E |Xn| ≤ E lim supn

|Xn|.

Therefore, the proof would be complete if |Xn| as→|X|. We cannot make this claim directly forthe random variables Xn and X, which might not even be defined on the same probability space.However, we may complete the proof anyway using a trick called the Skorohod representationtheorem. We will discuss this theorem in section 2.5, at which point we will finish the proof.


39

Exercise 2.21 Prove lemma 2.4.

Exercise 2.22 Prove Fatou’s lemma:

E lim infn

|Xn| ≤ lim infn

E |Xn|. (2.23)

Hint: Argue that E |Xn| ≥ E infk≥n |Xk|, then take the limit inferior of each side.Use the monotone convergence property.


Hint: For arbitrary ǫ > 0, first show that there exist real numbers a1 < · · · < am suchthat F (ai+1) − F (ai) < ǫ/2 for 0 ≤ i ≤ m, where a0 = −∞ and am+1 = +∞. Thenfind N such that |Fn(ai) − F (ai)| < ǫ/2 for all i and for all n > N . Finally, prove anduse the fact that

supx

|Fn(x) − F (x)| ≤ max

max0≤i≤m

|Fn(ai+1) − F (ai)|, max0≤i≤m

|Fn(ai) − F (ai+1)|

.

Exercise 2.24 If Ynd→Y , a sufficient condition for E Yn → E Y is the uniform integra-

bility of the Yn.

Definition 2.17 The random variables Y1, Y2, . . . are said to be uniformlyintegrable if

supn

E (|Yn| I|Yn| ≥ α) → 0 as α → ∞.

Prove that if Ynd→Y and the Yn are uniformly integrable, then E Yn → E Y .

Exercise 2.25 Prove that if there exists ǫ > 0 such that supn E |Yn|1+ǫ < ∞, then the Yn

are uniformly integrable.

Exercise 2.26 Prove that if there exists a random variable Z such that E |Z| = µ < ∞and P (|Yn| ≥ t) ≤ P (|Z| ≥ t) for all n and for all t > 0, then the Yn are uniformlyintegrable. You may use the fact (without proof) that for a nonnegative X,

E (X) =

∫ ∞

0P (X ≥ t) dt.

Hints: Consider the random variables |Yn|I|Yn| ≥ t and |Z|I|Z| ≥ t. In addition,use the fact that

E |Z| =∞

∑

i=1

E (|Z|Ii − 1 ≤ |Z| < i)

to argue that E (|Z|I|Z| < α) → E |Z| as α → ∞.

40

2.5 The Skorohod Representation Theorem

This section describes a trick that is often useful in proving results about convergence in distribution.For instance, it will allow us to easily complete the proof of the dominated convergence theorem,theorem 2.11.

Recall that Xnd→X does not imply anything about the joint distributions of Xn and X, nor even

that the random variables Xn and X are defined on the same sample space. In this section, weshow how to construct an alternative sequence Y1, Y2, . . . and a random variable Y , all on thesame sample space Ω, such that not only do Yn and Y have the same distributions as Xn and X,respectively, for all n, but in addition, Yn

as→Y . The new representation of the distributions of Xn

and X is called the Skorohod representation, and the theorem that asserts its existence is calledthe Skorohod representation theorem.

If the distribution functions of Xn and X were invertible, the construction of Yn and Y would bea simple matter. However, since not all distribution functions are invertible, we first generalize thenotion of the inverse of a distribution function by defining the quantile function.

Definition 2.18 If F (x) is a distribution function, then we define the quantile function

F− : (0, 1) → R by

F−(u)def= infx ∈ R : u ≤ F (x).

Let Su denote the set x ∈ R : u ≤ F (x) for a distribution function F (x) and u ∈ (0, 1). The factthat F (x) is nondecreasing means that s ∈ Su implies t ∈ Su for any t > s. Thus, there must existsome real number a such that Su = (a,∞) or Su = [a,∞). However, the right-continuity of F (x)guarantees that a must be contained in Su: Since

limǫ→0+

F (a + ǫ) = F (a),

taking limits in the inequality u ≤ F (a + ǫ) implies that u ≤ F (a). We conclude that Su = [a,∞)for some real number a. The important fact here is that a = F−(u) is contained in Su, which allowsus to prove the following useful lemma:

Lemma 2.5 u ≤ F (x) if and only if F−(u) ≤ x.

Proof: Let Su = t ∈ R : u ≤ F (t) as before. We showed above that Su equals the interval[F−(u),∞). Therefore, x ∈ Su if and only if F−(u) ≤ x.

Now the construction of Yn and Y proceeds as follows. Let Fn and F denote the distributionfunctions of Xn and X, respectively, for all n. Take the sample space Ω to be the interval (0, 1) andadopt the probability measure that assigns to each interval subset (a, b) ⊂ (0, 1) its length (b − a).Then for every ω ∈ Ω, first define

Yndef= F−

n (ω) and Ydef= F−(ω). (2.24)

The random variables Yn and Y are exactly the random variables we need, as asserted in thefollowing theorem.

41

Theorem 2.12 Skorohod representation theorem: Assume Fnd→F . The random variables

defined on Ω as in expression (2.24) satisfy

1. P (Yn ≤ t) = Fn(t) for all n and P (Y ≤ t) = F (t);

2. Ynas→Y .

Before proving theorem 2.12, we first state a technical lemma, a proof of which is the subject ofexercise 2.27(a).

Lemma 2.6 Assume Fnd→F and let the random variables Yn and Y be defined as in

expression (2.24). Then for any ω ∈ (0, 1) and any ǫ > 0 such that ω + ǫ < 1,

Y (ω) ≤ lim infn

Yn(ω) ≤ lim supn

Yn(ω) ≤ Y (ω + ǫ). (2.25)

Proof of Theorem 2.12: By lemma 2.5, Y ≤ t if and only if ω ≤ F (t). But P (ω ≤ F (t)) = F (t)by construction. A similar argument for Yn proves the first part of the theorem.

For the second part of the theorem, letting ǫ → 0 in inequality 2.25 shows that Yn(ω) → Y (ω)whenever ω is a point of continuity of Y (ω). Since Y (ω) is a nondecreasing function of ω, there areat most countably many points of discontinuity of ω; see exercise 2.27(b). Let D denote the set of allpoints of discontinuity of Y (ω). Since each individual point in Ω has probability zero, the countablesubadditivity property 2.7 implies that P (D) = 0. Since we have shown that Yn(ω) → Y (ω) for allω 6∈ D, we conclude that Yn

as→Y .

Note that the points of discontinuity of Y (ω) mentioned in the proof of theorem 2.12 are notin any way related to the points of discontinuity of F (x). In fact, “flat spots” of F (x) lead todiscontinuities of Y (ω) and vice versa.


Exercise 2.27 This exercise proves two results used to establish theorem 2.12.

(a) Prove lemma 2.6.

Hint: For any δ > 0, let x be a continuity point of F (t) in the interval (Y (ω)−δ, Y (ω)).

Use the fact that Fnd→F to argue that for large n, Y (ω) − δ < Yn(ω). Take the limit

inferior of each side and note that δ is arbitrary. Similarly, argue that for large n,Yn(ω) < Y (ω + ǫ) + δ.

(b) Prove that any nondecreasing function has at most countably many points ofdiscontinuity.

Hint: If x is a point of discontinuity, consider the open interval whose endpoints arethe left- and right-sided limits at x. Note that each such interval contains a rationalnumber, of which there are only countably many.

42

Chapter 3

Central Limit Theorems

The main result of this chapter is contained in section 3.2 about the Lindeberg-Feller central limittheorem, from which we derive the theorem most commonly known as the central limit theorem as acorollary. However, before we discuss central limit theorems, we include one section of backgroundmaterial for the sake of completeness. Section 3.1 introduces the powerful continuity theorem,theorem 3.2, which is the basis for proofs of various important results including the Lindeberg-Fellertheorem, theorem 3.3. This section also defines multivariate normal distributions. An importantdifference between the treatment of normal distributions in section 3.1 and the treatment seen insome textbooks is that we do not restrict the definition to the case in which the covariance matrixis nonsingular.

3.1 Characteristic Functions and the Normal Distribution

Definition 3.1 For a random vector X, we define the characteristic function φX : Rk → C

by

φX(t) = E exp(ittX) = E cos ttX + iE sin ttX,

where i2 = −1 and C denotes the complex numbers.

The characteristic function, which is defined on all of Rk for any X (unlike the moment generating

function), has some basic properties. For instance, φX(t) is always a continuous function withφX(0) = 1 and |φX(t)| ≤ 1. Also, inspection of Definition 3.1 reveals that for any constant vectora and scalar b,

φX+a(t) = exp(itta)φX(t) and φbX(t) = φX(bt).

Also, if X and Y are independent,

φX+Y(t) = φX(t)φY(t).

One of the main reasons that characteristic functions are so useful is the fact that they uniquelydetermine the distributions from which they are derived. This fact is so important that we state itas a theorem.

43

Theorem 3.1 The characteristic function φX(t) uniquely determines the distribution ofX.

Now suppose that X(n) d→X. Then ttX(n) d→ ttX, so Theorem 2.7 implies that φX(n)(t) → φX(t)

since both sin t and cos t are bounded continuous functions. However, the converse is also true:

Theorem 3.2 Continuity Theorem: X(n) d→X if and only if φX(n)(t) → φX(t) for all t.

Although we do not prove Theorem 3.2 rigorously, it is not hard to sketch a proof that φX(n)(t) →

φX(t) implies X(n) d→X. First, we note that the distribution functions Fn must contain a conver-gent subsequence, say Fnk

→ G as k → ∞, where G : R → [0, 1] must be a nondecreasing functionbut G is not necessarily a true distribution function (and of course convergence is guaranteed onlyat continuity points of G). It is possible to define the characteristic function of G — this is theassertion we do not prove — and it must follow that φFnk

(t) → φG(t). But this implies thatφG(t) = φX(t) because it was assumed that φ

X(n)(t) → φX(t). By Theorem 3.1, G must be the

distribution function of X. Therefore, every convergent subsequence of X(n) converges to X, andthe theorem is proved.

The main reason for the introduction of Theorem 3.2 is to prove the Lindeberg-Feller theorem 3.3.However, it also enables us to prove several other theorems that were earlier left unproved. Some ofthese are found in the problems section. Before we prove the Lindeberg-Feller theorem, though, wefirst establish a connection between the moments of a distribution and its characteristic function— namely, we derive the form of the derivatives of φX(t).

We attempt to derive ∂φX(t)/∂tj directly, by considering the limit of

φX(t + hu(j)) − φX(t)

h= E

[

eittX(

eihXj − 1

h

)]

as h → 0, where u(j) denotes the jth unit vector with 1 in the jth component and 0 elsewhere.Note that

∣

∣

∣

∣

eittX(

eihXj − 1

h

)∣

∣

∣

∣

=

∣

∣

∣

∣

∫ Xj

0eiht dt

∣

∣

∣

∣

≤ |Xj |,

so if E |Xj | < ∞ then the dominated convergence theorem, theorem 2.11, implies that

∂

∂tjφX(t) = E lim

h→0

[

eittX(

eihXj − 1

h

)]

= iE[

XjeittX

]

.

We conclude that

Lemma 3.1 If E ‖X‖ < ∞, then ∇φX(0) = iE X.

A similar argument gives

Lemma 3.2 If E XtX < ∞, then ∇2φX(0) = −E XXt.

One particular characteristic function will prove very important in what follows: If X ∼ Nk(0, Σ),then φX(t) = exp(−1

2ttΣt).

We begin with a density for a multivariate normal distribution on Rk. However, note that not all

multivariate normal distributions have densities—consider the univariate example of N(0, 0).

44

Given a mean vector µ ∈ Rk and a positive definite k×k covariance matrix Σ, X has a multivariate

normal distribution with mean µ and covariance Σ, written X ∼ Nk(µ, Σ), if its density on Rk is

f(x) = C exp

−1

2(x − µ)tΣ−1(x − µ)

. (3.1)

In expression (3.1), the constant is C = (2kπk|Σ|)−1/2, where |Σ| denotes the determinant of Σ.Because of the assumption that Σ is positive definite, an assumption that we will later relax, |Σ| isguaranteed to be positive.

As a special case, consider the bivariate normal distribution, where for some σ2 > 0, τ2 > 0, and−1 < ρ < 1 we have

Σ =

(

σ2 ρστρστ τ2

)

and thus

|Σ| = σ2τ2(1 − ρ2) and Σ−1 =1

σ2τ2(1 − ρ2)

(

τ2 −ρστ−ρστ σ2

)

.

In this case, of course, X1 and X2 have correlation ρ and marginal variances σ2 and τ2, respectively.

To define the multivariate normal distribution in full generality, we first consider the case in whichΣ is diagonal, say Σ = D = diag(d1, . . . , dk). Of course, if D is a legitimate covariance matrixit must be nonnegative definite (or positive semidefinite), which means that all of its eigenvaluesare nonnegative. Since the eigenvalues of a diagonal matrix are simply its diagonal elements, wesee immediately that di ≥ 0 for all i. We may now define a multivariate normal distribution withcovariance matrix D.

Definition 3.2 Suppose that D = diag(d1, . . . , dk) for nonnegative real numbers d1, . . . , dk.Then for µ ∈ R

k, the multivariate normal distribution with mean µ and covarianceD, denoted Nk(µ, D), is the joint distribution of the independent random variablesX1, . . . , Xk, where Xi ∼ N(µi, di).

To define a multivariate normal distribution for a general covariance matrix Σ, we make use ofthe fact that any symmetric matrix may be diagonalized by an orthogonal matrix. We first defineorthogonal, then state the diagonalizability result as a lemma.

Definition 3.3 A square matrix Q is orthogonal if Q−1 exists and is equal to Qt.

Lemma 3.3 If A is a symmetric k × k matrix, then there exists an orthogonal matrix Qsuch that QAQt is diagonal.

Note that the diagonal elements of the matrix QAQt in the matrix above must be the eigenvalues ofA. This is easy to prove, since if λ is a diagonal element of QAQt then it is an eigenvalue of QAQt

and hence there exists a vector x such that QAQtx = λx, which implies that A(Qtx) = λ(Qtx) andso λ is an eigenvalue of A.

Definition 3.4 Suppose Σ is an arbitrary symmetric k× k matrix with nonnegative eigen-values. Let Q be an orthogonal matrix such that QΣQt is diagonal. Then for µ ∈ R

k,the multivariate normal distribution with mean µ and covariance Σ, denoted Nk(µ, Σ),is the distribution of µ + QtY, where Y ∼ Nk(0, QΣQt).

45

Note that different diagonalizations Q1ΣQt1 and Q2ΣQt

2 do not necessarily lead to the same distri-bution for µ + Qt

iY(i), so there is some work to do in order to prove that Definition 3.4 is a valid

definition. This issue is addressed by Problem 3.6.


Exercise 3.1 (a) Prove that if Y ∼ N(0, σ2) then φY (t) = exp(

−12 t2σ2

)

.

Hint: If σ2 > 0, verify and solve the differential equation φ′Y (t) = −tφY (t). Use

integration by parts.

(b) Using part (a), prove that if X ∼ Nk(0, Σ), then φX(t) = exp(

−12t

tΣt)

. (Note:Do not assume that Σ is invertible.)

Exercise 3.2 We will prove Theorem 3.1.

(a) First, prove the Parseval relation for random X and Y:

E[

exp(−iatY)φX(Y)]

= E φY(X − a).

(b) Suppose that Y ∼ Nk(0, σ2I) with density f(y) = (√

2πσ2)−k exp(−yty/2σ2).Show that X + Y has a density and use part (a) to show that this density may beexpressed as a function of φX(t).

Since X + Yd→X as σ2 → 0, the distributions of X + Y uniquely determine the distri-

bution of X; but since part (b) shows that φX(t) uniquely determines the distributionof X + Y for a given σ2, Theorem 3.1 is proved.


Exercise 3.4 State and prove a multivariate generalization of Theorem 2.5.

Hint: Use definition 1.14 and Lemma 3.1 along with the Continuity Theorem 3.2.

Exercise 3.5 Suppose X ∼ Nk(µ, Σ), where Σ is invertible. Prove that

(X − µ)tΣ−1(X − µ) ∼ χ2k.

Hint: If Q diagonalizes Σ, say QΣQt = Λ, let Λ1/2 have the obvious definition andconsider YtY, where Y = (Λ1/2)−1Q(X − µ).

Exercise 3.6 For an arbitrary covariance matrix Σ, prove that Nk(µ, Σ) is well-defined byDefinition 3.4 by proving that for orthogonal matrices Q1 and Q2 diagonalizing Σ, andY(i) ∼ N(0, QiΣQt

i) for i = 1, 2, the distributions of Qt1Y

(1) and Qt2Y

(2) are the same.

Hint: To check whether multivariate distributions are equal using univariate ideas,consider Theorem 2.9.

46

Exercise 3.7 Let X1, X2, . . . be independent from N(µ, σ2) where µ 6= 0. Let

S2n =

1

n

n∑

i=1

(Xi − Xn)2.

Find the asymptotic distribution of the coefficient of variation Sn/Xn.

Exercise 3.8 Let X1, X2, . . . be independent Poisson random variables with mean λ = 1.Define Yn =

√n(Xn − 1).

(a) Find E (Y +n ), where Y +

n = YnIYn > 0.

(b) Find, with proof, the limit of E (Y +n ) and prove Stirling’s formula

n! ∼√

2π nn+1/2e−n.

Hint: Use the result of Problem 2.24.

3.2 The Lindeberg-Feller Central Limit Theorem

You may already be familiar with the central limit theorem as it applies to independent andidentically distributed sequences of random variables. This theorem, theorem 3.5, is shown here tobe a corollary of a much more general theorem known as the Lindeberg-Feller central limit theorem.A step-by-step proof of the Lindeberg-Feller central limit theorem is sketched in exercises 3.10 and3.11. The results in this section are for the univariate case only.

We would like to consider sequences of random variables that are independent but not necessarilyidentically distributed. However, even the concept of “sequence of random variables” must begeneralized here. Instead of an independent sequence, we define a triangular array of randomvariables:

X11 ← independentX21 X22 ← independentX31 X32 X33 ← independent

...

Thus, we assume that for each n, Xn1, . . . , Xnn are independent. We shall assume that E Xnk = µnk

and Var Xnk = σ2nk < ∞. Let Ynk = Xnk − µnk. (Alternatively, we could simply assume that

µnk = 0, as is the practice in many textbooks.) Beware: Forgetting to center Xnk by subtractingµnk is a common mistake in applying the theorems below!

Let Tn =∑n

k=1 Ynk and s2n = Var Tn =

∑nk=1 σ2

nk. For all n, Tn/sn has mean 0 and variance 1; ourgoal is to give conditions under which

Tn

sn

d→N(0, 1). (3.2)

Such conditions are given in the Lindeberg-Feller Central Limit Theorem. The key to this theoremis the Lindeberg condition:

For every ǫ > 0,1

s2n

n∑

k=1

E(

Y 2nkI |Ynk| ≥ ǫsn

)

→ 0 as n → ∞. (3.3)

47

Before stating the Lindeberg-Feller theorem, we need a technical condition that says essentiallythat no single random variable is dominant in its contribution to s2

n:

1

s2n

maxk≤n

σ2nk → 0 as n → ∞. (3.4)

Now that conditions (3.2), (3.3), and (3.4) have been written, the main result may be stated in asingle line:

Theorem 3.3 Lindeberg-Feller Central Limit Theorem: Condition (3.3) holds if and onlyif conditions (3.2) and (3.4) hold.

Another condition that implies (3.2) is the Lyapunov condition:

There exists δ > 0 such that1

s2+δn

n∑

k=1

E(

|Ynk|2+δ)

→ 0 as n → ∞. (3.5)

In some cases in which we wish to establish asymptotic normality, it is easier to check the Lyapunovcondition than to check the Lindeberg condition.

Theorem 3.4 The Lyapunov condition (3.5) implies the Lindeberg condition (3.3).

As an example of the power of the Lindeberg condition, we prove a version of the Central LimitTheorem that applies to independent and identically distributed sequences of random variables.The proof utilizes the dominated convergence theorem, theorem 2.11.

Theorem 3.5 Central limit theorem: Suppose that X1, X2, . . . are independent and iden-tically distributed with E (Xi) = µ and Var (Xi) = σ2 < ∞. Then

√n(Xn − µ)

d→N(0, σ2). (3.6)

Proof: If σ2 = 0, then both sides of expression (3.6) are the constant 0 so the result followsimmediately. If σ2 > 0, let Ynk = Xk − µ. Since Tn =

∑nk=1 Ynk and s2

n = nσ2, it is not hard toverify that

Tn

sn=

√n(Xn − µ)

σ.

Therefore, the result follows if we verify the Lindeberg condition (3.3). That is, we must verifythat for any ǫ > 0,

1

nσ2

n∑

k=1

E(

Y 2nkI|Ynk| ≥ ǫσ

√n

)

→ 0 as n → ∞. (3.7)

But Yn1, . . . , Ynn are identically distributed, so the left hand side of expression (3.7) simplifies to

1

σ2E

(

Y 211I|Y11| ≥ ǫσ

√n

)

. (3.8)

Note that P (|Y11| ≥ ǫσ√

n) → 0 as n → ∞, which implies that Y 211I|Y11| ≥ ǫσ

√n P→ 0. Since

Y 211I|Y11| ≥ ǫσ

√n ≤ Y 2

11 and E Y 211 < ∞, the dominated convergence theorem implies that

expression (3.8) tends to 0.

48

The argument following expression (3.8) in the previous proof is used so frequently that we statethe essential idea as a lemma:

Lemma 3.4 Suppose αn → ∞ as n → ∞. If E Y 2 < ∞, then for any ǫ > 0,

limn→∞

E(

Y 2I|Y | ≥ ǫαn)

= 0.

Proof: Let Zn = Y 2I|Y | ≥ ǫαn. Then

P (Zn 6= 0) ≤ P (|Y | ≥ ǫαn) → 0,

which implies ZnP→ 0. Furthermore, Zn ≤ Y 2 and E Y 2 < ∞. Therefore, the result follows from

the dominated convergence theorem.

Note that Theorem 3.5 does not follow from the Lyapunov theorem 3.4, since the Lyapunov con-dition requires a finite 2 + δ moment and Theorem 3.5 only requires a finite second moment (i.e.,a finite variance). There are some distributions with finite variance that do not have finite 2 + δmoment for any δ > 0; you are asked to construct such an example in exercise 3.12.

Example 3.1 Suppose Xn ∼ binomial(n, pn). We claim that

Xn − npn√

npn(1 − pn)

d→N(0, 1) (3.9)

whenever npn(1 − pn) → ∞ as n → ∞. In order to use Theorem 3.3 to prove thisresult, let Yn1, . . . , Ynn be independent and identically distributed with

P (Ynk = 1 − pn) = 1 − P (Ynk = −pn) = pn.

Then with Xn = npn +∑n

k=1 Ynk, we obtain Xn ∼ binomial(n, pn) as specified. Fur-thermore, E Ynk = 0 and Var Ynk = pn(1 − pn), so the Lindeberg condition says thatfor any ǫ > 0,

1

npn(1 − pn)

n∑

k=1

E(

Y 2nkI

|Ynk| ≥ ǫ√

npn(1 − pn))

→ 0. (3.10)

Since |Ynk| ≤ 1, the left hand side of expression (3.10) will be identically zero wheneverǫ√

npn(1 − pn) > 1. Thus, a sufficient condition for (3.9) to hold is that npn(1−pn) →∞. One may show that this is also a necessary condition (this is exercise 3.14).

Example 3.2 Distribution of sample median: Suppose X1, . . . , Xn are independent andidentically distributed such that P (Xi ≤ x) = F (x − θ) for some distribution functionF (x) with F (0) = 1/2 and F ′(0) exists and is positive. Let f(0) = F ′(0). Note that θis the population median. Suppose n = 2m − 1 is odd and let Xn denote the samplemedian. Then Xn = X(m), the mth order statistic.

We may compute the distribution function of√

n(X(m) − θ) directly:

P√

n(X(m) − θ) ≤ x

= P

X(m) ≤x√n

+ θ

= P (Yn ≤ m − 1),

49

where Yn ∼ binomial(n, pn) and pn = 1 − F (x/√

n). Then

P (Yn ≤ m − 1) = Gn(cn),

where

cn =

√n(1 − 2pn) − (1/

√n)

2√

pn(1 − pn)

and Gn(t) denotes the distribution function of (Y − npn)/√

npn(1 − pn). Becausepn → 1/2 in this case, Example 3.1 implies that Gn(t) → Φ(t), where Φ(t) is thestandard normal distribution function.

Note that

√cn = 2x

F (x/√

n) − F (0)

x/√

n− 1

2√

npn(1 − pn)→ 2xf(0).

Furthermore, the convergence of Gn(t) to Φ(t) is uniform in t by theorem 2.10. Thismeans that

P (Yn ≤ m − 1) = Gn(cn) → Φ[2xf(0)],

which is equivalent to

√n(Xn − θ)

d→N

(

0,1

4f2(0)

)

.

Thus, we have derived the asymptotic distribution of the sample median, at least asn → ∞ as an odd integer. A minor modification of the above argument for n evenshows that we need not restrict n to odd integer values.

We conclude this section by generalizing theorem 3.5 to the multivariate case. The proof is straight-forward using theorem 3.5 along with the Cramer-Wold theorem, theorem 2.9. Recall that theCramer-Wold theorem allows us to establish multivariate convergence in distribution by provingunivariate convergence in distribution for arbitrary linear combinations of the vector components.

Theorem 3.6 Multivariate central limit theorem: If X(1),X(2), . . . are independent andidentically distributed with mean vector µ ∈ R

k and covariance matrix Σ (where Σ hasfinite entries), then

√n(X − µ)

d→Nk(0, Σ).

Proof: For any vector a ∈ Rk, define Yn = at(X(n)−µ). Then E Yn = 0 and Var Yn = atΣa < ∞.

Thus, the (univariate) central limit theorem gives

√n(Y n)

d→N(0,atΣa). (3.11)

If X ∼ Nk(0, Σ), then expression (3.11) is equivalent to at[√

n(X−µ)]d→atX. Since a is arbitrary,

the result follows from the Cramer-Wold theorem, theorem 2.9.

We will see many, many applications of the multivariate central limit theorem in the chapters thatfollow.

50



Exercise 3.10 In this problem, we prove half of the Lindeberg-Feller theorem 3.3. Recallthat we assume Yn1, · · · , Ynn are independent for all n with E Ynk = 0, Var Ynk = σ2

nk <∞, s2

n =∑n

k=1 σ2nk, and Tn =

∑nk=1 Ynk. The Lindeberg-Feller theorem asserts that

the Lindeberg condition

For every ǫ > 0,1

s2n

n∑

k=1

E(

Y 2nkI |Ynk| ≥ ǫsn

)

→ 0 as n → ∞ (3.12)

is equivalent to the simultaneous conditions

Tn

sn

d→N(0, 1) (3.13)

and

1

s2n

maxk≤n

σ2nk → 0 as n → ∞. (3.14)

Here we prove that (3.12) implies (3.13) and (3.14).

(a) Prove that condition (3.12) implies condition (3.14).

(b) Prove that for any complex numbers a1, . . . , an and b1, . . . , bn with |ak| ≤ 1 and|bk| ≤ 1,

|a1 · · · an − b1 · · · bn| ≤n

∑

k=1

|ak − bk| . (3.15)

(c) Prove that for any real number x,

∣

∣

∣

∣

eix −(

1 + ix − 1

2x2

)∣

∣

∣

∣

≤ minx2, |x|3. (3.16)

Also prove that if |x| ≤ 12 , then |ex − 1 − x| ≤ x2.

(d) Prove that

∣

∣

∣

∣

φYnk

(

t

sn

)

−(

1 − t2σ2nk

2s2n

)∣

∣

∣

∣

≤ ǫ|t|3σ2nk

s2n

+t2

s2n

E(

Y 2nkI|Ynk| ≥ ǫsn

)

. (3.17)

(e) Using the previous results, prove that (3.12) implies (3.13).

51

Hint: Use the fact that∣

∣

∣

∣

∣

n∏

k=1

φYnk

(

t

sn

)

−n

∏

k=1

exp

(

− t2σ2nk

2s2n

)

∣

∣

∣

∣

∣

≤∣

∣

∣

∣

∣

n∏

k=1

φYnk

(

t

sn

)

−n

∏

k=1

(

1 − t2σ2nk

2s2n

)

∣

∣

∣

∣

∣

+

∣

∣

∣

∣

∣

n∏

k=1

(

1 − t2σ2nk

2s2n

)

−n

∏

k=1

exp

(

− t2σ2nk

2s2n

)

∣

∣

∣

∣

∣

.

Exercise 3.11 In this problem, we prove the converse of the Lindeberg-Feller theorem,which is the part due to Feller: Under the assumptions of the exercise 3.10, the variancecondition (3.14) and the asymptotic normality (3.13) together imply the Lindebergcondition (3.12).

(a) Prove that for any real number x, |eix−1| ≤ 2 min1, |x| and |eix−1− ix| ≤ 12x2.

(b) Use a result of part (a) to prove that∣

∣

∣

∣

φYnk

(

t

sn

)

− 1

∣

∣

∣

∣

≤ 2P (|Ynk ≥ ǫsn) + 2ǫ|t|.

Then prove that∑n

k=1 |φYnk(t/sn) − 1|2 → 0 as n → ∞.

Hint: Use Chebyshev’s inequality along with condition (3.14) to show that

maxk

|φYnk(t/sn) − 1| → 0.

Also use the results of part (a).

(c) Use inequality (3.15) along with the fact that |exp(z − 1)| = exp(Re z− 1) ≤ 1 for|z| ≤ 1 to show that as n → ∞,

∣

∣

∣

∣

∣

expn

∑

k=1

φYnk

(

t

sn

)

− 1

−n

∏

k=1

φYnk

(

t

sn

)

∣

∣

∣

∣

∣

→ 0.

(d) From part (c) and condition (3.13), conclude that

n∑

k=1

Re

φYnk

(

t

sn

)

− 1

→ −1

2t2.

(e) For arbitrary ǫ > 0, choose t large enough so that t2/2 > 2/ǫ2. Use part (d) toprove that

(

t2

2− 2

ǫ2

)

1

s2n

n∑

k=1

E(

Y 2nkI|Ynk| ≥ ǫsn

)

→ 0,

which completes the proof.

52

Hint: Show that the result of part (d) may be written

n∑

k=1

E

(

costYnk

sn− 1 +

t2Y 2nk

2sn

)

→ 0

and use the fact that the cosine is always ≥ −1.

Exercise 3.12 Give an example of an independent and identically distributed sequence towhich the Central Limit Theorem 3.5 applies but for which the Lyapunov condition isnot satisfied.

Exercise 3.13 Suppose X1, X2, . . . are independent with Xi ∼ Bernoulli(pi). Let s2n =

∑ni=1 pi(1 − pi). Prove that if s2

n → ∞, then

∑ni=1(Xi − pi)

sn

d→N(0, 1).

Exercise 3.14 In Example 3.1, we show that npn(1 − pn) → ∞ is a sufficient conditionfor (3.9) to hold. Prove that it is also a necessary condition. You may assume thatpn(1 − pn) is always nonzero.

Exercise 3.15 (a) Suppose that X1, X2, . . . are independent and identically distributedwith E Xi = µ and Var Xi = σ2 < ∞. Let zn1, . . . , znn be constants satisfying

maxk≤n z2nk

∑nj=1 z2

nj

→ 0 as n → ∞.

Let Tn =∑n

k=1 znkXk, and prove that (Tn − E Tn)/√

Var Tnd→N(0, 1).

(b) Reconsider Example 2.4, the simple linear regression case in which

β0n =n

∑

i=1

v(n)i Yi and β1n =

n∑

i=1

w(n)i Yi,

where

w(n)i =

zi − z∑n

j=1(zj − z)2and vi =

1

n− zw

(n)i

for constants z1, z2, . . .. Using part (a), state and prove sufficient conditions on theconstants zi that ensure the asymptotic normality of tn(β0n − β0) and un(β1n − β1) forsome real sequences tn and un. (Find these sequences.) You may assume the results ofExample 2.4, where it was shown that E β0n = β0 and E β1n = β1.

Exercise 3.16 Give an example (with proof) of a sequence of independent variables Z1, Z2, . . .with E (Zi) = 0, Var (Zi) = 1 such that

√n(Zn) does not converge in distribution to

N(0, 1).

Exercise 3.17 Let (a1, . . . , an) be a random permutation of the integers 1, . . . , n. If aj < ai

for some i < j, then the pair (i, j) is said to form an inversion. Let Xn be the total

53

number of inversions:

Xn =n

∑

j=2

j−1∑

i=1

Iaj < ai.

For example, if n = 3 and we consider the permutation (3, 1, 2), there are 2 inversionssince 1 = a2 < a1 = 3 and 2 = a3 < a1 = 3. This problem asks you to find theasymptotic distribution of Xn.

(a) Define Y1 = 0 and for j > 1, let

Yj =

j−1∑

i=1

Iaj < ai

be the number of ai greater than aj to the left of aj . Then the Yj are independent (youdon’t have to show this; you may wish to think about why, though). Find E (Yj) andVar Yj .

(b) Clearly Xn = Y1 + Y2 + · · · + Yn. Prove that

3

2

√n

(

4Xn

n2− 1

)

d→N(0, 1).

(c) For n = 10, evaluate the distribution of inversions as follows. First, simulate1000 permutations on 1, 2, . . . , 10 and for each permutation, count the number ofinversions. Plot a histogram of these 1000 numbers. Use the results of the simulationto estimate P (X10 ≤ 24). Second, estimate P (X10 ≤ 24) using a normal approximation.Can you find the exact integer c such that 10!P (X10 ≤ 24) = c?

Exercise 3.18 Suppose that X1, X2, X3 is a sample of size 3 from a beta (2, 1) distribution.

(a) Find P (X1 + X2 + X3 ≤ 1) exactly.

(b) Find P (X1 +X2 +X3 ≤ 1) using a normal approximation derived from the centrallimit theorem.

(c) Let Z = IX1 + X2 + X3 ≤ 1. Approximate E Z = P (X1 + X2 + X3 ≤ 1) byZ =

∑1000i=1 Zi/1000, where Zi = IXi1 + Xi2 + Xi3 and the Xij are independent beta

(2, 1) random variables. In addition to Z, report Var Z for your sample. (To thinkabout: What is the theoretical value of Var Z?)

(d) Approximate P (X1 + X2 + X3 ≤ 32) using the normal approximation and the

simulation approach. (Don’t compute the exact value, which is more difficult to thanin part (a); do you see why?)

Exercise 3.19 Lindeberg and Lyapunov impose conditions on moments so that asymptoticnormality occurs. However, it is possible to have asymptotic normality even if there areno moments at all. Let Xn assume the values +1 and −1 with probability (1− 2−n)/2each and the value 2k with probability 2−k for k > n.

(a) Show that E (Xjn) = ∞ for all positive integers j and n.

54

(b) Show that√

n(

Xn

) d→N(0, 1).

Exercise 3.20 Assume that elements (“coupons”) are drawn from a population of size n,randomly and with replacement, until the number of distinct elements that have beensampled is rn, where 1 ≤ rn ≤ n. Let Sn be the drawing on which this first happens.Suppose that rn/n → ρ, where 0 < ρ < 1.

(a) Suppose k − 1 distinct coupons have thus far entered the sample. Let Xnk be thewaiting time until the next distinct one appears, so that

Sn =

rn∑

k=1

Xnk.

Find the expectation and variance of Xnk.

(b) Let mn = E (Sn) and τ2n = Var (Sn). Show that

Sn − mn

τn

d→N(0, 1).

Tip: One approach is to apply Lyapunov’s condition with δ = 2. This involvesdemonstrating an asymptotic expression for τ2

n and a bound on E [Xnk − E(Xnk)]4.

There are several ways to go about this.

Exercise 3.21 Suppose that X1, X2, . . . are independent and identically distributed binomial(2, p)random variables. Define Yi = IXi = 0.

(a) Find a such that the joint asymptotic distribution of

√n

[(

Xn

Y n

)

− a

]

is nontrivial, and find this joint asymptotic distribution.

(b) Using the Cramer-Wold theorem, theorem 2.9, find the asymptotic distribution of√n(Xn + Y n − 1 − p2).

3.3 Central Limit Theorems for Dependent Sequences

Here we consider sequences that are identically distributed but not independent. In fact, we make astronger assumption than identically distributed; namely, we assume that X1, X2, . . . is a stationarysequence. (Stationary is defined in definition 2.11.) Denote E Xi by µ and let σ2 = Var Xi.

We seek sufficient conditions for the asymptotic normality of√

n(Xn − µ). The variance of Xn fora stationary sequence is given by equation (2.12). Letting γk = Cov (X1, X1+k), we conclude that

Var√

n(Xn − µ)

= σ2 +2

n

n−1∑

k=1

(n − k)γk. (3.18)

55

Suppose that

2

n

n−1∑

k=1

(n − k)γk → γ (3.19)

as n → ∞. Then based on equation (3.18), it seems reasonable to ask whether

√n(Xn − µ)

d→N(0, σ2 + γ).

The answer, in many cases, is yes. This section explores some of these cases.

3.3.1 m-dependent sequences

Recall from Definition 2.12 that X1, X2, . . . is m-dependent for some m ≥ 0 if the vector (X1, . . . , Xi)is independent of (Xi+j , Xi+j+1, . . .) whenever j > m. Therefore, for an m-dependent sequence wehave γk = 0 for all k > m, so limit (3.19) becomes

2

n

n−1∑

k=1

(n − k)γk → 2m

∑

k=1

γk.

For a stationary m-dependent sequence, the following theorem asserts the asymptotic normality ofXn as long as the Xi are bounded:

Theorem 3.7 If for some m ≥ 0, X1, X2, . . . is a stationary m-dependent sequence ofbounded random variables with E Xi = µ and Var Xi = σ2, then

√n(Xn − µ)

d→N

(

0, σ2 + 2

m∑

k=1

Cov [X1, X1+k].

)

.

The assumption in theorem 3.7 that the Xi are bounded is not necessary, as long as σ2 < ∞.However, the proof of the theorem is quite tricky without the boundedness assumption, and thetheorem is strong enough for our purposes as it stands. See, for instance, Ferguson (1996) for acomplete proof. The theorem may be proved using the following strategy: For some integer kn,define random variables V1, V2, . . . and W1, W2, . . . as follows:

V1 = X1 + · · · + Xkn, W1 = Xkn+1 + · · · + Xkn+m,

V2 = Xkn+m+1 + · · · + X2kn+m, W2 = X2kn+m+1 + · · · + X2kn+2m,...

(3.20)

In other words, each Vi is the sum of kn of the Xi and each Wi is the sum of m of the Xi. Becausethe sequence of Xi is m-dependent, we conclude that the Vi are independent. For this reason, wemay apply the Lindeberg-Feller theorem to the Vi. If kn is defined carefully, then the contributionof the Wi may be shown to be negligible. This strategy is implemented in exercise 3.22, where aproof of theorem 3.7 is outlined.

Example 3.3 Runs of successes: Suppose X1, X2, . . . are independent Bernoulli(p) vari-ables. Let Tn denote the number of runs of successes in X1, . . . , Xn, where a run of

56

successes is defined as a sequence of consecutive Xi, all of which equal 1, that is bothpreceded and followed by zeros (unless the run begins with X1 or ends with Xn). Whatis the asymptotic distribution of Tn?

We note that

Tn =n

∑

i=1

Irun starts at ith position

= X1 +n

∑

i=2

Xi(1 − Xi−1),

since a run starts at the ith position for i > 1 if and only if Xi = 1 and Xi−1 = 0.

Letting Yi = Xi+1(1 − Xi), we see immediately that Y1, Y2, . . . is a stationary 1-dependent sequence with E Yi = p(1 − p), so that by Theorem 3.7,

√nY n − p(1 −

p) d→N(0, τ2), where

τ2 = Var Y1 + 2 Cov (Y1, Y2)

= E Y 21 − (E Y1)

2 + 2 E Y1Y2 − 2(E Y1)2

= E Y1 − 3(E Y1)2 = p(1 − p) − 3p2(1 − p)2.

Since

Tn − np(1 − p)√n

=√

nY n − p(1 − p) +X1 − Yn√

n,

we conclude that

Tn − np(1 − p)√n

d→N(0, τ2).

We conclude this topic by considering a case in which asymptotic normality of Xn holds eventhough γk 6= 0 for all k.

Example 3.4 First-order stationary autoregressive process: Suppose U2, U3, . . . are inde-pendent N(0, σ2

0). We’ll define, for i ≥ 1,

Xi+1 = µ + β(Xi − µ) + Ui+1 (3.21)

for some β with |β| < 1. Let X1 ∼ N(µ, σ2) be independent of all of the Ui. Equation(3.21) gives an autoregressive sequence X1, X2, . . .. Suppose we would also like to ensurethat this sequence is stationary. Since this means Var Xi = σ2 for all i, and equation(3.21) gives Var Xi+1 = β2σ2 + σ2

0, we conclude that stationarity implies

σ2 =σ2

0

1 − β2.

Note that√

n(Xn − µ) is exactly normally distributed with mean zero, so it sufficesto show that Var

√n(Xn − µ)

has a finite, nonzero limit. We may rewrite equation(3.21) to read

Xk+1 − µ = βk(X1 − µ) + βk−1U2 + · · ·βUk + Uk+1.

57

In this form, it is easy to see that Cov (X1, X1+k) = βkσ2. Therefore,

Var √

n(Xn − µ) = σ2 +2

n

n−1∑

k=1

(n − k)βkσ2 = σ2

1 + 2n−1∑

k=1

βk − 2

n

n−1∑

k=1

kβk

.

Since

n−1∑

k=1

kβk =β1 + (n − 1)βn − nβn−1

(1 − β)2,

we see that 2n

∑n−1k=1 kβk → 0. Thus,

Var √

n(Xn − µ) → σ2

(

1 +2β

1 − β

)

=1 + β

1 − βσ2.

We conclude that

√n(Xn − µ)

d→N

(

0,1 + β

1 − βσ2

)

.

58


Exercise 3.22 We wish to prove theorem 3.7. However, we will only prove a limited ver-sion of the theorem here, namely, we assume that the random variables are bounded.Suppose X1, X2, . . . is a stationary m-dependent sequence of bounded random variablessuch that Var Xi = σ2. Without loss of generality, assume E Xi = 0. We wish to prove

that√

n(Xn)d→N(0, τ2), where

τ2 = σ2 + 2m

∑

k=1

Cov (X1, X1+k).

For all n, define kn = ⌊n1/4⌋ and ℓn = ⌊n/(kn + m)⌋ and tn = ℓn(kn + m). DefineV1, . . . , Vℓn

and W1, . . . , Wℓnas in equation (3.20). Then

√n(Xn) =

1√n

ℓn∑

i=1

Vi +1√n

ℓn∑

i=1

Wi +1√n

n∑

i=tn+1

Xi.

(a) Prove that

1√n

n∑

i=tn+1

XiP→ 0. (3.22)

Hint: Bound the left hand side of expression (3.22) using Markov’s inequality (1.14)with r = 1. What is the greatest possible number of summands?

(b) Prove that

1√n

ℓn∑

i=1

WiP→ 0.

Hint: For kn > m, the Wi are independent and identically distributed with distribu-tions that do not depend on n. Use the central limit theorem on (1/

√ℓn)

∑ℓn

i=1 Wi.

(c) Prove that

1√n

ℓn∑

i=1

Vid→ N(0, τ2),

then use Slutsky’s theorem to prove theorem 3.7.

Hint: Use the Lindeberg-Feller theorem.

Exercise 3.23 Suppose X0, X1, . . . is an independent sequence of Bernoulli trials with suc-cess probability p. Suppose Xi is the indicator of your team’s success on rally i in avolleyball game. (Note: This is a completely unrealistic model, since the serving teamis always at a disadvantage when evenly matched teams play.) Your team scores a point

59

each time it has a success that follows another success. Let Sn =∑n

i=1 Xi−1Xi denotethe number of points your team scores by time n.

(a) Find the asymptotic distribution of Sn.

(b) Simulate a sequence X0, X1, . . . , X1000 as above and calculate S1000 for p = .4.Repeat this process 100 times, then graph the empirical distribution of S1000 obtainedfrom simulation on the same axes as the theoretical asymptotic distribution from (a).Comment on your results.

Exercise 3.24 Suppose Z0, Z1, . . . are independent and identically distributed with mean0 and variance σ2, and let

Xi+1 = µ + β(Xi − µ) + (Zi−1 − Zi)

for i ≥ 1, where β is a constant with |β| < 1. Assume that Cov (Xi, Zj) = 0 for allj ≥ i. This sequence is called an autoregressive moving average process, or ARMAprocess.

(a) Find Var X1 in terms of σ2 and β if X1, X2, . . . is a stationary sequence.

Hint: Cov (Xi+1, Zi) 6= 0.

(b) If the Xi are normally distributed in addition to being a stationary sequence, findτ2 such that

√n(Xn − µ)

d→N(0, τ2).

Exercise 3.25 Let X0, X1, . . . be independent and identically distributed random variablesfrom a continuous distribution F (x). Define Yi = IXi < Xi−1 and Xi < Xi+1. Thus,Yi is the indicator that Xi is a relative minimum. Let Sn =

∑ni=1 Yi.

(a) Find the asymptotic distribution of Sn.

(b) For a sample of size 5000 from the uniform (0, 1) random number generator inR, compute an approximate p-value based on the observed value of Sn and the answerto part (a). The null hypothesis is that the sequence of “random” numbers generatedis independent and identically distributed. (Natually, the “random” numbers are notrandom at all, but are generated by a determininstic formula that is supposed to mimicrandomness.)

60

Chapter 4

The Delta Method and Applications

We open this chapter with the theorem known generally as the delta method.

Theorem 4.1 Delta method: If g′(a) exists and nb(Xn − a)d→X for b > 0, then

nb g(Xn) − g(a) d→ g′(a)X.

The proof of the delta method uses Taylor’s theorem, theorem 1.2: Since Xn − aP→ 0,

nb g(Xn) − g(a) = nb(Xn − a)

g′(a) + oP (1)

,

and thus Slutsky’s theorem together with the fact that nb(Xn − a)d→X proves the result.

Corollary 4.1 The often-used special case of theorem 4.1 in which X is normally dis-

tributed states that if g′(µ) exists and√

n(Xn − µ)d→N(0, σ2), then

√n

g(Xn) − g(µ) d→N

0, σ2g′(µ)2

.

The intuition behind the delta method is simple: Since any differentiable function “looks” linearin a very small neighborhood, applying the function to a sequence that converges to a point isessentially like applying a linear function, which means multiplying by the slope. In the nextsection, we extend theorem 4.1 in two directions: Theorem 4.2 deals with the special case in whichg′(a) = 0, and theorem 4.3 is the multivariate version of the delta method.

4.1 Asymptotic distributions and Variance-Stabilizing Transfor-

mations

We begin by immediately applying the delta method to a couple of simple examples that illustrate afrequently understood but seldom stated principle: When we speak of the “asymptotic distribution”of a sequence of random variables, we generally refer to a nontrivial (i.e., nonconstant) distribution.Thus, in the case of an independent and identically distributed sequence X1, X2, . . . of random

61

variables with finite variance, the phrase “asymptotic distribution of Xn” generally refers to thefact that

√n

(

Xn − E X1

) d→N(0, Var X1),

not the fact that XnP→E X1.

Example 4.1 Asymptotic distribution of X2n Suppose X1, X2, . . . are iid with mean µ and

finite variance σ2. Then by the central limit theorem,

√n(Xn − µ)

d→N(0, σ2).

Therefore, the delta method gives

√n(X

2n − µ2)

d→N(0, 4µ2σ2). (4.1)

However, this is not necessarily the end of the story. If µ = 0, then the normal limit

in (4.1) is degenerate—that is, expression (4.1) merely states that√

n(X2n) converges

in probability to the constant 0. This is not what we mean by the asymptotic dis-tribution! Thus, we must treat the case µ = 0 separately, noting in that case that√

nXnd→N(0, σ2) by the central limit theorem, which implies that

nXnd→σ2χ2

1.

Example 4.2 Estimating binomial variance: Suppose Xn ∼ binomial(n, p). BecauseXn/n is the maximum likelihood estimator for p, the maximum likelihood estima-tor for p(1 − p) is δn = Xn(n − Xn)/n2. The central limit theorem tells us that√

n(Xn/n − p)d→N0, p(1 − p), so the delta method gives

√n δn − p(1 − p) d→N

0, p(1 − p)(1 − 2p)2

.

Note that in the case p = 1/2, this does not give the asymptotic distribution of δn.Exercise 4.1 gives a hint about how to find the asymptotic distribution of δn in thiscase.

We have seen in the preceding examples that if g′(a) = 0, then the delta method gives somethingother than the asymptotic distribution we seek. However, by using more terms in the Taylorexpansion, we obtain the following generalization of Theorem 4.1:

Theorem 4.2 If g(t) has r derivatives at the point a and g′(a) = g′′(a) = · · · = g(r−1)(a) =

0, then nb(Xn − a)d→X for b > 0 implies that

nrb g(Xn) − g(a) d→ 1

r!g(r)(a)Xr.

It is straightforward using the multivariate notion of differentiability discussed in definition 1.14 toprove the following theorem:

62

Theorem 4.3 Multivariate delta method: If g : Rk → Rℓ has a derivative ∇g(a) at a ∈ Rk

and

nb(

X(n) − a)

d→Y

for some k-vector Y and some sequence X(1),X(2), . . . of k-vectors, where b > 0, then

nb

g(

X(n))

− g (a)

d→ [∇g(a)]t Y.

The proof of Theorem 4.3 involves a simple application of the multivariate Taylor expansion ofequation (1.10).

4.1.1 Variance stabilizing transformations

We conclude this section with a discussion of variance stabilizing transformations. Often, if E (Xi) =µ is the parameter of interest, the central limit theorem gives

√n(Xn − µ)

d→N0, σ2(µ).

In other words, the variance of the limiting distribution is a function of µ. This is a problem ifwe wish to do inference for µ, because ideally the limiting distribution should not depend on theunknown µ. The delta method gives a possible solution: Since

√n

g(Xn) − g(µ) d→N

0, σ2(µ)g′(µ)2

,

we may search for a transformation g(x) such that g′(µ)σ(µ) is a constant. Such a transformationis called a variance stabilizing transformation.


Exercise 4.1 Let δn be defined as in example 4.2. Find the asymptotic distribution of δn inthe case p = 1/2. That is, find constant sequences an and bn and a nontrivial random

variable X such that an(δn − bn)d→X.

Hint: Let Yn = Xn − (n/2). Apply the central limit theorem to Yn, then transformboth sides of the resulting limit statement so that a statement involving δn results.


Exercise 4.3 Suppose Xn ∼ binomial(n, p), where 0 < p < 1.

(a) Find the asymptotic distribution of g(Xn/n) − g(p), where g(x) = minx, 1 − x.

(b) Show that h(x) = sin−1(√

x) is a variance-stabilizing transformation.

Hint: (d/du) sin−1(u) = 1/√

1 − u2.

63

Exercise 4.4 Let Xn ∼ binomial(n, p), where p ∈ (0, 1) is unknown. Obtain confidenceintervals for p in two different ways:

(a) Since√

n(Xn/n − p)d→N [0, p(1 − p)], the variance of the limiting distribution

depends only on p. Use the fact that Xn/nP→ p to find a consistent estimator of the

variance and use it to derive a 95% confidence interval for p.

(b) Use the result of problem 4.3(b) to derive a 95% confidence interval for p.

(c) Evaluate the two confidence intervals in parts (a) and (b) numerically for all com-binations of n ∈ 10, 100, 1000 and p ∈ .1, .3, .5 as follows: For 1000 realizations ofX ∼ bin(n, p), construct both 95% confidence intervals and keep track of how manytimes (out of 1000) that the confidence intervals contain p. Report the observed pro-portion of successes for each (n, p) combination. Does your study reveal any differencesin the performance of these two competing methods?

4.2 Sample Moments

The weak law of large numbers tells us that If X1, X2, . . . are independent and identically distributedwith E |X1|k < ∞, then

1

n

n∑

i=1

Xki

P→E Xk1 .

That is, sample moments are (weakly) consistent. For example, the sample variance, which wedefine as

s2n =

1

n

n∑

i=1

(Xi − Xn)2 =1

n

n∑

i=1

X2i − (Xn)2, (4.2)

is consistent for Var Xi = E X2i − (E Xi)

2.

However, consistency is not the end of the story. The central limit theorem and the delta methodwill prove very useful in deriving asymptotic distribution results about sample moments. Weconsider two very important examples involving the sample variance of equation (4.2).

Example 4.3 Distribution of sample T statistic: Suppose X1, X2, . . . are iid with E (Xi) =µ and Var (Xi) = σ2 < ∞. Define s2

n as in equation (4.2), and let

Tn =

√n(Xn − µ)

sn.

Letting

An =

√n(Xn − µ)

σ

and Bn = σ/sn, we obtain Tn = AnBn. Therefore, since And→N(0, 1) by the central

limit theorem and BnP→ 1 by the weak law of large numbers, Slutsky’s theorem implies

that Tnd→N(0, 1). In other words, T statistics are asymptotically normal under the

null hypothesis.

64

Example 4.4 Asymptotic distribution of sample variance: Suppose that X1, X2, . . . areiid with E (Xi) = µ, Var (Xi) = σ2, and Var (Xi − µ)2 = τ2 < ∞. Define s2

n as inequation (4.2). This definition of s2

n is different from the unbiased sample variance

1

n − 1

n∑

i=1

(

Xi − Xn

)2;

however, as we see later, this will not matter in an asymptotic sense. Since the distri-bution of Xi − Xn does not change if we replace each Xi by Xi − µ, we may assumewithout loss of generality that µ = 0.

By the central limit theorem, we know that

√n

(

1

n

n∑

i=1

X2i − σ2

)

d→N(0, τ2).

Furthermore, the fact that√

nXnd→N(0, σ2) means that

√nX

2n

P→ 0. Therefore, since

√n(s2

n − σ2) =√

n

(

1

n

n∑

i=1

X2i − σ2

)

+√

nX2n,

Slutsky’s theorem implies that√

n(s2n−σ2)

d→N(0, τ2). A different approach to provingthis same fact, using the multivariate instead of the univariate central limit theorem,is a part of the correlation coefficient derivation of section 4.3.

Since

√n

(

n

n − 1s2n − σ2

)

=√

n(

s2n − σ2

)

+

√n

n − 1s2n

and√

ns2n/(n − 1)

P→ 0, Slutsky’s theorem implies that the asymptotic distribution ofthe unbiased version of the sample variance is the same as that of s2

n.


Exercise 4.5 (a) Suppose that X1, X2, . . . are iid Normal (0, σ2) random variables. Usingthe result of Example 4.4, find a variance-stabilizing transformation for

s2n =

1

n

n∑

i=1

(Xi − Xn)2.

(b) Give an approximate test at α = .05 for H0 : σ2 = σ20 vs. Ha : σ2 6= σ2

0 based onpart (a).

(c) For n = 25, estimate the true level of the test in part (b) for σ20 = 1 by simulating

5000 samples of size n = 25 from the null distribution. Report the proportion of casesin which you reject the null hypothesis according to your test (ideally, this proportionwill be about .05).

65

4.3 Sample Correlation

Suppose that (X1, Y1), (X2, Y2), . . . are iid vectors with E X4i < ∞ and E Y 4

i < ∞. For the sake ofsimplicity, we will assume without loss of generality that E Xi = E Yi = 0 (alternatively, we couldbase all of the following derivations on the centered versions of the random variables).

We wish to find the asymptotic distribution of the sample correlation coefficient, r. If we let

mx

my

mxx

myy

mxy

=1

n

∑ni=1 Xi

∑ni=1 Yi

∑ni=1 X2

i∑n

i=1 Y 2i

∑ni=1 XiYi

(4.3)

and

s2x = mxx − m2

x, s2y = myy − m2

y, and sxy = mxy − mxmy, (4.4)

then r = sxy/(sxsy). According to the central limit theorem,

√n

mx

my

mxx

myy

mxy

−

00σ2

x

σ2y

σxy

d→N5

00000

,

Cov (X1, X1) · · · Cov (X1, X1Y1)Cov (Y1, X1) · · · Cov (Y1, X1Y1)

.... . .

...Cov (X1Y1, X1) · · · Cov (X1Y1, X1Y1)

.(4.5)

Let Σ denote the covariance matrix in expression (4.5). Define a function g : R5 → R

3 such thatg applied to the vector of moments in equation (4.3) yields the vector (s2

x, s2y, sxy) as defined in

expression (4.4). Then

∇g

abcde

=

−2a 0 −b0 −2b −a1 0 00 1 00 0 1

.

Therefore, if we let

Σ∗ =

∇g

00σ2

x

σ2y

σxy

t

Σ

∇g

00σ2

x

σ2y

σxy

=

Cov (X21 , X2

1 ) Cov (X21 , Y 2

1 ) Cov (X21 , X1Y1)

Cov (Y 21 , X2

1 ) Cov (Y 21 , Y 2

1 ) Cov (Y 21 , X1Y1)

Cov (X1Y1, X21 ) Cov (X1Y1, Y

21 ) Cov (X1Y1, X1Y1)

,

then by the delta method,

√n

s2x

s2y

sxy

−

σ2x

σ2y

σxy

d→N3(0, Σ∗). (4.6)

As an aside, note that expression (4.6) gives the same marginal asymptotic distribution for√

n(s2x−

σ2x) as was derived using a different approach in example 4.4, since Cov (X2

1 , X21 ) is the same as τ2

in that example.

66

Next, define the function h(a, b, c) = c/√

ab, so that we have h(s2x, s2

y, sxy) = r. Then

[∇h(a, b, c)]t =1

2

( −c√a3b

,−c√ab3

,2√ab

)

,

so that

[∇h(σ2x, σ2

y , σxy)]t =

( −σxy

2σ3xσy

,−σxy

2σxσ3y

,1

σxσy

)

=

( −ρ

2σ2x

,−ρ

2σ2y

,1

σxσy

)

. (4.7)

Therefore, if A denotes the 1×3 matrix in equation (4.7), using the delta method once again yields

√n(r − ρ)

d→N(0, AΣ∗At).

To recap, we have used the basic tools of the multivariate central limit theorem and the multivariatedelta method to obtain a univariate result. This derivation of univariate facts via multivariatetechniques is common practice in statistical large-sample theory.

Example 4.5 Consider the special case of bivariate normal (Xi, Yi). In this case, we mayderive

Σ∗ =

2σ4x 2ρ2σ2

xσ2y 2ρσ3

xσy

2ρ2σ2xσ2

y 2σ4y 2ρσxσ3

y

2ρσ3xσy 2ρσxσ3

y (1 + ρ2)σ2xσ2

y

. (4.8)

In this case, AΣ∗At = (1 − ρ2)2, which implies that

√n(r − ρ)

d→N0, (1 − ρ2)2. (4.9)

In the normal case, we may derive a variance-stabilizing transformation. According toequation (4.9), we should find a function f(x) satisfying f ′(x) = (1 − x2)−1. Since

1

1 − x2=

1

2(1 − x)+

1

2(1 + x),

which is easy to integrate, we obtain

f(x) =1

2log

1 + x

1 − x.

This is called Fisher’s transformation; we conclude that

√n

(

1

2log

1 + r

1 − r− 1

2log

1 + ρ

1 − ρ

)

d→N(0, 1).


Exercise 4.6 Verify expressions (4.8) and (4.9).

67

Exercise 4.7 Assume (X1, Y1), . . . , (Xn, Yn) are iid from some bivariate normal distribu-tion. Let ρ denote the population correlation coefficient and r the sample correlationcoefficient.

(a) Describe a test of H0 : ρ = 0 against H1 : ρ 6= 0 based on the fact that

√n[f(r) − f(ρ)]

d→ N(0, 1),

where f(x) is Fisher’s transformation f(x) = (1/2) log[(1 + x)/(1 − x)]. Use α = .05.

(b) Based on 5000 repetitions each, estimate the actual level for this test in the casewhen E (Xi) = E (Yi) = 0, Var (Xi) = Var (Yi) = 1, and n ∈ 3, 5, 10, 20.

Exercise 4.8 Suppose that X and Y are jointly distributed such that X and Y are Bernoulli(1/2) random variables with P (XY = 1) = θ for θ ∈ (0, 1/2). Let (X1, Y1), (X2, Y2), . . .be iid with (Xi, Yi) distributed as (X, Y ).

(a) Find the asymptotic distribution of√

n[

(Xn, Y n) − (1/2, 1/2)]

.

(b) If rn is the sample correlation coefficient for a sample of size n, find the asymptoticdistribution of

√n(rn − ρ).

(c) Find a variance stabilizing transformation for rn.

(d) Based on your answer to part (c), construct a 95% confidence interval for θ.

(e) For each combination of n ∈ 5, 20 and θ ∈ .05, .25, .45, estimate the truecoverage probability of the confidence interval in part (d) by simulating 5000 samplesand the corresponding confidence intervals. One problem you will face is that in somesamples, the sample correlation coefficient is undefined because with positive probabilityeach of the Xi or Yi will be the same. In such cases, consider the confidence interval tobe undefined and the true parameter therefore not contained therein.

Hint: To generate a sample of (X, Y ), first simulate the X’s from their marginaldistribution, then simulate the Y ’s according to the conditional distribution of Y givenX. To obtain this conditional distribution, find P (Y = 1 | X = 1) and P (Y = 1 | X =0).

68

Chapter 5

Order Statistics and Quantiles

5.1 Extreme Order Statistics

Suppose we have a finite sample X1, . . . , Xn. Conditional on this sample, we define the valuesX(1), . . . , X(n) to be a permutation of X1, . . . , Xn such that X(1) ≤ X(2) ≤ · · · ≤ X(n). We callX(i) the ith order statistic of the sample. In much of this chapter, we will be concerned with themarginal distributions of these order statistics.

Even though the notation X(i) does not explicitly give the sample size n, the distribution of X(i)

depends essentially on n. For this reason, some textbooks use slightly more complicated notationsuch as

X(1:n), X(2;n), . . . , X(n;n)

for the order statistics of a sample. We choose to use the simpler notation here, though we willalways understand the sample size to be n.

The asymptotic distributions of order statistics at the extremes of a sample may be derived withoutany specialized knowledge other than the limit formula

(

1 +c

n

)n→ ec as n → ∞ (5.1)

and its generalization

(

1 +cn

bn

)bn

→ ec if cn → c and bn → ∞ (5.2)

[see exercise 1.2(b)]. Recall that by “asymptotic distribution of X(1),” we mean sequences kn and

an, along with a nondegenerate random variable X, such that kn(X(1) − an)d→X. This section

consists mostly of a series of illustrative examples.

Example 5.1 Suppose X1, . . . , Xn are independent and identically distributed uniform(0,1)random variables. What is the asymptotic distribution of X(n)?

69

Since X(n) ≤ t if and only if X1 ≤ t, X2 ≤ t, . . . , and Xn ≤ t, by independence we have

P (X(n) ≤ t) =

0 if t ≤ 0tn if 0 < t < 11 if t ≥ 1.

(5.3)

From equation (5.3), it is apparent that X(n)P→ 1, though this limit statement does not

fully reveal the asymptotic distribution of X(n). We desire sequences kn and an suchthat kn(X(n) − an) has a nondegenerate limiting distribution. Evidently, we shouldexpect an = 1, a fact we shall rederive below.

Computing the distribution function of kn(X(n) − an) directly, we find

F (u) = Pkn(X(n) − an) ≤ u = P

X(n) ≤u

kn+ an

as long as kn > 0. Therefore, we see that

F (u) =

(

u

kn+ an

)n

for 0 <u

kn+ an < 1. (5.4)

We would like this expression to tend to a limit involving only u as n → ∞. Keepingexpression 5.2 in mind, we take an = 1 and kn = n so that F (u) = (1 + u/n)n, whichtends to eu.

However, we are not quite finished, since we have not determined which values of umake the above limit valid. Equation 5.4 required that 0 < an + (u/kn) < 1, whichin this case becomes −1 < u/n < 0. This means u may be any negative real number,since for any u < 0, −1 < u/n < 0 for all n > |u|. We conclude that if the randomvariable U has distribution function

F (u) =

exp(u) if u ≤ 01 if u > 0,

then n(X(n) − 1)d→U . Since −U is simply a standard exponential random variable, we

may also write

n(1 − X(n))d→Exponential(1).

Example 5.2 Suppose X1, X2, . . . are independent and identically distributed exponentialrandom variables with mean 1. What is the asymptotic distribution of X(n)?

As in equation 5.3, if u/kn + an > 0 then

P

kn(X(n) − an) ≤ u

= P

(

X(n) ≤u

kn+ an

)

=

1 − exp

(

−an − u

kn

)n

.

Taking an = log n and kn = 1, the rightmost expression above simplifies to

1 − e−u

n

n

,

70

which has limit exp(−e−u). The condition u/kn + an > 0 becomes u + log n > 0,which is true for all u ∈ R as long as n > exp(−u). Therefore, we conclude that

X(n) − log nd→U , where

P (U ≤ u)def= exp− exp(−u) for all u. (5.5)

The distribution of U in equation 5.5 is known as the extreme value distribution or theGumbel distribution.

In examples 5.1 and 5.2, we derived the asymptotic distribution of a maximum from a simple randomsample. We did this using only the definition of convergence in distribution without relying on anyresults other than expression 5.2. In a similar way, we may derive the joint asymptotic distributionof multiple order statistics, as in the following example.

Example 5.3 Range of uniform sample: Let X1, . . . , Xn be a simple random sample fromUniform(0, 1). Let Rn = X(n) − X(1) denote the range of the sample. What is theasymptotic distribution of Rn?

We begin to answer this question by finding the joint asymptotic distribution of (X(n), X(1)),as follows. For sequences kn and ℓn, as yet unspecified, consider

P (knX(1) > x and ℓn(1 − X(n)) > y) = P (X(1) > x/kn and X(n) < 1 − y/ℓn)

= P (x/kn < X(1) < · · · < X(n) < 1 − y/ℓn),

where we have assumed that kn and ℓn are positive. Since the probability above issimply the probability that the entire sample is to be found in the interval (x/kn, 1 −y/ℓn), we conclude that as long as

0 <x

kn< 1 − y

ℓn< 1, (5.6)

we have

P (knX(1) > x and ℓn(1 − X(n)) > y) =

(

1 − y

ℓn− x

kn

)n

.

Expression (5.1) suggests that we set kn = ℓn = n, resulting in

P (nX(1) > x and n(1 − X(n)) > y) =(

1 − y

n− x

n

)n.

Expression 5.6 becomes

0 <x

n< 1 − y

n< 1,

which is satisfied for large enough n if and only if x and y are both positive. Weconclude that for x > 0, y > 0,

P (nX(1) > x and n(1 − X(n)) > y) → e−xe−y.

Since this is the joint distribution of independent standard exponential random vari-ables, say, Y1 and Y2, we conclude that

(

nX(1)

n(1 − X(n))

)

d→(

Y1

Y2

)

.

71

Therefore, applying the continuous function f(a, b) = a + b to both sides gives

n(1 − X(n) + X(1)) = n(1 − Rn)d→Y1 + Y2 ∼ Gamma(2, 1).

Let us consider a different example in which the asymptotic joint distribution does not involveindependent random variables.

Example 5.4 As in example 5.3, let X1, . . . , Xn be independent and identically distributedfrom uniform(0, 1). What is the joint asymptotic distribution of X(n−1) and X(n)?

Proceeding as in example 5.3, we obtain

P

[

(

n(1 − X(n−1))

n(1 − X(n))

)

>

(

x

y

)

]

= P(

X(n−1) < 1 − x

nand X(n) < 1 − y

n

)

. (5.7)

We consider two separate cases: If 0 < x < y, then the right hand side of (5.7) is simplyP (X(n) < 1 − y/n), which converges to e−y as in example 5.1. On the other hand, if0 < y < x, then

P(

X(n−1) < 1 − x

nand X(n) < 1 − y

n

)

= P(

X(n) < 1 − x

n

)

+P(

X(n−1) < 1 − x

n< X(n) < 1 − y

n

)

=(

1 − x

n

)n+ n

(

1 − x

n

)n−1 (x

n− y

n

)

→ e−x(1 + x − y).

The second equality above arises because X(n−1) < a < X(n) < b if and only if exactlyn−1 of the Xi are less than a and exactly one is between a and b. We now know the jointasymptotic distribution of n(1 − X(n−1)) and n(1 − X(n−1)); but can we describe thisjoint distribution in a simple way? Suppose that Y1 and Y2 are independent standardexponential variables. Consider the joint distribution of Y1 and Y1 + Y2: If 0 < x < y,then

P (Y1 + Y2 > x and Y1 > y) = P (Y1 > y) = e−y.

On the other hand, if 0 < y < x, then

P (Y1 + Y2 > x and Y1 > y) = P (Y1 > maxy, x − Y2) = E e−maxy,x−Y2

= e−yP (y > x − Y2) +

∫ x−y

0et−xe−t dt = e−x(1 + x − y).

Therefore, we conclude that(

n(1 − X(n−1))

n(1 − X(n))

)

d→(

Y1 + Y2

Y1

)

.

Notice that marginally, we have shown that n(1 − X(n−1))d→ Gamma(2, 1).

Recall that if F is a continuous, invertible distribution function and U is a standard uniform randomvariable, then F−1(U) ∼ F . This is easy to prove, since PF−1(U) ≤ t = PU ≤ F (t) = F (t).We may use this fact in conjunction with the result of Example 5.4 as in the following example.

72

Example 5.5 Suppose X1, . . . , Xn are independent standard exponential random variables.What is the joint asymptotic distribution of (X(n−1), X(n))?

The distribution function of a standard exponential distribution is F (t) = 1 − e−t,whose inverse is F−1(u) = − log(1 − u). Therefore,

(− log(1 − U(n−1))

− log(1 − U(n))

)

d=

(

X(n−1)

X(n)

)

,

whered= means “has the same distribution”. Thus,

(− log[

n(1 − U(n−1))]

− log[

n(1 − U(n))]

)

d=

(

X(n−1) − log n

X(n) − log n

)

.

We conclude by the result of example 5.4 that

(

X(n−1) − log n

X(n) − log n

)

d→(− log(Y1 + Y2)

− log Y1

)

,

where Y1 and Y2 are independent standard exponential variables.


Exercise 5.1 For a given n, let X1, . . . , Xn be independent and identically distributed withdistribution function

P (Xi ≤ t) =t3 + θ3

2θ3for t ∈ [−θ, θ].

Let X(1) denote the first order statistic from the sample of size n; that is, X(1) is thesmallest of the Xi.

(a) Prove that −X(1) is consistent for θ.

(b) Prove that

n(θ + X(1))d→ Y,

where Y is a random variable with an exponential distribution. Find E (Y ) in terms ofθ.

(c) For a fixed α, define

δα,n = −(

1 +α

n

)

X(1).

Find, with proof, α∗ such that

n (θ + δα∗,n)d→ Y − E (Y ),

where Y is the same random variable as in part (b).

73

(d) Compare the two consistent θ-estimators δα∗,n and −X(1) empirically as follows.For n ∈ 102, 103, 104, take θ = 1 and simulate 1000 samples of size n from thedistribution of Xi. From these 1000 samples, estimate the bias and mean squared errorof each estimator. Which of the two appears better? Do your empirical results agreewith the theoretical results in parts (c) and (d)?

Exercise 5.2 Let X1, X2, . . . be independent uniform (0, θ) random variables. Let X(n) =maxX1, . . . , Xn and consider the three estimators

δ0n = X(n) δ1

n =n

n − 1X(n) δ2

n =

(

n

n − 1

)2

X(n).

(a) Prove that each estimator is consistent for θ.

(b) Perform an empirical comparison of these three estimators for n = 102, 103, 104.Use θ = 1 and simulate 1000 samples of size n from uniform (0, 1). From these 1000samples, estimate the bias and mean squared error of each estimator. Which one of thethree appears to be best?

(c) Find the asymptotic distribution of n(θ− δin) for i = 0, 1, 2. Based on your results,

which of the three appears to be the best estimator and why? (For the latter question,don’t attempt to make a rigorous mathematical argument; simply give an educatedguess.)

Exercise 5.3 Find, with proof, the asymptotic distribution of X(n) if X1, . . . , Xn are inde-pendent and identically distributed with each of the following distributions. (That is,

find kn, an, and a nondegenerate random variable X such that kn(X(n) − an)d→X.)

(a) Beta(3, 1) with distribution function F (x) = x2 for x ∈ (0, 1).

(b) Standard logistic with distribution function F (x) = ex/(1 + ex).

Exercise 5.4 Let X1, . . . , Xn be independent uniform(0, 1) random variables. Find thejoint asymptotic distribution of

[

nX(2), n(1 − X(n−1))]

.

Hint: To find a probability such as P (a < X(2) < X(n) < b), consider the trinomialdistribution with parameters [n; (a, b−a, 1−b)] and note that the probability in questionis the same as the probability that the numbers in the first and third categories areeach ≤ 1.

Exercise 5.5 Let X1, . . . , Xn be a simple random sample from the distribution functionF (x) = [1 − (1/x)]Ix > 1.

(a) Find the joint asymptotic distribution of (X(n−1)/n, X(n)/n).

Hint: Proceed as in Example 5.5.

(b) Find the asymptotic distribution of X(n−1)/X(n).

Exercise 5.6 If X1, . . . , Xn are independent and identically distributed uniform(0, 1) vari-

ables, prove that X(1)/X(2)d→uniform(0, 1).

74

Exercise 5.7 Let X1, . . . , Xn be independent uniform(0, 2θ) random variables.

(a) Let M = (X(1) + X(n))/2. Find the asymptotic distribution of n(M − θ).

(b) Compare the asymptotic performance of the three estimators M , Xn, and thesample median Xn by considering their relative efficiencies.

(c) For n ∈ 101, 1001, 10001, generate 500 samples of size n, taking θ = 1. Keeptrack of M , Xn, and Xn for each sample. Construct a 3 × 3 table in which you reportthe sample variance of each estimator for each value of n. Do your simulation resultsagree with your theoretical results in part (b)?

Exercise 5.8 Let X1, . . . , Xn be a simple random sample from a logistic distribution withdistribution function F (t) = et/θ/(1 + et/θ) for all t.

(a) Find the asymptotic distribution of X(n) − X(n−1).

Hint: Use the fact that log U(n) and log U(n−1) both converge in probability to zero.

(b) Based on part (a), construct an approximate 95% confidence interval for θ. Usethe fact that the .025 and .975 quantiles of the standard exponential distribution are0.0253 and 3.6889, respectively.

(c) Simulate 1000 samples of size n = 40 with θ = 2. How many confidence intervalscontain θ?

5.2 Sample Quantiles

To derive the distribution of sample quantiles, we begin by obtaining the exact distribution of theorder statistics of a random sample from a uniform distribution. To facilitate this derivation, webegin with a quick review of changing variables. Suppose X has density fX(x) and Y = g(X),where g : R

k → Rk is differentiable and has a well-defined inverse, which we denote by h : R

k → Rk.

(In particular, we have X = h[Y].) The density for Y is

fY(y) = |Det[∇h(y)]| fX[h(y)], (5.8)

where |Det[∇h(y)]| is the absolute value of the determinant of the k × k matrix ∇h(y).

5.2.1 Uniform Order Statistics

We now show that the order statistics of a uniform distribution may be obtained using ratiosof gamma random variables. Suppose X1, . . . , Xn+1 are independent standard exponential, orGamma(1, 1), random variables. For j = 1, . . . , n, define

Yj =

∑ji=1 Xi

∑n+1i=1 Xi

. (5.9)

We will show that the joint distribution of (Y1, . . . , Yn) is the same as the joint distribution of theorder statistics (U(1), . . . , U(n)) of a simple random sample from uniform(0, 1) by demonstrating

75

that their joint density function is the same as that of the uniform order statistics, namely n!I0 <u(1) < . . . < u(n) < 1.

We derive the joint density of (Y1, . . . , Yn) as follows. As an intermediate step, define Zj =∑j

i=1 Xi

for j = 1, . . . , n + 1. Then

Xi =

Zi if i = 1Zi − Zi−1 if i > 1,

which means that the gradient of the transformation from Z to X is upper triangular with ones onthe diagonal, a matrix whose determinant is one. This implies that the density for Z is

fZ(z) = exp−zn+1I0 < z1 < z2 < · · · < zn+1.

Next, if we define Yn+1 = Zn+1, then we may express Z in terms of Y as

Zi =

Yn+1Yi if i < n + 1Yn+1 if i = n + 1.

(5.10)

The gradient of the transformation in equation (5.10) is lower triangular, with yn+1 along thediagonal except for a 1 in the lower right corner. The determinant of this matrix is yn

n+1, so thedensity of Y is

fY(y) = ynn+1 exp−yn+1Iyn+1 > 0I0 < y1 < · · · < yn < 1. (5.11)

Equation (5.11) reveals several things: First, (Y1, . . . , Yn) is independent of Yn+1 and the marginaldistribution of Yn+1 is Gamma(n+1, 1). More important for our purposes, the marginal joint densityof (Y1, . . . , Yn) is proportional to I0 < y1 < · · · < yn < 1, which is exactly what we needed toprove. We conclude that the vector Y defined in equation (5.9) has the same distribution as thevector of order statistics of a simple random sample from uniform(0, 1).

5.2.2 Uniform Sample Quantiles

Using the result of section 5.2.1, we may derive the joint asymptotic distribution of a set of samplequantiles for a uniform simple random sample.

Suppose we are interested in the p1 and p2 quantiles, where 0 < p1 < p2 < 1. The followingargument may easily be generalized to obtain the joint asymptotic distribution of any finite num-ber of quantiles. If U1, . . . , Un are independent uniform(0, 1) random variables, then the p1 andp2 sample quantiles may be taken to be the anth and bnth order statistics, respectively, where

andef= ⌊.5 + np1⌋ and bn

def= ⌊.5 + np2⌋ (⌊.5 + x⌋ is simply x rounded to the nearest integer).

Next, let

Andef=

1

n

an∑

i=1

Xi, Bndef=

1

n

bn∑

i=an+1

Xi, and Cndef=

1

n

n+1∑

i=bn+1

Xi.

We proved in section 5.2.1 that (U(an), U(bn)) has the same distribution as

g(An, Bn, Cn)def=

(

An

An + Bn + Cn,

An + Bn

An + Bn + Cn

)

. (5.12)

76

The asymptotic distribution of g(An, Bn, Cn) may be easily determined using the delta method ifwe can determine the joint asymptotic distribution of (An, Bn, Cn).

A bit of algebra reveals that

√n (An − p1) =

√

an

n

√an

(

nAn

an− np1

an

)

=

√

an

n

√an

(

nAn

an− 1

)

+an − np1√

n.

By the central limit theorem,√

an(nAn/an−1)d→N(0, 1) because the Xi have mean 1 and variance

1. Furthermore, an/n → p1 and the rightmost term above goes to 0, so Slutsky’s theorem gives

√n (An − p1)

d→N(0, p1).

Similar arguments apply to Bn and to Cn. Because An and Bn and Cn are independent of oneanother, we may stack them as in exercise 2.19 to obtain

√n

An

Bn

Cn

−

p1

p2 − p1

1 − p2

d→N3

000

,

p1 0 00 p2 − p1 00 0 1 − p2

by Slutsky’s theorem. For g : R3 → R

2 defined in equation (5.12), we obtain

∇g(a, b, c) =1

(a + b + c)2

b + c c−a c−a −a − b

.

Therefore,

[∇g(p1, p2 − p1, 1 − p2)]t =

(

1 − p1 −p1 −p1

1 − p2 1 − p2 −p2

)

,

so the delta method gives

√n

(

U(an)

U(bn)

)

−(

p1

p2

)

d→N2

(

00

)

,

(

p1(1 − p1) p1(1 − p2)p1(1 − p2) p2(1 − p2)

)

. (5.13)

The method used above to derive the joint distribution (5.13) of two sample quantiles may be easilyextended to any number of quantiles; doing so yields the following theorem:

Theorem 5.1 Suppose that for given constants p1, . . . , pk with 0 < p1 < · · · < pk < 1,there exist sequences a1n, . . . , akn such that for all 1 ≤ i ≤ k,

ain − npi√n

→ 0.

Then if U1, . . . , Un is a sample from Uniform(0,1),

√n

U(a1n)...

U(akn)

−

p1...

pk

d→Nk

0...0

,

p1(1 − p1) · · · p1(1 − pk)...

...p1(1 − pk) · · · pk(1 − pk)

.

Note that for i < j, both the (i, j) and (j, i) entries in the covariance matrix above equal pi(1− pj)and pj(1 − pi) never occurs in the matrix.

77

5.2.3 General sample quantiles

Recall that we have already derived the asymptotic distribution of the sample median in example3.2. Here, we generalize that result using an entirely different method.

Let F (x) be the distribution function for a random variable X. The quantile function F−(u)of definition 2.18 is nondecreasing on (0, 1) and it has the property that F−(U) has the samedistribution as X for U ∼ uniform(0, 1). This property follows from lemma 2.5, since

P [F−(U) ≤ x] = P [U ≤ F (x)] = F (x)

for all x. Since F−(u) is nondecreasing, it preserves ordering; thus, if X1, . . . , Xn is a randomsample from F (x), then

(X(1), . . . , X(n))d=

[

F−(U(1)), . . . , F−(U(n))

]

.

(The symbold= means “has the same distribution as”.)

Now, suppose that at some point ξ, the derivative F ′(ξ) exists and is positive. Then F (x) must becontinuous and strictly increasing in a neighborhood of ξ. This implies that in this neighborhood,

F (x) has a well-defined inverse, which must be differentiable at the point pdef= F (ξ). If F−1(u)

denotes the inverse that exists in a neighborhood of p, then

dF−1(p)

dp=

1

F ′(ξ). (5.14)

Equation (5.14) may be derived by differentiating the equation F [F−1(p)] = p. Note also thatwhenever the inverse F−1(u) exists, it must coincide with the quantile function F−(u). Thus,the condition that F ′(ξ) exists and is positive is a sufficient condition to imply that F−(u) isdifferentiable at p. This differentiability is important: If we wish to transform the uniform orderstatistics U(ain) of theorem 5.1 into order statistics X(ain) using the quantile function F−(u), thedelta method requires the differentiability of F−(u) at each of the points p1, . . . , pk.

The delta method, along with equation (5.14), yields the following corollary of theorem 5.1:

Theorem 5.2 Let X1, . . . , Xn be a simple random sample from a distribution function F (x)such that F (x) is differentiable at each of the points ξ1 < · · · < ξk and F ′(ξi) > 0 forall i. Denote F (ξi) by pi. Then under the assumptions of theorem 5.1,

√n

X(a1n)...

X(akn)

−

ξ1...ξk

d→Nk

0...0

,

p1(1−p1)F ′(ξ1)2

· · · p1(1−pk)F ′(ξ1)F ′(ξk)

......

p1(1−pk)F ′(ξ1)F ′(ξk) · · · pk(1−pk)

F ′(ξk)2

.


Exercise 5.9 Let X1 be Uniform(0, 2π) and let X2 be standard exponential, independentof X1. Find the joint distribution of (Y1, Y2) = (

√2X2 cos X1,

√2X2 sinX1).

78

Note: Since − log U has a standard exponential distribution if U ∼ uniform(0, 1), thisproblem may be used to simulate normal random variables using simulated uniformrandom variables.

Exercise 5.10 Suppose X1, . . . , Xn is a simple random sample with P (Xi ≤ x) = F (x−θ),where F (x) is symmetric about zero. We wish to estimate θ by (Qp + Q1−p)/2, whereQp and Q1−p are the p and 1 − p sample quantiles, respectively. Find the smallestpossible asymptotic variance for the estimator and the p for which it is achieved foreach of the following forms of F (x):

(a) Standard Cauchy

(b) Standard normal

(c) Standard double exponential

Hint: For at least one of the three parts of this question, you will have to solve for aminimizer numerically.

Exercise 5.11 When we use a boxplot to assess the symmetry of a distribution, one of themain things we do is visually compare the lengths of Q3 − Q2 and Q2 − Q1, where Qi

denotes the ith sample quartile.

(a) Given a random sample of size n from N(0, 1), find the asymptotic distribution of(Q3 − Q2) − (Q2 − Q1).

(b) Repeat part (a) if the sample comes from a standard logistic distribution.

(c) Using 1000 simulations from each distribution, use graphs to assess the accuracyof each of the asymptotic approximations above for n = 5 and n = 13. (For a sample ofsize 4n+1, define Qi to be the (in+1)th order statistic.) For each value of n and eachdistribution, plot the empirical distribution function against the theoretical limitingdistribution function.

Exercise 5.12 Let X1, . . . , Xn be a random sample from Uniform(0, 2θ). Find the asymp-totic distributions of the median, the midquartile range, and 2

3X⌊3n/4⌋. (The midquar-tile range is the mean of the 1st and 3rd quartiles.) Compare these three estimates ofθ based on their variances.

79

Chapter 6

Maximum Likelihood Estimation

6.1 Consistency

As I assume you already know, if X is a random variable (or vector) with density or mass functionfθ(x) that depends on a parameter θ, then the function fθ(X) viewed as a function of θ is calledthe likelihood function of θ. We often denote this function by L(θ). Note that L(θ) = fθ(X)is implicitly a function of X, but we suppress this fact in the notation. Since “density or massfunction” is awkward, we will use the term “density” to refer to fθ(x) throughout this chapter,even if the distribution function of X is not continuous. (It is possible to define integration so thatfθ(x) is truly a density function even for non-continuous X; however, this is a measure theoretictopic beyond the scope of this book.) Let the set of possible values of θ be the set Ω.

If L(θ) has a maximizer in Ω, say θ, then of θ is called a maximum likelihood estimator or MLE.Since the logarithm function is a strictly increasing function, any maximizer of L(θ) also maximizes

ℓ(θ)def= log L(θ). It is often much easier to maximize ℓ(θ), called the loglikelihood function, than

L(θ).

Example 6.1 Suppose Ω = (0,∞) and X ∼ binomial(n, e−θ). Then

ℓ(θ) = log

(

n

X

)

− Xθ + (n − X) log(1 − e−θ),

so

ℓ′(θ) = −X +X − n

1 − eθ.

Thus, setting ℓ′(θ) = 0 yields θ = − log(X/n). It isn’t hard to verify that ℓ′′(θ) < 0, sothat − log(X/n) is in fact a maximizer of ℓ(θ).

As the preceding example demonstrates, it is not always the case that a MLE exists — for if X = 0or X = n, then log(−X/n) is not contained in Ω. This is just one of the technical details that wewill consider. Ultimately, we will show that the maximum likelihood estimator is, in many cases,asymptotically normal. However, this is not always the case; in fact, it is not even necessarily truethat the MLE is consistent, as shown in Problem 6.1.

80

We begin the discussion of the consistency of the MLE by defining the so-called Kullback-Lieblerinformation.

Definition 6.1 If fθ0(x) and fθ1(x) are two densities, the Kullback-Leibler informationnumber equals

K(fθ0 , fθ1) = E θ0 logfθ0(X)

fθ1(X).

If Pθ0(fθ0(X) > 0 and fθ1(X) = 0) > 0, then K(fθ0 , fθ1) is defined to be ∞.

We may show that the Kullback-Leibler information must be nonnegative by noting that

E θ0

fθ1(X)

fθ0(X)= E θ1 Ifθ0(X) > 0 ≤ 1.

Therefore, by Jensen’s inequality (1.16) and the strict convexity of the function − log x,

K(fθ0 , fθ1) = E θ0 − logfθ1(X)

fθ0(X)≥ − log E θ0

fθ1(X)

fθ0(X)≥ 0, (6.1)

with equality if and only if Pθ0 fθ0(X) = fθ1(X) = 1. Inequality (6.1) is sometimes called theShannon-Kolmogorov information inequality.

If X1, . . . , Xn are independent and identically distributed with density fθ0(x), then ℓ(θ) =∑n

i=1 log fθ0(xi).Thus, the Shannon-Kolmogorov information inequality may be used to prove the consistency of themaximum likelihood estimator in the case of a finite parameter space:

Theorem 6.1 Suppose Ω contains finitely many elements and that X1, . . . , Xn are inde-pendent and identically distributed with density fθ0(x). Furthermore, suppose thatthe model is identifiable, which is to say that different values of θ lead to different

distributions. Then if θn denotes the maximum likelihood estimator, θnP→ θ0.

Proof: Notice that

1

n

n∑

i=1

logfθ(Xi)

fθ0(Xi)

P→E θ0 logfθ(Xi)

fθ0(Xi)= −K(fθ0 , fθ). (6.2)

The value of −K(fθ0 , fθ) is strictly negative for θ 6= θ0 by the identifiability of the model. Therefore,since θ = θn is the maximizer of the left hand side of Equation (6.2),

P (θn 6= θ0) = P

(

maxθ 6=θ0

1

n

n∑

i=1

logfθ(Xi)

fθ0(Xi)> 0

)

≤∑

θ 6=θ0

P

(

1

n

n∑

i=1

logfθ(Xi)

fθ0(Xi)> 0

)

→ 0.

This implies that θnP→ θ0.

The result of Theorem 6.1 may be extended in several ways; however, it is unfortunately not true ingeneral that a maximum likelihood estimator is consistent, as seen in Problem 6.1. We will presentthe extension given in Lehmann, but we do so without proof.

If we return to the simple Example 6.1, we found that the MLE was found by solving the equation

ℓ′(θ) = 0. (6.3)

81

Equation (6.3) is called the likelihood equation, and naturally a root of the likelihood equation is agood candidate for a maximum likelihood estimator. However, there may be no root and there maybe more than one. It turns out the probability that at least one root exists goes to 1 as n → ∞.Consider Example 6.1, in which no MLE exists whenever X = 0 or X = n. In that case, bothP (X = 0) = (1 − e−θ)n and P (X = n) = e−nθ go to zero as n → ∞. In the case of multiple roots,one of these roots is typically consistent for θ0, as stated in the following theorem.

Theorem 6.2 Suppose that X1, . . . , Xn are independent and identically distributed withdensity fθ0(x) for θ0 in an open interval Ω ⊂ R, where the model is identifiable (i.e.,different values of θ ∈ Ω give different distributions). Furthermore, suppose that theloglikelihood function ℓ(θ) is differentiable and that the support x : fθ(x) > 0 doesnot depend on θ. Then with probability approaching 1 as n → ∞, there exists θn =

θn(X1, . . . , Xn) such that ℓ′(θn) = 0 and θnP→ θ0.

Stated succinctly, Theorem 6.2 says that under certain regularity conditions, there is a consistentroot of the likelihood equation. It is important to note that there is no guarantee that this consistentroot is the MLE. However, if the likelihood equation only has a single root, we can be more precise:

Corollary 6.1 Under the conditions of Theorem 6.2, if for every n there is a unique rootof the likelihood equation, and this root is a local maximum, then this root is the MLEand the MLE is consistent.

Proof: The only thing that needs to be proved is the assertion that the unique root is the MLE.Denote the unique root by θn and suppose there is some other point θ such that ℓ(θ) ≥ ℓ(θn). Thenthere must be a local minimum between θn and θ, which contradicts the assertion that θn is theunique root of the likelihood equation.


Exercise 6.1 In this problem, we explore an example in which the MLE is not consistent.Suppose that for θ ∈ (0, 1), X is a continuous random variable with density

fθ(x) =3(1 − θ)

4δ3(θ)[δ2(θ) − (x − θ)2]I|x − θ| ≤ δ(θ) +

θ

2I|x| ≤ 1, (6.4)

where δ(θ) > 0 for all θ.

(a) Prove that fθ(x) is a legitimate density.

(b) What condition on δ(θ) ensures that x : fθ(x) > 0 does not depend on θ?

(c) With δ(θ) = exp−(1−θ)−4, let θ = .125. Take samples of sizes n ∈ 50, 250, 500from fθ(x). In each case, graph the loglikelihood function and find the MLE. Also, tryto identify the consistent root of the likelihood equation in each case.

Hints: To generate a sample from fθ(x), note that fθ(x) is a mixture density, whichmeans you can start by generating a standard uniform random variable. If it’s lessthan θ, generate a uniform variable on (−1, 1). Otherwise, generate a variable withdensity 3(δ2 − x2)/4δ3 on (−δ, δ) and then add θ. You should be able to do this by

82

inverting the distribution function or by using appropriately scaled and translated betavariables. Be very careful when graphing the loglikelihood and finding the MLE. Inparticular, make sure you evaluate the loglikelihood analytically at each of the samplepoints in (0, 1); if you fail to do this, you’ll miss the point of the problem and you’llget the MLE incorrect. This is because the correct loglikelhood graph will have tall,extremely narrow spikes.

Exercise 6.2 In the situation of Problem 6.1, prove that the MLE is inconsistent.

Exercise 6.3 Suppose that X1, . . . , Xn are independent and identically distributed withdensity fθ(x), where θ ∈ (0,∞). For each of the following forms of fθ(x), prove thatthe likelihood equation has a unique solution and that this solution maximizes thelikelihood function.

(a) Weibull: For some constant a > 0,

fθ(x) = aθaxa−1 exp−(θx)aIx > 0

(b) Cauchy:

fθ(x) =θ

π

1

x2 + θ2

(c)

fθ(x) =3θ2

√3

2π(x3 + θ3)Ix > 0

Exercise 6.4 Find the MLE and its asymptotic distribution given a random sample of sizen from fθ(x) = (1 − θ)θx, x = 0, 1, 2, . . ., θ ∈ (0, 1).

6.2 Asymptotic normality of the MLE

As seen in the preceding section, the MLE is not necessarily even consistent, so the title of thissection is slightly misleading — however, “Asymptotic normality of the consistent root of thelikelihood equation” is a bit too long! It will be necessary to review a few facts regarding Fisherinformation before we proceed.

For a density (or mass) function fθ(x), we define the Fisher information function to be

I(θ) = E θ

d

dθlog fθ(X)

2

. (6.5)

If η = g(θ) for some invertible and differentiable function g(·), then since

d

dη=

dθ

dη

d

dθ=

1

g′(θ)d

dθ

83

by the chain rule, we conclude that

I(η) =I(θ)

g′(θ)2 . (6.6)

Loosely speaking, I(θ) is the amount of information about θ contained in a single observation fromthe density fθ(x). However, this interpretation doesn’t always make sense — for example, it ispossible to have I(θ) = 0 for a very informative observation (see Example 7.2.1 on page 462 ofLehmann).

Although we do not dwell on this fact in this course, expectation may be viewed as integration.Suppose that fθ(x) is twice differentiable with respect to θ and that the operations of differentiationand integration may be interchanged in the following sense:

d

dθE θ

fθ(X)

fθ(X)= E θ

ddθfθ(X)

fθ(X)(6.7)

and

d2

dθ2E θ

fθ(X)

fθ(X)= E θ

d2

dθ2 fθ(X)

fθ(X). (6.8)

Equations (6.7) and (6.8) give two additional expressions for I(θ). From Equation (6.7) follows

I(θ) = Var θ

d

dθlog fθ(X)

, (6.9)

and Equation (6.8) implies

I(θ) = −E θ

d2

dθ2log fθ(X)

. (6.10)

In many cases, Equation (6.10) is the easiest form of the information to work with.

Equations (6.9) and (6.10) make clear a helpful property of the information, namely that forindependent random variables, the information about θ contained in the joint sample is simply thesum of the individual information components. In particular, if we have a simple random sampleof size n from fθ(x), then the information about θ equals nI(θ).

The reason that we need the Fisher information is that we will show that under certain regularityconditions,

√n(θn − θ0)

d→N

0,1

I(θ0)

, (6.11)

where θn is the consistent root of the likelihood equation.

Example 6.2 Suppose that X1, . . . , Xn are independent Poisson(θ0) random variables.Then the likelihood equation has a unique root, namely θn = Xn, and we know that

by the central limit theorem√

n(θn − θ0)d→N(0, θ0). However, the Fisher information

for a single observation in this case is

−E θ

d

dθfθ(X)

= E θX

θ2=

1

θ.

Thus, in this example, equation (6.11) holds.

84

Rather than stating all of the regularity conditions necessary to prove Equation (6.11), we workbackwards, figuring out the conditions as we go through the proof. The first step is to expandℓ′(θn) in a power series around θ0:

ℓ′(θn) = ℓ′(θ0) + (θn − θ0)

ℓ′′(θ0) + oP (1)

. (6.12)

In order for the second derivative ℓ′′(θ0) to exist, it is enough to assume that equation (6.8) holds(we will need this equation later anyway). Rewriting equation (6.12) gives

√n(θn − θ0) =

√nℓ′(θn) − ℓ′(θ0)ℓ′′(θ0) + oP (1)

=

1√nℓ′(θ0) − ℓ′(θn)

− 1nℓ′′(θ0) − oP (1/n)

. (6.13)

Let’s consider the pieces of Equation (6.13) individually. If the conditions of Theorem 6.2 are met,

then ℓ′(θn)P→ 0. If Equation (6.7) holds and I(θ0) < ∞, then

1√n

ℓ′(θ0) =√

n

(

1

n

n∑

i=1

d

dθlog fθ0(Xi)

)

d→N0, I(θ0)

by the central limit theorem and Equation (6.9). If Equation (6.8) holds, then

1

nℓ′′(θ0) =

1

n

n∑

i=1

d2

dθ2log fθ0(Xi)

P→−I(θ0)

by the weak law of large numbers and Equation (6.10).

In conclusion, we see that if all of the our assumptions hold, then the numerator of (6.13) con-verges in distribution to N0, I(θ0) by Slutsky’s theorem. Furthermore, the denominator in (6.13)converges to I(θ0), so a second use of Slutsky’s theorem gives the following theorem.

Theorem 6.3 Suppose that the conditions of Theorem 6.2 are satisfied, and let θn denotea consistent root of the likelihood equation. Assume also that equations (6.7) and (6.8)hold and that 0 < I(θ0) < ∞. Then

√n(θn − θ0)

d→N

0,1

I(θ0)

.

Sometimes, it is not possible to find an exact zero of ℓ′(θ). One way to get a numerical approximationto a zero of ℓ′(θ) is to use Newton’s method, in which we start at a point θ0 and then set

θ1 = θ0 −ℓ′(θ0)

ℓ′′(θ0). (6.14)

Ordinarily, after finding θ1 we would set θ0 equal to θ1 and apply Equation (6.14) iteratively.

However, we may show that by using a single step of Newton’s method, starting from a√

n-consistent estimator of θ0, we may obtain an estimator with the same asymptotic distribution asθn. The proof of the following theorem is left as an exercise:

Theorem 6.4 Suppose that θn is any√

n-consistent estimator of θ0 (i.e.,√

n(θn − θ0) isbounded in probability). Then under the conditions of Theorem 6.3, if we set

δn = θn − ℓ′(θn)

ℓ′′(θn), (6.15)

85

then

√n(δn − θ0)

d→N

(

0,1

I(θ0)

)

.


Exercise 6.5 Do Problems 2.1 on p. 553 and 2.12 on p. 555.

Exercise 6.6 (a) Show that under the conditions of Theorem 6.3, then if θn is a consistentroot of the likelihood equation, Pθ0(θn is a local maximum) → 1.

(b) Using the result of part (a), show that for any two sequences θ1n and θ2n ofconsistent roots of the likelihood equation, Pθ0(θ1n = θ2n) → 1.


Hint: Start with√

n(δn − θ0) =√

n(δn − θn) +√

n(θn − θ0), then expand ℓ′(θn) in aTaylor series about θ0 and rewrite

√n(θn − θ0) using this expansion.

Exercise 6.8 Suppose that the following is a random sample from a logistic density withdistribution function Fθ(x) = (1 + expθ − x)−1 (I’ll cheat and tell you that I usedθ = 2.)

1.0944 6.4723 3.1180 3.8318 4.1262

1.2853 1.0439 1.7472 4.9483 1.7001

1.0422 0.1690 3.6111 0.9970 2.9438

(a) Evaluate the unique root of the likelihood equation numerically. Then, taking thesample median as our known

√n-consistent estimator θn of θ, evaluate the estimator

δn in equation (6.15) numerically.

(b) Find the asymptotic distributions of√

n(θn − 2) and√

n(δn − 2). Then, simulate200 samples of size n = 15 from the logistic distribution with θ = 2. Find the sam-ple variances of the resulting sample medians and δn-estimators. How well does theasymptotic theory match reality?

6.3 Efficient estimation

In Theorem 6.3, we showed that a consistent root θn of the likelihood equation satisfies

√n(θn − θ0)

d→N

(

0,1

I(θ0)

)

.

In Theorem 6.4, we stated that if θn is a√

n-consistent estimator of θ0 and δn = θn − ℓ′(θn)/ℓ′′(θn),then

√n(δn − θ0)

d→N

(

0,1

I(θ0)

)

. (6.16)

86

Note the similarity in the last two asymptotic limit statements. There seems to be somethingspecial about the limiting variance 1/I(θ0), and in fact this is true.

Much like the Cramer-Rao lower bound states that (under some regularity conditions) an unbiasedestimator of θ0 cannot have a variance smaller than 1/I(θ0), the following result is true:

Theorem 6.5 Suppose that the conditions of theorem 6.3 are satisfied and that δn is anestimator satisfying

√n(δn − θ0)

d→N0, v(θ0)

for all θ0, where v(θ) is continuous. Then v(θ) ≥ 1/I(θ) for all θ.

In other words, 1/I(θ) is in a sense the smallest possible asymptotic variance for a√

n-consistentestimator. For this reason, we refer to any estimator δn satisfying (6.16) for all θ0 an efficientestimator.

One condition in Theorem 6.5 that may be a bit puzzling is the condition that v(θ) be continuous.If this condition is dropped, then a well-known counterexample, due to Hodges, exists:

Example 6.3 Suppose that δn is an efficient estimator of θ0. Then if we define

δ∗n =

0 if n(δn)4 < 1;δn otherwise,

it is possible to show (see exercise 6.9) that δ∗n is superefficient in the sense that

√n(δ∗n − θ0)

d→N

(

0,1

I(θ0)

)

for all θ0 6= 0 but√

n(δ∗n − θ0)d→ 0 if θ0 = 0.

Just as the invariance property of maximum likelihood estimation states that the MLE of a functionof θ equals the same function applied to the MLE of θ, we may see quite easily that a function ofan efficient estimator is itself efficient:

Theorem 6.6 If δn is efficient for θ0, and if g(θ) is a differentiable and invertible functionwith g′(θ0) 6= 0, g(δn) is efficient for g(θ0).

The proof of the above theorem follows immediately from the delta method, since the Fisherinformation for g(θ) is I(θ)/g′(θ)2 by equation (6.6).

We have already noted that (under suitable regularity conditions) if θn is a√

n-consistent estimatorof θ0 and

δn = θn − ℓ′(θn)

ℓ′′(θn), (6.17)

then δn is an efficient estimator of θ0. Alternatively, we may set

δ∗n = θn +ℓ′(θn)

nI(θn)(6.18)

87

and δ∗n is also an efficient estimator of θ0. Problem 6.7 asked you to prove the former fact regarding

δn; the latter fact regarding δ∗n is proved in nearly the same way because − 1nℓ′′(θn)

P→ I(θ0). InEquation (6.17), as already remarked earlier, δn results from a single step of Newton’s method; inEquation (6.18), δ∗n results from a similar method called Fisher scoring. As is clear from comparingEquations (6.17) and (6.18), scoring differs from Newton’s method in that the Fisher informationis used in place of the negative second derivative of the loglikelihood function. In some examples,scoring and Newton’s method are equivalent.

A note on terminology: The derivative of ℓ(θ) is sometimes called the score function. Furthermore,nI(θ) and −ℓ′′(θ) are sometimes referred to as the expected information and the observedinformation, respectively.

Example 6.4 Suppose X1, . . . , Xn are independent from a Cauchy location family withdensity

fθ(x) =1

π1 + (x − θ)2 .

Then

ℓ′(θ) =

n∑

i=1

2(xi − θ)

1 + (xi − θ)2,

so the likelihood equation is very difficult to solve. However, an efficient estimatormay still be created by starting with some

√n-consistent estimator θn, say the sample

median, and using either Equation (6.17) or Equation (6.18). In the latter case, weobtain the simple estimator

δ∗n = θn +2

nℓ′(θn), (6.19)

verification of which is the subject of Problem 6.10.

For the remainder of this section, we turn our attention to Bayes estimators, which give yet anothersource of efficient estimators. A Bayes estimator is the expected value of the posterior distribution,which of course depends on the prior chosen. Although we do not prove this fact here (see Ferguson§21 for details), any Bayes estimator is efficient under some very general conditions. The conditionsare essentially those of Theorem 6.3 along with the stipulation that the prior density is positiveeverywhere on Ω. (Note that if the prior density is not positive on Ω, the Bayes estimator may noteven be consistent — see Example 7.4.5 on p. 489 of Lehmann.)

Example 6.5 Consider the binomial distribution with beta prior, say X ∼ binomial(n, p)and p ∼ beta(a, b). Then the posterior density of p is proportional to the product ofthe likelihood and the prior, which (ignoring multiplicative constants not involving p)equals

pa−1(1 − p)b−1 × pX(1 − p)n−X .

Therefore, the posterior distribution of p is beta(a+X, b+n−X). The Bayes estimatoris the expectation of this distribution, which equals (a + X)/(a + b + n). If we let γn

denote the Bayes estimator here, then

√n(γn − p) =

√n

(

X

n− p

)

+√

n

(

γn − X

n

)

.

88

We may see that the rightmost term converges to zero in probability by writing

√n

(

γn − X

n

)

=

√n

a + b + n

(

a − (a + b)X

n

)

,

since a + (a + b)X/nP→ a + (a + b)p by the weak law of large numbers. In other words,

the Bayes estimator in this example has the same limiting distribution as the MLE,X/n. It is easy to verify that the MLE is efficient in this case.

Naturally, the central question when constructing a Bayes estimator is how to choose the priordistribution. We consider one class of prior distributions, called Jeffreys priors. Note that thesepriors are named for Harold Jeffreys, so it is incorrect to write “Jeffrey’s priors”. For a Bayesestimator θn of θ0, we have

√n(θn − θ0)

d→N

(

0,1

I(θ0)

)

.

Since the limiting variance of θn depends on I(θ0), if I(θ) is not a constant, then some values ofθ may be estimated more precisely than others. In analogy with the idea of variance-stabilizingtransformations seen earlier in this course, we might consider a reparameterization η = g(θ) suchthat g′(θ)2/I(θ) is a constant. More precisely, if g′(θ) = c

√

I(θ), then

√n

g(θn) − η0

)

d→N(0, c2).

So as not to influence the estimation of η, we choose as the Jeffreys prior a uniform prior on η.Therefore, the Jeffreys prior density on θ is proportional to g′(θ), which is proportional to

√

I(θ).Note that this may lead to an improper prior.

Example 6.6 In the case of Example 6.5, we may verify that for a Bernoulli(p) observation,

I(p) = −Ed2

dp2X log p + (1 − X) log(1 − p) = E

(

X

p2+

1 − X

(1 − p)2

)

=1

p(1 − p).

Thus, the Jeffreys prior on p in this case has a density proportional to p−1/2(1−p)−1/2.In other words, the prior is beta(1

2 , 12). Therefore, the Bayes estimator corresponding

to the Jeffreys prior is

γn =X + 1

2

n + 1.


Exercise 6.9 Verify the claim made in Example 6.3.

89

Exercise 6.10 If fθ(x) forms a location family, so that fθ(x) = f(x − θ) for some densityf(x), then the Fisher information I(θ) is a constant (you may assume this fact withoutproof).

(a) Verify that for the Cauchy location family,

fθ(x) =1

π1 + (x − θ)2 ,

we have I(θ) = 12 .

(b) For 500 samples of size n = 51 from a standard Cauchy distribution, calculatethe sample median θn and the efficient estimator δ∗n of Equation (6.19). Compare thevariances of θn and δ∗n with their theoretical asymptotic limits.

Exercise 6.11 (a) Derive the Jeffreys prior on θ for a random sample from Poisson(θ). Isthis prior proper or improper?

(b) What is the Bayes estimator of θ for the Jeffreys prior? Verify directly that thisestimator is efficient.

Exercise 6.12 (a) Derive the Jeffreys prior on σ2 for a random sample from N(0, σ2). Isthis prior proper or improper?

(b) What is the Bayes estimator of σ2 for the Jeffreys prior? Verify directly that thisestimator is efficient.

6.4 The multiparameter case

Suppose now that the parameter is the vector θ = (θ1, . . . , θk). If X ∼ fθ(x), then I(θ), theinformation matrix, is defined to be the k × k matrix

I(θ) = Eθ

hθ(X)htθ(X)

,

where hθ(x) = ∂∂θ

log fθ(x). This is a rare instance in which it’s probably clearer to use component-

wise notation: The i, j element of I(θ) is defined to be

Iij(θ) = Eθ

∂

∂θilog fθ(X)

∂

∂θjlog fθ(X)

.

Note that the one-parameter case is a special case of the above.

Example 6.7 Let θ = (ξ, σ2) and suppose X ∼ N(ξ, σ2). Then

log fθ(x) = −1

2log σ2 − (x − ξ)2

2σ2− log

√2π,

so

∂

∂ξlog fθ(x) =

x − ξ

σ2and

∂

∂σ2log fθ(x) =

1

2σ2

(

(x − ξ)2

σ2− 1

)

.

90

Thus, the entries in the information matrix are as follows:

I11(θ) = Eθ

(

(X − ξ)2

σ4

)

=1

σ2,

I21(θ) = I12(θ) = Eθ

(

(X − ξ)3

2σ6− X − ξ

2σ4

)

= 0,

I22(θ) = Eθ

(

1

4σ2− (X − ξ)2

2σ6+

(X − ξ)4

4σ8

)

=1

4σ4− σ2

2σ6+

3σ4

4σ8=

1

2σ4.

As in the one-parameter case, the definition of I(θ) is not often the easiest to work with. WitnessExample 6.7, in which the calculation of the information matrix required the evaluation of a fourthmoment as well as a lot of algebra. In analogy with equations (6.7) and (6.8), if

0 =∂

∂θiEθ

fθ(X)

fθ(X)= Eθ

∂∂θi

fθ(X)

fθ(X)and 0 =

∂2

∂θi∂θjEθ

fθ(X)

fθ(X)= Eθ

∂2

∂θi∂θjfθ(X)

fθ(X)(6.20)

for all i and j, then

Iij(θ) = Cov θ

(

∂

∂θilog fθ(X),

∂

∂θjlog fθ(X)

)

(6.21)

= −E θ

(

∂2

∂θi∂θjlog fθ(X)

)

. (6.22)

Example 6.8 In the normal case of Example 6.7, the information matrix is perhaps a biteasier to compute using Equation (6.22), since we obtain

∂2

∂ξ2log fθ(x) = − 1

σ2,

∂2

∂ξ∂σ2log fθ(x) = −x − ξ

σ4,

and

∂2

∂(σ2)2log fθ(x) =

1

2σ4− (x − ξ)2

σ6,

taking expectations gives

I(θ) =

(

1σ2 00 1

2σ4

)

as before but without requiring any fourth moments.

By Equation (6.21), the Fisher information matrix is nonnegative definite, just as the Fisher infor-mation is nonnegative in the one-parameter case. A further fact the generalizes into the multipa-rameter case is the additivity of information: If X and Y are independent, then IX(θ) + IY (θ) =I(X,Y )(θ). Finally, suppose that η = g(θ) is a reparameterization, where g(θ) is invertible anddifferentiable. Then if J is the Jacobian matrix of the inverse transformation (i.e., Jij = ∂θi/∂ηj),then the information about η is

I(η) = J tI(θ)J.

91

As we’ll see later, almost all of the same efficiency results that applied to the one-parameter caseapply to the multiparameter case as well. In particular, we will see that an efficient estimator θ isone that satisfies

√n(θ − θ0)

d→Nk0, I(θ0)−1.

Note that the formula for the information under a reparameterization implies that if η is an efficientestimator for η0 and the matrix J is invertible, then g−1(η) is an efficient estimator for θ0 =g−1(η0), since

√ng−1(η) − g−1(η0) d→Nk0, J(J tI(θ0)J)−1J,

and the covariance matrix above is easily seen to equal I(θ0)−1.

As in the one-parameter case, the likelihood equation is obtained by setting the derivative of ℓ(θ)equal to zero. In the multiparameter case, though, the gradient ∇ℓ(θ) is a 1 × k vector, so thelikelihood equation becomes ∇ℓ(θ) = 0. Since there are really k univariate equations implied bythis likelihood equation, it is common to refer to the likelihood equations (plural), which are

∂

∂θiℓ(θ) = 0 for i = 1, . . . , k.

In the multiparameter case, we have essentially the same theorems as in the one-parameter case.

Theorem 6.7 Suppose that X1, . . . , Xn are independent and identically distributed randomvariables (or vectors) with density f

θ0(x) for θ0 in an open subset Ω of Rk, where

distinct values of θ0 yield distinct distributions for X1 (i.e., the model is identifiable).Furthermore, suppose that the support set x : fθ(x) > 0 does not depend on θ.

Then with probability approaching 1 as n → ∞, there exists θ such that ∇ℓ(θ) = 0

and θP→θ0.

As in the one-parameter case, we shall refer to the θ guaranteed by Theorem 6.7 as a consistentroot of the likelihood equations. Unlike Theorem 6.2, however, Corollary 6.1 does not generalizeto the multiparameter case because it is possible that θ is the unique solution of the likelihoodequations and a local maximum but not the MLE. The best we can say is the following:

Corollary 6.2 Under the conditions of Theorem 6.7, if there is a unique root of the likeli-hood equations, then this root is consistent for θ0.

The asymptotic normality of a consistent root of the likelihood equation holds in the multiparametercase just as in the single-parameter case:

Theorem 6.8 Suppose that the conditions of Theorem 6.7 are satisfied and that θ denotesa consistent root of the likelihood equations. Assume also that Equation (6.20) issatisfied for all i and j and that I(θ0) is positive definite with finite entries. Then

√n(θ − θ0)

d→Nk0, I−1(θ0).

As in the one-parameter case, there are certain terms associated with the derivatives of the log-likelihood function. The gradient vector ∇ℓ(θ) is called the score vector. The negative second

92

derivative −∇2ℓ(θ) is called the observed information, and nI(θ) is sometimes called the expectedinformation. Any second derivative of a real-valued function of a k-vector, which is a k× k matrix,is called the Hessian matrix.

Newton’s method (often called Newton-Raphson in the multivariate case) and scoring work just asthey do in the one-parameter case. Starting from the point θ, one step of Newton-Raphson gives

δ = θ −

∇2ℓ(θ)−1

∇ℓ(θ) (6.23)

and one step of scoring gives

δ∗ = θ +1

nI−1(θ)∇ℓ(θ). (6.24)

Theorem 6.9 Under the assumptions of Theorem 6.8, if θ is a√

n-consistent estimator ofθ0, then the one-step Newton-Raphson estimator δ defined in Equation (6.23) satisfies

√n(δ − θ0)

d→Nk0, I−1(θ0)

and the one-step scoring the one-step scoring estimator estimator δ∗ defined in Equation(6.24) satisfies

√n(δ∗ − θ0)

d→Nk0, I−1(θ0).

As in the one-parameter case, we define an efficient estimator θ as one that satisfies

√n(θ − θ0)

d→Nk0, I−1(θ0).

This definition is justified by the fact that if δ is any estimator satisfying

√n(δ − θ)

d→Nk0, Σ(θ),

where I−1(θ) and Σ(θ) are continuous, then Σ(θ) − I−1(θ) is nonnegative definite for all θ.

Finally, we note that Bayes estimators are efficient, just as in the one-parameter case. This meansthat the same three types of estimators that are efficient in the one-parameter case — the consistentroot of the likelihood equation, the one-step scoring and Newton-Raphson estimators, and Bayesestimators — are also efficient in the multiparameter case.


Exercise 6.13 Let X ∼ multinomial(1,p), where p is a k-vector for k > 2. Let p∗ =(p1, . . . , pk−1). Find I(p∗).

Exercise 6.14 Suppose that θ ∈ R × R+ (that is, θ1 ∈ R and θ2 ∈ (0,∞)) and

fθ(x) =1

θ2f

(

x − θ1

θ2

)

for some continuous, differentiable density f(x) that is symmetric about the origin.Find I(θ).

93


Hint: Use theorem 1.4.


Hint: Use theorem 1.4.

Exercise 6.17 The multivariate generalization of a beta distribution is a Dirichlet distri-bution, which is the natural prior distribution for a multinomial likelihood. If p isa random (k + 1)-vector constrained so that pi > 0 for all i and

∑k+1i=1 pi = 1, then

(p1, . . . , pk) has a Dirichlet distribution with parameters a1 > 0, . . . , ak+1 > 0 if itsdensity is proportional to

pa1−11 · · · pak−1

k (1 − p1 − · · · − pk)ak+1−1I

mini

pi > 0

I

k∑

i=1

pi < 1

.

Prove that if G1, . . . , Gk+1 are independent random variables with Gi ∼ gamma(ai, 1),then

1

G1 + · · · + Gk+1(G1, . . . , Gk)

has a Dirichlet distribution with parameters a1, . . . , ak+1.

6.5 Nuisance parameters

This section is the one intrinsically multivariate topic in this chapter; it does not arise in the one-parameter setting. Here we consider how efficiency of an estimator is affected by the presence ofnuisance parameters.

Suppose θ is the parameter vector but θ1 is the only parameter of interest, so that θ2, . . . , θk

are nuisance parameters. We are interested in how the asymptotic precision with which we mayestimate θ1 is influenced by the presence of the nuisance parameters. In other words, if θ is efficientfor θ, then how does θ1 as an estimator of θ1 compare to an efficient estimator of θ1, say θ∗, underthe assumption that all of the nuisance parameters are known?

Assume I(θ) is positive definite. Let σij denote the (i, j) entry of I(θ) and let γij denote the (i, j)entry of I−1(θ). If all of the nuisance parameters are known, then I(θ1) = σ11, which means thatthe asymptotic variance of

√n(θ∗− θ1) is 1/σ11. On the other hand, if the nuisance parameters are

not known then the asymptotic variance of√

n(θ − θ) is I−1(θ), which means that the asymptoticvariance of

√n(θ1 − θ1) is γ11. Of interest here is the comparison between γ11 and 1/σ11.

Theorem 6.10 γ11 ≥ 1/σ11, with equality if and only if γ12 = · · · = γ1k = 0.

Proof: Partition I(θ) as follows:

I(θ) =

(

σ11 ρt

ρ Σ

)

,

94

where ρ and Σ are (k − 1) × 1 and (k − 1) × (k − 1), respectively. Let τ = σ11 − ρtΣ−1ρ. We mayverify that if τ > 0, then

I−1(θ) =1

τ

(

1 −ρtΣ−1

−Σ−1ρ Σ−1ρρtΣ−1 + τΣ−1

)

.

This proves the result, because the positive definiteness of I(θ) implies that Σ−1 is positive definite,which means that

γ11 =1

σ11 − ρtΣ−1ρ≥ 1

σ11,

with equality if and only if ρ = 0. Thus, it remains only to show that τ > 0. But this is immediatefrom the positive definiteness of I(θ), since if we set

v =

(

1

−ρtΣ−1

)

,

then τ = vtI(θ)v.

From the above result, it is clearly important to take nuisance parameters into account in estimation.However, it is not necessary to estimate the entire parameter vector all at once, since (θ1, . . . , θk)is efficient for θ if and only if each of the θi is efficient for θi in the presence of the other nuisanceparameters (see problem 6.18).


Exercise 6.18 Letting γij denote the (i, j) entry of I−1(θ), we say that θi is efficient forθi in the presence of the nuisance parameters θ1, . . . , θi−1, θi+1, . . . , θk if the asymptoticvariance of

√n(θi − θi) is γii.

Prove that θi is efficient for θi in the presence of nuisance parameters θ1, . . . , θi−1, θi+1, . . . , θk

for all i if and only if (θ1, . . . , θk) is efficient for θ.

6.6 Wald, Rao, and Likelihood Ratio Tests

Suppose we wish to test H0 : θ = θ0 against H1 : θ 6= θ0. We might look to the loglikelihood functionto help. Let ℓ(θ) denote the loglikelihood and θn the consistent root of the likelihood equation.Intuitively, the farther θ0 is from θn, the stronger the evidence against the null hypothesis. Buthow far is “far enough”? Note that if θ0 is close to θn, then ℓ(θ0) should also be close to ℓ(θn) andℓ′(θ0) should be close to ℓ′(θn) = 0.

• If we base a test upon the value of (θn − θ0), we obtain a Wald test.

• If we base a test upon the value of ℓ(θn) − ℓ(θ0), we obtain a likelihood ratio test.

• If we base a test upon the value of ℓ′(θ0), we obtain a (Rao) score test.

95

Recall that in order to prove Theorem 6.3, we argued that under certain regularity conditions, thefollowing facts are true under H0:

√n(θn − θ0)

d→ N

(

0,1

I(θ0)

)

; (6.25)

1√n

ℓ′(θ0)d→ N0, I(θ0); (6.26)

− 1

nℓ′′(θ0)

P→ I(θ0). (6.27)

Equation (6.26) is proved using the central limit theorem; equation (6.27) is proved using theweak law of large numbers; and equation (6.25) is the result of a Taylor expansion together withEquations (6.26), (6.27). The three equations above will help us to justify the definitions of Wald,score, and likelihood ratio tests to follow.

Equation (6.25) suggests that if we define

Wn =√

nI(θ0)(θn − θ0), (6.28)

then Wn, called a Wald statistic, should converge in distribution to standard normal under H0.Note that this fact remains true if we define

Wn =√

nI(θn − θ0), (6.29)

where I is consistent for θ0; for example, I could be I(θn).

Definition 6.2 A Wald test is any test that rejects H0 : θ = θ0 in favor of H1 : θ 6= θ0

when |Wn| ≥ uα/2 for Wn defined as in Equation (6.28) or Equation (6.29). As usual,uα/2 denotes the 1 − α

2 quantile of the standard normal distribution.

Equation (6.26) suggests that if we define

Rn =1

√

nI(θ0)ℓ′(θ0), (6.30)

then Rn, called a Rao score statistic or simply a score statistic, converges in distribution to standardnormal under H0. We could also replace I(θ0) by a consistent estimator I as in Equation (6.29),but usually this is not done: One of the main benefits of the score statistic is that it is not necessaryto compute θn, and using I(θn) instead of I(θ0) would defeat this purpose.

Definition 6.3 A score test, sometimes called a Rao score test, is any test that rejectsH0 : θ = θ0 in favor of H1 : θ 6= θ0 when |Rn| ≥ uα/2 for Rn defined as in Equation(6.30).

The third type of test, the likelihood ratio test, requires a bit of development. It is a test based onthe statistic

∆n = ℓ(θn) − ℓ(θ0). (6.31)

If we Taylor expand ℓ′(θn) = 0 around the point θ0, we obtain

ℓ′(θ0) = (θn − θ0)

−ℓ′′(θ0) + oP (1)

. (6.32)

96

We now use a second Taylor expansion, this time of ℓ(θn), to obtain

∆n = (θn − θ0)ℓ′(θ0) +

1

2(θn − θ0)

2[

ℓ′′(θ0) + oP (1)]

. (6.33)

If we now substitute Equation (6.32) into Equation (6.33), we obtain

∆n = n(θn − θ0)2

− 1

nℓ′′(θ0) +

1

2nℓ′′(θ0) + oP

(

1

n

)

. (6.34)

By Equations (6.25) and (6.27) and Slutsky’s theorem, this implies that 2∆nd→χ2

1 under the nullhypothesis. Noting that the 1 − α quantile of a χ2

1 distribution is u2α/2, we make the following

definition.

Definition 6.4 A likelihood ratio test is any test that rejects H0 : θ = θ0 in favor ofH0 : θ 6= θ0 when 2∆n ≥ u2

α/2 or, equivalently, when√

2∆n ≥ uα/2.

Since it may be shown that√

2∆n − |Wn| P→ 0 and Wn − RnP→ 0, the three tests defined above —

Wald tests, score tests, and likelihood ratio tests — are asymptotically equivalent in the sense thatunder H0, they reach the same decision with probability approaching 1 as n → ∞. However, theymay be quite different for a fixed sample size n, and furthermore they have some relative advantagesand disadvantages with respect to one another. For example,

• It is easy to create one-sided Wald and score tests (i.e., tests of H0 : θ = θ0 against H1 : θ > θ0

or H1 : θ < θ0), but this is more difficult with a likelihood ratio test.

• The score test does not require θn whereas the other two tests do.

• The Wald test is most easily interpretable and yields immediate confidence intervals.

• The score test and likelihood ratio test are invariant under reparameterization, whereas theWald test is not.

Example 6.9 Suppose that X1, . . . , Xn are independent with density fθ(x) = θe−xθIx >0. Then ℓ(θ) = n(log θ − θXn), which yields

ℓ′(θ) = n

(

1

θ− Xn

)

and ℓ′′(θ) = − n

θ2.

From the form of ℓ′′(θ), we see that I(θ) = θ−2, and setting ℓ′(θ) = 0 yields θn = 1/Xn.From these facts, we obtain as the Wald, score, and likelihood ratio statistics

Wn =

√n

θ0

(

1

Xn

− θ0

)

,

Rn = θ0

√n

(

1

θ0− Xn

)

=Wn

θ0Xn

, and

∆n = n

Xn(Xn − θ0) − log(θ0Xn)

.

Thus, we reject H0 : θ = θ0 in favor of H1 : θ 6= θ0 whenever |Wn| ≥ uα/2, |Rn| ≥ uα/2,or

√2∆n ≥ uα/2, depending on which test we’re using.

97

Generalizing the three tests to the multiparameter setting is straightforward. Suppose we wish totest H0 : θ = θ0 against H1 : θ 6= θ0, where θ ∈ Rk. Then

Wndef= n(θ − θ0)tI(θ0)(θ − θ0)

d→ χ2k, (6.35)

Rndef=

1

n∇ℓ(θ0)tI−1(θ0)∇ℓ(θ0)

d→ χ2k, and (6.36)

∆ndef= ℓ(θ) − ℓ(θ0)

d→ 1

2χ2

k. (6.37)

Therefore, if ckα denotes the 1− α quantile of the χ2

k distribution, then the multivariate Wald test,score test, and likelihood ratio test reject H0 when Wn ≥ ck

α, Rn ≥ ckα, and 2∆n ≥ ck

α, respectively.As in the one-parameter case, the Wald test may also be defined with I(θ0) replaced by a consistentestimator I.


Exercise 6.19 Let X1, . . . , Xn be a simple random sample from a Pareto distribution withdensity

f(x) = θcθx−(θ+1)Ix > c

for a known constant c > 0 and parameter θ > 0. Derive the Wald, Rao, and likelihoodratio tests of θ = θ0 against a two-sided alternative.

Exercise 6.20 Suppose that X is multinomial(n,p), where p ∈ Rk. In order to satisfy theregularity condition that the parameter space be an open set, define θ = (p1, . . . , pk−1).Suppose that we wish to test H0 : θ = θ0 against H1 : θ 6= θ0.

(a) Prove that the Wald and score tests are the same as the usual Pearson chi-squaretest.

(b) Derive the likelihood ratio statistic 2∆n.

98

Chapter 7

Hypothesis Testing

7.1 Contiguity and Local Alternatives

Suppose that we wish to test the hypotheses

H0 : θ = θ0 and H1 : θ > θ0. (7.1)

The test is to be based on a statistic Tn, where as always n denotes the sample size, and we shalldecide to

reject H0 in (7.1) if Tn ≥ Cn (7.2)

for some constant Cn. We may define some basic asymptotic concepts regarding tests of this type.

Definition 7.1 If Pθ0(Tn ≥ Cn) → α for test (7.2), then test (7.2) is said to have asymptoticlevel α.

Definition 7.2 If two different tests of the same hypotheses reach the same conclusion withprobability approaching 1 under the null hypothesis as n → ∞, the tests are said to beasymptotically equivalent.

As usual, the power of test (7.2) under the alternative θ is defined to be

βn(θ) = Pθ(Tn ≥ Cn).

Naturally, we expect that the power should approach 1.

Definition 7.3 A test (or, more precisely, a sequence of tests) is consistent against thealternative θ if βn(θ) → 1.

Unfortunately, the concepts we have defined so far are of limited usefulness. If we wish to comparetwo different tests of the same hypotheses, then if the tests are both sensible they should beasymptotically equivalent and consistent. Thus, consistency is nice but it doesn’t tell us much;asymptotic equivalence is nice but it doesn’t allow us to compare tests.

We make things more interesting by considering, instead of a fixed alternative θ, a sequence ofalternatives θ1, θ2, . . .. Suppose for test (7.2) that for each n, the distribution of Tn is determined

99

by θn (i.e., θn is the “truth”) and that

√nTn − µ(θn)

τ(θn)

d→N(0, 1). (7.3)

Furthermore, suppose that under H0 : θ = θ0,

√nTn − µ(θ0)

τ(θ0)

d→N(0, 1). (7.4)

Note that µ(θ) is the mean of Tn assuming that θ is the truth. We may calculate the power ofthe test against the sequence of alternatives θ1, θ2, . . . in a straightforward way. In order for thissituation to be interesting mathematically, the sequence must converge to θ0. Such a sequence isoften referred to as a contiguous sequence of alternatives, since “contiguous” means “nearby”; theidea of contiguity is that we choose not a single alternative hypothesis but a sequence of alternativesnear (and converging to) the null hypothesis.

First, we determine a value for Cn so that test (7.2) has asymptotic level α. Define uα to be theconstant such that 1 − Φ(uα) = α, where Φ(t) denotes the distribution function of a standardnormal random variable. By limit (7.4),

Pθ0

Tn − µ(θ0) ≥τ(θ0)uα√

n

→ α;

therefore, we define a new test, namely

reject H0 in (7.1) if Tn ≥ µ(θ0) +τ(θ0)uα√

n(7.5)

and conclude that test (7.5) has asymptotic level α as desired.

We now calculate the power of test (7.5) against the alternative θn:

βn(θn) = Pθn

Tn ≥ µ(θ0) +τ(θ0)uα√

n

= Pθn

√nTn − µ(θn)

τ(θn)· τ(θn)

τ(θ0)+

√n(θn − θ0)

τ(θ0)· µ(θn) − µ(θ0)

θn − θ0≥ uα

. (7.6)

Thus, βn(θn) tends to an interesting limit (i.e., a limit between α and 1) if τ(θn) → τ(θ0);√

n(θn −θ0) tends to a nonzero, finite limit; and µ(θ) is differentiable at θ0. This fact is summarized in thefollowing theorem:

Theorem 7.1 Let θn > θ0 for all n. Suppose that limits (7.3) and (7.4) hold, τ(θ) iscontinuous at θ0, µ(θ) is differentiable at θ0, and

√n(θn−θ0) → ∆ for some finite ∆ > 0.

If µ′(θ0) or τ(θ0) depend on n, then suppose that µ′(θ0)/τ(θ0) tends to a nonzero, finitelimit. Then if βn(θn) denotes the power of test (7.5) against the alternative θn,

βn(θn) → limn→∞

Φ

(

∆µ′(θ0)

τ(θ0)− uα

)

.

The proof of Theorem 7.1 merely uses Equation (7.6) and Slutsky’s theorem, since the hypothesesof the theorem imply that τ(θn)/τ(θ0) → 1 and µ(θn) − µ(θ0)/(θn − θ0) → µ′(θ0).

100

Example 7.1 Let X ∼ Binomial(n, pn), where pn = p0 + ∆/√

n and Tn = X/n. To testH0 : p = p0 against H1 : p > p0, note that

√n(Tn − p0)

√

p0(1 − p0)

d→N(0, 1)

under H0. Thus, test (7.5) says to reject H0 whenever Tn ≥ p0 + uα

√

p0(1 − p0)/n.This test has asymptotic level α. Since clearly τ(p) =

√

p(1 − p) is continuous andµ(p) = p is differentiable, Theorem 7.1 applies in this case as long as we can verify thelimit (7.3).

Let Xn1, . . . , Xnn be independent Bernoulli(pn) random variables. Then if Xn1 −pn, . . . , Xnn − pn can be shown to satisfy the Lyapunov condition, we have

√n(Tn − pn)

τ(pn)

d→N(0, 1)

and so Theorem 7.1 applies. The Lyapunov condition is easy to verify, since |Xni−pn| ≤1 implies

1

Var (nTn)2

n∑

i=1

E |Xni − pn|4 ≤ n

npn(1 − pn)2→ 0.

Thus, we conclude by Theorem 7.1 that

βn(pn) → Φ

(

∆√

p0(1 − p0)− uα

)

.

To apply this result, suppose that we wish to test whether a coin is fair by flipping it100 times. We reject H0 : p = 1/2 in favor of H1 : p > 1/2 if the number of successesdivided by 100 is at least as large as 1/2 + u.05/20, or 0.582. The power of this testagainst the alternative p = 0.6 is approximately

Φ

(√100(0.6 − 0.5)√

0.52− 1.645

)

= Φ(2 − 1.645) = 0.639.

Compare this asymptotic approximation with the exact power, which in this case is easyto compute: The probability of at least 59 successes out of 100 for a binomial(100, 0.6)random variable is 0.623.

As seen in Example 7.1, the asymptotic result of Theorem 7.1 may be applied to a fixed samplesize n and a fixed alternative θ as follows:

βn(θ) ≈ Φ

(√nµ(θ) − µ(θ0)

τ(θ0)− uα

)

(7.7)

There is an alternative formulation that yields a slightly different approximation. Starting from

βn(θ) = Pθ

√nTn − µ(θ)

τ(θ)≥ uα

τ(θ0)

τ(θ)−

√nµ(θ) − µ(θ0)

τ(θ)

,

101

we obtain

βn(θ) ≈ Φ

(√nµ(θ) − µ(θ0)

τ(θ)− uα

τ(θ0)

τ(θ)

)

. (7.8)

Applying approximation (7.8) to the binomial case of Example 7.1, we obtain 0.641 instead of 0.623for the approximate power.

We may invert approximations (7.7) and (7.8) to obtain approximate sample sizes required toachieve desired power β against alternative θ. From (7.7) we obtain

√n ≈ (uα − uβ)τ(θ0)

µ(θ) − µ(θ0)(7.9)

and from (7.8) we obtain

√n ≈ uατ(θ0) − uβτ(θ)

µ(θ) − µ(θ0). (7.10)

We may compare tests by considering the relative sample sizes necessary to achieve the same powerat the same level against the same alternative.

Definition 7.4 Given tests 1 and 2 of the same hypotheses with asymptotic level α and asequence of alternatives θk, suppose that

β(1)mk

(θk) → β and β(2)nk

(θk) → β

for sequences mk and nk of sample sizes. Then the asymptotic relative efficiency(ARE) of test 1 with respect to test 2 is

e1,2 = limk→∞

nk

mk,

assuming this limit exists.

In Examples 7.2 and 7.3, we consider two different tests for the same hypotheses. Then, in Example7.4, we compute their asymptotic relative efficiency.

Example 7.2 Suppose we have paired data (X1, Y1), . . . , (Xn, Yn). Let Zi = Yi − Xi forall i. Assume that the Zi are independent and identically distributed with distributionfunction P (Zi ≤ z) = F (z − θ) for some θ, where f(z) = F ′(z) exists and is symmetricabout 0. Let W1, . . . , Wn be a permutation of Z1, . . . , Zn such that |W1| ≤ |W2| ≤ · · · ≤|Wn|.

We wish to test H0 : θ = 0 against H1 : θ > 0. First, consider the Wilcoxon signedrank test. Define

Rn =n

∑

i=1

iIWi > 0.

Then under H0, the IWi > 0 are independent Bernoulli(1/2) random variables. Thus,

E Rn =n

∑

i=1

i

2=

n(n + 1)

4and Var Rn =

n∑

i=1

i2

4=

n(n + 1)(2n + 1)

24.

102

Furthermore, one may prove (see exercise 7.1) that

Rn − E Rn√Var Rn

d→N(0, 1).

Thus, a test with asymptotic level α rejects H0 when

Rn ≥ n(n + 1)

4+

uατ(0)√n

,

where τ(0) = n√

(n + 1)(2n + 1)/24. Furthermore, we may apply theorem 7.1. Nowwe must find E Rn under the alternative θn = ∆/

√n. As earlier, we let µ(θ) denote

E Rn under the assumption that θ is the true value of the parameter. Note that in thiscase, µ(θ) depends on n even though this dependence is not explicit in the notation.

First, we note that since |Wi| ≤ |Wj | for i ≤ j, Wi + Wj > 0 if and only if Wj > 0.

Therefore,∑j

i=1 IWi + Wj > 0 = jIWj > 0 and so we may rewrite Rn in the form

Rn =n

∑

j=1

j∑

i=1

IWi + Wj > 0 =n

∑

j=1

j∑

i=1

IZi + Zj > 0.

Therefore, we obtain

µ(θn) =

n∑

j=1

j∑

i=1

Pθn(Zi + Zj > 0) = nPθn

(Z1 > 0) +

(

n

2

)

Pθn(Z1 + Z2 > 0).

Since Pθn(Z1 > 0) = Pθn

(Z1 − θn > −θn) = 1 − F (−θn) and

Pθn(Z1 + Z2 > 0) = Pθn

(Z1 − θn) + (Z2 − θn) > −2θn= E Pθn

Z1 − θn > −2θn − (Z2 − θn) | Z2

=

∫ ∞

−∞1 − F (−2θn − z)f(z) dz,

we conclude that

µ′(θ) = nf(θ) +

(

n

2

) ∫ ∞

−∞2f(−2θ − z)f(z) dz.

Thus, because f(−z) = f(z) by assumption,

µ′(0) = nf(0) + n(n − 1)

∫ ∞

−∞f2(z) dz.

Letting

K =

∫ ∞

−∞f2(z) dz,

we obtain

limn→∞

µ′(0)

τ(0)= lim

n→∞

√24f(0) + (n − 1)K√

(n + 1)(2n + 1)= K

√12.

Therefore, theorem 7.1 shows that

βn(θn) → Φ(∆K√

12 − uα). (7.11)

103

Example 7.3 As in Example 7.2, suppose we have paired data (X1, Y1), . . . , (Xn, Yn) andZi = Yi − Xi for all i. The Zi are independent and identically distributed with distri-bution function P (Zi ≤ z) = F (z − θ) for some θ, where f(z) = F ′(z) exists and issymmetric about 0. Suppose that the variance of Zi is σ2.

Since the t-test (unknown variance) and z-test (known variance) have the same asymp-totic properties, let’s consider the z-test for simplicity. Then τ(θ) = σ for all θ. The rele-

vant statistic is merely Zn, and the central limit theorem implies that√

nZn/σd→N(0, 1)

under the null hypothesis. Therefore, the z-test in this case rejects H0 : θ = 0 in favorof H1 : θ > 0 whenever Zn > uασ/

√n. The central limit theorem shows that

√n(Zn − θn)

σ

d→N(0, 1)

under the alternatives θn = ∆/√

n. Therefore, by Theorem 7.1, we obtain

βn(θn) → Φ(∆/σ − uα) (7.12)

since µ′(θ) = 1 for all θ.

Before finding the asymptotic relative efficiency (ARE) of the Wilcoxon signed rank test and thet-test, we prove a lemma that enables this calculation very easily.

Suppose that for two tests, called test 1 and test 2, we use sample sizes m and n, respectively. Wewant m and n to tend to infinity together, an idea we make explicit by setting m = mk and n = nk

for k = 1, 2, . . .. Suppose that we wish to apply both tests to the same sequence of alternativehypotheses θ1, θ2, . . .. As usual, we make the assumption that (θk − θ0) times the square root ofthe sample size tends to a finite, nonzero limit as k → ∞. Thus, we assume

√mk(θk − θ0) → ∆1 and

√nk(θk − θ0) → ∆2.

Then if Theorem 7.1 may be applied to both tests, define c1 = limµ′1(θ0)/τ1(θ0) and c2 =

limµ′2(θ0)/τ2(θ0). The theorem says that

βmk(θk) → Φ ∆1c1 − uα and βnk

(θk) → Φ ∆2c2 − uα . (7.13)

To find the ARE, then, Definition 7.4 specifies that we assume that the two limits in (7.13) are thesame, which implies ∆1c1 = ∆2c2, or

nk

mk→ c2

1

c22

.

Thus, the ARE of test 1 with respect to test 2 equals (c1/c2)2. This result is summed up in the

following lemma, which defines a new term, efficacy.

Lemma 7.1 For a test to which Theorem 7.1 applies, define the efficacy of the test to be

c = limn→∞

µ′(θ0)/τ(θ0). (7.14)

Suppose that Theorem 7.1 applies to each of two tests, called test 1 and test 2. Thenthe ARE of test 1 with respect to test 2 equals (c1/c2)

2.

104

Example 7.4 Using the results of Examples 7.2 and 7.3, we conclude that the efficacies ofthe Wilcoxon signed rank test and the t-test are

√12

∫ ∞

−∞f2(z) dz and

1

σ,

respectively. Thus, Lemma 7.1 implies that the ARE of the signed rank test to thet-test equals

12σ2

(∫ ∞

−∞f2(z) dz

)2

.

In the case of normally distributed data, we may verify without too much difficulty thatthe integral above equals (2σ

√π)−1, so the ARE is 3/π ≈ 0.9549. Notice how close

this is to one, suggesting that for normal data, we lose very little efficiency by usinga signed rank test instead of a t-test. In fact, it may be shown that this asymptoticrelative efficiency has a lower bound of 0.864. However, there is no upper bound on theARE in this case, which means that examples exist for which the t-test is arbitrarilyinefficient compared to the signed rank test.


Exercise 7.1 Assume that θn → 0 for some sequence θ1, θ2, . . .. In example 7.2, Prove that

Rn − E Rn√Var Rn

d→N(0, 1)

under the assumption that for each n, the distribution of Rn is determined by theparameter θn. Note that this proof allows us to apply theorem 7.1, since it covers boththe contiguous alternatives and, by taking θn = θ0 for all n, the null hypothesis.

Exercise 7.2 For the hypotheses considered in Examples 7.2 and 7.3, the sign test is based

on the statistic N+ = #i : Zi > 0. Since 2√

n(N+/n − 12)

d→N(0, 1) under the nullhypothesis, the sign test (with continuity correction) rejects H0 when

N+ − 1

2≥ uα

√n

2+

n

2.

(a) Find the efficacy of the sign test. Make sure to indicate how you go about verifyingequation (7.3).

(b) Find the ARE of the sign test with respect to the signed rank test and the t-test.Evalaute each of these for the case of normal data.

Exercise 7.3 Suppose X1, . . . , Xn are a simple random sample from a uniform (0, 2θ) dis-tribution. We wish to test H0 : θ = θ0 against H1 : θ > θ0 at α = .05.

Define Q1 and Q3 to be the first and third quartiles of the sample. Consider test A,which rejects when

Q3 − Q1 − θ0 ≥ An,

105

and test B, which rejects when

X − θ0 ≥ Bn.

Based on the asymptotic distribution of X and the joint asymptotic distribution of(Q1, Q3), find the values of An and Bn that correspond with the test in (7.5). Thenfind the asymptotic relative efficiency of test A relative to test B.

Exercise 7.4 Let Pθ be a family of probability distributions indexed by a real parameter θ.If X ∼ Pθ, define µ(θ) = E (X) and σ2(θ) = Var (X). Now let θ1, θ2, . . . be a sequenceof parameter values such that θn → θ0 as n → ∞. Suppose that E θn

|X|2+δ < Mfor all n for some positive δ and M . Also suppose that for each n, Xn1, . . . , Xnn

are independent with distribution Pθnand define Xn =

∑ni=1 Xni/n. Prove that if

σ2(θ0) < ∞ and σ2(θ) is continuous at the point θ0, then

√n[Xn − µ(θn)]

d→ N(0, σ2(θ0))

as n → ∞.

Hint:

|a + b|2+δ ≤ 22+δ(

|a|2+δ + |b|2+δ)

.

Exercise 7.5 Suppose X1, X2, . . . are independent exponential random variables with meanθ. Consider the test of H0 : θ = 1 vs H1 : θ > 1 in which we reject H0 when

Xn ≥ 1 +uα√

n,

where α = .05.

(a) Derive an asymptotic approximation to the power of the test for a fixed samplesize n and alternative θ. Tell where you use the result of Problem 7.4.

(b) Because the sum of independent exponential random variables is a gamma randomvariable, it is possible to compute the power exactly in this case. Create a table inwhich you compare the exact power of the test against the alternative θ = 1.2 to theasymptotic approximation in part (a) for n ∈ 5, 10, 15, 20.

Exercise 7.6 Let X1, . . . , Xn be independent from Poisson (λ). Create a table in whichyou list the exact power along with approximations (7.7) and (7.8) for the test thatrejects H0 : λ = 1 in favor of H1 : λ > 1 when

√n(Xn − λ0)√

λ0≥ uα,

where n = 20 and α = .05, against each of the alternatives 1.1, 1.5, and 2.

Exercise 7.7 Let X1, ..., Xn be an independent sample from an exponential distributionwith mean λ, and Y1, ..., Yn be an independent sample from an exponential distribution

106

with mean µ. Assume that Xi and Yi are independent. We are interested in testingthe hypothesis H0 : λ = µ versus H1 : λ > µ. Consider the statistic

Tn = 2n

∑

i=1

(Ii − 1/2)/√

n,

where Ii is the indicator variable Ii = I(Xi > Yi).

(a) Derive the asymptotic distribution of Tn under the null hypothesis.

(b) Use the Lindeberg Theorem to show that, under the local alternative hypothesis(λn, µn) = (λ + n−1/2δ, λ), where δ > 0,

∑ni=1(Ii − ρn)

√

nρn(1 − ρn)

L−→ N(0, 1), where ρn =λn

λn + µn=

λ + n−1/2δ

2λ + n−1/2δ.

(c) Using the conclusion of part (b), derive the asymptotic distribution of Tn underthe local alternative specified in (b).

Exercise 7.8 Suppose X1, . . . , Xm is a simple random sample and Y1, . . . , Yn is anothersimple random sample independent of the Xi, with P (Xi ≤ t) = t2 for t ∈ [0, 1] andP (Yi ≤ t) = (t − θ)2 for t ∈ [θ, θ + 1]. Assume m/(m + n) → ρ as m, n → ∞ and0 < θ < 1.

Find the asymptotic distribution of√

m + n[g(Y − X) − g(θ)].

7.2 The Wilcoxon Rank-Sum Test

Suppose that X1, . . . , Xm and Y1, . . . , Yn are two independent simple random samples, with

P (Xi ≤ t) = P (Yj ≤ t + θ) = F (t) (7.15)

for some continuous distribution function F (t) with f(t) = F ′(t). Thus, the distribution of the Yj

is shifted by θ from the distribution of the Xi. We wish to test H0 : θ = 0 against H1 : θ > 0.

To do the asymptotics here, we will assume that n and m are actually both elements of separatesequences of sample sizes, indexed by a third variable, say k. Thus, m = mk and n = nk both goto ∞ as k → ∞, and we suppress the subscript k on m and n for convenience of notation. Supposethat we combine the Xi and Yj into a single sample of size m + n. Define the Wilcoxon rank-sumstatistic to be

Sk =n

∑

j=1

Rank of Yj among combined sample.

Letting Y(1), . . . , Y(n) denote the order statistics for the sample of Yj as usual, we may rewrite Sk

in the following way:

Sk =n

∑

j=1

Rank of Y(j) among combined sample

107

=n

∑

j=1

(

j + #i : Xi < Y(j))

=n(n + 1)

2+

n∑

j=1

m∑

i=1

IXi < Y(j)

=n(n + 1)

2+

n∑

j=1

m∑

i=1

IXi < Yj. (7.16)

Let N = m + n, and suppose that m/N → ρ as k → ∞ for some constant ρ ∈ (0, 1). First, we willestablish the asymptotic behavior of Sk under the null hypothesis. Define

µ(θ) = E θ Sk and τ(θ) =√

N Var Sk.

To evaluate µ(θ0) and τ(θ0), let Zi =∑n

j=1 IXi < Yj. Then the Zi are identically distributedbut not independent, and we have E θ0 Zi = n/2 and

Var θ0 Zi =n

4+ n(n − 1)Cov θ0 (IXi < Y1, IXi < Y2)

=n

4+

n(n − 1)

3− n(n − 1)

4

=n(n + 2)

12.

Furthermore,

E θ0 ZiZj =n

∑

r=1

n∑

s=1

Pθ0(Xi < Yr and Xj < Ys) =n(n − 1)

4+ nPθ0(Xi < Y1 and Xj < Y1),

which implies

Cov θ0(Zi, Zj) =n(n − 1)

4+

n

3− n2

4=

n

12.

Therefore,

µ(θ0) =n(n + 1)

2+

mn

2=

n(N + 1)

2

and

τ2(θ0) = Nm Var Z1 + Nm(m − 1)Cov (Z1, Z2) =Nmn(n + 2)

12+

Nm(m − 1)n

12=

Nmn(N + 1)

12.

Let θ1, θ2, . . . be a sequence of alternatives such that√

N(θk−θ0) → ∆ for a positive, finite constant∆. It is possible to show that

√NSk − µ(θ0)

τ(θ0)

d→N(0, 1) (7.17)

under H0 and√

NSk − µ(θk)τ(θ0)

d→N(0, 1) (7.18)

108

under the alternatives θk (see exercise 7.9). By expression (7.17), the test based on Sk withasymptotic level α rejects H0 : θ = 0 in favor of Ha : θ > 0 whenever Sk ≥ µ(θ0)+uατ(θ0)/

√N , or

Sk ≥ n(N + 1)

2+ uα

√

mn(N + 1)

12.

The Wilcoxon rank-sum test is sometimes referred to as the Mann-Whitney test. This alternativename helps to distinguish this test from the similarly named Wilcoxon signed rank test.

To find the limiting power of the rank-sum test, we may use Theorem 7.1 to conclude that

βk(θk) → limk→∞

Φ

(

∆µ′(θ0)

τ(θ0)− uα

)

. (7.19)

According to expression (7.19), we must evaluate µ′(θ0). To this end, note that

Pθ(X1 < Y1) = E θ Pθ(X1 < Y1 | Y1)= E θ F (Y1)

=

∫ ∞

−∞F (y)f(y − θ) dy

=

∫ ∞

−∞F (y + θ)f(y) dy.

Therefore,

d

dθPθ(X1 < Y1) =

∫ ∞

−∞f(y + θ)f(y) dy.

This gives

µ′(0) = mn

∫ ∞

−∞f2(y) dy.

Thus, the efficacy of the Wilcoxon rank-sum test is

limk→∞

µ′(θ0)

τ(θ0)= lim

k→∞

mn√

12∫ ∞−∞f2(y) dy

√

mnN(N + 1)=

√

12ρ(1 − ρ)

∫ ∞

−∞f2(y) dy.

The asymptotic power of the test follows immediately from (7.19).


Exercise 7.9 (a) Prove expression (7.17) under the null hypothesis H0 : θ = θ0 (whereθ0 = 0).

(b) Prove expression (7.18) under the sequence of alternatives θ1, θ2, . . . satisfying√N(θk − θ0) → ∆.

Hint: In both (a) and (b), start by verifying either the Lindeberg condition or theLyapunov condition.

109

Exercise 7.10 Suppose Var Xi = Var Yi = σ2 < ∞ and we wish to test the hypothesesH0 : θ = 0 vs. H1 : θ > 0 using the two-sample Z-statistic

Y − X

σ√

1m + 1

n

.

Note that this Z-statistic is s/σ times the usual T-statistic, where s is the pooled samplestandard deviation, so the asymptotic properties of the T-statistic are the same as thoseof the Z-statistic.

(a) Find the efficacy of the Z test. Justify your use of Theorem 7.1.

(b) Find the ARE of the Z test with respect to the rank-sum test for normallydistributed data.

(c) Find the ARE of the Z test with respect to the rank-sum test if the data comefrom a double exponential distribution with f(t) = 1

2λe−|t/λ|.

(d) Prove by example that the ARE of the Z-test with respect to the rank-sum testcan be arbitrarily close to zero.

110

Chapter 8

Categorical Data

8.1 Pearson’s Chi-Square

Let X(1),X(2), · · · be independent from a multinomial(1,p) distribution, where p is a k-vector withnonnegative entries that sum to one. That is,

P (X(i)j = 1) = 1 − P (X

(i)j = 0) = pj for all 1 ≤ j ≤ k (8.1)

and each X(i) consists of exactly k − 1 zeros and a single one, where the one is in the componentof the “success” category at trial i. Note that the multinomial distribution is a generalization ofthe binomial distribution to the case in which there are k categories of outcome instead of only 2.Also note that we ordinarily do not consider a binomial random variable (or a Bernoulli variableas in equation (8.1)) to be a 2-vector, but we could easily do so if the vector contained both thenumber of successes and the number of failures.

Since equation (8.1) implies that Var X(i)j = pj(1 − pj), and since Cov (X

(i)j , X

(i)ℓ ) = −pjpℓ for

j 6= ℓ, the random vector X(i) has covariance matrix

Σ =

p1(1 − p1) −p1p2 · · · −p1pk

−p1p2 p2(1 − p2) · · · −p2pk...

. . ....

−p1pk −p2pk · · · pk(1 − pk)

. (8.2)

Since E X(i) = p, the central limit theorem implies

√n

(

Xn − p) d→Nk(0, Σ). (8.3)

Note that the sum of the jth column of Σ is pj − pj(p1 + · · · + pk) = 0, which is to say that thesum of the rows of Σ is the zero vector, so Σ is not invertible.

We wish to derive the asymptotic distribution of Pearson’s chi-square statistic

χ2 =k

∑

j=1

(nj − npj)2

npj, (8.4)

111

where nj is the random variable nXj , the number of successes in the jth category for trials 1, . . . , n.We will discuss two different ways to do this. One way avoids dealing with the singular matrix Σ,whereas the other does not.

In the first approach, define for each i Y(i) = (X(i)1 , . . . , X

(i)k−1). That is, let Y(i) be the k−1-vector

consisting of the first k − 1 components of X(i). Then the covariance matrix of Y(i) is the upper-left (k − 1) × (k − 1) submatrix of Σ, which we denote by Σ∗. Similarly, let p∗ denote the vector(p1, . . . , pk−1).

First, verify that Σ∗ is invertible and that

(Σ∗)−1 =

1p1

+ 1pk

1pk

· · · 1pk

1pk

1p2

+ 1pk

· · · 1pk

.... . .

...1pk

1pk

· · · 1pk−1

+ 1pk

. (8.5)

Second, verify that

χ2 = n(Y − p∗)t(Σ∗)−1(Y − p∗). (8.6)

The facts in equations (8.5) and (8.6) are checked in Problem 8.3. If we now define

Z(n) =√

n(Σ∗)−1/2(Y − p∗),

then clearly the central limit theorem implies Z(n) d→Nk−1(0, I). By definition, the χ2k−1 distribu-

tion is the distribution of the sum of the squares of k − 1 independent standard normal randomvariables. Therefore,

χ2 = (Z(n))tZ(n) d→χ2k−1, (8.7)

which is the result that leads to the familiar chi-square test.

In a second approach to deriving the limiting distribution (8.7), we use some properties of projectionmatrices.

Definition 8.1 A symmetric matrix P is called a projection matrix if it is idempotent; thatis, if P 2 = P .

The following lemmas, to be proven in Problem 8.4, give some basic facts about projection matrices.

Lemma 8.1 Suppose P is a projection matrix. Then every eigenvalue of P equals 0 or 1.Suppose that r denotes the number of eigenvalues of P equal to 1. Then if Z ∼ Nk(0, P ),ZtZ ∼ χ2

r .

Lemma 8.2 The trace of a square matrix M , Tr (M), is equal to the sum of its diagonalentries. For matrices A and B whose sizes allow them to be multiplied in either order,Tr (AB) = Tr (BA).

Recall that if a square matrix M is symmetric, then there exists an orthogonal matrix Q suchthat QMQt is a diagonal matrix whose entries consist of the eigenvalues of M . By lemma 8.2,Tr (QMQt) = Tr (QtQM) = Tr (M), which proves yet another lemma:

112

Lemma 8.3 If M is symmetric, then Tr (M) equals the sum of the eigenvalues of M .

Define Γ = diag (p), and let Σ be defined as in equation (8.2). Clearly, equation (8.3) implies

√nΓ−1/2(X − p)

d→Nk(0, Γ−1/2ΣΓ−1/2).

Since Σ may be written in the form Γ − ppt,

Γ−1/2ΣΓ−1/2 = I − Γ−1/2pptΓ−1/2 = I −√p√

pt (8.8)

clearly has trace k − 1; furthermore,

(I −√p√

pt)(I −√p√

pt) = I − 2√

p√

pt +√

p√

pt√p√

pt = I −√p√

pt

because√

pt√p = 1, so the covariance matrix (8.8) is a projection matrix.

Define A(n) =√

nΓ−1/2(X − p). Then we may check (in problem 8.4) that

χ2 = (A(n))tA(n). (8.9)

Therefore, since the covariance matrix (8.8) is a projection with trace k−1, Lemma 8.3 and Lemma

8.1 prove that χ2 d→χ2k−1 as desired.

Suppose the k-vector X is distributed as multinomial (n,p), and we wish to test the null hypothesisH0 : p = p0 against the alternative H1 : p 6= p0 using the Pearson chi-square test. We are givena sequence of specific alternatives p(n) satisfying

√n(p(n) − p) → δ for some constant matrix δ.

Note that this means∑k

i=1 δi = 0, a fact that will be used later. Our task is to derive the limit ofthe power of the test under the sequence of alternatives p(n).

First, we define a noncentral chi-square distribution.

Definition 8.2 If A1, . . . , An are independent random variables with Ai ∼ N(µi, 1), thenthe distribution of A2

1+A22+ · · ·+A2

n is noncentral chi-square with n degrees of freedomand noncentrality parameter φ = µ2

1 + · · · + µ2n. We denote this distribution χ2

n(φ).Equivalently, we can say that if A ∼ Nn(µ, I), then AtA ∼ χ2

n(φ) where φ = µtµ.

Note that this is not a valid definition unless we may prove that the distribution of AtA dependson µ only through φ = µtµ. We prove this as follows. First, note that if φ = 0 then there isnothing to prove. Otherwise, define µ∗ = µ/

√φ. Next, find an orthogonal matrix Q whose first

row is (µ∗)t. This may be accomplished by, for example, Gram-Schmidt orthogonalization. ThenQA ∼ Nk(Qµ, I). Since Qµ is a vector with first element

√φ and remaining elements 0, clearly

QA has a distribution that depends on µ only through φ. But AtA = (QA)t(QA), so this provesthat Definition 8.2 is valid.

We will derive the power of the chi-square test by adapting the projection matrix technique ofsection 8.1. First, we prove a lemma.

Lemma 8.4 Suppose Z ∼ Nk(µ, P ), where P is a projection matrix of rank r ≤ k andPµ = µ. Then ZtZ ∼ χ2

r(µtµ).

Proof: Since P is a covariance matrix, it is symmetric, which means that there exists an orthogonalmatrix Q with QPQ−1 = diag (λ), where λ is the vector of eigenvalues of P . Since P is a projection

113

matrix, all of its eigenvalues are 0 or 1. Since P has rank r, exactly r of the eigenvalues are 1.Without loss of generality, assume that the first r entries of λ are 1 and the last k − r are 0.The random vector QZ is Nn(Qµ, diag (λ)), which implies that ZtZ = (QZ)t(QZ) is by definitiondistributed as χ2

r(φ) + ϕ, where

φ =

r∑

i=1

(Qµ)2i and ϕ =

k∑

i=r+1

(Qµ)2i .

Note, however, that

Qµ = QPµ = QPQtQµ = diag (λ)Qµ. (8.10)

Since entries r + 1 through k of λ are zero, the corresponding entries of Qµ must be zero becauseof equation (8.10). This implies two things: First, ϕ = 0; and second,

φ =r

∑

i=1

(Qµ)2i =k

∑

i=1

(Qµ)2i = (Qµ)t(Qµ) = µtµ.

Thus, ZtZ ∼ χ2r(µ

tµ), which proves the result.

Define Γ = diag (p0). Let Σ = Γ − p0(p0)t be the usual multinomial covariance matrix under

the null hypothesis; i.e.,√

n(X(n)/n − p0)d→Nk(0, Σ) if X(n) ∼ multinomial(n,p0). Consider

X(n) to have instead a multinomial (n,p(n)) distribution. Under the assumption made earlier that√n(p(n) − p0) → δ, it may be shown that

√n(X(n)/n − p(n))

d→ Nk(0, Σ). (8.11)

We claim that the limit (8.11) implies that the chi square statistic n(X(n)/n−p0)tΓ−1(X(n)/n−p0)converges in distribution to χ2

k−1(δtΓ−1δ), a fact that we now prove.

First, recall that we have already shown that Γ−1/2ΣΓ−1/2 is a projection matrix of rank k − 1.Define V(n) =

√n(X(n)/n − p0). Then

V(n) =√

n(X(n)/n − p(n)) +√

n(p(n) − p0).

The first term on the right hand side converges in distribution to Nk(0, Σ) and the second term

converges to δ. Therefore, Slutsky’s theorem implies that V(n) d→Nk(δ, Σ), which gives

Γ−1/2V(n) d→Nk(Γ−1/2δ, Γ−1/2ΣΓ−1/2).

Thus, if we can show that (Γ−1/2ΣΓ−1/2)(Γ−1/2δ) = (Γ−1/2δ), then the result we wish to provefollows from Lemma 8.4. But

Γ−1/2ΣΓ−1δ = Γ−1/2[Γ − p0(p0)t]Γ−1δ = Γ−1/2[δ − p0(1)tδ] = Γ−1/2δ

since 1tδ =∑k

i=1 δi = 0. Thus, we conclude that the chi-square statistic converges in distributionto χ2

k−1(δtΓ−1δ) under the sequence of alternatives p(1),p(2), . . ..

114

Example 8.1 For a particular trinomial experiment with n = 200, suppose the null hypoth-esis is H0 : p = p0 = (1

4 , 12 , 1

4). (This hypothesis might arise in the context of a geneticsexperiment.) We may calculate the approximate power of the Pearson chi-square testat level α = 0.01 against the alternative p = (1

3 , 13 , 1

3).

First, set δ =√

n(p− p0) =√

200( 112 ,−1

6 , 112). Under the alternative p, the chi square

statistic is approximately noncentral χ22 with noncentrality parameter

δt diag (p0)−1δ = 200

(

4

144+

2

36+

4

144

)

=200

9.

Since the test rejects H0 whenever the statistic is larger than the .99 quantile of χ22,

namely 9.210, the power is approximated by Pχ22(

2009 ) > 9.210 = 0.965. These values

were found using R as follows:

> qchisq(.99,2)

[1] 9.21034

> 1-pchisq(.Last.value, 2, ncp=200/9)

[1] 0.965006


Exercise 8.1 Hotelling’s T 2. Suppose X(1),X(2), . . . are independent and identically dis-tributed from some k-dimensional distribution with mean µ and finite nonsingularcovariance matrix Σ. Let Sn denote the sample covariance matrix

Sn =1

n − 1

n∑

j=1

(X(j) − X)(X(j) − X)t.

To test H0 : µ = µ0 against H1 : µ 6= µ0, define the statistic

T 2 = (V(n))tS−1n (V(n)),

where V(n) =√

n(X − µ0). This is called Hotelling’s T 2 statistic.

[Notes: This is a generalization of the square of a unidimensional t statistic. If thesample is multivariate normal, then [(n − k)/(nk − k)]T 2 is distributed as Fk,n−k. APearson chi square statistic may be shown to be a special case of Hotelling’s T 2. ]

(a) You may assume that S−1n

P→Σ−1 (this follows from the weak law of large numbers

since P (Sn is nonsingular) → 1). Prove that under the null hypothesis, T 2 d→χ2k.

(b) Let µ(n) be alternatives such that√

n(µ(n) − µ0) → δ. You may assume thatunder µ(n),

√n(X − µ(n))

d→ Nk(0, Σ).

Find (with proof) the limit of the power against the alternatives µ(n) of the test thatrejects H0 when T 2 ≥ cα, where P (χ2

k > cα) = α.

115

(c) An approximate 1−α confidence set for µ based on the result in part (a) may beformed by plotting the elliptical set

µ : n(X − µ)tS−1n (X − µ) = cα.

For a random sample of size 100 from N2(0, Σ), where Σ =

(

1 3/53/5 1

)

, produce a

scatterplot of the sample and plot 90% and 99% confidence sets on this scatterplot.

Hints: In part (c), to produce a random vector with the N2(0, Σ) distribution, take aN2(0, I) random vector and left-multiply by a matrix A such that AAt = Σ. It is nothard to find such an A (it may be taken to be lower triangular). One way to graph theellipse is to find a matrix B such that BtS−1

n B = I. Then note that

µ : n(X − µ)tS−1n (X − µ) = cα = X − Bν : νtν = cα/n,

and of course it’s easy to find points ν such that νtν equals a constant. To find amatrix B such as the one specified, note that the matrix of eigenvectors of Sn, properlynormalized, gives an orthogonal matrix that diagonalizes.

Exercise 8.2 Suppose we have a tetranomial experiment and wish to test the hypothesisH0 : p = (1/4, 1/4, 1/4, 1/4) against the alternative H1 : p 6= (1/4, 1/4, 1/4, 1/4) at the.05 level.

(a) Approximate the power of the test against the alternative (1/10, 2/10, 3/10, 4/10)for a sample of size n = 200.

(b) Give the approximate sample size necessary to give power of 80% against thealternative in part (a).

Hints: You can use the Splus function pchisq directly in part (a), but in part (b)you may have to use a trial-and-error approach with pchisq.

Exercise 8.3 Verify equations (8.5) and (8.6).

Exercise 8.4 Prove Lemma 8.1 and Lemma 8.2, then verify equation (8.9).

Exercise 8.5 Pearson’s chi-square for a 2-way table: Product multinomial model. If Aand B are categorical variables with 2 and k levels, respectively, and we collect randomsamples of size m and n from levels 1 and 2 of A, then classify each individual accordingto its level of the variable B, the results of this study may be summarized in a 2 × ktable. The standard test of the independence of variables A and B is the Pearsonchi-square test, which may be written as

∑

all cells in table

(Oj − Ej)2

Ej,

where Oj is the observed count in cell j and Ej is the estimate of the expected countunder the null hypothesis. Equivalently, we may set up the problem as follows: IfX and Y are independent Multinomial(m,p) and Multinomial (n,p) random vectors,

116

respectively, then the Pearson chi-square statistic is

W2 =k

∑

j=1

(Xj − mZj/N)2

mZj/N+

(Yj − nZj/N)2

nZj/N

,

where Z = X+Y and N = n + m. (Note: I used W2 to denote the chi-square statisticto avoid using yet another variable that looks like an X.)

Using the result of Equation (5.5.51) on p. 329, prove that if N → ∞ in such a waythat n/N → α ∈ (0, 1), then

W2 d→ χ2k−1.

Exercise 8.6 Pearson’s chi-square for a 2-way table: Multinomial model. Now considerthe case in which (X,Y) is a single multinomial (N,q) random 2k-vector. Xi will stilldenote the (1, i) entry in a 2 × k table, and Yi will still denote the (2, i) entry.

(a) In this case, q is a 2k-vector. Let α = q1/(q1 + qk+1) and define p to be thek-vector such that (q1, . . . , qk) = αp. Prove that under the usual null hypothesis thatvariable A is independent of variable B (i.e., the row variable and the column variableare independent), q = (αp, (1 − α)p) and p1 + · · · + pk = 1.

(b) As in Problem 8.5, let Z = X+Y. Assume the null hypothesis is true and supposethat for some reason α is known. The Pearson chi-square statistic may be written as

W2 =k

∑

j=1

(Xj − αZj)2

αZj+

(Yj − (1 − α)Zj)2

(1 − α)Zj

. (8.12)

Find the joint asymptotic distribution of

√

Nα(1 − α)

(

X1

Nα− Y1

N(1 − α), . . . ,

Xk

Nα− Yk

n(1 − α)

)

and use this result to prove that W2 d→χ2k.

Exercise 8.7 In Problem 8.6(b), it was assumed that α was known. However, in mostproblems this assumption is unrealistic. Therefore, we replace all occurrences of α inequation (8.12) by α =

∑ki=1 Xi/N . This results in a different asymptotic distribution

for the W2 statistic. Suppose we are given the following multinomial probabilities fora 2 × 2 table with independent row and column variables:

P (X1 = 1) = .1 P (X2 = 1) = .15 .25

P (Y1 = 1) = .3 P (Y2 = 1) = .45 .75

.4 .6 1

Note that α = .25 in the above table. Let N = 50 and simulate 1000 multinomialrandom vectors with the above probabilities. For each, calculate the value of W2 usingboth the known value α = .25 and the value α estimated from the data. Plot theempirical distribution function of each of these two sets of 1000 values. Compare withthe theoretical distribution functions for the χ2

1 and χ22 distributions.

117

Hint: To generate a multinomial random variable with expectation vector matchingthe table above, because of the independence inherent in the table you can generatetwo independent Bernoulli random variables with respective success probabilities equalto the margins: That is, let P (A = 2) = 1−P (A = 1) = .6 and P (B = 2) = 1−P (B =1) = .75, then classify the multinomial observation into the correct cell based on therandom values of A and B.

Exercise 8.8 The following example comes from genetics. There is a particular character-istic of human blood (the so-called MN blood group) that has three types: M, MN,and N. Under idealized circumstances, which we assume to be true for the purposesof this problem, these three types occur in the population with probabilities p1 = π2

M ,p2 = 2πMπN , and p3 = π2

N , respectively, where πM is the frequency of the M allele inthe population and πN = 1 − πM is the frequency of the N allele.

If the value of πM were known, then the asymptotic distribution of the Pearson χ2

statistic would be given in the development earlier in this topic. However, of coursewe usually don’t know πM . Instead, we estimate it using the maximum likelihoodestimator (2n1 + n2)/2n.

(a) Define B(n) =√

nΓ−1/2(X− p), where p is the MLE for p. Use the delta methodto derive the asymptotic distribution of Γ−1/2B(n).

(b) Define Γ to be the diagonal matrix with entries p1, p2, p3 along its diagonal. Derivethe asymptotic distribution of Γ−1/2Y(n).

(c) Derive the asymptotic distribution of the Pearson chi-square statistic

χ2 =

k∑

j=1

(nj − npj)2

npj. (8.13)

Exercise 8.9 Take πM = .75 and n = 100 in the situation described in Problem 8.8.Simulate 500 realizations of the data.

(a) Compute

k∑

j=1

(nj − npj)2

npj

for each of your 500 datasets. Compare the empirical distribution function of thesestatistics with both the χ2

1 and χ22 distribution functions. Comment on what you

observe.

(b) Compute the χ2 statistic of equation (8.13) for each of your 500 datasets. Com-pare the empirical distribution function of these statistics with both the χ2

1 and χ22

distribution functions. Comment on what you observe.

118

Chapter 9

U-statistics

9.1 Statistical Functionals and V-Statistics

Let S be a set of cumulative distribution functions and let T : S → R be a functional defined on thatset. Then T is called a statistical functional. In general, we may have a simple random sample froman unknown distribution F , and we want to learn the value of θ = T (F ) for a (known) functionalT . Some particular instances of statistical functionals are as follows:

• If T (F ) = F (c) for some constant c, then T is a statistical functional mapping each F toPF (X ≤ c).

• If T (F ) = F−1(p) for some constant p, where F−1(p) is defined in equation (2.18), then Tmaps F to its pth quantile.

• If T (F ) = E F (X), then T maps F to its mean.

Suppose X1, . . . , Xn is an independent and identically distributed sequence with distribution func-tion F (x). We define the empirical distribution function Fn to be the distribution function for adiscrete uniform distribution on X1, . . . , Xn. In other words,

Fn(x) =1

n#i : Xi ≤ x =

1

n

n∑

i=1

IXi ≤ x.

Since Fn(x) is a distribution function, a reasonable estimator of T (F ) is the so-called plug-in

estimator T (Fn). For example, if T (F ) = E F (X), then the plug-in estimator given a simplerandom sample X1, X2, . . . from F is

T (Fn) = E Fn(X) =

1

n

n∑

i=1

Xi = Xn.

As we will see later, a plug-in estimator is also known as a V-statistic or a V-estimator.

Suppose that for some real-valued function φ(x), we define T (F ) = E F φ(X). Note in this casethat

TαF1 + (1 − α)F2 = α E F1 φ(X) + (1 − α) E F2 φ(X) = αT (F1) + (1 − α)T (F2).

119

For this reason, such a functional is sometimes called a linear functional.

To generalize this idea, we consider a real-valued function taking more than one real argument, sayφ(x1, . . . , xa) for some a > 1, and define

T (F ) = E F φ(X1, . . . , Xa), (9.1)

which we take to mean the expectation of φ(X1, . . . , Xa) where X1, . . . , Xa is a simple randomsample from the distribution function F . We see that

E F φ(X1, . . . , Xa) = E F φ(Xπ(1), . . . , Xπ(a))

for any permutation π mapping 1, . . . , a onto itself. Since there are a! such permutations, considerthe function

φ∗(x1, . . . , xa)def=

1

a!

∑

all π

φ(xπ(1), . . . , xπ(a)).

Since E F φ(X1, . . . , Xa) = E F φ∗(X1, . . . , Xa) and φ∗ is symmetric in its arguments (i.e., permutingits a arguments does not change its value), we see that in Equation (9.1) we may assume withoutloss of generality that φ is symmetric in its arguments. A function defined as in equation (9.1) iscalled an expectation functional, as summarized in the following definition:

Definition 9.1 For some integer a ≥ 1, let φ: Ra → R be a function symmetric in its a argu-ments, called the kernel function. The expectation of φ(X1, . . . , Xa) under the assump-tion that X1, . . . , Xa are independent and identically distributed from some distributionF will be denoted by E F φ(X1, . . . , Xa). Then the map T (F ) = E F φ(X1, . . . , Xa) iscalled an expectation functional. If a = 1, then T is also called a linear functional.

Suppose T (F ) is an expectation functional defined according to Equation (9.1) for some a ≥ 1. Ifwe have a simple random sample of size n from F , then as noted earlier, a natural way to estimateT (F ) is by the use of the plug-in estimator T (Fn). This estimator is called a V-estimator or aV-statistic. It is possible to write down a V-statistic explicitly: Since Fn assigns probability 1

n toeach Xi, we have

Vn = T (Fn) = E Fnφ(X1, . . . , Xa) =

1

na

n∑

i1=1

· · ·n

∑

ia=1

φ(Xi1 , . . . , Xia). (9.2)

In the case a = 1, then Equation (9.2) becomes

Vn =1

n

n∑

i=1

φ(Xi).

It is clear in this case that E Vn = T (F ), which we denote by θ. Furthermore, if σ2 = Var F φ(X) <∞, then the central limit theorem implies that

√n(Vn − θ)

d→N(0, σ2).

For a > 1, however, the sum in equation (9.2) contains some terms in which i1, . . . , ia are not alldistinct. The expectation of such terms is not necessarily equal to θ = T (F ). Thus, Vn is no longerunbiased for a > 1.

120

Example 9.1 Let a = 2 and φ(x1, x2) = |x1−x2|. It may be shown (Problem 9.2) that thefunctional T (F ) = E F |X1 − X2| is not linear in F . Furthermore, since |Xi1 − Xi2 | isidentically zero whenever i1 = i2, it may also be shown that the V-estimator of T (F )is biased.

Since the bias in Vn is due to the duplication among the subscripts i1, . . . , ia, the obvious way tocorrect this bias is to restrict the summation in Equation (9.2) to sets of subscripts i1, . . . , ia thatcontain no duplication. For example, we might sum instead over all i1 < · · · < ia, an idea thatleads in the next topic to U-statistics.


Exercise 9.1 Let X1, . . . , Xn be a simple random sample from F . For a fixed x for which0 < F (x) < 1, find the asymptotic distribution of Fn(x).

Exercise 9.2 Let T (F ) = E F |X1 − X2|.

(a) Show that T (F ) is not a linear functional by exhibiting distributions F1 and F2

and a constant α ∈ (0, 1) such that

TαF1 + (1 − α)F2 6= αT (F1) + (1 − α)T (F2).

(b) For n > 1, demonstrate that the V-statistic Vn is biased in this case by findingcn 6= 1 such that E F Vn = cnT (F ).

Exercise 9.3 Let X1, . . . , Xn be a random sample from a distribution F with finite thirdabsolute moment.

(a) For a = 2, find φ(x1, x2) such that E F φ(X1, X2) = Var F X. Your φ functionshould be symmetric in its arguments.

Hint: The fact that θ = E X21 − E X1X2 leads immediately to a non-symmetric φ

function. Symmetrize it.

(b) For a = 3, find φ(x1, x2, x3) such that E F φ(X1, X2, X3) = E F (X − E F X)3. Asin part (a), φ should be symmetric in its arguments.

9.2 Hoeffding’s Decomposition and Asymptotic Normality

Because the V-statistic

Vn =1

na

n∑

i1=1

· · ·n

∑

ia=1

φ(Xi1 , . . . , Xia)

is in general a biased estimator of the expectation functional T (F ) = E F φ(X1, . . . , Xa) due topresence of summands in which there are duplicated indices on the Xik , the obvious way to producean unbiased estimator is to sum only over those (i1, . . . , ia) in which no duplicates occur. Because

121

φ is assumed to be symmetric in its arguments, we may without loss of generality restrict attentionto the cases in which 1 ≤ i1 < · · · < ia ≤ n. Doing this, we obtain the U-statistic

Un =1

(

na

)

∑

· · ·∑

1≤i1<···<ia≤n

φ(Xi1 , . . . , Xia). (9.3)

Note that the sample size n must be at least as large as a for a U-statistic to be defined. Naturally,the “U” in “U-statistic” stands for unbiased (the “V” in “V-statistic” stands for von Mises, whowas one of the originators of this theory in the late 1940’s). The unbiasedness of Un follows sinceit is the average of

(

na

)

terms, each with expectation T (F ) = E F φ(X1, . . . , Xa).

Example 9.2 Consider a random sample X1, . . . , Xn from F , and let

Rn =n

∑

j=1

jIWj > 0

be the Wilcoxon signed rank statistic, where W1, . . . , Wn are simply X1, . . . , Xn re-ordered in increasing absolute value. We showed in Example 7.2 that

Rn =n

∑

i=1

i∑

j=1

IXi + Xj > 0.

Letting φ(a, b) = Ia + b > 0, we see that φ is symmetric in its arguments and thus

1(

n2

)Rn = Un +1

(

n2

)

n∑

i=1

IXi > 0 = Un + OP ( 1n) ,

where Un is the U-statistic corresponding to the expectation functional T (F ) = PF (X1+X2 > 0). Therefore, many of the properties of the signed rank test that we have alreadyderived elsewhere can also be obtained using the theory of U-statistics developed laterin this topic.

In the special case a = 1, the V-statistic and the U-statistic coincide. In this case, we have alreadyseen that both Un and Vn are asymptotically normal by the central limit theorem. However, fora > 1 clearly the two statistics do not coincide in general. Furthermore, we may no longer use thecentral limit theorem to obtain asymptotic normality because the summands are not independent(each Xi appears in more than one summand).

To prove the asymptotic normality of U-statistics, we shall use a method sometimes known as theH-projection method after its inventor, Wassily Hoeffding. If φ(x1, . . . , xa) is the kernel functionof an expectation functional T (F ) = E F φ(X1, . . . , Xa), suppose X1, . . . , Xn is a random samplefrom the distribution F . Let θ = T (F ) and let Un be the U-statistic as defined in Equation (9.3).For 1 ≤ i ≤ a, suppose that the values of X1, . . . , Xi are held constant, say X1 = x1, . . . , Xi = xi,and the expectation of φ(X1, . . . , Xa) is taken. The result will of course be a function of x1, . . . , xi,and we may view this procedure as projecting the random vector (X1, . . . , Xa) onto the (a − i)-dimensional subspace in R

a given by (x1, . . . , xi, ci+1, . . . , ca) : (ci+1, . . . , ca) ∈ Ra−i. To this end,

we define for i = 1, . . . , a

φi(x1, . . . , xi) = E F φ(x1, . . . , xi, Xi+1, . . . , Xa). (9.4)

122

Equivalently, we may define

φi(X1, . . . , Xi) = E F φ(X1, . . . , Xa) | X1, . . . , Xi . (9.5)

From Equation (9.5), we obtain E F φi(X1, . . . , Xi) = θ for all i. Furthermore, we define

σ2i = Var F φi(X1, . . . , Xi). (9.6)

The variance of Un can be expressed in terms of the σ2i as follows:

Theorem 9.1 The variance of a U-statistic is

Var F Un =1

(

na

)

a∑

k=1

(

a

k

)(

n − a

a − k

)

σ2k;

if σ21, . . . , σ

2a are all finite, then

Var F Un =a2σ2

1

n+ o

(

1

n

)

.

Theorem 9.1 is proved in exercise 9.4.

Now we derive the asymptotic normality of Un in a sequence of steps. The basic idea will be toshow that Un − θ has the same limiting distribution as the sum

Un =n

∑

j=1

E F (Un − θ | Xj) (9.7)

of projections. The asymptotic distribution of Un is easy to derive because it may be shown to bethe sum of independent and identically distributed random variables.

Lemma 9.1 For all 1 ≤ j ≤ n,

E F (Un − θ | Xj) =a

nφ1(Xj) − θ .

Proof: Note that

E F (Un − θ | Xj) =1

(

na

)

∑

· · ·∑

1≤i1<···<ia≤n

E F φ(Xi1 , . . . , Xia) − θ | Xj ,

where E F φ(Xi1 , . . . , Xia) − θ | Xj equals φ1(Xj) − θ whenever j is among i1, . . . , ia and 0otherwise. The number of ways to choose i1, . . . , ia so that j is among them is

(

n−1a−1

)

, so weobtain

E F (Un − θ | Xj) =

(

n−1a−1

)

(

na

) φ1(Xj) − θ =a

nφ1(Xj) − θ .

123

Lemma 9.2 If σ21 < ∞ and Un is defined as in Equation (9.7), then

√nUn

d→N(0, a2σ21).

Proof: Lemma 9.2 follows immediately from Lemma 9.1 and the central limit theorem sinceaφ1(Xj) has mean aθ and variance a2σ2

1.

Now that we know the asymptotic distribution of Un, it remains to show that Un − θ and Un havethe same limiting behavior.

Lemma 9.3

E F

Un(Un − θ)

= E F U2n.

Proof: By Equation (9.7) and Lemma 9.1, E F U2n = a2σ2

1/n. Furthermore,

E F

Un(Un − θ)

=a

n

n∑

j=1

E F (φ1(Xj) − θ)(Un − θ)

=a

n

n∑

j=1

E F E F (φ1(Xj) − θ)(Un − θ) | Xj

=a2

n2

n∑

j=1

E F φ1(Xj) − θ2

=a2σ2

1

n.

Lemma 9.4 If σ2i < ∞ for i = 1, . . . , a, then

√n

(

Un − θ − Un

)

P→ 0.

Proof: It suffices to show that

E F

√n(Un − θ − Un)

2→ 0.

By Lemma 9.3, n E F

(

Un − θ − Un

)2= n

(

Var F Un − E F U2n

)

. By theorem 9.1,

n(Var F Un − E F U2n) = o(1).

Finally, since√

n(Un − θ) =√

nUn +√

n(Un − θ − Un), Lemmas 9.2 and 9.4 along with Slutsky’stheorem result in the theorem we had set out to prove:

Theorem 9.2 If σ2i < ∞ for i = 1, . . . , a, then

√n(Un − θ)

d→N(0, a2σ21). (9.8)

124


Exercise 9.4 Prove Theorem 9.1, as follows:

(a) Prove that

E φ(X1, . . . , Xa)φ(X1, . . . , Xk, Xa+1, . . . , Xa+(a−k)) = σ2k + θ2

and thus Cov φ(X1, . . . , Xa), φ(X1, . . . , Xk, Xa+1, . . . , Xa+(a−k)) = σ2k.

(b) Show that

Var

(

n

a

)

Un =

(

n

a

) a∑

k=1

(

a

k

)(

n − a

a − k

)

Cov φ(X1, . . . , Xa), φ(X1, . . . , Xk, Xa+1, . . . , Xa+(a−k))

and then use part (a) to prove the first equation of theorem 9.1.

(c) Verify the second equation of theorem 9.1.

Exercise 9.5 Suppose a kernel function φ(x1, . . . , xa) satifies E |φ(Xi1 , . . . , Xia)| < ∞ forany (not necessarily distinct) i1, . . . , ia. Prove that if Un and Vn are the correspond-

ing U- and V-statistics, then√

n(Vn − Un)P→ 0 so that Vn has the same asymptotic

distribution as Un.

Hint: Verify and use the equation

Vn − Un =

Vn − 1

na

∑

· · ·∑

all ij distinct

φ(Xi1 , . . . , Xia)

+

[

1

na− 1

a!(

na

)

]

∑

· · ·∑

all ij distinct

φ(Xi1 , . . . , Xia).

Exercise 9.6 For the kernel function of Example 9.1, φ(a, b) = |a − b|, the correspondingU-statistic is called Gini’s mean difference and it is denoted Gn. For a random samplefrom uniform(0, τ), find the asymptotic distribution of Gn.

Exercise 9.7 Let φ(x1, x2, x3) have the property

φ(a + bx1, a + bx2, a + bx3) = φ(x1, x2, x3)sgn(b) for all a, b. (9.9)

Let θ = E φ(X1, X2, X3). The function sgn(x) is defined as Ix > 0 − Ix < 0.

(a) We define the distribution F to be symmetric if for X ∼ F , there exists some µ(the center of symmetry) such that X−µ and µ−X have the same distribution. Provethat if F is symmetric then θ = 0.

(b) Let x and x denote the mean and median of x1, x2, x3. Let φ(x1, x2, x3) =sgn(x − x). Show that this function satisfies criterion (9.9), then find the asymptoticdistribution for the corresponding U-statistic if F is the standard uniform distribution.

125

Exercise 9.8 If the arguments of the kernel function φ(x1, . . . , xa) of a U-statistic arevectors instead of scalars, note that Theorem 9.2 still applies with no modification.With this in mind, consider for x, y ∈ R2 the kernel φ(x, y) = I(y1−x1)(y2−x2) > 0.

(a) Given a simple random sample X(1), . . . , X(n), if Un denotes the U-statistic cor-responding to the kernel above, the statistic 2Un − 1 is called Kendall’s tau statistic.

Suppose the marginal distributions of X(i)1 and X

(i)2 are both continuous, with X

(i)1 and

X(i)2 independent. Find the asymptotic distribution of

√n(Un − θ) for an appropriate

value of θ.

(b) To test the null hypothesis that a sample Z1, . . . , Zn is independent and identicallydistributed against the alternative hypothesis that the Zi are stochastically increasingin i, suppose we reject the null hypothesis if the number of pairs (Zi, Zj) with Zi < Zj

and i < j is greater than cn. This test is called Mann’s test against trend. Based onyour answer to part (a), find cn so that the test has asymptotic level .05.

(c) Estimate the true level of the test in part (b) for a simple random sample of sizen from a standard normal distribution for each n ∈ 5, 15, 75. Use 5000 samples ineach case.

9.3 Generalizing U-statistics

In this section, we generalize the idea of U-statistics in two different directions. First, we considersingle U-statistics for situations in which there is more than one sample. Next, we consider thejoint asymptotic distribution of two (single-sample) U-statistics.

We begin by generalizing the idea of U-statistics to the case in which we have more than one randomsample. Suppose that Xi1, . . . , Xini

is a simple random sample from Fi for all 1 ≤ i ≤ s. In otherwords, we have s random samples, each potentially from a different distribution, and ni is the sizeof the ith sample. We may define a statistical functional

θ = E φ (X11, . . . , X1a1 ; X21, . . . , X2a2 ; · · · ; Xs1, . . . , Xsas) . (9.10)

Notice that the kernel φ in Equation (9.10) has a1 + a2 + · · · + as arguments; furthermore, weassume that the first a1 of them may be permuted without changing the value of φ, the next a2 ofthem may be permuted without changing the value of φ, etc. In other words, there are s distinctblocks of arguments of φ, and φ is symmetric in its arguments within each of these blocks.

The U-statistic corresponding to the expectation functional (9.10) is

UN =1

(

n1

a1

) · · · 1(

ns

as

)

∑

· · ·∑

1≤i1<···<ia1≤n1···

1≤r1<···<ras≤ns

φ(

X1i1 , . . . , X1ia1; · · · ; Xsr1 , . . . , Xsras

)

. (9.11)

Note that N denotes the total of all the sample sizes: N = n1 + · · · + ns.

As we did in the case of single-sample U-statistics, define for 0 ≤ j1 ≤ ai, . . . , 0 ≤ js ≤ as

φj1···js(X11, . . . , X1j1 ; · · · ; Xs1, . . . , Xsjs) =

E φ(X11, . . . , X1a1 ; · · · ; Xs1, . . . , Xsas) | X11, . . . , X1j1 , · · · , Xs1, . . . , Xsjs (9.12)

126

and

σ2j1···js

= Var φj1···js(X11, . . . , X1j1 ; · · · ; Xs1, . . . , Xsjs). (9.13)

By an argument similar to the one used in the proof of Theorem 9.1, but much more tediousnotationally, we can show that

σ2j1···js

= Cov φ(X11, . . . , X1a1 ; · · · ; Xs1, . . . , Xsas),

φ(X11, . . . , X1j1 , X1,a1+1, . . . , X1,a1+(a1−j1); · · · ; Xs1, . . . , Xsjs , Xs,as+1, . . . , Xs,as+(as−js))

.(9.14)

Notice that some of the ji may equal 0. This was not true in the single-sample case, since φ0 wouldhave merely been the constant θ, so σ2

0 = 0.

In the special case when s = 2, Equations (9.12), (9.13) and (9.14) become

φij(X1, . . . , Xi; Y1, . . . , Yj) = E φ(X1, . . . , Xa1 ; Y1, . . . , Ya2) | X1, . . . , Xi, Y1, . . . , Yj ,

σ2ij = Var φij(X1, . . . , Xi; Y1 . . . , Yj),

and

σ2ij = Cov φ(X1, . . . , Xa1 ; Y1, . . . , Ya2),

φ(X1, . . . , Xi, Xa1+1, . . . , Xa1+(a1−i); Y1, . . . , Yj , Ya2+1, . . . , Ya2+(a2−j))

,

respectively, for 0 ≤ i ≤ a1 and 0 ≤ j ≤ a2.

Although we will not derive it here as we did for the single-sample case, there is an analagousasymptotic normality result for multisample U-statistics, as follows.

Theorem 9.3 Suppose that for i = 1, . . . , s, Xi1, . . . , Xiniis a random sample from the

distribution Fi and that these s samples are independent of each other. Suppose furtherthat there exist constants ρ1, . . . , ρs in the interval (0, 1) such that ni/N → ρi for all iand that σ2

a1···as< ∞. Then

√N(UN − θ)

d→N(0, σ2),

where

σ2 =a2

1

ρ1σ2

10···00 + · · · + a2s

ρsσ2

00···01.

Although the notation required for the multisample U-statistic theory is nightmarish, life becomesconsiderably simpler in the case s = 2 and a1 = a2 = 1, in which case we obtain

UN =1

n1n2

n1∑

i=1

n2∑

j=1

φ(X1i; X2j).

Equivalently, we may assume that X1, . . . , Xm are a simple random sample from F and Y1, . . . , Yn

are a simple random sample from G, which gives

UN =1

mn

m∑

i=1

n∑

j=1

φ(Xi; Yj). (9.15)

127

In the case of the U-statistic of Equation (9.15), Theorem 9.3 states that

√N(UN − θ)

d→N

(

0,σ2

10

ρ+

σ201

1 − ρ

)

,

where ρ = limm/N , σ210 = Cov φ(X1; Y1), φ(X1; Y2), and σ2

01 = Cov φ(X1; Y1), φ(X2; Y1).

Example 9.3 For independent random samples X1, . . . Xm from F and Y1, . . . , Yn from G,consider the Wilcoxon rank-sum statistic W , defined to be the sum of the ranks of theYi among the combined sample. We may show that

W =1

2n(n + 1) +

m∑

i=1

n∑

j=1

IXi < Yj.

Therefore, if we let φ(x; y) = Ix < y, then the corresponding two-sample U-statisticUN is related to W by W = 1

2n(n + 1) + mnUN . Therefore, we may use Theorem9.3 to obtain the asymptotic normality of UN , and therefore of W . However, we makeno assumption here that F and G are merely shifted versions of one another. Thus,we may now obtain in principle the asymptotic distribution of the rank-sum statisticfor any two distributions F and G that we wish, so long as they have finite secondmoments.

The other direction in which we will generalize the development of U-statistics is consideration ofthe joint distribution of two single-sample U-statistics. Suppose that there are two kernel functions,φ(x1, . . . , xa) and ϕ(x1, . . . , xb), and we define the two corresponding U-statistics

U (1)n =

1(

na

)

∑

· · ·∑

1≤i1<···<ia≤n

φ(Xi1 , . . . , Xia)

and

U (2)n =

1(

nb

)

∑

· · ·∑

1≤j1<···<jb≤n

ϕ(Xj1 , . . . , Xjb)

for a single random sample X1, . . . , Xn from F . Define θ1 = E U(1)n and θ2 = E U

(2)n . Furthermore,

define γij to be the covariance between φi(X1, . . . , Xi) and ϕj(X1, . . . , Xj), where φi and ϕj aredefined as in Equation (9.5). Letting k = mini, j, it may be proved that

γij = Cov

φ(X1, . . . , Xa), ϕ(X1, . . . , Xk, Xa+1, . . . , Xa+(b−k))

. (9.16)

Note in particular that γij depends only on the value of mini, j.

The following theorem, stated without proof, gives the joint asymptotic distribution of U(1)n and

U(2)n .

Theorem 9.4 Suppose X1, . . . , Xn is a random sample from F and that φ : Ra → R

and ϕ : Rb → R are two kernel functions satisfying Var φ(X1, . . . , Xa) < ∞ and

Var ϕ(X1, . . . , Xb) < ∞. Define τ21 = Var φ1(X1) and τ2

2 = Var ϕ1(X1), and let γij bedefined as in Equation (9.16). Then

√n

(

U(1)n

U(2)n

)

−(

θ1

θ2

)

d→N

(

0

0

)

,

(

a2τ21 abγ11

abγ11 b2τ22

)

.

128


Exercise 9.9 Suppose X1, . . . , Xm and Y1, . . . , Yn are independent random samples fromdistributions Unif(0, θ) and Unif(µ, µ+θ), respectively. Assume m/N → ρ as m, n → ∞and 0 < µ < θ.

(a) Find the asymptotic distribution of the U-statistic of Equation (9.15), whereφ(x; y) = Ix < y. In so doing, find a function g(x) such that E (UN ) = g(µ).

(b) Find the asymptotic distribution of g(Y − X).

(c) Find the range of values of µ for which the Wilcoxon estimate of g(µ) is asymp-totically more efficient than g(Y − X). (The asymptotic relative efficiency in this caseis the ratio of asymptotic variances.)

Exercise 9.10 Solve each part of Problem 9.9, but this time under the assumptions that theindependent random samples X1, . . . , Xm and Y1, . . . , Yn satisfy P (X1 ≤ t) = P (Y1 −θ ≤ t) = t2 for t ∈ [0, 1] and 0 < θ < 1. As in Problem 9.9, assume m/N → ρ ∈ (0, 1).

Exercise 9.11 Suppose X1, . . . , Xm and Y1, . . . , Yn are independent random samples fromdistributions N(0, 1) and N(µ, 1), respectively. Assume m/(m+n) → 1/2 as m, n → ∞.Let UN be the U-statistic of Equation (9.15), where φ(x; y) = Ix < y. Suppose thatθ(µ) and σ2(µ) are such that

√N [UN − θ(µ)]

d→N [0, σ2(µ)].

Calculate θ(µ) and σ2(µ) for µ ∈ .2, .5, 1, 1.5, 2.

Hint: This problem requires a bit of numerical integration. There are a couple ofways you might do this. Mathematica will do it quite easily. There is a function calledintegrate in Splus and one called quad in MATLAB for integrating a function. If youcannot get any of these to work for you, let me know.

Exercise 9.12 Suppose X1, X2, . . . are independent and identically distributed with finitevariance. Define

S2n =

1

n − 1

n∑

i=1

(xi − x)2

and let Gn be Gini’s mean difference, the U-statistic defined in Problem 9.6. Note thatS2

n is also a U-statistic, corresponding to the kernel function φ(x1, x2) = (x1 − x2)2/2.

(a) If Xi are distributed as Unif(0, θ), give the joint asymptotic distribution of Gn andSn by first finding the joint asymptotic distribution of the U-statistics Gn and S2

n. Notethat the covariance matrix need not be positive definite; in this problem, the covariancematrix is singular.

(b) The singular asymptotic covariance matrix in this problem implies that as n → ∞,the joint distribution of Gn and Sn becomes concentrated on a line. Does this appearto be the case? For 1000 samples of size n from Uniform(0, 1), plot scatterplots of Gn

against Sn. Take n ∈ 5, 25, 100.

129

Statistics 553: Asymptotic Toolspersonal.psu.edu/drh20/asymp/fall2004/lectures/asymp.pdf ·...

Documents

Transcript of Statistics 553: Asymptotic Toolspersonal.psu.edu/drh20/asymp/fall2004/lectures/asymp.pdf ·...