link.springer.com978-1-4612-4250-5/1.pdfApPENDIX A Measure and Integration Theory This appendix...

ApPENDIX A

Measure and Integration Theory

This appendix contains an introduction to the theory of measure and integration. The first section is an overview. It could serve either as a refresher for those who have previously studied the material or as an informal introduction for those who have never studied it.

A.I Overview

A.I.l Definitions In many introductory statistics and probability courses, one encounters discrete and continuous random variables and vectors. These are all special cases of a more general type of random quantity that we will study in this text. Before we can introduce the more general type of random quantity, we need to generalize the sums and integrals that figure so prominently in the distributions of discrete and continuous random variables and vectors. The generalization is through the concept of a measure (to be defined shortly), which is a way of assigning numerical values to the "sizes" of sets.

Example A.1. Let S be a nonempty set, and let A ~ S. Define I-£(A) to be the number of elements of A. Then I-£(S) > 0, 1-£(0) = 0, and if Al n A2 = 0, I-£(AI U A2) = I-£(Ad + I-£(A2). Note that I-£(A) = 00 is possible if S has infinitely many elements. The measure 1-£ described here is called counting measure on S.

Example A.2. Let A be an interval of real numbers. If A is bounded, let I-£(A) be the length of A. If A is unbounded, let I-£(A) = 00. It is easy to see that p.(IR) = 00, I 1-£(0) = 0, and if Al n A2 = 0 and Al U A2 is an interval, then

I By IR, we mean the set of real numbers.

A.L Overview 571

J.&(A1 U A2) = J.&(AI) + J.&(A2). The measure J.& described here is called Lebesgue measure.

Example A.S. Let f : m. -+ m.+ be a continuous function. 2 Define, for each interval A, J.&(A) = fA f(x)dx. Then J.&(m.) > 0, J.&(0) = 0, and if Al n A2 = 0 and Al U A2 is an interval, then J.&(AI U A2) = J.&(A 1 ) + J.&(A2).

Since measure will be used to give sizes to sets, the domain of a measure will be a collection of sets. In general, we cannot assign sizes to all sets, but we need enough sets so that we can take unions and complements. A collection of sets that is closed under taking complements and finite unions is called a field. A field that is closed under taking countable unions is called au-field.

Example A.4. Let S be any set. Let A = is, 0}. This u-field is called the trivial u-field. As a second example, let A C S. Let A = is, A, AC , 0}. Let B be another subset of B, and let A = {B,A, B,Ac,Bc,A n B,A nBc, .. . }. Such examples grow rapidly. The largest u-field is the collection of all subsets of S, called the power set of S and denoted 28 .

Example A.S. One field of subsets of m. is the collection of all unions of finitely many disjoint intervals (unbounded intervals are allowed). This collection is not a u-field, however.

It is easy to prove that the intersection of an arbitrary collection of u-fields is itself a u-field. Since 28 is a u-field, it is easy to see that, for every collection of subsets C of B, there is a smallest u-field A that contains C, namely the intersection of all O'-fields that contain C. This smallest O'-field is called the 0'

field genemted by C. The most commonly used u-field in this book will be the one generated by the

collection C of open subsets of a topological space.3 This u-field is called the Borel u-field. It is easy to see that the Borel u-field 8 1 for lR is the O'-field generated by the intervals of the form [b, 00). It is also the u-field generated by the intervals of the form (-00, a] and the u-field generated by the intervals of the form (a, b). Since multidimensional Euclidean spaces are topological spaces, they also have Borel O'-fields.

An alternative way to generate the Borel u-fields of lRk spaces is by means of product spaces. The O'-field generated by all product sets (one factor from each u-field) in a product space is called the product u-field. In m.k , the product ufield of one-dimensional Borel sets 8 1 is the same as the Borel u-field 8 k in the k-dimensional space (Proposition A.3S).

Sometimes, we need to extend m. to include points at infinity. The extended real numbers are the points in lRU{ 00, -oo}. The Borel u-field 8+ ofthe extended real numbers consists of 8 1 together with all sets of the form BU{ oo}, BU{ -oo}, and B U {oo, -oo} for B E 8 1 • It is easy to check that 8+ is a u-field. (See Problem 4 on page 603.)

2By lR+, we mean the open interval (0,00). 3 A space X is a topological space if it has a collection V of subsets called a

topology which satisfies the following conditions: 0 E V, X E V, the intersection of finitely many elements of V is in V, and the union of arbitrarily many elements of V is in V. The sets in V are called open sets.

572 Appendix A. Measure and Integration Theory

If A is a IT-field of subsets of a set S, then a measure p. on S is a function from A to the nonnegative extended real numbers that satisfies

• p.(0) = 0,

• {An}~=l mutually disjoint implies p.(U~lAi) = E:1 P.(Ai).

If p. is a measure, the triple (S, A, ",) is called a measure space. If (S, A, p.) is a measure space and p.(S) = 1, then p. is called a probability and (S, A, p.) is called a probability space.

Some examples of measures were given earlier. The Caratheodory extension theorem A.22 shows how to construct measures by first defining countably additive set functions on fields and then extending them to the generated IT-field. Lebesgue measure is defined in this manner by starting with length for unions of disjoint intervals.

Sets with measure zero are ubiquitous in measure theory, so there is a special term that allows us to refer to them more easily. If E is some statement concerning the points in S, and p. is a measure on S, we say that E is true almost everywhere with respect to p., written a.e. [p.], if the set of s such that E is not true is contained in a set A with p.(A) = O. If p. is a probability, then almost everywhere is often expressed as almost surely and denoted a.s. [p.].

Example A.6. It is well known that a nondecreasing function can have at most a countable number of discontinuities. Since countable sets have Lebesgue measure (length) 0, it follows that nondecreasing functions are continuous almost everywhere with respect to Lebesgue measure.

Infinite measures are difficult to deal with unless they behave like finite measures in certain important ways. If there exists a countable partition of the set S such that each element of the partition has finite p. measure, then we say that p. is IT-finite. When an abstract measure is mentioned in this text, it will generally be safe to assume that it is IT-finite unless the contrary is clear from context.

A.1.2 Measurable Functions There are certain types of functions with which we will be primarily concerned. Suppose that S is a set with a IT-field A of subsets, and let T be another set with a IT-field C of subsets. Suppose that I : S --+ T is a function. We say I is measurable if for every B E C, 1-1 (B) EA. When there are several possible IT-fields of subsets of either S or T, we will need to say explicitly with respect to which IT-field I is measurable. If I is measurable, one-to-one, and onto and 1-1 is measurable, we say that I is bimeasurable. If the two sets S and T are topological spaces with BorellT-fields, a measurable function is Borel measurable.

As examples, all continuous functions are Borel measurable. But many discontinuous functions are also measurable. For example, step functions are measurable. All monotone functions are measurable. In fact, it is very difficult to describe a nonmeasurable function without using some heavy mathematics.

If Sand T are sets, C is a u-field of subsets of T, and I : S --+ T is a function, then it is easy to show that r1(C) is a IT-field of subsets of S. In fact, it is the smallest IT-field of subsets of S such that I is measurable, and it is called the

IT-field generated by I·

A.I. Overview 573

Some useful properties of measurable functions are in Theorem A.38. To summarize, multivariate functions with measurable coordinates are measurable; compositions of measurable functions are measurable; sums, products, and ratios of measurable functions are measurable; limits, suprema, and infima of sequences of measurable functions are measurable.

As an application of the preceding results, we have Theorem A.42, which says that one function 9 is a function of another f if and only if 9 is measurable with respect to the a-field generated by f.

Many theorems about measurable functions are proven first for a special class of measurable functions called simple functions and then extended to all measurable functions using some limit theorems. A measurable function f is called simple if it assumes only finitely many distinct values. The most fundamental limit theorem is Theorem A.41, which says that every nonnegative measurable function can be approached from below (pointwise) by a sequence of nonnegative simple functions.

A.1.3 Integration The integral of a function with respect to a measure is a way to generalize the Riemann integral. The interested readers should be able to convince themselves that the integral as defined here is an extension of the Riemann integral. That is, if the Riemann integral of a function over a closed and bounded interval exists, then so does the integral as defined here, and the two are equal. We define the integral in stages. We start with nonnegative simple functions. If f is a nonnegative simple function represented as f(s) = 2::=1 adA, (s), with the ai distinct and the Ai mutually disjoint, then the integral 0/ f with respect to I-' is f f(s)dl-'(s) = 2::=1 ail-'(Ai). If 0 times 00 occurs in such a sum, the result is 0 by convention. The integral of a nonnegative simple function is allowed to be 00.

For general nonnegative measurable functions, we define the integral of f with respect to I-' as f f(s)dl-'(s) = SUPg:5/,g simple f g(s)dl-'(s). For general functions f, let j+(s) = max{f(s),O} and r(s) = - min{f(s),O} (the positive and negative parts of f, respectively). Then f(s) = j+(s) - r(s). The integral of f with respect to p, is

J f(s)dl-'(s) = J f+(s)dl-'(s) - J r(s)dl-'(s),

if at least one of the two integrals on the right is finite. If both are infinite, the integral is undefined. We say that f is integrable if the integral of / is defined and is finite. The integral is defined above in terms of its values at all points in S. Sometimes we wish to consider only a subset of A ~ S. The integral of f over A with respect to I-' is

i f(s)dl-'(s) = J IA(S)f(s)dl-'(s).

Several important properties of integrals will be needed in this text. Proposition A.49 and Theorem A.53 state a few of the simpler ones, namely that functions that are almost everywhere equal have the same integral, that the integral of a linear combination of functions is the linear combination of the integrals, that


smaller functions have smaller integrals, and that two integrable functions that have the same integral over every set are equal almost everywhere. Another useful property, given in Theorem A.54, is that a nonnegative integrable function leads to a new measure /I by means of the equation II(A) = fA f(s)dl-'(s).

The most important theorems concern the interchange of limits with integration. Let Un}~=1 be a sequence of measurable functions such that fn(x) ---+ f(x) a.e. [1-'1. The monotone convergence theorem A.52 says that if the fn are nonnegative and fn(x) ~ f(x) a.e. [1-'1, then

(A.7)

The dominated convergence theorem A.57 says that if there exists an integrable function 9 such that Ifn(x)1 :::; g(x), a.e. [I-'], then (A.7) holds.

Part 1 of Theorem A.38 says that measurable functions into each of two measurable spaces combine into a jointly measurable function. Measures and integration can also be extended from several spaces into the product space. For example, suppose that I-'i is a measure on the space (8i , Ai) for i = 1,2. To define a measure on (81 x 82,A1 ®A2), we can proceed as follows. For each product set A = Al X A2, define 1-'1 x 1-'2(A) = 1-'1 (Al)1-'2(A2). The Caratheodory extension theorem A.22 allows us to extend this definition to all of the product space. Lebesgue measure on lR?, denoted dxdy, is such a product measure. Not every measure on a product space is a product measure. Product probability measures will correspond to independent random variables.

Extending integration to product spaces proceeds through two famous theorems. Tonelli's theorem A.69 says that a nonnegative function f satisfies

/ f(x, y)dl-'l x 1-'2(X, y) / [/ f(x,Y)dl-'l(X)] dI-'2(Y)

/ [/ f(X,Y)dI-'2(Y)] dl-'l(X),

Fubini's theorem A.70 says that the same equations hold if f is integrable with respect to 1-'1 x 1-'2. These results also extend to finite product spaces 81 x· .. X 8n •

A.1.4 Absolute Continuity A special type of relationship between two measures on the same space is called absolute continuity. If 1-'1 and 1-'2 are two measures on the same space, we say that 1-'2 is absolutely continuous with respect to 1-'1, denoted 1-'2 « 1-'1, if 1-'1 (A) = 0 implies 1-'2(A) = O. When 1-'2 « 1-'1, we say that 1-'1 is a dominating measure for 1-'2. Here are some examples:

Example A.S . • Let f be any nonnegative measurable function and let 1-'1 be a measure.

Define 1-'2(A) = fA f(s)dl-'l(S). (See Theorem A.54.) Then, 1-'2 <t: 1-'1·

• Let 8 be the natural numbers and let a1,a2, ... be any sequence of nonnegative numbers. Define 1-'1 to be counting measure on 8, and let 1-'2(A) = 2:aiEA ai. Then 1-'2 «1-'1.

A.2. Measures 575

• Let 1-'1,1-'2, ... be a collection of measures on the same space (8,A). Let a1,a2, ... be a collection of positive numbers. Then I-' = EAlI i ail-'i is a measure and I-'i « I-' for all i.

The last example above is important because it tells us that for every countable collection of measures, there is a single measure such that all measures in the collection are absolutely continuous with respect to it.

The Ra.don-Nikodym theorem A.74 says that the first part of Example A.8 is the most general form of absolute continuity with respect to u-finite measures. That is, if 1-'1 is u-finite and 1-'2 « 1-'1, then there exists an extended real-valued measurable function f such that 1-'2(A) = fA f(x)dl-'l(X). In addition, if 9 is 1-'2 integrable, then J g(X)dI-'2(X) = f g(x)f(x)dl-'l(X). The function f is called the Radon-Nikodym derivative of 1-'2 with respect to 1-'1 and is usually denoted (dI-'2/dI-'1)(S).

A similar theorem, A.8i, relates integrals with respect to measures on two different spaces. It says that a function f : 81 -+ 82 induces a measure on the range 82• If 1-'1 is a measure on 81, then define 1-'2(A) = 1-'1 (J-1 (A»). Integrals with respect to 1-'2 can be written as integrals with respect to 1-'1 in the following way: f g(y)dI-'2(Y) = f g(f(X»dl-'l(X). The measure 1-'2 is called the measure induced on 82 by f from 1-'1.

A.2 Measures

A measure is a way of assigning numerical values to the "sizes" of sets. The collection of sets whose sizes are given by a measure is a u-field. (See Examples AA and A.5 on page 571.)

Definition A.9. A nonempty collection of subsets A of a set 8 is called a field if

• A E A implies4 AC E A, • AI, A2 E A implies Al U A2 E A.

A field A is called a u-field if {An}~=1 E A implies U~1Ai E A.

Proposition A.IO. Let N be an arbitrary set of indices, and let y = {Aa : Q E N} be an arbitrary collection of u-fields of subsets of a set 8. Then naENAa is also a u-field of subsets of 8.

Because of Proposition A.l0 and the fact that 28 is a u-field, it is easy to see that, for every collection of subsets C of 8, there is a smallest u-field A that contains C, namely the intersection of all u-fields that contain C.

Definition A.n. Let C be the collection of intervals in JR. The smallest u-field containing C is called the Borel u-field. In general, if 8 is a topological space, and B is the smallest u-field that contains all of the open sets, then B is called the Borel u-field.

4The symbol AC stands for the complement of the set A.


In addition to the Borel u-field, the product u-field is also generated by a simple collection of sets.

Definition A.l2 .

• Let ~ be an index set, and let {8a }aEN be a collection of sets. Define 8 = flaEN 8a . We call 8 a product space .

• For each a E ~, let Aa be a u-field of subsets of 8a . Define the product u-field as follows. ®aENAa is the smallest u-field that contains all sets of the form flaEN Aa, where Aa E Aa for all a and all but finitely many Aa

are equal to 8a •

In the special case in which ~ = {I, 2}, we use the notation 8 = 81 X 82 and the product u-field is denoted A1 ® A2.

Proposition A.l3.5 The Borel u-field 8 kof IRk is the same as the product ufield of k copies of (IR, 8 1 ).

There are other types of collections of sets that are related to u-fields. Sometimes it is easier to prove results about these other collections and then use the theorems that follow to infer similar results about u-fields.

Definition A.l4. Let 8 be a set. A collection II of subsets of 8 is called a 11'

system if A, B E II implies A n B E II. A collection A is called a >.-system if 8 E A, A E A implies AO E A, and {An}~=1 E A with Ai n Aj = 0 for i =f. j implies U~1 Ai E A.

As in Proposition A.1O, the intersection of arbitrarily many lr-systems is a lr-system, and so too with A-systems. The following propositions are also easy to prove.

Proposition A.l5. If 8 is a set and C is a collection of subsets of 8 such that C is a lr-system and a A-system, then C is au-field.

Proposition A.16. If 8 is a set and A is a A-system of subsets, then A, AnB E A implies A n BO E A.

The following lemma is the key to a useful uniqueness theorem.

Lemma A.l7 (lr-A theorem).6 Suppose that II is a lr-system, that A is a Asystem, and that II S;; A. Then the smallest u-field containing II is contained in A. PROOF. Define >.(II) to be the smallest A-system containing II, and define u(II) to be the smallest u-field containing II. For each A S;; 8, define gA to be the collection of all sets B S;; S such that An B E A(II).

First, we show that gA is a A-system for each A E A(II). To see this, note that An 8 E >'(II), so 8 EgA. If B EgA, then An BE >.(II), and Proposition A.I6 says that An BO E A(II), so BO EgA. Finally, {Bn}~=1 E gA with the Bn

5This proposition is used in the proof of Theorem A.38. 6This lemma is used in the proofs of Theorems A.26 and B.46 and

Lemma A.61.

A.2. Measures 577

disjoint implies that A n Bn E oX(II) with A n Bn disjoint, so their union is in oX(II). But their union is A n (U~=IBn). So U~=IBn EgA.

Next, we show that oX(II) ~ ge for every C E oX(II). Let A, B E II, and notice that AnB E II, so BE gAo Since gA is a oX-system containing II, it must contain oX(II). It follows that An C E oX(II) for all C E oX(II). If C E oX(II), it then follows that A E ge. So, II ~ ge for all C E oX(II). Since ge is a oX-system containing IT, it must contain oX(II).

Finally, if A, B E >'(IT), we just proved that B EgA, so An B E >.(IT) and hence oX(II) is also a 7r-system. By Proposition A.I5, oX(II) is a cr-field containing II and hence must contain cr(II). Since oX(II) ~ A, the proof is complete. 0

We are now in a position to give a precise definition of measure.

Definition A.1S.

• A pair (8, A), where 8 is a set and A is a cr-field, is called a measumble space.

• A function p, : A --+ [O,oo} is called a measure if

p,(0) = 0,

{An}~=1 mutually disjoint implies p, (U~IAi) = Z:::I p,(Ai).

• A function p, : A --+ [-oo,oo} that satisfies the above two conditions and does not assume both of the values 00 and -00 is called a signed measure.7

• If p, is a measure, the triple (8, A, p,) is called a measure space.

• If (8, A, p,) is a measure space and p,(S) = 1, then p, is called a probability and (8, A, p,) is called a probability space.

Some examples of measures were given in Section A.1.

Theorem A.19.8 If (S,A,p,) is a measure space and {An}~=l is a monotone sequence,9 then p, (limi_oo Ai) = limi_oo p,( Ai) if either of the following holds:

• the sequence is increasing,

• the sequence is decreasing and P,(Al) < 00.

PROOF. If the sequence is increasing, then let Bl = Al and Bk = Ak \ Ak-l for k> 1.10 Then {Bn}~=l are disjoint and the following are true:

k 00 k

UBi =Ak, UBi = lim A k , k-oo

i=l p,(Ak) = L p,(B;)

i=l i=l

7Signed measures will only be used in Section A.6. 8This theorem is used in the proofs of Theorems A.50 and B.90 and

Lemma A.72. 9 A sequence of sets {An}~=l is monotone if either Al ~ A2 ~ ... or

Al 2 A2 2 .... In the first case, we say that the sequence is increasing and limn_ oo An = U~lAi. In the second case, we say that the sequence is decreasing and limn_oo An = n~lAi'

laThe symbol A \ B is another way of saying A n Be.


kl~"!, JL( Ak) = ~ JL( Bd = JL (Q Bi) = JL U~"!, Ak) .

If the sequence is decreasing, then let Bi = Ai \ Ai+l, for i = 1,2, .... It follows that

Al = k~"!' Ak U (Q Bi) ,

and all of the sets on the right-hand side are disjoint. It follows that

k-l Ak Al \ U B;,

;=1 00

JL(At} = JL Cl~"!, Ak) + 2: JL(Bi), i=1

k-l JL(A k ) = JL(AI) - 2: JL(B;) ,

;=1 00

lim JL(A k ) JL(At} - 2:JL(B;) = JL ( lim A k). 0 k~oo k~oo

;=1

Another useful theorem concerning sequences of sets is the following.

Theorem A.20 (First Borel-Cantelli lemma).ll If E::"=1 JL(An) < 00, then JL (n~1 U;:"=i An) = O.

PROOF. Let Bi = U~=iAn and B = n~IB;. Since B ~ B; for each i, it follows that JL(B) ::; JL(B;) for all i. Since JL(Bi) ::; E::; JL(An), it follows that lim;_oo JL(Bi) = O. Hence JL(B) = O. 0

Theorem A.22 below is used in several places for extending measures defined on a field to the smallest a-field containing the field. A definition is required first.

Definition A.21. Let 8 be a set, A a collection of subsets of 8, and JL : A --+

lRu {±oo} a set function. Suppose that S = U~IA; with JL(Ai) < 00 for each i. Then we say JL is a-finite. If JL is a a-finite measure on (8, A), then (8, A, JL) is called a a-finite measure space.

The proof of Theorem A.22 is adapted from Royden (1968).

Theorem A.22 (Caratheodory extension theorem) P Let JL be a set function defined on a field C of subsets of a set 8 that is a-finite, nonnegative, extended

llThis theorem is used in the proofs of Lemma A.72 and Theorems B.90 and 1.61. There is a second Borel-Cantelli lemma, which involves probability measures, but we will not use'it in this text. See Problem 20 on page 663. The set whose measure is the subject of this theorem is sometimes called An infinitely often because it is the set of points that are in infinitely many of the An.

12This theorem is used to prove the existence of many common measures (including product measure) and in the proofs of Lemma A.24 and of Theorems B.1l8, B.131, and B.133.

A.2. Measures 579

real-valued, and countably additive and satisfies J.t(0) = o. Then there is a unique extension of J.t to a measure on a measure space13 (S, A, J.t *). (That is, C ~ A and J.t(A) = J.t*(A) for all A E C.)

PROOF. The proof will proceed as follows. First, we will define J.t* and A. Then we will show that J.t* is monotone and subadditive, that C ~ A, that A is a o--field, that J.L* is countably additive on A, that J.t* extends J.t, and finally that J.t* is the unique extension.

For each B E 28 , define

00

J.t*(B) = infLJ.t(A i ), (A.23) i=l

where the inf is taken over all {Ai}~l such that B ~ U~lAi and Ai E C for all i. Let

First, we show that J.t* is monotone and subadditive. Clearly, J.t*(A) :::; J.t(A) for all A E C and Bl ~ B2 implies J.t*(Bl) :::; J.t*(B2). It is also easy to see that J.t*(Bl U B2) 'S J.t*(Bl) + J.t*(B2) for all B1 ,B2 E 28. In fact, if {Bn}~l E 28 , then J.t*(U~lBi) :::; L::l J.t*(Bi). The proof is to notice that the collection of numbers whose inf is J.t* of the union includes all of the sums of the numbers whose infima are the J.t* values being added together.

Next, we show that C ~ A. Let A E C and C E 28. Since J.t* is subadditive, we only need to show that J.t*(C) ~ J.t*(C n A) + J.t*(C n AC ). If J.t*(C) = 00, this is clearly true. So let J.t*(C) < 00. From the definition of J.t*, for every f > 0, there exists a collection {Ai}~l of elements of C such that E:I J.t(Ai) < J.L*(C) + f. Since J.t(Ai) = J.t(Ai n A) + J.t(Ai n AC) for every i, we have

00 00

i=l i=l

~ J.t*(CnA)+J.t*(CnAc ).

Since this is true for every f > 0, it must be that J.t*(C) ~ J.t*(CnA)+J.t*(CnAc ), hence A E A.

Next, we show that A is a o--field. It is clear that 0 E A and A E A implies ACE A by the symmetry in the definition of A. Let AI, A2 E A and C E 28 . We can write

J.t*(C) J.t*(CnAI) + J.t*(Cn Af)

J.t*(C n AI) + J.t*(C n Af n A2) + J.t*(C n Af n Af)

~ J.t*(C n [AI U A2l) + J.t*(C n [AI U A2(),

where the first two equalities follow from AI, A2 E A, and the last follows from the subadditivity of J.t*. So, Al U A2 EA. Let {An}~=l E A; then we can write

l3The usual statement of this theorem includes the additional claim that the measure space (S, A, J.t*) is complete. A measure space is complete if every subset of every set with measure 0 is in the o--field.


A = U~IAi = U~IBi' where each Bi E A and the Bi are disjoint. (This just makes use of complements and finite unions of elements of A being in A.) Let Dn = Uf=IBi and C E 25. Since AC S;;; D;{ and Dn E A for each n, we have

1-'" (C) 1-'"(CnDn)+I-'*(CnD;{)

~ 1L*(CnDn)+IL*(CnAc ) n

LIL*(C n Bi ) + IL"(C n AC ).

;=1

Since this is true for every n, 00

IL*(C) > LIL*(CnBi)+IL*(CnAc ) i=1

~ 1L*(CnA) + I-'"(Cn AC ),

where the last inequality follows from subadditivity. So, A is au-field. Next, we show that IL* is countably additive when restricted to A. If AI, A2

are disjoint elements of A, then Al = (AI U A2) n Al and A2 = (AI U A2) n Af. It follows that

I-'*(Al U A2) = 1-'* (Ad + 1L"(A2).

By induction, 1-'* is finitely additive on A. Let A = U~IAi, where each Ai E A and the Ai are disjoint. Since Uf=IAi S;;; A, we have, for every n, IL*(A) ~ 2:~=IIL"(Ai)' which implies I-'*(A) ~ 2::1 1-'* (Ai). By subadditivity, we get the reverse inequality, hence 1-'* is countably additive on A.

Next, we prove that 1-'" extends 1-'. Since JJ" is countably additive on A, we can let BEe and {An}~1 E C be disjoint, such that B S;;; U~IAn. Then B S;;; U~=1 (An n B) = B, and IL" (B) ~ 2:::"=I IL(An n B) = I-'(B), since JJ is countably additive on C.

To prove uniqueness, suppose that 1-" also extends IL to A. Then I-"(B) ~ 2:::"=I IL(An ) if B S;;; U~=IAn. Hence, JJ'(B) ~ IL·(B) for all B E A. If there exists B such that IL'(B) < 1-'" (B), let {An}~=1 E C be disjoint and such that I-'(An) < 00 and U~=IAn = S. Then, there exists n such that IL'(B nAn) < I-'*(B nAn). Since 1L'(An) = I-'*(An), it must be that 1L'(Bc nAn) > 1L·(Bc nAn), but this is a contradiction. 0

Here are some examples:

• Let S = IR and let B be the Borel u-field. Define IL( (a, b]) = b - a for intervals, and extend I-' to finite unions of disjoint intervals by addition. Theorem A.22 will extend I-' to the u-field B. This measure is called Lebesgue measure on the real line.

• Let F be any monotone increasing function on IR which is continuous from the right. Let S = IR and let B be the Borel u-field. Define 1L«a, b]) = F(b) - F(a). This can be extended to all of B. In particular, if F is a CDF, then I-' is a probability.

In the examples above, the claim was made that I-' could be extended to the Borel u-field. To do this by way of the Caratheodory extension theorem A.22, we need IL to be defined on a field, countably additive, and u-finite. For the cases described above, this can be arranged as follows. Suppose that IL is defined on

A.2. Measures 581

intervals of the form (a, b] with a = -00 and/or b = 00 possible.14 The collection C of all unions of finitely many disjoint intervals of this form is easily seen to be a field. If (aI, bd, ... , (an, bn] are mutually disjoint, set

It is not hard to see that this extension of /-L to C is well defined. This means that if Uf=l (ai, b;] = U~l (Ci' di], where (Cl' dd, ... , (cm , dm ] are also mutually disjoint, then E~=l JL«ai,bij) = E:l /-L«Ci,dij). If JL is finite for every interval, then it is u-finite. To see that /-L is countably additive on C, suppose that /-L«a, bJ) = F(b)F(a), where F is nondecreasing and continuous from the right. If {(an' bn]}~=l is a sequence of disjoint intervals and (a, b] is an interval such that U~=l (an, bn] ~ (a,b], then it is not difficult to see that E::"=lJL«an,bn]) ~ /-L«a,b]). If (a,b] ~ U~=l(an,bn], we can also prove that 2:::1 /-L«an,bn]) ~ JL«a,bj) (see Problem 7 on page 603). Together these facts will imply that JL is countably additive on C.

The proof of Theorem A.22 leads us to the following useful result. Its proof is adapted from Halmos (1950).

Lemma A.24. l5 Let (S, A, /-L) be a u-finite measure space. Suppose that C is a field such that A is the smallest u-field containing C. Then, for every A E A and f > 0, there is C E C such that JL(C~A) < f. 16

PROOF. Clearly, /-L and C satisfy the conditions of Theorem A.22, so that /-L is equal to the p,* in the proof of that theorem. Let A E A and f > 0 be given. It follows from (A.23) that there exists a sequence {Ai}~l in C such that A ~ U~lAi and

00

/-L(A) > L/-L(Ai) -~. i=l

Since p, is countably additive,

so that there exists n such that

Let C = U~lAi, which is clearly in C. Now

l4If b = 00, we mean (a, 00) by (a, b]. That is, we do not intend 00 to be a point in the space S.

l5This lemma is used in the proof of the Kolmogorov zero-one law B.68. 16The symbol ~ here refers to the symmetric difference operator on pairs of

sets. We define C~A to be (C n AC ) u (CC n A).


Similarly,

It now follows that I-'(AAC) < f. 0 Sets with measure zero are ubiquitous in measure theory, so there is a special

definition that allows us to refer to them more easily.

Definition A.25. Let E be some statement concerning the points in 8 such that for each point s E 8 E is either true or false but not both. Suppose that there exists a set A E A such that I-'(A) = 0 and that for all s E AC , E is true. Then we say that E is true almost everywhere with respect to 1-', written a.e. fJ.I.]. If I-' is a probability, then almost everywhere is often expressed as almost surely and denoted a.s. [1-'].

The following theorem implies uniqueness of measures with certain properties.

Theorem A.26.17 Suppose that 1-'1 and 1-'2 are measures on (8, A) and A is the smallest u-field containing the 7r-system IT. If 1-'1 and 1-'2 are both u-finite on IT and they agree on IT, then they agree on A.

PROOF. First, let C E IT be such that 1-'1(C) = 1-'2(C) < 00, and define gc to be the collection of all B E A such that 1-'1(B n C) = 1-'2(B n C). Using simple properties of measures, we see that gc is a A-system that contains IT, hence it equals A by Lemma A.l7. (For example, if B E gc,

1-'1(Bc n C) = 1-'1 (C) - Ih(B n C) = 1-'2(C) - 1-'2(B n C) = 1-'2(Bc n C),

so B C E gc.) Next, if 1-'1 and 1-'2 are not finite, there exists a sequence {Cn }::"=l E IT such

that 1-'1(Cn ) = 1-'2(Cn ) < 00, and 8 = U::"=lCn. (Since IT is only a 7r-system, we cannot assume that the Cn are disjoint.) For each A E A,

Since I-'j (Uf=l [C; n AD can be written as a linear combination of values of I-'j at sets of the form A n C, where C E IT is the intersection of finitely many of C1, ... , Cn, it follows from A E gc that I-'duf=dC; n A]) = 1-'2 (Uf=l[C; n AJ) for all n, hence 1-'1(A) = 1-'2(A). 0

A.3 Measurable Functions There are certain types of functions with which we will be primarily concerned.

17This theorem is used in the proofs of Theorems B.32, B,46, B.118, B.l3l, and 1.115, Lemma A.64, and Corollary B,44.

A.3. Measurable Functions 583

Definition A.27. Suppose that S is a set with a a-field A of subsets, and let T be another set with a a-field C of subsets. Suppose that I : S -+ T is a function. We say I is measurable if for every B E C, l-l(B} E A. If I is measurable, one-to-one, and onto and 1-1 is measurable, we say that I is bimeasurable. If T = 1R, the real numbers, and C = B, the Borel a-field, then if I is measurable, we say that I is Borel measurable.

Proposition A.2S. Suppose that (S, A) and (T, C) are measurable spaces. Suppose that I : S -+ T is a lunction.

• II A = 2s , then I is measurable.

• IIC = {T,0}, then I is measurable.

• II A = {S, 0}, {y} E C lor every yET, and / is measurable, then / is constant.

As examples, if S = T = 1R and A = B is the Borel a-field, then all continuous functions are measurable. But many discontinuous functions are also measurable. For example, step functions are measurable. All monotone functions are measurable. In fact, it is very difficult to describe a nonmeasurable function without using some heavy mathematics.

The following theorems make it easier to show that a function is measurable.

Theorem A.29. l8 Let N, S, and T be arbitrary sets. Let {A, : a E N} be a collection 0/ subsets 01 T, and let A be an arbitrary subset 0/ T. Let / : S -+ T be a function. Then

rl (U An) U rl(An}, nEN nEN

r l (n An) n rl(An}, ",EN ",EN

rl(AC} rl(A)C.

PROOF. For the union, if s E l-l(U",ENA",}, then 1(8} E UnENAn , hence there exists a such that 1(8} E An, so s E f-l(An} and s E UnEN/-l(An ). If 8 E UnEN/-l(An ), then there exists a such that 8 E l-l(An), hence /(s) E An, hence f(s) E UnENAn , hence 8 E l-l(U",ENA",). This proves the first equality. The second is almost identical in that "there exists a" is merely replaced by "for all a" in the above proof. For the complement, if 8 E l-l(AC), then 1(8) E AC and 1(8) ~ A. Hence, 8 ~ rl(A) and s E rl(A)c. If s E rl(A)c, then 8 ~ rl(A) and 1(8} ~ A. So, f(8) E AC and 8 E rl(AC). 0

Corollary A.30.19 II Sand T are sets and C is a a-field of subsets of T and I : S -+ T i8 a function, then f- l (C) is a a-field 01 subsets of S. In fact, it is the smallest a-field 01 subsets 0/ S such that / is measurable.

18This theorem is used in the proof of Theorem A.34. 19This corollary is used in the proof of Theorem A.42, and it is used to define

the a-field generated by a function.


Definition A.31. The u-field rl(C) in Corollary A.30 is called the u-field genemted by f.

A measurable function also generates a u-field of subsets of its image.

Proposition A.32. Let (T, C) be a measumble space. Let U ~ T be arbitmry (possibly not even in C). Define C. = {U n B : B E C}. Then C. is a u-field of subsets of U.

Definition A.33. The u-field C. in Proposition A.32 is called the restriction of the u-field C to U. If f : 8 ~ T and U = 1(8), then C. is called the image u-field of f·

Theorem A.34.20 Let (8,A) be a measumble space and let f : 8 -- T be a function. Let C· be a nonempty collection of subsets of T, and let C be the smallest u-field that contains CO. If rl(C·) ~ A, then rl(C) ~ A.

PROOF. Let C2 be the collection of all subsets B of T such that r I (B) E A. By assumption, C· ~ C2 • We will now prove that C2 is a u-field; hence it must contain C, which implies the conclusion of the theorem. Clearly, C2 is nonempty, since C· is nonempty. Let A E C2. Theorem A.29 implies f-l(AC ) = f-l(A)c E A, since A is a u-field. This means that AC E C2. Let AI, A2, . .. E C2. Then Theorem A.29 implies

r l (QAi) = Qrl(Ai) E A,

since A is a u-field. So C2 is au-field. 0 To use this theorem to show that a function f : S ~ T is measurable when T

has a u-field of subsets C, we can find a smaller collection of subsets C· such that C is the smallest u-field containing C· and prove that f-I(C·) ~ A. Theorem A.34 would then imply f-I(C) ~ A and f is measurable. As an example, consider the next lemma.

Lemma A.35.21 Let (8,A) be a measumble space, and let f : 8 ~ JR be a function. Then f is measumble if and only if r 1« b, 00» E A for all b E JR.

PROOF. The "only if' part is trivial. For the "if" part, let C· be the collection of all subsets of 1R of the form (b,oo). The smallest u-field containing these is the Borel u-field 8, so r l (8) s:; A by Theorem A.34. 0

There are versions of Lemma A.35 that apply to intervals of the form (-00, a] and those of the form (a, b), and so on. Similarly, there is a version for general topological spaces.

Proposition A.36.22 Let (8, A) be a measumble space, and let (T,C) be a topological space with Borel u-field. Then f : 8 ~ JR is measumble if and only if rl(C) E A for all open C (or for all closed C).

20This theorem is used in the proofs of Lemma A.35, Proposition A.36, Corollary A.37, Theorems A.38, B.75, and B.133, and to prove that stochastic processes are measurable.

2lThis lemma is used in the proofs of Theorems A.38 and A.74. 22This proposition is used in the proof of Theorem A.38.

A.3. Measurable Functions 585

Another example of the use of Theorem A.34 is the proof that all continuous functions are measurable. The result follows because the Borel a-field is the smallest a-field containing open sets. Corollary A.37. Let (5, A) and (T, B) be topological spaces with their Borel a-fields. If f : 5 -t T is continuous, then f is mea.surable.

Here are some properties of measurable functions that will prove useful. Theorem A.38. Let (5, A) be a measurable space.

1. Let N be an index set, and let {(Tn,C",)}"'EN be a collection of measurable spaces. For each a E N, let j", ; 5 --+ T", be a function. Define f ; 5 --t I1aEN Tn by f(s) = {f",(S)}",EN. Then f is measurable (with respect to the product a-field) if and only if each fOt is measurable.

2. If (V, Cd and (U,C2) are measurable spaces and f ; 5 --> V and 9 ; V -t U are measurable, then gU) ; 5 --> U is measurable.

3. Let f and 9 be measurable functions from 5 to IRn , and let a be a constant scalar and let b E IRn be constant. Then the following functions are also measurable: f+g and af +b. Ifn = 1, then f·g and JIg are also measurable, where f / 9 can be set equal to an arbitrary constant when 9 = o.

4. If, for each n, fn is a measurable, extended real-valued function, then sUPn fn, infn fn, limsuPn jn, and liminfn fn are all measurable.

5. Let (T, C) be a metric space with Borel a-field. If /k ; 5 --> T is a measurable function for each k = 1,2, ... and limk~oo fk(s) = f(s) for all s, then f is measurable.

6. Let (T, C) be a metric space with Barela -field, and let J.t be a measure on (5, A). If /k ; 5 --> T is a measurable function for each k = 1,2, ... and limk_oo fk(s) exists a.e. [ILl, then there is a measurable j ; 5 --> T such that limk~oo /k(s) = f(s), a.e. [ILl·

PROOF. (1) Suppose that j is measurable. To show that jOt is measurable, let Bo. E Co. and let B/3 = T/3 for f3 # a. Set C = I1;EN B/3, which is in the product a-field, because all but finitely many B/3 equal the entire space T/3. Then /;;1 (Bo.) = rI(C). Since f is measurable, rI(C) E A. Now, suppose that each fa is measurable, and let B = flOiEN B OI , with Be. E COl for all a and all but finitely many BOt (say B Ot11 ... , B Otn ) equal to T",. Then f-1(B) = n~d;;/(BOI;) E A. Since the sets of the form B generate the product a-field, r1(B) E A for all B in the product a-field according to Theorem A.34.

(2) Let A E C2 • We need to prove that g(f)-l(A) E A. First, note that g(l)-l = f-1(g-1). Since g is measurable, g-l(A) E C1. Since I is measurable, r1(g-1(A)) E A. So g(f)-l(A) E A. (3) The arithmetic parts of the theorem are all similar. They all follow from

parts 2 and 1. For example, hex, y) = x + y is a measurable function from IR? to JR, so h(f, g) = I + 9 is measurable. For the quotient, a little more care is needed. Let hex, y) = x/y when y # 0 and let it be an arbitrary constant when y = O. Then h is measurable since {(x, y) ; y = O} is in 8 2 • It follows that h(f, g) is measurable.

(4) Let f = SUPn In. Then, for each finite b, {s ; f(s) :s b} = n~=l {s ; fn(s) :s b} E A. Also {s ; f(s) = -oo} = n~=l{S : fn(s) = -oo} E A, and


{s : f(s) = oo} = nbl U~=I {s : fn(s) > i} EA. Similar arguments work for inL Since limsuPnfn = infksuPn>kfn and liminfnfn = sUPkinfn2:kfn, these are also measurable. -

(5) Let d be the metric in T. For each closed set C E C, and each m, let Cm = {t: d(t, C) < l/m}. For each closed C, define

00 00 00

(A.39) m=l n=l k=n

It is easy to see that A.(C) E A is the set of all s such that limn_oo fn(s) E C. Obviously, f-I(C) consists of those s such that limn_oo fn(s) E C. Hence, r I (C) = A. (C) E A, and Proposition A.36 says that f is measurable.

(6) Let G = {s : limk_oo fk(S) does not exist}, and let G ~ C with j.t(C) = O. Let t E T, and define f(s) = t for sEC and f(s) = limk_oo /k(s) for s E CC. Apply part 5 to the restrictions of the functions {fk}~1 to CC to conclude that f restricted to CC (call the restriction g) is measurable. If A E C, f-I(A) = g-I(A) E A if t ft A and rl(A) = g-I(A) U C E A if tEA. So f is measurable. o Part 6 is particularly useful in that it allows us to treat the limit of a sequence of measurable functions as a measurable function even if the limit only exists almost everywhere. This is only useful, however, if we can show that functions that are equal almost everywhere have similar properties.

Many theorems about measurable functions are proven first for a special class of measurable functions called simple functions and then extended to all measurable functions using some limit theorems.

Definition A.40. A measurable function f is called simple if it assumes only finitely many distinct values.

A simple function is often expressed in terms of its values. Let f be a simple function taking values in IR n for some n. Suppose that {al, ... , ak} are the distinct values assumed by f, and let Ai = rl({ai}). Then f(s) = 2:~=1 adA; (s). The most fundamental limit theorem is the following.

Theorem A.41. If f is a nonnegative measurable function, then there exists a sequence of simple functions {Ji}bl such that for all s E S, /i(s) i f(s).

PROOF. For k = 1, ... ,i2 i , let Ak,i = {s : (k _1)/2i :$ f(s) < k/2i }. Define AO,i = {s: f(s) ~ i}. Then AO,i,AI,i,." , Ai2;,i are disjoint and their union is S. Define

f '(s) = { k;l if s E Ak,i for k > 0, • i if s E AO,i.

It is clear that Ii(s) :$ f(s) for all i and s, and each fi is a simple function. Since, for k > 0, Ak,i = A2k-l,i+l U A2k,HI, and AO,i = AO,HI U Ai2i+l+l,i+l U··· U A '+1' it is easy to see that fi(s) < fHl(S) for all i and all s. It is also (Hl)2' ,.+1, -. -i easy to see that for each s, there exists n such that for l ~ n, If(s) - f;(s)1 :$ 2 . , 0 Hence Ii(s) i f(s). . .

The following theorem will be very useful throughout the study of statistics. It says that one function 9 is a function of another f if and only if 9 is measurable with respect to the O'-field generated by f·

A.4. Integration 587

Theorem A.42. Let (81, AI), (82,A2), and (83,A3) be measurable spaces such that A3 contains all singletons. Suppose that f : 81 -+ 82 is measurable. Let Alf be the u-field generated by f. Let T be the image of f and let A. be the image u-field of f. Let 9 : 81 -+ 83 be a measurable function. Then 9 is Al/ measurable if and only if there is a measurable function h : T -+ 83 such that for each s E 81, g(s) = h(f(s». PROOF. For the "if" part, assume that there is a measurable h : T -+ 83 such that g(s) = h(f(s» for all s E 81. Let BE A3. We need to show that g-I(B) E Alf· Since h is measurable, h- l (B) E A., so h-I(B) = TnA for some A E A2. Since rI(A) = rl(T n A) and g-I(B) = rl(h-I(B»), it follows that g-I(B) = rl(A) E AI/.

For the "only if" part, assume that 9 is Al/ measurable. For each t E 83, let Ct == g-I({t}). Since g is measurable with respect to All, let At E AI! be such that Ct == r 1(At ). (Such At exists because of Corollary A.30.) Define h(s) == t for all sEAt n T. (Note that if tt 1= t2, then At} n At2 n T == 0, so h is well defined.) To see that g(s) = h(f(s)), let g(s) == t, so that sECt = r l (At ). This means that f(s) E At nT, which in turn implies h(f(s)) == t == g(s).

To see that h is measurable, let A E A3. We must show that h- l (A) EA •. Since 9 is Alf measurable, g-l(A) E Al/, so there is some B E A2 such that g-I(A) = rl(B). We will show that h- l (A) == BnT E A. to complete the proof. If s E h-I(A), then t = h(s) E A and s == f(x) for some x E Ct ~ g-I(A) == rI(B), so f(x) E B. Hence, s E BnT. This implies that h- l (A) ~ BnT. Lastly, .if s E B n T, s = f(x) for some x E rI(B) = g-I(A) and h(s) = h(f(x») = g(x) E A. So, h(s) E A and s E h-I(A). This implies B nT ~ h- l (A). 0

The condition that A3 contain singletons is needed to avoid the situation in the following example.

Example A.43. Let 81 = 82 = 83 = lR and let Al = A2 be the Borel u-field, while A3 is the trivial u-field. Then every function 9 : Sl -+ S3 is All measurable no matter what f is, for example, g( s) == s. If f (8) = 82, then 9 is not a function of f.

AA Integration The integral of a function with respect to a measure is a way to generalize the notion of weighted average. We define the integral in stages. We start with nonnegative simple functions.

Definition A.44. Let f be a nonnegative simple function represented as f(8) = L::"'I adA,(s), with the ai distinct and the Ai mutually disjoint. Then, the integral of f with respect to 1-£ is f f(s)dl-£(8) = 2::=1 ail-£(Ai). If 0 times 00 occurs in such a sum, the result is 0 by convention.

The integral of a nonnegative simple function is allowed to be 00. It turns out that the formula for the integral of a nonnegative simple function is more general than in Definition A.44.


Proposition A.45.23 If (S, A, 1-£) is a measure space, Ai E A and ai 2: ° for i = 1, ... , n, and f(s) = E:'::l adA; (s), then I f(s)dl-£(s) = E:'::l aiJL(Ai).

Next, we consider general nonnegative measurable functions. If f is a nonnegative simple function, then for every nonnegative simple function 9 :::; I, it follows easily from Definition A.44 that I g(s)dl-£(s) :::; f f(s)dl-£(s). Hence, the following definition contains no contradiction with Definition A.44.

Definition A.46. If I is a nonnegative measurable function, then the integral of f with respect to 1-£ is I f(s)dl-£(s) = SUPg~/.g simple I g(s)dp.(s).

For general functions f, define the positive part as f+(s) = max{f(s),O} and define the negative part as r(s) = -min{f(s),O}. Then I(s) = r(s)- r(s). If f 2: 0, then r == ° and I r (s ) dl-£ ( s) = OJ hence the following definition contains no contradiction with the previous definitions.

Definition A.47. If I is a measurable function, then the integral 01 I with respect to 1-£ is

J f(s)dl-£(s) = J t+(s)dl-£(s) - J r(s)dl-£(s),

if at least one of the two integrals on the right is finite. If both are infinite, the integral is undefined. We say that f is integrable if the integral of f is defined and is finite.

The integral is defined above in terms of its values at all points in S. Sometimes we wish to consider only a subset of S.

Definition A.4S. If A ~ Sand f is measurable, the integral of f over A with respect to 1-£ is i f(s)dl-£(s) = J IA(s)f(s)dl-£(s).

Here are a few simple facts about integrals.

Proposition A.49. Let (S, A, 1-£) be a probability space, and let f, 9 : S - IR be measurable.

1. If f = 9 a.e. [1-£), then f f(s)dl-£(s) = f g(s)dl-£(s) il either integral is defined.

2. If f f(s)dJL(s) is defined and a is a constant, then

J af(s)dJL(s) = a f f(s)dJL(s).

3. If f and 9 are integrable with respect to 1', and f :::; g, a.e. lI'), then

J f(s)dJL(s) :::; J g(s)dJL(s).

4. If f and 9 are integrable and fA f(s)dJL(s) = fA g(s)dl-£(s) for all A E A, then f = g, a.e. lI').

23This proposition is used in the proof of Theorem A.53.


The proofs of the next few theorems are essentially borrowed from Royden (1968). Theorem A.50 (Fatou's lemma}.24 Let {fn}~=l be a sequence of nonnegative measurable functions. Then

J l~n:~f j(s)d/-L(s) $ l~n:i~f ! fn(s}d/-L(s).

PROOF. Let f(s) = liminfn _ oo fn(s). Since

J f(s)d/-L(s) = sup J ¢(s)dJ.L(s) , simple <p :s; f

we need only prove that, for every simple ¢ ~ f,

J ¢(s)d/-L(s) $ l~n:~f J fn(s)d/-L(s),

Since this is clearly true if ¢(s) = 0, a.s. [/-L], we will assume that /-L(A) > 0, where A = {s : ¢(s) > O}. Let ¢ ~ f be simple, let € > 0, and let 6 and M be the smallest and largest positive values that ¢ assumes. For each n, define

An = {s E A: /k(s) > (1- €)¢(s), for all k 2:: n}. Since (1 - €)¢(s) < f(s) for all sEA, U~=lAn = A and An ~ An+! for all n. Let Bn = AnA::;.

f fn(s)dJ.L(s) 2:: ( f",(s)dJ.L(s) 2:: (1 - €) ( ¢(s)dJ.L(s). JAn JAn (A.51)

If /-L(Bn) = 00 for n = no, then J.L(A) = 00 and f ¢(s)dJ.L(s) = 00, since ¢ takes on only finitely many different values. The rightmost integral in (A.51) is at least DJ.L(An), which goes to 00 as n increases, hence lim infn _ oo J fn(s)dJ.L(s) = 00 and the result is true. So, assume p,(Bn) < 00 for all n. Since n~=lBn = 0, it follows from Theorem A.19 that limn _ oo J.L(Bn) = 0. So, there exists N such that n > N implies J.L(Bn) < €. Since -

J ¢(s)dJ.L(s) = i ¢(s)dJ.L(s) = in ¢(s)dJ.L(s) + in ¢(s)dp,(s)

$ (¢(s)dJ.L(s) + M€, JAn (A.51) implies that, for n 2: N,

f fn(s)dp,(s) 2:: (1 - €) f ¢(s)dp,(s) - €(l - €)M.

24This theorem is used in the proofs of Theorems A.52, A.57, A.60, B.117, and 7.80.


If J 4J(s)dp.(s) = 00, the result is true again. If J 4J(s)dp.(s) = K < 00, then for every n ~ N,

j In(s)dp.(s) ~ j 4J(s)dp.(s) - €[(l - €)M + K],

hence

l~~:f ! In (s)dp.(s) ~ ! 4J(s)dp.(s) - €[(l- €)M + K].

Since this is true for every € > 0,

liminf j In (s)dp.(s) ~ j4J(S)dP.(S)' n ..... oo

o

Theorem A.52 (Monotone convergence theorem). Let {fn}~=l be a sequence 01 measurable nonnegative functions, and let I be a measurable function such that In(x) :5 I(x) a.e. [p.] and In(x) -+ I(x) a.e. lP-]. Then,

nl~~ j In (x)dp.(x) = j I(x)dp.(x).

PROOF. Since In :5 I for all n, J In (x)dJl(x) :5 J I(x)dp.(x) for all n. Hence

l~~~f j In (x)dp.(x) :5 li~s.:!p j In (x)dp.(x) :5 j I(x)dp.(x).

By Fatou's lemma A.50, J l(x)dJl(x) :5liminfn-+oo J In (x)dp.(x). o

Theorem A.53. II J I(s)dp.(s) and J g(s)dp.(s) are defined and they are not both infinite and 01 opposite signs, then Il/(s) + g(s)]dJl(s) = J I(s)dp.(s) + J g(s)dp.(s).

PROOF. If I,g ~ 0, then by Theorem A.41, there exist sequences of nonnegative simple functions {fn}~l and {gn}~=l such that In T I and gn T g. Then Un + gn) T U + g) and Illn(s) + gn(s)]dp.(s) = J In (s)dp.(s) + J gn(s)dp.(s) by Proposition A.45. The result now follows from the monotone convergence theorem A.52. For integrable I and g, note that U+g)+ + r +g- = U+g)- + r + g+. What we just proved for nonnegative functions implies that

ju + g)+(s)dp.(s) + j r(s)dp.(s) + / g-(s)dJl(s)

= j[U + g)+(s) + res) + g-(s)Jdp.(s)

= f[u + g)-(s) + I+(s) + g+(s)Jdp.(s)

j (f + g)-(s)dp.(s) + / 1+ (s)dp.(s) + / g+(s)dp.(s).

Rearranging the terms in the first and last expressions gives the desired result. If both I and 9 have infinite integral of the same sign, then it follows easily using


Proposition A.49, that 1+ 9 has infinite integral of the same sign. Finally, if only one of I and 9 has infinite integral, it also follows easily from Proposition A.49 that I + 9 has infinite integral of the same sign. 0

A nonnegative function can be used to create a new measure.

Theorem A.54. Let (8, A, IL) be a measure space, and let I : 8 --+ IR be nonnegative and measurable. Then v(A) = fA l(s)dIL(s) is a measure on (8, A).

PROOF. Clearly, v is nonnegative and v(0) = 0, since l(s)10(S) = 0, a.e. [IL]. Let {An}~=1 be disjoint. For eachn, define gn(S) = l(s)IAn (s) and In(s) = 1::=1 gi(S). Define A = U~=IAn. Then 0 $ In $ I1A, a.e. [IL] and In converges to I1A, a.e. [IL]. So, the monotone convergence theorem A.52 says that

lim j In(s)dIL(S) = v(A). n .... oo

(A.55)

Also, V(Ai) = f gi(s)dIL(S), for each i. It follows from Theorem A.53 that

(A.56)

Take the limit as n --+ 00 of the second and last terms in (A.56) and compare to (A.55) to see that v is countably additive. 0

Theorem A.57 (Dominated convergence theorem). Let {fn}~=1 be a sequence 01 measurable junctions, and let I and 9 be measurable functions such that In(x) --+ I(x) a.e. [IL], I/n(x)1 $ g(x) a.e. [IL], and f g(x)dIL(X) < 00. Then,

!!..~ 1 In (x)dIL(X) = I I (X)dIL(X),

PROOF. We have -g(x) $ In(x) $ g(x) a.e. [IL], hence

g(X) + In(x) ~ 0, a.e. [ILJ, g(x) - In(x) ~ 0, a.e. [IL],

lim (g(x) + In (x)] = g(x) + I(x) a.e. [IL], n .... oo

lim (g(x) - In(x)J = g(x) - I(x) a.e. [IL]. n .... oo

It follows from Fatou's lemma A.50 and Theorem A.53 that

l(g(x) + f(x)]dIL(x) $ l~~~f j(g(X) + In(x)]dIL(X)

1 g(x)dIL(X) + l~~~f 1 In (x)dIL(X) ,

$ liminfjln(x)dlL(X). n .... oo


Similarly, it follows that

J[g(X) - f(x)]dJL(x)

J f(x)dJL(x)

~ l~~f J[g(X) - fn(x)]dJL(x)

J g(x)dJL(x) -li~~~p J fn(x)dJL(x),

~ li::s~p J fn(x)dJL(x).

Together, these imply the conclusion of the theorem. 0 An alternate version of the dominated convergence theorem is the following.

Proposition A.58.25 Let {In}~l' {gn}~=l be sequences of measurable functions such that Ifn(x)1 $ gn(X), a.e. [JL]. Let f and 9 be measurable functions such that limn_ oo fn(x) = f(x) and limn_ex:> gn(X) = g(x), a.e. [JL]. Suppose that limn _ oo J gn(x)dJL(x) = J g(x)dJL(x) < 00. Then, limn _ oo J fn(x)dJL(x) = J f(x)dJL(x).

The proof is the same as the proof of Theorem A.57, except that gn replaces 9 in the first three lines and wherever 9 appears with fn and a limit is being taken.

For O'-finite measure spaces, the minimal condition that guarantees convergence of integrals is uniform integrability.

Definition A.59. A sequence of integrable functions {In}~=l is uniformly integrable (with respect to JL) if limc _ oo sUPn J{:z::lfn(:z:ll>C} Ifn(x)ldJL(x) = O.

Theorem A.60.26 Let JL be a finite measure. Let {In}~=l be a sequence of integrable functions such that limn_co fn = f a.e. [JL]. Then limn_co J fn(x)dJL(x) = J f(x)dJL(x) if {In}~=l is uniformly integrable.27

PROOF. Let ft, f;;, f+, and f- be the positive and negative parts of fn and f. We will prove that the result holds for nonnegative functions and take the difference to get the general result. Let ~ > 0 and let c be large enough so that

sUPn J{x:fn(xl>c} fn(x)dJL(x) < ~. The functions

converge a.e. [JL] to

We now have

( ) _ {fn(X) if fn(x) $ C,

gn X - C if !n(x) > c

g(x) = { ~(x) if f(x) $ c, if f(x) > c.

J f(x)dJL(x) > J g(x)dJL(x) = }~~ J gn(x)dJ.L(X)

~ li~~~p J fn(x)dJL(x) - ~,

25This proposition is used in the proof of Scheffe's theorem B. 79. 26This theorem is used in the proofs of Theorems 1.121 and B.118. 27 One could replace "if" by "if and only if," but we will never need the "only

if" part of the theorem in this book.

A.5. Product Spaces 593

where the second line follows from the dominated convergence theorem A.57 and the third from our choice of c. Since this is true for every f, we have J f(x)dp,{x) ;:: lim sup J fn(x)dp,{x). Combining this with Fatou's lemma A.50 gives

J f(x)dp,{x) = !~~ J fn(x)dp,(x). o

A.5 Product Spaces

In Definition A.12, we introduced product spaces and product IT-fields. We would like to be able to define measures on (81 X 82 , Al ® A2) in terms of measures on (81, AI) and (82 ,A2). The derivation of product measure given here resembles the derivation in Billingsley (1986, Section 18).

Lemma A.61.28 Let (81,Al,P,I) and (82,A2,P,2) be IT-finite measure spaces, and let Al ® A2 be the product IT-field .

• For every B E Al ®A2 and every x E 81, B", = {y: (x,y) E B} E A2 and P,2(B",) is a measurable function from (81, AI) to 1R U {oo} .

• For every B E Al ®A2 and every y E 82, B11 = {x: (x,y) E B} E Al and P,1(B11) is a measurable function from (82 , A2) to 1R U {oo}.

PROOF. Clearly, we need only prove one of the two sets of assertions. First, let B = Al X A2 with Ai E Ai for i = 1,2 and x E 81. Then

B", = {A2 if x E ~1' o otherwlSe.

So, B", E A2. Let C be the collection of all sets B ~ 81 X 82 such that B", E A2. If BE C, then (BO)", = {y: (x,y) f/. B} = (B",)o, so BO E C. Let {Bn}~=1 E C. Then it is easy to see that

(91 Bn) '" = {Y: (x,y) E 91 Bn} = 91{Y: (x,y) E Bn} = 91(Bn)", E c. (A.62)

Clearly, 81 x 82 E C, so C is a IT-field containing all product sets; hence it contains Al®A2. Next, let fB(X) = P,2(B",) for BE A 1®A2. Write 81 X82 = U~=IEn with En = A 1n X A 2n and P,i(Am) < 00 for all n and i = 1,2 and with the En disjoint. Then let fB,n = P,2«B n En)",). It follows that fB = E:=1 fB,n. If we can show that fB,n is measurable for each n, then so is fB, since they are nonnegative, and the sum is well defined. If B = Bl X B 2, then fB,n(X) = IAlnnBl (x)p,2{A2n nB2), which is a measurable function. Let 'D be the collection of all sets D ~ 81 X 82

28This lemma is used in the proofs of Lemmas A.64 and A.67 and Theorems A.69 and B.46.


such that fD,n is measurable. If DE V, then fDC,n = JL2(A2n ) - fD,n, which is measurable, so DC E V. If {Dm}~=l E V with the Dm disjoint, then

JL2 (91 (Dm n En)",) = ~ Jl.2 (Dm n En)",

00

LfDm,n(X), m=l

which is a measurable function, so U~=lDm E V. Clearly, 8 1 x 82 E V, so V is a A-system (see Definition A.14) that contains the 1I'-system of product sets. By the 1I'-A theorem A.17, V contains Al ® A2. 0

The following corollary to Lemma A.64 is a sort of dual to part 1 of Theorem A.38.

Corollary A.63. Let (81, AI), (82 ,A2), and (X,8) be measurable spaces. If f: 81 X 82 -> X is measurable, then for every Sl E 8 1 , f.1(S2) = f(sl,s2) is a measurable function from 82 to X.

Lemma A.64.29 8uppose that (81 , AI, Jl.I) and (82 , A2, Jl.2) are a-finite measure spaces. For each x E 8 1 , Y E 82 , and B E Al ® A 2, define B", and BY as in Lemma A.61. Then vdB) = iS1 Jl.2(B",)dJl.1(X) and v2(B) = iS2 Jl.1(BY)dJl.2(Y) both define the same measure on (81 x 82, Al ® A2). If Ai E Ai for i = 1,2, then v1(Al x A2) = JL1(A1)JL2(A2)'

PROOF. First, prove that VI is a measure. The proof that V2 is a measure is identical. Clearly, v1(B) ~ 0 for all Band v1(0) = o. If {Bn}~=l are disjoint, then

00

where the first equality follows from the definition of VI, the fact that Jl.2 is countably additive, and (A.62); the second equality follows from the monotone convergence theorem A.52 and the fact that 2:::'=1 Jl.2«Bn)",) :::; 2:::'=1 Jl.2«Bn)",) for all m; and last equality follows from the definition of VI. This proves that VI (and so too V2) is a measure. Note that if B = Al X A2, then

rIAl (x)JL2(A2)dJL1(X) = JL1(A1)JL2(A2) lSl r IA2(Y)Jl.1(A1)dJl.2(Y) = v2(B).

lS2 SO, VI = V2 on the 1I'-system consisting of product sets. Since each of JL1 and Jl.2 is a-finite, there exists a countable collection of product sets whose union is 81 x 82

29This lemma is used in the proof of Lemma A.67.

A.5. Product Spaces 595

and such that each one has finite III = 112 measure. By Theorem A.26, III agrees with 112 on all of Al @A2' 0 Definition A.65. Let (8i' Ai, J1.i) for i = 1,2 be (7-finite measure spaces. Define the product measure J1.1 x J1.2 on (81 x 82, Al @ A2) as the common value of the two measures III and 112 in Lemma A.64.

Lebesgue measure on lR.2, denoted dxdy, is a product measure. Not every measure on a product space is a product measure. Product probability measures will correspond to independent random variables. (See Theorem 8.66.) Proposition A.66. Let J1. be a measure on a product space (81 X 82, Al @ A2)' Then J1. is a product measure if and only if there exist set functions J.ti : Ai -> 1R for i = 1,2 such that, for every Al E Al and A2 E A2, J1.(AI x A2} = J1.dAl)J1.2(A2).

Lemma A.67.30 Let f be a measurable function from 8 1 x 82 to m such that either {x E 81 : Jlf(x,y)ldJ1.2(Y) = oo} ~ A E AI, where III (A) = 0, or f 2 O. Then, there is a measurable (possibly extended real-valued) function 9 : 81 -> mu{±oo} such thatg(x) = Jf(x,y)dIl2(Y), a.e. [Ill]' Iff is the indicatorofa measurable set B, then

! g(x)dIl1(X} = III x 1l2(B). (A.68)

PROOF. For each B E Al @ A 2, note that J IB(X, y)dJ1.2(y) = J1.2(B:x), where Bx is defined in Lemma A.61. It was shown there that 1l2(B:x) is a measurable function of x. It follows from Lemma A.64 that (A.68) holds. It now follows from the linearity of integrals that if f is a nonnegative simple function, then g(x) = J f(x,y)dIl2(Y) is a measurable function of x. If f is a nonnegative measurable function, let {fn}~=l be a sequence of nonnegative simple functions such that In ~ I for all nand limn -+ oo fn(x, y) = f(x, y) for all (x, y). Then, the monotone convergence theorem A.52 says that limn-+oo J fn(x, y)dIl2(Y) = J l(x,y)dJ1.2(Y) = g(x) for all x. By part 5 of Theorem A.38, 9 is measurable. If J1.r{x E 81 : J I/(x, y)ldIl2(Y) = oo} = 0, then the argument just given applies to both rand r and the difference J r (x, Y )dJ1.2 (y) - J r (x, Y )d1l2 (y) is defined a.e. [1l11 and equals J f(x, y)dJ1.2(y), a.e. [Ild· If we let g(x) = J r (x, y)dIl2(Y)J rex, y)dIl2(y) for all x f/. A, and let g(x) be constant on A, then g(x) = J I(x, y)dIl2(X), a.e. [Ill], and 9 is measurable. 0

The following two theorems will be used many times in the study of product spaces.

Theorem A.69 (Tonelli's theorem). Let (81,AI,1l1) and (82,A2,J.t2) be (7-finite measure spaces. Let f : SI X S2 ...... m be a nonnegative measurable function. Then

30This lemma is used in the proofs of Theorem A.70 and of Lemmas 6.48 and 8.46.


= / [/ f(X,Y)dIL2(Y)] dILl (x).

PROOF. As in the proof of Lemma A.67, let {In}~l be a sequence of nonnegative simple functions such that fn S f for all nand limn _ oo fn(x, y) = f(~~Y) for all (x,y). If fn(x,y) = L~:l ai,nIBi.n(x,y), then J fn(x,y)dIL2(Y) =

Li=l ai,nIL2(Bi ,n,x) by Lemma A.61 and

/ [/ fn(x,y)dIL2(y)] dlLl(X) = / f(x,y)dlLl X 1L2(X,y)

by (A.68). Since 0 S J fn(x, y)dIL2(Y) ::; J f(x, y)dJj2(y) for all x and n, and limn _ oo J fn(x, y)dIL2(Y) = J f(x, y)dIL2(Y) as in the proof of Lemma A.67, it follows from the monotone convergence theorem A.52 that

f f(x, y)dJjl x Jj2(X, y) = }!..~ f fn(X, y)dJjl x Jj2(X, y)

= }!..~ / [/ fn(X, Y)dIL2(Y)] dILl (X)

f [}!..~ f fn(X, y)dJj2(y)] dJjl (x)

/ [/ f(X,y)dJj2(X,y)] dlLl(X).

The proof that the iterated integrals can be calculated in the other order is similar. 0

Theorem A.70 (Fubini's theorem). Let (Sl,Al,Jjl) and (S2,A2,1L2) be (1-

finite measure spaces. If f : Sl x S2 --+ IR is integrable with respect to Jjl x Jj2,

then

/J(X,Y)dJjl XJj2(X,y) = ![/f(x,y)dJjl(X)]dJj2(y) = f[fJ(x,Y)dIL2(Y)]dlLI(X)'

PROOF. Let g(x) = J If(x,y)dIL2(Y), a.e. [ILl] be measurable. Then

1 g(x)dIL1(X) = 1 [1 If(X,Y)ldIL2 (Y)] dlLl(X) = 1 If{x,y)ldlLl X 1L2{X,y) < 00

follows from Tonelli's theorem A.69 applied to If I· It follows that

{x: / If(x,y)ldJj2(Y) = oo} CAE Al

implies 1L1(A) = O. Apply Tonelli's theorem A.69 to f+ and f- and note that the set of all x such that J r(x,y)dIL2(Y) - J r(x,y)dIL2(Y) is undefined is a subset of {x : J If(x, y)ldIL2(Y) = oo}. It follows that this difference of integrals is defined a.e. [ILl] and the integral (with respect to ILd of the difference (which equals J[f f(x,y)dIL2{y)]dIL1(X)) is the difference of the integrals (which equals

J f{x, y)dlLl X 1L2(X, y)). . 0 All of the results of this section can be extended to finite product spaces

81 x ... X 8n by simple inductive arguments.

A.6. Absolute Continuity 597

A.6 Absolute Continuity

It is also common to consider two different measures on the same space.

Definition A.71. Let ILl and IL2 be two measures on the same space (S,A). Suppose that, for all A E A, ILl (A) = 0 implies 1L2{A) = O. Then, we say that 1L2 is absolutely continuous with respect to ILl, denoted 1L2 « ILl· When 1L2 « ILl, we say that ILl is a dominating measure for 1L2.

Consider next a function f and a measure IL such that I f(x)dlL(x) is defined. Then II(A) = J f(x)dIL(x) is defined for all measurable A. If f takes on negative values with pos'ftive measure, then II is not a measure because it assigns negative values to some sets, such as A = {x : f(x) < O}. However, II is still a signed measure.

If one of a pair of two measures is finite, there is a necessary and sufficient condition for absolute continuity which resembles the definition of continuity of functions.

Lemma A.72.3l Let ILl and IL2 be measures on a space (S,A). Consider the following condition:

For every f > 0, there is 6. such that JLl(A) < 6. implies JL2(A) < f. (A.73)

• If condition (A. 73) holds, then JL2 « JLl.

• If IL2 «ILl and 1L2 is finite, then condition (A. 73) holds.

PROOF. For the first part, let f > 0 and suppose that ILl (A) = O. Then JLl (A) < 6. and 1L2(A) < f. Since this is true for all f > 0, JL2(A) = O. For the second part, suppose that JL2 « ILl, that IL2 is finite, and that (A.73) fails. Then there exists € > 0 such that, for every integer n, there is An with 1-'1 (An) < 1/n2 but 1L2{An) ~ E. Let A = n~l U:'=k An. By the first Borel-Cantelli lemma A.20, ILl(A) = 0 so' IL2(A) = O. Since 1-'2 is finite, Theorem A.19 implies that

JL2(A) = kl~~ jJ,2 (Qk An) ~ E.

This is a contradiction. 0

The following theorem says that the first part of Example A.8 on page 574 is the most general form of absolute continuity with respect to o--finite measures. The proof is mostly borrowed from Royden (1968).

Theorem A.74 (Radon-Nikodym theorem). Let ILl and JL2 be measures on (8, A) such that 1L2 «JLl and 1-'1 is o--finite. Then there exists an extended realvalued measumble junction f : S --+ [0,00] such that for every A E A,

(A.75)

3IThis lemma is used in the proof of Lemma B.119.


Also, il 9 : S -> IR is /-l2 integrable, then

J g(x)d/-l2(x) = J g(x)/(x)d/-ll(X). (A.76)

The function f is called the Radon-Nikodym derivative of /-l2 with respect to /-ll and it is unique a.e. [/-llj. The Radon-Nikodym derivative is sometimes denoted (d/-l2/d/-l1) (s). II/-l2 is (I-finite, then I is finite a.e. [/-ld·

PROOF. First, we prove uniqueness a.e. [/-llj. Suppose that such an f exists. Let 9 be another function such that I and 9 are not a.e. [/-llj equal. Let An = {x : I(x) > g(x) + lin} and Bn = {x : I(x) < g(x) - lin}. Since I and 9 are not equal a.e. [/-ld, then there exists n such that either /-ll(An) > 0 or /-ll(Bn) > o. Let A be a subset of either An or Bn with finite positive measure. Then fA f(x)d/-l1(x) '" fA g(x)d/-l1(x). Hence 9 '" d/-l2/d/-ll.

The proof of existence proceeds as follows. First, we show that we can reduce to the case in which /-ll is finite. Then, we create a collection of signed measures Va indexed by a real number Ct. For each Ct we find a set Aa such that every subset of Aa has positive Va measure and every subset of the complement B a

has negative Va measure. We then show that B{J ~ Ba for (3 ~ Ct, which allows us to define I(x) = sup{Ct : X E Ba}. Finally, we show that I satisfies (A.75) and (A.76).

Now, we prove that we need only consider finite /-ll. Since /-ll is u-finite, let {An}~=l be disjoint elements of A such that /-ll(Ai) < 00 and S = U~lAi. Let /-lj,i be /-lj restricted to Ai for j = 1,2 and each i. Then /-l2,i « /-ll,i for each i and each /-ll,i is finite. Suppose that for each i we can find Ii as in the theorem with /-lj replaced by /-lj,i for j = 1,2. Then f(x) = "E:l IA; (X)fi(X) is the function required by the theorem as stated. Hence, we prove the theorem only for the case in which /-ll is finite.

Suppose that /-ll is finite, and define the signed measure Va = Ct/-ll - /-l2 for each nonnegative rational number Ct. (Note that va(A) never equals 00, although it may equal -00.) For each Ct, define

Pa {A E A : va(B) ~ 0, for every B ~ A},

Aa sup va(A). AEPa

That is, A" is the supremum of the signed measures of sets all of whose subsets have nonnegative signed measure.32 Since 0 EPa, A" ~ o. Let {An}~=l be such that A" = limi_oo v,,(Ai ), and let Aa = U~lAi. Since every subset of AO can be written as a union of subsets of the Ai, it follows that Aa EPa, hence Aa ~ va(Aa). Since A" \ Ai <; Aa, it follows that va(Aa \ Ai) ~ 0 for all i a~d l/a(A") = l/a(Aa \ A;) + l/a(A;) 2 va(Ai) for all i. It follows that Aa ~ va(A ). Hence Aa = v,,(Aa) < 00. Define Ba = (Aa)c.

Next, we prove that every subset of B a has nonpositive measure.33 If not, let B <; B a such that va(B) > o. If B has no subsets with negative signed measure,

32The sets in Pa are often called the positive sets relative to the signed measure

Va· 33Such sets are called negative sets relative to the signed measure Va·


then BuA'" E P'" and v",(A"'UB) > >''''' a contradiction. So, let n1 be the smallest positive integer such that there is a subset B1 ~ B with v",(Bd < -l/n1. For each k > 1, let nk be the smallest positive integer such that there exists a subset Bk ~ B \ u7,;;} Bi with V",(Bk) < -link. Now, let C = B \ Uk"=lBk. Clearly v"'(C) > O. If we prove that C has no subsets with negative signed measure, then C E P'" and we have another contradiction. So, suppose that D ~ C has v",(D) = -€ < O. Since v",(B) > 0, it must be that 2:~=1 V",(Bk) > -00.

Hence limk_oo nk = 00. So, there is k such that I/(nk+1 - 1) < €. Notice that D ~ C ~ B\U~=lBk. Since v",(D) < -l/(nk+l-I), this contradicts the definition of nk+1.

If (3 > 0, we have

v",(A'" n B/3) ~ 0, v/3(A'" n B/3) $ O.

Subtract the first inequality from the second to get (f3 - O)J.!l (A'" n B/3) $ 0, from which it follows that J.!l(A'" n B/3) = O. Since v/3(A) ;::: v",(A) for f3;::: 0, we can assume that A'" ~ A/3 if f3 ;::: o. It follows that B/3 ~ B'" for f3 ;::: 0, and we can define f(x) = sup{o : x E B"'}. Since BO = S, f(x) ~ 0 for all x. It is easy to see that f(x) ;::: 0 if x E B'" and f(x) $ 0 if x E A"'. It is also easy to see that {x: f(x) ;::: b} = U"'~bB"'. Since this is a countable union of measurable sets, it is measurable. By Lemma A.35, f is measurable.

Next, we prove that (A.75) holds for every A E A. Let A E A be arbitrary and let € > 0 be given. Let N > J.!l(A)/€ be a positive integer. Define Ek = An Bk/N n A(k+l)/N and Eoo = A \ Uk::1Ak/N. Then A = Uk::1Ek U Eoo and the E j are all disjoint. So J.!2(A) = J.!2(Eoo ) + 2:;:'0 J.!2(Ek). By construction f(x) E [kiN, (k + l)/N] for all x E Ek and f(x) = 00 for all x E Eoo. Since Vk/N(Ek) $ 0 and V(k+1)/N(Ek) ~ 0, we have, for finite k,

1J.!2(Ek) -lek

f(X)dJ.!l(X)1 $ ~J.!l(Ek)' (A.77)

If J.!l(Eoo ) > 0, then J.!2{Eoo ) = 00 since v",(Eoo ) < 0 for all o. If J.!l(Eoo ) = 0, then J.!2(Eoo) = 0 by absolute continuity. Either way, J.!2{Eoo ) = JEoo f(x)dJ.!l{X). Adding this into the sum of (A. 77) over all finite k gives

1J.!2{E) -Ie f(X)dJ.!l{X)1 $ ~J.!l(E) < €.

Since this is true for every € > 0, (A.75) is established. To prove (A.76), we note that it is true if 9 is an indicator function, hence it

is true for all simple functions. By the monotone convergence theorem A.52, it is true for all nonnegative functions and by subtraction it is true for all integrable functions.

Finally, if f(x) = 00 for all x E A with J.!l(A) > 0, then J.!2{B) = 00 for every B ~ A such that /L1{B) > O. It is now impossible for J.!2 to be a-finite. 0

In statistical applications, we will often have a class of measures, each of which is absolutely continuous with respect to a single a-finite measure. It would be nice if the single dominating measure were in the original class or could be constructed from the class. The following theorem addresses this problem. The proof is borrowed from Lehmann (1986).


Theorem A.78.34 Let 1.£ be a u-finite measure on (S,A). Suppose that N is a collection of measures on (S, A) such that for every v E N, v « 1.£. Then there exists a sequence of nonnegative numbers {Ci}~1 and a sequence of elements of N, {Vi}~1 such that L:1 Ci = 1 and v « 2::1 CiVi for every v E N.

PROOF. If N is a countable collection, the result is trivially true. If 1.£ is finite, let >. = 1.£. If 1.£ is not finite, then there exists a countable partition of S into {8i}~1 such that 0 < 1.£(8i ) = di < 00. For each B E A, let >.(B) = 2::11.£(B n 8i)/(2'di ). In either case>. is finite and v « >. for every v E N. Define Q to be the collection of all measures of the form"~ aWi where"~ ai = 1 and each LJ.=l LJ.=l Vi E N. Clearly f3 E Q implies f3 « >..

Next, let V be the collection of sets C in A such that there exists Q E Q satisfying >.({x E C : dQ/d>.(x) = O}) = 0 and Q(C) > o. To see that V is nonempty, let v be a measure in N that is not identically 0 and let C = {x : dv/d>.(x) > O}. Then with Q = v, we have {x E C : dQ/d>.(x) = O} = 0 and Q(C) = v(C) = v(8) > 0, so C E V. Since A is finite, sUPcev >'(C) = c < 00, so there exist {Cn}~=l such that limn_co >'(Cn ) = c and Cn E V for all n. Let Co = U~=lCn and let Qn E Q be such that Qn(Cn) > 0 and >.({x E Cn : dQn/d>.(x) = O}) = O. Let Qo = 2:::'=12-nQn E Q, so that dQo/d>. = E::'=12-ndQn/d>. and

{ dQo } UOO { dQn } x E Co: d>' (x) = 0 ~ x E Cn : d>' (x) = 0 •

n=l

which implies that Co E V and >'(Co) = c. Since Qo E Q, we now need only prove that v « Qo for all v E N to finish

the proof. Suppose that Qo(A) = 0 and v E N. We must prove v(A) = O. Since Qo(A nCo) = 0 and dQo/d>.(x) > 0 for all x E Co, it follows that >'(A nCo) = 0 and hence v(AnCo) = O. Let C = {x: dv/d>.(x) > O}. Then, v(Anccfncc ) = 0 since dv/d>.(x) = 0 for x E Cc . Let D = Anccf nc, which is disjoint from Co. If >.(D) > 0, then >'(Co U D) > >'(Co) and D E V. It follows easily that Co U DE V and >'(Co U D) > >'(Co) contradicts >'(Co) = c. Hence >.(D) = 0 and v(D) = 0, which implies veAl = v(A nCo) + v(A n ccf n CC) + v(D) = o. 0

There is a chain rule for Radon-Nikodym derivatives.

Theorem A.79 (Chain rule).35 Let v and." be u-finite measures and suppose that 1.£ « v «1/. Then

dl.£ (s) = dJ-l (8) dv (8), a.e. [.,,). d." dv d."

(A.80)

PROOF. It is easy to see that 1.£ « ." so that dJ-l/d." exists. For every set A, it follows from (A.76) that

1 dl.£ 1 dl.£ dv J-I(A) = -(s)dv(s) == -d (8)-d (s)d.,,(s). Adv A v ."

34This theorem is used in the proofs of Lemmas 2.15 and 2.24. It appears as Theorem 2 in Appendix 3 of Lehmann (1986) and is attributed to Halmos and Savage (1949).

35This theorem is used in the proof of Lemma 2.15.


By the uniqueness of Radon-Nikodym derivatives, (A.80) holds. .0 The Radon-Nikodym theorem A. 74 relates integrals with respect to two dIfferent measures on the same space. There are also theorems that relate integrals with respect to two different measures on two different spaces. Theorem A.81. A measurable function f from one measure space (5l ,Al,lll) to a measurable space (S2,A2), f: SI -+ S2, induces a measure on the range S2. For each A E A2, define 1l2(A) = III (I-I (A)). Integrals with respect to 112 can be written as integrals with respect to j)l in the following way: If g : 52 -+ IR tS integrable, then J g(y)dIl2(Y) = J g(f(x»dj)J(x), (A.82)

PROOF. What needs to be proven is that {t2 is indeed a measure and that (A.82) holds. To see that {t2 is a measure, note that if A, B E A2 are disjoint, then so too are f-l(A) and f-l(B). The fact that j)2 is nonnegative and countably additive now follows directly from the same fact about ttl.

If 9 : S2 -+ lR is the indicator function of a set A, then

J g(y)d{t2(Y) {t2(A) = j)1(f-J(A))

f I f -l(A)(x)d{tl(X) = f g(f(X»dj)l(X),

That (A.82) is true for all nonnegative simple functions follows by adding the far ends of this equation (multiplied by positive constants). The monotone convergence theorem A.52 allows us to extend the equality to all nonnegative integrable functions. By subtraction, we can extend to all integrable functions. 0

Definition A.S3. The measure {t2 in Theorem A.8! is called the measure induced on (82 , A2) by f from j)J.

If the measure ttl in Theorem A.81 is not finite, and the function f is not one-to-one, the measure {t2 may not be very interesting.

Example A.S4. Let 81 = lR?, 82 = IR, j)l equal Lebesgue measure on lR?, and f(x, y) = x. Let the two u-fields be Borelu-fields. The measure {t2 that f induces on (82, A2) from ttl is the following. If A E A2 and the Lebesgue measure of A is 0, then {t2(A) = O. Otherwise, j)2(A) = 00. Although j)2 is absolutely continuous with respect to Lebesgue measure, it is not IT-finite. The only functions 9 that are integrable with respect to {t2 are those that are almost everywhere O.

If j)l is IT-finite, there is a way to avoid the problem in Example A.84 by making use of the following result.

Theorem A.85.36 A measure {t on a space (S, A) is u-finite if and only if there exists an integrable function f : 8 -+ IR such that f > 0, a.e. [ttl.

36This theorem is used in the proof of Theorem B.46.


PROOF. For the "if" part, let f be as in the statement of the theorem. Let 0 < f f(s)dJ.L(s) = c < 00. Let An = {s : lin ~ f(s) < l/(n - I)}, for n = 1,2, .... We see that Al = {s: f(s) ~ I} and 8 = U~=lAn. We can write

It follows that J.L(An) ~ nc for all n. Hence J.L is u-finite. For the "only if" part, assume that J.L is u-finite, and let {An}~=l be mutually

disjoint sets such that 8 = U~=lAn and J.L(An) < 00 for all n. Define f(s) to equal TnlJ.L(An} for all sEAn and for all n such that J.L(An} > O. For n such that J.L(An) = 0, set f(s) = 0 if sEAn. Then

o

Example A.S6 (Continuation of Example A.84; see page 601). Let hex, y) = exp( _[x2 + y2)/2). It is known that h is integrable with respect to J.Ll and h is everywhere strictly positive. Let J.L~(C) = Ie h(x,y)dJ.L1 (x, y). Then J.L~ «J.Ll and J.LI « J.Li· The measure J.L~ induced on (82 ,A2 ) from J.Li by f(x,y} = x is J.L~(B) = J21T IB exp( _x2 12)dx. A function 9 : 82 -+ IR is integrable with respect to J.L; if and only if exp( _x2 /2)g(x) is integrable with respect to Lebesgue measure.

As a sort of reverse version of Theorem A.8l, functions from a measurable space to a measure space induce measures on the domain space.

Proposition A.ST. Let f be a measurable function from a measurable space (8l ,Al) to a measure space (82,A2,J.L2), f : 81 -+ 82. Let All ~ Al be the u-field generated by f, and let T be the image of f. 8uppose that T E A2. Then f induces a measure J.Ll on (8l,Alf) defined by J.Ll(A) = J.L2(TnB) if A = rl(B). Jilurthermore, if 9 : (81 , Au) -+ IR is integrable with respect to J.L1, then

J g(X)dJ.LI(X) = [ h(y)dJ.L2(y), (A.88)

where h satisfying h(J(x» = g(x) is guaranteed to exist by Theorem A.42.

A.7 Problems

Section A.2:

1. Let 8 be a set and let A be the collection of all subsets of 8 that either are countable or have countable complement. Prove that A is au-field.

2. Prove Proposition A.lO on page 575.

A.7. Problems 603

3. Prove Proposition A.13 on page 576. (Hint: First, show that every open ball in IRk is the union of countably many open rectangles. Then prove that the smallest a-field containing open balls must be the same as the smallest a-field containing open rectangles.)

4. Prove that B+ defined on page 571 is a a-field of subsets of the extended real numbers.

5. Prove Proposition A.15 on page 576. 6. Prove Proposition A.16 on page 576. 7. *Let F : lR -+ lR be a nondecreasing function that is continuous from the

right. For each interval (a, b], define p.«a, b]) = F(b) - F(a). (a) Suppose that {(an, bn]}~l is a sequence of disjoint intervals such

that U~=l (an, bnJ ~ (a, bJ. Prove that I:::"=l p.«an, bnl) S; p.«a, bl). (Hint: Prove it for finite collections and take a limit.)

(b) Suppose that {(an, bn]};';"=l is a sequence of disjoint intervals such that (a,b} ~ U;';"=l(an,bnj. Prove that L:::"=l p,«an,bn]) ~ p,«a,b]). (Hint: First, prove it for finite collections by induction. For infinite collections, let p.( (a, b]) > E > O. Cover a compact interval [a + 6, b} with finitely many open intervals (an, bn + 6n ) such that lp.«a, bJ) -p.«a + 6, b])1 < E/2 and I L::=1 p.«an, bnl) - L::=1 p.«an, bn +6n ])1 < E/2. This can be done by using continuity from the right.)

(c) Prove that p. is countably additive on the smallest field containing intervals of the form (a,b]. (Hint: Deal separately with finite and semi-infinite intervals)

8. A measure space (8,A,p.) is complete if A ~ B E A and p,(B) = 0 implies A E A. Let (8,C,p.) be a measure space, and let '0 = {D : 3A,C E C with D~A ~ C and p.(C) = O}. For each D E '0, define p.·(D) = p.(A) where D~A ~ C and p.(C) = O. Show that p.. is well defined and that (8, V, p..) is a complete measure space.

Section A.3:

9. Prove Proposition A.28 on page 583. 10. Prove Proposition A.32 on page 584. 11. Prove Proposition A.36 on page 584. 12. Let (8, A, p.) be a measure space, and let {jn};';"=1 be a sequence of mea

surable functions from 8 to lR. Suppose that for every E > 0, L:=1 p,( {s : fn(s) > E}) < 00. Prove that limn_oo fn(s) = 0, a.e. lp.l. (Hint: Use the first Borel-Cantelli lemma A.20.)

13. Let (8j, Aj) for j = 0, 1, 2, 3 be measurable spaces. Let fJ : 80 -+ 8j be measurable and onto for j = 1,2,3. Let Ao,j be the a-field generated by fJ for j = 1,2. Prove that /3 is measurable with respect to AO,1 n A O,2· if and only if there exist measurable gj : 8 j -+ 83 for j = 1,2 such that fa = glUt) = g2(h)·


Section A..4:

14. If f ~ 0 is measurable and J f(s)dJ,l(s) = 0, then show that f(s) = 0, a.e. [J,I).

15. If f(s) > 0 for all sEA and J,I(A) > 0, prove that fA f(s)dJ,l(s) > O.

16. Prove Proposition A.45 on page 588. (Hint: Use induction on n.)

17. Prove Proposition A.49 on page 588. (Hint: For part 4, use Problem 14 on page 604.)

18. Let 8 = IR and let A be the O'-field of sets that are either countable or have countable complement. (See Problem 1 on page 602.) Let J,I be Lebesgue measure. Suppose that f : 8 - IR is integrable. Prove that f = 0, a.e. fI.t.).

19. Let (8, A) be a measurable space, and let f be a bounded measurable function. (That is, there exist a and b such that a $; f(x) $; b for all x E 8.)

(a) Let J,I be a measure on (8, A) such that J,I(8) = 1. Prove that

a $; J f(x}dJ,l(x) $; b.

(b) Let € > O. Prove that there exists a simple function 9 such that for all measures J,I satisfying J,I(8) = 1, I J f(x)dJ,l(x) - J g(x)dJ,l(x) I < f.

20. Prove the following alternative type of monotone convergence theorem: Let {In}~=1 be a sequence of integrable functions such that fn(x) converges monotonically to f(x) a.e. [J,I). Then J f(x)dJ,l(x) is defined and J f(x)dJ,l(x) = limn _ oo f fn(x)dJ,l(x). (Hint: Use the dominated convergence theorem A.57 on the positive parts of fn and the monotone convergence theorem A.52 on the negative parts, or vice versa, depending on whether the convergence is from above or below.)

21. Let (8, A, J,I) be a measure space, let {gn}~=1 be a sequence of integrable functions that converges a.e. fI.t.), and let 9 be another integrable function. Suppose that for all C E A,

lim r gn(s)dJ,l(s) = 1 g(s)dJ,l(s). n-+oo}c C

Prove that limn _ oo gn = g, a.e. [J,I).

8ection A.5:

22. Prove Proposition A.66 on page 595.

23. Let (81,Ad and (82,A2) be measurable spaces, and define the product space (81 x 82, Al ® A2)' Prove that A x B E Al ® A2 with A ~ 81 and that B ~ 82 implies A E Al and B E A2. (Hint: For each C E Al ® A2, define Cy = {x : (x,y) E C}. Then let C = {C: Cy E AI, for all y E 82}. Prove that C is a O'-field containing all product sets.)

A.7. Problems 605

Section A.6:

24. Suppose that IJI «: 1J2 and 1J2 «: 1J1.

(a) Show that a.e. [1J1] means the same thing as a.e. [1J2].

(b) Show that

dlJl (s) = (d1J2 (S)) -1, [I d [I a.e. IJI an a.e. 1J2. dIJ2 dlJl

25. If IJI is a measure and f is a nonnegative measurable function, then define the measure 1J2 by 1J2(A) = JAf(s)dIJ1(s). Prove that 1J2 «: 1J1.

26. Let >. be Lebesgue measure on JR and define

for some fixed c > 0 and Xo E JR.

(a) Prove that IJ is a measure.

(b) Show that>. «: IJ, but that IJ ¢:. >..

(c) Show that J f(x)dlJ(x} = J f(x)d>'(x} + cf(xo}.

27. *In the proof of Theorem A.74, we proved the Hahn decomposition theorem for signed measures, namely that if v is a signed measure on (S, A), then there exists A E A such that A is a positive set and A C is a negative set relative to v.

(a) Let v be a signed measure on (S,A). Suppose that there are two different Hahn decompositions. That is, Al and A2 are both positive sets and Af and Af are both negative sets. Prove that every measurable subset B of Al n Af has v(B) = o.

(b) If v is a signed measure on (S,A), use the Hahn decomposition theorem to create definitions for the following:

i. The integral with respect to v of a measurable function. ii. When a function is integrable with respect to v.

(c) If there are two different Hahn decompositions for a signed measure v, prove that the definition of integral with respect to v produces the same value for both decompositions.

28. In the statement of Proposition A.87 on page 602, prove that the measure IJI is well defined. (That is, suppose that A = rl(Bd = r 1(B2), and prove that 1J2(BI n T) = 1J2(B2 n T).) Also prove that IJI is a measure.

29. In the statement of Proposition A.87 on page 602, assuming that IJI is a well-defined measure, prove that (A.88) holds.

ApPENDIX B

Probability Theory

This appendix builds on Appendix A but is otherwise self-contained. It contains an introduction to the theory of probability. The first section is an overview. It could serve either as a refresher for those who have previously studied the material or as an informal introduction for those who have never studied it.

B.1 Overview

B.l.l Mathematical Probability The measure theoretic definition of probability is that a measure space (8, A, JL) is called a probability space and JL is called a probability if 1-'(8) = 1. Each element of A is called an event. A measurable function X from 8 to some other space (X, 8) is called a random quantity. The most popular type of random quantity is a random variable, which occurs when X is m. with the Borel u-field. The probability measure JLx induced on (X, 8) by X from JL is called the distribution ofX.

Example B.l. Let 8 = X = m. with Borel u-field. Let f be a nonnegative function such that f f(x)dx = 1. Define JL(A) = fA f(x)dx and Xes) = s. Then X is a continuous random variable with density f, and JLX = JL. If we let /I denote Lebesgue measure, then JLX «/I with dJLx/d/l = f.

Example B.2. Let 8 = m. with Borel u-field. Let X = {X 1,X2 , ••• 1 a countable set. Let f be a nonnegative function defined on X such that Ei=l f(x;) = 1. Define JL(A) = E{i:Z,EA} f(Xi). Then X is a discrete random variable with probability mass function f, and JLX = 1-'. If we let /I denote counting measure on X, then JL« /I with dJL/d/l = f.

B.l. Overview 607

In both of these examples, we will say that f is the density of X with respect to v.

When there is one probability space (8, A, IJ.) from which all other probabilities are induced by way of random quantities, then the probability in that one space will be denoted Pro So, for example, if J-tx is the distribution of a random quantity X and if BE B, then Pr(X E B) = J-t(X- 1(B» = J-tx(B).

The expected value or mean or expectation of a random variable X is defined (and denoted) as E(X) = J xdJ-tx(x), if the integral exists, where J-tx is the distribution of X. If X is a vector of random variables (called a random vector), then E(X) will stand for the vector with coordinates equal to the means of the coordinates of X.

The (in)famous law of the unconscious statistician, B.12, is very useful for calculating means of functions of random quantities. It says that E[f(X)] = J f(x)dJ-tx(x). For example, the variance of a random variable X with mean c is Var(X) = E([X - CJ2), which can be calculated as J(x - c)2dJ-tx(x). The covariance between two random variables X and Y with means Cx and cy, respectively, is Cov(X, Y) = E([X - cxllY - cyJ).

B.1.2 Conditioning

We begin with a heuristic derivation of the important concepts using the special case of discrete random quantities. Afterwards, we define the important terms in a more rigorous way.

Consider the case of two random quantities X and Y, each of which assumes at most countably many distinct values, X E X = {Xl, ... } and Y E Y = {Yl, ... }. Let Pij = Pr(X = Xi, Y = Vi). Then

00

Pr(X = Xi) = LPi; = Pi., and ;=1 00

Pr(Y = Yi) = LPi; = P.i. i=1

These equations give the marginal distributions of X and Y, respectively. We can define the conditional probability that X = Xi given Y = Yi by

Pr(X = xilY = Yi) = Pii = Pili. P.;

~o.te t~at ~or e~h j, E:1 Pi!; = 1 so that the numbers {Pi\j.}~1 define a probabIlity dIstrIbutIon on X kn~~n as the conditional distribution of X given Y = Yi. We can calculate the cond,t,onal mean (expectation) of a function f of X given Y = Yi by

00

E(f(X)IY = Yi) = L f(Xi)Pilj. i=1

From the conditional distribution, we could define a measure on (X, 2x) by

J-tx!y(AIYi) = L Pi!i· ""EA

608 Appendix B. Probability Theory

It follows that, for each j, E(f(X)IY = Yj) = J f(x)d/-LxlY(xIYj). We can think

of this conditional mean as a function of y:

g(y) = E(f(X)IY = y).

The marginal distribution of Y is a measure on (Y,2)1) defined by

1J,y(B) = LP.j, for all B E2)1.

YjEB

Similarly, the joint distribution of (X, Y) induces a measure on (X x Y, 2x ® 2)1)

by /-LX,Y(C) = L(Xi,Yj)ECPij , for all C E 2x ® 2)1. The point of all of these

measures and distributions is the following. We can write the integral of 9 over

any set B E 2)1 as

fa g(y)d/-LY(Y) 00

L g(Yj )P.j = L L f(Xi)PiIiP.j YjEB YjEB i=l

J f(x)fB (y)d/-Lx,Y (x, y) = E (f (X) fB (Y».

The overall equation

fa g(y)d/-LY(Y) = E (f (X) fB (Y»

will be used as the property that defines conditional expectation in general.

Through the definition of conditional expectation, we will define conditional prob

ability and conditional distributions in general.

Theorem B.21 says that, in general, if a random variable X has finite mean and

if C is a sub-a-field of A, then a function 9 : S -+ 1R exists which is measurable

with respect to the a-field C and such that

E(XfB) = fa g(s)d/-L(s), for all B E C. (B.3)

This is the general version of what we worked out above for discrete random

variables in which C was the u-field generated by Y. We will use the symbol

E(XIC) to stand for the function g. The two important features that E(XIC)

possesses are that it is measurable with respect to the u-field C and that it satisfies

(B.3). Any function that equals E(XIC) a.s. [/-L] will also satisfy (B.3), so there

may be many functions that satisfy the definition of conditional expectation. All

such functions are called versions of the conditional expectation. When we say

that a random variable equals E(XIC), we will mean that it is a version of E(XIC).

Notice that we can set B = S in (B.3) and the equation becomes E(X) =

E[E(XIC)]. This result is called the law oftotal probability. A useful generalization

is given in Theorem B.70. lf C is the a-field generated by another random quantity Y, then the symbol

E(XIY) is usually used instead of E(XIC). For the case in which C is the a-field

generated by Y, some special notation is introduced. We saw in Theorem A.42

that a function is measurable with respect to the a-field generated by Y if and

B.l. Overview 609

only if it is a function of Y. Hence, there is a function h defined on the sp~e Y where Y takes its values such that E(XIY) = heY). We use the notatIOn E(XIY = t) to stand for h(t). (See Corollary B.22.) In this not~tio~, w~ have, for all B E C, E(XIB) = fe E(XIY = t)dlJ.y(t), where lJ.y is the dlstnbutlOn of Y.

Example B.4. Let S = lR? and let A be the two-dimensional Borel sets. Let

IJ.(A) = j _1_ exp {_3(s~ + s~ - SIS2)} dSl dS2. A V37r 3

Suppose that Xes) = SI and Y(s) = S2 when s = (SI' S2). Now E(IX!) = y'2[ir < 00. We claim that g(s) == s2/2 and h(t) == t/2 satisfy the conditions required to be E(XIY)(s) and E(XIY == t), respectively. First, note that t~e u-field genera~ed by Y is Ay == {1R xC: C is Borel measurable}, and IJ.Y IS the measure wIth density exp(-t2/2)/.;z.;r. It is clear that any measurable function of S2 alone is Ay measurable. Let B = 1R. x C, so that E(XIB) equals

[: L SI ~7r exp {-~(s~ + s~ - SI S2)} ds2dsl

= /100 SI V2 exp {_3 (SI- !S2)2} -1-exp{-!s~}dSldS2 e -00 v'31r 3 2 v'2ir 2

= r !s2_1_ exp {-!s~} dS2 le 2 V'f;ir 2

L [: ~S2:' exp {-~ (Sl - ~S2 f} ~ exp {-~S~} ds}ds2

= Is ~S2 Ja7r exp { -~(s~ + s~ - SI S 2)} ds}ds2 = Is g(s)dJ1(s).

Note also that the third line in the above string equals fe h(s2)dlJ.y(s2). It is easy to see that if X is already measurable with respect to C, then

E(XIC) == X. Conditional probability turns out to be the special case of conditional expectation in which X = IA. That is, we define Pr(AIC) == E(IAIC). A conditional

probability is regular if Pr('IC)(s) is a probability measure for all s. It turns out that, under very general conditions (see Theorem B.32), we can choose the functions Pr(AIC)O in such a way that they are regular conditional probabilities. In particular, the space (X, 8) needs to be sufficiently like the real numbers with the Borel u-field. Such spaces are called Borel spaces as defined in Definition B.31. All of the most common spaces are Borel spaces. In particular, 1R. k for all finite k and 1R00 • For those readers with more mathematical background, complete separable metric spaces are also Borel spaces. Also, finite and countable products of Borel spaces are Borel spaces.

In the future, we will assume that all versions of conditional probabilities are regular when they are on Borel spaces. If C is the a-field generated by Y, then Pr(AIY == y) will be used to stand for E(IAIY = y).

If X : S -+ X is a random quantity, its conditional distribution is the collection of conditional probabilities on X induced from the restriction of conditional probabilities on S to the a-field generated by X. If the PCIC) are regular conditional


probabilities, then we say that the version of the conditional distribution of X given C is a regular conditional distribution. When we refer to a conditional distribution without the word "version," we will mean a version of the conditional distribution. Occasionally, we will need to choose a version that satisfies some other condition. In those cases, we will try to be explicit about versions.

Because conditional distributions are probability measures, many of the theorems from Appendix A which apply to such measures apply to conditional distributions. For example, the monotone convergence theorem A.52 and the dominated convergence theorem A.57 apply to conditional means because limits of measurable functions are still measurable. Also, most of the properties of probability measures from this appendix apply as well.

We now turn our attention to the existence and calculation of densities for conditional distributions. If the joint distribution of two random quantities has a density with respect to a product measure, then the conditional distributions have densities that can be calculated in the usual way, as the joint density divided by the marginal density of the conditioning variable. Theorem B.46 allows us to extend this result to joint distributions that are not absolutely continuous with respect to product measures, such as when one of the quantities is a function of the other. Here, we merely give an example of how such conditional densities are calculated.

Example B.5. Let X = (XI,X2) have bivariate normal distribution with density

with respect to Lebesgue measure on rn? The marginal density of Y = XI + X2 with respect to Lebesgue measure is

fy(y) = ~u exp ( - 2~2 [y - (I-'I + 1-'2)12) ,

where u2 = u~ +u~ +2{XTIU2. The fair (X, Y) does not have a joint density with respect to Lebesgue measure on IR , but it does have a joint density with respect to the measure /.I on IR3 defined as follows. For each A ~ IR\ let A' = {(XI,X2) : (XI,X2,X1 +X2) E A}. Let /.I(A) = A2(A'), where Ak is Lebesgue meas.ure on IRk for k = 1,2. Then fx,y(x, y) = fx(x) is the joint density of (X, Y) With respect to /.I, and

/x(x) = _1_ exp ( __ I_ (XI -I'I _ [u~ + {XT1U2](~ -1'1 - 1-'2»)2) , fy (y) V21ru* 2u*2 U

if y = Xl + X2, is the conditional density of X given Y = y with respect to the measure /.Ix 1)1 (AIY) = AI(A:), where A: = {Xl: (Xl,y - xI) E A}.

The concept of conditional independence will turn out to be central ~~ t~e development of statistical models. A collection {Xn}~=1 of random quantities IS

B.1. Overview 611

conditionally independent given another quantity Y if the con~ition~l .distrib~tion (given Y) of every finite subset is a product measure. If, In ad~ltlOn, Y IS constant almost surely, we say that {Xn}~=1 are independent. We Will call random quantities (conditionally) IID if they are (conditionally) independent and they all have the same conditional distribution.

B.1.3 Limit Theorems There are three types of convergence which we consider for sequences of random quantities: almost sure convergence, convergence in probability, and convergence in distribution. The weakest of these is the last. (See Theorem B.90.) A sequence {Xn}~=1 converges in distribution to X if limn_co E (f (Xn)) ::::: E (f (X)) f~ every bounded continuous function f. We denote this type of convergence Xn -+ X. If X = JR., a more common way to express Xn E. X is that limn_ oo Fn(x) = F(x) for all x at which F is continuous, where Fn is the CDF of Xn and F is the CDF of X.I

If X is a metric space with metric d, we say that a sequence {Xn}~=l converges in probability to X if, for every f > 0, limn_co Pr(d(Xn, X) > f) = O. We write this as Xn .!.. X. Almost sure convergence is the same as almost everywhere convergence of functions, and it is the strongest of the three. That is, Xn -+ X, a.s. means that {s : Xn(S) does not converge to Xes)} ~ E with Pr(E) ::::: O.

A popular method for proving convergence in distribution involves the use of characteristic functions. The characteristic function of a random vector X is the complex-valued function

cPx(t) ::::: E (exp[itT Xl) .

It is easy to see that the characteristic function exists for every random vector and has complex absolute value at most 1 for all t. Other facts that follow directly from the definition are the following. If Y ::::: aX +b, then cPy(t) ::::: cPx (at) exp(itb). If X and Yare independent, cPx+y = cPxcPy.

The importance of characteristic functions is that they characterize distributions (see the uniqueness theorem B.106) and they are "continuous" as a function of the distribution in the sense of convergence in distribution (see the continuity theorem B.93).

Two of the more useful limit theorems are the weak law oflarge numbers B.95 and the central limit theorem B.97. If {Xn}~=l are llD random variables with finite mean fL, then the weak law of large numbers says that the sample average Xn = 2::1 X;/n converges in probability to fL. If, in addition, they have finite variance 0'2, the central limit theorem B.97 says that v'n(Xn - fL) E. N(0,0'2), the normal distribution with mean 0 and variance 0'2.

lSee Problem 25 on page 664. If X::::: IRk, the same idea can be used. That is, Xn E. X if and only if the joint. CDFs Fn of Xn converge to the joint CDF F of X at all points at which F is continuous. Since we will not need to use this characterization, we will not prove it.


B.2 Mathematical Probability

In this chapter, we will present the basic framework of the measure theoretic probability calculus. Most of the concepts like random quantities, distributions, an so forth. will be special cases of measure theoretic concepts introduced in Appendix A.

B.2.1 Random Quantities and Distributions We begin by introducing the basic building blocks of probability theory.

Definition B.O. A probability space is a measure space (8, A, 1-') with 1-'(8) = 1. Each element of A is called an event. If (8,A, I-') is a probability space, (X,8) is a measurable space, and X : 8 -> X is measurable, then X is called a random quantity. If X = JR and 8 is the Borel or Lebesgue u-field, then X is called a random variable. Let I-'x be the probability measure induced on (X, 8) by X from I-' (see Definition A.S3). This probability measure is called the distribution of X. The distribution of X is said to be discrete if there exists a countable set A ~ X such that I-'x(A) = 1. The distribution of X is continuous if I-'x({x}) = 0 for all x E X.

The distribution of X is easily seen to be equivalent to the restriction of I-' to the u-field generated by X, Ax.

When there is one probability space from which all other probabilities are induced by way of random quantities, then the probability in that one space will be denoted Pro So, for example, in the above definition of the distribution of a random quantity X, if BE 8, then Pr(X E B) = I-'(X- 1(B)) = I-'x(B).

The distribution of a random variable can be described by its cumulative distribution function.

Definition B.T. A function F is a (cumulative) distribution function (CDF) if it has the following properties:

• F is nondecreasingj

• lim"' ..... -oo F(x) = OJ

• lim"' ..... oo F(x) = Ij

• F is continuous from the right.

Proposition B.S. If X is a random variable, then the function Fx(x) = Pr(X $ x) is a CDF. In this case, Fx is called the CDF of X.

A distribution function F can be used to create a measure on (JR, 8) as follows. Set I-'«a,b)) = F(b) - F(a), and extend this to the whole u-field using the Caratheodory extension theorem A.22.2

We can also construct a distribution function from a probability measure on the real numbers. If I-' is a probability measure on (JR, 8 1), the CDF associated with it is F(x) = I-'«-oo,x]). If f is a Borel measurable function from JR to JR, we will write J f{x)dF(x) and J f{x)dl-'{x) interchangeably.

2See the discussion on page 581 and Problem 7 on page 603.

B.2. Mathematical Probability 613

If J-t is a probability measure on (IRn , Bn ), a joint CDF can be defined as

F{Xl, ... ,Xn ) = J-t ((-00, xd x .,. x (-oo,xn ]) ,

the measure of an orthant. For every joint CDF, there is a random vector X with that CDF and we call the CDF Fx.

Definition B.9. Let (8, A, J-t) be a probability space, and let (X, B, v) be a measure space. Suppose that X : 8 -> X is measurable. Let J-tx be the measure induced on (X,B) by X from J-t. Suppose that J-tx «v. Then we call the RadonNikodym derivative f x = dJ-tx / dv the density of X with respect to v. Proposition B.lO. If h : X :-> IR is measurable and fx is the density of X with respect to v, then J h{x)dFx{x) = J h(x)fx(x)dv(x).

Definition B.ll. If X is a random variable with CDF Fx{')' then the expected value (or mean, or expectation) of X is E{X) = J xdFx{x}. If X is a random vector, then E(X) will stand for the vector with coordinates equal to the means of the coordinates of X.

The following theorem is often called the law of the unconscious statistician, because some people forget that it is not really the definition of expected value. Theorem B.l2.3 If X : 8 -> X is a random quantity and f : X -> IR is a measurable function, then E[f(X)] = J f(x)dJ-tx(x), where J-tx is the distribution ofX.

PROOF. If we let Y = f{X), then Y induces a measure (with CDF Fy) on (JR, B1) according to Theorem A.8!. The definition of E(Y) is J ydFy(y), and Theorem A.81 says that J ydFy(y) = J f(x)dFx(x}. 0

Definition B.l3. If X is a random variable with finite mean c, then the variance of X is the mean of eX - C)2 and is denoted Var(X). If X is a random vector with finite mean vector c, then the covariance matrix of X is the mean of eX -c)(X - c)T and is also denoted Var(X). The covariance of two random variables X and Y with finite means Cx and cy is E((X - cx][Y - CY]) and is denoted Cov(X, V).

It is possible for a random variable to have finite mean and infinite variance. Proposition B.l4. If X has finite mean J-t, then Var(X) = E(X2) - J-t2 •

B.2.2 Some Useful Inequalities Although there are theoretical formulas for calculating means of functions of random variables, often they are not analytically tractable. We may, on the other hand, only need to know that a mean is less than some value. For this reason, we present some well-known inequalities concerning means of random variables.

3This theorem is used in making sense of the notation Ell when introducing parametric models.


Theorem B.15 (Markov inequality). 4 Suppose that X is a nonnegative mndom variable with finite mean p.. Then, for all c > 0, Pr(X ~ c) ~ p./c.

PROOF. Let F be the CDF of X. Then, we can write

p. = f xdF(x) ~ 1 xdF(x) ~ cl dF(x) = cPr(X ~ c). [c,oo) [c,oo)

Divide the extreme parts by c to get the result. 0

The following well-known inequality follows trivially from the Markov inequality B.15.

Corollary B.I6 (Tchebychev's inequality).5 Suppose that X is a mndom variable with finite variance (12 and finite mean p.. Then, for all c > 0,

(12

Pr(IX - p.1 ~ c) ~ 2' c

Another well-known inequality involves convex functions. 6 The proof of this theorem resembles the proofs in Ferguson (1967) and Berger (1985).

Theorem B.IT (Jensen's inequality).1 Letg be a convex function defined on a convex subset X of IRk and suppose that Pr(X E X) = 1. If E(X) is finite, then E(X) E X and g(E(X» $ E(g(X».

PROOF. First, we prove that E(X) E X by induction on the dimension of X. Without loss of generality, we can assume that E(X) = 0, since we can subtract E(X) from X and from every element of X, and E(X) E X if and only if 0 E X - E(X). If k = 0, then X = {O} and E(X) = O. Suppose that 0 E X for all X with dimension strictly less than m ~ k. Now suppose that X and X have dimension m and 0 f/. X. Since X and {O} are disjoint convex sets, the separating hyperplane theorem C.5 says that there is a nonzero vector v and a constant c such that, for every x E X, V T X $ c and 0 ~ c.8 If we let Y = v T X, then we have Pr(Y ~ c) = 1 and E(Y) = 0 ~ c. It follows that Pr(Y = c) = 1 and c = O. Hence, X lies in the (m - I)-dimensional convex set Z = X n {x : v T x = O}. It follows that 0 E Z C X.

Next, we prove the inequality by induction on k. For k = 0, E(g(X» = g(E(X», since X is degenerate. Suppose that the inequality holds for all dimensions up to m - 1 < k. Let X have dimension m. Define the subset of JRm+l,

X' = {(x,z): x E X,z E JR, and g(x) ~ z}.

Let (Xl,ZI) and (X2,Z2) be in X' and define

(y, w) = (axl + (1- a)x2, aZl + (1- a)z2)'

4This theorem is used in the proofs of Corollary B.16 and Lemma 1.61. 5This corollary is used in the proof of Theorem 1.59. 6Let X be a linear space. A function f: X -+ JR is convex if f(oXx+(I-oX)y) ~

oXf(x) + (1 - oX)f(y) for all x, y E X and all oX E [0,1]. 1This theorem is used in the proofs of Lemma B.114 and Theorems B.118

and 3.20. sThe symbol v T stands for the transpose of the vector v.

B.3. Conditioning 615

Since ag(xl) + (1- a)g(x2) ~ g(y) and w ~ ag(xI) + (1- a)g(x2), it follows that (y, w) E X', so X' is convex. It is also clear that (E(X), g{E(X))) is a boundary point of X'. The supporting hyperplane theorem C.4 says that there is a vector v = (v""v .. ) such that, for all (x,z) E X', v;I x + v .. z ~ v;IE(X) + v .. g(E(X». Since (x, Zl) E X' implies (x, Z2) E X' for all Z2 > ZI, it cannot be that v .. < 0, since then lim .. _ oo v;I x + V .. Z = -00, a contradiction. Since (x, g(x» E X' for all x E X, it follows that v;I X + v .. g(X) ~ v;IE(X) + v .. g(E(X», from which we conclude

v .. g(E(X» :5 v~ [X - E(X)] + v .. g(X). (B.18)

Taking expectations of both sides of this gives v .. g(E(X» :5 v .. g(X). If v .. > 0, the proof is complete. If v .. = 0, then (B.18) becomes 0:5 v T [X -E(X)] which implies v T[X - E(X)] = 0 with probability 1. Hence X lies in an (m - l)-dimensional space, and the induction hypothesis finishes the proof. 0

The famous Cauchy-Schwarz inequality for vectors9 has a probabilistic version.

Theorem B.lO (Cauchy-Schwarz inequality).l0 Let Xl and X2 be two random vectors of the same dimension such that E(IIXiIl2) is finite for i = 1,2. Then

(B.20)

PROOF. Let Z = 1 if XlX2 ~ 0 and Z = -1 if XlX2 < O. Let Y = IIXI + cZX2112, where c = -y'EIIXl Il2/EIIX2112. Then Y ~ 0 and Z2 = 1. So

o :5 E(Y) = EIIXlll 2 + c2EIIX2112 + 2cE(IXl X21)

= 2EIIXlll2 _ 2E(IXl X 21)y'EIIX l Il2.

y'EIIX2112

The desired result follows immediately from this inequality.

B.3 Conditioning

B.3.1 Conditional Expectations

o

Section B.1.2 contains a heuristic derivation of the important concepts in conditioning using the special case of discrete random quantities. We now turn to a more general presentation.

Theorem B.21.11 Let (8, A, 1') be a probability space, and suppose that X: 8 ~ 1R is a measurable function with E(IXI) < 00. Let C be a sub-u-field of A. Then there exists a C measurable function 9 : 8 ~ 1R which satisfies

E(XIB) = fa g(s)dl'(s), for all B E C.

9That is, if Xl and X2 are vectors, then Ixi x21 :5 IIXlllllx211. lOThis theorem is used in the proofs of Theorems 3.44, 5.13, and 5.18. llThis theorem is used to help define the general concept of conditional

expectation.


PROOF. Use Theorem A.54 to construct two measures 11+ and p,- on (8,C):

p,+(B) = i X+(s)dp,(s), p,_(B) = i X-(s)dp,(s).

It is clear that p,+ « p, and p,_ « p,. The Radon-Nikodym theorem A.74 tells us that there are C measurable functions g+ and g_ such that

p,+(B) = fa g+(s)dp,(s), p,_(B) = fa g_(s)dp,(s).

Since E(XIB) = p,+(B) - p,_(B), the result follows with g = g+ - g_. 0 We will use the symbol E(XIC) to stand for the function g. If C is the a-field

generated by another random quantity Y, then the symbol E(XIY) is usually used instead of E(XIC). For the case in which C is the a-field generated by Y, the next corollary follows from Theorem B.21 with the help of Theorem A.42.

Corollary B.22. Let (8, A, p,) be a probability space, and let (y, C) be a measurable space such that C contains all singletons. 8uppose that X : 8 --+ IR and Y : 8 --+ Y are measurable functions and E(IXI) < 00. Let P,y be the measure induced on (Y,C) by Y from p, (see Theorem A.Sl). Let Ay be the sub-a-field of A generated by Y. Then there exists a function h : Y --+ IR that satisfies the following: If BE Ay equals y-l(C) for C E C, then E(XIB) = fch(t)dp,y(t).

We will use the symbol E(XIY = t) to stand for the h(t) in Corollary B.22. At this point the reader might wish to review Example BA on page 609.

To summarize the above results, we state the following.

Definition B.23. Let (8, A, p,) be a probability space, and suppose that X 8 --+ 1R is measurable and E(IXI) < 00. Let C be a sub-a-field of A. We define the conditional mean (conditional expectation) of X given C denoted E(XIC) to be any C measurable function g : 8 --+ 1R that satisfies

E(XIB) = i g(s)dp,(s), for all B E C.

Each such function is called a version of the conditional mean. If Y : 8 --+ Y and C is the sub-a-field generated by Y, then E(XIC) is also called the conditional mean of X given Y, denoted E(XIY). If, in addition, the a-field of subsets of Y contains singletons, let h : Y --+ ill. be the function such that 9 = h(Y). Then h(t) is denoted by E(XIY = t).

When we say that a random variable equals E(XIY), we will mean that it is a version of E(XIY). The following propositions are immediate consequences of the above definitions.

Proposition B.24. Let (8, A, p,) be a probability space, and let (y, C) be a measurable space such that C contains singletons. Let X : 8 --+ IR and Y : 8 --+ Y be measurable. Let P,y be the measure on Y induced from p, by Y. A function 9 : Y --+ IR is a version of E(XIY = t) if and only if for all B E C, fB g(t)dp,y(t) = E(XIB(Y»'


Proposition B.25 .

• If Z and W are both versions ofE(XIC), then Z = W, a.s .

• If X is C measurable, then E(X\C) = X, a.s.

Proposition B.26. IfC = {8,0}, the trivial u-fi~ld, then E(XIC) = E(X).

Proposition B.27.12 Let (8,A,IL) be a probability space, and let (Y,C) be a measurable space. Let X : 8 ..... IR and Y : 8 ..... Y be measurable, and let 9 : Y ..... IR be such that g(Y)X is integrable. Let ILY be the measure on Y induced from IL by Y. Then E(g(Y)X) = J g(t)E(XIY = t)dlLy(t).

Proposition B.28Y Let (8, A, IL) be a probability space and let X : 8 ..... JR, Y : 8 ..... (y,B1), and Z: 8 ..... (2,B2 ) be measurable functions. Let lLy and ILZ be the measures induced on Y and 2 by Y and Z, respectively, from p,. Suppose that E(\XI) < 00 and that Z is a one-to-one function of Y, that is, there exists a bimeasurable h : Y ..... 2 such that Z = h(Y). Then E(X\Y = y) = E(X\Z = h(y», a.s. IlLY].

Conditional probability is the special case of conditional expectation in which X=IA.

Definition B.29. Let (8, A, p,) be a probability space. For each A E A, the conditional probability of A given C (or given Y if C is the u-field generated by Y) is Pr(A\C) = E(IAIC). IfPr(·\C)(s) is a probability on (8, A) for aIls E 8, then the conditional probabilities given C are called regular conditional probabilities.

It turns out that under very general conditions (see Theorem B.32), we can choose the functions Pr(AIC) in such a way that they are regular conditional probabilities. In the future, we will assume that this is done in all such cases. If C is the u-field generated by Y, then Pr(AIY = y) will be used to stand for E(IA\Y = y) as in the discussion following Corollary B.22.

If X : 8 --t X is a random quantity, its conditional distribution is the collection of conditional probabilities on X induced from the restriction of conditional probabilities on 8 to the u-field generated by X.

Definition B.30. Let (8,A,IL) be a probability space and let (X, B) be a measurable space. Suppose that X : 8 ..... X is a measurable function. Let P be the probability on (X, B) induced by X from IL. Let C be a sub-u-field of A. For each B E B, let P(BIC) = Pr(A\C), where A = X- 1 (B). We say that any set of functions from 8 to [0,1] of the form

{P(BICK), for all BE 8}

is a version of the conditional distribution of X given C. If C is the u-field generated by another random quantity Y : 8 ..... y, a version of the conditional

12This proposition is used in the proof of Theorem B.64. 13This proposition is used to facilitate the transition from spaces of probability

measures to subsets of Euclidean space when parametric models are introduced. It is also used in the proof of Theorem 2.114.


distribution of X given Y is specified by any collection of probability functions of the form

{Pr(·IY = t), for all t E Y}. If the P(·IC) are regular conditional probabilities, then we say that the version of the conditional distribution of X given C is a regular conditional distribution.

When we refer to a conditional distribution without the word ''version,'' we will mean a version of the conditional distribution. Occasionally, we will need to choose a version that satisfies some other condition. In those cases, we will try to be explicit about versions.

If X is sufficiently like the real numbers, there will be versions of conditional distributions that are regular. We make that precise with the following definition.

Definition B .31. Let (X, 8) be a measurable space. If there exists a bimeasurable function 4> : X -+ R, where R is a Borel subset of JR, then (X, 8) is called a Borel space.

In particular, we can show that all Euclidean spaces with the Borel u-fields are Borel spaces. (See Lemma B.36.) First, we prove that regular conditional distributions exist on Borel spaces. The proof is borrowed from Breiman (1968, Section 4.3).

Theorem B.32. Let (8, A, p.) be a probability space and let C be a sub-u-field of A. Let (X, 8) be a Borel space. Let X : 8 -+ X be a mndom quantity. Then there exists a regular conditional distribution of X given C.

PROOF. Let 4> : X -+ R be the function guaranteed by Definition B.31. Define the random variable Z = 4>(X) : 8 -+ R ~ JR. First we prove that the ufield generated by X, Ax, is contained in the u-field generated by Z, Az. Let B E Ax; then there is C E 8 such that B = X-l(C). Since 4> is one-to-one, 4>-l(4)(C)) = C. Since 4>-1 is measurable, 4>(C) is a Borel subset of R. Now, Z-l(4)(C)) = X-I (C) = B, hence B E Az. It is also easy to see that Az is contained in Ax, so they are equal. If Z has a regular conditional distribution, then so does X. The remainder of the proof is to show that Z has a regular conditional distribution.

For each rational number q, choose a version of Pr(Z :::; qlC) and let

Mq,r = {s : Pr(Z :::; qIC)(s) < Pr(Z :::; rIC)(s)}, M = U Mq,r. q>r

According to Problem 3 on page 662 and countable additivity, p.(M) = O. Next, define

Nq = {s: lim Pr(Z:::; rIC)(s) i= Pr(Z:::; qIC)}, r ! q, r rational

N= U Nq •

All q

We can use Problem 3 on page 662 once again to prove that p.(Nq ) = 0 for all q, hence p.( N) = O. Similarly, we can show that p.( L) = 0, where L is the set

{ s: lim Pr(Z ~ rIC)(s) =J: o} U {s: lim Pr(Z ~ rIC)(s) =J: I}. r -+ -00 r -+ 00

r rational r rational


If G is an arbitrary CDF, we can define

F(zIC)(s) = { <?(z) hmr! :t, r rational Pr(Z ::; riC)

if s E MUNUL, otherwise.

F(·IC)(s) is a CDF for every s (see Problem 2 on page 661), and it is easy to check that F(zIC) is a version of Pr(Z ::; zlC) for every z. If we extend F(·IC)(s) to a probability measure 7](.j s) on the Borel u-field for every s, we only need to check that, for every Borel set B, 7](Bj.) is a version of Pr(Z E BIC). That is, for every C E C, we need

l 7](B; s)d",(s) = Pre {Z E B} n C). (B.33)

By construction, (B.33) is true if B is an interval of the form (-00, z). Such intervals form a 7r-system II such that B is the smallest u-field containing II. If we define

Q (B) = Ie 7](Bj s)d",(s) 1 Pr(C) ,

Q (B) = Pr({Z E B} nc) 2 Pr(C) ,

we see that Ql and Q2 agree on II. Tonelli's theorem A.69 can be used to see that Ql is countably additive, while Q2 is clearly a probability. It follows from Theorem A.26 that Ql and Q2 agree on B. 0

Note that the only condition required for regular conditional distributions to exist is a condition on the space of the random quantity for which we desire a regular conditional distribution. The u-field C, or the random quantity on which we condition, can be quite general. In the future, if we assume that (X, B) is a Borel space, we can construct regular conditional distributions given anything we wish. Also, since the function in the definition of Borel space is one-to-one and the Borel u-field of IR contains singletons, it follows that the u-field of a Borel space contains singletons (cf. Theorem A.42).

B.3.2 Borel Spaces*

In this section we prove that there are lots of Borel spaces. First, we prove that every space satisfying some general conditions is a Borel space, and then we will show that Euclidean spaces satisfy those conditions. Then, we show that finite and countable products of Borel spaces are Borel spaces. The most general type of Borel space in which we shall be interested is a complete separable metric space (sometimes called a Polish space).

Definition B.34. Let X be a topological space. A subset D of X is dense if, for every x E X and every open set U containing x, there is an element of Din U. If there exists a countable dense subset of X, then X is separable. Suppose that X is a metric space with metric d. A sequence {Xn}~=1 is Cauchy if, for every E,

there exists N such that m, n ~ N implies d(xn' Xm) < E. A metric space X is complete if every Cauchy sequence converges. A complete and separable metric space is called a Polish space.

*This section may be skipped without interrupting the flow of ideas.


We would like to prove that all Polish spaces are Borel spaces. First, we prove that !Roo is a Borel space (Lemma B.36). Then we prove that there exist bimeasurable maps between Polish spaces and measurable subsets of !Roo (Lemma BAO). The following simple proposition pieces these results together.

Proposition B .35. If X is a Borel space and there exists a bimeasurable function f : Y --> X, then Y is a Borel space.

Lemma B.36. The infinite product space IRoo is a Borel space.

PROOF. The idea of the prooe4 is the following. We start by transforming each coordinate to the interval (0,1) using a continuous function with continuous inverse. For each number in (0,1) we find a base 2 expansion, which is a sequence of Os and Is. We then take these sequences (one for each coordinate) and merge them into a single sequence, which we then interpret as the base 2 expansion of a number in (0,1). If this sequence of transformations is bimeasurable, we have our function <p.

Let 'I/J : lRoo --> (0,1)00 be defined by

(1 tan-l(xd 1 tan-1(x2) )

'I/J(Xl,X2, ... )= -+ ,-+ , ... , 2 7r 2 7r

which is bimeasurable. For each x E [0,1), set yo(x) = x and for j = 1,2, ... , define

{ I if 2Yi-l(X) ~ 1, o if not,

Yj(x) = 2Yj-l(X) - Zj(x).

For each j, Zj is a measurable function. It is easy to see that Zj (x) is the jth digit in a base 2 expansion of x with infinitely many Os. Note also that Yj(x) E [0,1) for all j and x.

Create the following triangular array of integers:

1 2 3 4 5 7 8

11 12

6 9 10

13 14 15

Let the jth integer from the top of the ith column be l(i,j). Then

l(i,j) = i(i;I) +i(j-l)+ U- 1)ij - 2).

14This proof is adapted from Breiman (1968, Theorem AA7).


Clearly, each integer t appears once and only once as £( i, j) for some i and j .15 Define

(8.37)

Then h is clearly a measurable function from (0, 1)00 to a subset R of (0, 1). There is a countable subset of (0, 1) which is not in the image of h. These are the numbers with only finitely many Os in one or more of the subsequences {£(i,j)}~l of their base 2 expansion for i = 1,2, .... For example, the number c = E:o 2-i(Hl)/2-l is not in R.16 Since the complement of a countable set is measurable, the set R is measurable.

We define 4> = h( t/J). If we can show that h has a measurable inverse, the proof is complete. For each x E R, define

4>i(X) = ~ Zt(i,j) (x) . ~ 2] (8.38) j=l

Clearly, each 4>i is measurable. Note that, for each i and j,

Zj(4)i(X)) = Zt(i,j)(X), (8.39)

Combining (B.37), (B.38), (8.39), and the fact that every integer appears once and only once as lei, j) for some i and j, we see that h(4)l (x), 4>2 (X), ... ) = x, so that (4)1,4>2, ... ) is the inverse of h and it is measurable. 0

Lemma B.40. If (X,8) is a Polish space with the Borel (J-field and metric d, then it is a Borel spaceP

PROOF. All we need to prove is that there exists a bimeasurable f : X -> G, where G is a measurable subset of ]Roo. We then use Lemma B.36 and Proposition B.35.

Let {Xn}~=l be a countable dense subset of X, and let d be the metric on X. Define the function f : X -> IR 00 by

f(x) = (d(x, xd, d(x, X2), .. . ).

We will first show that f is continuous, which will make it measurable. Suppose that {Yn}~=l is a sequence in X that converges to Y E X. The kth coordinate of f(Yn) is d(Yn,Xk), which converges to d(y,Xk) because the metric is continuous. Hence, each coordinate of f is continuous, and f is continuous. Next, we prove that f is one-to-one. Suppose that f(x) = fey). Then d(x, Xn) = dey, Xn) for

l5It is easy to check the following. For each integer t, let k = inf{ n : t :s n(n + 1)/2}. Then ret) = 1 + k(k + 1)/2 - t and set) = k + 1 - ret) have the property that £(r(t), s(t)) = t, r(£(i,j)) = i, and s(£(i,j)) = j.

l6This number corresponds to having Is in the first column of the triangular array but nowhere else. Clearly, 0 < c < 1, but it is impossible to have Is in the entire first column, since this would require Xl = 1. Even if Xl = 1 had been allowed, its base 2 expansion would have ended in infinitely many Os rather than infinitely many Is.

l7This proof is adapted from p. 219 of Billingsley (1968) and Theorem 15.8 of Royden (1968).


all n. Since {Xn}~1 is dense, there exists a subsequence {Xn;}~1 such that limj--+oo Xnj = x. It follows that 0 = limj--+oo d(x, xnj ) = limj-+oo dey, xnj)j hence limj--+oo Xnj = y, and y = x.

Next, we prove that f- I : f(X) -+ X is continuous. Suppose that a sequence of points {f(Yn)}::'=1 converges to fey). Let limj--+oo Xnj = y. Then limj--+oo dey, Xnj ) = O. But dey, Xnj ) is the nj coordinate of fey), which in turn is the limit (as n -+ 00) of the nj coordinate of f(Yn). For each j, d(Yn,Y) $ d(yn, xnj ) +d(y, xnj ). Let 10 > 0 and let j be large enough so that dey, Xnj ) < 10/2. Now, let N be large enough so that n ~ N implies d(Yn' Xn,;) < dey, Xn,; ) + f/2. It follows that, if n ~ N, d(Yn, y) < E. Hence limn-+oo Yn = y and f- I is continuous, hence measurable.

Finally, we will prove that the image G of f is a measurable subset of JRoo. We will do this by proving that G is the intersection of countably many open subsets

-18 of G. Let Gn be the following set:

{x E JRoo : 3 0", a neighborhood of x with d(a,b) ~ lin for all a,b E f-I(O",)}.

Since 0", S;; Gn for each x E Gn , Gn is open. Also, since f and f- I are continuous, it is easy to see that G S;; Gn for all n. Let G' = G n::'=1 Gn. For each x E G', let O""n S;; Gn be such that 0",,1 ;;;? 0",,2 ;;;? ••• and that dCa, b) $ lin for all a,b E f-I(O""n)' Note that f-I(O."n) ;;;? f-I(O""n+il for all n. If Yn E f-I(O""n) for every n, then {Yn}::'=1 is a Cauchy sequence, since n,m ~ N implies d(Yn,Ym) $ liN. Hence, there is a limit Y to the sequence. It is easy to see that if there were two such sequences with limits Y and y', then dey, y') < f for all f > 0, hence y = y'. So we can define a function h: G' -+ X by hex) = y. If x E G, then clearly hex) = rl(x). If x' E O""n, then d(h(x), h(x'» ~ lin, so h is continuous. We now prove that G' S;; G, which implies that G = G' and the proof will be complete. Let x E G', and let Xn E G be such that Xn -+ x. (This is possible since G' S;; G.) Since h is continuous, f-I(Xn) -+ hex). If Yn = f-I(Xn) and y = hex), then Yn -+ Y and f(Yn) -+ fey) E G, since f is continuous. But f(Yn) = Xn , so fey) = x, and the proof is complete. 0

Next, we show that products of Borel spaces are Borel spaces.

Lemma B.41. Let (Xn,Bn) be a Borel space for each n. The product spaces n~=1 Xi for all finite nand n:=1 Xn with product u-fields are Borel spaces.

PROOF. We will prove the result for the infinite product. The proofs for finite products are similar. If Xn = JR for all n, the result is true by Lemma B.36. For general Xn , let 4>n : Xn -+ R,.. and 4>. : JRoo -+ R. be bimeasurable, where R,.. and R. are measurable subsets of JR. Then, it is easy to see that

4> : g Xn -+ 4>. (g Rn) is bimeasurable, where 4>(XI,X2, ... ) = 4>.(4)I(XI),4>2(X2),'' .). 0

Next, we show that the set of bounded continuous functions from [0,1) to the real numbers is also a Polish space.

18We use symbol G to stand for the closure of the set q .. The closure C?f a subs~t G of a topological space is the smallest closed set contammg G. A set IS closed If and only if its complement is open.


Lemma B.42.19 Let C[O, 1) be the set of all bounded continuous functions from [0,1) to JR. Let p(f, g) = SUPXE[O,l j lf(x) - g(x)l· Then, p is a metric on C[O, 1) and C[O, 1) is a Polish space. PROOF. That p is a metric is easy to see. To see that C is separable, let Dk be the set of functions that take on rational values at the points 0, l/k, ... , (k - 1)/k, 1 and are linear between these values. Let D = Uk=lDk. The set D is countable. Every continuous function on a compact set is uniformly continuous, so let f E e[O, 1) and € > 0. Let 8 be small enough so that Ix-YI < 8 implies If(x)- f(Y)1 < E/4. Let k be larger than 4/€. There exists 9 E Dk such that Ig(i/k)- f(i/k)1 < €/4 for each i = 0, ... , k. For ilk < x < (i + l)/k, If(x) - f(i/k)1 < E/4, and Ig(x) - g(i/k)1 < €/2, so I/(x) - g(x)1 < E. To see that e[O, 1] is complete, let {/n}~=l be a Cauchy sequence. Then, for all x, {/n(X)}~=l is a Cauchy sequence of real numbers that converges to some number f(x). We need to show that the convergence of f n to f is uniform. To the contrary, assume that there exists E such that, for each n there is Xn such that Ifn(Xn) - f(xn)1 > E. We know that there exists n such that m > n implies lin (X) - Im(x)1 < E/2 for all x. In particular, Ifn(xn) - fm(xn)1 < E/2 for all m > n. Since limm _ oo fm(xn) = f(x n}, it follows that there exists m such that Ifm(xn) - f(xn)1 < E/2, a contradiction. 0 Because Borel spaces have u-fields that look just like the Borel u-field of the real numbers, their u-fields are generated by countably many sets. The countable field that generates the Borel u-field of 1R is the collection of all sets that are unions of finitely many disjoint intervals (including degenerate ones and infinite ones) with rational endpoints.

Proposition B.43.20 Let (X,B) be a Borel space. Then there exists a countable field e such that B is the smallest u-field containing C. Because a field is a 7f-system, Theorem A.26 and Proposition B.43 imply the following.

Corollary B.44. Let (X, B) be a Borel space, and let e be a countable field that genemtes B. II ILl and IL2 are u-finite measures on B that agree on e, then they agree on B.

B.3.3 Conditional Densities Because conditional distributions are probability measures, many of the theorems from Appendix A which apply to such measures apply to conditional distributions. For example, the monotone convergence theorem A.52 and the dominated convergence theorem A.57 apply to conditional means because limits of measurable functions are still measurable. Also, most of the properties of probability measures from this appendix apply as well. In this section, we focus on the existence and calculation of densities for conditional distributions.

If the joint distribution of two random quantities has a density with respect to a product measure, then the conditional distributions have densities that can

19This lemma is used in the proof of Lemma 2.121. 20This proposition is used in the proofs of Lemmas 2.124 and 2.126 and Theorem 3.110.


be calculated in the usual way.

Proposition B.45. Let (S,A,p.) be a probability space and let (X,81,VX) and (Y,82,Vy) be l1-jinite measure spaces. Let X : S -+ X and Y : S -+ Y be measurable /unctions. Let P,X,Y be the probability induced on (X x Y,81 ® 82) by (X, Y) from p.. Suppose that p.x,y « Vx x Vy. Let the density be Ix,Y(x, y). Let the probability induced on (Y,82) by Y from p, be denoted p,y. Then P,Y is absolutely continuous with respect to Vy with density

fy(y) = i fX,y(x,y)dvx(x),

and the conditional distribution of X given Y has densities

f (xly) = fx,Y(x, y) XIY fy(y)

with respect to vx.

This proposition can be proven directly using Tonelli's theorem A.69 or as a special case of Theorem B.46 (see Problem 15 on page 663).

Theorem B.46. Let (X,8I) be a Borel space, let (y,82) be a measurable space, and let (XxY, 81®82, v) be a l1-jinite measure space. Then, there exists a measure 11)1 on (y,82) and for each y E y, there exists a measure vXly(·ly) on (X,81) such that for each integrable or nonnegative h : X x Y -+ JR, f hex, y)dvXly(xly) is 8 2 measurable and

1 hex, y)d/l(x, y) = 1 [1 hex, y)d/lXIY(XIY)] dVy(y). (B.47)

PROOF. Let f be the strictly positive integrable function guaranteed by Theorem A.85. Without loss of generality, assume that f I(x, y)dv(x, y) = 1. The measure p,(A) = fA I(x, y)d/l(x, y) is a probability, /I « p" and (dv/dp,)(x, y) = 1/I(x, y). Let P,xIY be a regular conditional distribution on (X, 8 1) constructed from p., and let Vy be the marginal distribution on (y,82). Define

Note that 1 IAXB(X,y)dp,Xly(xly) = IB(y)P,Xly(Aly), (B.48)

which is a measurable function of y because P.xIY is a regular conditional distribution. Just as in the proof of Lemma A.61, we can use the 7r-A theorem A.17 to show that f gdp.xl'Y is measurable if 9 is the indicator of an element of the product l1-field. It follows that f gdp.xlY is measurable for every nonnegative simple function g. By the monotone convergence theorem A.52, letting {gn}~=l be nonnegative simple functions increasing to. 9 everywhere, it fol.lows that f g(x,y)dp,Xly(xly) is measurable for all nonnegatIve measurable functIOns, and hence f hd/lXIY = f h/ fdp,xlY is measurable if h is nonnegative.


Next, define a probability 1/ on (X x y, BI @ B2) by

1/(C) = 1 [I Ic(x, y)dJLXly(XIY)] dvy(y).

It follows from (B.48) that 'f/ and JL agree on the collection of all product sets (a 7T-system that generates BI @ B2)' Theorem A.26 implies that they agree on BI @ B2. By linearity of integrals and the monotone convergence theorem A.52, if 9 is nonnegative, then

1 g(x, y)d1/(x, y) 1 [I g(x, Y)dJLX1y(XIY)] dvy(y)

1 [I g(X,Y)I(X,Y)dVXIY(XIY)] dVy(y). (B.49)

For every nonnegative h,

1 Ih(X,y) Ih(X,y) ( ) h(x,y)dv(x,y) = f(x,y/(x,y)dv(x,y)= f(x,y)dJLX,y (B.50)

1 ~~:: ~~ d1/(x, y) = 1 [I hex, y)dVXIY(X IY )] dVy(y),

where the second equality follows from the fact that dJLldv = I, the third follows from the fact that JL and 1/ are the same measure, and the fourth follows from (B.49). If h is integrable with respect to v, then (B.50) applies to h+, h-, and Ihl, and all three results are finite. Also, f Ih(x, y)ldvxly(xly) is measurable and vy({y : flh(x,y)ldvxly(xly) = oo}) = o. So f h+(x,y)dvxly(xl) and f h-(x,y)dvXly(xly) are both finite almost surely, and their difference is f hex, y)dvXly(xly), a measurable function. It now follows that (B.47) holds. 0

The measures /.Iy and VXIY in Theorem B.46 are not unique. In the proof, we could easily have defined /.Iy several ways, such as vy(A) = fA g(Y)JLy(y) for any strictly positive function 9 with finite JLy integral. A corresponding adjustment would have to be made to the definition of VXIY:

In the special case in which v is a product measure VI x V2, it is easy to show that VI can play the role of vXly(·ly) for all y and that V2 can play the role of /.Iy in Theorem B.46. (See Problem 15 on page 663.)

There is a familiar application of Theorem B.46 to cases in which X and Yare Euclidean spaces but V is concentrated on a lower-dimensional manifold defined by a function y = g(x).

Proposition B.51. Suppose that X = mn and Y = mk , with k < n. Let 9 : X -+ Y be such that there exists h : X -> mn - k such that vex) = (g(x), hex)) is one-to-one, is differentiable, and has a differentiable inverse. For y E mk and wE mn - k , define J(y,w) to be the Jacobian, that is, the determinant of the matrix of partial derivatives of the coordinates of v-I(y,w) with respect to the coordinates of y and of w. Let Ai be Lebesgue measure on mi , for each i. Define


a measure v on X x Y by v(C) = An({X : (x,g(x» E C}). Then, vy equal to Lebesgue measure on IRk and vXly(AIY) = fAo J(y,W)dAn-k(W) satisfy (B.47),

v where A; = {w: v-1(y,w) E A}.

We are now in position to derive a formula for conditional densities in general?l

Theorem B.52. Let (8, A, JL) be a probability space, let (X, 8 1) be a Borel space, let (Y,82) be a measurable space, and let (X x y, 8 1 ®82 , v) be a (I-finite measure space. Let vy and VXIY be as guaranteed by Theorem B.46. Let X : 8 -+ X and Y : 8 -+ Y be measurable functions. Let JLX,Y be the probability induced on (X x Y,81 ®~) by (X, Y) from JL. 8uppose that JLX,Y « v. Let the density be fx,y(x,y). Let the probability induced on (y,82) by Y from JL be denoted JLY. Then, JLy « vyi for each y E y,

~Y (y) = fy(y) = [ /x,y(x,y)dvxly{xly); y lx (B.53)

and the conditional distribution of X given Y = y has density

j (I) - jx,Y{x,y) XIY x y - Jy(y) (8.54)

with respect to VXly(·ly).

PROOF. It follows from Theorem B.46 that for all B E 82,

JLY(B) = f IB (y)/x,Y (x, y)dv(x, y)

= f IB{Y) [f fX,Y(X,Y)dVXI)I(X 1Y)] dvy{y).

The fact that JLY « V)I and (B.53) both follow from this equation. Let JLxly(·ly) denote a regular conditional distribution of X given Y = y. For each A E 81 and B E 82, apply Theorem B.46 with h(x, y) = IA(x)IB(y)fxly(xly)Jy(y) to conclude

Since this is true for all B E 82, we conclude that

JLxlY(Aly) = 1 fXIY(xly)dvXly(xly)·

Hence (8.54) gives the density of JLxly(·ly) with respect to vXI)I(·ly)· 0 The point of Theorem B.52 is that we can calculate conditional densities for

random quantities even if the measure that dominates the joint distribution is not a product measure. When the joint distribution is dominated by a product

21The condition that the joint distribution have a density with respect .t~ a measure v in Theorem B.52 is always met since v can be taken equal to the Jomt distribution. The theorem applies even if v is not the joint distribution, however.


measure the conditional distributions are all dominated by the same measure. (See Pr;blem 15 on page 663.) In general, however, the conditional distributio~ of X given Y = y is dominated by a measure that depends on y. For example, If Y = g{X), the joint distribution of (X, Y) is not dominated by a product measure even if the distribution of X is dominated. (See also Problem 7 on page 662.) Nevertheless, we have the following result.

Corollary B.55.22 Let (S,A,p,) be a probability space, let (Y,B2) be a measurable space such that B2 contains all singletons, and let (X, B) be a Borel space with Vx a u-finite measure on (X,8). Let X : S -+ X and g : X -+ Y be measumble junctions. Let Y = g{X). Suppose that the distribution of X has density fx with respect to vx. Define von (X x Y,Bl ®B2) by v(C) = vx({x: (X,9(X» E C}). Let p,X,Y be the probability induced on (X x y, BI ®B2) by (X, Y) from p,. Let the probability induced on (Y,B2 ) by Y from p, be denoted P,y. Then p,x,y «v with Radon-Nikodym derivative fx,y(x,y) = fx(x)l{g(x)} (y). Also, the conditions of Theorem B.46 hold, and we can write

dp,y ( ) dvy Y = fy(y) = i l{g(x)} {y)fx (x)dvXIY{x/y),

fxlY{x/y) = { O~;f:~ if y = g(x), otherwise.

Also, the conditional distribution ofY given X is given by P,Ylx{C/x) = 10{g(x».

PROOF. Since Vx is u-finite, v is also. Since Y is a function of X, Theorem A.81 implies that for all integrable h, J h{x, y)dv(x, y) = J h(x, g{x»dvx(x). The facts that Ix,Y has the specified form and that P,YIX is the conditional distribution of Y given X follow easily from this equation. 0

The point of Corollary B.55 is that if Y = g{X), then we can assume that the conditional distribution of X given Y = y is concentrated on g-I({y}).

Example B.56.23 Let 1 be a spherically symmetric density with respect to An, LebesguemeasureonlRn . That is, I(x) = h{x T x) for some function h: lR -+ lR+o (the interval [0,00» and J h{x T x)dAn{X) = 1. Let X have density f and let V = XT X. Let R = V- 1/ 2 , and transform to spherical coordinates:

XI rcos(fh), X2 = r sin(Ot} COS{(2),

Xn -l rsin(Ot} .. 'COS(On_l), Xn = r sin(Ol) ... sin(On_l).

The Jacobian is rn - 1j(f), where j is some function of 0 alone. The Jacobian for the transformation to v and 0 is v(n/2)-lj(0)/2. The integral of j(O) over all 0

22This corollary is used in the proof of Theorem 2.86 and in Example 3.106. 23The calculation in this example is used again in Example 4.121.


values is 7rn / 2/f(n/2}. So, the marginal density of V is

Iv ( v) = 7r ~ V ~ -1 h( V ) •

2f(~) The conditional density of X given V = v is then

2f(~)Vl-~ T IXIV(xlv) = 7r1f I{v}(x x)

with respect to the measure IIXIV(Clv) = fc • v(n/2)-lj(9)dAn_l(9)/2, where

C' = {9 : v! (cos(Ot), ... , sin(OI}'" sin(On_t}} E C}.

It follows that the conditional distribution of X given V = v is given by

r(!!) 1 JLxlV(Clv} = + j(O}dAn-l(O}. 7r c'

It is easy to see that JLxlV(·lv} is the uniform distribution over the sphere of radius v in n dimensions.

Another example was given in Example B.5 on page 610.

B.3.4 Conditional Independence

The concept of conditional independence will turn out to be central to the development of statistical models.

Definition B.57. Let ~ be an index set, let Y and {XihEN be random quantities, and let Ai be the a-field generated by Xi. We say that {XihEN are conditionally independent given Y if, for every n and every set of distinct indices iI, ... ,in and every collection of sets Al E Ail, ... , An E Ain , we have

p, (6 A; Y) ~ n P,(A;IY), a.a. (B.58)

If, in addition, Y is constant almost surely, we say {Xi hEN are independent. Under the same conditions as above, if all of the conditional distributions of

the Xi given Yare the same, then we say {XihEN are conditionally IID given Y. If, in addition, Y is constant almost surely, we say {XihEN are IID.

Example B.59. Let F be a joint CDF of n random variables Xl,.'" X n , and let JL be the corresponding measure on lR n. Then JL is a product measure if and only if Xl, ... ,Xn are independent (see Proposition B.66).

Example B.60 (Continuation of Example B.56; see page 627).24 Transform to (Y, V), where Y = X/Vl/2. Then, the conditional distribution of Y given V is given by

24This calculation is used again in Example 4.121.

8.3. Conditioning 629

where D' = {9 : (cos(91), ... ,sin(91 ) .. ·sin(9n -I)) ED}. We note that this formula does not depend on v; hence Y is independent of V. In addition, it is easy to see that JLYlv(ylv) is just the uniform distribution over the sphere of radius 1 in n dimensions.

The use of conditional independence in predictive inference is based on the following theorem.

Theorem B.61.26 Let N be an index set, let Y and {XihEN be a collection of random quantities, and let Ai be the u-field generated by Xi. Then {XihEN are conditionally independent given Y if and only if for every nand m and every set of distinct indices it, ... ,in, jt, ... ,jm and every collection of sets Al E Ail, ... ,An E Ain , we have

(8.62)

PROOF. For the "if" part, we will assume (8.62) and prove (B.58) by induction on n. For n = 1, there is nothing to prove. Assuming (8.58) is true for all n ::; k, we now prove it for n = k + 1. Let Aj E Ai; for j = 1, ... ,k + 1. According to (B.62) and (8.58) for n = k, we have

It follows that for all B E Ay, the u-field generated by Y,

Pr (B 0 Ai) = Pr (BnAHIOAi)

= inAk+1 Pr (oAiIY,Xk+l) (s)dJL(s)

k k

= lnAIo+I n Pr(AiIY)(s)dJL(s) = l 1Ak+l (s) n Pr(AiIY)(s)dJL(s)

f k k+l

= JB Pr(AHIIY)(s) n Pr(AiIY)(s)dJL(s) = i n Pr(AiIY)(s)dJL(s).

The equality of the first and last terms above for all B E Ay means that Il~~ll Pr(AiIY) = Pr(n~~: AiIY), a.s., which is what we need to complete the induction.

For the "only if" part, we will assume (B.58) and prove (B.62). For a function 9 to be the left-hand side of (8.62), it must be measurable with respect to the u-field AY,m generated by Y, Xii, ... ,Xjm , and satisfy

(B.63)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


for all C E AY,m. Clearly, the right-hand side of (B.62) is measurable with respect to AY,7n' If C = Cy nCx, where Cy E Ay and Cx is in the O'-field generated by Xjl , ... , Xj"" then

Pr (CQ Ai) fay Pr ( Cx Q Ai Y) (s)dJL(s)

= fay lex (s) Pr (0 Ai Y) (s)dJL(s)

fa Pr (0 Ail Y) (s)dJL(s).

This means that (B.63) holds with 9 = Pr(n?=lAiIY) so long as C is of the specified form. To show that it holds for all C E AY,m, we first note that AY,m is the smallest O'-field containing all sets of the specified form. Clearly, (B.63) holds for all sets that are unions of finitely many disjoint sets of the specified form by linearity of integrals. These sets form a field C. According to Lemma A.24, for each f > 0, there is C. E C such that Pr(C.~C) < f/2. The following facts follow trivially:

1 g(s)dJL(s) e.

Pr ( cD Ai) - Pr ( C. 0 Ai) I <

Il g(s)dJL(s) -l. g(s)dJL(s)I <

f

2'

f

2

Combining these gives that lIe g(s)dJL(s) - Pr (C n?=l Adl < f. Since f is arbitrary, (B.63) holds for all C E AY,m. 0

A particular case of interest involves three random quantities. Theorem B.64 says that when there are only two Xs in Theorem B.61, we can check conditional independence by checking only one of the equations of the form (B.62).

Theorem B.64.26 Let X, Y, and Z be three random quantities, and let Ax, Ay, and Az be the O'-fields generated by each of them. Suppose that for all A E Ax, Pr(AIY, Z) = Pr(AIY). Then X and Z are conditionally independent given Y.

PROOF. We need to check that for every A E Ax and B E Az, Pr(A n BIY) = Pr(AIY) Pr(BIY). Equivalently, for all such A and B, and all C E Ay, we must show

Pr(A n B n C) = f le(s) Pr(AIY)(s) Pr(BIY)(s)dJL(s). (B.65)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


Since we have assumed that Pr(AIY, Z) = Pr(AIY), we have that, for all B E Az and C E Ay,

Pr(A n B n C) = f Ie(s)I8(s) Pr(AIY)(s)d/L(s).

We can use Proposition B.27 with g(Y) = Ie Pr(AIY) and X = 18 to see that

f Ie(s)I8(s) Pr(AIY)(s)d/L(s) = f Ic(s) Pr(AIY)(s) Pr(BIY)(s)d/L(s).

Together, these last two equations prove (B.65). 0 The following result relates product measure on a product space to independent random variables.

Proposition B.66. Let (8, A, /L) be a probability space and let (Ti , Bi ) (i = 1, ... , n) be measurable space. Let Xi : 8 -+ Ti be measurable for i = 1, ... , n. Let /Li be the measure that Xi induces on Ti for each i, and let Tn = TI X ..• X Tn, Bn = B1 ® ... ®Bn. Let /L. be the measure that (Xl,' .. , Xn) induces on (Tn, Bn) from /L. Then /L. is the product measure /Ln = /LI X .,. x /Ln, if and only if the Xi are independent.

The same result holds for conditional independence.

Corollary B.67. Random quantities X I, ... ,Xn are conditionally independent given Y if and only if the product measure of the conditional distributions of XI, ... ,Xn given Y is a version of the conditional distribution of(XI, ... ,Xn ) given Y.

There is an interesting theorem that applies to sequences of independent random variables, even if they are not identically distributed. Theorem B.68 (Kolmogorov zero-one law).27 Suppose that (8,A,/L) is a probability space. Let {Xn}~l be a sequence of independent random quantities. For each n, let Cn be the u-field generated by (Xn , X n +l , .. . ) and let C = n~=ICn. Then every set in C has probability 0 or probability 1.

PROOF. Let An be the u-field generated by (Xl, ... , Xn). Then C. = U~IAn is a field. It is easy to see that C is contained in the smallest u-field containing C •. Let A E C. By Lemma A.24, for every k > 0, there exists nand Ck E An such that IL(A~Ck) < 11k. It follows that

lim IL( Ck) = IL(A), k->oo

(B.69)

Since A E C, it follows that A E Cn +l ; hence A and Ck are independent for every k. It follows that IL(Ck n A) = /L(A)/L(Ck). It follows from (B.69) that IL(A) = IL(A)2, and hence either /L(A) = 0 or /L(A) = 1. 0

27This theorem is used in the proofs of Corollary 1.63 and Lemma 7.83, and in the discussion of "sampling to a foregone conclusion" in Section 9.4.


The a-field C in Theorem B.68 is often called the tail a-field of the sequence {Xn}~=l' An interesting feature of the tail a-field is that limits are measurable with respect to it.28 (See Problem 21 on page 663.)

B.3.5 The Law of Total Probability Next, we introduce some theorems that are very simple to state for discrete random variables but appear to be rather unwieldy in the general case. We will, however, need them often.

Theorem B.70 (Law of total probability). Let (8, A, JL) be a probability space, and let Z be a random variable with E(IZI) < 00. Let C ~ B be sub-ufields of A. Then E(ZIC) = E{E{ZIB)IC), a.e. [JL].

PROOF. Define T = E{ZIB) : 8 -+ JR, which is any B measurable function satisfying E{ZIB) = fB T(s)dJL{s), for all BE B. We need to show that E{ZIC) = E{TIC) a.s. fIL]. The function E{TIC) is any C measurable function satisfying fc E{TIC){s)dJL{s) = E{TIc), for all C E C. But, since C ~ B, C E C implies C E B. So, for C E C,

l E{TIC)(s)dJL{s) = E{TIc) = J Ic(s)T{s)dJL{s) = l T{s)dJL{s) = E{ZIc),

where the last equality follows since T = E{ZIB) and C E B. Since E{TIC) is C measurable, equating the first and last entries of the above string of equations means that E{TIC) satisfies the condition required for it to equal E{ZIC). 0

When Band C are the a-fields generated by two random quantities X and· Y, respectively, C ~ B means Y is a function of X. So, Theorem B.70 can be rewritten in this case.

Corollary B.n. Let X: 8 -+ Ul, Y : 8 -+ U2, and Z: 8 -+ 1R be measurable junctions such that E{IZI) < 00. 8uppose that Y is a function of X. Then,

E{ZIY) = E {E {ZIX)I Y}, a.s. fIL]·

The most popular special case of this corollary occurs when Y is constant.

Corollary B.72.29 Let (8,A,JL) be a probability space. Let X : 8 -+ U1 and Z : 8 -+ 1R be measurable functions such that E{IZI) < 00. Then, E{Z) = E{E{ZIX)}.

This is the special case of Theorem B.70 when C is the trivial a-field. The following theorem implies that if a conditional mean given X depends on

X only through heX), then it is also the conditional mean given heX).

Theorem B.73.30 Let (8,A,JL) be a probability space and let Band C be suba-fields of A with C ~ B. Let Z : 8 -+ 1R be measurable such that E(IZl) <

28The tail a-field will play a role in the proofs of Corollary 1.63 and Theorem 1.49.

29This corollary is used in the proof of Theorem B.75. 30This theorem is used in the proofs of Theorems 1.49 and 2.6.


00. Then there exists a version of E(ZI8) that is C measurable if and only if E(ZI8) = E(ZIC), a.s. [j.t]. PROOF. For the "if" direction, if E(ZI8) = E(ZIC), a.s. [j.t], then E(ZIC) is measurable with respect to both C and 8, and hence it is a C measurable version of E(ZI8). For the "only if" direction, if W is a C measurable version of E(ZI8), then W = E(WIC), a.s. [j.t] by the second part of Proposition B.25. By the law of total probability B.70, E(WIC) = E(ZIC), a.s. [j.t]. 0

A useful corollary is the following.

Corollary B.74.31 Let (8, A, j.t) be a probability space. Let (81 ,Ad and (82, A2) be measurable spaces, and let X : 8 --> 81 and h : 81 --> 82 be measurable functions. Let Z : 8 -+ JR be measurable such that E(lZI) < 00. Define Y = heX). Then E(ZIX = x, Y = y) == E(ZIX == x) a.s. with respect to the measure on (81 x 82, A1 ® A2) induced by (X, Y) : 8 --> 81 X 82 from j.t.

The following theorem deals with conditioning on two random quantities at the same time. In words it says that the conditional mean of a random variable Z given two random quantities Xl and X2 can be calculated two ways. One is to condition on both Xl and X2 at once, and the other is to condition on one of them, say X2, and then find the conditional mean of Z given Xl, but starting from the conditional distribution of (Z,X1) given X2. Theorem B.75.32 Let (8, A, j.t) be a probability space and let (Xi, 8 i) fori == 1,2 be measurable spaces. Let Xi : 8 -t Xi for i == 1,2 and Z : 8 --> JR be random quantities such that E(lZI) < 00. Let j.t1,2,Z denote the measure on (Xl x X2 X JR,Bl ® B2 ® B) induced by (Xl,X2, Z) from j.t. (Here,8 denotes the Borel (J'field.) For each (x,y) E Xl X X2, let g(x,y) denote E(ZI(Xl,X2) = (x,y». For each A E A and y E X2, let j.t(2)(Aly) denote Pr(AIX2 = y). For each y E X2, let hex, y) denote the conditional mean of Z given Xl = x calculated in the probability space (8,A,j.t(2)(·ly». Then h = 9 a.s. [j.t1,2,Z]. PROOF. Saying that h = 9 a.s. [j.tl,2,Z] is equivalent to saying that

To prove this we first note that f(s) = h(Xl (s), X2(S» is measurable with respect to the (J'-field generated by (Xl, X 2 ), Ax 1 ,x 2' All that remains is to show that it satisfies the integral condition required to be E(ZIX1,X2). That is, for all C E AXl,x2,

E(ZIc) = l f(s)dj.t(s). (8.76)

Let j.t2 be the measure on (X2,82) induced by X2 from j.t. First, suppose that C = An B, where A E AXl and B E A x2 . The last hypothesis of the theorem says that for all A E AXIl E(ZIAIX2 = y) = fA h(Xl (s),y)dj.t(2)(sly). If j.t112(·ly) is the probability on (Xl,Bl ) induced by Xl from j.t(2) (-Iy), then j.t112(·ly) is also the conditional distribution of Xl given X2 = Y as in Theorem B.46. Suppose

31This corollary is used in the proof of Theorem 2.14. 32This theorem is used in the proof of Lemma 2.120, and it is used in making sense of the notation Ee when introducing parametric models.


that A = X11(D) and B = X;l(F). Then An B = (Xl,X2)-1(D x F) and E(ZIAJX2 = y) = JDh(X,y)d/1-1(X), By Corollary B.72 and Theorem B.46, we can write

E(ZlAIB) = 1 Iv h(x, y)d/L112(XJy)d/1-2(Y)

( h(x, y)d/1-1,2,Z(X, y, z) = ( J(s)d/1-(s). JDXFXR J AnB

This proves (8.76) for C = An B. Let C be the collection of all sets C in A such that (8.76) holds. Clearly SEC. If C E C, then CC E C since f J(s)d/1-(s) = E(Z). By additivity of integrals, if {Cd~l E C, then U~lCi ~ C, hence C contains the smallest IT-field containing all sets of the form An B for A E AXl

and B E AX2' Theorem A.34 can be used to show that this IT-field is AXl,X2' 0 If a random variable has finite second moment, then there is a concept of

conditional variance.

Definition B.77. Let X : S --+ 1Rk have finite second moment, and let C be a sub-IT-field of A. Then the conditional covariance matrix of X given C is defined as Var(XJC) = E[(X - E(XJC»)(X - E(XJC»TJC].

The following result is easy to prove.

Proposition B.78.33 Let X: S --+ IRk have finite second moment, and let C be a sub-IT-field of A. Then Var(X) = EVar(XJC) + Var[E(XJC)].

BA Limit Theorems

There are several types of convergence that will be of interest to us. They involve sequences of random quantities or sequences of distributions.

B.4.1 Convergence in Distribution and in Probability The simplest type of convergence occurs when the distributions have densities with respect to a common measure. The following theorem is due to Scheffe (1947).

Theorem B.79 (Schefi'e's theorem).34 Let {Pn}~=l and P be nonnegative Junctions from a measure space (X, B, II) to IR such that the integral oj each junction is 1 and limn->ooPn(x) = p(x), a.e. [II]. Then

lim r Pn(X)dll(X) = ( p(X)dll(X), Jor all B E B. n-+ooJB JB

PROOF. Let 6n (x) = Pn(X) - p(x), and let 6;i and 6;; be its positive and ?eg-ative parts. Clearly, both limn->oo 6;i = 0 and limn->oo 6;; = 0, a.e. [II]. Smce

33This proposition is used in the proofs of Theorems 2.36 and 2.86. 34This theorem is used in the proofs of Lemma 1.113 and Theorem 1.121.

B.4. Limit Theorems 635

o s:; 0;; :5 p is true, it follows from the dominated convergence theorem A.57 that limn~oo IB 0;; (x)dv(x) = 0 for all B. Since both Pn and P are densities, Ix On (x)dv(x) = 0 for all n. It follows that limn~oo Ix 0;; (x)dv(x) = O. Since IB(x)8;;(x) s:; 8;t(x) for all x, it follows from Proposition A.58 that

lim 1 o;!"(x)dv(x) = o. n-+cx> B

So, limn~oo IBfpn(X) - p(x)Jdv(x) = 0 for all B. 0 Since defining convergence requires a topology, the following definitions require

that the random quantities lie in various types of topological spaces.

Definition B.SO. Let {Xn}~=,l be a sequence of random quantities and let X be another random quantity, all taking values in the same topological space X. Suppose that limn~oo E (f (Xn» = E (f (X» for every bounded continuous function , : X ..... JR, then we say that Xn converges in distribution to X, which is

written Xn E. X.

Convergence in distribution is sometimes defined in terms of probability mea

sures. The reason is that if Xn E. X, the actual values of Xn and of X do not play any role in the convergence. All that matters is the distributions of Xn and of X.

Definition B.S1. Let {Pn}~=l be a sequence of probability measures on a topological space (X, 8) where 8 contains all open sets. Let P be another probability

on (X, 8). We say that Pn converges weakl1f5 to P (denoted Pn ~ P) if, for each bounded continuous function 9 : X ..... JR, limn~oo J g(x)dPn(x) = J g(x)dP(x).

35This is not exactly the same as the concept of weak convergence in normed linear spaces [see, for example, Dunford and Schwartz (1957), p. 419J. The collection of all probability measures on a space (X, 8) can be considered a subset of a normed linear space C consisting of all finite signed measures v (see Definition A.IS) with the norm being SUPBEB Iv(B)I. Weak convergence of a sequence {Vn}~='l in this space would require the convergence of L(vn) for every bounded linear functional L on C. Every bounded measurable function 9 on (X, 8) determines a bounded linear functional Lg on C by Lg(v) = I g(x)dv(x), where the integral with respect to a signed measure can be defined as in Problem 27 on page 605. Hence, weak convergence of a sequence of probability measures would require convergence of the means of all bounded measurable functions. In particular, limn~oo Pn(B) = PCB) for all measurable sets B, not just those for which P assigns 0 probability to the boundary (see the portmanteau theorem B.83 on page 636). Alternatively, we can consider the set of bounded continuous functions , : X ..... JR as a normed linear space N with "'" = sup", I/(x)l. Then the set of finite signed measures C is a set of bounded linear functionals on N using the definition v(f) = J '(x)dv(x). Weak" convergence of a sequence {Vn}~=l in C to v is defined as the convergence of vn(f) to v(f) for all , EN. This is precisely convergence in distribution. Hence, it would make more sense to call convergence in distribution weak" convergence rather than weak convergence. Since the tradition in probability theory is to call it weak convergence, we will continue to do so.


It is easy to see that these two types of convergence are the same.

Proposition B.S2. Let Pn be the distribution of X n, and let P be the distribution of X. Then, Xn E. X if and only if Pn ~ P.

Since we will usually be dealing with X spaces that are metric spaces, there are some equivalent ways to define convergence in distribution or weak convergence. The proofs of Theorems B.83 and B.88 are adapted from Billingsley (1968).

Theorem B.S3 (Portmanteau theorem).36 The following are all equivalent in a metric space:

w 1. Pn -+ P;

2. limsuPn_oo Pn(B) :$ P(B) for each closed B;

3. liminfn_ oo Pn(A) 2:: P(A), for each open A;

4. limn_co Pn(C) = P(C), for each C with P(8C) = 0.37

PROOF. Let d be the metric in the metric space. First, assume (1) and let B be a closed set. Let 6 > 0 be given. For each e > 0, define C. = {x : d(x,B) :$ e}, where d(x, B) = infl/EB d(x, y). Since Id(x, B) - d(y, B)I :$ d(x, y), we see that d(x, B) is continuous in x. Each C. is closed and n.>oc. = B. Let e be small enough so that P(C.) :$ P(B) + 6. Let f : 1R -+ 1R be

{ 1 if t :$ 0,

f(t) = 1 - t if 0 < t < 1, o if t 2:: 1,

and define g.(x) = f(d(x, B)/e). Then g. is bounded and continuous. So,

!~~ f g.{x)dPn{x) = f g.{x)dP{x).

It is easy to see that 0 :$ g.(x) :$ 1, g.(x) = 1 for all x E B, and g.(x) = 0 for all x ¢ C •. Hence, for every 6 > 0,

Pn(B) = J IB(x)dPn(x):$ J g. (x)dPn (x) -+ J g.(x)dP(x)

5 f Ic. (x)dP(x) = P(C.) :$ P(B) + 6.

It follows that limsuPn_oo Pn(B) :$ P(B), which is (2). That (2) and (3) are equivalent follows easily from the facts that if A is open,

then B = AC is closed and Pn(A) = 1- Pn(B). It is also easy to see that (2) and (3) together imply (4). Next assume (4), let B be a closed set, and define C. as above. The boundary of C. is a subset of {x: d(x, B) = e}. There can be at most countably many e such that these sets have positive probability. Hence, there

36This theorem is used in the proofs of Theorem B.88 and Lemma 7.19. 37We use the symbol a in front of the name of a subset of a topological space

to refer to the boundary of the set. The boundary of a set C in a topological space is the intersection of the closure of the set with the closure of the complement.

BA. Limit Theorems 637

exists a sequence {fk}~1 converging to 0 such that P(d(X,B) = fk) = 0 for all k. It follows that limn_co Pn(C. k ) = P(C. k ) for all k. Since Pn(B) :5 Pn(C.k )

for every nand k, we have, for every k,

Since PCB) = limk ..... co P(C'k )' we have (2). So, (2), (3), and (4) are equivalent and (1) implies (2).

All that remains is to prove that (2) implies (1). Assume (2), and let f be a bounded continuous function. Let m < f(x) < M for all x. For each k, let Fi,k = {x : f(x) :5 m + (M - m)i/k} for i = 1, ... , k. Let FO,k = 0. Each Fi,k is closed, since f is continuous. Let Gi,k = Fi,k \ Fi- 1,k for i = 1, ... , k. It is easy to see that for every probability Q,

k i-I! k i m + (M - m) L -k-Q(Gi,k) < f(x)dQ(x) :5 m + (M - m) L "kQ(Gi,k).

i=1 i=1

Since Q(Gi,k) = Q(Fi,k) - Q(Fi-l,k) for every i and k, we get

M-m k! M-m M-m k M - -k - L Q(Fi,k) < f(x)dQ(x) :5 M + -k - - -k - L Q(Fi,k).

i=1 i=1 (B.84)

For each i, (B.85)

It follows that, for every k,

! M-m M-m k f(x)dP(x) :5 M + -k- - -k-L P(Fi,k)

i=1

M-m M-m k :5 M + --k- - --k- LlimsupPn(Fi,k)

i=1 n-oo

:5 Mk-m +liminf!f(x)dPn(x), n_co

where the first inequality follows from the second inequality in (B.84) with Q = P, the second inequality follows from (B.85), and the third inequality follows from the first inequality in (B.84) with Q = Pn. Letting k be arbitrarily large, we get

! f(x)dP(x) :5 l~~~f ! f(x)dPn(x). (B.86)

Now, apply the same reasoning to - f to get

- ! f(x)dP(x) < l~~~f! - f(x)dPn(x) = -li:~~p ! f(x)dPn(x),

! f(x)dP(x) ~ li:~~P! f(x)dPn(x). (B.87)

Together, (B.86) and (B.87) imply (1). o


Theorem B.88 (Continuous mapping theorem).38 Let {Xn}~=l be a sequence of random quantities, and let X be another random quantity all taking values in the same metric space X. Suppose that Xn Eo X. Let y be a metric space and let 9 : X -+ y. Define

C9 = {x: 9 is continuous at x}.

D Suppose that Pr(X E Cg ) = 1. Then g(Xn } -+ g(X}.

PROOF. Let Pn be the distribution of g(Xn } and let P be the distribution of g(X). Let B be a closed subset of y. If x E g-I(B) but x r¢ g-I(B), then 9 is not continuous at x. It follows that g-I(B} <; g-I(B} U Cf. Now write

limsupPn(B} = limsupPr(Xn E g-I(B» :::; limsupPr(Xn E g-I(B» n-+oo n-+oo n-+oo

< Pr(X E g-I(B» :::; Pr(X E g-I(B» + Pr(X E C;}

= Pr(X E g-I(B» = PCB),

and the result now follows from the portmanteau theorem B.83. Another type of convergence is convergence in probability.

o

Definition B.89. If {Xn}~=l and X are random quantities in a metric space with metric d, and if, for every f > 0, limn -+oo Pr(d(Xn, X) > f) = 0, then we

say that Xn converges in probability to X, which is written Xn !: X.

The following theorem is useful in that it relates convergence in distribution, convergence in probability, and the simpler concept of convergence almost surely.

Theorem B.90.39 Let {Xn}~=l be a sequence of random vectors and let X be a random vector.

1. Iflimn-+ oo Xn = X a.s., then Xn !: X. P D 2. If Xn -+ X, then Xn -+ X.

D P 3. If X is degenerate and Xn -+ X, then Xn -+ X.

4. If Xn ~ X, then there is a subsequence {nk}k:,l such that limk-+oo X nk = X, a.s.

PROOF. First, assume that Xn converges a.s. to X. For each nand f, let An,. = {s: d(Xn(s},X(s»:::; f}. Then Xn(S} converges to Xes} if and only if

,E D. (Q, [.ON A •.• ]).

Since this set must have probability 1, then so too must U~=l (n~=NAn,.) for all €. By Theorem A.19, it follows that for every €, limN-+oo Pr (n~=NAn,.) = 1.

38This theorem is used to provide a short proof of DeFinetti's representation theorem for Bernoulli random variables in Example 1.82 on page 46.

39This theorem is used in the proofs of Theorems B.95, 1.49, 7.26, and 7.78.


Hence, for each 10 > 0, limn_oo Pr(A~,<) == 0, which is precisely what it means to p

say that Xn -> X. Next assume that Xn .!: X. Let 9 : X -> IR be bounded and continuous with

Ig(x)1 ~ K for all x. Let 10 > 0, and let A be a compact se~ wit~ Pr(X E A~ > 1-€/[6Kj. A continuous function (like g) on a compact set IS umformly contmu?us. So let 8 > ° be such that x E A and d(x, y) < 8 implies Ig(x) - g(y)1 < 10/3. Smce Xn ~ X, there exists N such that n ~ N implies Pr(d(Xn, X) < 8) > 1-€/[6Kj. Let B = {X E A,d(Xn,X) < 8}. It follows that Ig(X)IB - g(Xn)IBI < 10/3 and, for all n ~ N, Pr(B) > 1 - €/[3Kj. Also, note that n ;:: N implies

10 10 IEg(X) - E[g(X)IBli < 3' IEg(Xn) - E[g(Xn)IBll < 3' So, n ;:: N implies

IEg(X) - Eg(Xn)1 ::; IEg(X) - E[g(X)IBJI + IE[g(X)IBj- E[g(Xn)IBJI + IE[g(Xn)IBJ - Eg(Xn)1 10 10 10 - + - + - = 1:. 333

. D Thus, lImn-+oo Eg(Xn) == Eg(X), and we have proven Xn -+ X. Next, suppose that X is degenerate at Xo and Xn 2? X. Let 10 > 0, and define

if d(x,xo)::; ~,

if d(x,xo) ~ 10,

otherwise.

Since 9 is bounded and continuous, Eg(Xn) converges to Eg(X). But Eg(X) = 1 since Pr(g(X) = 1) = 1, and Eg(Xn) ::; Pr(d(Xn, X) < 10), since 0::; g(x) ::; 1 for all x. So limn _ oo Pr(d(Xn, xo) < 10) = 1, and Xn ~ X.

Finally, assume that Xn ~ X. Let nk be such that n ;:: nk implies

Pr(d(Xn, X) ;::~) < Tk.

Define Ak = {d(Xnk'X) ;:: 11k}. By the first Borel-Cantelli lemma A.20, we have Pr(B) = 0, where B = nbl Uk=i Ak. It is easy to check that B is the event that d(Xnk' X) is at least l/k for infinitely many different k. Hence Be ~ {limk-+oo X nk = X}, and limk_oo X nk = X, a.s. 0

B.4.2 Characteristic Functions There is a very important method for proving convergence in distribution which involves the use of characteristic functions.

Definition B.91. Let X be a random vector. The complex-valued function

cpx(t) = E (exp[itT Xl)

is called the characteristic function of X. If F is a k-dimensional distribution function, the function cpp(t) = J exp[it T x)dF(x) is called the characteristic function of F.


Example B.92. Let X have standard normal distribution. Then

r/>X(t) J 1 (X2) 1 J ([X - itJ2 + t2) exp( itx) ..,f'f; exp - 2" dx ==..,f'f; exp - 2 dx

exp ( -~). Similarly, for other normal distributions, N(/-£, (12), the characteristic functions are c/>x(t) == exp( _(12 t2 /2 + it/-£).

By Theorem B.12, if X has CDF F, then c/>x == c/>F. It is easy to see that the characteristic function exists for every random vector and it has complex absolute value at most 1 for all t. Other facts that follow directly from the definition are the following. If Y == aX + b, then r/>y(t) == c/>x(at)exp(itb). If X and Y are independent, c/>x+y == c/>xc/>y.

The reason that characteristic functions are so useful for proving convergence in distribution is two-fold. First, for each characteristic function C/>, there is only one CDF F such that c/>F == c/>. (See the uniqueness theorem B.106.) Second, characteristic functions are "continuous" as a function of the distribution

in the sense of convergence in distribution. That is, Xn E. X if and only if limn~oo c/>Xn(t) == c/>x(t) for all t.40 (See the continuity theorem B.93.)

Theorem B.93 (Continuity theorem).41 For finite-dimensional random vectors, convergence in distribution is equivalent to convergence of characteristic functions. That is, Xn E. X if and only if limn~oo c/>Xn (t) == c/>x (t) for all t.

PROOF. The "only if" part follows from Definition B.80 and the fact that one can write exp( it T x) as two bounded, continuous, real-valued functions of x for every t.

For the "if" part, suppose that X is k-dimensional and that limn-+ oo c/>Xn (t) == c/>x(t) for all t. To prove that for each bounded continuous g, limn~oo Eg(Xn) == Eg(X), we will truncate 9 to a bounded rectangle and then approximate the truncated function by a function g' whose mean is a linear combination of values of the characteristic function. The mean of g'(Xn ) will then converge to the mean of g'(X). We then need to show that the means of g'(X) and g'(Xn ) approximate the means of g(X) and g(Xn), respectively.

First, we need to find a bounded rectangle on which to do the truncation. For each coordinate Xl of X, we will show that if a and b are continuity points of the CDF FXi of Xl, and FXl(b) - FXl(a) > q, then there is b' > b and a' < a such that limn -+oo FXl (b') - F Xl (a') ?: q. For each a, b, 0, define

n n

1-~ {

I

fa,b,6(x) == ~ _ X~b

if a < x < b, if a - 0 < x:::; a,

if b:::; x < b+ 0, otherwise.

(B.94)

40This presentation is a hybrid of the presentations given by Breiman (1968, Chapter 8) and Hoel, Port, and Stone (1971, Chapter 8).

41This theorem is used in the proofs of Theorems B.95, B.97, and 7.20.


Note that this function has equal values at a - 6 and b + 6. Consider the interval [a - 6, b + 6] as a circle identifying the two endpoints. Now, use the StoneWeierstrass theorem C.3 to approximate uniformly fa.b.6 to within f on the circle by f~.b.6 •• (X} = E~=-l bj exp(21fijx/c}, where c = b - a + 26. If Y is a random variable, then Ef~.b.6 •• (Y} is a linear combination of values of the characteristic function of Y. So, we have limn ..... oo Ef~.b.6 .• (Xn} = Ef~.b.6 .• (X}. Let q > 0, and let a and b be continuity points of Fxt such that Fxt(b} - Fxt(a} = v > q. Let w = v - q. Let 6 > 0 be arbitrary, and define a' = a - 6 and b' = b + 6. Let N be large enough so that n ;::: N implies IEf~.b.6.w/3(X~} - Ef~.b.6.w/3(Xl)1 < w/3. If n;::: N, then

Fxt (b') - Fxt (a');::: Efa.b.6(X~) n n

> EJ' (Xl) 2w - a.b.6.' -"'3 ;::: Fxt(b} - Fxt(a} - w

> Ef~.b.6.f(X!) - .!j ;::: Efa.b.6(Xt ) - w

= q.

Now, let 9 be a bounded continuous function, and suppose that Ig(x)1 < K for all x. Let f > O. For each coordinate xt of X, let at and be be continuity points of Fxt such that Fxt(bt} -Fxt(at) > I-f/(7[K +f/7]k). Let 6 > 0 be arbitrary, and define at = at - 6, bt = bt - 6, and g*(x} = g(x) n:=l fa~.b~.6(Xt). Use the Stone-Weierstrass theorem C.3 to uniformly approximate g* to within f/7 on the rectangle {x : at - 6 :5 Xl :5 bt + 6} by

ml mJe

g'(x} = L ... L ajlo ...• jk exp(21fijT x), il=-ml ik=-mk

where j is the vector with lth coordinate jt/[bt - at + 26]. Then,

lim Eg'(Xn) = Eg'(X). n ..... oo

Let Nl be large enough so that n;::: Nl implies Fx~ (bt}-Fx~ (at) ;::: I-f/(7[K + f/7]k) for all j. Let N2 be large enough so that n ;::: N2 implies IEg'(Xn) -Eg'(X} I < f/7. Let R be the rectangle R = {x : at < Xl :5 btl. Since g' is periodic in every coordinate, it is bounded by K + f/7 on all of IRk. If n ;::: max{Nl' N2}, then IEg(Xn} - Eg(X)1 is no greater than

Elg(Xn)Inc(Xn)1 + Elg(X}Inc(X) I + Elg'(Xn)Inc(Xn)1 + IEg(Xn}In(Xn} - Eg'(Xn}In(Xn)1 + IEg'(Xn) - Eg'(X}1

+ Elg'(X)Inc(X} I + IEg'(X)In(X) - Eg(X)In(X)1 :5 f. 0

We will prove two more limit theorems that make use of the continuity theorem B.93. Suppose that X has finite mean. Since I exp(itx} - 11 :5 min{ltxl,2} for all t,z,42 and

I. exp(itx} -1 . 1m = lX

t ..... o . t

42See Problem 26 on page 664.


for all x, it follows from the dominated convergence theorem that

Similarly, if X has finite variance, it can be shown that

;t22 <px(t)lt=o = _E(X2).

Using these two facts, we can prove the weak law of large numbers and the centrol limit theorem.

Theorem B.95 (Weak law of large numbers). Suppose that {Xn}~=l are IID mndom variables with finite mean p,. Then, Xn = L:~=1 Xiln converges in probability to p,.

PROOF. First, we will prove that the characteristic function of X n -p, converges to 1 for all t. Let Y; = Xi -p,. Since <PY; (0) = 1, log <PY, (t) exists and is differentiable near t = 0, and we know that

!!..log<py (0) = 0 = lim log¢x,(t) dt' t-+O t

(B.96)

The characteristic function of X n - p, is ¢' (t) = ¢Y; (tint. For fixed t, let n be large enough so that tin is close enough to ° for log ¢Y, (tin) to be well defined. We know that

( t ) log ¢Y; ( .!. ) log¢.(t) = n log ¢Y, ;;: = t .!. n •

n

The limit of this quantity, as n -t 00, is 0 by (B.96). It follows that for all t, limn -+oo ¢.(t) = 1. By the continuity theorem B.93, Xn - f.t E. o. By Theo-

- p rem B.90, Xn - f.t -t O. 0

In Chapter 1, we prove a strong law of large numbers 1.62, which has a stronger conclusion and a weaker hypothesis. There is also a weak law of large numbers for the case of infinite means. (See Problem 27 on page 664.)

The following theorem is very useful for approximating distributions.

Theorem B.97 (Centrailimit theorem). Suppose that {Xi}~l is a sequence that is lID with finite mean f.t and finite variance (]'2. Let X n be the avemge of thefirstn XiS. Then fo(Xn-P,) E. N(0,(]'2), the normal distribution with mean o and variance (]'2.

PROOF. Set Yn = foe X n -p,). We might as well assume that f.t = 0, since we have just subtracted it from each Xi. Since the second derivative of the characteristic function at t = 0 of each Xi is _(]'2, we can apply I'Hopital's rule twice to conclude

(B.98)

The characteristic function of Yn is ¢Yn (t) = ¢x; (tl v'nt· We will prove that this converges to exp(-t2(]'2/2) for each t. Since log<Pyn(t) = nlog¢x;(tlfo),


we use (B.98) to note that

It follows that limn -+oo CPY" (t) = exp( _t2u2 /2), and the continuity theorem B.93 finishes the proof. 0

There is also a multivariate version of the central limit theorem.

Theorem B.99 (Multivariate central limit theorem).43 Let {Xn}~=l be a sequence of lID mndom vectors in IRP with mean p. and covariance matrix E. Then v'n(X" - p.) Eo Np(O, E), a multivariate normal distribution.

- v PROOF. Let Yn = Vri(Xn - p.) and let Y '" Np(O,E). Then Yn -+ Y if and only if the characteristic function of Yn converges to that of Y. That is, if and only if, for each A E JRP, E exp { iA T Yn} -+ E exp { iA T Y }. This occurs if and

only if, for each A, ATYn Eo ATy. The distribution of ATy is N(O, ATEA), and ATYn is .;n times the average of the AT (Xn - p.). By the univariate central limit

theorem B.97, ATYn Eo A Ty. 0

There are inversion formulas for characteristic functions which allow us to obtain or approximate the original distributions from the characteristic functions.

Example B.100 (Continuation of Example B.92j see page 640). Let X have distribution N(O, ( 2 ). Then J Icpx(t)ldt < 00. In fact,

2~ J exp(-ixt)cpx(t)dt = ~ Jexp(- u2 [t + ~X]2 __ 1 X2) dt 2~ 2 u2 2u2

~u exp ( - 2~2X2) = fx(x).

Example B.100 says that the following inversion formula applies to normal distributions with 0 mean. It is equally easy to see that it applies to NIe(O, lie) distributions.44

Lemma B.IOI (Continuous inversion formula).45 Let X E IR" have integrable chamcteristic function. Then the distribution of X has a bounded density f x with respect to Lebesgue measure given by

fx(x) = (2!)" J exp(-itT x)CPx(t)dt. (B.102)

PROOF. Clearly, the function in (B.102) is bounded since cpx is integrable. Let Y" have N,,(O, u2 lie) distribution. The characteristic function of X + Y" is CPxCPy".

(2!)" J exp(-itT x)cpx(t)cpy" (t)dt

43This theorem is used in the proofs of Theorems 7.35 and 7.57. 44We use the symbol lie to stand for the k x k identity matrix. 45This lemma is used in the proofs of Lemma B.105 and Corollary B.106.


(2!)k J J exp(-itT x)exp(itT z)</>y,,(t)dFx(z)dt

J h, (x - z)dFx(z) = fx+y" (x),

(B.103)

where the second equality follows from the fact that (B.102) applies to normal distributions. Now suppose that we let (f go to zero. Since </>x is integrable and </>y,,(t) goes to 1 for all t, it follows that the left-hand side of (B.103) converges to the right-hand side of (B.102). It also follows that fx+y" is bounded uniformly in (f and x. Let B be a hypercube such that the probability is 0 that X is in the boundary of B. Then

r lim fx+Yu (x)dx = lim r fx+Y" (x)dx = { fx(x)dx, } B ,,-0 ,,-oJ B } B

(B. 104)

where the first equality follows from the boundedness of f x + Yu' and the second is proven as follows. The difference between IB fx+y" (x)dx and IB fx(x)dx is the sum over the 2k corners of the hypercube B of terms like

k

L Pr(bi - Y",i < Xi :::; bi , Y",i > 0) + Pr(bi < Xi :::; b; - y",;, Y",i < 0), i=l

where bi is the ith coordinate of the corner. We can write

Pr(bi - Y",i < Xi ~ bi, Y",i > 0) = 100 Pr(bi - y < Xi :::; bi, y > O)dFY".i (y).

This last expression goes to 0 as (f - 0 since b; is a continuity point for FXi' A similar argument applies to the other probability. The equality of the first and last expressions in (B.104) is what it means to say that lim,,_o fx+y" (x) is the density of X with respect to Lebesgue measure. This, in turn, equals the right-hand side of (B.102). 0

Lemma B.I05.46 Let Y be a mndom variable such that </>y is integmble. Let X be an arbitmry mndom variable independent of Y. For all finite a < band c,

Pr(a < X + cY ~ b) = ~ f (exp( -ibt) -. exp( -iat») </>x(t)</>y(ct)dt. 211" -~t

PROOF. Since </>y is integrable and </>x+cy(t) = </>x (t)</>y (ct), it follows that X + cY has integrable characteristic function. Lemma B.101 says that (B.102) applies to X + cY, hence

Pr(a < X + cY :::; b)

~ j</>x(t)</>y(ct)exP(-itX)dt 211"

lb fx+cy(x)dx

46This lemma is used in the proof of Corollary B.106.

B.5. Stochastic Processes 645

2~ 1b J </Jx(t)</Jy(ct)exp(-itx)dtdx

2~ J </Jy(et)</Jx(t) 1b exp(-itx)dxdt

2~ ! </Jy(ct)</Jx(t) (exp( -itb) ':texp( -ita)) dt. 0

Corollary B.106 (Uniqueness theorem).47 Let F and G be two univariate CDFs such that </JF = </JG. Then F = G.

PROOF. In the proof of Lemma B.101, we proved that ifY '" N(O, 1), and if a and b are continuity points of F, and X has CDF F, then limc_o Pr(a < X + cY ::; b) = Pr(a < X::; b). The same is true of G. Hence, F = G by Lemma B.105. 0

An obvious consequence of the uniqueness theorem is the following.

Corollary B.107.48 Suppose that F and G are k-dimensional CDFs such that lor every bounded continuous I, J l(x)dF(x) = J l(x)dG(x). Then F = G.

B.5 Stochastic Processes

B.5.1 Introduction

Sometimes we wish to specify a joint distribution for an infinite sequence of random variables. Let (S, A, 1-') be a probability space. If Xn : S --+ IR for every n and each Xn is measurable with respect to the Borel u-field B, we can define a u-field of subsets of IRoo such that the infinite sequence X = (Xl ,X2 , ... ) is measurable. Let BOO be the smallest u-field that contains all finite-dimensional orthants, that is, every set B of the form

{x: Xil ::; el, ... ,Xin ::; Cn , for some n and some integers il, ... ,in

and some numbers CI, ••• , Cn }.

It is clear that X-I (B) E A since it is the intersection of finitely many sets in A. By Theorem A.34, it follows that X-I (BOO) ~ A, so X is measurable with respect to this u-field.

B.5.2 Martingales+

A particular type of stochastic process that is sometimes of interest is a martingale. [For more discussion of martingales, see Doob (1953), Chapter VII.]

47This corollary is used in the proof of Theorem 2.74. 48This corollary is used in the proof of DeFinetti's representation theorem 1.49. +This section contains results that rely on the theory of martingales. It may

be skipped without interrupting the flow of ideas.


Definition B.lOS. Let (8, A, 1-1) be a probability space. Let N be a set of consecutive integers. For each n EN, let F,. be a sub-u-field of A such that Fn ~ F,.+l for all n such that n and n + 1 are in N. Let {X,,},.E.N' be a sequence of random variables such that Xn is measurable with respect to Fn for all n. The sequence of pairs {(Xn,F,.)}nE.N' is called a martingale if, for all n such that n and n + 1 are in N, E(Xn+lIFn) = Xn. It is called a submartingale if, for every n, E(Xn+lIFn) ~ Xn.

Note that a martingale is also a submartingale.

Example B.I09. A simple example of a martingale is the following. Let N = {1, 2, . .:1,. and let {Yn}~l be independent random variables with mean O. Let Xn = E:=l Yi. Let Fn be the u-field generated by YI, ... , Yn. Then,

E(YI + ... + Yn+ll.rn) = YI + ... + Yn = Xn ,

since E(Yn+lIFn) = 0 by independence. If each Yi has nonnegative finite mean, then E(Xn+lIF,.) ~ Xn, and we have a submartingale.

Example B.llO. Another example of a martingale is the following. Let N be a collection of consecutive integers, and let {F,,},.E.N' be an increasing sequence of u-fields. Let X be a random variable with E(IXI) < 00. Set X,. = E(XIF,.). By the law of total probability B. 70,

E(X"+lIF,.) = E[E(XIF,.+dIF,.) = E(XIFn) = Xn,

so {(X,., Fn)}nE.N' is a martingale.

Example B.llI. If {(Xn, Fn)}"E.N' is a martingale, then

IXnl = IE(X"+IIFn)1 :5 E(lXn+lIlFn),

hence {(IXnl,Fn)}nE.N' is a submartingale.

(B.112)

The following result is proven using the same argument as in Example B.111.

Proposition B.ll3.49 If {(Xn, Fn)}nE.N' is a martingale, then EIXnl is nondecreasing in n.

The reader should note that if {(Xn, Fn)}nE.N' is a submartingale and if M ~ N is a string of consecutive integers, then {(Xn, .rn)}nEM is also a submartingale. Similarly, if k is an integer (positive or negative) and M = {n: n+k EN}, then {(X~, F~)}nEM is a submartingaie, where X~ = Xn+A: and .r:.. = Fn+lc. This latter is just a shifting of the index set.

There are important convergence theorems that a.pply to many martingales and submartingales. They say that if the set N is infinite, then limit random variables exist. A lemma is needed to prove these theorems. 50 It puts a bound on how often a submartingale can cross an interval between two numbers. It is used to show that such crossings cannot occur infinitely often with high probability. (Infinitely many crossings of a nondegenerate interval would imply divergence of the submartingale.)

49This proposition is used in the proof of Theorem B.122. 50This lemma is proven by Doob (1953, Theorem VII, 3.3).


Lemma B.114 (Upcrossing lemma).5l Let.N = {I, ... , N}, and suppose that {(X",.rn)}~=l is a submartingale. Let r < q, and define V to be the number of times that the sequence Xl, ... , XN crosses from below r to above q. Then

E(V) :5 _1_ (EIXNI + Ir!). q-r

(B.115)

PROOF. Let Yn = max{O,Xn - r} for every n. Since g(x) = max{O, x} is a nondecreasing convex function of x, it is easy to see (using Jensen's inequality B.17) that {Yn , .rn}~=l is a submartingale. Note that a consecutive set of Xi(S) cross from below r to above q if and only if the corresponding consecutive set of Yi(s) cross from 0 to above q - r. Let To(s) = 0 and define Tm for m = 1,2, ... as

Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) = O}, if m is odd,

Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) ;::: q - r}, if m is even,

Tm(s) = N + 1, if the corresponding set above is empty.

Now V(s) is one-half of the largest even m such that Tm(s) :5 N. Define, for i= 1, ... ,N,

Ri(S) = {I if Tm(s) < i :5 Tm+l(S) for m odd, o otherwIse.

Then (q - r)V(s) :5 E;:'l Ri(S)(Yi(S) - Yi-l(S» = X, where Yo == 0 for convenience. First, note that for all m and i, {Tm(s) :5 i} E .ri. Next, note that for every i,

{s:Ri(s)=l}= U ({Tm:5i-1}n{Tm+l:5i-1}C)E.ri-l. (B.116) m odd

N

E(X) = L:J. . _ (Yi(S) - Yi-l(s»dl-£(s) i=l {s.R,{s)-l}

N

= LJ. . _ (E(YiI.ri-t}(S) - Yi-l(s»dJL(s) i=l {s.R,{s)-l}

N

:5 L !(E(YiI.ri-t}(S) - Yi_l(s»dJL(s) i=l

N

L(E(Yi) - E(Yi_t}) = E(YN), i=l

where the second equality follows from (B.116) and the inequality follows from the fact that {Yn, Fn}~=l is a submartingale. It follows that (q-r)E(V) :5 E(YN). Since E(YN) :5lrl + E(IXN!), it follows that (B.115) holds. 0

The proof of the following convergence theorem is adapted from Chow, Robbins, and Siegmund (1971).

5lThis lemma is used in the proofs of Theorems B.U7 and B.122.


Theorem B.117 (Martingale convergence theorem: part 1).52 Suppose that {(Xn,Fn)}~l is a submartingale such that sUPnEIXnl < 00. Then X = limn_ oo Xn exists a.s. and EIXI < 00.

PROOF. Let X· = limsuPn-+oo Xn and X. = liminfn-+oo Xn. Let B = {s : X.{s) < X·(s)}. We will prove that J.L(B) = O. We can write

B= u {s : X·{s) ~ q > r ~ X.(s)}. r < q, r, q rational

Now, X·{s) > q > r ~ X.{s) if and only if the values of Xn(S) cross from being below r to being above q infinitely often. For fixed rand q, we now prove that this has probability OJ hence J.L(B) = O. Let Vn equal the number of times that Xl, ... , Xn cross from below r to above q. According to Lemma B.114,

supE(Vn) $ _1_ (SuPE{IXnl) + Irl) < 00. n q-r n

The number of times the values of {Xn(S)}~=l cross from below r to above q equals limn-+oo Vn(s). By the monotone convergence theorem A.52,

00 > supE(Vn) = E{ lim Vn ). n n-oo

It follows that J.L{{s: limn-+co Vn{s) = oo}) = O. Since J.L(B) = 0, we have that X = limn-+oo Xn exists a.s. Fatou's lemma A.50

says E(IXi) $ liminfn-+co E(lXnl) $ sUPn E(lXnl) < 00. 0 For the particular martingale in which Xn = E(XIFn) for a single X, we have

an expression for the limit.

Theorem B.118 (Levy's theorem: part 1).53 Let {Fn}~=l be an increasing sequence of u-fields. Let Foo be the smallest u-field containing all of the Fn. Let E(\XD < 00. Define Xn = E(X\Fn) and Xoo = E(X\Foo). Then limn-+oo Xn = Xoo , a.s.

The proof of this theorem requires a lemma that will also be needed later.

Lemma B.119.54 Let {Fn }~=l be a sequence of u-fields. Let E(\XI) < 00. Define Xn = E(X\Fn). Then {Xn}~=l is a uniformly integmble sequence.

PROOF. Since E(XIFn) = E{X+IFn) - E(X-IFn), and the sum of uniformly integrable sequences is uniformly integrable, we will prove the result for nonnegative X. Let Ae,n = {Xn ~ c} E Fn. So fAc.n Xn{s)dJ.L{s) = fAc,n X{s)dJ.L{s). If we can find, for every E > 0, a C such that fA X (s)dJ.L(s) < E for all n and all

c ~ C, we are done. Define 71{A) = fAX{s)dJ.Lt~). We have 71 «J.L and 71 is finite.

52This theorem is used in the proof of Theorems B.118 and 1.121. 53This theorem is used in the proofs of Theorem 7.78 and Lemma 7.124. 54This lemma is used in the proofs of Theorems B.U8, B.122, and B.124. It is

borrowed from Billingsley (1986, Lemma 35.2).


By Lemma A.72, we have that for every E > 0 there exists li such that JL(A) < 6 implies T/(A) < E. By the Markov inequality B.15,

1 1 JL(Ac,n) :$ -E(Xn) = -E(X),

c c

for all n. Let C = 2E(X)/li. Then c ~ C implies JL(Ac,n) < 6 for all n, so T/(Ac,n) < € for all n. 0 PROOF OF THEOREM B.ll8. By Lemma B.119, {Xn}~=1 is a uniformly integrable sequence. Let Y be the limit of the martingale guaranteed by Theorem B.117. Since Y is a limit of functions of the X n , it is measurable with respect to Foo. It follows from Theorem A.60 that for every event A, limn _ oo E(XnIA) = E(Y IA)' Next, note that, for every A E F n ,

f Y(s)dJ.&(s) = lim f E(XIFn)(s)dJ.&(s) = f X(s)dJL(S), A n_oo A A

where the last equality follows from the definition of conditional expectation. Since this is true for every n and every A E Fn , it is true for all A in the field F = U~=IFn. Since IXI is integrable, we can apply Theorem A.26 to conclude that the equality holds for all A E F oo , the smallest IT-field containing F. The equality E(XIA) = E(YIA) for all A E Foo together with the fact that Y is Foo measurable is precisely what it means to say that Y = E(XIFoo) = Xoo. 0

For negatively indexed martingales, there is also a convergence theorem. Some authors refer to negatively indexed martingales in a different fashion, which is often more convenient.

Definition B.120. Let (S, A, JL) be a probability space. For each n = 1,2, ... , let Fn be a sub-IT-field of A such that Fn+l ~ Fn for all n. Let {Xn}~=1 be a sequence of random variables such that Xn is measurable with respect to Fn for all n. The sequence of pairs {(Xn,Fn)}~=1 is called a reversed martingale if for all n E(XnIFn+l) = X n +1 .

Example B.121. As in Example B.110, we can let {Fn}~=1 be a decreasing sequence of IT-fields, and let E(IX!) < 00. Define Xn = E(XIFn). It follows from the law of total probability B.70 that {(Xn,Fn)}~=l is a reversed martingale.

The following theorem is proven by Doob (1953, Theorem VII 4.2).

Theorem B.122 (Martingale convergence theorem: part 11).55 Suppose that {(Xn, Fn)}n<o is a martingale. Then X = limn _- oo Xn exists a.s. and has finite mean.

PROOF. Just as in the proof of Theorem B.117, we let Vn be the number of times that the finite sequence Xn,Xn+1, ... ,X-1 crosses from below a rational r to above another rational q (for n < 0). The upcrossing lemma B.114 says that

1 E{Vn) :$ - (E(IX_ 1 !) + Ir!) < 00.

q-r


650 Appendix B. Pcobability Theory

As in the proof of Theorem B.117, it follows that X = limn_-oo Xn exists with probability 1. From (B.112) and Lemma B.119, it follows that

lim E(IXnl) = E(IXI). n--oo

By Proposition B.113, it follows that E(lXI) < 00, and so X has finite mean. 0 It is usually more convenient to express Theorem B.122 in terms of reversed

martingales.

Corollary B.123.56 If {(Xn, Fn)}~l is a reversed martingale, then limn_co Xn exists a.s. and has finite mean.

There is also a version of Levy's theorem B.118 for reversed martingales.

Theorem B.124 (Levy's theorem: part 11).57 Let {Fn}~=l be a decreasing sequence of a-fields. Let Foo be the intersection n~=lFn. Let E(IXI) < 00. Define Xn = E(XIFn) and Xoo = E(XIFoo). Then limn_oo Xn = Xoo a.s.

PROOF. It is easy to see that {(Xn' Fn)}~=l is a reversed martingale and that E(IX11) < 00. By Theorem B.122, it follows that limn __ oo Xn = Y exists and is finite a.s. To prove that Y = Xoo a.s., note that Xoo = E(X1IFoo) since .1'00 ~ Fl. So, we must show that Y = E(X1IFoo). Let A E .1'00' Then

i Xn(s)dl./-(s) = i Xl (s)dl./-(S) ,

since A E Fn and Xn = E(XdFn). Once again, using (B.112) and Lemma B.119, it follows that fA Y(s)dl./-(s) = fA X1 (s)dl./-(s); hence Y = E(XdFoo). 0

B.5.3 Markov Chains· Another type of stochastic process we will occasionally meet is a Markov chain. 58

Definition B.125. Let {Xn}~=l be a sequence of random variables taking values in a space X with a-field B. The sequence is called a Markov chain (with stationary transition distributionsl9 if there exists a function p : B x X -+ [0, I] such that

• for all x EX, pC x) is a probability measure on B;

• for all B E B, p(B,·) is B measurable;

56This corollary is used in the proof of Theorem B.124. 57This theorem is used in the proofs of Theorem 1.62, Corollary 1.63,

Lemma 2.121, and Lemma 7.83. "This section may be skipped without interrupting the flow of ideas.

581n this text, we only use Markov chains as occasional examples of sequences of random variables that are not exchangeable.

59There are more general definitions of Markov chains and Markov processes in which the transition distribution from Xn to Xn+l is allowed to depend on n. We will not need these more general processes in this book.


• for each n and each BE 13, p(B, x) = Pr(Xn+1 E BIXI = Xl, X2 = X2,···, Xn-l = Xn-l, Xn = X),

almost surely with respect to the joint distribution of (X I, ..• , Xn ).

The last condition in the definition of a Markov chain says that the conditional distribution of Xn+l given the past depends only on the most recent past X n . In other words, Xn+l is conditionally independent of Xl, ... , Xn-l given X n. Example B.126. A sequence {Xn}~=l of lID random variables is a Markov chain with p(B,x) = Pr(Xi E B) for all x. Example B.121. Let {Xn}~=l be Bernoulli random variables such that

Pr(Xn+1 = 11XI = Xl,·" ,Xn = Xn) = px,.,l, for Xn E {O, 1}. The entire joint distribution of the sequence is determined by the numbers PO,1, Pl,l, and Pr(XI = 1).

B.5.4 General Stochastic Processes Occasionally, we will have to deal with more complicated stochastic processes. What makes them more complicated is that they consist of more than countably many random quantities.

Example B.128. Let :F be a set of real-valued functions of a real vector. That is, there exists k such that FE :F means F : JRk -+ JR. Suppose that X : 8 -+ :F is a random quantity whose values are functions themselves. We would like to be able to discuss the distribution of X. We will need a u-field of subsets of :F in order to discuss measurability. A natural u-field is the smallest u-field that contains all sets of the form At,x = {F E :F : F( t) :$ x}, for all t E JR k and all X E JR. It can be shown (see below) that X is measurable with respect to this u-field if, for every t E 1Rk, the real-valued function G t : 8 -+ 1R is Borel measurable, where Gt(s) = F(t) when Xes) = F.

A general stochastic process can be defined, and it resembles the above example in all important aspects.

Definition B.129. Let (8, A, 11) be a probability space, and let R be some set. For each r E R, let (Xr, 13r ) be a Borel space, and let Xr : 8 -+ Xr be measurable. The collection of random variables X = {Xr : r E R} is called a stochastic process.

Example B.130. If every (Xr ,13r ) is the same space (X,13), then X can be thought of as a "random function" from R to X as follows. For each s E 8, define the function F. : R -+ 8 by F.(r) = Xr(S). In order to make this a true random function, we need a u-field on the set of functions from R to X. Since this set of functions is the product set X R , a natural u-field is the product u-field BR. The product u-field is easily seen to be the smallest u-field containing all sets of the form Ar,B = {F : F(r) E B}, for r E Rand B E 13. Now, let F : 8 -+ XR be defined by F(s) == F •. Then F is measurable because

F-I(Ar,B) = {s: F.(r) E B} = {s : Xr{s) E B} E A, because Xr is measurable.


The important theorem about stochastic processes is that their distribution is determined by the joint distributions of all finite collections of the X r •

Theorem B.131.6o Let R be a set and, for each r E R, let (Xr,Br) be a Borel space. Let X = {Xr : r E R} and X' = {X; : r E R} be two stochastic processes. Suppose that for every k and every k-tuple (rl' ... , rk) E Rk, the joint distribution of (Xr1 ,·· • ,Xrk ) is the same as that of (X;l' ... , X;k). Then the distribution of X is the same as that of x' .

PROOF. Define X = TIrER Xr and let B be the product u-field. Say that a set C E B is a finite-dimensional cylinder set if there exists k and rl, ... , rk E R and a measurable D ~ TI:=l Xr , such that

C = {x EX: (xr1 , ••. ,Xrk) ED}.

It is easy to see that if {rl, ... ,rk} ~ {tl, ... , tm } for m ~ k, then there exists a measurable subset D' of n;:l XSj such that

C = {x EX: (X S l> ... ,xsm ) ED'},

by taking the Cartesian product of D times the product of those Xr for r E {Sl, ... , Sm} \ {rl, ... , rk} and then possibly rearranging the coordinates of all points in this set to match the order of rl, ... , rk among Sl, ... , Sm. So, if C and G are both finite-dimensional cylinder sets with

G = {x EX: (Xhl, ••• ,Xht) E E},

then we can let {tl, ... , tm } = {rl, ... , rk} U {hI, ... , ht} and write

C {XEX:(Xtl> ... ,Xtm )ED'}, G {XEX:(xtl> ... ,Xtm )EG'}.

It follows that

enG = {x EX: (Xtll ... , Xt m ) E D' n G'} .

So the finite-dimensional cylinder sets form a 1r-system. By assumption, the distributions of X and X' agree on this 1r-system. Since X = {x EX: Xr E Xr } for arbitrary r E R and since the distributions of X and X' are finite measures, we can apply Theorem A.26 to conclude that the distributions are the same. 0

Another important fact about general stochastic processes is that it is possible to specify a joint distribution for the entire process by merely specifying all of the finite-dimensional joint distributions, so long as they obey a consistency condition.

Definition B.132. Let X = TIrER Xr with the product u-field, where (Xr , Br) is a Borel space for every r. For each finite k and each k-tuple (iI, ... , ik) of distinct elements of R, let Pilo ... ,ik be a probability measure on n:=l Xi;- We say that these probabilities are consistent if the following conditions hold for each k and distinct iI, ... , ik E R and each A in the product u-field of TI;=l Xi; :

6°This theorem is used in the proofs of Theorem B.133 and DeFinetti's representation theorem 1.49.


• For each permutation 7r of k items, Pil, ... ,ik (A) = H,,(1) , ... ,i,,(k) (B), where

B = {(X".(I),'" ,X".(k»: (Xl, ... ,Xk) E A} .

• For each f. E R \ {il, ... , ik}, H1, ... ,ik (A) = Pil, ... ,ik,l(B), where

B = {(Xl, ... ,Xk, Xk+I) : (XI, .. " Xk) E A, Xk+l EXt}.

Since the set R may not be ordered, the first condition ensures that it does not matter in what order one writes a finite set of indices. The second condition is the substantive one, and it ensures that the marginal distributions of subsets of coordinates are the probability measures associated with those subsets.

To avoid excessive notation, it will be convenient to refer to PJ as the probability measure associated with a finite subset J ~ R without specifying the order of the elements of J. When the consistency conditions in Definition B.132 hold, this should not cause any confusion.

The proof of the following theorem is adapted from Loeve (1977, pp. 94-5). The theorem says that consistent finite-dimensional distributions determine a unique joint distribution on the product space.

Theorem B.133.61 Let X = TIrER Xr with the product u-field, where Xr is a Borel space for every r. For each finite subset J ~ R, let PJ be a probability measure on TIrEJ X r • Suppose that the PJ are consistent as defined in Definition B.132. Then there exists a unique distribution on X with finite-dimensional marginals given by the PJ. PROOF. The uniqueness follows from Theorem B.131, if we can prove existence. First, suppose that Xr = 1R for all r. Let C be the class of all unions of finitely many finite-dimensional cylinder sets of the form C = TIrER Cr , where all but finitely many of the Cr equallR and the others are unions of finitely many intervals. The class C is a field. For C of the above form, define P(C) = PJ(TIrEJ Cr ). The consistency assumption implies that P can be uniquely extended to a finitely additive probability on C. To show that P is countably additive, we will show that if {An}~=1 is a decreasing sequence of elements of C such that P(An) > 10 for all n, then A = n~=IAn is nonempty. Suppose that P(An) > 10 for all n. Let I n be the set of all subscripts involved in AI, ... , An and J be the union of these sets. Let An = Bn X TIrOn Xr. Then P(A .. ) = PJ" (Bn), and Bn is the union of finitely many products of intervals. For each product of intervals H that constitutes B n , we can find a product of bounded closed intervals contained in H such that the PJn probability of the union of these H is as close as we wish to PJ" (Bn ). Let Cn be a finite union of products of closed bounded intervals contained in Bn such that PJn (Bn \ Cn) < €/2n+l. If Dn is the cylinder set corresponding to Cn , then

PJn (An \ Dn) = PJn (Bn \ en) < 2n~1 . Now, let En = Ann~1 Di, so that P(An \En) < 10/2. It follows that PeEn) > €/2, so each En is nonempty. Let xn :::: (x~,x~, ... ) E En. Since El 2 E2 2 "', it follows that for every k 2: 0, xn+k E En ~ D ... Hence (x?+kji E I n ) E Cn. Since

61This theorem is used in the proof of Lemma 2.123.


each en is bounded, there is a subsequence of {(xf; i E Jl)}~=1 that converges to

a point (Xiii E JI) Eel. Let the subsequence be {(x~~;i E J1)}%"=I' Then there

is a subsequence of{(x~~; i E h)}%"=l that converges to a point (Xi; i E h) E e2 •

Continue extracting subsequences to get a limit point XJ = (Xi; i E J) E Dn for all n. Hence, every point that extends XJ to an element of X is in An for all n, and A is nonempty. Now apply the Caratheodory extension theorem A.22 to extend P to the entire product a-field.

For general Borel spaces, let CPr : Xr -+ Fr be a bimeasurable mapping to a Borel subset of rn. for each r. It follows easily by using Theorem A.34 that the function cp: X -+ TIrERFr is bimeasurable, where cp(x) = (CPr(Xr);r E R). For each finite s.ubset J, cP induces a probability on TIiEJ IR from PJ, and these are clearly consistent. By what we have already proven there is a probability P on TITER IR with the desired marginals. Then cp-l induces a probability on X from P with the desired marginals. 0

B.6 Subjective Probability

It is not obvious for what purpose a mathematical probability, as described in this chapter and defined in Definition A.18, would ever be useful. In this section, we try to show how the mathematical definition of probability is just what one would want to use to describe one's uncertainty about unknown quantities if one were forced to gamble on the outcomes of those unknown quantities.62

DeFinetti (1974) suggests that probability be defined in terms of those gambles an agent is willing to accept. Others, like DeGroot (1970), would only require that probabilities be subjective degrees of belief. Either way, we might ask, "Why should degrees of belief or gambling behavior satisfy the measure theoretic definition of probability?" In this section, we will try to motivate the measure theoretic definition of probability by considering gambling behavior. We begin by adopting the viewpoint of DeFinetti (1974}.63

For the purposes of this discussion, let a mndom variable be any number about which we are uncertain. For each bounded random variable X, assume that there is some fair price p such that an agent is indifferent between all gambles that pay c(X - p), where c is in some sufficiently small symmetric interval around 0 such that the maximum loss is still within the means of the agent to pay. For example, suppose that X = X is observed. If c(x - p) > 0, then the agent would receive this amount. If c(x - p) < 0, then the agent would lose -c(x - p). It must be that -c(x - p} is small enough for the agent to be able to pay. Surely, for X in a

621n Section 3.3, we give a much more elaborate motivation for the entire apparatus of Bayesian decision theory, which includes mathematical probab~lity as one of its components. An alternative derivation of mathematical probability from operational considerations is given in Chapter 6 of DeGroot (1970).

63There are a few major differences between the approach in this section and DeFinetti's approach, which DeFinetti, were he alive, would be quick to po~nt out. Out of respect for his memory and his followers, we will also try to pomt out these differences as we encounter them.

B.6. Subjective Probability 655

bounded set, C can be made small enough for this to hold, so long as the agent has some funds available.

Definition B.134. The fair price p of a random quantity is called its prevision and is denoted P(X). It is assumed, for a bounded random quantity X, that the agent is indifferent between all gambles whose net gain (loss if negative) to the agent is c(X - P(X)) for all c in some symmetric interval around O.

The symmetric interval around 0 mentioned in the definition of prevision may be different for different random variables. For example, it might stand to reason that the interval corresponding to the random variable 2X would be half as wide as the interval corresponding to X.

Another assumption we make is that if an agent is willing to accept each of a countable collection of gambles, then the agent is willing to accept all of them at once, so long as the maximum possible loss is small enough for the agent to pay.64 An example of countably many gambles, each of which is acceptable but cannot be accepted together, is the famous St. Petersburg paradox.

Example B.135. Suppose that a fair coin is tossed until the first head appears. Let N be the number of tosses until the first head appears. For n = 1,2, ... , define

if N = n, otherwise.

Suppose that our agent says that P(Xn } = 1 for all n. For each n, there is Cn < 0 such that the agent is willing to accept cn(Xn - 1). If - 2::""=1 cn 2n is too big, however, the agent cannot accept all of the gambles at once. Similarly, there are Cn > 0 such that the agent is willing to accept Cn (X n - 1). If 2:::'-1 Cn is too big, the agent cannot accept all of these gambles. The St. Petersburg paradox corresponds to the case in which Cn = 1 for all n. In this case, the agent pays 00

and only receives 2N in return. We have ruled out this possibility by requiring that the agent be able to afford the worst possible loss.

The following example illustrates how it is possible to accept infinitely many gambles at once.

Example B.136. Suppose that a random quantity X could possibly be anyone of the positive integers. For each positive integer x, let

if X = x, if not.

Suppose that our agent is indifferent between all gambles of the form c(Ix _ 2-"') for all -1 :s c :s 1 and all integers x. Then, we assume that the agent is also indifferent between all gambles of the form 2:::'=1 c",(I", - T X ), so long as -1 :s C", :s 1 for all x. (Note that the largest possible loss is no more than 1.) Let y = :L:1 cx!x with -1 :5 Cx :5 1 for all x. Note that Y is a bounded random

64DeFinetti would not require an agent to accept count ably many gambles at once, but rather only finitely many. We introduce this stronger requirement to avoid mathematical problems that arise when the weaker assumption holds but the stronger one does not. Schervish, Seidenfeld, and Kadane (1984) describe one such problem in detail.


quantity, and that the agent has implicitly agreed to accept all gambles of the form c(Y - ,.,,) for -1 ::; c ::; 1, where ,." = E::'=l cz2-z. If the agent were foolish enough to be indifferent between all gambles of the form dey - p) for -a ::; d ::; a where p i= ,.", then a clever opponent could make money with no risk. For example, if p > ,.", let f = min{l, a}. The opponent would ask the agent to accept the gamble fey -p) as well as the gambles - fcz(Iz _2-Z ) for x = 1,2, .... The net effect to the agent of these gambles is - f(p -,.,,) < 0, no matter what value X takes! A similar situation arises if p < ,.". Only p = ,." protects the agent from this sort of problem, which is known as Dutch book.

To avoid Dutch book, we introduce the following definition.

Definition B.137. Let {X" : el E A} be a collection of bounded random variables. Suppose that, for each el, an agent gives a prevision P(X,,) and is indifferent between all gambles of the form c(X" - P(X,,» for -d" ::; c ::; d" with d" = min{maxx(:':~:P(x,.), p(X,.)MminX,.} for some M > o. These previsions are coherent if there exist no countable subset B ~ A and {Cb : -db::; Cb ::; db, for all bE B} such that -M ::; LbEB Cb(Xb - P(Xb» < 0 under all circumstances.65 If a collection of previsions is not coherent, we say that it is incoherent.

The value M is the maximum amount the agent is willing to lose. Coherence of a sufficiently rich collection of previsions is equivalent to a probability assignment.

Theorem B.13S.66 Let (B,A) be a measurable space. Suppose that, Jor each C E A, the agent assigns a prevision P(Ic), where Ic is the indicator oj C. Define ,." : A -+ IR by ,.,,( C) = P(lc). Then the previsions are coherent if and only if,." is a probability on (8, A).

PROOF. Without loss of generality, suppose that the agent is indifferent between all gambles of the form c(Ic - P(lc», for all -1 ::; c ::; 1. For the "if" part, assume that,." is a probability. Let {Cn}::'=l E A and Ci E [-1,1) be such that with

00

X = L en(lcn - ,.,,(Cn », n=l

the maximum losses from X and from -X are small enough for the agent to afford. Since this makes X bounded, it follows from Fubini's theorem A.70 that E(X) = OJ hence it is impossible that X < 0 under all circumstances, and the previsions are coherent.

For the "only if' part, assume that the previsions are coherent. Clearly, ,.,,(0) = 0, since 10 = 0 and -c,.,,(0) ~ 0 for both positive and negative c. It is also easy to see that ,.,,(A) ~ 0 for all A. If ,.,,(A) < 0, then for all negative c, c(IA -,.,,(A» < 0 and we have incoherence. Countable additivity follows in a similar fashion. Let {An}~=l be mutually disjoint, and let A = U~=lAn. If ,.,,(A) < E:=l,.,,(An),

65When only finitely many gambles are required to be combined at once, as by DeFinetti (1974), incoherence requires that the sum .be s~rictly less than s.ome negative number under all circumstances. That is, DeFmettl would allow a stnctly negative gamble to be called coherent, so long as the least upper bound was O.


B.6. Subjective Probability 657

then the following gamble is always negative: 00

If J.t(A) > E::"=l J.t(An), then the negative of the above gamble is always negativ~ Either way there is incoherence.

Theorem B.138 says that if an agent insists on dealing with a l7-field of sub-sets of some set 8, then expressing coherent previsions for gambles on events is equivalent to choosing probabilities.67 Similar claims can be made about bounded random variables.

Theorem B.139. Let C be the collection of all bounded measurable functions from a measurable space (8, A) to lR. Suppose that, for each X E C, an agent assigns a prevision P(X). The previsions are coherent if and only if there exists a probability /1 on (8, A) such that P(X) = E(X) for all X E C. PROOF. Suppose that the agent is indifferent between all gambles of the form c(X - P(X» for -dx ::; c S dx. For the "if" direction, the proof is virtually identical to the corresponding part of the proof of Theorem B.138. For the "only if" part, note that IA E C for every A E A. It follows from Theorem B.138 that a probability J.t exists such that J.t(A) = P(lA) for all A E A. Hence P(X) = E(X) for all simple functions X. Let X > 0 and let Xl S X 2 S ... be simple functions less than or equal toX such that limn _ oo Xn = X. Then X = E::"=l (Xn+l-Xn ), so

00

P(X) = LP(Xn +1 - Xn) = }~~ E(Xn +l) = E(X), n=l

from coherence and the monotone convergence theorem A.52. For general X, let X+ and X- be, respectively, the positive and negative parts of X. Since P(X) = P(X+) - P(X-) follows easily from coherence, the proof is complete. 0

We conclude this "motivation" of probability theory from gambling considerations by trying to motivate conditional probability. Suppose that, in addition to assigning previsions to gambles involving arbitrary bounded random variables, the agent is also required to assign conditional previsions in the following way. Let C be a sub-l7-field of A, and suppose that gambles of the form cIA (X - p), for all nonempty A E C, are being considered.68 The fair price would be that value of p, denoted P(XIA), such that the agent was indifferent between all gambles of the form cIA (X - P(XIA)) for all c in some symmetric interval around o. Rather than choose a different P(XIA) for each A, the agent has the option of choosing a single function Q : 8 -+ 1R such that Q is measurable with respect to the l7-field C. The conditional gambles would then be cIA(X - Q).

Example B.140. For the simple case in which C = {0, A, AG , 8}, Q is measurable if and only if it takes on only two values, one on A and the other on AG. In

671n the theory of DeFinetti (1974), one obtains finitely additive probabilities without assuming that probabilities have been assigned to all elements of a 17-field. 68DeFinetti (1974) would only require that such conditional gambles be considered one at a time rather than a l7-field at a time.


this case, there are only two sets of conditional gambles (other than the "unconditional" gambles c[X-P(X)]) , namely cIA (X -P(XIA» and cIAc(X -P(XIAc ». Here, Q = P(XIA)IA + P(XIAc)IX, Note that the previsions P(XIA) and P(IA) = ",(A) are already expressed. It is easy to see that

cIA(X - P(XIA»

= c(XIA - E(XIA» - cP(XIA)(IA -",(A» + c[P(XIA)",(A) - E(XIA)].

Clearly, the only coherent choices of P(XIA) satisfy P(XIA)",(A) = E(XIA). If ",(A) > 0, then P(XIA) = E(XIA)/",(A), the usual conditional mean of X given A. Similarly, P(XIAc)",(Ac ) = E(XI~) must hold.

The general situation is not much different from Example B.140.

Theorem B.141. Suppose that an agent must choose a function Q that is measumble with respect to a sub-a-field C so that for each nonempty A E C, he or she is indifferent between all gambles of the form cIA(X - Q). The choice of Q is coherent if and only ifE(QIA) = E(XIA), for all A E C.

PROOF. As in Example B.140, note that

cIA(X - Q) = c(XIA - E(XIA» - c(QIA - E(QIA» + c[E(QIA) - E(XIA)]'

The choice of Q can be coherent if and only if E(QIA) = E(XIA). 0 The reader should note the similarity between the conditions in Theorem B.141

and Definition B.23. The function Q must be a version of the conditional mean of X given C.

Example B.142. Let (X, Y) be random variables with a traditional joint density with respect to Lebesgue measure Ix,Y. That is, for all C E JR.2,

Pr«X, Y) E C) = fa IX,y(x,y)dxdy,

and for all bounded measurable functions 9 : JR. 2 --+ JR.,

E(g(X, Y» = j g(x, y)fx,y(x, y)dxdy. (B.143)

Let C be the a-field generated by Y. That is, C = {y-I(A) : A E B}, where B is the Borel a-field of subsets of JR.. It is straightforward to check that for all A E c, E(XIA) = E(QIA), where Q(s) = h(Y(s», and

h(y) = jxfx,y(X,Y) dx, fy(y)

and fy(y) = J fX,Y(x,y)dx is the usual marginal density of Y. (Just apply (B.143) with g(x,y) = xh(y) and with g(x,y) = xlc(y), where A = y-I(C).)

What we have done in this section is give a motivation for the use of the mathematical probability calculus to express uncertainty for the purposes of gambling. We assume that an agent chooses which gambles to accept in such a way that he or she is not subject to Dutch book, which is a combination of acceptable gambles that produces a loss no matter what happens. We were also able to use this approach to motivate the mathematical definition of conditional expectatio~. by introducing conditional gambles and requiring that the same coherence condltlOn apply to conditional and unconditional gambles alike.

B.7. Simulation 659

B.7 Simulation * Several times in this text, we will want to generate observations that have a desired distribution. Such observations will be called pseudorandom numbers because samples appear to have the properties of random variables, but they are actually generated by a complicated deterministic process. We will not go into detail on how pseudorandom numbers with uniform U(O, 1) distribution are generated. In this section, we wish to prove a couple of useful theorems about how to generate pseudorandom numbers with other distributions under the assumption that pseudorandom numbers with U(O, 1) distribution can be generated. Theorem B.144. Let F be a CDF and define the inverse of F by

F- 1 ( } _ { inf{x: F(x) ~ q} q - sup{x : F(x) > O}

ifq > 0, if q = O.

If U has U(O, 1) distribution, then X = F-1 (U) has CDF F. PROOF. We will calculate Pr(X ~ t) for all t. First, let t be a continuity point of F. Then

Pr(X ~ t) = Pr(F-1(U) ~ t) = Pr(U ~ F(t» = F(t), where the second equality follows from the fact that, at a continuity point t, X ~ t if and only if U ~ F(t), and the third equality follows from the fact that U has U(0,1} distribution. Finally, let t be a jump point of F and let F(t) -lim"'rt F(x) = c. Then X = t if and only if t - c < U ~ t, so

Pr(X = t) = Pr(t - c < U ~ t) = c.

So, X has CDF F at continuity points of F and its distribution has the same sized jumps as F at the same points. So the CDF of X is F. 0

This theorem allows us to generate pseudorandom variables with arbitrary CDF F, if we can find F- 1 • The method described in this theorem is called the probability integral transform. Note that the probability integral transform has a surprising theoretical implication.

Proposition B.145. Let U have U(O,1) distribution, and let X be a random quantity taking values in a Borel space X. Then there exists a measurable function f : 10, 1] ~ X such that feU) has the same distribution as X. The next theorem allows us to find pseudorandom variables with arbitrary density I if we can generate pseudorandom variables with another density 9 such that I(x) ~ kg(x} for some number k and all x.

Theorem B.146 (Acceptance-rejection). Let f be a nonnegative integrable function, and let 9 be a density junction. Let k > 0 and suppose that f (x) ~ kg( x) for all x. Suppose that {Y;}~1 and {Ui}~l are all independent and that the Y; have density 9 and the Ui are U(O, 1). Define Z = YN, where

N = min {i . U. < fey;} } . • - kg(Y;) .

*This section may be skipped without interrupting the flow of ideas.


Then Z has density proportional to f.

PROOF. We can write the CDF of Z as

( I f(Yi») Pr (Yi ::; t, U. ::; tK~i)) Pr Yi::; t Ui::; -- = --'--.,....---....,--!...

kg(Yi) Pr (u. < 1llil) • - k9(YJ

Pr(Z ~ t) =

= E [pr ( Yi ::; t, Ui ::; ~ I Yi) ]

E [pr ( Ui ::; k~\W) I Yi ) ]

where we have used the law of total probability B.70 in the last equation. The conditional probability in the numerator is

Pr (Yi < t U· < f(Yi) I Yi) = { 0 if Yi > t, • - , • - k (Yi)' 1.!Y:il.·f v". < t 9 • k9(YJ l.li _ •

The mean of this is

jt fey) 1jt -00 kg(y)g(y)dy = k -00 f(y)dy,

since Yi has PDF g(.). Similarly, the denominator conditional probability can be written as

Pr (Ui < f(Yi) I Yi) = f(Yi) . - kg(Yi) kg(Yi)

The mean of this is likewise seen to be J f(y)dy/k. The ratio of these is

Pr(Z < t) = J~oo f(y)dy - J f(y)dy ,

hence Z has density proportional to f. 0 Next, we prove a theorem that allows us to simulate from distributions with

bounded densities and sufficiently thin tails even when we only know the density up to a normalizing constant. The theorem is due to Kinderman and Monahan (1977).

Theorem B.147 (Ratio of uniforms method). Let f : IR -> [0,(0) be an integrable function. Define

If (U, V) has uniform distribution over the set A, then V /U has density proportional to f.

PROOF. Let (U, V) be uniformly distributed on the set A. Then fu,v(u,v) = lA(u,v)/c, where c is the area of A. Define X = U and Y = V/U. The Jacobian for the transformation is x and the joint density of (X, Y) is

x x ( ) /x,y(x,y) = ~IA(x,xy) = ~I[O,v'f'"iY)l x .

B.8. Problems 661

It follows that fy(y) = Jov'fu5 ~dx = icf(y). 0

If both f(x) :::; b and a :::; xV f(x) :::; b for all x, then A is contained in the rectangle with opposite corners (0, a) and (b, c). We can then generate U '" U(O, b) and V '" U(a, c). We set X = V/U, and if U2 :5 f(X), take X as our desired random variable. If U2 < f(X), try again.

An important application of simulation is to the numerical integration technique called importance sampling. Suppose that we wish to know the value of the ratio of two integrals

J v(8)h(8)d8

J h(8)d8 ' (B.148)

where 8 can be a vector. Suppose that f is a density function such that h/ f is nearly constant and it is easy to generate pseudorandom numbers with density /. Let {Xi}~l be an IID sequence of pseudorandom numbers with density f. Then

J h(8)d8 =

J v(8)h(9)d9

E (h(Xi») f(Xi ) ,

( h(X;») E v(Xd /(Xi) ,

where the expectations are with respect to the pseudo-distribution of Xi. If we let Wi = h(Xi)/ /(Xi ) and Zi = V(Xi)Wi , then the weak law of large numbers B.95 says that Zn/W n converges in probability to (B.148).69 The reason that we want h/ / to be nearly constant is so that the variance of Wi is small. In Section 7.1.3, we will show how to approximate the variance of Zn/W n as an estimate of (B.148).

B.8 Problems

Section B.2:

1. Suppose that an urn contains m ~ 3 white balls and n ~ 3 black balls. Suppose that the urn is well mixed so that at any time, the probability that anyone of the remaining balls in the urn is as likely to be drawn as any other. We will draw three balls without replacement and set Xi = 1 if the ith ball drawn is black, Xi = ° if the ith ball is white. Show that

Pr(X1 = 0,X2 = 1,X3 = 1)

Pr(X1 = 1, X 2 = 1, X3 = 0).

2. Suppose that H is a nondecreasing function and

F(x) = inf t > x

t rational

H(t).

69The strong law of large numbers 1.63 says that Zn/W n converges a.s. to (B.148).


(a) Prove that F is continuous from the right.

(b) Prove that infall x H(x) = infall x F(x).

(c) Prove that sUPall x H(x) = sUPall x F(x).

Section B.3:

3. Using the definition of conditional probability, show that AnB = 0 implies

Pr(AIC) + Pr(BIC) = Pr(A U BIC), a.s.

Use this to help prove that {An}~=l disjoint implies

4. *Let Xl and X 2 be lID random variables with U(O, 1) distribution. Let

Using the definition of conditional distribution, show that the conditional distribution of Xl given T = t is a mixture of a point mass at t and a U(O, t) distribution. Also, find the mixture.

5. Let (S, A, IJ.) be a probability space. Let C be a sub-q-field such that IJ.(C) E {O, I} for all C E C. Let EIXI < 00. Prove that E(XIC) = E(X), a.s. [IJ.].

6. Let (S, A, IJ.) be a probability space. Let {An}~l be a partition of S, and let C be the smallest q-field containing {An}~l. Let X be a random variable. Show that E(XIC) = E~ IAn W n , where

Wn = {:;('A:;P if IJ.(An) > 0, otherwise.

7. Let <l> denote the standard normal CDF, and let the joint CDF of random variables (X, Y) be

{ ~ Fx,Y(x, y) = ~~y) if Y - 1 ~ x < y + 1,

if x ~ Y + 1, otherwise.

(a) Find the conditional distribution of X given Y.

(b) Find the conditional distribution of Y given X.

8. Prove Proposition B.25 on page 617. (Hint: Use part 4 of Proposition A.49.)

9. Prove Proposition B.26 on page 617.

10. Prove Proposition B.27 on page 617. (Hint: Prove it for 9 an indicator function, then for simple functions, then for nonnegative measurable functions, then for all integrable functions.)


B.8. Problems 663

12. Suppose that Xl, ... , Xn are independent, each with distribution N(e, 1). Find the conditional distribution of Xl"'" Xn given Xn = x, where Xn = E~lXi/n.

13. Let 81 ~ 82 ~ ... be a sequence of u-fields, and let X ~ o. Suppose that E(XI8n) = Y for all n. Let 8 be the smallest u-field containing all of the 8 n. Show that E(XI8) = Y, a.s. (Hint: Show that the union of the 8 n is a 1I'-system, and use Theorem A.26.)


15. Assume the conditions of Theorem B.46. Also, suppose that (X,8l,Vt) and ()l, 82,112) are u-finite measure spaces and v = VI X V2. Prove that VI can play the role of vXly(·ly) for all y and that V2 can play the role of vy in the statement of Theorem B.46.

16. Prove Proposition B.51 on page 625. (Hint: Notice that IA(V-l(y,w» = IA;(w).)

17. Prove Proposition B.66 on page 631. (Hint: Prove the result for product sets first, and then use Theorem A.26.)

18. Prove Corollary B.67 on page 631.

19. Prove Corollary B.74 on page 633.

20. Prove the second Borel-Cantelli lemma: If {An}~=l are mutually independent and E:=l Pr(An) = 00, then Pr(n~l U~=i An) = 1. (This set is sometimes called An infinitely often.) (Hint: Find the probability of the complement by using the fact that 1 - x:5 exp(-x) for 0:5 x :5 1.)

21. *Suppose that (8, A, It) is a measure space. Let {fn}~=l be a sequence of measurable functions In : 8 --+ T, where (T,8) is a metric space with Borel u-field. Let C be the tail u-field of {fn}~=l' If limn_oo fn(s) = f(s), for all s, then prove that f is measurable with respect to C. (Hint: Refer to the proof of part 5 of Theorem A.3B. Show that the set A. E C by showing that the union in (A.39) does not need to start at 1.)

22. Let (8, A, It) be a probability space, and let C be the tail u-field of a sequence of random quantities {Xn}~=lt where Xn : 8 --+ X for all n. Let V be the u-field generated by {Xn}~=l' Let X = (Xl,X2, ... ) E Xoo. If 11' is a permutation of a finite set of integers {1, ... , n}, let 11' X = (X"(l), ... , X,.(n) , Xn +1, ... ). We say that A E V is symmetric if A = X-I (B) and for every permutation 11' of finitely many coordinates, A = (1I'X)-l(B) as well.

(a) Prove that every C E C is symmetric.

(b) Show that there can be symmetric events that are not in C.



Section B.4:

24. Find a sequence of random variables that converges in probability to 0 but does not converge a.s. to O. (Hint: Consider the countable collection of all subsets of [0,1] of the form [k/2n, (k + 1)/2n] with k and n integers. Arrange them in an appropriate sequence.)

25. Let {Xn}~=l be a sequence of random variables, and let X be another random variable. Let Fn be the CDF of Xn and let F be the CDF of X. Prove that Xn E. X if and only if limn-->oo Fn(x) = F(x) for every x such that F is continuous at x.

26. Prove that I exp(iy) -11 ~ min{lyl, 2} for all y. (Hint: Show that exp(iy) = 1 + i J: exp( is )ds for y 2: 0 and a similar formula for y < 0.)

27. Prove the weak law of large numbers for infinite means: Suppose that {Xi}~l are IID with mean 00. Then, for all real x, limn-->oo Pr(Xn > x) = 1, where Xn = 2::~=1 Xi/no (Hint: Define Yi.t = min{Xi' t}. Prove that E(Yi.t) < 00 for all t, but limt-->oo E(Yi.t) = 00.)

28. *Suppose that X is a random vector having bounded density with respect to Lebesgue measure. Prove that the characteristic function of X is integrable. (Hint: Run the proof of Lemma B.101 in reverse.)

Section B.5:

29. Let {in}~=l be a sequence of numbers in {O, I}. Suppose that {Xn}~=1 is a sequence of Bernoulli random variables such that

Pr(X1 =il, ... ,Xn =in)= x~2(n~4)' x+2

where x = 2::7=1 ij. Show that this specifies a consistent set of joint distributions for n = 1,2, ....

30. Let /-L be a finite measure on (1R, B), where B is the Borel u-field. Suppose that {X(t) : -00 < t < oo} is a stochastic process such that X(t) has Beta(/-L(-oo,t],/-L(t,oo)) distribution for each t, X(t) > Xes) ift > s, and X(·) is continuous from the right.

(a) Prove that Pr(limt-->oo X(t) = 1) = 1. (b) Let U = inf{t : X(t) 2: 1/2}. Prove that the median of U is inf{t :

IL(-oo,t]2: /-L(t,oo)}. (Hint: Write {U $ s} in terms of X(·).)

31. Let R be a set, and let (Xr, Br) be a Borel space for every r E R. Let X = n Xr and let B be the product u-field. For each r E R, let Xr : X -+ Xr bert'~ projection function Xr(X) = Xr. Prove that B is the union of all of the u-fields generated by all of the countable collections of Xr functions. That is, let Q be the set of all countable subsets of R, and for each q E Q let xq = {Xr lrEq and let Bq be the u-field generated by xq. Then show that B = uqEQBq.

Section B. 7:


ApPENDIX C

Mathematical Theorems Not Proven Here

There are several theorems of a purely mathematical nature which we use on occasion in this text, but which we do not wish to prove here because their proofs involve a great deal of mathematical background of which we will not make use anywhere else.

C.1 Real Analysis

Theorem C.1 (Taylor's theorem).1 Suppose that 1 : IRm -+ IR has continuous partial derivatives 01 all orders up to and including k + 1 with respect to all coordinates in a convex neighborhood D of a point Xo. For xED and i = 1, ... , k + 1, define

D(i)(f,x,y) = :t ... :t (8Z' ~i.8z' I(Z)\ llYi.) , i1=1 i;=1 31 3. z=z 8=1

where we allow notation like 83 /8z18z18z4 to stand for 83 /8z~8z4. Then, for XED,

k

!(x) = !(xo) + ~ ~D(i)(f;xo,x - xo) + (k ~ I)! D(k+l) (f; x., x - xo), 0=1

where x· is on the line segment joining X and Xo.

1This theorem is used in the proofs of Theorems 7.63, 7.89, 7.108, and 7.125. For a proof (with m = 2), see Buck (1965), Theorem 16 on page 260.

666 Appendix C. Mathematical Theorems Not Proven Here

Theorem C.2 (Inverse function theorem).2 Let f be a continuously differentiable function from an open set in IRn into IRn such that «ali / ax j)) is a nonsingular matrix at a point x. If y = f (x), then there exist open sets U and V such that x E U, Y E V, f is one-to-one on U, and feU) = V. Also, if g: V -+ U is the inverse of f on U, then g is continuously differentiable on V.

Theorem C.3 (Stone-Weierstrass theorem).3 Let A be a collection of continuous complex functions defined on a compact set C and satisfying these conditions:

• If f E A, then the complex conjugate of f is in A.

• If Xl =1= X2 E C, then there exists f E A such that f(xd =1= f(x2).

• If f, 9 E A, then f + 9 E A and f g E A. • If f E A and c is a constant, then cf E A .

• For each x E C, there exists f E A such that f(x) =1= O.

Then, for every continuous complex function f on C, there exists a sequence {fn}::"=l in A such that fn converges uniformly to f on C.

Theorem C.4 (Supporting hyperplane theorem).4 If S is a convex subset of a finite-dimensional Euclidean space, and Xo is a boundary point of S, then there is a nonzero vector v such that for every xES, V T X ~ V T Xo.

Theorem C.5 (Separating hyperplane theorem). 5 If Sl and S2 are disjoint convex subsets of a finite-dimensional Euclidean space, then there is a nonzero vector v and a constant c such that for every x E Sl, V T X ::; c and for every y E S2, V T Y ~ c.

Theorem C.6 (Bolzano-Weierstrass theorem).6 Suppose that B is a closed and bounded subset of a finite-dimensional Euclidean space. Then every infinite subset of B has a cluster point in B.

C.2 Complex Analysis

Theorem C. 7.7 Let f be an analytic function in a neighborhood of a point z. Then the derivatives of f of every order exist and are analytic in a neighborhood

2This theorem is used in the proof of Theorem 7.57. For a proof, see Rudin (1964), Theorem 9.17.

3This theorem is used in the proofs of DeFinetti's representation theorem 1.49 and 1.47 and Theorem B.93. For a proof, see Rudin (1964), Theorem 7.31.

4This theorem is used in the proof of Theorem B.17. For a proof, see Berger (1985), Theorem 12 on page 341, or Ferguson (1967), Theorem 1 on page 73.

5This theorem is used in the proof of Theorems B.17, 3.77 and 3.95. For a proof, see Berger (1985), Theorem 13 on page 342, or Ferguson (1967), Theorem 2 on page 73. ..

6This theorem is used in the proof of Theorem 3.77. For a proof, see DugundJ1 (1966), Theorems 3.2 and 4.3 of Chapter XI. .

7This theorem is used to show that certain estimators are UMVUE, and III the proof of Theorem 2.74. For a proof, see Churchill (1960, Sections 52 and 56).

C.3. Functional Analysis 667

01 z. II J<k) denotes the kth derivative 01 I, then

00 (k)

I(x) = ~)x - z)k I k'(z) k=O

lor all x in some circle around z.

Theorem C.8 (Maximum modulus theorem).8 Let I be an analytic function in an open set D which is continuous on the closure 01 D. Let the maximum value 01 I/{z)1 lor z in the closure 01 D be c. Then I/{z)1 < c lor all zED unless I is constant on D.

Theorem C.9 (Cauchy's equation).9 Let G be a Borel subset 01 IRk with positive Lebesgue measure. Let I : G -+ IR be measurable. Let H 1 = G and Hn = Hn-l + G lor each n. For each n, let gn : Hn -+ IR be measurable such that gn (E~=l Xi) = E~l I(Xi), lor almost all (Xl, ... ,Xn) E Gn • Then there is a real number a and a vector b E IRk such that I{x) = a + bT X a.e. in G.

C.3 Functional Analysis

Theorem C.10. lo lIT is an operator with finite norm on the Hilbert space L2 {",)

given by T(f)(x) = J K(x',x)d",(x'), then T is 01 Hilbert-Schmidt type il and only il J J IK{x', x)1 2d",{x')d",{x) < 00.

Theorem C.llY Every operator 01 Hilbert-Schmidt type is completely continuous.

Theorem C.12.l2 II T is a completely continuous sell-adjoint operator, then T has an eigenvalue A with IAI = IITII. Theorem C.13.l3 lIT is a linear operator with finite norm and T· is its adjoint operator, then IIT·TII = IITII2.

8This theorem is used in the proof of Theorem 2.64. For a proof, see Churchill (1960), Section 54, or Ahlfors (1966), Theorem 12' on page 134.

9This theorem is used in the proof of Theorem 2.114. For a proof, see Diaconis and Freedman (1990), Theorem 2.1.

lOThis theorem is used in the proof of Theorem 8.40. For a lroof, see Section XI.6 of Dunford and Schwartz (1963). By L2{",) we mean {/: 12{x)d",{x) < oo}.

llThis theorem is used in the proof of Theorem 8.40. For a proof, see Theorem 6 of Section XI.6 of Dunford and Schwartz (1963). The reader should note that Dunford and Schwartz (1963) use the term compact instead of completely continuous.

l2This theorem is used in the proof of Theorem 8.40. For a proof, see Lemma 1 in Section VIIL3 of Berberian (1961).

l3This theorem is used in the proof of Theorem 8.40. For a proof, see part (5) of Theorem 2 on p. 132 of Berberian (1961).

ApPENDIX D

Summary of Distributions

The distributions used in this book are listed here. We give the name and symbol used to describe each distribution. Each distribution is absolutely continuous with respect to some measure or other. In most cases the mean and variance are given. In some cases, the symbol for the CDF is given.

D.l Univariate Continuous Distributions

Alternate noncentral beta Symbol: ANCB(q,a,,)

D . f () ,",00 ~(1 )! k r + +2k ~+k-l(1 )!+k-1 enslty: x x == L.Jk=O ~ -, "Y r ~+k r !+k x -x

Dominating measure: Lebesgue measure on [0,1]

Alternate noncentral chi-squared Symbol: ANCX2(q, a, ,)1

• ",,00 r(k+!? k( )! ",!+k-l ("') DenSity: fx(x) == L.Jk=O klr(! "Y 1-"Y 2!+k r(!+k) exp -2

Dominating measure: Lebesgue measure on [0, (0)

Mean: q+a~ 1-1'

Variance: 2 [q + a1'12~1'1')]

IThis distribution was derived without a name by Geisser (1967). It was named L2 by Lecoutre and Rouanet (1981).

Alternate noncentral F

Symbol: ANCF(q,a,,?

D.l. Univariate Continuous Distributions 669

Dominating measure: Lebesgue measure on [0; 00)

Mean: (1 -,) a~2 + ,~, if a> 2

Variance: 2a2 (a_2+q)(1_"}')2 + 4a2"}'(1-"}') if a > 4 (a-2)2(a-4)q (a-2)q2 ,

Beta Symbol: Beta( a, (3)

Density: Ix (x) = ic~'igJ)xa-1(1- X),8-1

Dominating measure: Lebesgue measure on [0,1]

Mean: a~,8

v: . a('J arlance: (a+,8)2(a+,8+1)

Cauchy

Symbol: Cau(p" (7'2)

Density: Ix (x) = 11'-1 (1 + (:z:~~)2) -1

Dominating measure: Lebesgue measure on (-00,00)

Mean: Does not exist

Variance: Does not exist

Chi-squared

Symbol: X~

Density: fx(x) = 2i~(~) exp(-~) Dominating measure: Lebesgue measure on [0, 00)

Mean: a

Variance: 2a

2The alternate noncentral F distribution, with a different scaling factor, was called the 1/12 distribution by Rouanet and Lecoutre (1983). See also Lecoutre (1985). The distribution was derived without a name by Femindiz (1985). Schervish (1992) gives additional details concerning the ANCX2 , ANCE, and ANCF distributions.

670 Appendix D, Summary of Distributions

Exponential

Symbol: Exp(O)

Density: fx(x) == Oexp(-xO)

Dominating measure: Lebesgue measure on [0, 00)

Mean: ! Variance: -b

F Symbol: Fq,a

• r(!I..±.!!.)q~a! !1 1 !1±<! DensIty: fx(x) == ~r x2- (a + qx)--r

r(~)r(!)

Dominating measure: Lebesgue measure on [0,00)

Mean: a~2' if a> 2

". 2 2 q+a-2 'f 4 variance: a q(a-4)(a-2)2' I a>

Gamma

Symbol: r(a,.8)

Density: fx(x) == ~X"'-l exp(-,Bx)


Mean: ~

Variance: -p

Inverse gamma

Symbol: r- 1(a,,B)

Density: fx(x) == ~X-O-l exp(-~)


Mean: 6, if a> 1

Variance: (0 1~2("'_2)' if a > 2

Laplace

Symbol: Lap(tL, 0')

Density: fx(x) == 2~ exp (_Ix~el), Dominating measure: Lebesgue measure on 1R

Mean: tL

Variance: 20'2

Noncentral beta Symbol: NCB (a., {3, 1/J)

Dol. Univariate Continuous Distributions 671

D 't f () ,",00 (!I!.)k (!I!.) r(a+~) a+k-l(1 )~-1 ensl y: x x = L..Jk=O 2 exp - 2 klr(a+k)r(~)x - x Dominating measure: Lebesgue measure on [0,1)

Noncentral chi-squared

Symbol: NCx.~(1/J)

D ' f ( ) - ,",00 (!I!.)k (!I!.) ",!+k-l ( "') enslty: x x - L..Jk=O 2 exp - 2 kJ2~+kr(!+k) exp -2

Dominating measure: Lebesgue measure on [0,00)

Mean: q+1/J

Variance: 2q + 41/J

N oncentral F Symbol: NCF(q, a, 1/J)


Mean: (1 +~) a~2' if a> 2

Variance' 2 (!!}2 (q+.p)2+(q+2.p){a-2) if a > 4 'q (a 2)2(" 4)

Noncentral t Symbol: NCta (6)

Density: fx(x) = 2::-0 (6:t exp(-~) r(~) i (1 + "':)-~ - 0 va;;-r( ! )(!)

Dominating measure: Lebesgue measure on ill.

~ Mean: 6----rrrr Vf, if a> 1

, a(6 +1) a62 0 2 [~"-l ]2 Variance: a-2 -"""2 r! ' If a > 2

CDF: NCT,,(oj6)

Normal Symbol: N(p., 002 )

Density: /x(x) = (V21Too)-l exp ( _("';,.~)2)

Dominating measure: Lebesgue measure on (-00,00)

672 Appendix D. Summary of Distributions

Mean: J.t

Variance: 0'2

CDF: ~(.) (For N(O, 1) distribution)

Pareto

Symbol: Par(o:, c)

Density: /x(x) = ;=+1 Dominating measure: Lebesgue measure on [e, 00)

Mean: :~l' if 0: > 1

... " • c2°'f varIance: (0 2)(0 1)2, 1 0: > 2

t Symbol: ta (J.t,0'2)

Density: /x(x) = ~1i~(7 (1 + ("':(7!tr~ Dominating measure: Lebesgue measure on (-00,00)

Mean: J.t, if a > 1

Variance: 0'2 a~2' if a > 2

CDF: TaO (For t .. (O, 1) distribution)

Uniform

Symbol: U(a, b)

Density: /x(x) = (b - a)-l

Dominating measure: Lebesgue measure on [a, b]

Mean: ~

Variance: (b_a)2 12

D.2 Univariate Discrete Distributions

Bernoulli

Symbol: Ber(p) Density: /x(x) = p"'(l _ p)l-'"

Dominating measure: Counting measure on {O,1}

Mean: p

Variance: p(l - p)

Binomial Symbol: Bin{n,p)

D.2. Univariate Discrete Distributions 673

Density: /x{x) = (:)p"'{I- p)l-",

Dominating measure: Counting measure on {O, ... , n}

Mean: np

Variance: np{l - p)

Geometric Symbol: Geo(p)

Density: /x{x) = p{l- p)'"

Dominating measure: Counting measure on {O, 1,2, ... }

Mean·.!=.E • p

Variance: ~ p

Hypergeometric

Symbol: Hyp{N,n,k)

Density: /x{x) = (~)[~):)

Dominating measure: Counting measure on {max{O,n - N + k}, ... ,min{n,k}}

Mean: ~

Variance: n (~) ( N ;k) (~ :7 )

Negative binomial

Symbol: N egbin{ a, p)

Density: /x{x) = Wot:lpO{l - p)'"

Dominating measure: Counting measure on {O, 1,2, ... }

Mean' a.!=.E • p

Variance: a7

Poisson

Symbol: Poi{>.)

Density: fx{x) = exp{->'):~ Dominating measure: Counting measure on {O, 1,2, ... }

Mean: >. Variance: >.

674 Appendix D. Summary of Distributions

D.3 Multivariate Distributions

Dirichlet

Symbol: Dirk(al, ... , ak)

Densl't . f ( ) = r{"'o) X"'l-l .. 'X"'k-l-l(l - x -y. Xl.···.Xk_l Xl,· .. , Xk-l I'(",tl ... r(",/c) 1 k-I 1

... - xk_da/c-l, where ao = 2::=1 ai

Dominating measure: Lebesgue measure on {(Xl, ... ,Xk-l): all Xi ~ 0 and Xl + ... Xk-l ::; I}

Mean: E(Xi ) = ~ Variance: Var(Xi) = "';J0o-a;)

"'0("'0+1)

Covariance: Cov(Xi , X J.) = !riCk; ",g ("'0 +1)

Multinomial

Symbol: Multk(n,Pl, ... ,Pk)

Density: !Xl ..... Xk(Xl, ... ,Xk) = ( n )PZll ... p~/c .::r:l'···I:l:k '"

Dominating measure: Counting measure on {(Xl, ... ,Xk): all Xi E {O, ... ,n} and Xl + ... + X" = n}

Mean: E(Xi) = npi

Variance: Var(Xi) = npi(l - Pi)

Covariance: COV(Xi, Xj) = -npiPj

Multivariate Normal

Symbol: Np(JJ, (1)

Density: !x(x) = (271r~I(1I-~ exp(-!(x - JJ)T (1-l(X - JJ»)

Dominating measure: Lebesgue measure on IRP

Mean: E(Xi) = /Ji

Variance: Var(Xi) = (1i.i

Covariance: COV(Xi, Xi) = (1i.i

References

AHLFORS, L. (1966). Complex Analysis (2nd ed.). New York: McGraw-Hill.

AITCHISON, J. and DUNSMORE, I. R. (1975). Statistical Prediction Analysis. Cambridge: Cambridge University Press.

ALBERT, J. H. and CHlB, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669-679.

ALDOUS, D. J. (1981). Representations for partially exchangeable random variables. Journal of Multivariate Analysis, 11, 581-598.

ALDOUS, D. J. (1985). Exchangeability and related topics. In P. L. HENNEQUlN (Ed.), Ecole d'EM de ProbabiliMs de Saint-Flour XIII-1983 (pp. 1-198). Berlin: Springer-Verlag.

ANDERSON, T. W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). New York: Wiley.

ANSCOMBE, F. J. and AUMANN, R. J. (1963). A definition of subjective probability. Annals of Mathematical Statistics, 34, 199-205.

ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2, 1152-1174.

BAHADUR, R. R. (1957). On unbiased estimates of uniformly minimum variance. Sankhya, 18, 211-224.

BARNARD, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott. Journal of the Royal Statistical Society (Series B), 32, 194-195.

BARNARD, G. A. (1976). Conditional inference is not inefficient. Scandinavian Journal of Statistics, 3, 132-134.

BARNDORFF-NIELSEN, O. E. (1988). Parametric Statistical Models and Likelihood. Berlin: Springer-Verlag.

BARNETT, V. (1982). Comparative Statistical Inference (2nd ed.). New York: Wiley.

BARRON, A. R. (1986). Discussion of "On the consistency of Bayes estimates" by Diaconis and Freedman. Annals of Statistics, 14, 26-30.

BARRON, A. R. (1988). The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions. Technical Report 7, Department of Statistics, University of Illinois, Champaign, IL.

BASU, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya, 15, 377-380.

BASU, D. (1958). On statistics independent of sufficient statistics. Sankhya, 20, 223-226.

676 References

BAYES, T. (1764). An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 37(}---418.

BECKER, R. A., CHAMBERS, J. M., and WILKS, A. R. (1988). The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth and Brooks/Cole.

BERBERIAN, S. K. (1961). Introduction to Hilbert Space. New York: Oxford University Press.

BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag.

BERGER, J. O. (1994). An overview of robust Bayesian analysis (with discussion). Test, 3, 5-124.

BERGER, J. O. and BERRY, D. A. (1988). The relevance of stopping rules in statistical inference (with discussion). In S. S. GUPTA and J. O. BERGER (Eds.), Statistical Decision Theory and Related Topics IV (pp. 29-72). New York: Springer-Verlag.

BERGER, J. O. and SELLKE, T. (1987). Testing a point null hypothesis: The irreconcilability of P values and evidence (with discussion). Journal of the American Statistical Association, 82, 112-122.

BERK, R. H. (1966). Limiting behavior of posterior distributions when the model is incorrect. Annals of Mathematical Statistics, 37, 51-58.

BERKSON, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37,325-335.

BERTI, P., REGAZZINI, E., and RIGo, P. (1991). Coherent statistical inference and Bayes theorem. Annals of Statistics, 19, 366-381.

BICKEL, P. J. and FREEDMAN, D. A. (1981). Some asymptotic theory for the bootstrap. Annals of Statistics, 9, 1196-1217.

BILLINGSLEY, P. (1968). Convergence of Probability Measures. New York: Wiley.

BILLINGSLEY, P. (1986). Probability and Measure (2nd ed.). New York: Wiley.

BISHOP, Y. M. M., FIENBERG, S. E., and HOLLAND, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.

BLACKWELL, D. (1947). Conditional expectation and unbiased sequential estimation. Annals of Mathematical Statistics, 18, 105-110.

BLACKWELL, D. (1973). Discreteness of Ferguson selections. Annals of Statistics, 1,356-358.

BLACKWELL, D. and DUBINS, L. (1962). Merging of opinions with increasing information. Annals of Mathematical Statistics, 33, 882-886.

BLACKWELL, D. and RAMAMOORTHI, R. V. (1982). A Bayes but not classically sufficient statistic. Annals of Statistics, 10, 1025-1026.

BLYTH, C. R. (1951). On minimax statistical decision procedures and their admissibility. Annals of Mathematical Statistics, 22, 22-42.

BONDAR, J. V. (1988). Discussion of "Conditionally acceptable frequentist solutions" by George Casella. In S. S. GUPTA and J. O. BERGER (Eds.) , Statistical Decision Theory and Related Topics IV (pp. 91-93). New York: Springer-Verlag.

References 677

BORTKIEWICZ, L. V. (1898). Das Gesetz der Kleinen Zahlen. Leipzig: Teubner. Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations (with

discussion). Journal of the Royal Statistical Society (Series B), 26, 211-246. Box, G. E. P. and TIAO, G. C. (1968). A Bayesian approach to some outlier

problems. Biometrika, 55, 119-129. BREIMAN, L. (1968). Probability. Reading, MA: Addison-Wesley. BRENNER, D., FRASER, D. A. S., and McDuNNOUGH, P. (1982). On asymp

totic normality of likelihood and conditional analysis. Canadian Journal of Statistics, 10, 163-172.

BROWN, L. D. (1967). The conditional level of Student's t test. Annals of Mathematical Statistics, 38, 1068-1071.

BROWN, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Annals of Mathematical Statistics, 42, 855-903. (See also correction, Annals of Statistics, 1, 594-596.)

BR.OWN, L. D. and HWANG, J. T. (1982). A unified admissibility proof. In S. S. GUPTA and J. O. BERGER (Eds.), Statistical Decision Theory and Related Topics III (pp. 205-230). New York: Academic Press.

BUCK, C. (1965). Real Analysis (2nd ed.). New York: McGraw-Hill. BUEHLER, R. J. (1959). Some validity criteria for statistical inferences. Annals

of Mathematical Statistics, 30, 845-863. BUEHLER, R. J. and FEDDERSON, A. P. (1963). Note on a conditional property

of Student's t. Annals of Mathematical Statistics, 34, 1098-1100. CASELLA, G. and BERGER, R. L. (1987). Reconciling Bayesian and frequentist

evidence in the one-sided testing problem (with discussion). Journal of the American Statistical Association, 82, 106-11l.

CHALONER, K., CHURCH, T., LOUIS, T. A., and MATTS, J. P. (1993). Graphical elicitation of a prior distribution for a clinical trial. The Statistician, 42, 341-353.

CHANG, T. and VILLEGAS, C. (1986). On a theorem of Stein relating Bayesian and classical inferences in group models. Canadian Journal of Statistics, 14, 289-296.

CHAPMAN, D. and ROBBINS, H. (1951). Minimum variance estimation without regularity assumptions. Annals of Mathematical Statistics, 22, 581-586.

CHEN, C.-F. (1985). On asymptotic normality of limiting density functions with Bayesian implications. Journal of the Royal Statistical Society (Series B), 47, 540-546.

CHOW, Y. S., ROBBINS, H., and SIEGMUND, D. (1971). Great Expectations: The Theory of Optimal Stopping. New York: Houghton Mifflin. CHURCHILL, R. V. (1960). Complex Variables and Applications (2nd ed.). New

York: McGraw Hill.

CLARKE, B. S. and BAR.RON, A. R. (1994). Jeffreys' prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference, 41,37-60.

678 References

CORNFIELD, J. (1966). A Bayesian test of some classical hypotheses-with applications to sequential clinical trials. Journal of the American Statistical Association, 61, 577-594.

COX, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics, 29, 357-372.

Cox, D. R. (1977). The role of significance tests. Scandinavian Journal of Statistics, 4, 49-70.

Cox, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall.

CRAMER, H. (1945). Mathematical Methods of Statistics. Princeton: Princeton University Press.

CRAMER, H. (1946). Contributions to the theory of statistical estimation. Skandinavisk Aktuarietidsk, 29, 85-94.

DAVID, H. A. (1970). Order Statistics. New York: Wiley.

DAWID, A. P. (1970). On the limiting normality of posterior distributions. Proceedings of the Cambridge Philosophical Society, 67, 625-633.

DAWID, A. P. (1982). Intersubjective statistical models. In G. KOCH and F. SPIZZICHINO (Eds.), Exchangeability in Probability and Statistics (pp. 217-232). Amsterdam: North-Holland.

DAWID, A. P. (1984). Statistical theory: The prequential approach. Journal of the Royal Statistical Society (Series A), 147, 278-292.

DAWID, A. P., STONE, M., and ZIDEK, J. V. (1973). Marginalization paradoxes in Bayesian and structural inference. Journal of the Royal Statistical Society (Series B), 35, 189-233.

DEFINETTI, B. (1937). Foresight: Its logical laws, its subjective sources. In H. E. KYBURG and H. E. SMOKLER (Eds.), Studies in Subjective Probability (pp. 53-118). New York: Wiley.

DEFINETTI, B. (1974). Theory of Probability, Vols. I and II. New York: Wiley.

DEGROOT, M. H. (1970). Optimal Statistical Decisions. New York: Wiley.

DEMoIVRE, A. (1756). The Doctrine of Chance (3rd ed.). London: A. Millar.

DIACONIS, P. and FREEDMAN, D. A. (1980a). Finite exchangeable sequences. Annals of Probability, 8, 745-764.

DIACONIS, P. and FREEDMAN, D. A. (1980b). DeFinetti's generalizations of exchangeability. In R. C. JEFFREY (Ed.), Studies in Inductive Logic and Probability, II (pp. 233-249). Berkeley: University of California.

DIACONIS, P. and FREEDMAN, D. A. (1980c). DeFinetti's theorem for Markov chains. Annals of Probability, 8, 115-130.

DIACONIS, P. and FREEDMAN, D. A. (1984). Partial exchangeability and sufficiency. In J. K. GHOSH and J. Roy (Eds.), Statistics: Applications and New Directions (pp. 205-236). Calcutta: Indian Statistical Institute.

DIACONIS, P. and FREEDMAN, D. A. (1986a). On the consistency of Bayes estimates (with discussion). Annals of Statistics, 14, 1-26.

References 679

DIACONIS, P. and FREEDMAN, D. A. (1986b). On inconsistent Bayes estimates of location. Annals of Statistics, 14, 68-87.

DIACONIS, P. and FREEDMAN, D. A. (1990). Cauchy's equation and DeFinetti's theorem. Scandinavian Journal of Statistics, 17, 235-250.

DIACONIS, P. and YLVISAKER, D. (1979). Conjugate priors for exponential families. Annals of Statistics, 7, 269-281.

DICKEY, J. M. (1980). Beliefs about beliefs, a theory of stochastic assessments of subjective probabilities. In J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics (pp. 471-487). Valencia, Spain: University Press.

DOOB, J. L. (1949). Application of the theory of martingales. In Le Calcul des Probabilites et ses Applications (pp. 23-27). Paris: Colloques Internationaux du Centre National de la Recherche Scientifique.

DOOB, J. L. (1953). Stochastic Processes. New York: Wiley. DUBINS, L. E. and FREEDMAN, D. A. (1963). Random distribution functions.

Bulletin of the American Mathematical Society, 69, 548-551. DUGUNDJI, J. (1966). Topology. Boston: Allyn and Bacon. DUNFORD, N. and SCHWARTZ, J. T. (1957). Linear Operators, Part I: General

Theory. New York: Interscience. DUNFORD, N. and SCHWARTZ, J. T. (1963). Linear Operators, Part II: Spectral

Theory. New York: Interscience. EBERHARDT, K. R., MEE, R. W., and REEVE, C. P. (1989). Computing factors

for exact two-sided tolerance limits for a normal distribution. Communications in Statistics-Simulation and Computation, 18, 397-413.

EDWARDS, W., LINDMAN, H., and SAVAGE, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242.

EFRON, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1-26.

EFRON, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied Mathematics.

EFRON, B. and HINKLEY, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika, 65, 457-487.

EFRON, B. and MORRIS, C. N. (1975). Data analysis using Stein's estimator and its generalizations. Journal of the American Statistical Association, 70, 311-319.

EFRON, B. and TIBSHIRANI, R. J. (1993). An Introduction to the Bootstrap. London: Chapman and Hall.

ESCOBAR, M. D. (1988). Estimating the Means of Several Normal Populations by Nonparametric Estimation of the Distribution of the Means. Ph.D. thesis, Yale University.

FABIUS, J. (1964). Asymptotic behavior of Bayes' estimates. Annals of Mathematical Statistics, 35, 846-856.

680 References

FERGUSON, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press.

FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1, 209-230.

FERGUSON, T. S. (1974). Prior distributions on spaces of probability measures. Annals of Statistics, 2, 615-629.

FERRANDIZ, J. R. (1985). Bayesian inference on Mahalanobis distance: An alternative to Bayesian model testing. In J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics 2: Proceedings of the Second Valencia International Meeting (pp. 645-653). Amsterdam: North Holland.

FIELLER, E. C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society (Series B), 16, 175-185.

FISHBURN, P. C. (1970). Utility Theory for Decision Making. New York: Wiley.

FISHER, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222A, 309-368.

FISHER, R. A. (1924). The conditions under which X2 measures the discrepancy between observation and hypothesis. Journal of the Royal Statistical Society, 87, 442-450.

FISHER, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22, 700-725.

FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London, A, 144, 285-307.

FISHER, R. A. (1935). The fiducial argument in statistical inference. Annals of Eugenics, 6, 391-398.

FISHER, R. A. (1936). Has Mendel's work been rediscovered? Annals of Science, 1, 115-137.

FISHER, R. A. (1943). Note on Dr. Berkson's criticism of tests of significance. Journal of the American Statistical Association, 38, 103-104.

FISHER, R. A. (1966). The Design of Experiments (8th ed.). New York: Hafner.

FRASER, D. A. S. and McDuNNOUGH, P. (1984). Further remarks on asymptotic normality of likelihood and conditional analyses. Canadian Journal of Statistics, 12, 183-190.

FREEDMAN, D. A. (1963). On the asymptotic behavior of Bayes' estimates in the discrete case. Annals of Mathematical Statistics, 34, 1386-1403.

FREEDMAN, D. A. (1977). A remark on the difference between sampling with and without replacement. Journal of the American Statistical Association, 72,681.

FREEDMAN, D. A. and DIACONIS, P. (1982). On inconsistent M-estimators. Annals of Statistics, 10, 454-461.

FREEDMAN, L. S. and SPIEGELHALTER, D. J. (1983). The assessment ofsubjective opinion and its use in relation to stopping rules of clinical trials. The Statistician, 32, 153-160.

References 681

FREEMAN, P. R. (1980). On the number of outliers in data from a linear model. In J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics (pp. 349-365). Valencia, Spain: University Press.

GABRIEL, K. R. (1969). Simultaneous test procedures-some theory of mUltiple comparisons. Annals of Mathematical Statistics, 40, 224-250.

GARTHWAITE, P. and DICKEY, J. (1988). Quantifying expert opinion in linear regression problems. Journal of the Royal Statistical Society (Series B), 50, 462-474.

GARTHWAITE, P. H. and DICKEY, J. M. (1992). Elicitation of prior distributions for variable-selection problems in regression. Annals of Statistics, 20, 1697-1719.

GAVASAKAR, U. K. (1984). A Study of Elicitation Procedures by Modelling the Errors in Responses. Ph.D. thesis, Carnegie Mellon University.

GEISSER, S. (1967). Estimation associated with linear discriminants. Annals of Mathematical Statistics, 38, 807-817.

GEISSER, S. and EDDY, W. F. (1979). A predictive approach to model selection. Journal of the American Statistical Association, 74, 153-160.

GELFAND, A. E. and SMITH, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical ASSOCiation, 85,398-409 .

. GEMAN, S. and GEMAN, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6, 721-741.

GNANADESIKAN, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. New York: Wiley.

GOOD, I. J .. (1956). Discussion of "Chance and control: Some implications of randomization" by G. Spencer Brown. In C. CHERRY (Ed.), Information Theory: Third London Symposium (pp. 13-14). London: Butterworths.

HALL, P. (1992). The Bootstrap and Edgeworth Expansion. New York: SpringerVerlag.

HALL, W. J., WIJSMAN, R. A., and GHOSH, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

HALMOS, P. R. (1950). Measure Theory. New York: Van Nostrand. HALMOS, P. R. and SAVAGE, L. J. (1949). Application of the Radon-Nikodym

theorem to the theory of sufficient statistics. Annals of Mathematical Statistics, 20, 225-241.

HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J., and STAHEL, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley.

HARTIGAN, J. (1983). Bayes Theory. New York: Springer-Verlag. HEATH, D. and SUDDERTH, W. D. (1976). DeFinetti's theorem on exchangeable

variables. American Statistician, 30, 188-189.

682 References

HEATH, D. and SUDDERTH, W. D. (1989). Coherent inference from improper priors and from finitely additive priors. Annals of Statistics, 17, 907-919.

HEWITT, E. and SAVAGE, L. J. (1955). Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80, 470--501.

HEYDE, C. C. and JOHNSTONE, 1. M. (1979). On asymptotic posterior normality for stochastic processes. Journal of the Royal Statistical Society (Series B), 41, 184-189.

HILL, B. M. (1965). Inference about variance components in the one-way model. Journal of the American Statistical Association, 60, 806--825.

HILL, B. M., LANE, D., and SUDDERTH, W. D. (1987). Exchangeable urn processes. Annals of Probability, 15, 1586-1592.

HOEL, P. G., PORT, S. C., and STONE, C. J. (1971). Introduction to Probability Theory. Boston: Houghton Mifflin.

HOGARTH, R. M. (1975). Cognitive processes and the assessment of subjective probability distributions (with discussion). Journal of the American Statistical Association, 70, 271-294.

HUBER, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73--101.

HUBER, P. J. (1967). The behaviour of maximum likelihood estimates under nonstandard conditions. In L. M. LECAM and J. NEYMAN (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1 (pp. 221-233). Berkeley: University of California.

HUBER, P. J. (1977). Robust Statistical Procedures. Philadelphia: Society for Industrial and Applied Mathematics.

HUBER, P. J. (1981). Robust Statistics. New York: Wiley.

JAMES, W. and STEIN, C. M. (1960). Estimation with quadratic loss. In J. NEYMAN (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1 (pp. 361-379). Berkeley: University of California.

JAYNES, E. T. (1976). Confidence intervals vs. Bayesian intervals (with discussion). In W. L. HARPER and C. A. HOOKER (Eds.), Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science (pp. 175-257). Dordrecht: D. Reidel.

JEFFREYS, H. (1961). Theory of Probability (3rd ed.). Oxford: Oxford University Press.

JOHNSTONE, 1. M. (1978). Problems in limit theory for martingales and posterior distributions from stochastic processes. Master's thesis, Australian National University.

KADANE, J. B., DICKEY, J. M., WINKLER, R. L., SMITH, W., and PETERS, S. C. (1980). Interactive elicitation of opinion for a normal linear model. Journal of the American Statistical Association, 75, 845-854.

KADANE, J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1985). Statistical implications of finitely additive probability. In P. GOEL and A. ZELLNER

References 683

(Eds.), Bayesian Inference and Decision Techniques with Applic~tions: Essays in Honor of Bruno DeFinetti (pp. 59-76). Amsterdam: Elsevier SCience Publishers.

KADANE J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1996). Reasoning to a fore~one conclusion. Journal of the American Statistical Association, 91, to appear.

KAGAN, A. M., LINNIK, Y. V., and RAO, C. R. (1965). On a characterization of the normal law based on a property of the sample average. Sankhya, Series A, 32, 37-40.

KAHNEMAN, D., SLOVIC, P., and TVERSKY, A. (Eds.) (1982). Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press.

KASS, R. E. and RAFTERY, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795.

KASS, R. E. and STEFFEY, D. (1989). Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association, 84, 717-726.

KASS, R. E., TIERNEY, L., and KADANE, J. B. (1988). Asymptotics in Bayesian computation. In J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics :1 (pp. 261-278). Oxford: Clarendon Press.

KASS, R. E., TIERNEY, L., and KADANE, J. B. (1990). The validity of posterior expansions based on Laplace's method. In S. GEISSER, J. S. HODGES, S. J. PRESS, and A. ZELLNER (Eds.), Bayesian and Likelihood Methods in Statistics and Econometrics (pp. 473-488). Amsterdam: Elsevier (North Holland).

KEIFER, J. and WOLFOWITZ, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-906.

KERRJDGE, D. (1963). Bounds for the frequency of misleading Bayes inferences. Annals of Mathematical Statistics, 34, 1109-1110.

KINDERMAN, A. J. and MONAHAN, J. F. (1977). Computer generation of random variables using the ratio of uniform deviates. A eM Transactions on Mathematical Software, 3, 257-260.

KINGMAN, J. F. C. (1978). Uses of exchangeability. Annals of Probability, 6, 183-197.

KNUTH, D. E. (1984). The 1E;Xbook. Reading, MA: Addison-Wesley. KRAFT, C. H. (1964). A class of distribution function processes which have

derivatives. Journal of Applied Probability, 1, 385-388. KRASKER, W. and PRATT, J. W. (1986). Discussion of "On the consistency

of Bayes estimates" by Diaconis and Freedman. Annals of Statistics, 14, 55-58.

KREM, A. (1963). On the independence in the limit of extreme and central order statistics. Publications of the Mathematical Institute of the Hungarian Academy of Science, 8, 469-474.

684 References

KSHIRSAGAR, A. M. (1972). Multivariate Analysis. New York: Marcel Dekker.

KULLBACK, S. (1959). Information Theory and Statistics. New York: Wiley.

LAMPORT, L. (1986). UTEX: A Document Prepamtion System. Reading, MA: Addison-Wesley.

LAURITZEN, S. L. (1984). Extreme point models in statistics (with discussion). Scandinavian Journal of Statistics, 11, 65-91.

LAURITZEN, S. 1. (1988). Extremal Families and Systems of Sufficient Statistics. Berlin: Springer-Verlag.

LAVINE, M. (1992). Some aspects of Polya tree distributions for statistical modelling. Annals of Statistics, 20, 1222-1235.

LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1991). Bayesian inference with specified prior marginals. Journal of the American Statistical Association, 86, 964-971.

LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1993). Linearization of Bayesian robustness problems. Journal of Statistical Planning and Inference, 37,307-316.

LECAM, L. M. (1953). On some asymptotic properties of maximl,lm likelihood estimates and related Bayes estimates. University of California Publications in Statistics, 1, 277-330.

LECAM, L. M. (1970). On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Annals of Mathematical Statistics, 41, 802-828.

LECOUTRE, B. (1985). Reconsideration of the F-test of the analysis of variance:' The semi-Bayesian significance test. Communications in Statistics-Theory and Methods, 14, 2437-2446.

LECOUTRE, B. and ROUANET, H. (1981). Deux structures statistiques fondamentales en analyse de la variance univariee et mulitvariee. Mathematiques et Sciences Humaines, 75, 71-82.

LEHMANN, E. 1. (1958). Significance level and power. Annals of Mathematical Statistics, 29,1167-1176.

LEHMANN, E. L. (1983). Theory of Point Estimation. New York: Wiley.

LEHMANN, E. L. (1986). Testing Statistical Hypotheses (2nd ed.). New York: Wiley.

LEHMANN, E. 1. and SCHEFFE, H. (1955). Completeness, similar regions and unbiased estimates. Sankhya, 10, 305-340. (Also 15, 219-236, and correction 17, 250.)

LINDLEY, D. V. (1957). A statistical paradox. Biometrika, 44, 187-192.

LINDLEY, D. V. and NOVICK, M. R. (1981). The role of exchangeability in inference. Annals of Statistics, 9, 45-58.

LINDLEY, D. V. and PHILLIPS, L. D. (1976). Inference for a Bernoulli process (a Bayesian view). American Statistician, 30, 112-119.

LINDLEY, D. V. and SMITH, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society (Series B), 34, 1-41.

References 685

LOEVE, M. (1977). Probability Theory I (4th ed.). New York: Springer-Verlag.

MAULDIN, R. D., SUDDERTH, W. D., and WILLIAMS, S. C. (1992). Polya trees and random distributions. Annals of Statistics, 20, 1203-1221-

MAULDIN, R. D. and WILLIAMS, S. C. (1990). Reinforced random walks and random distributions. Proceedings of the American Mathematical Society, 110, 251-258.

MENDEL, G. (1866). Versuche iiber pflanzenhybriden. Verhandlungen Naturforschender Vereines in Brunn, 10, 1.

METIVIER, M. (1971). Sur la construction de mesures alt~atoires presque surement absolument continues par rapport a une mesure donnee. Zeitschrijt fur Wahrscheinlichkeitstheorie, 20, 332-344.

MORRIS, C. N. (1983). Parametric empirical Bayes inference: Theory and applications (with discussion). Journal of the American Statistical Association, 78,47-65.

NACHBIN, L. (1965). The Haar Integral. Princeton: Van Nostrand.

NEYMAN, J. (1935). Su un teorema concernente Ie cosiddette statistiche sufficienti. Giornale Dell'Istituto Italiano degli Attuari, 6, 320--334.

NEYMAN, J. and PEARSON, E. S. (1933). On the problem of the most efficient test of statistical hypotheses. Philosophical 1hmsactions of the Royal Society of London, Series A, 231, 289-337.

NEYMAN, J. and SCOTT, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32.

PEARSON, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine (5thSeries), 50, 339-357. (See also correction, Philosophical Magazine (6thSeries), 1, 670-671.)

PERLMAN, M. (1972). On the strong consistency of approximate maximum likelihood estimators. In L. M. LECAM, J. NEYMAN, and E. L. SCOTT (Eds.), Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, volume 1 (pp. 263-281). Berkeley: University of California.

PIERCE, D. A. (1973). On some difficulties in a frequency theory of inference. Annals of Statistics, 1,241-250.

PITMAN, E. (1939). The estimation of location and scale parameters of a continuous population of any given form. Biometrika, 30, 391-421.

PRATT, J. W. (1961). Review of "Testing Statistical Hypotheses" by E. L. Lehmann. Journal of the American Statistical Association, 56, 163-167.

PRATT, J. W. (1962). Discussion of "On the foundations of statistical inference" by Allan Birnbaum. Journal of the American Statistical Association, 57, 314-316.

RAO, C. R. (1945). Information and the accuracy a.ttainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81-91.

686 References

RAO, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.). New York: Wiley.

ROBBINS, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. In J. NEYMAN (Ed.), Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability (pp. 131-148). Berkeley: University of California.

ROBBINS, H. (1955). An empirical Bayes approach to statistics. In J. NEYMAN (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1 (pp. 157-164). Berkeley: University of California.

ROBBINS, H. (1964). The empirical Bayes approach to statistical decision problems. Annals of Mathematical Statistics, 35, 1-20.

ROBERT, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3, 601-608.

ROBERTS, H. V. (1967). Informative stopping rules and inferences about population size. Journal of the American Statistical Association, 62, 763-775.

ROUANET, H. and LECOUTRE, B. (1983). Specific inference in ANOVA: From significance tests to Bayesian procedures. British Journal of Mathematical and Statistical Psychology, 36, 252-268.

ROYDEN, H. L. (1968). Real Analysis. London: Macmillan.

RUBIN, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130-134.

RUDIN, W. (1964). Principles of Mathematical Analysis (2nd ed.). New York: McGraw-Hill.

SAVAGE, L. J. (1954). The Foundations of Statistics. New York: Wiley.

SAVAGE, L. J. (1962). The Foundations of Statistical Inference. London: Methuen.

SCHEFFE, H. (1947). A useful convergence theorem for probability distributions. Annals of Mathematical Statistics, 18,434--438.

SCHERVISH, M. J. (1983). User-oriented inference. Journal of the American Statistical Association, 78, 611-615.

SCHERVISH, M. J. (1992). Bayesian analysis of linear models (with discussion). In J. M. BERNARDO, J. O. BERGER, A. P. DAWID, and A. F. M. SMITH (Eds.), Bayesian Statistics 4: Proceedings of the Second Valencia International Meeting (pp. 419--434). Oxford: Clarendon Press.

SCHERVISH, M. J. (1994). Discussion of "Bootstrap: More than a stab in the dark?" by G. A. Young. Statistical Science, 9, 408-410.

SCHERVISH, M. J. (1996). P-values: What they are and what they are not. American Statistician, 50, to appear.

SCHERVISH, M. J. and CARLIN, B. P. (1992). On the convergence of successive substitution sampling. Journal of Computational and Graphical Statistics, 1, 111-127.

SCHERVISH, M. J. and SEIDENFELD, T. (1990). An approach to consensus and certainty with increasing evidence. Journal of Statistical Planning and In-ference, 25,401-414.

References 687

SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1984). The extent of non-conglomerability of finitely additive probabilities. Zeitschrift fur Wahrscheinlichkeitstheorie, 66, 205-226.

SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1990). State dependent utilities. Journal of the American Statistical Association, 85, 840-847.

SCHWARTZ, L. (1965). On Bayes procedures. Zeitschrift fur Wahrscheinlichkeitstheorie, 4, 10-26.

SEIDENFELD, T. and SCHERVISH, M. J. (1983). A conflict between finite additivity and avoiding Dutch Book. Philosophy of Science, 50, 398-412.

SEIDENFELD, T., SCHERVISH, M. J., and KADANE, J. B. (1995). A representation of partially ordered preferences. Annals of Statistics, 23, 2168-2217.

SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley.

SETHURAMAN, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650.

SINGH, K. (1981). On the asymptotic accuracy of Efron's bootstrap. Annals of Statistics, 9, 1187-1195.

SMITH, A. F. M. (1973). A general Bayesian linear model. Journal of the Royal Statistical Society, Ser. B, 35, 67-75.

SPJ0TVOLL, E. (1983). Preference functions. In P. J. BICKEL, K. DOKSUM, and J. L. HODGES, JR. (Eds.), A Festschrift for Erich L. Lehmann (pp. 409-432). Belmont, CA: Wadsworth.

STATSCI (1992). S-PLUS, Version 3.1 (software package). Seattle: StatSci Division, MathSoft, Inc.

STEIN, C. M. (1946). A note on cumulative sums. Annals of Mathematical Statistics, 17, 498-499.

STEIN, C. M. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In J. NEYMAN (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1 (pp. 197-206). Berkeley: University of California.

STEIN, C. M. (1965). Approximation of improper prior measures by prior probability measures. In J. NEYMAN and L. M. LECAM (Eds.), Bernoulli, Bayes, Laplace: Anniversary Volume (pp. 217-240). New York: Springer-Verlag.

STEIN, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9, 1135-1151.

STIGLER, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap.

STONE, M. (1976). Strong inconsistency from uniform priors. Journal of the American Statistical Association, 71, 114-125.

STONE, M. and DAWID, A. P. (1972). Un-Bayesian implications of improper Bayes inference in routine statistical problems. Biometrika, 59, 369-375.

688 References

STRASSER, H. (1981). Consistency of maximum likelihood and Bayes estimates. Annals of Statistics, 9, 1107-1113.

STRAWDERMAN, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42, 385-388.

TAYLOR, R. L., DAFFER, P. Z., and PATTERSON, R. F. (1985). Limit Theorems for Sums of Exchangeable Random Variables. Totowa, NJ: Rowman and Allanheld.

TIERNEY, L. (1994). Markov chains for exploring posterior distributions (with discussion). Annals of Statistics, 22,1701-1762.

TIERNEY, L. and KADANE, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81, 82-86.

VENN, J. (1876). The Logic of Chance (2nd ed.). London: Macmillan.

VERDINELLI, I. and WASSERMAN, L. (1991). Bayesian analysis of outlier problems using the Gibbs sampler. Statistics and Computing, 1, 105-117.

VON MISES, R. (1957). Probability, Statistics and Truth. London: Allen and Unwin.

VON NEUMANN, J. and MORGENSTERN, O. (1947). Theory of Games and Economic Behavior (2nd ed.). Princeton: Princeton University Press.

WALD, A. (1947). Sequential Analysis. New York: Wiley.

WALD, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20, 595-601.

WALD, A. and WOLFOWITZ, J. (1948). Optimum character of the sequential probability ratio test. Annals of Mathematical Statistics, 19, 326-339.

WALKER, A. M. (1969). On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society (Series B), 31, 80-88.

WALLACE, D. L. (1959). Conditional confidence level properties. Annals of Mathematical Statistics, 30,864-876.

WELCH, B. L. (1939). On confidence limits and sufficiency, with particular reference to parameters of location. Annals of Mathematical Statistics, 10, 58-69.

WEST, M. (1984). Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society (Series B), 46,431-439.

WILKS, S. S. (1941). Determination of sample sizes for setting tolerance limits. Annals of Mathematical Statistics, 12, 91-96.

YOUNG, G. A. (1994). Bootstrap: More than a stab in the dark? (with discussion). Statistical Science, 9, 382-415.

ZELLNER, A. (1971). An Introduction to Bayesian Inference in Econometrics. New York: Wiley.

Notation and Abbreviation Index

o (vector of Os), 385

1 (vector of Is), 345

28 (power set), 571

« (absolutely continuous), 574, 597 a.e. (almost everywhere), 582 ANCB(·,·,·) (distribution), 668 ANC:e(·,·,·) (distribution), 668 ANCF(·,·,·) (distribution), 669 ANOVA (analysis of variance), 384 ARE (asymptotic relative efficiency),

413 a.s. (almost surely), 582 Ax (u-field generated by X), 51, 82 N (action space), 144 ex (action space u-field), 144

\ (remove one set from another), 577 B (closure of set), 622 Ber(·) (distribution), 672 Beta{·,·) (distribution), 669 Bin{·,·) (distribution), 673 [3k (Borel u-field), 576 [3 (Borel u-field), 575

Cau(·,·) (distribution), 669 CDF (cumulative distribution

function), 612 c~ (constant related to RHM), 367 Cg (constant related to LHM), 367 X~ (distribution), 669

E. (converges in distribution), 635

& (converges in probability), 638 w - (converges weakly), 635 Covo(',') (conditional covariance

given e = 0), 19 Cov (covariance), 613 Cp (u-field on set of probability

measures), 27 AC (complement of set), 575

Dir(·) (Dirichlet process), 54

Dirk("') (distribution), 674 dJl2/dJll (Radon-Nikodym

derivative), 575, 598 ~ (symmetric difference), 581

Eo(-) (conditional mean given e = 0), 19

ExpO (distribution), 670 EO (expected value), 607, 613 E(·I·) (conditional mean), 616

f+ (positive part), 588 f- (negative part), 588 fXls (conditional density of X given

8),13 fXIY (conditional density), 13 Fq,a (distribution), 670

r-1 (.,.) (distribution), 670 f(-,·) (distribution), 670 Geo(·) (distribution), 673

HPD (highest posterior density), 327 Hyp{',',') (distribution), 673

lID (independent and identically distributed), 611

lk (identity matrix), 643 IXIT('; ·1·) (conditional Kullback

Leibler information), 115

IXIT(-I-) (conditional Fisher information), 111

Ix (-) (Fisher information), 111 Ix(·j·) (Kullback-Leibler

information), 115 lAO (indicator function), 9

Lap(·, .) (distribution), 670 '\g (measure constructed from LHM),

367 LHM (left Haar measure), 363 LMP (locally most powerful), 245 LMVUE (locally minimum variance

unbiased estimator), 300

690 Notation and Abbreviation Index

LR (likelihood ratio), 274

MC (most cautious), 230 MLE (maximum likelihood

estimator), 307 MLR (monotone likelihood ratio),

239 MP (most powerful), 230 MRE (minimum risk equivariant),

347 MUltk( . .. ) (distribution), 674 J.Lejx (·1·) (posterior distribution), 16

NCB(·,·,·) (distribution), 671 NCX~O (distribution), 671 NCF(·,.,·) (distribution), 671 NCtaO (distribution), 671 NCTa(·j·) (CDF of NCt

distribution), 671 Negbin(·,·) (distribution), 673 v (dominating measure), 13 N(·,·) (distribution), 671 Np (·,·) (distribution), 674 N (integers plus 00), 537

n (parameter space), 13, 82 Op (stochastic small order), 396 Op (stochastic large order), 396 o (small order), 394 o (large order), 394

Po (parametric family), 50, 82 Par(·,·) (distribution), 672 {)B (boundary of set), 636 <1>(-) (CDF of normal distribution),

672 P n (empirical probability measure),

12 Poi(.) (distribution), 673 PrO (probability), 612 Pr(·I·) (conditional probability), 617 P9,T(-) (conditional distribution of T

given e = B), 84 P~ (-) (conditional probability given

8 = B), 51, 83 P9(-) (conditional distribution given

e = B), 51, 83 P (random probability measure), 25 P (set of all probability measures), 27

Qn(·lx) (conditional distribution given n observations), 539

1R (real numbers), 570 1R+ (positive reals), 571 1R+o (nonnegative reals), 627 pg (measure constructed from RHM),

367 RHM (right Haar measure), 363 r(T/,8) (Bayes risk), 149 R(B,8) (risk function), 149

(S, A, J.L) (measure space), 577 SPRT (sequential probability ratio

test), 549 SSS (successive substitution

sampling), 507

T (parameter space a-field), 13 8' (parametric index), 50 8 (parameter), 51 T (statistic), 84 TaO (CDF of t distribution), 672 ta(-,·) (distribution), 672 v T (transpose of vector), 614

UMA (uniformly most accurate), 317 UMAU (uniformly most accurate

unbiased), 321 UMC (uniformly most cautious), 230 UMCU (uniformly most cautious

unbiased), 254 UMP (uniformly most powerful), 230 UMPU (uniformly most powerful

unbiased), 254 UMPUAI (uniformly most powerful

unbiased almost invariant), 384

UMVUE (uniformly minimum variance unbiased estimator), 297

USC (upper semicontinuous), 417 U(-,·) (distribution), 672

Var9(.,·) (conditional variance given e = B), 19

Var (variance), 613

x (element of sample space), 82 X (sample space), 13, 82

Name Index

Ahlfors, L., 667, 675 Aitchison, J., 325, 675 Albert, J., 519, 675 Aldous, D., 46, 79, 482, 675 Anderson, T., 386, 675 Andrews, C., ix Anscombe, F., 181, 675 Antoniak, C., 59, 675 Aumann, R, 181, 675

Bahadur, R, 94, 675 Barnard, G., 320, 420, 675 Barndorff-Nielsen, 0., 307, 675 Barnett, V., vii, 675 Barron, A., 434-435, 446, 675, 677 Basu, D., 99-100, 675 Bayes, T., 16, 29, 676 Becker, R, x, 676 Berberian, S., 507, 667, 676 Berger, J., 22, 173, 284, 525, 565,

614, 666, 676 Berger, R, 283, 677 Berk, R, 417, 430, 432, 676 Berkson, J., 218, 281, 676 Berry, D., 565, 676 Berti, P., 21, 676 Bhattacharyya, A., 305 Bickel, P., 330-331, 676 Billingsley, P., 46, 621, 636, 648, 676 Bishop, Y., 462, 676 Blackwell, D., 56, 86, 152, 455, 676 Blyth, C., 158, 676 Bohrer, R., x Bondar, J., 236, 676 Bortkiewicz, L., 462, 677 Box, G., 21, 521, 677 Breiman, L., 618, 640, 677 Brenner, D., 435, 677 Brown, L., 99, 160, 167, 677 Buck, C., 665, 677 Buehler, R, 99, 677

Carlin, B., 507, 686 Casella, G., 283, 677

Chaloner, K., 24, 677 Chambers, J., x, 676 Chang, T., 379, 677 Chapman, D., 303, 677 Chen, C., 435, 677 Chib, S., 519, 675 Chow, y., 647, 677 Church, T., 24, 677 Churchill, R, 666-667, 677 Clarke, B., 446, 677 Cornfield, J., 563, 565, 678 Cox, D., 21, 218, 424, 521, 677-678 Cramer, H., 301, 678

Daffer, P., 33, 688 David, H., 404, 678 Dawid, A., 21, 125, 435, 521, 678,

687 DeFinetti, B., ix, 6, 21, 25, 28, 654,

656-657, 678 DeGroot, M., ix, 91, 98, 181, 362,

536, 654, 678 DeMoivre, A., 8, 678 Diaconis, P., ix, 15, 28, 41, 46, 108,

123,126,426,434,479-480, 667, 678-680

Dickey, J., 24, 679, 681-682 Doob, J., 36, 429, 507, 645-646, 679 Doytchinov, B., ix Dubins, L., 70, 455, 676, 679 Dugundji, J., 666, 679 Dunford, N., 507, 635, 667, 679 Dunsmore, I., 325, 675

Eberhardt, K., 326, 679 Eddy, W., 521,681 Edwards, W., 222, 284, 679 Efron, B., 166,330-331,335-336,423,

679 Escobar, M., 60, 679

Fabius, J., 61, 679 Fedderson, A., 99, 677

692 Name Index

Ferguson, T., 52, 56, 61, 173, 179, 181,248,258,614,666,680

Ferrandiz, J., 669, 680 Fieller, E., 321, 680 Fienberg, S., 462, 676 Fishburn, P., 181, 680 Fisher, R., 89, 96, 217-218, 307, 370,

373, 522, 680 Fraser, D., 435, 677, 680 Freedman, D., 15, 28, 40-41, 46, 61,

70,123,126,330-331,426, 434,479-480,667,676,678-680

Freedman, L., 24, 680 Freeman, P., 524, 681

Gabriel, K., 252, 681 Garthwaite, P., 24, 681 Gavasakar, U., 24, 681 Geisser, S., 521, 668, 681 Gelfand, A., 507,681 Geman, D., 507, 681 Geman, S., 507, 681 Ghosh, J., 381-382, 681 Gnanadesikan, R., 22, 681 Good, I., 565, 681

Hadjicostas, P., ix Hall, P., 337-338, 681 Hall, W., 381-382, 681 Halmos, P., 364, 600, 681 Hampel, F., 315, 681 Hartigan, J., 20-21, 33, 681 Heath, D., 21, 46, 681-682 Hewitt, E., 46, 682 Heyde, C., 435, 682 Hill, B., 9, 484, 682 Hinkley, D., 218, 423, 678-679 Hodges, J., 414 Hoel, P., 640, 682 Hogarth, R., 24, 682 Holland, P., 462, 676 Huber, P., 310, 315, 428, 682 Hwang, J., 160, 677

James, W., 163,682 Jaynes, E., 379, 682 Jeffreys, H., 122, 229, 284, 682 Jiang, T., ix

Johnstone, I., 435, 682

Kadane, J., 21, 24,183-184,446,564, 655, 682-683, 687-688

Kagan, A., 349, 683 Kahneman, D., 23, 683 Kasa, R., ix, 226, 446, 505, 683 Kerridge, D., 564, 683 Kiefer, J., 417, 420, 683 Kinderman, A., 660, 683 Kingman, J., 36, 683 Knuth, D., x, 683 Kraft, C., 66, 683 Krasker, W., 56, 683 Krem, A., 408, 683 Kshirsagar, A., 386, 684 Kullback, S., 116, 684

Lamport, L., x, 684 Lane, D., 9, 682 Lauritzen, S., 28, 123, 481, 684 Lavine, M., 69,526,684 LeCam, L., 414, 437, 684 Lecoutre, B., 668-669, 684, 686 Lehmann, E., 231,280, 285,298,350,

684 Levy, P., 648, 650 Lindley, D., 6, 229, 284, 479, 684 Lindman, H., 222,284,679 Linnik, Y., 349, 683 Laeve, M., 34, 653, 685 Louis, T., 24, 677

Matts, J., 24, 677 Mauldin, R., 66, 69, 685 McDunnough, P., 435, 677, 680 Mee, R., 326, 679 Mendel, G., 217, 685 Metivier, M., 66,685 Monahan, J., 660, 683 Morgenstern, 0., 181-182,688 Morris, C., 166, 500, 679, 685

Nachbin, L., 364, 685 Neyman, J., 89, 175,231,247,420,

685 Nobile, A., ix Novick, M., ix, 6, 684

Oue, S., ix

Patterson, R., 33, 688 Pearson, E., 175, 231, 247, 685 Pearson, K., 216, 685 Perlman, M., 430, 685 Peters, S., 24, 682 Phillips, L., 6, 684 Pierce, D., 99, 685 Pitman, E., 347, 685 Port, S., 640, 682 Portnoy, S., x Pratt, J., 56, 98, 683, 685

Raftery, A., 226, 683 Ramamoorthi, R., 86, 676 Rao, C., 152, 301, 349, 683, 685-686 Reeve, C., 326, 679 Regazzini, E., 21, 676 Rigo, P., 21, 676 Robbins, H., 303, 647, 677, 686 Robert, C., 225, 686 Roberts, H., 565,686 Ronchetti, E., 315, 681 Rouanet, H., 668-669, 684, 686 Rousseeuw, P, 315, 681 Royden, H., 578, 589, 597, 621, 686 Rubin, D., 332, 686 Rudin, W., 666, 686

Savage, L., 46, 181, 222, 284, 565, 600, 679, 681-682, 686

Scheffe, H., 298, 634, 684, 686 Schervish, M., v Schwartz, J., 507, 635, 667, 679 Schwartz, L., 429, 687 Scott, E., 420, 685 Seidenfeld, T., ix, 21, 183-184, 187,

429, 564, 655, 682-683, 686-687

Sellke, T., 284, 676 Serfiing, R., 413, 687 Sethuraman, J., 56, 687 Short, T., ix Shurlow, N., v Siegmund, D., 647, 677 Singh, K., 331, 687 Siovic, P., 23, 683

Name Index 693

Smith, A., 479, 507, 681, 684, 687 Smith, W., 24, 682 Spiegelhalter, D., 24, 680 Spj(lltvoll, E., 283, 687 Stahel, W., 315, 681 Steffey, D., 505, 683 Stein, C., 163, 379, 382, 568, 682, 687 Stigler, S., 8, 687 Stone, C., 640, 682 Stone, M., 21, 678, 687 Strasser, H., 430, 688 Strawderman, W., ix, 166,688 Sudderth, W., 9, 21, 46, 66, 69,

681-682, 685

Taylor, R., 33, 688 Tiao, G., 521, 677 Tibshirani, R., 336, 679 Tierney, L., 225, 446, 507, 683, 688 Tversky, A., 23, 683

Venn, J., 8,688 Verdinelli, I., 524, 688 Villegas, C., 379, 677 Von Mises, R., 10, 688 Von Neumann, J., 181-182,688

Wald, A., 415, 549, 552, 557,688 Walker, A., 435, 442, 688 Wallace, D., 99, 688 Wasserman, L., ix, 524, 526, 684, 688 Welch, B., 320, 688 West, M., 524, 688 Wijsman, R., x, 381-382, 681 Wilks, A., x, 676 Wilks, S., 325, 688 Williams, S., 66, 69, 685 Winkler, R., 24, 682 Wolfowitz, J., 417, 420, 557, 683, 688 Wolpert, R., 526, 684

Ylvisaker, D., 108, 679 Young, G., 329, 688

Zellner, A., 16, 688 Zidek, J., 21, 678

Subject Index*

Abelian group, 353 Absolutely continuous, 574, 597, 668 Absolutely continuous function, 211 Accept hypothesis, 214 Acceptance-rejection, 659 Action space, 144 Admissible, 154-157, 162, 167, 174

A, 154-156, 162 Almost everywhere, 572, 582 Almost invariant function, 383 Almost surely, 572, 582 Alternate noncentral beta

distribution, 668 Alternate non central X2 distribution,

668 Alternate noncentral F distribution,

669 Alternative, 2, 214

composite, 215 simple, 215, 233

Analysis of variance, 384, 491 Analytic function, 105 Ancillary statistic, 95, 99, 119

maximal, 97 ANOVA, 384, 491 Archemedian condition, 192 ARE,413 Asymptotic distribution, 399 Asymptotic efficiency, 413 Asymptotic relative efficiency, 413 Asymptotic variance, 402 Autoregression, 141 Autoregressive process, 441 Axioms of decision theory, 183-184,

296

Backward induction, 537 Bahadur's theorem, 94 Base measure, 54 Base of test, 215--216

'Italicized page numbers indicate where a term is defined.

Basu's theorem, 99 Bayes factor, 221, 238, 262-263, 274 Bayes risk, 149 Bayes rule, 150, 154-155, 167-168,

178 extended, 169 formal, 146, 150, 157, 348, 351,

369 generalized, 156-157 partial, 147, 150

Bayes' theorem, 4, 16 Bayesian bootstrap, 332 Bernoulli distribution, 672 Beta distribution, 54, 669 Bhattacharyya lower bounds, 305 Bias, 296 . Bimeasurable function, 572, 583, 618 Binomial distribution, 673 Bolzano-Weierstrass theorem, 666 Bootstrap, 329

Bayesian, 332 nonparametric, 329 parametric, 330

Borel a-field, 571, 575 Borel space, 609, 618 Borel-Cantelli lemma:

first, 578 second, 663

Boundary, 636 Boundedly complete statistic, 94, 99 Box-Cox transformations, 521

Called-off preference, 184 Caratheodory extension theorem, 578 Cauchy distribution, 669 Cauchy sequence, 619 Cauchy's equation, 667 Cauchy-Schwarz inequality, 615 CDF,612

empirical, 404-405, 408 Central limit theorem, 642

multivariate, 643 Chain rule, 600 Chapman-Robbins lower bound, 304

Characteristic function, 611, 639 Chi-squared distribution, 669 Chi-squared test of independence, 467 Closed set, 622 Closure, 622 Coherent tests, 252 Complete class, 174

essentially, 174, 244, 251, 256 minimal, 174

minimal, 174-175 Complete class theorem, 179 Complete measure space, 579, 603 Complete metric space, 619 Complete statistic, 94, 298

boundedly, 94, 99 Composite alternative, 215 Composite hypothesis, 215 Conditional distribution, 13, 16, 607,

609,617 regular, 610, 618 version, 617

Conditional expectation, 19,607, 616 version, 608, 616

Conditional Fisher information, 111, 119

Conditional independence, 9, 610, 628 Conditional Kullback-Leibler

information, 115, 119 Conditional mean, 607, 616

version, 616 Conditional preference, 185

consistent, 186 Conditional probability, 607, 609, 617

regular, 609, 617 Conditional score function, 111 Conditionally sufficient statistic, 95 Confidence coefficient, 315, 325 Confidence interval, 3

fixed-width, 559 sequential, 559

Confidence sequence, 569 Confidence set, 279, 315, 379

conservative, 315 exact, 315 randomized, 316 UMA,317 UMAU, 321

Conjugate prior, 92 Conservative confidence set, 315

Subject Index 695

Conservative prediction set, 324 Conservative tolerance set, 325 Consistent, 397, 412 Consistent conditional preference, 186 Consistent distributions, 652 Contingency table, 467 Continuity axiom, 184 Continuity theorem, 640 Continuous distribution, 612 Continuous mapping theorem, 638 Convergence:

pointwise, 184 weak, 399, 635

Convergence in distribution, 399, 611, 635

Convergence in probability, 396, 611, 638

Convex function, 614 Counting measure, 570 Covariance, 607, 613 Cramer-Rao lower bound, 301

multiparameter, 306 Credible set, 327 Cumulative distribution function (see

CDF),612 Cylinder set, 652

Data, 82 Decide optimally after stopping, 540 Decision rule, 145

maximum, 541 nonrandomized, 145, 151, 153 nonrandomized sequential, 537 randomized, 145, 151 randomized sequential, 537 regular, 54{}-541 sequential, 537

nonrandomized, 537 randomized, 537

terminal, 537 truncated, 542

Decision theory, 144, 181 axioms, 183-184

Decreasing sequence of sets, 577 DeFinetti's representation theorem,

28 Degenerate exponential family, 104 Degenerate weak order, 183 Delta method, 401, 464, 466

696 Subject Index

Dense, 619 Density, 607, 613 Dirichlet distribution, 52, 54, 674 Dirichlet process, 52, 54, 332, 434 Discrete distribution, 612 Distribution:

alternate noncentral beta, 668 alternate noncentral X?, 668 alternate noncentral F, 669 asymptotic, 399 Bernoulli, 672 beta, 54, 669 binomial, 673 Cauchy, 669 chi-squared, 669 conditional, 13, 16 consistent, 652 continuous, 612 Dirichlet, 52, 54, 674 discrete, 612 empirical, 12, 38 exponential, 670 F,670 fiducial, 370, 373 gamma, 670 geometric, 673 half-normal, 389 hyper geometric, 673 inverse gamma, 670 Laplace, 670 least favorable, 168 marginal, 14 multinomial, 674 multivariate normal, 643, 674 negative binomial, 673 noncentral beta, 289, 671 noncentral X2 , 671 noncentral F, 289, 671 noncentral t, 289, 325, 671 normal, 21, 349, 611, 640, 642,

671 multivariate, 643, 674

Pareto, 672 Poisson, 673 posterior, 16 predictive, 14

posterior, 18 prior, 14

prior, 13

improper, 20 t,672 uniform, 659, 672

Distribution function (see CDF), 612 Dominance axiom, 185 Dominated convergence theorem, 591 Dominates, 154 Dominating measure, 574, 597 Dutch book, 656

Efficiency: asymptotic, 413 asymptotic relative, 413 second-order, 414

Elicitation of probabilities, 22-23 Empirical Bayes, 166, 420, 500 Empirical CDF, 404-405, 408 Empirical distribution, 12, 38 Empirical probability measure, 12 (-contamination class, 524, 526, 528 Equal-tailed test, 263 Equivalence class, 140 Equivalence relation, 140 Equivariant rule, 357

location, 346-347, 351 minimum risk (see MRE), 347 scale, 350

Essentially complete class, 174, 244, 251, 256

minimal, 174 Estimator, 3, 296

maximum likelihood, 3, 307 MFUE, 347, 351, 363 Pitman, 347, 363 point, 3, 296 unbiased, 3, 296

Event, 606, 612 Exact confidence set, 315 Exact prediction set, 324 Exchangeable, 7, 27-28

partially, 125, 479 row and column, 482

Expectation, 607, 613 conditional, 19, 616

Expected Fisher information, 423 Expected loss principle, 146, 181 Expected value (see Expectation),

613 Exponential distribution, 670

Exponential family, 102-103, 105, 109, 155, 239, 249

degenerate, 104 nondegenerate, 104

Extended Bayes rule, 169 Extended real numbers, 571 Extremal family, 123, 125

F distribution, 670 Fatou's lemma, 589 FI regularity conditions, 111 Fiducial distribution, 370, 373 Field, 571, 575 Finite population sampling, 74 Finitely additive probability, 21, 281,

564, 657 Fisher information, 111, 113, 301,

412, 463 conditional, 111, 119 expected, 423 observed, 226, 424, 435

Fisher-Neyman factorization theorem, 89

Fixed point, 505 Fixed-point problem, 505 Fixed-width confidence interval, 559 Floor of test, 215-216 Formal Bayes rule, 146, 150, 157,348,

351, 369 Fubini's theorem, 596 Function:

absolutely continuous, 211 bimeasurable, 583 measurable, 572, 583 simple, 586

Gamma distribution, 670 General linear group, 354 Generalized Bayes rule, 156-157, 159 Generalized Neyman-Pearson lemma,

247 Generated u-field, 571-572, 584 Geometric distribution, 673 Gibbs sampling, 507 Goodness of fit test, 218, 461 Gross error sensitivity, 312 Group, 35~355-356

abelian, 353 general linear, 354

Subject Index 697

location, 354 location-scale, 354, 357, 368 permutation, 355 scale, 354

Haar measure: left, 363

related, 366 right, 363

related, 366 Hahn decomposition theorem, 605 Half-normal distribution, 389 Hierarchical model, 166, 476 Highest posterior density region (see

HPD),327 Hilbert space, 507 Hilbert-Schmidt-type operator, 507,

667 Horse lottery, 182 Hotelling's T2, 388 HPD region, 327, 329, 343 Hypergeometric distribution, 673 Hyperparameters, 477 Hypothesis, 2, 214

composite, 215 one-sided, 241 simple, 215, 233

Hypothesis test, 2 predictive, 219, 325 randomized, 3

Hypothesis-testing loss, 214

Identity element of group, 353 Ignorable statistic, 142 lID, 2, 8, 611, 628

conditionally, 9-10, 83,611, 628 Image sigma field, 584 Importance sampling, 403, 661 Improper prior, 20, 122, 223, 263 Inadmissible, 154 Increasing sequence of sets, 577 Independence, 610, 628

conditional, 9, 610, 628 Indifferent, 183 Induced measure, 575, 601 Infinitely often, 578, 663 Influence function, 311 Information:

Fisher, 111, 113,463

698 Subject Index

Kullback-Leibler, 115-116 Integrable, 588

uniformly, 592 Integral, 573, 587-588

over a set, 588 Invariance of distributions, 355 Invariant function, 357

almost, 383 location, 346 maximal, 358 scale, 350

Invariant loss, 956 location, 946 scale, 350-351

Invariant measure, 363 Inverse function theorem, 666 Inverse gamma distribution, 670 Inverse of group element, 353

Jacobian, 625 James-Stein estimator, 163,486 Jeffreys' prior, 122,446 Jensen's inequality, 614

Kolmogorov zero-one law, 631 Kullback-Leibler divergence, 116 Kullback-Leibler information,

115-116 conditional, 115, 119

Levy's theorem, 648, 650 'x-admissible, 154-156, 162 Laplace approximation, 226, 446 Laplace distribution, 670 Large order, 394

stochastic, 996 Law of large numbers:

strong, 34-36 weak,642

Law of the unconscious statistician, 607, 613

Law of total probability, 632 Least favorable distribution, 168 Lebesgue measure, 571, 580 Left Haar measure, 363

related, 366 Lehmann-Scheffe theorem, 298 L-estimator, 410 Level of test, 215-216

LHM,363 Likelihood function, 2, 13, 307 Likelihood ratio test (see LR test),

274 Linear regression, 276, 321 LMP test, 245, 265, 289 LMPU test, 265, 292 LMVUE,900 Locally minimum variance unbiased

estimator, 300 Locally most powerful test (see

LMP),245 Location equivariant rule, 946 Location estimation, 346 Location group, 354 Location invariant function, 346 Location invariant loss, 346 Location parameter, 944 Location-scale group, 354 Location-scale parameter, 345 Look-ahead decision rule, 546 Loss function, 144, 162, 189, 296

convex, 349 hypothesis-testing, 214 squared-error, 146, 297 0-1, 215 0-1-c, 215, 218

Lower boundary, 170, 179, 233-235, 287

LR test, 223, 273-274, 458-459

Marginal distribution, 14, 607 Marginalization paradox, 21 Markov chain, 15, 507, 650 Markov chain Monte Carlo, 507 Markov inequality, 614 Martingale, 645-646

reversed, 33, 649 Martingale convergence theorem,

648-649 Maximal ancillary, 97 Maximal invariant, 358 Maximin strategy, 168 Maximin value, 168 Maximum likelihood estimator, 3,

307, 415, 418-421 Maximum modulus theorem, 667 Maximum of decision rules, 541 MC test, 230

Mean, 607, 613 conditional, 616 trimmed, 314

Measurable function, 572, 583 Measure, 570, 572, 575, 577

induced, 601 Lebesgue, 571, 580 product, 595 u-finite, 572, 578, 601 signed, 577, 597

Measure space, 572, 577 M-estimator, 313-315, 424-428, 434 Method of Laplace, 226, 446 Method of moments, 340 Mill's ratio, 470 Minimal complete class, 174-175 Minimal essentially complete class,

174 Minimal sufficient statistic, 92 Minimax principle, 167, 189 Minimax rule, 167-169 Minimax theorem, 172 Minimax value, 168 Minimum risk equivariant (see MRE),

347 MLE, 3, 307, 415, 418-421 MLR,239-244 Monotone convergence theorem, 590 Monotone likelihood ratio, 239-244 Monotone sequence of sets, 577 Most cautious test, 230 Most powerful test, 230 MP test, 230 MRE, 347, 349, 351, 363 Multinomial distribution, 674 Multiparameter Cramer-Rao lower

bound,306 Multivariate central limit theorem,

643 Multivariate normal distribution, 643,

674

Natural parameter, 103, 105 Natural parameter space, 103, 105 Natural sufficient statistic, 103 Negative binomial distribution, 673 Negative part, 573, 588 Negative set, 598 Neyman structure, 266

Subject Index 699

Neyman-Pearson fundamental lemma, 175, 231

NM-Iottery, 182 Noncentral beta distribution, 289,

671 Noncentral X2 distribution, 671 Noncentral F distribution, 289, 671 Noncentral t distribution, 289, 325,

671 Nondegenerate exponential family,

104 Nondegenerate weak order, 183 Nonnull states, 184 Nonparametric, 52 Nonparametric bootstrap, 329 Nonrandomized decision rule, 145,

151, 153 Nonrandomized sequential decision

rule, 537 Normal distribution, 21, 349, 611,

640, 642, 671 multivariate, 643, 674

Null states, 184

Observed Fisher information, 226, 424,435

One-sided hypothesis, 241 One-sided test, 239, 243 Open set, 571 Operating characteristic, 215 Orbit, 358 Order statistics, 86 Outliers, 521

Parameter, 1, 6, 50--51, 82 location, 344 location-scale, 345 natural, 103, 105 scale, 345

Parameter space, 1, 50, 82 natural, 103, 105

Parametric bootstrap, 330 Parametric family, 1, 50, 102 Parametric index, 33, 50 Parametric models, 12 Parametric Models, 49 Pareto distribution, 672 Partial Bayes rule, 147, 150 Partially exchangeable, 125,479

700 Subject Index

Percentile-t bootstrap confidence interval, 336

Permutations, 355 11"->' theorem, 576 Pitman's estimator, 347, 363 Pivotal, 316, 370, 373 Point estimation, 296 Point estimator, 296 Pointwise convergence, 184 Poisson distribution, 673 Polish space, 619 Polya tree distribution, 69 Polya urn scheme, 9 Portmanteau theorem, 636 Positive part, 573, 588 Positive set, 598 Posterior distribution, 4, 16

asymptotic normality, 435, 437, 442-443

consistency, 429-430 Posterior predictive distribution, 18 Posterior risk, 146, 150 Power function, 2, 215, 240 Power set, 571 Prediction set, 324-325

conservative, 324 exact, 324

Predictive distribution, 14, 455 posterior, 18 prior, 14

Predictive hypothesis test, 219, 325 Preference, 182

conditional, 185 consistent, 186

Prevision, 655 Prior distribution, 4, 13

improper, 20, 223, 263 natural conjugate family, 92

Prize, 181 Probability, 572, 577

empirical, 12 random, 27

Probability integral transform, 519, 659

Probability space, 572, 577, 606, 612 Product measure, 595 Product u-field, 576 Product space, 576 Pseudorandom numbers, 659

Pure significance test, 217 P-value, 279, 375, 380

Quantile: sample, 404-405, 408

Radon-Nikodym derivative, 575, 598 Radon-Nikodym theorem, 597 Random probability measure, 27 Random quantity, 82, 606, 612 Random variables, 606, 612

exchangeable, 27 IID,8

Randomized confidence set, 316 Randomized decision rule, 145, 151 Randomized sequential decision rule,

537 Randomized test, 3 Rao-Blackwell theorem, 152 Ratio of uniforms, 660 Regression, 276, 321, 519 Regular conditional distribution, 610,

618 Regular conditional probabilities,

609,617 Regular decision rule, 540 Reject hypothesis, 214 Rejection region, 2 Related LHM, 366 Related RHM, 366 Relative rate of convergence, 413, 470 Restriction of u-field, 584 Reversed martingale, 649 RHM,363 Right Haar measure, 363

related, 366 Risk function, 149-150, 153, 155, 167,

216, 233, 297-298 Risk set, 170-172, 179,233,235,287 Robustness, 310

Bayesian, 524 Rowand column exchangeable, 482

Sample quantile, 404-405, 408 Sample space, 2, 82 Scale equivariant rule, 350 Scale estimation, 350 Scale group, 354 Scale invariant function, 350

Scale invariant loss, 350-351 Scale parameter, 345 Scheffe's theorem, 634 Score function, 111, 122,302, 305

conditional, 111 Second-order efficiency, 414 Sensitivity analysis, 524 Separable space, 619 Separating hyperplane theorem, 666 Sequential decision rule, 537 Sequential probability ratio test, 549 Sequential test, 548 Set estimation, 296 Shrinkage estimator, 163 a-field, 575

Borel, 571, 575 generated, 571-572, 584 image, 584 restriction, 584 tail, 632

a-finite measure, 572, 578, 601 Signed measure, 577, 597, 605, 635 Significance probability, 217, 228, 280 Significance test, 217 Simple alternative, 215 Simple function, 586 Simple hypothesis, 215 Size of test, 2, 215-216 Small order, 394

stochastic, 396 SPRT,549 Squared-error loss, 146, 297 Vn-consistent, 401 SSS, 507 St. Petersburg paradox, 655 State independence, 184, 205 State-dependent utility, 205-206 States of Nature, 181, 189, 205 Statistic, 83

ancillary, 95,99, 119 boundedly complete, 94 complete, 94, 298 sufficient, 84-85-86, 99, 103,

150-151,298 Stein estimator (see James-Stein

estimator), 163 Stochastic large order, 396 Stochastic small order, 396 Stone-Weierstrass theorem, 666

Subject Index 701

Stopping time, 537, 548, 552, 554 Strict preference, 183 Strong law of large numbers, 34-36 Strongly unimodal, 329 Submartingale, 646 Successive substitution, 505-506, 545 Successive substitution sampling, 507 Sufficient statistic, 84-85-86, 99, 103,

109, 150-151, 298 conditionally, 95 minimal,92 natural, 103

Superefficiency, 414 Supporting hyperplane theorem, 666 Sure-thing principle, 184

t distribution, 672 Tail a-field, 632 Tailfree process, 60 Taylor's theorem, 665 Tchebychev's inequality, 614 Terminal decision rule, 537 Test:

goodness of fit, 218, 461 one-sided, 239, 243 two-sided, 256, 273

Test function, 175, 215 Theorem:

Bahadur,94 Basu, 99 Bayes, 4,16 Bhattacharyya lower bounds,

305 Bolzano-Weierstrass, 666 Caratheodory extension, 578 Cauchy's equation, 667 central limit, 642

multivariate, 643 chain rule, 600 Chapman-Robbins bound, 304 complete class, 179 continuity, 640 continuous mapping, 638 Cramer-Rao lower bound, 301 DeFinetti, 27-28 dominated convergence, 591 Fatou's lemma, 589 Fisher-Neyman, 89 Fubini,596

702 Subject Index

Hahn decomposition, 605 inverse function, 666 Kolmogorov zero-one law, 631 Levy, 648, 650 law of total probability, 632 Lehmann-Scheffe, 298 martingale convergence, 648-649 maximum modulus, 667 minimax, 172 monotone convergence, 590 multivariate central limit, 643 Neyman-Pearson, 175,231

generalized, 247 11"->',576 portmanteau, 636 Radon-Nikodym, 597 Rao-Blackwell, 152 Scheffe, 634 separating hyperplane, 666 Stone-Weierstrass, 666 strong law of large numbers, 36 supporting hyperplane, 666 Taylor, 665 Tonelli, 595 uniqueness, 645 upcrossing, 647 weak law of large numbers, 642

Tolerance coefficient, 325 Tolerance set, 219, 325

conservative, 325 Tonelli's theorem, 595 Topological space, 571, 575 Topology, 571 Transformation, 354 Transition kernel, 124 Trimmed mean, 314 Trivial u-field, 571 Truncated decision rule, 542 Two-sided alternative, 246 Two-sided hypothesis, 246 Two-sided test, 256, 273 Type I error, 214 Type II error, 214

UMA confidence set, 317 UMAU confidence set, 321 UMC test, 230-231, 239, 244, 255,

257 UMCU test, 254-256

UMP test, 230, 240, 243-244, 255, 257

UMPU test, 254-256 UMPUAI test, 384 UMVUE, 297-299 Unbiased estimator, 3, 296-302 Unbiased test, 254 Uniform distribution, 659, 672 Uniformly integrable, 592 Uniformly minimum variance

unbiased estimator (see UMVUE),297

Uniformly most accurate confidence set (see UMA), 317

Uniformly most accurate unbiased confidence set (See UMAU), 321

Uniformly most cautious test (see UMC),230

Uniformly most cautious unbiased test (see UMCU), 254

Uniformly most powerful test (see UMP),230

Uniformly most powerful unbiased test (see UMPU), 254

Uniqueness theorem, 645 Up crossing lemma, 647 Upper semicontinuous, 417 USC, 417 Utility function, 181, 188

state-dependent, 205-206

Variance, 607, 613 Variance components, 484 Variance stabilizing transformation,

402 Version of conditional distribution,

617 Version of conditional expectation,

608,616 Version of conditional mean, 616

Wald's lemma, 552 Weak convergence, 399, 635 Weak· convergence, 635 Weak law of large numbers, 642, 664 Weak order, 183, 216-217, 280

degenerate, 183 nondegenerate, 183

Weak preference, 182

Springer Series in Statistics (conlin"od from p. Ii)

Pollard: Convergence of Stochastic Processes. Pratt/Gibbons: Concepts of Nonparametric Theory. ReadlCressie: Goodness-of-Fit Statistics for Discrete Multivariate Data. Reinsel: Elements of Multivariate Time Series Analysis. Reiss: A Course on Point Processes. Reiss: Approximate Distributions of Order Statistics: With Applications

to Non-parametric Statistics. Rieder: Robust Asymptotic Statistics. Rosenbaum: Observational Studies. Ross: Nonlinear Estimation. Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition. Siirndal/SwenssonlWretman: Model Assisted Survey Sampling. Schervish: Theory of Statistics. Seneta: Non-Negative Matrices and Markov Chains, 2nd edition. Shao/Tu: The Jackknife and Bootstrap. Siegmund: Sequential Analysis: Tests and Confidence Intervals. Simonoff: Smoothing Methods in Statistics. Small: The Statistical Theory of Shape. Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior

Distributions and Likelihood Functions, 3rd edition. Tong: The Multivariate Normal Distribution. van der Vaart/Wellner: Weak Convergence and Empirical Processes: With

Applications to Statistics. Vapnik: Estimation of Dependences Based on Empirical Data. Weerahandi: Exact Statistical Methods for Data Analysis. West/Harrison: Bayesian Forecasting and Dynamic Models. Wolter: Introduction to Variance Estimation. Yaglom: Correlation Theory of Stationary and Related Random Functions I:

Basic Results. Yag/om: Correlation Theory of Stationary and Related Random Functions II:

Supplementary Notes and References.

link.springer.com978-1-4612-4250-5/1.pdfApPENDIX A Measure and Integration Theory This appendix...

Documents

Transcript of link.springer.com978-1-4612-4250-5/1.pdfApPENDIX A Measure and Integration Theory This appendix...