Probability Theory Statistics

113
Lecture notes Probability Theory & Statistics Jørgen Larsen English version / Roskilde University

Transcript of Probability Theory Statistics

Page 1: Probability Theory Statistics

Lecture notes

Probability Theory&

Statistics

Jørgen Larsen

English version /

Roskilde University

Page 2: Probability Theory Statistics

This document was typeset whithKp-Fonts using KOMA-Scriptand LATEX. The drawings areproduced with METAPOST.

October . Contents

Preface

I Probability Theory

Introduction

Finite Probability Spaces . Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . .

Point probabilities · Conditional probabilities and distributions ·Independence

. Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . Independent random variables · Functions of random variables · Thedistribution of a sum

. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variance and covariance · Examples · Law of large numbers . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Countable Sample Spaces . Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . .

Point probabilities · Conditioning; independence · Random variables . Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variance and covariance . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Geometric distribution · Negative binomial distribution · Poissondistribution

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Continuous Distributions . Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . .

Transformation of distributions · Conditioning . Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 3: Probability Theory Statistics

Contents

Exponential distribution · Gamma distribution · Cauchy distribution · Normal distribution

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Generating Functions . Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . The sum of a random number of random variables . . . . . . . . Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General Theory . Why a general axiomatic approach? . . . . . . . . . . . . . . . .

II Statistics

Introduction

The Statistical Model . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The one-sample problem with 01-variables · The simple binomial model · Comparison of binomial distributions · The multinomial distri-

bution · The one-sample problem with Poisson variables · Uniformdistribution on an interval · The one-sample problem with normal vari-ables · The two-sample problem with normal variables · Simplelinear regression

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimation . The maximum likelihood estimator . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The one-sample problem with 01-variables · The simple binomial model · Comparison of binomial distributions · The multinomial distri-

bution · The one-sample problem with Poisson variables · Uniformdistribution on an interval · The one-sample problem with normal vari-ables · The two-sample problem with normal variables · Simplelinear regression

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hypothesis Testing . The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The one-sample problem with 01-variables · The simple binomial model

Contents

· Comparison of binomial distributions · The multinomial dis-tribution · The one-sample problem with normal variables · Thetwo-sample problem with normal variables

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Examples . Flour beetles . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The basic model · A dose-response model · Estimation · Modelvalidation · Hypotheses about the parameters

. Lung cancer in Fredericia . . . . . . . . . . . . . . . . . . . . . . The situation · Specification of the model · Estimation in themultiplicative model · How does the multiplicative model describe thedata? · Identical cities? · Another approach · Comparing thetwo approaches · About test statistics

. Accidents in a weapon factory . . . . . . . . . . . . . . . . . . . The situation · The first model · The second model

The Multivariate Normal Distribution . Multivariate random variables . . . . . . . . . . . . . . . . . . . . Definition and properties . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Linear Normal Models . Estimation and test in the general linear model . . . . . . . . .

Estimation · Testing hypotheses about the mean . The one-sample problem . . . . . . . . . . . . . . . . . . . . . . . One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . Bartlett’s test of homogeneity of variances . . . . . . . . . . . . . Two-way analysis of variance . . . . . . . . . . . . . . . . . . . .

Connected models · The projection onto L0 · Testing the hypotheses · An example (growing of potatoes)

. Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . Specification of the model · Estimating the parameters · Testinghypotheses · Factors · An example

. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A A Derivation of the Normal Distribution

B Some Results from Linear Algebra

C Statistical Tables

D Dictionaries

Page 4: Probability Theory Statistics

Contents

Bibliography

Index

Preface

Probability theory and statistics are two subject fields that either can be studiedon their own conditions, or can be dealt with as ancillary subjects, and theway these subjects are taught should reflect which of the two cases that isactually the case. The present set of lecture notes is directed at students whohave to do a certain amount of probability theory and statistics as part of theirgeneral mathematics education, and accordingly it is directed neither at studentsspecialising in probability and/or statistics nor at students in need of statisticsas an ancillary subject. — Over a period of years several draft versions of thesenotes have been in use at the mathematics education at Roskilde University.

When preparing a course in probability theory it seems to be an ever unresolvedissue how much (or rather how little) emphasis to put on a general axiomaticpresentation. If the course is meant to be a general introduction for studentswho cannot be supposed to have any great interests (or qualifications) in thesubject, then there is no doubt: don’t mention Kolmogorov’s axioms at all (orpossibly in a discreet footnote). Things are different with a course that is partof a general mathematics education; here it would be perfectly meaningful tostudy the mathematical formalism at play when trying to establish a set ofbuilding blocks useful for certain modelling problems (the modelling of randomphenomena), and hence it would indeed be appropriate to study the basis of themathematics concerned.

Probability theory is definitely a proper mathematical discipline, and statis-tics certainly makes use of a number of different sub-areas of mathematics. Bothfields are, however, as for their mathematical content, organised in a rather dif-ferent way from “normal” mathematics (or at least from the way it is presentedin math education), because they primarily are designed to be able to do certainthings, e.g. prove the Law of Large Numbers and the Central Limit Theorem,and only to a lesser degree to conform to the mathematical community’s currentopinions about how to present mathematics; and if, for instance, you believeprobability theory (the measure theoretic probability theory) simply to be aspecial case of measure theory in general, you will miss several essential pointsas to what probability theory is and is about.

Moreover, anyone who is attempting to acquaint himself with probability

Page 5: Probability Theory Statistics

Preface

theory and statistics soon realises that that it is a project that heavily involvesconcepts, methods and results from a number of rather different “mainstream”mathematical topics, which may be one reason why probability theory andstatistics often are considered quite difficult or even incomprehensible.

Following the traditional pattern the presentation is divided into two parts,a probability part and a statistics part. The two parts are rather different asregards style and composition.

The probability part introduces the fundamental concepts and ideas, illus-trated by the usual examples, the curriculum being kept on a tight rein through-out. Firstly, a thorough presentation of axiomatic probability theory on finitesample spaces, using Kolmogorov’s axioms restricted to the finite case; by do-ing so the amount of mathematical complications can be kept to a minimumwithout sacrificing the possibility of proving the various results stated. Next,the theory is extended to countable sample spaces, specifically the sample spaceN0 equipped with the σ algebra consisting of all subsets of the sample space; itis still possible to prove “all” theorems without too much complicated mathe-matics (you will need a working knowledge of infinite series), and yet you willget some idea of the problems related to infinite sample spaces. However, theformal axiomatic approach is not continued in the chapter about continuousdistributions on R and Rn, i.e. distributions having a density function, and thischapter states many theorems and propositions without proofs (for one thing,the theorem about transformation of densities, the proof of which rightly is tobe found in a mathematical analysis course). Part I finishes with two chapters ofa somewhat different kind, one chapter about generating functions (including afew pages about branching processes), and one chapter containing some hintsabout how and why probabilists want to deal with probability measures ongeneral sample spaces.

Part II introduces the classical likelihood-based approach to mathematicalstatistics: statistical models are made from the standard distributions and havea modest number of parameters, estimators are usually maximum likelihoodestimators, and hypotheses are tested using the likelihood ratio test statistic.The presentation is arranged with a chapter about the notion of a statisticalmodel, a chapter about estimation and a chapter about hypothesis testing; thesechapters are provided with a number of examples showing how the theoryperforms when applied to specific models or classes of models. Then followsthree extensive examples, showing the general theory in use. Part II is completedwith an introduction to the theory of linear normal models in a linear algebraformulation, in good keeping with tradition in Danish mathematical statistics.

JL

Part I

Probability Theory

Page 6: Probability Theory Statistics

Introduction

Probability theory is a discipline that deals with mathematical formalisationsof the everyday concepts of chance and randomness and related concepts. Atfirst one might wonder how it could be possible at all to make mathematicalformalisations of anything like chance and randomness, is it not a fundamentalcharacteristic of these concepts that you cannot give any kinds of exact descrip-tion relating to them? Not quite so. It seems to be an empirical fact that atleast some kinds of random experiments and random phenomena will displaya considerable degree of regularity when repeated a large number of times,this applies to experiments such as tossing coins or dice, roulette and otherkinds of chance games. In order to treat such matters in a rigorous way wehave to introduce some mathematical ideas and notations, which at first mayseem a little vague, but later on will be given a precise mathematical meaning(hopefully not too far from the common usage of the same terms).

When run, the random experiment will produce a result of some kind. Therolling of a die, for instance, will result in the die falling with a certain, yetunpredictable, number of pips uppermost. The result of the random experimentis called an outcome. The set of all possible outcomes is called the sample space.

Probabilities are real numbers expressing certain quantitative characteristicsof the random experiment. A simple example of a probability statement couldbe “the probability that the die will fall with the number uppermost is 1/6”;another example could be “the probability of a white Christmas is 1/20”. Whatis the meaning of such statements? Some people will claim that probabilitystatements are to be interpreted as statements about predictions about theoutcome of some future phenomenon (such as a white Christmas). Otherswill claim that probability statements describe the relative frequency of theoccurrence of the outcome in question, when the random experiment (e.g.the rolling of the die) is repeated over and over. The way probability theoryis formalised and axiomatised is very much inspired by an interpretation ofprobability as relative frequency in the long run, but the formalisation is not tieddown to this single interpretation.

Probability theory makes use of notation from simple set theory, although withslightly different names, see the overview on the next page. A probability, or to

Page 7: Probability Theory Statistics

Introduction

Some notions from settheory together withtheir counterparts inprobabilility theory.

Typicalnotation

Probability theory Set theory

Ω sample space;the certain event

universal set

the impossible event the empty setω outcome member of ΩA event subset of ΩA∩B both A and B intersection of A and BA∪B either A or B union of A and BA \B A but not B set differenceAc the opposite event of A the complement of A,

i.e. Ω \A

be precise, a probability measure will be defined as a map from a certain domaininto the set of reals. What this domain should be may not be quite obvious; onemight suggest that it should simply be the entire sample space, but it turns outthat this will cause severe problems in the case of an uncountable sample space(such as the set of reals). It has turned out that the proper way to build a usefultheory, is to assign probabilities to events, that is, to certain kinds of subsets ofthe sample space. Accordingly, a probability measure will be a special kind ofmap assigning real numbers to certain kinds of subsets of the sample space.

Finite Probability Spaces

This chapter treats probabilities on finite sample spaces. The definitions and thetheory are simplified versions of the general setup, thus hopefully giving a goodidea also of the general theory without too many mathematical complications.

. Basic definitions

Definition .: Probability space over a finite setA probability space over a finite set is a triple (Ω,F ,P) consisting of. a sample space Ω which is a non-empty finite set,. the set F of all subsets of Ω,. a probability measure on (Ω,F ), that is, a map P : F →R which is

• positive: P(A) ≥ 0 for all A ∈ F ,• normed: P(Ω) = 1, and• additive: if A1,A2, . . . ,An ∈ F are mutually disjoint, then

P( n⋃

i=1

Ai

)=

n∑

i=1

P(Ai).

The elements of Ω are called outcomes, the elements of F are called events.

Here are two simple examples. Incidentally, they also serve to demonstrate thatthere indeed exist mathematical objects satisfying Definition .:

Example .: Uniform distributionLet Ω be a finite set with n elements, Ω = ω1,ω2, . . . ,ωn, let F be the set of subsetsof Ω, and let P be given by P(A) = 1

n#A, where #A means “the number of elements ofA”. Then (Ω,F ,P) satisfies the conditions for being a probability space (the additivityfollows from the additivity of the number function). The probability measure P is calledthe uniform distribution on Ω, since it distributes the “probability mass” uniformly overthe sample space.

Example .: One-point distributionLet Ω be a finite set, and let ω0 ∈ Ω be a selected point. Let F be the set of subsetsof Ω, and define P by P(A) = 1 if ω0 ∈ A and P(A) = 0 otherwise. Then (Ω,F ,P) satisfiesthe conditions for being a probability space. The probability measure P is called theone-point distribution at ω0, since it assigns all of the probability mass to this singlepoint.

Page 8: Probability Theory Statistics

Finite Probability Spaces

Then we can prove a number of results:

Lemma .For any events A and B of the probability space (Ω,F ,P) we have. P(A) + P(Ac) = 1.. P() = 0.. If A ⊆ B, then P(B \A) = P(B)−P(A) and hence P(A) ≤ P(B)

(and so P is increasing).. P(A∪B) = P(A) + P(B)−P(A∩B).

ProofRe. : The two events A and Ac are disjoint and their union is Ω; then by theaxiom of additivity P(A) + P(Ac) = P(Ω), and P(Ω) equals 1.Re. : Apply what has just been proved to the eventA =Ω to get P() = 1−P(Ω) =1− 1 = 0.Re. : The events A and B \A are disjoint with union B, and therefore P(B) =P(A) + P(B \A); since P(B \A) ≥ 0, we see that P(A) ≤ P(B).Re. : The three events A \B, B \A and A∩B are mutually disjoint with unionA∪B. Therefore

P(A∪B) = P(A \B) + P(B \A) + P(A∩B)

=(P(A \B) + P(A∩B)

)+(P(B \A) + P(A∩B)

)−P(A∩B)

= P(A) + P(B)−P(A∩B).

A B

A \B

A∩B

B \A

All presentations of probability theory use coin-tossing and die-rolling as exam-ples of chance phenomena:

Example .: Coin-tossingSuppose we toss a coin once and observe whether a head or a tail turns up.

The sample space is the two-point set Ω = head, tail. The set of events is F =Ω, head, tail,. The probability measure corresponding to a fair coin, i.e. a coinwhere heads and tails have an equal probability of occurring, is the uniform distributionon Ω:

P(Ω) = 1, P(head) = 1/2, P(tail) = 1/2, P() = 0.

Example .: Rolling a dieSuppose we roll a die once and observe the number turning up.

The sample space is the set Ω = 1,2,3,4,5,6. The set F of events is the set ofsubsets of Ω (giving a total of 26 = 64 different events). The probability measure Pcorresponding to a fair die is the uniform distribution on Ω.

. Basic definitions

This implies for instance that the probability of the event 3,6 (the number of pips isa multiple of 3) is P(3,6) = 2/6, since this event consists of two outcomes out of a totalof six possible outcomes.

Example .: Simple random sampling without replacementFrom a box (or urn) containing b black and w white balls we draw a k-sample, that is, asubset of k elements (k is assumed to be not greater than b+w). This is done as simple

random sampling without replacement, which means that all the(b+wk

)different subsets

with exactly k elements (cf. the definition of binomial coefficientes page ) have thesame probability of being selected. — Thus we have a uniform distribution on the setΩ consisting of all these subsets.

So the probability of the event “exactly x black balls” equals the number of k-sampleshaving x black and k − x white balls divided by the total number of k-samples, that is(bx

)(wk − x

)/(b+wk

). See also page and Proposition ..

Point probabilities

The reader may wonder why the concept of probability is mathematified inthis rather complicated way, why not just simply have a function that mapsevery single outcome to the probability for this outcome? As long as we areworking with finite (or countable) sample spaces, that would indeed be a feasibleapproach, but with an uncountable sample space it would never work (since anuncountable infinity of positive numbers cannot add up to a finite number).

However, as long as we are dealing with the finite case, the following definitionand theorem are of interest.

Definition .: Point probabilitiesLet (Ω,F ,P) be a probability space over a finite set Ω. The function

p :Ω −→ [0 ; 1]

ω 7−→ P(ω)is the point probability function (or the probability mass function) of P.

Point probabilities are often visualised as “probability bars”.

Proposition .If p is the point probability function of the probability measure P, then for anyevent A we have P(A) =

ω∈Ap(ω).

ProofWrite A as a disjoint union of its singletons (one-point sets) and use additivity:P(A) = P

(⋃

ω∈Aω

)=

ω∈AP(ω) =

ω∈Ap(ω). A

Page 9: Probability Theory Statistics

Finite Probability Spaces

Remarks: It follows from the theorem that different probability measures musthave different point probability functions. It also follows that p adds up to unity,i.e.

ω∈Ωp(ω) = 1; this is seen by letting A =Ω.

Proposition .If p : Ω → [0 ; 1] adds up to unity, i.e.

ω∈Ωp(ω) = 1, then there is exactly one

probability measure P on Ω which has p as its point probability function.

ProofWe can define a function P : F → [0 ; +∞[ as P(A) =

ω∈Ap(ω). This function is

positive since p ≥ 0, and it is normed since p adds to unity. Furthermore it isadditive: for disjoint events A1,A2, . . . ,An we have

P(A1 ∪A2 ∪ . . .∪An) =∑

ω∈A1∪A2∪...∪Anp(ω)

=∑

ω∈A1

p(ω) +∑

ω∈A2

p(ω) + . . .+∑

ω∈Anp(ω)

= P(A1) + P(A2) + . . .+ P(An),

where the second equality follows from the associative law for simple addition.Thus P meets the conditions for being a probability measure. By constructionp is the point probability function of P, and by the remark to Proposition .only one probability measure can have p as its point probability function.

A one-point distributionExample .The point probability function of the one-point distribution at ω0 (cf. Example .) isgiven by p(ω0) = 1, and p(ω) = 0 when ω ,ω0.

Example .The point probability function of the uniform distribution on ω1,ω2, . . . ,ωn (cf. Exam-ple .) is given by p(ωi) = 1/n, i = 1,2, . . . ,n.A uniform distribution

Conditional probabilities and distributions

Often one wants to calculate the probability that some event A occurs giventhat some other event B is known to occur — this probability is known as theconditional probability of A given B.

Definition .: Conditional probabilityLet (Ω,F ,P) be a probability space, let A and B be events, and suppose that P(B) > 0.The number

P(A | B) =P(A∩B)

P(B)

. Basic definitions

is called the conditional probability of A given B.

Example .Two coins, a -krone piece and a -krone piece, are tossed simultaneously. What isthe probability that the -krone piece falls heads, given that at least one of the twocoins fall heads?

The standard model for this chance experiment tells us that there are four possibleoutcomes, and, recording an outcome as

(“how the -krone falls”, “how the -krone falls”),

the sample space is

Ω = (tails, tails), (tails,heads), (heads, tails), (heads,heads).Each of these four outcomes is supposed to have probability 1/4. The conditioning eventB (at least one heads) and the event in question A (the -krone piece falls heads) are

B = (tails,heads), (heads, tails), (heads,heads) and

A = (heads, tails), (heads,heads),so the conditional probability of A given B is

P(A | B) =P(A∩B)

P(B)=

2/43/4

= 2/3.

From Definition . we immediately get

Proposition .If A and B are events, and P(B) > 0, then P(A∩B) = P(A | B) P(B). B

P

Definition .: Conditional distributionLet (Ω,F ,P) be a probability space, and let B be an event such that P(B) > 0. Thefunction

P( | B) : F −→ [0 ; 1]

A 7−→ P(A | B)

is called the conditional distribution given B. B

P( | B)

Bayes’ formula

Let us assume that Ω can be writtes as a disjoint union of k events B1,B2, . . . ,Bk(in other words, B1,B2, . . . ,Bk is a partition of Ω). We also have an event A.Further it is assumed that we beforehand (or a priori) know the probabilitiesP(B1),P(B2), . . . ,P(Bk) for each of the events in the partition, as well as all condi-tional probabilities P(A | Bj ) for A given a Bj . The problem is find the so-called

Page 10: Probability Theory Statistics

Finite Probability Spaces

posterior probabilities, that is, the probabilities P(Bj | A) for the Bj , given thatthe event A is known to have occurred.

(An example of such a situation could be a doctor attempting to make amedical diagnosis: A would be the set of symptoms observed on a patient, andthe Bs would be the various diseases that might cause the symptoms. The doctorThomas Bayes

English mathematicianand theologian (-).“Bayes’ formula” (fromBayes ()) nowadayshas a vital role in the so-called Bayesian statis-tics and in Bayesiannetworks.

knows the (approximate) frequencies of the diseases, and also the (approximate)probability of a patient with disease Bi exhibiting just symptoms A, i = 1,2, . . . , k.The doctor would like to know the conditional probabilities of a patient exhibit-ing symptoms A to have disease Bi , i = 1,2, . . . , k.)

Since A =k⋃

i=1

A∩Bi where the sets A∩Bi are disjoint, we have

P(A) =k∑

i=1

P(A∩Bi) =k∑

i=1

P(A | Bi) P(Bi),

and since P(Bj | A) = P(A∩Bj )/ P(A) = P(A | Bj ) P(Bj )/ P(A), we obtain

P(Bj | A) =P(A | Bj ) P(Bj )k∑

i=1

P(A | Bi) P(Bi)

. (.)

This formula is known as Bayes’ formula; it shows how to calculate the posterior

A

B1 B2 . . . Bj. . . Bk

probabilities P(Bj | A) from the prior probabilities P(Bj ) and the conditionalprobabilities P(A | Bj ).

Independence

Events are independent if our probability statements pertaining to any sub-collection of them do not change when we learn about the occurrence or non-occurrence of events not in the sub-collection.

One might consider defining independence of events A and B to mean thatP(A | B) = P(A), which by the definition of conditional probability implies thatP(A∩B) = P(A)P(B). This last formula is meaningful also when P(B) = 0, andfurthermore A and B enter in a symmetrical way. Hence we define independenceof two events A and B to mean that P(A∩B) = P(A)P(B). — To do things properly,however, we need a definition that covers several events. The general definitionis

Definition .: Independent eventsThe events A1,A2, . . . ,Ak are said to be independent, if it is true that for any subset

Ai1 ,Ai2 , . . . ,Aim of these events, P( m⋂

j=1

Aij

)=

m∏

j=1

P(Aij ).

. Basic definitions

Note: When checking out for independence of events, it does not suffice to checkwhether they are pairwise independent, cf. Example .. Nor does it suffice tocheck whether the probability for the intersection of all of the events equals theproduct of the individual events, cf. Example ..

Example .Let Ω = a,b,c,d and let P be the uniform distribution on Ω. Each of the three eventsA = a,b, B = a,c and C = a,d has a probability of 1/2. The events A, B and Care pairwise independent. To verify, for instance, that P(B∩C) = P(B)P(C), note thatP(B∩C) = P(a) = 1/4 and P(B)P(C) = 1/2 · 1/2 = 1/4.

On the other hand, the three events are not independent: P(A∩B∩C) , P(A)P(B)P(C),since P(A∩B∩C) = P(a) = 1/4 and P(A)P(B)P(C) = 1/8.

Example .Let Ω = a,b,c,d,e, f ,g,h and let P be the uniform distribution on Ω. The three eventsA = a,b,c,d, B = a,e, f ,g and C = a,b,c, e then all have probability 1/2. Since A ∩B∩C = a, it is true that P(A∩B∩C) = P(a) = 1/8 = P(A)P(B)P(C), but the events A,B and C are not independent; we have, for instance, that P(A∩ B) , P(A)P(B) (sinceP(A∩B) = P(a) = 1/8 and P(A)P(B) = 1/2 · 1/2 = 1/4).

Independent experiments; product space

Very often one wants to model chance experiments that can be thought of ascompound experiments made up of a number of separate sub-experiments (theexperiment “rolling five dice once”, for example, may be thought of as madeup of five copies of the experiment “rolling one die once”). If we are givenmodels for the sub-experiments, and if the sub-experiments are assumed to beindependent (in a sense to be made precise), it is fairly straightforward to writeup a model for the compound experiment.

For convenience consider initially a compound experiment consisting of justtwo sub-experiment I and II, and let us assume that they can be modelled by theprobability spaces (Ω1,F1,P1) and (Ω2,F2,P2), respectively.

We seek a probability space (Ω,F ,P) that models the compound experimentof carrying out both I and II. It seems reasonable to write the outcomes of thecompound experiment as (ω1,ω2) where ω1 ∈Ω1 and ω2 ∈Ω2, that is, Ω is theproduct set Ω1 ×Ω2, and F could then be the set of all subsets of Ω. But whichP should we use?

Consider an I-event A1 ∈ F1. Then the compound experiment event A1×Ω2 ∈F corresponds to the situation where the I-part of the compound experimentgives an outcome in A1 and the II-part gives just anything, that is, you don’t careabout the outcome of experiment II. The two events A1 ×Ω2 and A1 thereforemodel the same real phenomenon, only in two different probability spaces,and the probability measure P that we are searching, should therefore have

Page 11: Probability Theory Statistics

Finite Probability Spaces

the property that P(A1 ×Ω2) = P1(A1). Simililarly, we require that P(Ω1 ×A2) =P2(A2) for any event A2 ∈ F2.

If the compound experiment is going to have the property that its componentsare independent, then the two eventsA1×Ω2 andΩ1×A2 have to be independent(since they relate to separate sub-experiments), and since their intersection isA1 ×A2, we must have

P(A1 ×A2) = P((A1 ×Ω2)∩ (Ω1 ×A2)

)

= P(A1 ×Ω2) P(Ω1 ×A2) = P1(A1) P2(A2).

This is a condition on P that relates to all product sets in F . Since all singletonsare product sets ((ω1,ω2) = ω1× ω2), we get in particular a condition on thepoint probability function p for P; in an obvious notation this condition is thatp(ω1,ω2) = p1(ω1)p2(ω2) for all (ω1,ω2) ∈Ω.

Ω1

Ω2

A1

A2 Ω1 ×A2

A1×Ω

2

A 1×A

2

Inspired by this analysis of the problem we will proceed as follows:. Let p1 and p2 be the point probability functions for P1 and P2.. Then define a function p : Ω → [0 ; 1] by p(ω1,ω2) = p1(ω1)p2(ω2) for

(ω1,ω2) ∈Ω.. This function p adds up to unity:

ω∈Ωp(ω) =

(ω1,ω2)∈Ω1×Ω2

p1(ω1)p2(ω2)

=∑

ω1∈Ω1

p1(ω1)∑

ω2∈Ω2

p2(ω2)

= 1 · 1 = 1.

. According to Proposition . there is a unique probability measure P onΩ with p as its point probability function.

. The probability measure P satisfies the requirement that P(A1 × A2) =P1(A1) P2(A2) for all A1 ∈ F1 and A2 ∈ F2, since

P(A1 ×A2) =∑

(ω1,ω2)∈A1×A2

p1(ω1)p2(ω2)

=∑

ω1∈A1

p1(ω1)∑

ω2∈A2

p2(ω2) = P1(A1) P2(A2).

This solves the specified problem. — The probability measure P is called theproduct of P1 and P2.

One can extend the above considerations to cover situations with an arbitrarynumber n of sub-experiments. In a compound experiment consisting of n in-dependent sub-experiments with point probability functions p1,p2, . . . ,pn, the

. Random variables

joint point probability function is given by

p(ω1,ω2, . . . ,ωn) = p1(ω1)p2(ω2) . . . pn(ωn),

and for the corresponding probability measures we have

P(A1 ×A2 × . . .×An) = P1(A1) P2(A2) . . . Pn(An).

In this context p and P are called the joint point probability function and thejoint distribution, and the individual pis and Pis are called the marginal pointprobability functions and the marginal distributions.

Example .In the experiment where a coin is tossed once and a die is rolled once, the probability ofthe outcome (heads,5 pips) is 1/2 · 1/6 = 1/12, and the probability of the event “heads andat least five pips” is 1/2 · 2/6 = 1/6. — If a coin is tossed 100 times, the probability that thelast 10 tosses all give the result heads, is (1/2)10 ≈ 0.001.

. Random variables

One reason for mathematics being so successful is undoubtedly that it to sucha great extent makes use of symbols, and a successful translation of our every-day speech about chance phenomena into the language of mathematics willtherefore—among other things—rely on using an appropriate notation. In prob-ability theory and statistics it would be extremely useful to be able to workwith symbols representing “the numeric outcome that the chance experimentwill provide when carried out”. Such symbols are random variables; randomvariables are most frequently denoted by capital letters (such as X, Y , Z). Ran-dom variables appear in mathematical expressions just as if they were ordinarymathematical entities, e.g. X +Y = 5 and Z ∈ B.

Example: To describe the outcome of rolling two dice once let us introducerandom variables X1 and X2 representing the number of pips shown by die no. and die no. , respectively. Then X1 = X2 means that the two dice show the samenumber of pips, and X1 +X2 ≥ 10 means that the sum of the pips is at least 10.

Even if the above may suggest the intended meaning of the notion of a randomvariable, it is certainly not a proper definition of a mathematical object. Andconversely, the following definition does not convey that much of a meaning.

Definition .: Random variableLet (Ω,F ,P) be a probability space over a finite set. A random variable (Ω,F ,P) isa map X from Ω to R, the set of reals.

More general, an n-dimensional random variable on (Ω,F ,P) is a map X fromΩ to Rn.

Page 12: Probability Theory Statistics

Finite Probability Spaces

Below we shall study random variables in the mathematical sense. First, let usclarify how random variables enter into plain mathematical parlance: Let X be arandom variable on the sample space in question. If u is a proposition about realnumbers so that for every x ∈ R, u(x) is either true or false, then u(X) is to beread as a meaningful proposition about X, and it is true if and only if the eventω ∈Ω : u(X(ω)) occurs. Thereby we can talk of the probability P(u(X)) of u(X)being true; this probability is per definition P(u(X)) = P(ω ∈Ω : u(X(ω))).

To write ω ∈Ω : u(X(ω)) is a precise, but rather lengthy, and often undulydetailed, specification of the event that u(X) is true, and that event is thereforealmost always written as u(X).

Example: The proposition X ≥ 3 corresponds to the event X ≥ 3, or moredetailed to ω ∈ Ω : X(ω) ≥ 3, and we write P(X ≥ 3) which per definitionis P(X ≥ 3) or more detailed P(ω ∈ Ω : X(ω) ≥ 3).—This is extended in anobvious way to cases with several random variables.

For a subset B of R, the proposition X ∈ B corresponds to the event X ∈ B =ω ∈Ω : X(ω) ∈ B. The probability of the proposition (or the event) is P(X ∈ B).Considered as a function of B, P(X ∈ B) satisfies the conditions for being aprobability measure on R; statisticians and probabilists denote this measure thedistribution of X, whereas mathematicians will speak of a transformed measure.—Actually, the preceding is not entirely correct: at present we cannot talk aboutprobability measures on R, since we have reached only probabilities on finitesets. Therefore we have to think of that probability measure with the two namesas a probability measure on the finite set X(Ω) ⊂R.

Denote for a while the transformed probability measure X(P); then X(P)(B) =P(X ∈ B) where B is a subset of R. This innocuous-looking formula shows that allprobability statements regarding X can be rewritten as probability statementssolely involving (subsets of) R and the probability measure X(P) on (the finitesubset X(Ω) of) R. Thus we may forget all about the original probability spaceΩ.

Example .Let Ω = ω1,ω2,ω3,ω4,ω5, and let P be the probability measure on Ω given by thepoint probabilities p(ω1) = 0.3, p(ω2) = 0.1, p(ω3) = 0.4, p(ω4) = 0, and p(ω5) = 0.2. Wecan define a random variable X :Ω→ R by the specification X(ω1) = 3, X(ω2) = −0.8,X(ω3) = 4.1, X(ω4) = −0.8, and X(ω5) = −0.8.

The range of the map X is X(Ω) = −0.8,3,4.1, and the distribution of X is theprobability measure on X(Ω) given by the point probabilities

P(X = −0.8) = P(ω2,ω4,ω5) = p(ω2) + p(ω4) + p(ω5) = 0.3,P(X = 3) = P(ω1) = p(ω1) = 0.3,P(X = 4.1) = P(ω3) = p(ω3) = 0.4.

. Random variables

Example .: Continuation of Example .Now introduce two more random variables Y and Z, leading to the following situation:

ω p(ω) X(ω) Y (ω) Z(ω)

ω1 0.3 3 −0.8 3ω2 0.1 −0.8 3 −0.8ω3 0.4 4.1 4.1 4.1ω4 0 −0.8 4.1 4.1ω5 0.2 −0.8 3 −0.8

It appears that X, Y and Z are different, since for example X(ω1) , Y (ω1), X(ω4) ,Z(ω4) and Y (ω1) , Z(ω1), although X, Y and Z do have the same range −0.8,3,4.1.Straightforward calculations show that

P(X = −0.8) = P(Y = −0.8) = P(Z = −0.8) = 0.3,P(X = 3) = P(Y = 3) = P(Z = 3) = 0.3,P(X = 4.1) = P(Y = 4.1) = P(Z = 4.1) = 0.4,

that is, X, Y and Z have the same distribution. It is seen that P(X , Z) = P(ω4) = 0, soX = Z with probability 1, and P(X = Y ) = P(ω3) = 0.4.

Being a distribution on a finite set, the distribution of a random variableX can bedescribed (and illustrated) by its point probabilities. And since the distribution“lives” on a subset of the reals, we can also describe it by its distribution function.

Definition .: Distribution functionThe distribution function F of a random variable X is the function

F : R −→ [0 ; 1]

x 7−→ P(X ≤ x).

A distribution function has certain properties:

Lemma .If the random variable X has distribution function F, then

P(X ≤ x) = F(x),

P(X > x) = 1−F(x),

P(a < X ≤ b) = F(b)−F(a),

for any real numbers x and a < b.

ProofThe first equation is simply the definitionen of F. The other two equations followfrom items and in Lemma . (page ).

Page 13: Probability Theory Statistics

Finite Probability Spaces

Proposition .The distribution function F of a random variable X has the following properties:Increasing func-

tionsLet F : R → R be anincreasing function,that is, F(x) ≤ F(y)whenever x ≤ y.Then at each point x,F has a limit from the left

F(x−) = limh0

F(x − h)

and a limit from the right

F(x+) = limh0

F(x+ h),

and

F(x−) ≤ F(x) ≤ F(x+).

If F(x−) , F(x+), x is apoint of discontinuity ofF (otherwise, x is a pointof continuity of F).

. It is non-decreasing, i.e. if x ≤ y, then F(x) ≤ F(y).. lim

x→−∞F(x) = 0 and limx→+∞F(x) = 1.

. It is continuous from the right, i.e. F(x+) = F(x) for all x.. P(X = x) = F(x)−F(x−) at each point x.. A point x is a discontinuity point of F, if and only if P(X = x) > 0.

Proof: If x ≤ y, then F(y)−F(x) = P(x < X ≤ y) (Lemma .), and since probabilitiesare always non-negative, this implies that F(x) ≤ F(y).: Since X takes only a finite number of values, there exist two numbers xminand xmax such that xmin < X(ω) < xmax for all ω. Then F(x) = 0 for all x < xmin,and F(x) = 1 for all x > xmax.: Since X takes only a finite number of values, then for each x it is true thatfor all sufficiently small numbers h > 0, X does not take values in the interval]x ; x+ h]; this means that the events X ≤ x and X ≤ x+ h are identical, and soF(x) = F(x+ h).: For a given number x consider the part of range of X that is to the left of x,that is, the set X(Ω)∩ ]−∞ ; x[ . If this set is empty, take a = −∞, and otherwisetake a to be the largest element in X(Ω)∩ ]−∞ ; x[.

Since X(ω) is outside of ]a ; x[ for all ω, it is true that for every x∗ ∈ ]a ; x[, theevents X = x and x∗ < X ≤ x are identical, so that P(X = x) = P(x∗ < X ≤ x) =P(X ≤ x)−P(X ≤ x∗) = F(x)−F(x∗). For x∗ x we get the desired result.: Follows from and .

Random variables will appear again and again, and the reader will soon be han-

1

“arbitrary” distribution function,finite sample space

dling them quite easily.—Here are some simple (but not unimportant) examplesof random variables.

Example .: Constant random variableThe simplest type of random variables is the random variables that always take thesame value, that is, random variables of the form X(ω) = a for all ω ∈Ω; here a is some

1

an “arbitrary” distribution function

real number. The random variable X = a has distribution function

F(x) =

1 if a ≤ x0 if x < a

.

Actually, we could replace the condition “X(ω) = a for all ω ∈Ω” with the condition“X = a with probability 1” or P(X = a) = 1; the distribution function would remainunchanged.

1

a

. Random variables

Example .: 01-variableThe second simplest type of random variables must be random variables that takesonly two different values. Such variables occur in models for binary experiments, i.e.experiments with two possible outcomes (such as Success/Failure or Head/Tails). Tosimplify matters we shall assume that the two values are 0 and 1, whence the name01-variables for such random variables (another name is Bernoulli variables).

The distribution of a 01-variable can be expressed using a parameter p which is theprobability of the value 1:

P(X = x) =

p if x = 1

1− p if x = 0

orP(X = x) = px(1− p)1−x, x = 0,1. (.)

The distribution function of a 01-variable with parameter p is

F(x) =

1 if 1 ≤ x1− p if 0 ≤ x < 1

0 if x < 0.

1

0 1

Example .: Indicator functionIf A is an event (a subset of Ω), then its indicator function is the function

1A(ω) =

1 if ω ∈ A0 if ω ∈ Ac.

This indicator function is a 01-variable with parameter p = P(A).Conversely, if X is a 01-variable, then X is the indicator funtion of the event A =

X−1(1) = ω : X(ω) = 1.Example .: Uniform random variableIf x1,x2, . . . ,xn are n different real number and X a random variable that takes each ofthese numbers with the same probability, i.e.

P(X = xi) =1n, i = 1,2, . . . ,n,

then X is said to be uniformly distributed on the set x1,x2, . . . ,xn.The distribution funtion is not particularly well suited for giving an informativevisualisation of the distribution of a random variable, as you may have realisedfrom the above examples. A far better tool is the probability funtion. Theprobability funtion for the random variable X is the function f : x 7→ P(X = x),considered as a function defined on the range of X (or some reasonable supersetof it). — In situations that are modelled using a finite probability space, therandom variables of interest are often variables taking values in the non-negativeintegers; in that case the probability function can be considered as defined eitheron the actual range of X, or on the set N0 of non-negative integers.

Page 14: Probability Theory Statistics

Finite Probability Spaces

Definition .: Probability functionThe probability function of a random variablel X is the function

f : x 7→ P(X = x).

Proposition .For a given distribution the distribution function F and the probability function fare related as follows:

f (x) = F(x)−F(x−),

F(x) =∑

z :z≤xf (z).

ProofThe expression for f is a reformulation of item in Proposition .. The expres-

1

1

F

f

sion for F follows from Proposition . page .

Independent random variables

Often one needs to study more than one random variable at a time.Let X1,X2, . . . ,Xn be random variables on the same finite probability space.

Their joint probability function is the function

f (x1,x2, . . . ,xn) = P(X1 = x1,X2 = x2, . . . ,Xn = xn).

For each j the functionfj(xj ) = P(Xj = xj )

is called the marginal probability function of Xj ; more generally, if we have anon-trivial set of indices i1, i2, . . . , ik ⊂ 1,2, . . . ,n, then

fi1,i2,...,ik (xi1 ,xi2 , . . . ,xik ) = P(Xi1 = xi1 ,Xi2 = xi2 , . . . ,Xik = xik )

is the marginal probability function of Xi1 ,Xi2 , . . . ,Xik .

Previously we defined independence of events (page f); now we can defineindependence of random variables:

Definition .: Independent random variablesLet (Ω,F ,P) be a probability space over a finite set. Then the random variablesX1,X2, . . . ,Xn on (Ω,F ,P) are said to be independent, if for any choice of subsetsB1,B2, . . . ,Bn of R, the events X1 ∈ B1, X2 ∈ B2, . . . , Xn ∈ Bn are independent.

A simple and clear criterion for independence is

. Random variables

Proposition .The random variables X1,X2, . . . ,Xn are independent, if and only if

P(X1 = x1,X2 = x2, . . . ,Xn = xn) =n∏

i=1

P(Xi = xi) (.)

for all n-tuples x1,x2, . . . ,xn of real numbers such that xi belongs to the range of Xi ,i = 1,2, . . . ,n.

Corollary .The random variables X1,X2, . . . ,Xn are independent, if and only if their joint proba-bility function equals the product of the marginal probability functions:

f12...n(x1,x2, . . . ,xn) = f1(x1)f2(x2) . . . fn(xn).

Proof of Proposition .To show the “only if” part, let us assume that X1,X2, . . . ,Xn are independentaccording to Definition .. We then have to show that (.) holds. Let Bi = xi,then Xi ∈ Bi = Xi = xi (i = 1,2, . . . ,n), and

P(X1 = x1,X2 = x2, . . . ,Xn = xn) = P( n⋂

i=1

Xi = xi)

=n∏

i=1

P(Xi = xi);

since x1,x2, . . . ,xn are arbitrary, this proves (.).

Next we shall show the “if” part, that is, under the assumption that equation(.) holds, we have to show that for arbitrary subsets B1,B2, . . . ,Bn of R, theevents X1 ∈ B1, X2 ∈ B2, . . . , Xn ∈ Bn are independent. In general, for anarbitrary subset B of Rn,

P((X1,X2, . . . ,Xn) ∈ B

)=

(x1,x2,...,xn)∈BP(X1 = x1,X2 = x2, . . . ,Xn = xn).

Then apply this to B = B1 ×B2 × · · · ×Bn and make use of (.):

P( n⋂

i=1

Xi ∈ Bi)

= P((X1,X2, . . . ,Xn) ∈ B

)

=∑

x1∈B1

x2∈B2

. . .∑

xn∈BnP(X1 = x1,X2 = x2, . . . ,Xn = xn)

=∑

x1∈B1

x2∈B2

. . .∑

xn∈Bn

n∏

i=1

P(Xi = xi) =n∏

i=1

xi∈BiP(Xi = xi)

=n∏

i=1

P(Xi ∈ Bi),

showing that X1 ∈ B1, X2 ∈ B2, . . . , Xn ∈ Bn are independent events.

Page 15: Probability Theory Statistics

Finite Probability Spaces

If the individual Xs have identical probability functions and thus identicalprobability distributions, they are said to be identically distributed, and if theyfurthermore are independent, we will often describe the situation by saying thatthe Xs are “independent identically distributed random variables” (abbreviated toi.i.d. r.v.s).

Functions of random variables

Let X be a random variable on the finite probability space (Ω,F ,P). If t afunction mapping the range of X into the reals, then the function t X is alsoa random variable on (Ω,F ,P); usually we write t(X) instead of t X. Thedistribution of t(X) is easily found, in principle at least, since

Ω R

R

X

ttX

P(t(X) = y) = P(X ∈ x : t(x) = y

)=

x : t(x)=y

f (x) (.)

where f denotes the probability function of X.Similarly, we write t(X1,X2, . . . ,Xn) where t is a function of n variables (so t

maps Rn into R), and X1,X2, . . . ,Xn are n random variables. The function t canbe as simple as +. Here is a result that should not be surprising:

Proposition .Let X1,X2, . . . ,Xm,Xm+1,Xm+2, . . . ,Xm+n be independent random variables, and let t1and t2 be functions of m and n variables, respectively. Then the random variablesY1 = t1(X1,X2, . . . ,Xm) and Y2 = t2(Xm+1,Xm+2, . . . ,Xm+n) are independent

The proof is left as an exercise (Exercise .).

The distribution of a sum

Let X1 and X2 be random variables on (Ω,F ,P), and let f12(x1,x2) denote theirjoint probability function. Then t(X1,X2) = X1 +X2 is also a random variable on(Ω,F ,P), and the general formula (.) gives

P(X1 +X2 = y) =∑

x2∈X2(Ω)

P(X1 +X2 = y and X2 = x2)

=∑

x2∈X2(Ω)

P(X1 = y − x2 and X2 = x2)

=∑

x2∈X2(Ω)

f12(y − x2,x2),

that is, the probability function of X1 +X2 is obtained by adding f12(x1,x2) overall pairs (x1,x2) with x1 + x2 = y.—This has a straightforward generalisation tosums of more than two random variables.

. Examples

If X1 and X2 are independent, then their joint probability function is the productof their marginal probability functions, so

Proposition .If X1 and X2 are independent random variables with probability functions f1 and f2,then the probability function of Y = X1 +X2 is

f (y) =∑

x2∈X2(Ω)

f1(y − x2)f2(x2).

Example .If X1 and X2 are independent identically distributed 01-variables with parameter p,then what is the distribution of X1 +X2?

The joint distribution function f12 of (X1,X2) is (cf. (.) page )

f12(x1,x2) = f1(x1)f2(x2)

= px1(1− p)1−x1 · px2(1− p)1−x2

= px1+x2(1− p)2−(x1+x2)

for (x1,x2) ∈ 0,12, and 0 otherwise. Thus the probability function of X1 +X2 is

f (0) = f12(0,0) = (1− p)2,

f (1) = f12(1,0) + f12(0,1) = 2p(1− p),

f (2) = f12(1,1) = p2.

We check that the three probabilities add to 1:

f (0) + f (1) + f (2) = (1− p)2 + 2p(1− p) + p2 =((1− p) + p

)2= 1.

(This example can be generalised in several ways.)

. Examples

The probability function of a 01-variable with parameter p is (cf. equation (.)on page )

f (x) = px(1− p)1−x, x = 0,1.

If X1,X2, . . . ,Xn are independent 01-variables with parameter p, then their jointprobability function is (for (x1,x2, . . . ,xn) ∈ 0,1n)

f12...n(x1,x2, . . . ,xn) =n∏

i=1

pxi (1− p)1−xi = ps(1− p)n−s (.)

where s = x1 + x2 + . . .+ xn (cf. Corollary .).

Page 16: Probability Theory Statistics

Finite Probability Spaces

Definition .: Binomial distributionThe distribution of the sum of n independent and identically distributed 01-variableswith parameter p is called the binomial distribution with parameters n and p (n ∈Nand p ∈ [0 ; 1]).

To find the probability function of a binomial distribution, consider n indepen-dent 01-variablesX1,X2, . . . ,Xn, each with parameter p. Then Y = X1+X2+· · ·+Xn

Probability function for thebinomial distribution withn = 7 and p = 0.618

1/4

0 1 2 3 4 5 6 7

follows a binomial distribution with parameters n and p. For y an integerbetween 0 and n we then have, using that the joint probability function ofX1,X2, . . . ,Xn is given by (.),Binomial coeffi-

cientsThe number of differentk-subsets of an n-set, i.e.the number of differentsubsets with exactly kelements that can beselected from a set withn elements, is denoted(nk

), and is called a

binomial coefficient. Herek and n are non-nega-tive integers, and k ≤ n.

• We have(n0

)= 1, and

(nk

)=

(n

n− k).

P(Y = y) =∑

x1+x2+...+xn=y

f12...n(x1,x2, . . . ,xn)

=∑

x1+x2+...+xn=y

py(1− p)n−y

=

x1+x2+...+xn=y

1

py(1− p)n−y

=(ny

)py(1− p)n−y

since the number of different n-tuples x1,x2, . . . ,xn with y 1s and n− y 0s is(ny

).

Hence, the probability function of the binomial distribution with parameters nand p is

f (y) =(ny

)py(1− p)n−y , y = 0,1,2, . . . ,n. (.)

Theorem .If Y1 and Y2 are independent binomial variables such that Y1 has parameters n1 andp, and Y2 has parameters n2 and p, then the distribution of Y1 +Y2 is binomial withparameters n1 +n2 and p.

ProofOne can think of (at least) two ways of proving the theorem, a troublesome waybased on Proposition ., and a cleverer way:

Let us consider n1 +n2 independent and identically distributed 01-variablesX1,X2, . . . ,Xn1

,Xn1+1, . . . ,Xn1+n2. Per definition, X1 +X2 + . . . +Xn1

has the samedistribution as Y1, namely a binomial distribution with parameters n1 and p,and similarly Xn1+1 + Xn1+2 + . . . + Xn1+n2

follows the same distribution as Y2.From Proposition . we have that Y1 and Y2 are independent. Thus all thepremises of Theorem . are met for these variables Y1 and Y2, and since Y1 +Y2

. Examples

is a sum of n1 + n2 independent and identically distributed 01-variables withparameter p, Definition . tells us that Y1 +Y2 is binomial with parametersn1 +n2 and p. This proves the theorem! • We can deduce a

recursion formula forbinomial coefficients:Let G , be a set with nelements and let g0 ∈ G;now there are twokinds of k-subsets of G:) those k-subsets thatdo not contain g0 andtherefore can be seenas k-subsets of G \ g0,and ) those k-subsetsthat do contain g0 andtherefore can be seenas the disjoint unionof a (k − 1)-subset ofG \ g0 and g0. Thetotal number of k-sub-sets is therefore(nk

)=

(n− 1k

)+(n− 1k − 1

).

This is true for 0 < k < n.

• The formula(nk

)=

n!k! (n− k)!

can be verified by check-ing that both sidesof the equality signsatisfy the same recur-sion formula f (n,k) =f (n−1, k)+f (n−1, k−1),and that they agree fork = n and k = 0.

Perhaps the reader is left with a feeling that this proof was a little bit odd, andis it really a proof? Is it more than a very special situation with a Y1 and a Y2constructed in a way that makes the conclusion almost self-evident? Should theproof not be based on “arbitrary” binomial variables Y1 and Y2?

No, not necessarily. The point is that the theorem is not a theorem aboutrandom variables as functions on a probability space, but a theorem about howthe map (y1, y2) 7→ y1 + y2 transforms certain probability distributions, and inthis connection it is of no importance how these distributions are provided. Therandom variables only act as symbols that are used for writing things in a lucidway. (See also page .)

One application of the binomial distribution is for modelling the number offavourable outcomes in a series of n independent repetitions of an experimentwith only two possible outcomes, favourable and unfavourable.

Think of a series of experiments that extends over two days. Then we canmodel the number of favourable outcomes on each day and add the results, orwe can model the total number. The final result should be the same in bothcases, and Proposition . tells us that this is in fact so.

Then one might consider problems such as: Suppose we made n1 repetitionson day and n2 repetitions on day . If the total number of favourable outcomesis known to be s, what can then be said about the number of favourable outcomeson day ? That is what the next result is about.

Proposition .If Y1 and Y2 are independent random variables such that Y1 is binomial with param-eters n1 and p, and Y2 is binomial with parameters n2 and p, then the conditionaldistribution of Y1 given that Y1 +Y2 = s has probability function

P(Y1 = y | Y1 +Y2 = s

)=

(n1

y

)(n2

s − y)

(n1 +n2

s

) , (.)

which is non-zero when maxs −n2,0 ≤ y ≤minn1, s.Note that the conditional distribution does not depend on p.—The proof ofProposition . is left as an exercise (Exercise .).

The probability distribution with probability function (.) is a hypergeometricdistribution. The hypergeometric distribution already appeared in Example .

Page 17: Probability Theory Statistics

Finite Probability Spaces

on page in connection with simple random sampling without replacement.Simple random sampling with replacement, however, leads to the binomialdistribution, as we shall see below.

When taking a small sample from a large population, it can hardly be of anyimportance whether the sampling is done with or without replacement. Thisis spelled out in Theorem .; to better grasp the meaning of the theorem,consider the following situation (cf. Example .): From a box containing Nballs, b of which are black and N − b white, a sample of size n is taken withoutreplacement; what is the probability that the sample contains exactly y black

Probability function for thehypergeometric distributionwith N = 8, s = 5 and n = 7

1/4

1/2

0 1 2 3 4 5 6 7

Probability function for thehypergeometric distributionwith N = 13, s = 8 and n = 7

1/4

1/2

0 1 2 3 4 5 6 7

Probability function for thehypergeometric distributionwith N = 21, s = 13 and n = 7

1/4

0 1 2 3 4 5 6 7

Probability function for thehypergeometric distributionwith N = 34, s = 21 and n = 7

1/4

0 1 2 3 4 5 6 7

balls, and how does this probability behave for large values of N and b?

Theorem .

When N →∞ and b→∞ in such a way thatbN→ p ∈ [0 ; 1],

(by

)(N − bn− y

)

(Nn

) −→(ny

)py(1− p)n−y

for all y ∈ 0,1,2, . . . ,n.ProofStandard calculations give that

(by

)(N − bn− y

)

(Nn

) =(ny

)(N −n)!

(b − y)! (N −n− (b − y))!b! (N − b)!

N !

=(ny

) b!(b − y)!

· (N − b)!(N − b − (n− y))!N !

(N −n)!

=(ny

)y factors

︷ ︸︸ ︷b(b−1)(b−2)...(b−y+1) ·

n−y factors︷ ︸︸ ︷(N−b)(N−b−1)(N−b−2)...(N−b−(n−y)+1)

N (N−1)(N−2)...(N−n+1)︸ ︷︷ ︸n factors

.

The long fraction has n factors both in the nominator and in the denominator, sowith a suitable matching we can write it as a product of n short fractions, eachof which has a limiting value: there will be y fractions of the form b−something

N−something ,

each converging to p; and there will be n− y fractions of the form N−b−somethingN−something ,

each converging to 1− p. This proves the theorem.

. Expectation

. Expectation

The expectation — or mean value — of a real-valued function is a weightedaverage of the values that the function takes.

Definition .: Expectation / expected value / mean valueThe expectation, or expected value, or mean value, of a real random variable X on afinite probability space (Ω,F ,P) is the real number

E(X) =∑

ω∈ΩX(ω)P(ω).

We often simply write EX instead of E(X).

If t is a function defined on the range of X, then, using Definition ., theexpected value of the random variable t(X) is

E(t(X)

)=

ω∈Ωt(X(ω)

)P(ω). (.)

Proposition .The expectation of t(X) can be expressed as

E(t(X)

)=

x

t(x)P(X = x) =∑

x

t(x)f (x),

where the summation is over all x in the range X(Ω) of X, and where f denotes theprobability function of X.

In particular, if t is the identity mapping, we obtain an alternative formula forthe expected value of X:

E(X) =∑

x

xP(X = x).

ProofThe proof is the following series of calculations :

x

t(x) P(X = x) 1=∑

x

t(x) P(ω : X(ω) = x

)

2=∑

x

t(x)∑

ω∈X=xP(ω)

3=∑

x

ω∈X=xt(x) P(ω)

4=∑

x

ω∈X=xt(X(ω)) P(ω)

Page 18: Probability Theory Statistics

Finite Probability Spaces

5=∑

ω∈⋃xX=x

t(X(ω)) P(ω)

6=∑

ω∈Ωt(X(ω)) P(ω)

7= E(t(X)).

Comments:Equality : making precise the meaning of P(X = x).Equality : follows from Proposition . on page .Equality : multiply t(x) into the inner sum.Equality : if ω ∈ X = x, then t(X(ω)) = t(x).Equality : the sets X = x,ω ∈Ω, are disjoint.Equality :

⋃xX = x equals Ω.

Equality : equation (.).

Remarks:. Proposition . is of interest because it provides us with three different

expressions for the expected value of Y = t(X):

E(t(X)) =∑

ω∈Ωt(X(ω)) P(ω), (.)

E(t(X)) =∑

x

t(x) P(X = x), (.)

E(t(X)) =∑

y

y P(Y = y). (.)

The point is that the calculation can be carried out either onΩ and using P(equation (.)), or on (the copy of R containing) the range of X and usingthe distribution of X (equation (.)), or on (the copy of R containing)the range of Y = t(X) and using the distribution of Y (equation (.)).

. In Proposition . it is implied that the symbol X represents an ordinaryone-dimensional random variable, and that t is a function from R to R. Butit is entirely possible to interpret X as an n-dimensional random variable,X = (X1,X2, . . . ,Xn), and t as a map from Rn to R. The proposition and itsproof remains unchanged.

. A mathematician will term E(X) the integral of the function X with respectto P and write E(X) =

∫ΩX(ω) P(dω), briefly E(X) =

∫ΩXdP.

We can think of the expectation operator as a map X 7→ E(X) from the set ofrandom variables (on the probability space in question) into the reals.

. Expectation

Theorem .The map X 7→ E(X) is a linear map from the vector space of random variables on(Ω,F ,P) into the set of reals.

The proof is left as an exercise (Exercise . on page ).

In particular, therefore, the expected value of a sum always equals the sum ofthe expected values. On the other hand, the expected value of a product is notin general equal to the product of the expected values:

Proposition .If X and Y are independent random variables on the probability space (Ω,F ,P),then E(XY ) = E(X)E(Y ).

ProofSince X and Y are independent, P(X = x,Y = y) = P(X = x) P(Y = y) for all xand y, and so

E(XY ) =∑

x,y

xyP(X = x,Y = y)

=∑

x,y

xyP(X = x)P(Y = y)

=∑

x

y

xyP(X = x)P(Y = y)

=∑

x

xP(X = x)∑

y

yP(Y = y) = E(X)E(Y ).

Proposition .If X(ω) = a for all ω, then E(X) = a, in brief E(a) = a, and in particular, E(1) = 1.

ProofUse Definition ..

Proposition .If A is an event and 1A its indicator function, then E(1A) = P(A).

ProofUse Definition .. (Indicator functions were introduced in Example . onpage .)

In the following, we shall deduce various results about the expectation operator,results of the form: if we know this of X (and Y ,Z, . . .), then we can say that

Page 19: Probability Theory Statistics

Finite Probability Spaces

about E(X) (and E(Y ),E(Z), . . .). Often, what is said about X is of the form “u(X)with probability 1”, that is, it is not required that u(X(ω)) be true for all ω, onlythat the event ω : u(X(ω)) has probability 1.

Lemma .If A is an event with P(A) = 0, and X a random variable, then E(X1A) = 0.

ProofLet ω be a point in A. Then ω ⊆ A, and since P is increasing (Lemma .), itfollows that P(ω) ≤ P(A) = 0. We also know that P(ω) ≥ 0, so we concludethat P(ω) = 0, and the first one of the sums below vanishes:

E(X1A) =∑

ω∈AX(ω)1A(ω) P(ω) +

ω∈Ω\AX(ω)1A(ω) P(ω).

The second sum vanishes because 1A(ω) = 0 for all ω ∈Ω \A.

Proposition .The expectation operator is positive, i.e., if X ≥ 0 with probability 1, then E(X) ≥ 0.

ProofLet B = ω : X(ω) ≥ 0 and A = ω : X(ω) < 0. Then 1 = 1B(ω) + 1A(ω) forall ω, and so X = X1B +X1A and hence E(X) = E(X1B) + E(X1A). Being a sumof non-negative terms, E(X1B) is non-negative, and Lemma . shows thatE(X1A) = 0.

Corollary .The expectation operator is increasing in the sense that if X ≤ Y with probability 1,then E(X) ≤ E(Y ). In particular, |E(X)| ≤ E(|X |).ProofIf X ≤ Y with probability 1, then E(Y ) − E(X) = E(Y − X) ≥ 0 according toProposition ..

Using |X | as Y , we then get E(X) ≤ E(|X |) and −E(X) = E(−X) ≤ E(|X |), so|E(X)| ≤ E(|X |).

Corollary .If X = a with probability 1, then E(X) = a.

ProofWith probability 1 we have a ≤ X ≤ a, so from Corollary . and Proposi-tion . we get a ≤ E(X) ≤ a, i.e. E(X) = a.

Lemma .: Markov inequalityIf X is a non-negative random variable and c a positive number, thenAndrei Markov

Russian mathematician(-). P(X ≥ c) ≤ 1

cE(X).

. Expectation

ProofSince c1X≥c(ω) ≤ X(ω) for all ω, we have c E(1X≥c) ≤ E(X), that is, E(1X≥c) ≤1c E(X). Since E(1X≥c) = P(X ≥ c) (Proposition .), the Lemma is proved.

Corollary .: Chebyshev inequality

If a is a positive number and X a random variable, then P(|X | ≥ a) ≤ 1a2 E(X2). Panuftii Chebyshev

Russian mathematician(-).Proof

P(|X | ≥ a) = P(X2 ≥ a2) ≤ 1a2 E(X2), using the Markov inequality.

Corollary .If E|X | = 0, then X = 0 with probability 1.

ProofWe will show that P(|X | > 0) = 0. From the Markov inequality we know thatP(|X | ≥ ε) ≤ 1

ε E(|X |) = 0 for all ε > 0. Now chose a positive ε less than the Augustin LouisCauchyFrench mathematician(-).

smallest non-negative number in the range of |X | (the range is finite, so suchan ε exists). Then the event |X | ≥ ε equals the event |X | > 0, and henceP(|X | > 0) = 0.

Proposition .: Cauchy-Schwarz inequalityFor random variables X and Y on a finite probability space Hermann Amandus

SchwarzGerman mathematician(-).

(E(XY )

)2 ≤ E(X2) E(Y 2) (.)

with equality if and only if there exists (a,b) , (0,0) such that aX + bY = 0 withprobability 1.

ProofIf X and Y are both 0 with probability 1, then the inequality is fulfilled (withequality).

Suppose that Y , 0 with positive probability. For all t ∈R

0 ≤ E((X + tY )2

)= E(X2) + 2tE(XY ) + t2 E(Y 2). (.)

The right-hand side is a quadratic in t (note that E(Y 2) > 0 since Y is non-zerowith probability 1), and being always non-negative, it cannot have two differentreal roots, and therefore the discriminant is non-positive, that is, (2E(XY ))2 −4E(X2)E(Y 2) ≤ 0, which is equivalent to (.). The discriminant equals zero ifand only if the quadratic has one real root, i.e. if and only if there exists a t suchthat E(X + tY )2 = 0, and Corollary . shows that this is equivalent to sayingthat X + tY is 0 with probability 1.

The case X , 0 with positive probability is treated in the same way.

Page 20: Probability Theory Statistics

Finite Probability Spaces

Variance and covariance

If you have to describe the distribution of X using one single number, thenE(X) is a good candidate. If you are allowed to use two numbers, then it wouldbe very sensible to take the second number to be some kind of measure of therandom variation of X about its mean. Usually one uses the so-called variance.

Definition .: Variance and standard deviationThe variance of the random variable X is the real number

Var(X) = E((X −EX)2

)= E(X2)− (EX)2.

The standard deviation of X is the real number√

Var(X).

From the first of the two formulas for Var(X) we see that Var(X) is alwaysnon-negative. (Exercise . is about showing that E

((X −EX)2

)is the same as

E(X2)− (EX)2.)

Proposition .Var(X) = 0 if and only if X is constant with probability 1, and if so, the constantequals EX.

ProofIf X = c with probability 1, then EX = c; therefore (X −EX)2 = 0 with probabil-ity 1, and so Var(X) = E((X −EX)2) = 0.

Conversely, if Var(X) = 0 then Corollary . shows that (X −EX)2 = 0 withprobability 1, that is, X equals EX with probability 1.

The expectation operator is linear, which means that we always have E(aX) =aEX and E(X +Y ) = EX + EY . For the variance operator the algebra is different.

Proposition .For any random variable X and any real number a, Var(aX) = a2 Var(X).

ProofUse the definition of Var and the linearity of E.

Definition .: CovarianceThe covariance between two random variables X and Y on the same probability spaceis the real number Cov(X,Y ) = E

((X −EX)(Y −EY )

).

The following rules are easily shown:

. Expectation

Proposition .For any random variables X,Y ,U,V and any real numbers a,b,c,d

Cov(X,X) = Var(X),

Cov(X,Y ) = Cov(Y ,X),

Cov(X,a) = 0,

Cov(aX + bY ,cU + dV ) = acCov(X,U ) + adCov(X,V )

+ bcCov(Y ,U ) + bdCov(Y ,V ).

Moreover:

Proposition .If X and Y are independent random variables, then Cov(X,Y ) = 0.

ProofIf X and Y are independent, then the same is true of X − EX and Y − EY(Proposition . page ), and using Proposition . page we see thatE((X −EX)(Y −EY )

)= E(X −EX)E(Y −EY ) = 0 · 0 = 0.

Proposition .For any random variables X and Y , |Cov(X,Y )| ≤

√Var(X)

√Var(Y ). The equality

sign holds if and only if there exists (a,b) , (0,0) such that aX + bY is constant withprobability 1.

ProofApply Proposition . to the random variables X −EX and Y −EY .

Definition .: CorrelationThe correlation between two non-constant random variables X and Y is the number

corr(X,Y ) =Cov(X,Y )√

Var(X)√

Var(Y ).

Corollary .For any non-constant random variables X and Y. −1 ≤ corr(X,Y ) ≤ +1,. corr(X,Y ) = +1 if and only if there exists (a,b) with ab < 0 such that aX + bY

is constant with probability 1,. corr(X,Y ) = −1 if and only if there exists (a,b) with ab > 0 such that aX + bY

is constant with probability 1.

ProofEverything follows from Proposition ., except the link between the sign of

Page 21: Probability Theory Statistics

Finite Probability Spaces

the correlation and the sign of ab, and that follows from the fact that Var(aX +bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y ) cannot be zero unless the last termis negative.

Proposition .For any random variables X and Y on the same probability space

Var(X +Y ) = Var(X) + 2Cov(X,Y ) + Var(Y ).

If X and Y are independent, then Var(X +Y ) = Var(X) + Var(Y ).

ProofThe general formula is proved using simple calculations. First

(X +Y −E(X +Y )

)2= (X −EX)2 + (Y −EY )2 + 2(X −EX)(Y −EY ).

Taking expectation on both sides then gives

Var(X +Y ) = Var(X) + Var(Y ) + 2E((X −EX)(Y −EY )

)

= Var(X) + Var(Y ) + 2Cov(X,Y ).

Finally, if X and Y are independent, then Cov(X,Y ) = 0.

Corollary .Let X1,X2, . . . ,Xn be independent and identically distributed random variables withexpectation µ and variance σ2, and let Xn = 1

n (X1 +X2 + . . .+Xn) denote the averageof X1,X2, . . . ,Xn. Then EXn = µ and VarXn = σ2/n.

This corollary shows how the random variation of an average of n values de-creases with n.

Examples

Example .: 01-variablesLet X be a 01-variable with P(X = 1) = p and P(X = 0) = 1− p. Then the expected valueof X is EX = 0 ·P(X = 0) + 1 ·P(X = 1) = p. The variance of X is, using Definition .,Var(X) = E(X2)− p2 = 02 ·P(X = 0) + 12 ·P(X = 1)− p2 = p (1− p).

Example .: Binomial distributionThe binomial distribution with parameters n and p is the distribution of a sum of n inde-pendent 01-variables with parameter p (Definition . page ). Using Example .,Theorem . and Proposition ., we find that the expected value of this binomialdistribution is np and the variance is np(1− p).

It is perfectly possible, of course, to find the expected value as∑xf (x) where f is the

probability function of the binomial distribution (equation (.) page ), and similarly

one can find the variance as∑x2f (x)−

(∑xf (x)

)2.

In a later chapter we will find the expected value and the variance of the binomialdistribution in an entirely different way, see Example . on page .

. Expectation

Law of large numbers

A main field in the theory of probability is the study of the asymptotic behaviourof sequences of random variables. Here is a simple result:

Theorem .: Weak Law of Large NumbersLet X1,X2, . . . ,Xn, . . . be a sequence of independent and identically distributed randomvariables with expected value µ and variance σ2.

Then the average Xn = (X1 +X2 + . . .+Xn)/n converges to µ in the sense that

limn→∞P

(∣∣∣Xn −µ∣∣∣ < ε

)= 1

for all ε > 0. Strong Law of LargeNumbersThere is also a StrongLaw of Large Numbers,stating that under thesame assumptions as inthe Weak Law of LargeNumbers,

P(

limn→∞Xn = µ

)= 1,

i.e. Xn converges to µwith probability 1.

ProofFrom Corollary . we have Var(Xn−µ) = σ2/n. Then the Chebyshev inequality(Corollary . page ) gives

P(∣∣∣Xn −µ

∣∣∣ < ε)

= 1−P(∣∣∣Xn −µ

∣∣∣ ≥ ε)≥ 1− 1

ε2σ2

n,

and the last expression converges to 1 as n tends to∞.

Comment: The reader might think that the theorem and the proof involvesinfinitely many independent and identically distributed random variables, andis that really allowed (at this point of the exposition), and can one have infinitesequences of random variables at all? But everything is in fact quite unproblem-atic. We consider limits of sequences of real number only; we are working witha finite number n of random variables at a time, and that is quite unproblematic,we can for instance use a product probability space, cf. page f.

The Law of Large Numbers shows that when calculating the average of a largenumber of independent and identically distributed variables, the result will beclose to the expected value with a probability close to 1.

We can apply the Law of Large Numbers to a sequence of independent andidentically distributed 01-variables X1,X2, . . . that are 1 with probability p; theirdistribution has expected value p (cf. Example .). The average Xn is therelative frequency of the outcome 1 in the finite sequence X1,X2, . . . ,Xn. TheLaw of Large Numbers shows that the relative frequency of the outcome 1 witha probability close to 1 is close to p, that is, the relative frequency of 1 is closeto the probability of 1. In other words, within the mathematical universe wecan deduce that (mathematical) probability is consistent with the notion ofrelative frequency. This can be taken as evidence of the appropriateness of themathematical discipline Theory of Probability.

Page 22: Probability Theory Statistics

Finite Probability Spaces

The Law of Large Numbers thus makes it possible to interpret probability asrelative frequency in the long run, and similarly to interpret expected value asthe average of a very large number of observations.

. Exercises

Exercise .Write down a probability space that models the experiment “toss a coin three times”.Specify the three events “at least one Heads”, “at most one Heads”, and “not the sameresult in two successive tosses”. What is the probability of each of these events?

Exercise .Write down an example (a mathematics example, not necessarily a “real wold” example)of a probability space where the sample space has three elements.

Exercise .Lemma . on page presents a formula for the probability of the occurrence of atleast one of two events A and B. Deduce a formula for the probability that exactly oneof the two events occurs.

Exercise .Give reasons for defining conditional probability the way it is done in Definition . onpage , assuming that probability has an interpretation as relative frequency in thelong run.

Exercise .Show that the function P( | B) in Definition . on page satisfies the conditionsfor being a probability measure (Definition . on page ). Write down the pointprobabilities of P( | B) in terms of those of P.

Exercise .Two fair dice are thrown and the number of pips of each is recorded. Find the probabilitythat die number turns up a prime number of pips, given that the total number of pipsis 8?

Exercise .A box contains five red balls and three blue balls. In the half-dark in the evening, whenall colours look grey, Pedro takes four balls at random from the box, and afterwardsAntonia takes one ball at random from the box.. Find the probability that Antonia takes a blue ball.. Given that Antonia takes a blue ball, what is the probability that Pedro took four

red balls?

Exercise .Assume that children in a certain familily will have blue eyes with probability 1/4, andthat the eye colour of each child is independent of the eye colour of the other children.This family has five children.

. Exercises

. If it is known that the youngest child has blue eyes, what is probability that atleast three of the children have blue eyes?

. If it is known that at least one of the five children has blue eyes, what is probabilitythat at least three of the children have blue eyes?

Exercise .Often one is presented to problems of the form: “In a clinical investigation it was foundthat in a group of lung cancer patients, % were smokers, and in a comparable controlgroup % were smokers. In the light af this, what can be said about the impact ofsmoking on lung cancer?”[A control group is (in the present situation) a group of persons chosen such that itresembles the patient group in “every” respect except that persons in the control groupare not diagnosed with lung cancer. The patient group and the control group should becomparable with respect to distribution of age, sex, social stratum, etc.]

Disregarding discussions of the exact meaning of a “comparable” control group, aswell as problems relating to statistical and biological variability, then what quantitativestatements of any interest can be made, relating to the impact of smoking on lungcancer?

Hint: The odds of an event A is the number P(A)/ P(Ac). Deduce formulas for the oddsfor a lung cancer patient having lung cancer, and the odds for a non-smoker havinglung cancer.

Exercise .From time to time proposals are put forward about general screenings of the populationin order to detect specific diseases an an early stage.

Suppose one is going to test every single person in a given section of the populationfor a fairly infrequent disease. The test procedure is not % reliable (test proceduresseldom are), so there is a small probability of a “false positive”, i.e. of erroneously sayingthat the test person has the disease, and there is also a small probability of a “falsenegative”, i.e. of erroneously saying that the person does not have the disease.

Write down a mathematical model for such a situation. Which parameters would itbe sensible to have in such a model?

For the individual person two questions are of the utmost importance: If the testturns out positive, what is the probability of having the disease, and if the test turnsout negative, what is the probability of having the disease?

Make use of the mathematical model to answer these questions. You might also enternumerical values for the parameters, and calculate the probabilities.

Exercise .Show that if the events A and B are independent, then so are A and Bc.

Exercise .Write down a set of necessary and sufficient conditions that a function f is a probabilityfunction.

Exercise .Sketch the probability function of the binomial distribution for various choices of the

Page 23: Probability Theory Statistics

Finite Probability Spaces

parameters n and p. What happens when p→ 0 or p→ 1? What happens if p is relacedby 1− p?

Exercise .Assume that X is uniformly distributed on 1,2,3, . . . ,10 (cf. Definition . on page ).Write down the probability function and the distribution function of X, and makesketches of both functions. Find EX and VarX.

Exercise .Consider random variables X and Y on the same probability space (Ω,F ,P). Whatconnections are there between the statements “X = Y with probability 1” and “X and Yhave the same distribution”? – Hint: Example . on page might be of some use.

Exercise .Let X and Y be random variables on the same probability space (Ω,F ,P). Show that ifX is constant (i.e. if there exists a number c such that X(ω) = c for all ω ∈Ω), then Xand Y are independent.

Next, show that if X is constant with probability 1 (i.e. if there is a number c suchthat P(X = c) = 1), then X are Y independent.

Exercise .Prove Proposition . on page . — Use the definition of conditional probability andthe fact that Y1 = y and Y1 +Y2 = s is equivalent to Y1 = y and Y2 = s − y; then make useof the independence of Y1 and Y2, and finally use that we know the probability functionfor each of the variables Y1, Y2 and Y1 +Y2.

Exercise .Show that the set of random variables on a finite probability space is a vector space overthe set of reals.

Show that the map X 7→ E(X) is a linear map from this vector space into the vectorspace R.

Exercise .Let the random variable X have a distribution given by P(X = 1) = P(X = −1) = 1/2. FindEX and VarX.

Let the random variable Y have a distribution given by P(Y = 100) = P(Y = −100) =1/2, and find EY and VarY .

Exercise .Definition . on page presents two apparently different expressions for the varianceof a random variable X. Show that it is in fact true that E

((X −EX)2

)= E(X2)− (EX)2.

Exercise .Let X be a random variable such that

P(X = 2) = 0.1 P(X = −2) = 0.1

P(X = 1) = 0.4 P(X = −1) = 0.4,

. Exercises

and let Y = t(X) where t is the function from R to R given by

t(x) = −x if |x| ≤ 1

x otherwise.

Show that X and Y have the same distribution. Find the expected value of X and the Johan LudwigWilliam ValdemarJensen (-),Danish mathematicianand engineer at Kjøben-havns Telefon Aktiesel-skab (the CopenhagenTelephone Company).Among other things,known for Jensen’sinequality (see Exer-cise .).

expected value of Y . Find the variance of X and the variance of Y . Find the covariancebetween X and Y . Are X and Y independent?

Exercise .Give a proof of Proposition . (page ): In the notation of the proposition, we haveto show that

P(Y1 = y1 and Y2 = y2) = P(Y1 = y1) P(Y2 = y2)

for all y1 and y2. Write the first probability as a sum over a certain set of (m+n)-tuples(x1,x2, . . . ,xm+n); this set is a product of a set of m-tuples (x1,x2, . . . ,xm) and a set ofn-tuples (xm+1,xm+2, . . . ,xm+n).

Exercise .: Jensen’s inequalityLet X be a random variable and ψ a convex function.Show Jensen’s inequality Convex functions

A real-valued functionψ defined on an intervalI ⊆R is said to be convexif the line segmentconnecting any twopoints (x1,ψ(x1)) and(x2,ψ(x2)) of the graphof ψ lies entirely aboveor on the graph, that is,for any x1,x2 ∈ I ,λψ(x1) + (1−λ)ψ(x2) ≥ψ(λx1 + (1−λ)x2

)

for all λ ∈ [0 ; 1].

x1 x2

ψ

If ψ possesses a contin-uous second derivative,then ψ is convex if andonly if ψ′′ ≥ 0.

Eψ(X) ≥ ψ(EX).

Exercise .Let x1,x2, . . . ,xn be n positive numbers. The number G = (x1 x2 . . . xn)1/n is called thegeometric mean of the xs, and A = (x1 + x2 + · · ·+ xn)/n is the arithmetic mean of the xs.Show that G ≤ A.

Hint: Apply Jensen’s inequality to the convex function ψ = − ln and the randomvariable X given by P(X = xi) = 1

n , i = 1,2, . . . ,n.

Page 24: Probability Theory Statistics

Countable Sample Spaces

This chapter presents elements of the theory of probability distributions oncountable sample spaces. In many respects it is an entirely unproblematicgeneralisation of the theory covering the finite case; there will, however, be acouple of significant exceptions.

Recall that in the finite case we were dealing with the set F of all subsetsof the finite sample space Ω, and that each element of F , that is, each subsetof Ω (each event) was assigned a probability. In the countable case too, wewill assign a probability to each subset of the sample space. This will meet thedemands specified by almost any modelling problem on a countable samplespace, although it is somewhat of a simplification from the formal point of view.— The interested reader is referred to Definition . on page to see a generaldefinition of probability space.

. Basic definitions

Definition .: Probability space over countable sample spaceA probability space over a countable set is a triple (Ω,F ,P) consisting of. a sample space Ω, which is a non-empty countable set,. the set F of all subsets of Ω,. a probability measure on (Ω,F ), that is, a map P : F →R that is

• positive: P(A) ≥ 0 for all A ∈ F ,• normed: P(Ω) = 1, and• σ -additive: if A1,A2, . . . is a sequence in F of disjoint events, then

P( ∞⋃

i=1

Ai

)=∞∑

i=1

P(Ai). (.)

A countably infinite set has a non-countable infinity of subsets, but, as you cansee, the condition for being a probability measure involves only a countableinfinity of events at a time.

Lemma .Let (Ω,F ,P) be a probability space over the countable sample space Ω. Then

i. P() = 0.

Page 25: Probability Theory Statistics

Countable Sample Spaces

ii. If A1,A2, . . . ,An are disjoint events from F , then

P( n⋃

i=1

Ai

)=

n∑

i=1

P(Ai),

that is, P is also finitely additive.iii. P(A) + P(Ac) = 1 for all events A.iv. If A ⊆ B, then P(B \A) = P(B)−P(A), an P(A) ≤ P(B), i.e. P is increasing.v. P(A∪B) = P(A) + P(B)−P(A∩B).

ProofIf we in equation (.) let A1 =Ω, and take all other As to be , we see that P()must equal 0. Then it is obvious that σ -additivity implies finite additivity.

The rest of the lemma is proved in the same way as in the finite case, see theproof of Lemma . on page .

The next lemma shows that even in a countably infinite sample space almost allof the probability mass will be located on a finite set.

Lemma .Let (Ω,F ,P) be a probability space over the countable sample space Ω. For eachε > 0 there is a finite subset Ω0 of Ω such that P(Ω0) ≥ 1− ε.

ProofIf ω1,ω2,ω3, . . . is an enumeration of the elements of Ω, then we can write∞⋃

i=1

ωi =Ω, and hence∞∑

i=1

P(ωi) = P(Ω) = 1 by the σ -additivity of P. Since the

series converges to 1, the partial sums eventually exceed 1− ε, i.e. there is an n

such thatn∑

i=1

P(ωi) ≥ 1− ε. Now take Ω0 to be the set ω1,ω2, . . . ,ωn.

Point probabilities

Point probabilities are defined the same way as in the finite case, cf. Defini-tion . on page :

Definition .: Point probabilitiesLet (Ω,F ,P) be a probability space over the countable sample space Ω. The function

p :Ω −→ [0 ; 1]

ω 7−→ P(ω)

is the point probability function of P.

. Basic definitions

The proposition below is proved in the same way as in the finite case (Proposi-tion . on page ), but note that in the present case the sum can have infinitelymany terms.

Proposition .If p is the point probability function of the probability measure P, then P(A) =∑

ω∈Ap(ω) for any event A ∈ F .

Remarks: It follows from the proposition that different probability measurescannot have the same point probability function. It also follows (by takingA =Ω) that p adds up to 1, i.e.

ω∈Ωp(ω) = 1.

The next result is proved the same way as in the finite case (Proposition .on page ); note that in sums with non-negative terms, you can change theorder of the terms freely without changing the value of the sum.

Proposition .If p : Ω → [0 ; 1] adds up to 1, i.e.

ω∈Ωp(ω) = 1, then there exists exactly one

probability measure on Ω with p as its point probability function.

Conditioning; independence

Notions such as conditional probability, conditional distribution and indepen-dence are defined exactly as in the finite case, see page and .

Note that Bayes’ formula (see page ) has the following generalisation: If

B1,B2, . . . is a partition of Ω, i.e. the Bs are disjoint and∞⋃

i=1

Bi =Ω, then

P(Bj | A) =P(A | Bj ) P(Bj )∞∑

i=1

P(A | Bi) P(Bi)

for any event A.

Random variables

Random variables are defined the same way as in the finite case, except thatnow they can take infinitely many values.

Here are some examples of distributions living on finite subsets of the reals.First some distributions that can be of some use in actual model building, as weshall see later (in Section .):

Page 26: Probability Theory Statistics

Countable Sample Spaces

• The Poisson distribution with parameter µ > 0 has point probability func-tion

f (x) =µx

x!exp(−µ), x = 0,1,2,3, . . .

• The geometric distribution with probability parameter p ∈ ]0 ; 1[ has pointprobability function

f (x) = p (1− p)x, x = 0,1,2,3, . . .

• The negative binomial distribution with probability parameter p ∈ ]0 ; 1[and shape parameter k > 0 has point probability function

f (x) =(x+ k − 1

x

)pk(1− p)x, x = 0,1,2,3, . . .

• The logarithmic distribution with parameter p ∈ ]0 ; 1[ has point probabilityfunction

f (x) =1− lnp

(1− p)x

x, x = 1,2,3, . . .

Here is a different kind of example, showing what the mathematic formalismalso is about.

• The range of a random variable can of course be all kinds of subsets of thereals. A not too weird example is a random variable X taking the values±1,±1

2 ,±13 ,±1

4 , . . . ,0 with these probabilies

P(X = 1

n

)= 1

3·2n , n = 1,2,3, . . .

P(X = 0

)= 1

3 ,

P(X = −1

n

)= 1

3·2n , n = 1,2,3, . . .

Thus X takes infinitely many values, all lying in a finite interval.

Distribution function and probability function are defined in the same wayin the countable case as in the finite case (Definition . on page and Def-inition . on page ), and Lemma . and Proposition . on page areunchanged. The proof of Proposition . calls for a modification: either utilisethat a countable set is “almost finite” (Lemma .), or use the proof of the gen-eral case (Proposition .).

The criterion for independence of random variables (Proposition . or Corol-lary . on page ) remain unchanged. The formula for the point probabilityfunction for a sum of random variables is unchanged (Proposition . onpage ).

. Expectation

. Expectation

Let (Ω,F ,P) be a probability space over a countable set, and let X be a randomvariable on this space. One might consider defining the expectation of X inthe same way as in the finite case (page ), i.e. as the real number EX =∑

ω∈ΩX(ω) P(ω) or EX =

x

xf (x), where f is the probability function of X. In

a way this is an excellent idea, except that the sums now may involve infinitelymany terms, which may give problems of convergence.

Example .

It is assumed to be known that for α > 1, c(α) =∞∑

x=1

x−α is a finite positive number;

therefore we can define a random variableX having probability function fX(x) = 1c(α) x

−α ,

x = 1,2,3, . . . However, the series EX =∞∑

x=1

xfX(x) =1c(α)

∞∑

x=1

x−α+1 converges to a well-

defined limit only if α > 2. For α ∈ ]1 ; 2], the probability function converges too slowlytowards 0 for X to have an expected value. So not all random variables can be assignedan expected value in a meaningful way.

One might suggest that we introduce +∞ and −∞ as permitted values for EX, butthat does not solve all problems. Consider for example a random variable Y with rangeN∪−N and probability function fY (y) = 1

2c(α) |y|−α , y = ±1,±2,±3, . . .; in this case it isnot so easy to say what EY should really mean.

Definition .: Expectation / expected value / mean valueLet X be a random variable on (Ω,F ,P). If the sum

ω∈Ω|X(ω)|P(ω) converges to a

finite value, then X is said to have an expected value, and the expected value of X isthen the real number EX =

ω∈ΩX(ω) P(ω).

Proposition .Let X be a real random variable on (Ω,F ,P), and let f denote the probabilityfunction of X. Then X has an expected value if and only if the sum

x

|x|f (x) is

finite, and if so, EX =∑

x

xf (x).

This can be proved in almost the same way as Proposition . on page ,except that we now have to deal with infinite series.

Example . shows that not all random variables have an expected value. — Weintroduce a special notation for the set of random variables that do have anexpected value.

Page 27: Probability Theory Statistics

Countable Sample Spaces

Definition .The set of random variables on (Ω,F ,P) which have an expected value, is denotedL 1(Ω,F ,P) orL 1(P) orL 1.

We have (cf. Theorem . on page )

Theorem .The set L 1(Ω,F ,P) is a vector space (over R), and the map X 7→ EX is a linearmap of this vector space into R.

This is proved using standard results from the calculus of infinite series.

The results from the finite case about expected values are easily generalised tothe countable case, but note that Proposition . on page now becomes

Proposition .If X and Y are independent random variables and both have an expected value,then XY also has an expected value, and E(XY ) = E(X) E(Y ).

Lemma . on page now becomes

Lemma .If A is an event with P(A) = 0, and X is a random variable, then the random variableX1A has an expected value, and E(X1A) = 0.

The next result shows that you can change the values of a random variable on aset of probability 0 without changing its expected value.

Proposition .Let X and Y be random variables. If X = Y with probability 1, and if X ∈L 1, thenalso Y ∈L 1, and EY = EX.

ProofSince Y = X+(Y −X), then (using Theorem .) it suffices to show that Y −X ∈L 1

and E(Y − X) = 0. Let A = ω : X(ω) , Y (ω). Now we have Y (ω) − X(ω) =(Y (ω)−X(ω))1A(ω) for all ω, and the result follows from Lemma ..

An application of the comparison test for infinite series gives

Lemma .Let Y be a random variable. If there is a non-negative random variable X ∈L 1 suchthat |Y | ≤ X, then Y ∈L 1, and EY ≤ EX.

. Expectation

Variance and covariance

In brief the variance of a random variable X is defined as Var(X) = E((X −EX)2

)

whenever this is a finite number.

Definition .The set of random variables X on (Ω,F ,P) such that E(X2) exists, is denotedL 2(Ω,F ,P) orL 2(P) orL 2.

Note that X ∈L 2(P) if and only if X2 ∈L 1(P).

Lemma .If X ∈L 2 and Y ∈L 2, then XY ∈L 1.

ProofUse the fact that |xy| ≤ x2 + y2 for any real numbers x and y, together withLemma ..

Proposition .The setL 2(Ω,F ,P) is a linear subspace ofL 1(Ω,F ,P), and this subspace containsall constant random variables.

ProofTo see thatL 2 ⊆L 1, use Lemma . with Y = X.

It follows from the definition ofL 2 that if X ∈L 2 and a is a constant, thenaX ∈L 2, i.e.L 2 is closed under scalar multiplication.

If X,Y ∈L 2, then X2, Y 2 and XY all belongs toL 1 according to Lemma ..Hence (X +Y )2 = X2 + 2XY +Y 2 ∈L 1, i.e. X +Y ∈L 2. ThusL 2 is also closedunder addition.

Definition .: VarianceIf E(X2) < +∞, then X is said to have a variance, and the variance of X is the realnumber

Var(X) = E((X −EX)2

)= E(X2)− (EX)2.

Definition .: CovarianceIf X and Y both have a variance, then their covariance is the real number

Cov(X,Y ) = E((X −EX)(Y −EY )

).

The rules of calculus for variances and covariances are as in the finite case(page ff).

Page 28: Probability Theory Statistics

Countable Sample Spaces

Proposition .: Cauchy-Schwarz inequalityIf X and Y both belong toL 2(P), then

(E|XY |

)2 ≤ E(X2)E(Y 2). (.)

ProofFrom Lemma . it is known that XY ∈L 1(P). Then proceed along the samelines as in the finite case (Proposition . on page ).

If we replace X with X −EX and Y with Y −EY in equation (.), we get

Cov(X,Y )2 ≤ Var(X) Var(Y ).

. Examples

Geometric distribution

Consider a random experiment with two possible outcomes 0 and 1 (or FailureGeneralised bino-mial coefficiensFor r ∈ R and k ∈ N

we define(rk

)as the

numberr(r − 1) . . . (r − k + 1)

1 · 2 · 3 · . . . · k .

(When r ∈ 0,1,2, . . . ,nthis agrees with thecombinatoric definitionof binomial coefficient(page ).)

and Success), and let the probability of outcome 1 be p. Repeat the experimentuntil outcome 1 occurs for the first time. How many times do we have to repeatthe experiment before the first 1? Clearly, in principle there is no upper boundfor the number of repetitions needed, so this seems to be an example where wecannot use a finite sample space. Moreover, we cannot be sure that the outcome 1will ever occur.

To make the problem more precise, let us assume that the elementary experi-ment is carried out at time t = 1,2,3, . . ., and what is wanted is the waiting timeto the first occurrence of a 1. Define

T1 = time of first 1,

V1 = T1 − 1 = number of 0s before the first 1.

Strictly speaking we cannot be sure that we shall ever see the outcome 1; we letT1 =∞ and V1 =∞ if the outcome 1 never occurs. Hence the possible values ofT1 are 1,2,3, . . . ,∞, and the possible values of V1 are 0,1,2,3, . . . ,∞.

Let t ∈ N0. The event V1 = t (or T1 = t + 1) means that the experimentproduces outcome 0 at time 1,2, . . . , t and outcome 1 at time t + 1, and thereforeP(V1 = t) = p (1− p)t, t = 0,1,2, . . . Using the formula for the sum of a geometricseries we see thatGeometric series

For |t| < 1,1

1− t =∞∑

n=0

tn.P(V1 ∈N0) =

∞∑

t=0

P(V1 = t) =∞∑

t=0

p (1− p)t = 1,

. Examples

that is, V1 (and T1) is finite with probability 1. In other words, with probability 1the outcome 1 will occur sooner or later.

The distribution of V1 is a geometric distribution:

Definition .: Geometric distributionThe geometric distribution with parameter p is the distribution on N0 with probabil-ity function

f (t) = p (1− p)t , t ∈N0.

We can find the expected value of the geometric distribution:Geometric distribution

with p = 0.316

1/4

0 2 4 6 8 10 12 . . .

EV1 =∞∑

t=0

tp(1− p)t = p(1− p)∞∑

t=1

t (1− p)t−1

= p(1− p)∞∑

t=0

(t + 1)(1− p)t =1− pp

(cf. the formula for the sum of a binomial series). So on the average there will Binomial series. For n ∈N and t ∈R,

(1 + t)n =n∑

k=0

(nk

)tk .

. For α ∈R and |t| < 1,

(1 + t)α =∞∑

k=0

(αk

)tk .

. For α ∈R and |t| < 1,

(1− t)−α

=∞∑

k=0

(−αk

)(−t)k

=∞∑

k=0

(α + k − 1

k

)tk .

be (1 − p)/p 0s before the first 1, and the mean waiting time to the first 1 isET1 = EV1 + 1 = 1/p. — The variance of the geometric distribution can be foundin a similar way, and it turns out to be (1− p)/p2.

The probability that the waiting time for the first 1 exceeds t is

P(T1 > t) =∞∑

k=t+1

P(T1 = k) = (1− p)t .

The probability that we will have to wait more than s + t, given that we havewaited already the time s, is

P(T1 > s+ t | T1 > s) =P(T1 > s+ t)

P(T1 > s)= (1− p)t = P(T1 > t),

showing that the remaining waiting time does not depend on how long we havebeen waiting until now — this is the lack of memory property of the distributionof T1 (or of the mechanism generating the 0s and 1s). (The continuous timeanalogue of this distribution is the exponential distribution, see page .)

The “shifted” geometric distribution is the only distribution on N0 with thisproperty: If T is a random variable with range N and satisfying P(T > s + t) =P(T > s) P(T > t) for all s, t ∈N, then

P(T > t) = P(T > 1 + 1 + . . .+ 1) =(P(T > 1)

)t= (1− p)t

where p = 1−P(T > 1) = P(T = 1).

Page 29: Probability Theory Statistics

Countable Sample Spaces

Negative binomial distribution

In continuation of the previous section consider the variables

Tk = time of kth 1

Vk = Tk − k = number of 0s before kth 1.

The event Vk = t (or Tk = t+k) corresponds to getting the result 1 at time t+k,and getting k − 1 1s and t 0s before time t + k. The probability of obtaining a 1at time t + k equals p, and the probability of obtaining exactly t 0s among thet+k−1 outcomes before time t+k is the binomial probability

(t+k−1t

)pk−1(1−p)t,

and hence

P(Vk = t) =(t + k − 1

t

)pk (1− p)t , t ∈N0.

This distribution of Vk is a negative binomial distribution:

Definition .: Negative binomial distributionThe negative binomial distribution with probability parameter p ∈ ]0 ; 1[ and shapeparameter k > 0 is the distribution on N0 that has probability function

f (t) =(t + k − 1

t

)pk (1− p)t , t ∈N0. (.)

Remarks: ) For k = 1 this is the geometric distribution. ) For obvious reasons,Negative binomial distribution

with p = 0.618 and k = 3.5

1/4

0 2 4 6 8 10 12 . . .

k is an integer in the above derivations, but actually the expression (.) iswell-defined and a probability function for any positive real number k.

Proposition .If X and Y are independent and have negative binomial distributions with thesame probability parameter p and with shape parameters j and k, respectively, thenX +Y has a negative binomial distribution with probability parameter p and shapeparameter j + k.

ProofWe have

P(X +Y = s) =s∑

x=0

P(X = x) P(Y = s − x)

=s∑

x=0

(x+ j − 1

x

)pj(1− p)x ·

(s − x+ k − 1

s − x)pk(1− p)s−x

= pj+kA(s) (1− p)s

. Examples

where A(s) =s∑

x=0

(x+ j − 1

x

)(s − x+ k − 1

s − x)

. Now one approach would be to do

some lengthy rewritings of the A(s) expression. Or we can reason as follows:

The total probability has to be 1, i.e.∞∑

s=0

pj+kA(s) (1−p)s = 1, and hence p−(j+k) =

∞∑

s=0

A(s) (1−p)s. We also have the identity p−(j+k) =∞∑

s=0

(s+ j + k − 1

s

)(1−p)s, either

from the formula for the sum of a binomial series, or from the fact that the sumof the point probabilities of the negative binomial distribution with parametersp and j + k is 1. By the uniqueness of power series expansions it now follows

that A(s) =(s+ j + k − 1

s

), which means that X +Y has the distribution stated in

the proposition.

It may be added that within the theory of stochastic processes the followinginformal reasoning can be expanded to a proper proof of Proposition .: Thenumber of 0s occurring before the (j + k)th 1 equals the number of 0s before thejth 1 plus the number of 0s between the jth and the (j + k)th 1; since values atdifferent times are independent, and the rule determining the end of the firstperiod is independent of the number of 0s in that period, then the number of 0sin the second period is independent of the number of 0s in the first period, andboth follow negative binomial distributions.

The expected value and the variance in the negative binomial distribution arek(1− p)/p and k(1− p)/p2, respectively, cf. Example . on page . Therefore,the expected number of 0s before the kth 1 is EVk = k(1−p)/p, and the expectedwaiting time to the kth 1 is ETk = k + EVk = k/p.

Poisson distribution

Suppose that instances of a certain kind of phenomena occur “entirely at ran-dom” along some “time axis”; examples could be earthquakes, deaths from anon-epidemic disease, road accidents in a certain crossroads, particles of thecosmic radiation, α particles emitted from a radioactive source, etc. etc. In suchconnection you could ask various questions, such as: what can be said aboutthe number of occurrences in a given time interval, and what can be said aboutthe waiting time from one occurrence to the next. — Here we shall discuss thenumber of occurrences.

Consider an interval ]a ; b] on the time axis. We shall be assuming that theoccurrences take place at points that are selected entirely at random on thetime axis, and in such a way that the intensity (or rate of occurrence) remains

Page 30: Probability Theory Statistics

Countable Sample Spaces

constant throughout the period of interest (this assumption will be written outmore clearly later on). We can make an approximation to the “entirely random”placement in the following way: divide the interval ]a ; b] into a large numberof very short subintervals, say n subintervals of length ∆t = (b − a)/n, and thenfor each subinterval have a random number generator determine whether therebe an occurrence or not; the probability that a given subinterval is assigned anoccurrence is p(∆t), and different intervals are treated independently of eachother. The probability p(∆t) is the same for all subintervals of the given length∆t. In this way the total number of occurrences in the big interval ]a ; b] followsa binomial distribution with parameters n and p(∆t) — the total number is thequantity that we are trying to model.

It seems reasonable to believe that as n increases, the above approximationSiméon-Denis Pois-sonFrench mathematicianand physicist (-).

approaches the “correct” model. Therefore we will let n tend to infinity; at thesame time p(∆t) must somehow tend to 0; it seems reasonable to let p(∆t) tendto 0 in such a way that the expected value of the binomial distribution is (ortends to) a fixed finite number λ (b − a). The expected value of our binomialdistribution is np(∆t) =

((b − a)/∆t

)p(∆t) = p(∆t)

∆t (b − a), so the proposal is that

p(∆t) should tend to 0 in such a way thatp(∆t)∆t

→ λ where λ is a positiveconstant.

The parameter λ is the rate or intensity of the occurrences in question, and itsdimension is number pr. time. In demography and actuarial mathematics thepicturesque term force of mortality is sometimes used for such a parameter.

According to Theorem ., when going to the limit in the way just described,the binomial probabilities converge to Poisson probabilities.Poisson distribution, µ = 2.163

1/4

0 2 4 6 8 10 12 . . .

Definition .: Poisson distributionThe Poisson distribution with parameter µ > 0 is the distribution on the non-negativeintegers N0 given by the probability function

f (x) =µx

x!exp(−µ), x ∈N0.

(That f adds up to unity follows easily from the power series expansion of theexponential function.)

The expected value of the Poisson distribution with parameter µ is µ, andPower series expan-sion of the exponen-tial functionFor all (real and com-

plex) t, exp(t) =∞∑

k=0

tk

k!.

the variance is µ as well, so VarX = EX for a Poisson random variable X. —Previously we saw that for a binomial random variable X with parameters nand p, VarX = (1− p) EX, i.e. VarX < EX, and for a negative binomial randomvariable X with parameters k and p, pVarX = EX, i.e. VarX > EX.

. Examples

Theorem .When n→∞ and p→ 0 simultaneously in such a way that np converges to µ > 0, thebinomial distribution with parameters n and p converges to the Poisson distributionwith parameter µ in the sense that

(nx

)px(1− p)n−x −→ µx

x!exp(−µ)

for all x ∈N0.

ProofConsider a fixed x ∈N0. Then

(nx

)px(1− p)n−x =

n(n− 1) . . . (n− x+ 1)x!

px (1− p)−x (1− p)n

=1(1− 1

n ) . . . (1− x−1n )

x!(np)x (1− p)−x (1− p)n

where the long fraction converges to 1/x!, (np)x converges to µx, and (1− p)−x

converges to 1. The limit of (1−p)n is most easily found by passing to logarithms;we get

ln((1− p)n

)= n ln(1− p) = −np · ln(1− p)− ln(1)

−p −→ −µ

since the fraction converges to the derivative of ln(t) evaluated at t = 1, whichis 1. Hence (1− p)n converges to exp(−µ), and the theorem is proved.

Proposition .If X1 and X2 are independent Poisson random variables with parameters µ1 and µ2,then X1 +X2 follows a Poisson distribution with parameter µ1 +µ2.

ProofLet Y = X1 +X2. Then using Proposition . on page

P(Y = y) =y∑

x=0

P(X1 = y − x) P(X2 = x)

=y∑

x=0

µy−x1

(y − x)!exp(−µ1)

µx2x!

exp(−µ2)

=1y!

exp(−(µ1 +µ2)

) y∑

x=0

(yx

)µy−x1 µx2

=1y!

exp(−(µ1 +µ2)

)(µ1 +µ2)y .

Page 31: Probability Theory Statistics

Countable Sample Spaces

. Exercises

Exercise .Give a proof of Proposition . on page .

Exercise .In the definition of variance of a distribution on a countable probability space (Defi-nition . page ), it is tacitly assumed that E

((X −EX)2

)= E(X2)− (EX)2. Prove that

this is in fact true.

Exercise .Prove that VarX = E(X(X − 1))− ξ(ξ − 1), where ξ = EX.

Exercise .Sketch the probability function of the Poisson distribution for different values of theparameter µ.

Exercise .Find the expected value and the variance of the Poisson distribution with parameter µ.(In doing so, Exercise .may be of some use.)

Exercise .Show that if X1 and X2 are independent Poisson random variables with parameters µ1and µ2, then the conditional distribution of X1 given X1 +X2 is a binomial distribution.

Exercise .Sketch the probability function of the negative binomial distribution for different valuesof the parameters k and p.

Exercise .Consider the negative binomial distribution with parameters k and p, where k > 0 andp ∈ ]0 ; 1[. As mentioned in the text, the expected value is k(1− p)/p and the variance isk(1− p)/p2.

Explain that the parameters k and p can go to a limit in such a way that the expectedvalue is constant and the variance converges towards the expected value, and show thatin this case the probability function of the negative binomial distribution converges tothe probability function of a Poisson distribution.

Exercise .Show that the definition of covariance (Definition . on page ) is meaningful, i.e.show that when X and Y both are known to have a variance, then the expectationE((X −EX)(Y −EY )

)exists.

Exercise .Assume that on given day the number of persons going by the S-train without a validticket, follows a Poisson distribution with parameter µ. Suppose there is a probabilityof p that a fare dodger is caught by the ticket controller, and that different dodgers arecaught independently of each other. In view of this, what can be said about the numberof fare dodgers caught?

. Exercises

In order to make statements about the actual number of passengers without a ticketwhen knowing the number of passengers caught without at ticket, what more informa-tion will be needed?

Exercise .: St. Petersburg paradoxA gambler is going to join a game where in each play he has an even chance of doublingthe bet and of losing the bet, that is, if he bets k€ he will win 2k€ with probability 1/2and lose his bet with probability 1/2. Our gambler has a clever strategy: in the first playhe bets 1€; if he loses in any given play, then he will bet the double amount in the nextplay; if he wins in any given play, he will collect his winnings and go home.

How big will his net winnings be?How many plays will he be playing before he can get home?The strategy is certain to yield a profit, but surely the player will need some working

capital. How much money will he need to get through to the win?

Exercise .Give a proof of Proposition . on page .

Exercise .The expected value of the product of two independent random variables can be ex-pressed in a simple way by means of the expected values of the individual variables(Proposition .).

Deduce a similar result about the variance of two independent random variables. —Is independence a crucial assumption?

Page 32: Probability Theory Statistics

Continuous Distributions

In the previous chapters we met random variables and/or probability distri-butions that are actually “living” on a finite or countable subset of the realnumbers; such variables and/or distributions are often labelled discrete. Thereare other kinds of distributions, however, distributions with the property thatall non-empty open subsets of a given fixed interval have strictly positive proba-bility, whereas every single point in the interval has probability 0.

Example: When playing “spin the bottle” you rotate a bottle to pick a randomdirection, and the person pointed at by the bottle neck must do whatever isagreed on as part of the play. We shall spoil the fun straight away and modelthe spinning bottle as “a point on the unit circle, picked at random from auniform distribution”; for the time being, the exact meaning of this statementis not entirely well-defined, but for one thing it must mean that all arcs of agiven length b are assigned the same amount of probability, no matter whereon the circle the arc is placed, and it seems to be a fair guess to claim thatthis probability must be b/2π. Every point on the circle is contained in arcs ofarbitrarily small length, and hence the point itself must have probability 0.

One can prove that there is a unique corresponcence between on the one handprobability measures on R and on the other hand distribution functions, i.e.functions that are increasing and right-continuous and with limits 0 and 1 at−∞ and +∞ respectively, cf. Proposition . and/or Proposition .. In brief,the corresponcence is that P(]a ; b]) = F(b) − F(a) for every half-open interval]a ; b]. — Now there exist very strange functions meeting the conditions forbeing a distribution function, but never fear, in this chapter we shall be dealingwith a class of very nice distributions, the so-called continuous distributions,distributions with a density function.

. Basic definitions

Definition .: Probability density functionA probability density function on R is an integrable function f : R→ [0 ; +∞[ with

integral 1, i.e.∫

Rf (x)dx = 1.

Page 33: Probability Theory Statistics

Continuous Distributions

A probability density function on Rd is an integrable function f : Rd → [0 ; +∞[

with integral 1, i.e.∫

Rdf (x)dx = 1.

Definition .: Continuous distributionA continuous distribution is a probability distribution with a probability densityfunction: If f is a probability density function on R, then the continuous function

F(x) =∫ x

−∞f (u)du is the distribution function for a continuous distribution on R.

We say that the random variable X has density function f if the distribution

function F(x) = P(X ≤ x) of X has the form F(x) =∫ x

−∞f (u)du. Recall that in this

case F′(x) = f (x) at all points of continuity x of f .P(a < X ≤ b) =

∫ b

af (x)dx

a b

f

Proposition .Consider a random variable X with density function f . Then for all a < b

. P(a < X ≤ b) =∫ b

af (x)dx.

. P(X = x) = 0 for all x ∈R.. P(a < X < b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a ≤ X ≤ b).. If f is continuous at x, then f (x)dx can be understood as the probability thatX belongs to an infinitesimal interval of length dx around x.

ProofItem follows from the definitions of distribution function and density function.Item follows from the fact that

0 ≤ P(X = x) ≤ P(x − 1n < X ≤ x) =

∫ x

x−1/nf (u)du,

where the integral tends to 0 as n tends to infinity. Item follows from items and . A proper mathematical formulation of item is that if x is a point ofcontinuity of f , then as ε→ 0,

12ε

P(x − ε < X < x+ ε) =12ε

∫ x+ε

x−εf (u)du −→ f (x).

Example .: Uniform distributionThe uniform distribution on the interval from α to β (where −∞ < α < β < +∞) is the

Density function f anddistribution function F forthe uniform distributionon the interval from α to β.

α β

1β−α

α β

1F

f

distribution with density function

f (x) =

1β −α for α < x < β,

0 otherwise.

. Basic definitions

Its distribution function is the function

F(x) =

0 for x ≤ αx −αβ −α for x0 < x ≤ β

1 for x ≥ β.A uniform distribution on an interval can, by the way, be used to construct the above-mentioned uniform distribution on the unit circle: simply move the uniform distributionon ]0 ; 2π[ onto the unit circle.

Definition .: Multivariate continuous distributionThe d-dimensional random variable X = (X1,X2, . . . ,Xd) is said to have densityfunction f , if f is a d-dimensional density function such that for all intervals Ki =]ai ; bi], i = 1,2, . . . ,d,

P

d⋂

i=1

Xi ∈ Ki =

K1×K2×...×Kd

f (x1,x2, . . . ,xd)dx1dx2 . . .dxd .

The function f is called the joint density function for X1,X2, . . . ,Xn.

A number of definitions, propositions and theorems are carried over more orless immediately from the discrete to the continuous case, formally by replacingprobability functions with density functions and sums with integrals. Examples:

The counterpart to Proposition ./Corollary . is

Proposition .The random variables X1,X2, . . . ,Xn are independent, if and only if their joint densityfunction is a product of density functions for the individual Xis.

The counterpart to Proposition . is

Proposition .If the random variables X1 and X2 are independent and with density functions f1and f2, then the density function of Y = X1 +X2 is

f (y) =∫

Rf1(x)f2(y − x)dx, y ∈R.

(See Exercise ..)

Page 34: Probability Theory Statistics

Continuous Distributions

Transformation of distributions

Transformation of distributions is still about deducing the distribution of Y =t(X) when the distribution of X is known and t is a function defined on therange of X. The basic formula P(t(X) ≤ y) = P

(X ∈ t−1(]−∞ ; y])

)always applies.

Example .We wish to find the distribution of Y = X2, assuming that X is uniformly distributed on]−1 ; 1[.

For y ∈ [0 ; 1] we have P(X2 ≤ y) = P(X ∈ [−√y ;

√y]

)=

∫ √y

−√y12

dx =√y, so the distri-

bution function of Y is

FY (y) =

0 for y ≤ 0,√y for 0 < y ≤ 1,

1 for y > 1.

A density function f for Y can be found by differentiating F (only at points where F isdifferentiable):

f (y) =1

2y−1/2 for 0 < y < 1,

0 otherwise.

In case of a differentiable function t, we can sometimes write down an explicitformula relating the density function of X and the density function of Y = t(X),simply by applying a standard result about substitution in integrals:

Theorem .Suppose thatX is a real random variable, I an open subset of R such that P(X ∈ I) = 1,and t : I →R a C1-function defined on I and with t′ , 0 everywhere on I . If X hasdensity function fX , then the density function of Y = t(X) is

fY (y) = fX(x)∣∣∣t′(x)

∣∣∣−1= fX(x)

∣∣∣(t−1)′(y)∣∣∣

where x = t−1(y) and y ∈ t(I). The formula is sometimes written as

fY (y) = fX(x)∣∣∣∣∣dxdy

∣∣∣∣∣.

The result also comes in a multivariate version:

Theorem .Suppose that X is a d-dimensional random variable, I an open subset of Rd suchthat P(X ∈ I) = 1, and t : I → Rd a C1-function defined on I and with a Jacobian

Dt =dydx

which is regular everywhere on I . If X has density function fX , then the

density function of Y = t(X) is

fY (y) = fX (x)∣∣∣detDt(x)

∣∣∣−1= fX (x)

∣∣∣detDt−1(y)∣∣∣

. Basic definitions

where x = t−1(y) and y ∈ t(I). The formula is sometimes written as

fY (y) = fX (x)∣∣∣∣∣det

dxdy

∣∣∣∣∣.

Example .If X is uniformly distributed on ]0 ; 1[, then the density function of Y = − lnX is

fY (y) = 1 ·∣∣∣∣∣dxdy

∣∣∣∣∣ = exp(−y)

when 0 < x < 1, i.e. when 0 < y < +∞, and fY (y) = 0 otherwise. — Thus, the distributionof Y is an exponential distribution, see Section ..

Proof of Theorem .The theorem is a rephrasing of a result from Mathematical Analysis about substi-tution in integrals. The derivative of the function t is continuous and everywherenon-zero, and hence t is either strictly increasing or strictly decreasing; let usassume it to be strictly increasing. Then Y ≤ b⇔ X ≤ t−1(b), and we obtain thefollowing expression for the distribution function FY of Y :

FY (b) = P(Y ≤ b)

= P(X ≤ t−1(b)

)

=∫ t−1(b)

−∞fX(x)dx [ substitute x = t−1(y) ]

=∫ b

−∞fX

(t−1(y)

)(t−1)′(y)dy.

This shows that the function fY (y) = fX(t−1(y)

)(t−1)′(y) has the property that

FY (y) =∫ y

−∞fY (u)du, showing that Y has density function fY .

Conditioning

When searching for the conditional distribution of one continuous randomvariable X given another continuous random variable Y , one cannot take forgranted that an expression such as P(X = x | Y = y) is well-defined and equalto P(X = x,Y = y)/ P(Y = y); on the contrary, the denominator P(Y = y) alwaysequals zero, and so does the nominator in most cases.

Instead, one could try to condition on a suitable event with non-zero prob-ability, for instance y − ε < Y < y + ε, and then examine whether the (nowwell-defined) conditional probability P(X ≤ x | y − ε < Y < y + ε) has a limit-ing value as ε→ 0, in which case the limiting value would be a candidate forP(X ≤ x | Y = y).

Page 35: Probability Theory Statistics

Continuous Distributions

Another, more heuristic approach, is as follows: let fXY denote the jointdensity function of X and Y and let fY denote the density function of Y . Thenthe probability that (X,Y ) lies in an infinitesimal rectangle with sides dx and dyaround (x,y) is fXY (x,y)dxdy, and the probability that Y lies in an infinitesimalinterval of length dy around y is fY (y)dy; hence the conditional probability thatX lies in an infinitesimal interval of length dx around x, given that Y lies in an

infinitesimal interval of length dy around y, isfXY (x,y)dxdyfY (y)dy

=fXY (x,y)fY (y)

dx, so

a guess is that the conditional density is

f (x | y) =fXY (x,y)fY (y)

.

This is in fact true in many cases.

The above discussion of conditional distributions is of course very carelessindeed. It is however possible, in a proper mathematical setting, to give an exactand thorough treatment of conditional distributions, but we shall not do so inthis presentation.

. Expectation

The definition of expected value of a random variable with density functionresembles the definition pertaining to random variables on a countable samplespace, the sum is simply replaced by an integral:

Definition .: Expected valueConsider a random variable X with density function f .

If∫ +∞

−∞|x|f (x)dx < +∞, then X is said to have an expectation, and the expected value

of X is defined to be the real number EX =∫ +∞

−∞xf (x)dx.

Theorems, propositions and rules of arithmetic for expected value and vari-ance/covariance are unchanged from the countable case.

. Examples

Exponential distribution

Definition .: Exponential distributionThe exponential distribution with scale parameter β > 0 is the distribution with

. Examples

density function

f (x) =

exp(−x/β) for x > 0

0 for x ≤ 0.

Its distribution function is

F(x) =

1− exp(−x/β) for x > 0

0 for x ≤ 0.

The reason why β is a scale parameter is evident from

Proposition .If X is exponentially distributed with scale parameter β, and a is a positive constant,then aX is exponentially distributed with scale parameter aβ.

Density function anddistribution function of theexponential distributionwith β = 2.163.

1

0 1 2 3 4 5

f

1

0 1 2 3 4 5

F

ProofApplying Theorem ., we find that the density function of Y = aX is thefunction fY defined as

fY (y) =

exp(−(y/a)/β

)· 1a

=1aβ

exp(−y/(aβ)

)for y > 0

0 for y ≤ 0.

Proposition .If X is exponentially distributed with scale parameter β, then EX = β and VarX = β2.

ProofAs β is a scale parameter, it suffices to show the assertion in the case β = 1, andthis is done using straightforward calculations (the last formula in the side-note(on page ) about the gamma function may be useful).

The exponential distribution is often used when modelling waiting times be-tween events occurring “entirely at random”. The exponential distribution hasa “lack of memory” in the sense that the probability of having to wait morethan s+ t, given that you have already waited t, is the same as the probability ofwaiting more that t; in other words, for an exponential random variable X,

P(X > s+ t | X > s) = P(X > t), s, t > 0. (.)

Proof of equation (.)Since X > s+ t⇒ X > s, we have X > s+ t ∩ X > s = X > s+ t, and hence

P(X > s+ t | X > s) =P(X > s+ t)

P(X > s)=

1−F(s+ t)1−F(s)

Page 36: Probability Theory Statistics

Continuous Distributions

=exp(−(s+ t)/β)

exp(−s/β)= exp(−t/β) = P(X > t)

for any s, t > 0.

In this sense the exponential distribution is the continuous counterpart to theThe gamma functionThe gamma function isthe function

Γ (t) =∫ +∞

0xt−1 exp(−x)dx

where t > 0. (Actu-ally, it is possibleto define Γ (t) for allt ∈C \ 0,−1,−2,−3, . . ..)The following holds:

• Γ (1) = 1,• Γ (t+1) = t Γ (t) for

all t,• Γ (n) = (n− 1)! forn = 1,2,3, . . ..

• Γ ( 12 ) =√π.

(See also page .)

A standard integralsubstitution leads to theformula

Γ (k)βk =+∞∫

0

xk−1 exp(−x/β)dx

for k > 0 and β > 0.

geometric distribution, see page .Furthermore, the exponential distribution is related to the Poisson distribu-

tion. Suppose that instances of a certain kind of phenomenon occur at randomtime points as described in the section about the Poisson distribution (page ).Saying that you have to wait more that t for the first instance to occur, is thesame as saying that there is no occurrences in the time interval from 0 to t; thelatter event is an event in the Poisson model, and the probability of that event

is (λt)0

0! exp(−λt) = exp(−λt), showing that the waiting time to the occurrence ofthe first instance is exponentially distributed with scale parameter 1/λ.

Gamma distribution

Definition .: Gamma distributionThe gamma distribution with shape parameter k > 0 and scale parameter β > 0 is thedistribution with density function

f (x) =

1Γ (k)βk

xk−1 exp(−x/β) for x > 0

0 for x ≤ 0.

For k = 1 this is the exponential distribution.

Proposition .If X is gamma distributed with shape parameter k > 0 and scale parameter β > 0,and if a > 0, then Y = aX is gamma distributed with shape parameter k and scaleparameter aβ.

ProofSimilar to the proof of Proposition ., i.e. using Theorem ..

Gamma distribution withk = 1.7 and β = 1.27

1/2

0 1 2 3 4 5

f

Proposition .If X1 and X2 are independent gamma variables with the same scale parameter βand with shape parameters k1 and k2 respectively, then X1 +X2 is gamma distributedwith scale parameter β and shape parameter k1 + k2.

. Examples

ProofFrom Proposition . the density function of Y = X1 +X2 is (for y > 0)

f (y) =∫ y

0

1Γ (k1)βk1

xk1−1 exp(−x/β) · 1Γ (k2)βk2

(y − x)k2−1 exp(−(y − x)/β)dx,

and the substitution u = x/y transforms this into

f (y) =

1Γ (k1)Γ (k2)βk1+k2

∫ 1

0uk1−1(1−u)k2−1 du

yk1+k2−1 exp(−y/β),

so the density is of the form f (y) = const · yk1+k2−1 exp(−y/β) where the constantmakes f integrate to 1; the last formula in the side note about the gamma

function gives that the constant must be1

Γ (k1 + k2)βk1+k2. This completes the

proof.

A consequence of Proposition . is that the expected value and the variance ofa gamma distribution must be linear functions of the shape parameter, and alsothat the expected value must be linear in the scale parameter and the variancemust be quadratic in the scale parameter. Actually, one can prove that

Proposition .If X is gamma distributed with shape parameter k and scale parameter β, thenEX = kβ and VarX = kβ2.

In mathematical statistics as well as in applied statistics you will encounter aspecial kind of gamma distribution, the so-called χ2 distribution.

Definition .: χ2 distributionA χ2 distribution with n degrees of freedom is a gamma distribution with shapeparameter n/2 and scale parameter 2, i.e. with density

f (x) =1

Γ (n/2)2n/2xn/2−1 exp(−1

2x), x > 0.

Cauchy distribution

Definition .: Cauchy distributionThe Cauchy distribution with scale parameter β > 0 is the distribution with densityfunction

f (x) =1πβ

11 + (x/β)2 , x ∈R.

Cauchy distributionwith β = 1

1/2

0 1 2 3−3 −2 −1

f

Page 37: Probability Theory Statistics

Continuous Distributions

This distribution is often used as a counterexample (!) — it is, for one thing,an example of a distribution that does not have an expected value (since thefunction x 7→ |x|/(1+x2) is not integrable). Yet there are “real world” applicationsof the Cauchy distribution, one such is given in Exercise ..

Normal distribution

Definition .: Standard normal distributionThe standard normal distribution is the distribution with density function

ϕ(x) =1√2π

exp(−1

2x2), x ∈R.

The distribution function of the standard normal distribution isStandard normal distribution

1/2

0 1 2 3−3 −2 −1

ϕ

Φ(x) =∫ x

−∞ϕ(u)du, x ∈R.

Definition .: Normal distributionThe normal distribution (or Gaussian distribution) with location parameter µ andquadratic scale parameter σ2 > 0 is the distribution with density function

f (x) =1√

2πσ2exp

(−1

2(x −µ)2

σ2

)=

1σϕ(x −µσ

), x ∈R.

The terms location parameter and quadratic scale parameter are justified byCarl FriedrichGaussGerman mathematician(-).Known among otherthings for his work ingeometry and mathe-matical statistics (in-cluding the method ofleast squares) He alsodealt with practicalapplications of mathe-matics, e.g. geodecy.German DM banknotes from the decadesprededing the shift toeuro (in ), had aportrait of Gauss, thegraph of the normaldensity, and a sketchof his triangulation ofa region in northernGermany.

Proposition ., which can be proved using Theorem .:

Proposition .If X follows a normal distribution with location parameter µ and quadratic scaleparameter σ2, and if a and b are real numbers and a , 0, then Y = aX + b follows anormal distribution with location parameter aµ+ b and quadratic scale parametera2σ2.

Proposition .If X1 and X2 are independent normal variables with parameters µ1, σ2

1 and µ2, σ22

respectively, then Y = X1 +X2 is normally distributed with location parameter µ1 +µ2and quadratic scale parameter σ2

1 + σ22 .

ProofDue to Proposition . it suffices to show that if X1 has a standard normaldistribution, and if X2 is normal with location parameter 0 and quadratic scaleparameter σ2, then X1 +X2 is normal with location parameter 0 and quadraticscale parameter 1 + σ2. That is shown using Proposition ..

. Examples

Proof that ϕ is a density functionWe need to show that ϕ is in fact a probability density function, that is, that∫Rϕ(x)dx = 1, or in other words, if c is defined to be the constant that makes

f (x) = c exp(−12x

2) a probability density, then we have to show that c = 1/√

2π.The clever trick is to consider two independent random variables X1 and X2,

each with density f . Their joint density function is

f (x1)f (x2) = c2 exp(−1

2 (x21 + x2

2)).

Now consider a function (y1, y2) = t(x1,x2) where the relation between xs andys is given as x1 =

√y2 cosy1 and x2 =

√y2 siny1 for (x1,x2) ∈ R2 and (y1, y2) ∈

]0 ; 2π[× ]0 ; +∞[. Theorem . shows how to write down an expression for thedensity function of the two-dimensional random variable (Y1,Y2) = t(X1,X2);from standard calculations this density is 1

2c2 exp(−1

2y2) when 0 < y1 < 2π andy2 > 0, and 0 otherwise. Being a probability density function, it integrates to 1:

1 =∫ +∞

0

∫ 2π

0

12c

2 exp(−12y2)dy1 dy2

= 2πc2∫ +∞

0

12 exp(−1

2y2)dy2

= 2πc2,

and hence c = 1/√

2π.

Proposition .If X1,X2, . . . ,Xn are independent standard normal random variables, then Y =X2

1 +X22 + . . .+X2

n follows a χ2 distribution with n degrees of freedom.

ProofThe χ2 distribution is a gamma distribution, so it suffices to prove the claimin the case n = 1 and then refer to Proposition .. The case n = 1 is treated asfollows: For x > 0,

P(X21 ≤ x) = P(−√x < X1 ≤

√x)

=∫ √x

−√xϕ(u)du [ ϕ is an even function ]

= 2∫ √x

0ϕ(u)du [ substitute t = u2 ]

=∫ x

0

1√2π

t−1/2 exp(−12 t)dt ,

Page 38: Probability Theory Statistics

Continuous Distributions

and differentiation with respect to t gives this expression for the density functionof X2

1 :1√2π

x−1/2 exp(−12x), x > 0.

The density function of the χ2 distribution with 1 degree of freedom is (Defini-tion .)

1Γ (1/2)21/2

x−1/2 exp(−12x), x > 0.

Since the two density functions are proportional, they must be equal, and thatconcludes the proof. — As a by-product we see that Γ (1/2) =

√π.

Proposition .If X is normal with location parameter µ and quadratic scale parameter σ2, thenEX = µ and VarX = σ2.

ProofReferring to Proposition . and the rules of calculus for expected values andvariances, it suffices to consider the case µ = 0, σ2 = 1, so assume that µ = 0 andσ2 = 1. By symmetri we must have EX = 0, and then VarX = E((X − EX)2) =E(X2); X2 is known to be χ2 with 1 degree of freedom, i.e. it follows a gammadistribution with parameters 1/2 and 2, so E(X2) = 1 (Proposition .).

The normal distribution is widely used in statistical models. One reason for thisis the Central Limit Theorem. (A quite different argument for and derivation ofthe normal distribution is given on pages ff.)

Theorem .: Central Limit TheoremConsider a sequence X1,X2,X3, . . . of independent identically distributed randomvariables with expected value µ and variance σ2 > 0, and let Sn denote the sum of thefirst n variables, Sn = X1 +X2 + . . .+Xn.

Then as n→∞,Sn −nµ√nσ2

is asymptotically standard normal in the sense that

limn→∞P

(Sn −nµ√nσ2

≤ x)

= Φ(x)

for all x ∈R.

Remarks:

• The quantitySn −nµ√nσ2

=1nSn −µ√σ2/n

is the sum and the average of the first

n variables minus the expected value of this, divided by the standarddeviation; so the quantity has expected value 0 and variance 1.

. Exercises

• The Law of Large Numbers (page ) shows that 1nSn−µ becomes very small

(with very high probability) as n increases. The Central Limit Theoremshows how 1

nSn −µ becomes very small, namely in such a way that 1nSn −µ

divided by√σ2/n is a standard normal variable.

• The theorem is very general as it assumes nothing about the distributionof the Xs, except that it has an expected value and a variance.

We omit the proof of the Central Limit Theorem.

. Exercises

Exercise .Sketch the density function of the gamma distribution for various values of the shapeparameter and the scale parameter. (Investigate among other things how the behaviournear x = 0 depends on the parameters; you may wish to consider lim

x→0f (x) and lim

x→0f ′(x).)

Exercise .Sketch the normal density for various values of µ and σ2.

Exercise .Consider a continuous and strictly increasing function F defined on an open intervalI ⊆R, and assume that the limiting values of F at the two endpoints of the interval are0 and 1, respectively. (This means that F, possibly after an extension to all of the realline, is a distribution function.). Let U be a random variable having a uniform distribution on the interval ]0 ; 1[ .

Show that the distribution function of X = F−1(U ) is F.. Let Y be a random variable with distribution function F. Show that V = F(Y ) is

uniform on ]0 ; 1[ .Remark: Most computer programs come with a function that returns random numbers(or at least pseudo-random numbers) from the uniform distribution on ]0 ; 1[ . Thisexercise indicates one way to transform uniform random numbers into random numbersfrom any given continuous distribution.

Exercise .A two-dimensional cowboy fires his gun in a random direction. There is a fence at somedistance from the cowboy; the fence is very long (infinitely long?).. Find the probability that the bullet hits the fence.. Given that the bullet hits the fence, what can be said of distribution of the point

of hitting?

Exercise .Consider a two-dimensional random variable (X1,X2) with density function f (x1,x2),

cf. Definition .. Show that the function f1(x1) =∫

Rf (x1,x2)dx2 is a density function

of X1.

Page 39: Probability Theory Statistics

Continuous Distributions

Exercise .Let X1 and X2 be independent random variables with density functions f1 and f2, andlet Y1 = X1 +X2 and Y2 = X1.. Find the density function of (Y1,Y2). (Utilise Theorem ..). Find the distribution of Y1. (You may use Exercise ..). Explain why the above is a proof of Proposition ..

Exercise .There is a claim on page that it is a consequence of Proposition . that the expectedvalue and the variance of a gamma distribution must be linear functions of the shapeparameter, and that the expected value must be linear in the scale parameter and thevariance quadratic in the scale parameter. — Explain why this is so.

Exercise .The mercury content in swordfish sold in some parts of the US is known to be normallydistributed with mean 1.1ppm and variance 0.25ppm2 (see e.g. Lee and Krutchkoff() and/or exercises . and . in Larsen ()), but according to the healthauthorities the average level of mercury in fish for consumption should not exceed1ppm. Fish sold through authorised retailers are always controlled by the FDA, andif the mercury content is too high, the lot is discarded. However, about 25% of thefish caught is sold on the black market, that is, 25% of the fish does not go throughthe control. Therefore the rule “discard if the mercury content exceeds 1ppm” is notsufficient.

How should the “discard limit” be chosen to ensure that the average level of mercuryin the fish actually bought by the consumers is as small as possible?

Generating Functions

In mathematics you will find several instances of constructions of one-to-onecorrespondences, representations, between one mathematical structure andanother. Such constructions can be of great use in argumentations, proofs,calculations etc. A classic and familiar example of this is the logarithm function,which of course turns multiplication problems into addition problems, or statedin another way, it is an isomorphism between (R+, · ) and (R,+) .

In this chapter we shall deal with a representation of the set of N0-valuedrandom variables (or more correctly: the set of probability distributions on N0)using the so-called generating functions.

Note: All random variables in this chapter are random variables with values inN0.

. Basic properties

Definition .: Generating functionLet X be a N0-valued random variable. The generating function of X (or rather: thegenerating function of the distribution of X) is the function

G(s) = EsX =∞∑

x=0

sxP(X = x), s ∈ [0 ; 1].

The theory of generating functions draws to a large extent on the theory of powerseries. Among other things, the following holds for the generating function G ofthe random variable X:. G is continuous and takes values in [0 ; 1]; moreover G(0) = P(X = 0) andG(1) = 1.

. G has derivatives of all orders, and the kth derivative is

G(k)(s) =∞∑

x=k

x(x − 1)(x − 2) . . . (x − k + 1)sx−k P(X = x)

= E(X(X − 1)(X − 2) . . . (X − k + 1)sX−k

).

In particular, G(k)(s) ≥ 0 for all s ∈ ]0 ; 1[.

Page 40: Probability Theory Statistics

Generating Functions

. If we let s = 0 in the expression for G(k)(s), we obtain G(k)(0) = k!P(X = k);this shows how to find the probability distribution from the generatingfunction.Hence, different probability distributions cannot have the same generatingfunction, that is to say, the map that takes a probability distributions intoits generating function is injective.

. Letting s→ 1 in the expression for G(k)(s) gives

G(k)(1) = E(X(X − 1)(X − 2) . . . (X − k + 1)

); (.)

one can prove that lims→1

G(k)(s) is a finite number if and only if Xk has an

expected value, and if so equation (.) holds true.Note in particular that

EX = G′(1) (.)

and (using Exercise .)

VarX = G′′(1)−G′(1)(G′(1)− 1

)

= G′′(1)−(G′(1)

)2+G′(1).

(.)

An important and useful result is

Theorem .If X and Y are independent random variables with generating functions GXand GY , then the generating function of the sum of X and Y is the product of thegenerating functions: GX+Y = GXGY .

ProofFor |s| < 1, GX+Y (s) = EsX+Y = E

(sXsY

)= EsX EsY = GX(s)GY (s) where we

have used a well-known property of exponential functions and then Propo-sition ./. (on page /).

Example .: One-point distributionIf X = a with probability 1, then its generating function is G(s) = sa.

Example .: 01-variableIf X is a 01-variable with P(X = 1) = p, then its generating function is G(s) = 1− p+ sp.

Example .: Binomial distributionThe binomial distribution with parameters n and p is the distribution of a sum of nindependent and identically distributed 01-variables with parameter p (Definition .on page ). Using Theorem . and Example . we see that the generating function ofa binomial random variable Y with parameters n and p therefore is G(s) = (1− p+ sp)n.

. Basic properties

In Example . on page we found the expected value and the variance of thebinomial distribution to be np and np(1 − p), respectively. Now let us derive thesequantities using generating functions: We have

G′(s) = np(1− p+ sp)n−1 and G′′(s) = n(n− 1)p2(1− p+ sp)n−2,

soG′(1) = EY = np and G′′(1) = E(Y (Y − 1)) = n(n− 1)p2.

Then, according to equations (.) and (.)

EY = np and VarY = n(n− 1)p2 −np(np − 1) = np(1− p).

We can also give a new proof of Theorem . (page ): Using Theorem ., thegenerating function of the sum of two independent binomial variables with parametersn1, p and n2, p, respectively, is

(1− p+ sp)n1 · (1− p+ sp)n2 = (1− p+ sp)n1+n2

where the right-hand side is recognised as the generating function of the binomialdistribution with parameters n1 +n2 and p.

Example .: Poisson distributionThe generating function of the Poisson distribution with parameter µ (Definition .page ) is

G(s) =∞∑

x=0

sxµx

x!exp(−µ) = exp(−µ)

∞∑

x=0

(µs)x

x!= exp

(µ(s − 1)

).

Just like in the case of the binomial distribution, the use of generating functions greatlysimplifies proofs of important results such as Proposition . (page ) about thedistribution of the sum of independent Poisson variables.

Example .: Negative binomial distributionThe generating function of a negative binomial variable X (Definition . page ) withprobability parameter p ∈ ]0 ; 1[ and shape parameter k > 0 is

G(s) =∞∑

x=0

(x+ k − 1

x

)pk(1− p)x sx = pk

∞∑

x=0

(x+ k − 1

x

)((1− p)s

)x=

(1p− 1− p

ps

)−k.

(We have used the third of the binomial series on page to obtain the last equality.)Then

G′(s) = k1− pp

(1p− 1− p

ps

)−k−1

and G′′(s) = k(k + 1)(

1− pp

)2(1p− 1− p

ps

)−k−2

.

This is entered into (.) and (.) and we get

EX = k1− pp

and VarX = k1− pp2 ,

as claimed on page .By using generating functions it is also quite simple to prove Proposition . (on

page ) concerning the distribution of a sum of negative binomial variables.

Page 41: Probability Theory Statistics

Generating Functions

As these examples might suggest, it is often the case that once you have writtendown the generating function of a distribution, then many problems are easilysolved; you can, for instance, find the expected value and the variance simplyby differentiating twice and do a simple calculation. The drawback is of course,that it can be quite cumbersome to find a usable expression for the generatingfunction.

Earlier in this presentation we saw examples of convergence of probabilitydistributions: the binomial distribution converges in certain cases to a Poissondistribution (Theorem . on page ), and so does the negative binomialdistribution (Exercise . on page ). Convergence of probability distributionson N0 is closely related to convergence of generating functions:

Theorem .: Continuity TheoremSuppose that for each n ∈ 1,2,3, . . . ,∞ we have a random variable Xn with generat-ing function Gn and probability function fn. Then lim

n→∞fn(x) = f∞(x) for all x ∈N0,

if and only if limn→∞Gn(s) = G∞(s) for all s ∈ [0 ; 1[.

The proof is omitted (it is an exercise in mathematical analysis rather than inprobability theory).

We shall now proceed with some new problems, that is, problems that have notbeen treated earlier in this presentation.

. The sum of a random number of random variables

Theorem .Assume that N,X1,X2, . . . are independent random variables, and that all the Xs havethe same distribution. Then the generating function GY of Y = X1 +X2 + . . .+XN is

GY = GN GX , (.)

where GN is the generating function of N , and GX is the generating function of thedistribution of the Xs.

ProofFor an outcomeω ∈ΩwhithN (ω) = n, we have Y (ω) = X1(ω)+X2(ω)+· · ·+Xn(ω),i.e., Y = y and N = n = X1 +X2 + · · ·+Xn = y and N = n, and so

GY (s) =∞∑

y=0

sy P(Y = y) =∞∑

y=0

sy∞∑

n=0

P(Y = y and N = n)

. The sum of a random number of random variables

y number of leaves with y mites

+

Table .Mites on apple leaves

=∞∑

y=0

sy∞∑

n=0

P(X1 +X2 + · · ·+Xn = y

)P(N = n)

=∞∑

n=0

( ∞∑

y=0

sy P(X1 +X2 + · · ·+Xn = y

))P(N = n),

where we have made use of the fact that N is independent of X1 +X2 + · · ·+Xn.The expression in large brackets is nothing but the generating function ofX1 +X2 + · · ·+Xn, which by Theorem . is

(GX(s)

)n, and hence

GY (s) =∞∑

n=0

(GX(s)

)nP(N = n).

Here the right-hand side is in fact the generating function of N , evaluated at thepoint GX(s), so we have for all s ∈ ]0 ; 1[

GY (s) = GN(GX(s)

),

and the theorem is proved.

Example .: Mites on apple leavesOn the th of July , leaves were selected at random from each of six McIntoshapple trees in Connecticut, and for each leaf the number of red mites (female adults)was recorded. The result is shown in Table . (Bliss and Fisher, ).

If mites were scattered over the trees in a haphazard manner, we might expect thenumber of mites per leave to follow a Poisson distribution. Statisticians have ways totest whether a Poisson distribution gives a reasonable description of a given set of data,and in the present case the answer is that the Poisson distribution does not apply. So wemust try something else.

One might argue that mites usually are found in clusters, and that this fact shouldsomehow be reflected in the model. We could, for example, consider a model where the

Page 42: Probability Theory Statistics

Generating Functions

number of mites on a given leaf is determined using a two-stage random mechanism:first determine the number of clusters on the leaf, and then for each cluster determinethe number of mites in that cluster. So let N denote a random variable that gives thenumber of clusters on the leaf, and let X1,X2, . . . be random variables giving the numberof individuals in cluster no. 1,2, . . .What is observed, is 150 values of Y = X1+X2+. . .+XN(i.e., Y is a sum of a random number of random variables).

It is always advisable to try simple models before venturing into the more complexones, so let us assume that the random variables N,X1,X2, . . . are independent, andthat all the Xs have the same distribution. This leaves the distribution of N and thedistribution of the Xs as the only unresolved elements of the model.

The Poisson distribution did not apply to the number of mites per leaf, but it mightapply to the number of clusters per leaf. Let us assume that N is a Poisson variable withparameter µ > 0. For the number of individuals per cluster we will use the logarithmicdistribution with parameter α ∈ ]0 ; 1[, that is, the distribution with point probabilities

f (x) =1

− ln(1−α)αx

x, x = 1,2,3, . . .

(a cluster necessarily has a non-zero number of individuals). It is not entirely obviouswhy to use this particular distribution, except that the final result is quite nice, as weshall see right away.

Theorem . gives the generating function of Y in terms of the generating functionsof N and X. The generating function of N is GN (s) = exp

(µ(s−1)

), see Example .. The

generating function of the distribution of X is

GX(s) =∞∑

x=1

sx f (x) =1

− ln(1−α)

∞∑

x=1

(αs)x

x=− ln(1−αs)− ln(1−α)

.

From Theorem ., the generating function of Y is GY (s) = GN(GX(s)

), so

GY (s) = exp(µ( − ln(1−αs)− ln(1−α)

− 1))

= exp(

µ

ln(1−α)ln

1−αs1−α

)=

(1−αs1−α

)µ/ ln(1−α)

=(

11−α −

α1−α s

)µ/ ln(1−α)

=(

1p− 1− p

ps

)−k

where p = 1−α and k = −µ/ lnp. This function is nothing but the generating function ofthe negative binomial distribution with probability parameter p and shape parameter k(Example .). Hence the number Y of mites per leaf follows a negative binomialdistribution.

Corollary .In the situation described in Theorem .,

EY = EN EX

VarY = EN VarX + VarN (EX)2.

. Branching processes

ProofEquations (.) and (.) express the expected value and the variance of adistribution in terms of the generating function and its derivatives evaluatedat 1. Differentiating (.) gives G′Y = (G′N GX) G′X , which evaluated at 1 givesEY = EN EX. Similarly, evaluate G′′Y at 1 and do a little arithmetic to obtain thesecond of the results claimed.

. Branching processes

A branching process is a special type of stochastic process.In general, a stochastic About the notationIn the first chapterswe made a point ofrandom variables beingfunctions defined ona sample space Ω; butgradually both the ωsand the Ω seem to havevanished. So, does the tof “Y (t)” in the presentchapter replace the ω?No, definitely not. AnΩ is still implied, andit would certainly bemore correct to writeY (ω,t) instead of justY (t). The meaning ofY (ω,t) is that for each tthere is a usual randomvariable ω 7→ Y (ω,t),and for each ω there is afunction t 7→ Y (ω,t) of“time” t.

process is a family (X(t) : t ∈ T ) of random variables, indexed by a quantity t(“time”) varying in a set T that often is R or [0 ; +∞[ (“continuous time”), orZ or N0 (“discret time”). The range of the random variables would often be asubset of either Z (“discret state space” or R (“continuous state space”).

We can introduce a branching process with discrete time N0 and discretestate space Z in the following way. Assume that we are dealing with a specialkind of individuals that live exactly one time unit: at each time t = 0,1,2, . . .,each individual either dies or is split into a random number of new individuals(descendants); all individuals of all generations behave the same way and inde-pendently of each other, and the numbers of descendants are always selectedfrom the same probability distribution. If Y (t) denotes the total number ofindividuals at time t, calculated when all deaths and splittings at time t havetaken place, then (Y (t) : t ∈N0) is an example of a branching process.

Some questions you would ask regarding branching processes are: Given thatY (0) = y0, what can be said of the distribution of Y (t) and of the expected valueand the variance of Y (t)? What is the probability that Y (t) = 0, i.e., that thepopulation is extinct at time t? and how long will it take the population tobecome extinct?

We shall now state a mathematical model for a branching process and thenanswer some of the questions.

The so-called offspring distribution plays a decisive role in the model. Theoffspring distribution is the probability distribution that models the numberof descendants of each single individual in one time unit (generation). Theoffspring distribution is a distribution on N0 = 0,1,2, . . .; the outcome 0 meansthat the individual (dies and) has no descendants, the value 1 means that theindividual (dies and) has exactly one descendant, the value 2 means that theindividual (dies and) has exactly two descendants, etc.

Let G be the generating function of the offspring distribution. The initialnumber of individuals is 1, so Y (0) = 1. At time t = 1 this individual becomes

Page 43: Probability Theory Statistics

Generating Functions

Y (1) new individuals, and the generating function of Y (1) is

s 7→ E(sY (1)

)= G(s).

At time t = 2 each of the Y (1) individuals becomes a random number of new ones,so Y (2) is a sum of Y (1) independent random variables, each with generatingfunction G. Using Theorem ., the generating function of Y (2) is

s 7→ E(sY (2)

)= (G G)(s).

At time t = 3 each of the Y (2) individuals becomes a random number of new ones,so Y (3) is a sum of Y (2) independent random variables, each with generatingfunction G. Using Theorem ., the generating function of Y (3) is

s 7→ E(sY (3)

)= (G G G)(s).

Continuing this reasoning, we see that the generating function of Y (t) is

s 7→ E(sY (t)

)= (G G . . . G︸ ︷︷ ︸

t

)(s) . (.)

(Note by the way that if the initial number of individuals is y0, then the generat-ing function becomes

((GG . . .G)(s)

)y0 ; this is a consequence of Theorem ..)Let µ denote the expected value of the offspring distribution; recall that µ =

G′(1) (equation (.)). Now, since equation (.) is a special case of Theorem .,we can use the corollary of that theorem and obtain the hardly surprising resultEY (t) = µt, that is, the expected population size either increases or decreasesexponentially. Of more interest is

Theorem .Assume that the offspring distribution is not degenerate at 1, and assume thatY (0) = 1.

• If µ ≤ 1, then the population will become extinct with probability 1, morespecifically is

limt→∞P(Y (t) = y) =

1 for y = 00 for y = 1,2,3, . . .

• If µ > 1, then the population will become extinct with probability s? , and willgrow indefinitely with probability 1− s? , more specifically is

limt→∞P(Y (t) = y) =

s? for y = 00 for y = 1,2,3, . . .

Here s? is the solution of G(s) = s, s < 1.

. Branching processes

ProofWe can prove the theorem by studying convergence of the generating function ofY (t): for each s0 we will find the limiting value of EsY (t)

0 as t→∞. Equation (.)

shows that the number sequence(EsY (t)

0

)t∈N is the same as sequence (sn)n∈N

determined by

sn+1 = G(sn), n = 0,1,2,3, . . .

The behaviour of this sequence depends on G and on the initial value s0. Recallthat G and all its derivatives are non-negative on the interval ]0 ; 1[, and thatG(1) = 1 and G′(1) = µ. By assumption we have precluded the possibility thatG(s) = s for all s (which corresponds to the offspring distribution being degener-ate at 1), so we may conclude that

• If µ ≤ 1, then G(s) > s for all s ∈ ]0 ; 1[.The sequence (sn) is then increasing and thus convergent. Let the limitingvalue be s; then s = lim

n→∞sn = limn→∞G(sn+1) = G(s), where the centre equality

follows from the definition of (sn) and the last equality from G beingcontinuous at s. The only point s ∈ [0 ; 1] such that G(s) = s, is s = 1, so wehave now proved that the sequence (sn) is increasing and converges to 1.

• If µ > 1, then there is exactly one number s? ∈ [0 ; 1[ such that G(s?) = s? .sn sn+1sn+20

01

1G

The case µ ≤ 1

– If s? < s < 1, then s? ≤ G(s) < s.Therefore, if s? < s0 < 1, then the sequence (sn) decreases and hencehas a limiting value s ≥ s? . As in the case µ ≤ 1 it follows that G(s) = s,so that s? .

– If 0 < s < s? , then s < G(s) ≤ s? .Therefore, if 0 < s0 < s? , then the sequence (sn) increases, and (asabove) the limiting value must be s? .

– If s0 is either s? or 1, then the sequence (sn) is identically equal to s0.In the case µ ≤ 1 the above considerations show that the generating function ofY (t) converges to the generating function for the one-point distribution at 0 (viz.the constant function 1), so the probability of ultimate extinction is 1.

In the case µ > 1 the above considerations show that the generating functionsnsn+1sn+2s∗0

01

1G

The case µ > 1

of Y (t) converges to a function that has the value s? throughout the interval[0 ; 1[ and the value 1 at s = 1. This function is not the generating function of anyordinary probability distribution (the function is not continuous at 1), but onemight argue that it could be the generating function of a probability distributionthat assigns probability s? to the point 0 and probability 1− s? to an infinitelydistant point.

It is in fact possible to extend the theory of generating functions to coverdistributions that place part of the probability mass at infinity, but we shall not

Page 44: Probability Theory Statistics

Generating Functions

do so in this exposition. Here, we restrict ourselves to showing that

limt→∞P(Y (t) = 0) = s? (.)

limt→∞P(Y (t) ≤ c) = s? for all c > 0. (.)

Since P(Y (t) = 0) equals the value of the generating function of Y (t) evaluatedat s = 0, equation (.) follows immediately from the preceeding. To proveequation (.) we use the following inequality which is obtained from theMarkov inequality (Lemma . on page ):

P(Y (t) ≤ c) = P(sY (t) ≥ sc

)≤ s−cEsY (t)

where 0 < s < 1. From what is proved earlier, s−cEsY (t) converges to s−cs? ast→∞. By chosing s close enough to 1, we can make s−cs? be as close to s? aswe wish, and therefore limsup

t→∞P(Y (t) ≤ c) ≤ s? . On the other hand, P(Y (t) = 0) ≤

P(Y (t) ≤ c) and limt→∞P(Y (t) = 0) = s? , and hence s? ≤ liminf

t→∞ P(Y (t) ≤ c). We may

conclude that limt→∞P(Y (t) ≤ c) exists and equals s? .

Equations (.) and (.) show that no matter how large the interval [0 ; c] is,in the limit as t→∞ it will contain no other probability mass than the mass aty = 0.

. Exercises

Exercise .Cosmic radiation is, at least in this exercise, supposed to be particles arriving “totallyat random” at some sort of instrument that counts them. We will interpret “totally atrandom” to mean that the number of particles (of a given energy) arriving in a timeinterval of length t is Poisson distributed with parameter λt, where λ is the intensity ofthe radiation.

Suppose that the counter is malfunctioning in such a way that each particle isregistered only with a certain probability (which is assumed to be constant in time).What can now be said of the distribution of particles registered?

Exercise .Use Theorem . to show that when n→∞ and p→ 0 in such a way that np→ µ, thenHint:

If (zn) is a sequenceof (real or complex)numbers convergingto the finite number z,then

limn→∞

(1 +

znn

)n= exp(z).

the binomial distribution with parameters n and p converges to the Poisson distributionwith parameter µ. (In Theorem . on page this result was proved in another way.)

Exercise .Use Theorem . to show that when k→∞ and p→ 1 in such a way that k(1−p)/p→ µ,then the negative binomial distribution with parameters k and p converges to thePoisson distribution with parameter µ. (See also Exercise . on page .)

. Exercises

Exercise .In Theorem . it is assumed that the process starts with one single individual (Y (0) = 1).Reformulate the theorem to cover the case with a general initial value (Y (0) = y0).

Exercise .A branching process has an offspring distribution that assigns probabilities p0, p1and p2 to the integers 0, 1 and 2 (p0 + p1 + p2 = 1), that is, after each time step eachindividual becomes 0, 1 or 2 individuals with the probabilities mentioned. How doesthe probability of ultimate extinction depend on p0, p1 and p2?

Page 45: Probability Theory Statistics

General Theory

This chapter gives a tiny glimpse of the theory of probability measures on Andrei KolmogorovRussian mathematician(-).In he publishedthe book Grundbegriffeder Wahrscheinlichkeits-rechnung, giving forthe first time a fullysatisfactory axiomaticpresentation of proba-bility theory.

general sample spaces.

Definition .: σ -algebraLet Ω be a non-empty set. A set F of subsets of Ω is called a σ -algebra on Ω if. Ω ∈ F ,. if A ∈ F then Ac ∈ F ,. F is closed under countable unions, i.e. if A1,A2, . . . is a sequence in F then

the union∞⋃

n=1

An is in F .

Remarks: Let F be a σ -algebra. Since = Ωc, it follows from and that

∈ F . Since∞⋂

n=1

An =( ∞⋃

n=1

Acn

)c, it follows from and that F is closed under

countable intersections as well.

On R, the set of reals, there is a special σ -algebraB , the Borel σ -algebra, which Émile BorelFrench mathematician(-).

is defined as the smallest σ -algebra containing all open subsets of R;B is alsothe smallest σ -algebra on R containing all intervals.

Definition .: Probability spaceA probability space is a tripel (Ω,F ,P) consisting of. a sample space Ω, which is a non-empty set,. a σ -algebra F of subsets of Ω,. a probability measure on (Ω,F ), i.e. a map P : F →R which is

• positive: P(A) ≥ 0 for all A ∈ F ,• normed: P(Ω) = 1, and• σ -additive: for any sequence A1,A2, . . . of disjoint events from F ,

P( ∞⋃

i=1

Ai

)=∞∑

i=1

P(Ai).

Proposition .Let (Ω,F ,P) be a probability space. Then P is continuous from below in the sense

that if A1 ⊆ A2 ⊆ A3 ⊆ . . . is an increasing sequence in F , then∞⋃

n=1

An ∈ F and

Page 46: Probability Theory Statistics

General Theory

limn→∞P(An) = P

( ∞⋃

n=1

An

); furthermore, P is continuous from above in the sense that if

B1 ⊇ B2 ⊇ B3 ⊇ . . . is a decreasing sequence in F , then∞⋂

n=1

Bn ∈ F and limn→∞P(Bn) =

P( ∞⋂

n=1

Bn

).

ProofDue to the finite additivty,

P(An) = P((An \An−1)∪ (An−1 \An−2)∪ . . .∪ (A2 \A1)∪ (A1 \)

)

= P(An \An−1) + P(An−1 \An−2) + . . .+ P(A2 \A1) + P(A1 \),

and so (with A0 = )

limn→∞P(An) =

∞∑

n=1

P(An \An−1) = P( ∞⋃

n=1

(An \An−1))

= P( ∞⋃

n=1

An

),

where we have used that F is a σ -algebra and P is σ -additive. This proves thefirst part (continuity from below).

To prove the second part, apply “continuity from below” to the increasingsequence An = Bc

n.

Random variables are defined in this way (cf. Definition . page ):

Definition .: Random variableA random variable on (Ω,F ,P) is a function X : Ω→ R with the property thatX ∈ B ∈ F for each B ∈B .

Remarks:. The condition X ∈ B ∈ F ensures that P(X ∈ B) is meaningful.. Sometimes we write X−1(B) instead of X ∈ B (which is short for ω ∈Ω :X(ω) ∈ B), and we may think of X−1 as a map that takes subsets of R intosubsets of Ω.Hence a random variable is a function X :Ω→R such that X−1 :B → F .We say that X : (Ω,F )→ (R,B ) is a measurable map.

The definition of distribution function is that same as Definition . (page ):

Definition .: Distribution functionThe distribution function of a random variable X is the function

F : R −→ [0 ; 1]

x 7−→ P(X ≤ x)

We obtain the same results as earlier:

Lemma .If the random variable X has distribution function F, then

P(X ≤ x) = F(x),

P(X > x) = 1−F(x),

P(a < X ≤ b) = F(b)−F(a),

for all real numbers x and a < b.

Proposition .The distribution function F of a random variable X has the following properties:. It is non-decreasing: if x ≤ y then F(x) ≤ F(y).. lim

x→−∞F(x) = 0 and limx→+∞F(x) = 1.

. It is right-continuous, i.e. F(x+) = F(x) for all x.. At each point x, P(X = x) = F(x)−F(x−).. A point x is a point of discontinuity of F, if and only if P(X = x) > 0.

ProofThese results are proved as follows:. As in the proof of Proposition ... The sets An = X ∈ ]−n ;n] increase to Ω, so using Proposition . we get

limn→∞

(F(n)−F(−n)

)= limn→∞P(An) = P(X ∈Ω) = 1. Since F is non-decreasing

and takes values between 0 and 1, this proves item .. The sets X ≤ x + 1

n decrease to X ≤ x, so using Proposition . we get

limn→∞F(x+ 1

n ) = limn→∞P(X ≤ x+ 1

n ) = P( ∞⋂

n=1

X ≤ x+ 1n

)= P(X ≤ x) = F(x).

. Again we use Proposition .:

P(X = x) = P(X ≤ x)−P(X < x)

= F(x)−P( ∞⋃

n=1

X ≤ x − 1n

)

= F(x)− limn→∞P(X ≤ x − 1

n )

= F(x)− limn→∞F(x − 1

n )

= F(x)−F(x−).

. Follows from the preceeding.

Page 47: Probability Theory Statistics

General Theory

Proposition . has a “converse”, which we state without proof:

Proposition .If the function F : R → [0 ; 1] is non-decreasing and right-continuous, and iflimx→−∞F(x) = 0 and lim

x→+∞F(x) = 1, then F is the distribution function of a ran-

dom variable.

. Why a general axiomatic approach?

. One reason to try to develop a unified axiomatic theory of probability is thatfrom a mathematical-aesthetic point of view it is quite unsatisfactory to have totreat discrete and continuous probability distributions separately (as is done inthese notes) — and what to do with distributions that are “mixtures” of discreteand continuous distributions?

Example .Suppose we want to model the lifetime T of some kind of device that either breaks downimmediately, i.e. T = 0, or after a waiting time that is assumed to follow an exponentialdistribution. Hence the distribution of T can be specified by saying that P(T = 0) = pand P(T > t | T > 0) = exp(−t/β), t > 0; here the parameter p is the probability of animmediate breakdown, and the parameter β is the scale parameter of the exponentialdistribution.

The distribution of T has a discrete component (the probability mass p at 0) and acontinuous component (the probability mass 1− p smeared out on the positive realsaccording to an exponential distribution).

. It is desirable to have a mathematical setting that allows a proper definitionof convergence of probability distributions. It is, for example, “obvious” thatthe discrete uniform distribution on the set 0, 1

n ,2n ,

3n , . . . ,

n−1n ,1 converges to

the continuous uniform distribution on [0 ; 1] as n → ∞, and also that thenormal distribution with parameters µ and σ2 > 0 converges to the one-pointdistribution at µ as σ2→ 0. Likewise, the Central Limit Theorem (page ) tellsthat the distribution of sums of properly scaled independent random variablesconverges to a normal distribution. The theory should be able to place theseconvergences within a wider mathematical context.

. We have seen simple examples of how to model compound experimentswith independent components (see e.g. page f and page ). How can this begeneralised, and how to handle infinitely many components?

Example .The Strong Law of Large Numbers (cf. page ) states that if (Xn) is a sequence ofindependent identically distributed random variables with expected value µ, then

. Why a general axiomatic approach?

P(

limn→∞Xn = µ

)= 1 where Xn denotes the average of X1,X2, . . . ,Xn. The event in question

here, limn→∞Xn = µ, involves infinitely many Xs, and one might wonder whether it is at

all possible to have a probability space that allows infinitely many random variables.

Example .In the presentations of the geometric and negative binomial distributions (pages and), we studied how many times to repeat a 01 experiment before the first (or kth) 1occurs, and a random variable Tk was introduced as the number of repetitions neededin order that the outcome 1 occurs for the kth time.

It can be of some interest to know the probability that sooner or later a total of k 1shave appeared, that is, P(Tk <∞). If X1,X2,X3, . . . are random variables representingoutcome no. , , , . . . of the 01-experiment, then Tk is a straightforward function ofthe Xs: Tk = inft ∈N : Xt = k. — How can we build a probability space that allows usto discuss events relating to infinite sequences of random variables?

Example .Let (U (t) : t = 1,2,3 . . .) be a sequence of independent identically distributed uniformrandom variables on the two-element set −1,1, and define a new sequence of randomvariables (X(t) : t = 0,1,2, . . .) as

X(0) = 0

X(t) = X(t − 1) +U (t), t = 1,2,3, . . .

that is, X(t) = U (1) +U (2) + · · · +U (t). This new stochastic process (X(t) is known asa simple random walk in one dimension. Stochastic processes are often graphed in a

t

x

coordinate system with the abscissa axis as “time” (t) and the ordinate axis as “state”(x). Accordingly, we say that the simple random walk (X(t) : t ∈N0) is in the state x = 0at time t = 0, and for each time step thereafter, it moves one unit up or down in thestate space.

The steps need not be of length 1. We can easily define a random walk that movesin steps of length ∆x, and moves at times ∆t,2∆t,3∆t, . . . (here, ∆x , 0 and ∆t > 0). Todo this, begin with an i.i.d. sequence (U (t) : t = ∆t,2∆t,3∆t, . . .) (where the Us have thesame distribution as the previous Us), and define

X(0) = 0

X(t) = X(t −∆t) +U (t)∆x, t = ∆t,2∆t,3∆t, . . .

that is,X(n∆t) =U (∆t)∆x+U (2∆t)∆x+· · ·+U (n∆t)∆x. This definesX(t) for non-negativemultipla af ∆t, i.e. for t ∈ N0∆t; then use linear interpolation in each subinterval]k∆t ; (k + 1)∆t[ to define X(t) for all t ≥ 0. The resulting function t 7→ X(t), t ≥ 0, iscontinuous and piecewise linear.

It is easily seen that E(X(n∆t)

)= 0 and Var

(X(n∆t)

)= n (∆x)2, so if t = n∆t (and

hence n = t/∆t), then VarX(t) =((∆x)2/∆t

)t. This suggests that if we are going to let

∆t and ∆x tend to 0, it might be advisable to do it in such a way that (∆x)2/∆t has alimiting value σ2 > 0, so that VarX(t)→ σ2t.

t

x

Page 48: Probability Theory Statistics

General Theory

The problem is to develop a mathematical setting in which the “limiting process”just outlined is meaningful; this includes an elucidation of the mathematical objectsinvolved, and of the way they converge.

The solution is to work with probability distributions on spaces of continuous func-tions (the interpolated random walks are continuous functions of t, and the limitingprocess should also be continuous). For convenience, assume that σ2 = 1; then thelimiting process is the so-called standard Wiener process (W (t) : t ≥ 0). The samplepaths t 7→W (t) are continuous and have a number of remarkable properties:

• If 0 ≤ s1 < t1 ≤ s2 < t2, the increments W (t1) −W (s1) and W (t2) −W (s2) areindependent normal random variables with expected value 0 and variance (t1−s1)and (t2 − s2), respectively.

• With probability 1 the function t 7→W (t) is nowhere differentiable.• With probability 1 the process will visit all states, i.e. with probability 1 the range

of t 7→W (t) is all of R.• For each x ∈R the process visits x infinitely often with probability 1, i.e. for eachx ∈R the set t > 0 :W (t) = x is infinite with probability 1.

Pioneering works in this field are Bachelier () and the papers by Norbert Wienerfrom the early s, see Wiener (, side -).

Part II

Statistics

Page 49: Probability Theory Statistics

Introduction

While probability theory deals with the formulation and analysis of probabilitymodels for random phenomena (and with the creation of an adequate concep-tual framework), mathematical statistics is a discipline that basically is aboutdeveloping and investigating methods to extract information from data setsburdened with such kinds of uncertainty that we are prepared to describe withsuitable probability distributions, and applied statistics is a discipline that dealswith these methods in specific types of modelling situations.

Here is a brief outline of the kind of issues that we are going to discuss: We aregiven a data set, a set of numbers x1,x2, . . . ,xn, referred to as the observations (asimple example: the values of 115 measuments of the concentration of mercuryi swordfish), and we are going to set up a statistical model stating that theobservations are observed values of random variables X1,X2, . . . ,Xn whose jointdistribution is known except for a few unknown parameters. (In the swordfishexample, the random variables could be indepedent and identically distributedwith the two unknown parameters µ and σ2). Then we are left with three mainproblems:. Estimation. Given a data set and a statistical model relating to a given

subject matter, then how do we calculate an estimate of the values of theunknown parameters? The estimate has to be optimal, of course, in a senseto be made precise.

. Testing hypotheses. Usually, the subject matter provides a number of in-teresting hypotheses, which the statistician then translates into statisticalhypotheses that can be tested. A statistical hypothesis is a statement that theparameter structure can be simplified relative to the original model (forexample that some of the parameters are equal, or have a known value).

. Validation of the model. Ususally, statistical models are certainly not real-istic in the sense that they claim to imitate the “real” mechanisms thatgenerated the observations. Therefore, the job of adjusting and check-ing the model and assessing its usefulness gets a different content anda different importance from what is the case with many other types ofmathematical models.

Page 50: Probability Theory Statistics

Introduction

This is how Fisher in explained the object of statistics (Fisher (),Ronald AylmerFisherEnglish statisticianand geneticist (-). The founder ofmathematical statistics(or at least, founder ofmathematical statisticsas it is presented inthese notes).

here quoted from Kotz and Johnson ()):

In order to arrive at a distinct formulation of statistical problems, it isnecessary to define the task which the statistician sets himself: briefly,and in its most concrete form, the object of statistical methods is thereduction of data. A quantity of data, which usually by its mere bulkis incapabel of entering the mind, is to be replaced by relatively fewquantities which shall adequately represent the whole, or which, in otherwords, shall contain as much as possible, ideally the whole, of the relevantinformation contained in the original data.

The Statistical Model

We begin with a fairly abstract presentation of the notion of a statistical modeland related concepts, afterwards there will be a number of illustrative examples.

There is an observation x = (x1,x2, . . . ,xn) which is assumed to be an observedvalue of a random variable X = (X1,X2, . . . ,Xn) taking values in the observationspace X; in many cases, the set X is Rn or Nn

0.For each value of a parameter θ from the parameter space Θ there is a probabil-

ity measure Pθ on X. All of these Pθs are candidates for being the distributionof X , and for (at least) one value θ is it true that the distribution of X is Pθ .

The statement “x is an observered value of the random variable X , and for atleast one value of θ ∈Θ is it true that the distribution of X is Pθ” is one way ofexpressing the statistical model.

The model function is a function f : X×Θ→ [0 ; +∞[ such that for each fixedθ ∈Θ the function x 7→ f (x,θ) is the probability (density) function pertainingto X when θ is the true parameter value.

The likelihood function corresponding to the observation x is the functionL : Θ → [0 ; +∞[ given as L(θ) = f (x,θ).—The likelihood function plays anessential role in the theory of estimation of parameters and test of hypotheses.

If the likelihood function can be written as L(θ) = g(x)h(t(x),θ) for suitablefunctions g, h and t, then t is said to be a sufficient statistic (or to give a sufficientreduction of data).—Sufficient reductions are of interest mainly in stiuationswhere the reduction is to a very small dimension (such as one or two).

Remarks:. Parameters are ususally denoted by Greek letters.. The dimension of the parameter space Θ is ususally much smaller than

that of the observation space X.. In most cases, an injective parametrisation is wanted, that is, the mapθ 7→ Pθ should be injective.

. The likelihood function does not have to sum or integrate to 1.. Often, the log-likelihood function, i.e. the logarithm of likelihood function,

is easier to work with than the likelihood function itself; note that “loga-rithm” always means “natural logarithm”.

Some more notation:

Page 51: Probability Theory Statistics

The Statistical Model

. When we need to emphasise that L is the likelihood function correspond-ing to a certain observation x, we will write L(θ;x) instead of L(θ).

. The symbols Eθ and Varθ are sometimes used when we wish to emphasisethat the expectation and the variance are evaluated with respect to theprobability distribution corresponding to the parameter value θ, i.e. withrespect to Pθ .

. Examples

The one-sample problem with 01-variables

In the general formulation of the one-sample problem with 01-variables, x =The dot notationGiven a set of indexedquantities, for exam-ple a1, a2, . . . , an, then ashorthand notation forthe sum of these quan-tities is same symbol,now with a dot as index:

a =n∑

i=1

ai

Similarly with two ormore indices, e.g.:

bi =∑

j

bij

and

b j =∑

i

bij

(x1,x2, . . . ,xn) is assumed to be an observation of an n-dimensional randomvariable X = (X1,X2, . . . ,Xn) taking values in X = 0,1n. The Xis are assumed tobe independent indentically distributed 01-variables, and P(Xi = 1) = θ whereθ is the unknown parameter. The parameter space is Θ = [0 ; 1]. The modelfunction is (cf. (.) on page )

f (x,θ) = θx(1−θ)n−x , (x,θ) ∈ X×Θ,

and the likelihood function is

L(θ) = θx(1−θ)n−x , θ ∈Θ.

We see that the likelihood function depends on x only through x, that is, x issufficient, or rather: the function that maps x into x is sufficient.B [To be continued on page .]

Example .Suppose we have made seven replicates of an experiment with two possible outcomes 0and 1 and obtained the results 1,1,0,1,1,0,0. We are going to write down a statisticalmodel for this situation.

The observation x = (1,1,0,1,1,0,0) is assumed to be an observation of the 7-dimen-sional random variable X = (X1,X2, . . . ,X7) taking values in X = 0,17. The Xis areassumed to be independent identically distributed 01-variables, and P(Xi = 1) = θwhere θ is the unknown parameter. The parameter space is Θ = [0 ; 1]. The modelfunction is

f (x,θ) =7∏

i=1

θxi (1−θ)1−xi = θx (1−θ)7−x .

The likelihood function corresponding to the observation x = (1,1,0,1,1,0,0) is

L(θ) = θ4 (1−θ)3, θ ∈ [0 ; 1].

. Examples

The simple binomial model

The binomial distribution is the distribution of a sum of independent 01-vari-ables, so it is hardly surprising that statistical analysis of binomial variablesresembles that of 01-variables very much.

If Y is binomial with parameters n (known) and θ (unknown, θ ∈ [0 ; 1]), thenthe model function is

f (y,θ) =(ny

)θy(1−θ)n−y

where (y,θ) ∈ X×Θ = 0,1,2, . . . ,n × [0 ; 1], and the likelihood function is

L(θ) =(ny

)θy(1−θ)n−y , θ ∈ [0 ; 1].

B [To be continued on page .]

Example .If we in Example . consider the total number of 1s only, not the outcomes of the indi-vidual experiments, then we have an observation y = 4 of a binomial random variable Ywith parameters n = 7 and unknown θ. The observation space is X = 0,1,2,3,4,5,6,7and the parameter space is Θ = [0 ; 1]. The model function is er

f (y,θ) =(7y

)θy (1−θ)7−y , (y,θ) ∈ 0,1,2,3,4,5,6,7 × [0 ; 1].

The likelihood function corresponding to the observation y = 4 is

y

θ

f (y,θ) =(7y

)θy(1−θ)7−y

L(θ) =(74

)θ4 (1−θ)3, θ ∈ [0 ; 1].

Example .: Flour beetles IAs part of an example to be discussed in detail in Section ., flour beetles (Triboliumcastaneum) are exposed to the insecticide Pyrethrum in a certain concentration, andduring the period of observation of the beetles die. If we assume that all beetles are“identical”, and that they die independently of each other, then we may assume that thevalue y = 43 is an observation of a binomial random variable Y with parameters n = 144and θ (unknown). The statistical model is then given by means of the model function

f (y,θ) =(144y

)θy(1−θ)144−y , (y,θ) ∈ 0,1,2, . . . ,144 × [0 ; 1].

The likelihood function corresponding to the observation y = 43 is

L(θ) =(14443

)θ43(1−θ)144−43, θ ∈ [0 ; 1].

B [To be continued in Example . page .]

Page 52: Probability Theory Statistics

The Statistical Model

Table .Comparison of bino-mial distributions

group no. . . . s

number of successes y1 y2 y3 . . . ysnumber of failures n1 − y1 n2 − y2 n3 − y3 . . . ns − ys

total n1 n2 n3 . . . ns

Comparison of binomial distributions

We have observations y1, y2, . . . , ys of independent binomial random variablesY1,Y2, . . . ,Ys, where Yj has parameters nj (known) and θj ∈ [0 ; 1]. You mightthink of the observations as being arranged in a table like the one in Table ..The model function is

f (y,θ) =s∏

j=1

(njyj

)θyjj (1−θj )nj−yj

where the parameter variable θ = (θ1,θ2, . . . ,θs) varies in Θ = [0 ; 1]s, and the ob-

servation variable y = (y1, y2, . . . , ys) varies in X =s∏

j=1

0,1, . . . ,nj. The likelihood

function and the log-likelihood function corresponding to y are

L(θ) = const1 ·s∏

j=1

θyjj (1−θj )nj−yj ,

lnL(θ) = const2 +s∑

j=1

(yj lnθj + (nj − yj ) ln(1−θj )

);

here const1 is the product of the s binomial coefficients, and const2 = ln(const1).B [To be continued on page .]

Example .: Flour beetles IIFlour beetles are exposed to an insecticide which is administered in four differentconcentrations, ., ., . and . mg/cm2, and after days the numbers ofdead beetles are registered, see Table .. (The poison is sprinkled on the floor, so theconcentration is given as weight per area.) One wants to examine whether there is adifference between the effects of the various concentrations, so we have to set up astatistical model that will make that possible.

Just like in Example ., we will assume that for each concentration the number ofdead beetles is an observation of a binomial random variable. For concentration no.j, j = 1,2,3,4, we let yj denote the number of dead beetles and nj the total numberof beetles. Then the statistical model is that y = (y1, y2, y3, y4) = (43,50,47,48) is anobservation of four-dimensional random variable Y = (Y1,Y2,Y3,Y4) where Y1, Y2, Y3

. Examples

concentration. . . .

dead alive

total

Table .Survival of flourbeetles at four differentdoses of poison.

and Y4 are independent binomial random with parameters n1 = 144, n2 = 69, n3 = 54,n4 = 50 and θ1, θ2, θ3, θ4. The model function is

f (y1, y2, y3, y4;θ1,θ2,θ3,θ4) =(144y1

)θy11 (1−θ1)144−y1

(69y2

)θy22 (1−θ2)69−y2×

(54y3

)θy33 (1−θ3)54−y3

(50y4

)θy44 (1−θ4)50−y4 .

The log-likelihood function corresponding to the observation y is

lnL(θ1,θ2,θ3,θ4) = const

+ 43lnθ1 + 101ln(1−θ1) + 50lnθ2 + 19ln(1−θ2)

+ 47lnθ3 + 7ln(1−θ3) + 48lnθ4 + 2ln(1−θ4).

B [To be continued in Example . on page .]

The multinomial distribution

The multinomial distribution is a generalisation of the binomial distribution: Ina situation with n independent replicates of a trial with two possible outcomes,the number of “successes” follows a binomial distribution, cf. Definition . onpage . In a situation with n independent replicates of a trial with r possibleoutcomes ω1,ω2, . . . ,ωr , let the random variable Yi denote the number of timesoutcome ωi is observed, i = 1,2, . . . , r; then the r-dimensional random variableY = (Y1,Y2, . . . ,Yr ) follows a multinomial distribution.

In the circumstances just described, the distribution of Y has the form

P(Y = y) =(

ny1 y2 . . . yr

) r∏

i=1

θyii (.)

if y = (y1, y2, . . . , yr ) is a set of non-negative integers with sum n, and P(Y = y) =0 otherwise. The parameter θ = (θ1,θ2, . . . ,θr ) is a set of r non-negative realnumbers adding up to 1; here, θi is the probability of the outcome ωi in a singletrial. The quantity (

ny1 y2 . . . yr

)=

n!r∏

i=1

yi !

Page 53: Probability Theory Statistics

The Statistical Model

Table .Distribution ofhaemoglobin geno-type in Baltic cod.

Lolland Bornholm Åland

AAAaaa

total

is a so-called multinomial coefficient, and it is by definition the number of waysthat a set with n elements can be partitioned into r subsets such that subset no.i contains exactly yi elements, i = 1,2, . . . , r.

The probability distribution with probability function (.) is known as themultinomial distribution with r classes (or categories) and with parameters n(known) and θ.B [To be continued on page .]

Example .: Baltic codOn March a group of marine biologists caught cod in the waters at Lolland,and for each fish the haemoglobin in the blood was examined. Later that year, a similarexperiment was carried out at Bornholm and at the Åland Islands (Sick, ).

It is believed that the type of haemoglobin is determined by a single gene, and in factthe biologists studied the genotype as to this gene. The gene is found in two versions Aand a, and the possible genotypes are then AA, Aa and aa. The empirical distribution ofgenotypes for each location is displayed in Table ..

At each of the three places a number of fish are classified into three classes, so we havethree multinomial situations. (With three classes, it would often be named a trinomialsituation.) Our basic model will therefore be the model stating that each of the threeobserved triples

yL =

y1Ly2Ly3L

=

273012

, yB =

y1By2By3B

=

142052

, yÅ =

y1Åy2Åy3Å

=

05

75

originates from its own multinomial distribution; these distributions have parametersnL = 69, nB = 86, n = 80, and

θL =

θ1Lθ2Lθ3L

, θB =

θ1Bθ2Bθ3B

, θ =

θ1θ2θ3

.

B [To be continued in Example . on page .]

The one-sample problem with Poisson variables

The simplest situation is as follows: We have observations y1, y2, . . . , yn of inde-pendent and identically distributed Poisson random variables Y1,Y2, . . . ,Yn with

. Examples

no. of deaths y no. of regiment-years with y deaths

Table .Number of deathsby horse-kicks in thePrussian army.

parameter µ. The model function is

f (y,µ) =n∏

j=1

µyj

yj !exp(−µ) =

µyn∏

j=1

yj !

exp(−nµ),

where µ ≥ 0 and y ∈Nn0.

The likelihood function is L(µ) = constµy exp(−nµ), and the log-likelihoodfunction is lnL(µ) = const + y lnµ−nµ.B [To be continued on page .]

Example .: Horse-kicksFor each of the years from to and each of cavalry regiments of the Prus-sian army, the number of soldiers kicked to death by a horse is recorded (Bortkiewicz,). This means that for each of “regiment-years” the number of deaths by horse-kick is known.

We can summarise the data in a table showing the number of regiment-years with ,, , . . . deaths, i.e. we classify the regiment-years by number of deaths; it turns out thatthe maximum number of deaths per year in a regiment-year was , so there will be fiveclasses, corresponding to , , , and deaths per regiment-year, see Table ..

It seems reasonable to believe that to a great extent these death accidents happenedtruely at random. Therefore there will be a random number of deaths in a given regimentin a given year. If we assume that the individual deaths are independent of one anotherand occur with constant intensity throughout the “regiment-years”, then the conditionsfor a Poisson model are met. We will therefore try a statistical model stating that the observations y1, y2, . . . , y200 are observed values of independent and identicallydistributed Poisson variables Y1,Y2, . . . ,Y200 with parameter µ.B [To be continued in Example . on page .]

Uniform distribution on an interval

This example is not, as far as is known, of any great importance, but it can beuseful to try out the theory.

Page 54: Probability Theory Statistics

The Statistical Model

Consider observations x1,x2, . . . ,xn of independent and identically distributedrandom variablesX1,X2, . . . ,Xn with a uniform distribution on the interval ]0 ;θ[,where θ > 0 is the unknown parameter. The density function of Xi is

f (x,θ) =

1/θ when 0 < x < θ0 otherwise,

so the model function is

f (x1,x2, . . . ,xn,θ) =

1/θn when 0 < xmin and xmax < θ0 otherwise.

Here xmin = minx1,x2, . . . ,xn and xmax = maxx1,x2, . . . ,xn, i.e., the smallestand the largest observation.B [To be continued on page .]

The one-sample problem with normal variables

We have observations y1, y2, . . . , yn of independent and identically distributednormal variables with expected value µ and variance σ2, that is, from the

distribution with density function y 7→ 1√2πσ2

exp(−1

2(y −µ)2

σ2

). The model

function is

f (y,µ,σ2) =n∏

j=1

1√2πσ2

exp(−1

2

(yj −µ)2

σ2

)

= (2πσ2)−n/2 exp(− 1

2σ2

n∑

j=1

(yj −µ)2)

where y = (y1, y2, . . . , yn) ∈Rn, µ ∈R, and σ2 > 0. Standard calculations lead to

n∑

j=1

(yj −µ)2 =n∑

j=1

((yj − y) + (y −µ)

)2

=n∑

j=1

(yj − y)2 + 2(y −µ)n∑

j=1

(yj − y) +n (y −µ)2

=n∑

j=1

(yj − y)2 +n(y −µ)2,

son∑

j=1

(yj −µ)2 =n∑

j=1

(yj − y)2 +n(y −µ)2. (.)

. Examples

– –

Table .Newcomb’s measure-ments of the passagetime of light over adistance of m.The entries of the tablemultiplied by −3 +. are the passagetimes in −6 sec.

Using this we see that the likelihood function is

lnL(µ,σ2) = const− n2

ln(σ2)− 12σ2

n∑

j=1

(yj −µ)2

= const− n2

ln(σ2)− 12σ2

n∑

j=1

(yj − y)2 − n(y −µ)2

2σ2 .

(.)

We can make use of equation (.) once more: for µ = 0 we obtain

n∑

j=1

(yj − y)2 =n∑

j=1

y2j −ny2 =

n∑

j=1

y2j −

1ny2 ,

showing that the sum of squared deviations of the ys from y can be calculatedfrom the sum of the ys and the sum of squares of the ys. Thus, to find thelikelihood function we need not know the individual observations, just theirsum and sum of squares (in other words, t(y) = (

∑y,

∑y2) is a sufficient statistic,

cf. page ).B [To be continued on page .]

Example .: The speed of lightWith a series of experiments that were quite precise for their time, the Americanphysicist A.A. Michelson and the American mathematician and astronomer S. Newcombwere attempting to determine the velocity of light in air (Newcomb, ). They wereemploying the idea of Foucault of letting a light beam go from a fast rotating mirrorto a distant fixed mirror and then back to the rotating mirror where the angulardisplacement is measured. Knowing the speed of revolution and the distance betweenthe mirrors, you can then easily calculate the velocity of the light.

Table . (from Stigler ()) shows the results of measurements made by New-comb in the period from th July to th September in Washington, D.C. In Newcomb’sset-up there was a distance of m between the rotating mirror placed in Fort Myer

Page 55: Probability Theory Statistics

The Statistical Model

on the west bank of the Potomac river and the fixed mirror placed on the base of theGeorge Washington monument. The values reported by Newcomb are the passage timesof the light, that is, the time used to travel the distance in question.

Two of the values in the table stand out from the rest, – and –. They shouldprobably be regarded as outliers, values that in some sense are “too far away” from themajority of observations. In the subsequent analysis of the data, we will ignore thosetwo observations, which leaves us with a total of observations.B [To be continued in Example . on page .]

The two-sample problem with normal variables

In two groups of individuals the value of a certain variable Y is determined foreach individual. Individuals in one group have no relation to individuals in theother group, they are “unpaired”, and there need not be the same number ofindividuals in each group. We will depict the general situation like this:

observations

group y11 y12 . . . y1j . . . y1n1

group y21 y22 . . . y2j . . . y2n2

Here yij is observation no. j in group no. i, i = 1,2, and n1 and n2 the numberof observations in the two groups. We will assume that the variation betweenobservations within a single group is random, whereas there is a systematicvariation between the two groups (that is indeed why the grouping is made!).Finally, we assume that the yijs are observed values of independent normalrandom variables Yij , all having the same variance σ2, and with expected valuesEYij = µi , j = 1,2, . . . ,ni , i = 1,2. In this way the two mean value parameters µ1and µ2 describe the systematic variation, i.e. the two group levels, and the vari-ance parameter σ2 (and the normal distribution) describe the random variation,which is the same in the two groups (an assumption that may be tested, seeExercise . in page ). The model function is

f (y,µ1,µ2,σ2) =

( 1√2πσ2

)nexp

(− 1

2σ2

2∑

i=1

ni∑

j=1

(yij −µi)2)

where y = (y11, y12, y13, . . . , y1n1, y21, y22, . . . , y2n2

) ∈ Rn, (µ1,µ2) ∈ R2 and σ2 > 0;here we have put n = n1 +n2. The sum of squares can be partitioned in the sameway as equation (.):

2∑

i=1

ni∑

j=1

(yij −µi)2 =2∑

i=1

ni∑

j=1

(yij − yi)2 +2∑

i=1

ni(yi −µi)2;

. Examples

here yi =1n

ni∑

j=1

yij is the ith group average.

The log-likelihood function is

lnL(µ1,µ2,σ2) (.)

= const− n2

ln(σ2)− 12σ2

( 2∑

i=1

ni∑

j=1

(yij − yi)2 +2∑

i=1

ni(yi −µi)2).

B [To be continued on page .]

Example .: Vitamin CVitamin C (ascorbic acid) is a well-defined chemical substance, and one would thinkthat industrially produced Vitamin C is as effective as “natural” vitamin C. To checkwhether this is indeed the fact, a small experiment with guinea pigs was conducted.

In this experiment almost “identical” guinea pigs were divided into two groups.Over a period of six weeks the animals in group got a certain amount of orange juice,and the animals in group got an equivalent amount of synthetic vitamin C, and at theend of the period, the length of the odontoblasts was measured for each animal. Theresulting observations (put in order in each group) were:

orange juice: . . . . . . . . . .synthetic vitamin C: . . . . . . . . . .

This must be a kind of two-sample problem. The nature of the observations seems tomake it reasonable to try a model based on the normal distribution, so why not suggestthat this is an instance of the “two-sample problem with normal variables”. Later on,we shall analyse the data using this model, and we shall in fact test whether the meanincrease of the odontoblasts is the same in the two groups.B [To be continued in Example . on page .]

Simple linear regression

Regression analysis, a large branch of statistics, is about modelling the meanvalue structure of the random variables, using a small number of quantitativecovariates (explanatory variables). Here we consider the simplest case.

There is a number of pairs (xi , yi), i = 1,2, . . . ,n, where the ys are regarded asobserved values of random variables Y1,Y2, . . . ,Yn, and the xs are non-randomso-called covariates or explanatory variables.—In regression models covariatesare always non-random.

In the simple linear regression model the Y s are independent normal randomvariables with the same variance σ2 and with a mean value structure of the formEYi = α +βxi , or more precisely: for certain constants α and β, EYi = α +βxi for

Page 56: Probability Theory Statistics

The Statistical Model

Table .Forbes’ data.—Boilingpoint in degrees F,barometric pressure ininches of Mercury.

boiling barometric boiling barometricpoint pressure point pressure

. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

all i. Hence the model has three unknown parameters, α, β and σ2. The modelfunction is

f (y,α,β,σ2) =n∏

i=1

1√2πσ2

exp(−1

2

(yi − (α + βxi)

)2

σ2

)

= (2πσ2)−n/2 exp(− 1

2σ2

n∑

i=1

(yi − (α + βxi)

)2)

where y = (y1, y2, . . . , yn) ∈Rn, α,β ∈R and σ2 > 0. The log-likelihood function is

lnL(α,β,σ2) = const− n2

lnσ2 − 12σ2

n∑

i=1

(yi − (α + βxi)

)2. (.)

B [To be continued on page .]

Example .: Forbes’ data on boiling pointsThe atmospheric pressure decreases with height above sea level, and hence you canpredict altitude from measurements of barometric pressure. Since the boiling point ofwater decreases with barometric pressure, you can in fact calculate the altitude frommeasurements of the boiling point of water.—In s and s the Scottish physicistJames D. Forbes performed a series of measurements at different locations in theSwiss alps and in Scotland. He determined the boiling point of water and the barometricpressure (converted to pressure at a standard temperature of the air) (Forbes, ).The results are shown in Table . (from Weisberg ()).

A plot of barometric pressure against boiling point shows an evident correlation(Figure ., bottom). Therefore a linear regression model with pressure as y and boilingpoint as x might be considered. If we ask a physicist, however, we learn that we shouldrather expect a linear relation between boiling point and the logarithm of the pressure,and that is indeed affirmed by the plot in Figure ., top. In the following we willtherefore try to fit a regression model where the logarithm of the barometric pressure isthe y variable and the boiling point the x variable.B [To be continued in Example . on page .]

. Exercises

••

••••

•••••

••

•••

3.1

3.2

3.3

3.4

ln(b

arom

etri

cp

ress

ure

)

••

•• ••••••

••

•••

195 200 205 210

22

24

26

28

30

boiling point

baro

met

ric

pre

ssu

re

Figure .Forbes’ data.Top: logarithm of baro-metric pressure againstboiling point.Bottom: barometricpressure against boil-ing point.Barometric pressure ininches of Mercury, boil-ing point in degrees F.

. Exercises

Exercise .Explain how the binomial distribution is in fact an instance of the multinomial distri-bution.

Exercise .The binomial distribution was defined as the distribution of a sum of independent andidentically distributed 01-variables (Definition . on page ).

Could you think of a way to generalise that definition to a definition of the multino-mial distribution as the distribution of a sum of independent and identically distributedrandom variables?

Page 57: Probability Theory Statistics

Estimation

The statistical model asserts that we may view the given set of data as anobservation from a certain probability distribution which is completely specifiedexcept for a few unknown parameters. In this chapter we shall be dealing withthe problem of estimation, that is, how to calculate, on the basis of model plusobservations, an estimate of the unknown parameters of the model.

You cannot from purely mathematical reasoning reach a solution to the prob-lem of estimation, somewhere you will have to introduce one or more externalprinciples. Dependent on the principles chosen, you will get different methodsof estimation. We shall now present a method which is the “usual” method inthis country (and in many other countries). First some terminology:

• A statistic is a function defined on the observation space X (and takingvalues in R or Rn).If t is a statistic, then t(X) will be a random variable; often one does notworry too much about distinguishing between t and t(X).

• An estimator of the parameter θ is a statistic (or a random variable) tak-ing values in the parameter space Θ. An estimator of g(θ) (where g is afunction defined on Θ) is a statistic (or a random variable) taking valuesin g(Θ).It is more or less implied that the estimator should be an educated guessof the true value of the quantity it estimates.

• An estimate is the value of the estimator, so if t is an estimator (morecorrectly: if t(X) is an estimator), then t(x) is an estimate.

• An unbiased estimator of g(θ) is an estimator t with the property that forall θ, Eθ(t(X)) = g(θ), that is, an estimator that on the average is right.

. The maximum likelihood estimator

Consider a situation where we have an observation x which is assumed to bedescribed by a statistical model given by the model function f (x,θ). To find asuitable estimator for θ we will use a criterion based on the likelihood functionL(θ) = f (x,θ): it seems reasonable to say that if L(θ1) > L(θ2), then θ1 is a betterestimate of θ than θ2 is, and hence the best estimate of θ must be the value θthat maximises L.

Page 58: Probability Theory Statistics

Estimation

Definition .The maximum likelihood estimator is the function that for each observation x ∈ Xreturns the point of maximum θ = θ(x) of the likelihood function corresponding to x.

The maximum likelihood estimate is the value taken by the maximum likelihoodestimator.

This definition is rather sloppy and incomplete: it is not obvious that the likeli-hood function has a single point of maximum, there may be several, or none atall. Here is a better definition:

Definition .A maximum likelihood estimate corresponding to the observation x is a point ofmaximum θ(x) of the likelihood function corresponding to x.

A maximum likelihood estimator is a (not necessarily everywhere defined) functionfrom X to Θ, that maps an observation x into a corresponding maximum likelihoodestimate.

This method of maximum likelihood is a proposal of a general method forcalculating estimators. To judge whether it is a sensible method, let us ask a fewquestions and see how they are answered.. How easy is it to use the method in a concrete model?

In practise what you have to do is to find point(s) of maximum of areal-valued function (the likelihood function), and that is a common andwell-understood mathematical issue that can be addressed and solvedusing standard methods.Often it is technically simpler to find θ as the point of maximum of thelog-likelihood function lnL. If lnL has a continuous derivative D lnL, thenthe points of maximum in the interior of Θ are solutions of D lnL(θ) = 0,where the differential operator D denotes differentiation with respect to θ.

. Are there any general results about properties of the maximum likelihoodestimator, for example about existence and uniqueness, and about howclose θ is to θ?Yes. There are theorems telling that, under certain conditions, then asthe number of observations increases, the probability that there exists aunique maximum likelihood estimator tends to 1, and Pθ

(|θ(X)−θ| < ε

)

tends to 1 (for all ε > 0).Under stronger conditions, such as Θ be an open set, and the first threederivatives of lnL exist and satisfy some conditions of regularity, then asn→∞, the maximum likelihood estimator θ(X) is asymptotically normalwith asymptotic mean θ and an asymptotic variance which is the inverseof Eθ

(−D2 lnL(θ;X)

); moreover, Eθ

(−D2 lnL(θ;X)

)= Varθ

(D lnL(θ;X)

).

. Examples

(According to the so-called Cramér-Rao inequality this is the lower boundof the variance of a central estimator, so in this sense the maximum likeli-hood estimator is asymptotically optimal.)In situations with a one-dimensional parameter θ, the asymptotic normal-ity of θ means that the random variable

U =θ −θ√

Eθ(−D2 lnL(θ;X)

)

is asymptotically standard normal as n→∞, which is to say thatlimn→∞P(U ≤ u) = Φ(u) for all u ∈ R. (Φ is the distribution function of the

standard normal distribution, cf. Definition . on page .). Does the method produce estimators that seem sensible? In rare cases

you can argue for an “obvious” common sense solution to the estimationproblem, and our new method should not be too incompatible to commonsense.This has to be settled through examples.

. Examples

The one-sample problem with 01-variables

C [Continued from page .]

In this model the log-likelihood function and its first two derivatives are

lnL(θ) = x lnθ + (n− x) ln(1−θ),

D lnL(θ) =x −nθθ(1−θ)

,

D2 lnL(θ) = − xθ2 −

n− x(1−θ)2

for 0 < θ < 1. When 0 < x < n, the equation D lnL(θ) = 0 has the unique solutionθ = x/n, and with a negative second derivative this must be the unique point ofmaximum. If x = n, then L and lnL are strictly increasing, and if x = 0, thenL and lnL are strictly decreasing, so even in these cases θ = x/n is the uniquepoint of maximum. Hence the maximum likelihood estimate of θ is the relativefrequency of 1s—which seems quite sensible.

The expected value of the estimator θ = θ(X) is Eθ θ = θ, and its variance isVarθ θ = θ(1− θ)/n, cf. Example . on page and the rules of calculus forexpectations and variances.

Page 59: Probability Theory Statistics

Estimation

The general theory (see above) tells that for large values of n, Eθ θ ≈ θ and

Varθ θ ≈(Eθ

(−D2 lnL(θ,X)

))−1=

(Eθ

(Xθ2 +

n−X(1−θ)2

))−1

= θ(1−θ)/n.

B [To be continued on page .]

The simple binomial model

C [Continued from page .]

In the simple binomial model the likelihood function is

L(θ) =(ny

)θy(1−θ)n−y , θ ∈ [0 ; 1]

and the log-likelihood function

lnL(θ) = ln(ny

)+ y lnθ + (n− y) ln(1−θ), θ ∈ [0 ; 1].

Apart from a constant this function is identical to the log-likelihood function inthe one-sample problem with 01-variables, so we get at once that the maximumlikelihood estimator is θ = Y /n. Since Y and X have the same distribution,the maximum likelihood estimators in the two models also have the samedistribution, in particular we have Eθ θ = θ and Varθ θ = θ(1−θ)/n.B [To be continued on page .]

Example .: Flour beetles IC [Fortsat fra Example . page .]

In the practical example with flour beetles, θ = 43/144 = 0.30. The estimated standard

deviation is√θ(1− θ)/144 = 0.04.

B [To be continued in Example . on page .]

Comparison of binomial distributions

C [Continued from page .]

The log-likelihood function corresponding to y is

lnL(θ) = const +s∑

j=1

(yj lnθj + (nj − yj ) ln(1−θj )

). (.)

We see that lnL is a sum of terms, each of which (apart from a constant) is a log-likelihood function in a simple binomial model, and the parameter θj appearsin the jth term only. Therefore we can write down the maximum likelihoodestimator right away:

θ = (θ1, θ2, . . . , θs) =(Y1

n1,Y2

n2, . . . ,

Ysns

).

. Examples

Since Y1,Y2, . . . ,Ys are independent, the estimators θ1, θ2, . . . , θs are also inde-pendent, and from the simple binomial model we know that E θj = θj andVar θj = θj(1−θj )/nj , j = 1,2, . . . , s.B [To be continued on page .]

Example .: Flour beetles IIC [Continued from Example . on page .]

In the flour beetle example where each group (or concentration) has its own probabilityparameter θj , this parameter is estimated as the fraction of dead in the actual group, i.e.

(θ1, θ2, θ3, θ4) = (0.30,0.72,0.87,0.96).

The estimated standard deviations are√θj (1− θj )/nj , i.e. 0.04, 0.05, 0.05 and 0.03.

B [To be continued in Example . on page .]

The multinomial distribution

C [Continued from page .]

If y = (y1, y2, . . . , yr ) is an observation from a multinomial distribution with rclasses and parameters n and θ, then the log-likelihood function is

lnL(θ) = const +r∑

i=1

yi lnθi .

The parameter θ is estimated as the point of maximum θ (in Θ) of lnL; theparameter space Θ is the set of r-tuples θ = (θ1,θ2, . . . ,θr ) satisfying θi ≥ 0,i = 1,2, . . . , r, and θ1 +θ2 + · · ·+θr = 1. An obvious guess is that θi is estimatedas the relative frequency yi/n, and that is indeed the correct answer; but how toprove it?

One approach would be to employ a general method for determining extremasubject to constraints. Another possibility is prove that the relative frequenciesindeed maximise the likelihood function, that is, to show that if we let θi = yi/n,i = 1,2, . . . , r, and θ = (θ1, θ2, . . . , θr ), then lnL(θ) ≤ lnL(θ) for all θ ∈Θ:

The clever trick is to note that ln t ≤ t − 1 for all t (and with equality if andonly if t = 1). Therefore

lnL(θ)− lnL(θ) =r∑

i=1

yi lnθi

θi≤

r∑

i=1

yi

( θiθi− 1

)

=r∑

i=1

(yi

θiyi/n

− yi)

=r∑

i=1

(nθi − yi) = n−n = 0.

The inequality is strict unless θi = θi for all i = 1,2, . . . , r.B [To be continued on page .]

Page 60: Probability Theory Statistics

Estimation

Example .: Baltic codC [Continued from Example . on page .]

If we restrict our study to cod in the waters at Lolland, the job is to determine thepoint θ = (θ1,θ2,θ1) in the three-dimensional probability simplex that maximises thelog-likelihood function

lnL(θ) = const + 27lnθ1 + 30lnθ2 + 12lnθ3.

From the previous we know that θ1 = 27/69 = 0.39, θ2 = 30/69 = 0.43 and θ3 = 12/69 =0.17.θ3

θ1

θ2

1

1

1

The probability sim-plex in the three-dimensional space, i.e.,the set of non-negativetriples (θ1,θ2,θ3) withθ1 +θ2 +θ3 = 1.

B [To be continued in Example . on page .]

The one-sample problem with Poisson variables

C [Continued from page .]

The log-likelihood function and its first two derivatives are for µ > 0

lnL(µ) = const + y lnµ−nµ,D lnL(µ) =

yµ−n,

D2 lnL(µ) = − yµ2 .

When y > 0, the equation D lnL(µ) = 0 has the unique solution µ = y/n, andsince D2 lnL is negative, this must be the unique point of maximum. It is easilyseen that the formula µ = y/n also gives the point of maximum in the case y = 0.

In the Poisson distribution the variance equals the expected value, so by theusual rules of calculus, the expected value and the variance of the estimatorµ = Y/n are

Eµ µ = µ, Varµ µ = µ/n. (.)

The general theory (cf. page ) can tell us that for large values of n, Eµ µ ≈ µ

and Varµ µ ≈(Eµ

(−D2 lnL(µ,Y )

))−1=

(Eµ

( Yµ2

))−1

= µ/n.

Example .: Horse-kicksC [Continued from Example . on page .]

In the horse-kicks example, y = 0 ·109+1 ·65+2 ·22+3 ·3+4 ·1 = 122, so µ = 122/200 =0.61.

Hence, the number of soldiers in a given regiment and a given year that are kicked todeath by a horse is (according to the model) Poisson distributed with a parameter thatis estimated to 0.61. The estimated standard deviation of the estimate is

õ/n = 0.06, cf.

equation (.).

. Examples

Uniform distribution on an interval

C [Continued from page .]

The likelihood function is

L(θ) =

1/θn if xmax < θ0 otherwise.

This function does not attain its maximum. Nevertheless it seems temptingto name θ = xmax as the maximum likelihood estimate. (If instead we wereconsidering the uniform distribution on the closed interval from 0 to θ, theestimation problem would of course have a nicer solution.)

The likelihood function is not differentiable in its entire domain (which is]0 ; +∞[), so the regularity conditions that ensure the asymptotic normality ofthe maximum likelihood estimator (page ) are not met, and the distributionof θ = Xmax is indeed not asymptotically normal (see Exercise .).

The one-sample problem with normal variables

C [Continued from fra page .]

By solving the equation D lnL = 0, where lnL is the log-likelihood function (.)on page , we obtain the maximum likelihood estimates for µ and σ2:

µ = y =1n

n∑

j=1

yj , σ 2 =1n

n∑

j=1

(yj − y)2.

Usually, however, we prefer another estimate for the variance parameter σ2,namely

s2 =1

n− 1

n∑

j=1

(yj − y)2.

One reason for this is that s2 is an unbiased estimator; this is seen from Propo-sition . below, which is a special case of Theorem . on page , see alsoSection ..

Proposition .Let X1,X2, . . . ,Xn be independent and identically distributed normal random vari-ables with mean µ and variance σ2. Then it is true that

. The random variable X =1n

n∑

j=1

Xj follows a normal distribution with mean µ

and variance σ2/n.

Page 61: Probability Theory Statistics

Estimation

. The random variable s2 =1

n− 1

n∑

j=1

(Xj −X)2 follows a gamma distribution

with shape parameter f /2 and scale parameter 2σ2/f where f = n− 1, or putin another way, (f /σ2)s2 follows a χ2 distribution with f degrees of freedom.From this follows among other things that Es2 = σ2.

. The two random variables X and s2 are independent.

Remarks: As a rule of thumb, the number of degrees of freedom of the varianceestimate in a normal model can be calculated as the number of observationsminus the number of free mean value parameters estimated. The degrees offreedom contains information about the precision of the variance estimate, seeExercise ..B [To be continued on page .]

Example .: The speed of lightC [Continued from Example . on page .]

Assuming that the positive values in Table . on page are observations from a sin-gle normal distribution, the estimate of the mean of this normal distribution is y = 27.75and the estimate of the variance is s2 = 25.8 with degrees of freedom. Consequently,the estimated mean of the passage time is (27.75 × 10−3 + 24.8) × 10−6sec = 24.828 ×10−6sec, and the estimated variance of the passage time is 25.8 × (10−3 × 10−6sec)2 =25.8×10−6(10−6sec)2 with degrees of freedom, i.e., the estimated standard deviationis√

25.8× 10−610−6sec = 0.005× 10−6sec.B [To be continued in Example . on page .]

The two-sample problem with normal variables

C [Continued from page .]

The log-likelihood function in this model is given in (.) on page . It attainsits maximum at the point (y1, y2, σ

2), where y1 and y2 are the group averages,

and σ 2 =1n

2∑

i=1

ni∑

j=1

(yij − yi)2 (here n = n1 +n2). Often one does not use σ 2 as an

estimate of σ2, but rather the central estimator

s20 =1

n− 2

2∑

i=1

ni∑

j=1

(yij − yi)2.

The denominator n− 2, the number of degrees of freedom, causes the estimator tobecome unbiased; this is seen from the propositon below, which is a special caseof Theorem . on page , see also Section ..

. Examples

n S y f SS s2

orange juice . . . .synthetic vitamin C . . . .

sum . .average . .

Table .Vitamin C-example,calculations.n stands for numberof observations y,S for Sum of ys, y foraverage of ys, f fordegrees of freedom, SSfor Sum of Squareddeviations, and s2 forestimated variance(s2 = SS/f ).

Proposition .For independent normal random variables Xij with mean values EXij = µi , j =1,2, . . . ,ni , i = 1,2, and all with the same variance σ2, the following holds:

. The random variables Xi =1ni

ni∑

j=1

Xij , i = 1,2, follow normal distributions

with mean µi and variance σ2/ni , i = 1,2.

. The random variable s20 =1

n− 2

2∑

i=1

ni∑

j=1

(Xij−Xi)2 follows a gamma distribution

with shape parameter f /2 and scale parameter 2σ2/f where f = n − 2 andn = n1 + n2, or put in another way, (f /σ2)s20 follows a χ2 distribution withf degrees of freedom.From this follows among other things that Es20 = σ2.

. The three random variables X1, X2 and s20 are independent.

Remarks:• In general the number of degrees of freedom of a variance estimate is the

number of observations minus the number of free mean value parametersestimated.

• A quantity such as yij − yi , which is the difference between an actual ob-served value and the best “fit” provided by the actual model, is sometimes

called a residual. The quantity2∑

i=1

ni∑

j=1

(yij − yi)2 could therefore be called a

residual sum of squares.B [To be continued on page .]

Example .: Vitamin CC [Continued from Example . on page .]

Table . gives the parameter estimates and some intermediate results. The estimatedmean values are 13.18 (orange juice group) and 8.00 (synthetic vitamin C). The estimate

Page 62: Probability Theory Statistics

Estimation

of the common variance is 13.68 on 18 degrees of freedom, and since both groups have10 observations, the estimated standard deviation of each mean is

√13.68/10 = 1.17.

B [To be continued in Example . on page .]

Simple linear regression

C [Continued from page .]

We shall estimate the parameters α, β and σ2 in the linear regression model.The log-likelihood function is given in equation (.) on page . The sum ofsquares can be partioned like this:

n∑

i=1

(yi − (α + βxi)

)2=

n∑

i=1

((yi − y) + (y −α − βx)− β(xi − x)

)2

=n∑

i=1

(yi − y)2 + β2n∑

i=1

(xi − x)2

− 2βn∑

i=1

(xi − x)(yi − y) +n(y −α − βx)2,

since the other two double products from squaring the trinomial both sum to 0.In the final expression for the sum of squares, the unknown α enters in the lastterm only; the minimum value of that term is 0, which is attained if and only ifα = y − βx. The remaining terms constitute a quadratic in β, and this quadratic

has the single point of minimum β =∑

(xi − x)(yi − y)∑(xi − x)2 . Hence the maximum

likelihood estimates are

β =

n∑

i=1

(xi − x)(yi − y)

n∑

i=1

(xi − x)2

and α = y − βx.

(It is tacitly assumed that∑

(xi − x)2 is non-zero, i.e., the xs are not all equal.—Itmakes no sense to estimate a regression line in the case where all xs are equal.)

The estimated regression line is the line whose equation is y = α + βx.The maximum likelihood estimate of σ2 is

σ 2 =1n

n∑

i=1

(yi − (α + βxi)

)2,

but usually one uses the unbiased estimator

s202 =1

n− 2

n∑

i=1

(yi − (α + βxi)

)2.

. Examples

••

•• •• •••••

••

•••

195 200 205 210 215

3

3.1

3.2

3.3

3.4

boiling point

ln(b

arom

etri

cp

ress

ure

)

Figure .Forbes’ data: Datapoints and estimatedregression line.

with n− 2 degrees of freedom. With the notation SSx =n∑

i=1

(xi − x)2 we have (cf.

Theorem . on page and Section .):

Proposition .The estimators α, β and s202 in the linear regression model have the following proper-ties:. β follows a normal distribution with mean β and variance σ2/SSx.. a) α follows a normal distribution with mean α and with variance

σ2(

1n + x2

SSx

).

b) α and β are correlated with correlation −1/√

1 + SSxnx2 .

. a) α + β x follows a normal distribution with mean α + βx and varianceσ2/n.

b) α + β x and β are independent.. The variance estimator s202 is independent of the estimators of the mean value

parameters, and it follows a gamma distribution with shape parameter f /2and scale parameter 2σ2/f where f = n− 2, or put in another way, (f /σ2)s202follows a χ2 distribution with f degrees of freedom.From this follows among other things that Es202 = σ2.

Example .: Forbes’ data on boiling pointsC [Continued from Example . on page .]

As is seen from Figure ., there is a single data point that deviates rather much fromthe ordinary pattern. We choose to discard this point, which leaves us with 16 datapoints.

The estimated regression line is

ln(pressure) = 0.0205 ·boiling point− 0.95

and the estimated variance s202 = 6.84×10−6 on 14 degrees of freedom. Figure . shows

Page 63: Probability Theory Statistics

Estimation

the data points together with the estimated line which seems to give a good descriptionof the points.

To make any practical use of such boiling point measurements you need to knowthe relation between altitude and barometric pressure. With altitudes of the size of afew kilometres, the pressure decreases exponentially with altitude; if p0 denotes thebarometric pressure at sea level (e.g. 1013.25 hPa) and ph the pressure at an altitude h,then h ≈ 8150m · (lnp0 − lnph).

. Exercises

Exercise .In Example . on page it is argued that the number of mites on apple leaves followsa negative binomial distribution. Write down the model function and the likelihoodfunction, and estimate the parameters.

Exercise .Find the expected value of the maximum likelihood estimator σ 2 for the varianceparameter σ2 in the one-sample problem with normal variables.

Exercise .Find the variance of the variance estimator s2 in the one-sample problem with normalvariables.

Exercise .In the continuous uniform distribution on ]0 ;θ[, the maximum likelihood estimatorwas found (on page ) to be θ = Xmax where Xmax = maxX1,X2, . . . ,Xn.. Show that the probability density function of Xmax is

f (x) =

nxn−1

θnfor 0 < x < θ

0 otherwise.

Hint: first find P(Xmax ≤ x).. Find the expected value and the variance of θ = Xmax, and show that for large n,

E θ ≈ θ and Var θ ≈ θ2/n2.

. We are now in a position to find the asymptotic distribution ofθ −θ√Var θ

. For large

n we have, using item , thatθ −θ√Var θ

≈ −Y where Y =θ −Xmax

θ/n= n

(1− Xmax

θ

).

Prove that P(Y > y) =(1− y

n

)n, and conclude that the distribution of Y asymptoti-

cally is an exponential distribution with scale parameter 1.

Hypothesis Testing

Consider a general statistical model with model function f (x,θ), where x ∈ Xand θ ∈ Θ. A statistical hypothesis H0 is an assertion that the true value of theparameter θ actually belongs to a certain subset Θ0 of Θ, formally

H0 : θ ∈Θ0

where Θ0 ⊂Θ. Thus, the statistical hypothesis claims that we may use a simplermodel.

A test of the hypothesis is an assessment of the degree of concordance betweenthe hypothesis and the observations actually available. The test is based on asuitably selected one-dimensional statistic t, the test statistic, which is designedto measure the deviation between observations and hypothesis. The test prob-ability is the probability of getting a value of X which is less concordant (asmeasured by the test statistic t) with the hypothesis than the actual observationx is, calculated “under the hypothesis” [i.e. the test probability is calculatedunder the assumption that the hypothesis is true]. If there is very little concor-dance between the model and the observations, then the hypothesis is rejected.

. The likelihood ratio test

Just as we in Chapter constructed an estimation procedure based on thelikelihood function, we can now construct a procedure of hypothesis testingbased on the likelihood function. The general form of this procedure is asfollows:. Determine the maximum likelihood estimator θ in the full model and the

maximum likelihood estimator θ under the hypothesis, i.e. θ is a point of

maximum of lnL in Θ, and θ is a point of maximum of lnL in Θ0.. To test the hypothesis we compare the maximal value of the likelihood

function under the hypothesis to the maximal value in the full model, i.e.we compare the best description of x obtainable under the hypothesis tothe best description obtainable with the full model. This is done using thelikelihood ratio test statistic

Page 64: Probability Theory Statistics

Hypothesis Testing

Q =L(θ)

L(θ)=L(θ(x),x)

L(θ(x),x).

Per definition Q is always between 0 and 1, and the closer Q is to 0, theless agreement between the observation x and the hypothesis. It is oftenmore convenient to use −2lnQ instead of Q; the test statistic −2lnQ isalways non-negative, and the larger its value, the less agreement betweenthe observation x and the hypothesis.

. The test probability ε is the probability (calculated under the hypothesis)of getting a value of X that agrees less with the hypothesis than the actualobservation x does, in mathematics:

ε = P0

(Q(X) ≤Q(x)

)= P0

(−2lnQ(X) ≥ −2lnQ(x)

)

or simply

ε = P0(Q ≤Qobs) = P0

(−2lnQ ≥ −2lnQobs

).

(The subscript 0 on P indicates that the probability is to be calculatedunder the assumption that the hypothesis is true.)

. If the test probability is small, then the hypothesis is rejected.Often one adopts a strategy of rejecting a hypothesis if and only if the testprobability is below a certain level of significance α which has been fixed inadvance. Typical values of the significance level are 5%, 2.5% and 1%. Wecan interprete α as the probability of rejecting the hypothesis when it istrue.

. In order to calculate the test probability we need to know the distributionof the test statistic under the hypothesis.In some situations, including normal models (see page ff and Chap-ter ), exact test probabilities can be obtained by means of known andtabulated standard distributions (t, χ2 and F distributions).In other situations, general results can tell that in certain circumstances

(under assumptions that assure the asymptotic normality of θ and θ,cf. page ), the asymptotic distribution of −2lnQ is a χ2 distributionwith dimΘ − dimΘ0 degrees of freedom (this requires that we extendthe definition of dimension to cover cases where Θ is something like adifferentiable surface). In most cases however, the number of degrees offreedom equals “the number of free parameters in the full model minusthe number of free parameters under the hypothesis”.

. Examples

. Examples

The one-sample problem with 01-variables

C [Continued from page .]

Assume that in the actual problem, the parameter θ is supposed to be equal tosome known value θ0, so that it is of interest to test the statistical hypothesisH0 : θ = θ0 (or written out in more details: the statistical hypothesis H0 : θ ∈Θ0where Θ0 = θ0.)

Since θ = x/n, the likelihood ratio test statistic is

Q =L(θ0)

L(θ)=θx0 (1−θ0)n−x

θx(1− θ)n−x=

(nθ0

x

)x(n−nθ0

n− x

)n−x,

and

−2lnQ = 2(x ln

xnθ0

+ (n− x) lnn− xn−nθ0

).

The exact test probability is

ε = Pθ0

(−2lnQ ≥ −2lnQobs

),

and an approximation to this can be calculated as the probability of gettingvalues exceeding −2lnQobs in the χ2 distribution with 1 − 0 = 1 degree offreedom; this approximation is adequate when the “expected” numbers nθ0 andn−nθ0 both are at least 5.

The simple binomial model

C [Continued from page .]

Assume that we want to test a statistical hypothesis H0 : θ = θ0 in the simplebinomial model. Apart from a constant factor the likelihood function in thesimple binomial model is identical to the likelihood function in the one-sampleproblem with 01-variables, and therefore the two models have the same teststatistics Q and −2lnQ, so

−2lnQ = 2(y ln

y

nθ0+ (n− y) ln

n− yn−nθ0

),

and the test probabilities are found in the same way in both cases.

Example .: Flour beetles IC [Continued from Example . on page .]

Assume that there is a certain standard version of the poison which is known to kill 23%of the beetles. [Actually, that is not so. This part of the example is pure imagination.]

Page 65: Probability Theory Statistics

Hypothesis Testing

It would then be of interest to judge whether the poison actually used has the samepower as the standard poison. In the language of the statistical model this means thatwe should test the statistical hypothesis H0 : θ = 0.23.

When n = 144 and θ0 = 0.23, then nθ0 = 33.12 and n − nθ0 = 110.88, so −2lnQconsidered as a function of y is

−2lnQ(y) = 2(y ln

y

33.12+ (144− y) ln

144− y110.88

)

and −2lnQobs = −2lnQ(43) = 3.60. The exact test probability is

ε = P0

(−2lnQ(Y ) ≥ 3.60

)=

y : −2lnQ(y)≥3.60

(144y

)0.23y 0.77144−y .

Standard calculations show that the inequality −2lnQ(y) ≥ 3.60 holds for the y-values0,1,2, . . . ,23 and 43,44,45, . . . ,144. Furthermore, P0(Y ≤ 23) = 0.0249 and P0(Y ≥ 43) =0.0344, so that ε = 0.0249 + 0.0344 = 0.0593.

To avoid a good deal of calculations one can instead use the χ2 approximation of−2lnQ to obtain the test probability. Since the full model has one unknown parameterand the hypothesis leaves us with zero unknown parameters, the χ2 distribution inquestion is the one with 1 − 0 = 1 degree of freedom. A table of quantiles of theχ2 distribution with 1 degree of freedom (see for example page ) reveals thatthe 90% quantile is 2.71 and the 95% quantile 3.84, so that the test probability issomewhere between 5% and 10%; standard statistical computer software has functionsthat calculate distribution functions of most standard distributions, and using suchsoftware we will see that in the χ2 distribution with 1 degree of freedom, the probabilityof getting values larger that 3.60 is 5.78%, which is quite close to the exact value of5.93%.

The test probability exceeds 5%, so with the common rule of thumb of a 5% levelof significance, we cannot reject the hypothesis; hence the actual observations do notdisagree significantly with the hypothesis that the actual poison has the same power asthe standard poison.

Comparison of binomial distributions

C [Continued from page .]

This is the problem of investigating whether s different binomial distributionsactually have the same probability parameter. In terms of the statistical modelthis becomes the statistical hypothesis H0 : θ1 = θ2 = · · · = θs, or more accurately,H0 : θ ∈Θ0, where θ = (θ1,θ2, . . . ,θs) and where the parameter space is

Θ0 = θ ∈ [0 ; 1]s : θ1 = θ2 = · · · = θs.

In order to test H0 we need the maximum likelihood estimator under H0, i.e.we must find the point of maximum of lnL restricted to Θ0. The log-likelihood

. Examples

function is given as equation (.) on page , and when all the θs are equal,

lnL(θ,θ, . . . ,θ) = const +s∑

j=1

(yj lnθ + (nj − yj ) ln(1−θ)

)

= const + y lnθ + (n − y) ln(1−θ).

This restricted log-likelihood function attains its maximum at θ = y/n; herey = y1 + y2 + . . .+ ys and n = n1 +n2 + . . .+ns. Hence the test statistic is

−2lnQ = −2lnL(θ, θ, . . . , θ)

L(θ1, θ2, . . . , θs)

= 2s∑

j=1

(yj ln

θj

θ+ (nj − yj ) ln

1− θj1− θ

)

= 2s∑

j=1

(yj ln

yjyj

+ (nj − yj ) lnnj − yjnj − yj

)

(.)

where yj = nj θ is the “expected number” of successes in group j under H0, thatis, when the groups have the same probability parameter. The last expression for−2lnQ shows how the test statistic compares the observed numbers of successesand failures (the yjs and (nj − yj )s) to the “expected” numbers of successes andfailures (the yjs and (nj − yj )s). The test probability is

ε = P0(−2lnQ ≥ −2lnQobs)

where the subscript 0 on P indicates that the probability is to be calculated underthe assumption that the hypothesis is true.* If so, the asymptotic distribution of−2lnQ is a χ2 distribution with s−1 degrees of freedom. As a rule of thumb, theχ2 approximation is adequate when all of the 2s “expected” numbers exceed 5.

Example .: Flour beetles IIC [Continued from Example . on page .]

The purpose of the investigation is to see whether the effect of the poison varies withthe dose given. The way the statistical analysis is carried out in such a situation is tosee whether the observations are consistent with an assumption of no dose effect, or inmodel terms: to test the statistical hypothesis

H0 : θ1 = θ2 = θ3 = θ4.

Under H0 the θs have a common value which is estimated as θ = y/n = 188/317 = 0.59.Then we can calculate the “expected” numbers yj = nj θ and nj − yj , see Table .. The

* This less innocuous that it may appear. Even when the hypothesis is true, there is still anunknown parameter, the common value of the θs.

Page 66: Probability Theory Statistics

Hypothesis Testing

Table .Survival of flour bee-tles at four differentdoses of poison: ex-pected numbers underthe hypothesis of nodose effect.

concentration. . . .

dead . . . .alive . . . .

total

likelihood ratio test statistic −2lnQ (eq. (.)) compares these numbers to the observednumbers from Table . on page :

−2lnQobs = 2(

43ln43

85.4+ 50ln

5040.9

+ 47ln47

32.0+ 48ln

4829.7

101ln10158.6

+ 19ln19

28.1+ 7ln

722.0

+ 2ln2

20.3

)= 113.1.

The full model has four unknown parameters, and under H0 there is just a singleunknown parameter, so we shall compare the −2lnQ value to the χ2 distribution with4 − 1 = 3 degrees of freedom. In this distribution the 99.9% quantile is 16.27 (seefor example the table on page ), so the probability of getting a larger, i.e. moresignificant, value of −2lnQ is far below 0.01%, if the hypothesis is true. This leads us torejecting the hypothesis, and we may conclude that the four doses have significantlydifferent effects.

The multinomial distribution

C [Continued from page .]

In certain situations the probability parameter θ is known to vary in some subsetΘ0 of the parameter space. If Θ0 is a differentiable curve or surface then theproblem is a “nice” problem, from a mathematically point of view.

Example .: Baltic codC [Continued from Example . on page .]

Simple reasoning, which is left out here, will show that in a closed population withrandom mating, the three genotypes occur, in the equilibrium state, with probabilitiesθ3

θ1

θ2

1

1

1

The shaded area isthe probability sim-plex, i.e. the set oftriples θ = (θ1,θ2,θ3)of non-negative num-bers adding to 1. Thecurve is the set Θ0, thevalues of θ correspond-ing to Hardy-Weinbergequilibrium.

θ1 = β2, θ2 = 2β(1 − β) and θ3 = (1 − β)2 where β is the fraction of A-genes in thepopulation (β is the probability that a random gene is A).—This situation is know as aHardy-Weinberg equilibrium.

For each population one can test whether there actually is a Hardy-Weinberg equi-librium. Here we will consider the Lolland population. On the basis of the observa-tion yL = (27,30,12) from a trinomial distribution with probability parameter θL weshall test the statistical hypothesis H0 : θL ∈ Θ0 where Θ0 is the range of the mapβ 7→

(β2,2β(1− β), (1− β)2

)from [0 ; 1] into the probability simplex,

Θ0 =(β2,2β(1− β), (1− β)2

): β ∈ [0 ; 1]

.

. Examples

Before we can test the hypothesis, we need an estimate of β. Under H0 the log-likeli-hood function is

lnL0(β) = lnL(β2,2β(1− β), (1− β)2

)

= const + 2 · 27lnβ + 30lnβ + 30ln(1− β) + 2 · 12ln(1− β)

= const + (2 · 27 + 30)lnβ + (30 + 2 · 12)ln(1− β)

which attains its maximum at β =2 · 27 + 30

2 · 69=

84138

= 0.609 (that is, β is estimated as

twice the observed numbers of A genes divided by the total number of genes). Thelikelihood ratio test statistic is

−2lnQ = −2lnL(β2,2β(1− β), (1− β)2

)

L(θ1, θ2, θ3)= 2

3∑

i=1

yi lnyiyi

where y1 = 69 · β2 = 25.6, y2 = 69 · 2β(1− β) = 32.9 and y3 = 69 · (1− β)2 = 10.6 are the“expected” numbers under H0.

The full model has two free parameters (there are three θs, but they add to 1), andunder H0 we have one single parameter (β), hence 2 − 1 = 1 degree of freedom for−2lnQ.

We get the value −2lnQ = 0.52, and with 1 degree of freedom this corresponds to atest probability of about 47%, so we can easily assume a Hardy-Weinberg equilibriumin the Lolland population.

The one-sample problem with normal variables

C [Continued from page .]

Suppose that one wants to test a hypothesis that the mean value of a normaldistribution has a given value; in the formal mathematical language we are goingto test a hypothesis H0 : µ = µ0. Under H0 there is still an unknown parameter,namely σ2. The maximum likelihood estimate of σ2 under H0 is the point ofmaximum of the log-likelihood function (.) on page , now considered as afunction of σ2 alone (since µ has the fixed value µ0), that is, the function

σ2 7→ lnL(µ0,σ2) = const− n

2ln(σ2)− 1

2σ2

n∑

j=1

(yj −µ0)2

which attains its maximum when σ2 equals σ 2 =1n

n∑

j=1

(yj −µ0)2. The likelihood

ratio test statistic is Q =L(µ0,σ 2)L(µ, σ 2)

. Standard caclulations lead to

−2lnQ = 2(lnL(y, σ 2)− lnL(µ0,σ 2)

)= n ln

(1 +

t2

n− 1

)

Page 67: Probability Theory Statistics

Hypothesis Testing

where

t =y −µ0√s2/n

. (.)

The hypothesis is rejected for large values of −2lnQ, i.e. for large values of |t|.The standard deviation of Y equals

√σ2/n (Proposition . page ), so the

t test statistic measures the difference between y and µ0 in units of estimatedstandard deviations of Y . The likelihood ratio test is therefore equivalent to atest based on the more directly understable t test statistic.

The test probability is ε = P0(|t| > |tobs|). The distribution of t under H0 isknown, as the next proposition shows.t distribution.

The t distribution withf degrees of freedomis the continuous dis-tribution with densityfunction

Γ ( f +12 )

√πf Γ ( f2 )

(1 +

x2

f

)− f +12

where x ∈R.

Proposition .Let X1,X2, . . . ,Xn be independent and identically distributed normal random vari-

ables with mean value µ0 and variance σ2, and let X =1n

n∑

i=1

Xi and

s2 =1

n− 1

n∑

i=1

(Xi −X

)2. Then the random variable t =

X −µ0√s2/n

follows a t distribu-

tion with n− 1 degrees of freedom.

Additional remarks:• It is a major point that the distribution of t under H0 does not depend

on the unknown parameter σ2 (and neither does it depend on µ0). Fur-thermore, t is a dimensionless quantity (test statistics should always bedimensionless). See also Exercise ..

• The t distribution is symmetric about 0, and consequently the test proba-bility can be calculated as P(|t| > |tobs|) = 2P(t > |tobs|).William Sealy

Gosset (-).Gosset worked at Guin-ness Brewery in Dublin,with a keen interest inthe then very youngscience of statistics. Hedeveloped the t test dur-ing his work on issues ofquality control relatedto the brewing process,and found the distribu-tion of the t statistic.

In certain situations it can be argued that the hypothesis should be rejectedfor large positive values of t only (i.e. when y is larger than µ0); this couldbe the case if it is known in advance that the alternative to µ = µ0 is µ > µ0.In such a situation the test probability is calculated as ε = P(t > tobs).Similarly, if the alternative to µ = µ0 is µ < µ0, the test probability iscalculated as ε = P(t < tobs).A t test that rejects the hypothesis for large values of |t|, is a two-sided t test,and a t test that rejects the hypothesis when t is large and positive [or largeand negative], is a one-sided test.

• Quantiles of the t distribution are obtained from the computer or from acollection of statistical tables. (On page you will find a short table ofthe t distribution.)Since the distribution is symmetric about 0, quantiles usually are tabulatedfor probabilities larger that 0.5 only.

. Examples

• The t statistic is also known as Student’s t in honour of W.S. Gosset, who in published the first article about this test, and who wrote under thepseudonym ‘Student’.

• See also Section . on page .

Example .: The speed of lightC [Continued from Example . on page .]

Nowadays, one metre is defined as the distance that the light travels in vacuum in1/299792458 of a second, so light has the known speed of 299792458 metres persecond, and therefore it will use τ0 = 2.48238×10−5 seconds to travel a distance of 7442metres. The value of τ0 must be converted to the same scale as the values in Table .(page ): ((τ0 × 106)− 24.8)× 103 = 23.8. It is now of some interest to see whether thedata in Table . are consistent with the hypothesis that the mean is 23.8, so we willtest the statistical hypothesis H0 : µ = 23.8.

From a previous example, y = 27.75 and s2 = 25.8, so the t statistic is

t =27.75− 23.8√

25.8/64= 6.2 .

The test probability is the probability of getting a value of t larger than 6.2 or smallerthat −6.2. From a table of quantiles in the t distribution (for example the table onpage ) we see that with degrees of freedom the 99.95% quantile is slightly largerthan 3.4, so there is a probability of less than 0.05% of getting a t larger than 6.2, andthe test probability is therefore less than 2× 0.05% = 0.1%. This means that we shouldreject the hypothesis. Thus, Newcomb’s measurements of the passage time of the lightare not in accordance with the today’s knowledge and standards.

The two-sample problem with normal variables

C [Continued from page .]

We are going to test the hypothesis H0 : µ1 = µ2 that the two samples comefrom the same normal distribution (recall that our basic model assumes that thetwo variances are equal). The estimates of the parameters of the basic modelwere found on page . Under H0 there are two unknown parameters, thecommon mean µ and the common variance σ2, and since the situation under H0is nothing but a one-sample problem, we can easily write down the maximum

likelihood estimates: the estimate of µ is the grand mean y =1n

2∑

i=1

ni∑

j=1

yij , and

the estimate of σ2 is σ 2 =1n

2∑

i=1

ni∑

j=1

(yij − y)2. The likelihood ratio test statistic

Page 68: Probability Theory Statistics

Hypothesis Testing

for H0 is Q =L(y,y,σ 2)L(y1, y2, σ

2). Standard calculations show that

−2lnQ = 2(lnL(y1, y2, σ

2)− lnL(y,y,σ 2))

= n ln(1 +

t2

n− 2

)

where

t =y1 − y2√s20

(1n1

+ 1n2

) . (.)

The hypothesis is rejected for large values of −2lnQ, that is, for large valuesof |t|.

From the rules of calculus for variances, Var(Y 1 − Y 2) = σ2(

1n1

+ 1n2

), so the

t statistic measures the difference between the two estimated means relative tothe estimated standard deviation of the difference. Thus the likelihood ratiotest (−2lnQ) is equivalent to a test based on an immediately appealing teststatistic (t).

The test probability is ε = P0(|t| > |tobs|). The distribution of t under H0 isknown, as the next proposition shows.

Proposition .Consider independent normal random variables Xij , j = 1,2, . . . ,ni , i = 1,2, all witha common mean µ and a common variance σ2. Let

Xi =1ni

ni∑

j=1

Xij , i = 1,2, and

s20 =1

n− 2

2∑

i=1

ni∑

j=1

(Xij −Xi)2

Then the random variable

t =X1 −X2√s20 ( 1

n1+ 1n2

)

follows a t distribution with n− 2 degrees of freedom (n = n1 +n2).

Additional remarks:• If the hypothesis H0 is accepted, then the two samples are not really

different, and they can be pooled together to form a single sample of sizen1 +n2. Consequently the proper estimates must be calculated using theone-sample formulas

y =1n

2∑

i=1

ni∑

j=1

yij and s201 =1

n− 1

2∑

i=1

ni∑

j=1

(yij − y)2.

. Examples

• Under H0 the distribution of t does not depend on the two unknownparameters (σ2 and the common µ).

• The t distribution is symmetric about 0, and therefore the test probabilitycan be calculated as P(|t| > |tobs|) = 2P(t > |tobs|).In certain situations it can be argued that the hypothesis should be rejectedfor large positive values of t only (i.e. when y1 is larger than y2); thiscould be the case if it is known in advance that the alternative to µ1 = µ2is that µ1 > µ2. In such a situation the test probability is calculated asε = P(t > tobs).Similarly, if the alternative to µ1 = µ2 is that µ1 < µ2, the test probability iscalculated as ε = P(t < tobs).

• Quantiles of the t distribution are obtained from the computer or from acollection of statistical tables, possibly page .Since the distribution is symmetric about 0, quantiles are usually tabulatedfor probabilities larger that 0.5 only.

• See also Section . page .

Example .: Vitamin CC [Continued from Example . on page .]

The t test for equality of the two means requires variance homogeneity (i.e. that thegroups have the same variance), so one might want to perform a test of variancehomogeneity (see Exercise .). This test would be based on the variance ratio

R =s2orange juice

s2synthetic

=19.697.66

= 2.57 .

The R value is to be compared to the F distribution with (9,9) degrees of freedom in atwo-sided test. Using the tables on page ff we see that in this distribution the 95%quantile is 3.18 and the 90% quantile is 2.44, so the test probability is somewhere be-tween 10 and 20 percent, and we cannot reject the hypothesis of variance homogeneity.

Then we can safely carry on with the original problem, that of testing equality of thetwo means. The t test statistic is

t =13.18− 8.00√13.68 ( 1

10 + 110 )

=5.181.65

= 3.13 .

The t value is to be compared to the t distribution with 18 degrees of freedom; in thisdistribution the 99.5% quantile is 2.878, so the chance of getting a value of |t| larger than3.13 is less than 1%, and we may conclude that there is a highly significant differencebetween the means of the two groups.—Looking at the actual observation, we see thatthe growth in the “synthetic” group is less than in the “orange juice” group.

Page 69: Probability Theory Statistics

Hypothesis Testing

. Exercises

Exercise .Let x1,x2, . . . ,xn be a sample from the normal distribution with mean ξ and variance σ2.Then the hypothesis H0x : ξ = ξ0 can be tested using a t test as described on page .But we could also proceed as follows: Let a and b , 0 be two constants, and define

µ = a+ bξ,

µ0 = a+ bξ0,

yi = a+ bxi , i = 1,2, . . . ,n

If the xs are a sample from the normal distribution with parameters ξ and σ2, then the ysare a sample from the normal distribution with parameters µ and b2σ2 (Proposition .page ), and the hypothesis H0x : ξ = ξ0 is equivalent to the hypothesis H0y : µ = µ0.Show that the t test statistic for testing H0x is the same as the t test statistic for testingH0y (that is to say, the t test is invariant under affine transformations).

Exercise .In the discussion of the two-sample problem with normal variables (page ff) it isassumed that the two groups have the same variance (there is “variance homogeneity”).It is perfectly possible to test that assumption. To do that, we extend the model slightly tothe following: y11, y12, . . . , y1n1

is a sample from the normal distribution with parametersµ1 and σ2

1 , and y21, y22, . . . , y2n2is a sample form the normal distribution with parameters

µ2 and σ22 . In this model we can test the hypothesis H : σ2

1 = σ22 .

Do that! Write down the likelihood function, find the maximum likelihood estimators,and find the likelihood ratio test statistic Q.

Show that Q depends on the observations only through the quantity R =s21s22

, where

s2i =1

ni − 1

ni∑

j=1

(yij − yi)2 is the unbiased variance estimate in group i, i = 1,2.

It can be proved that when the hypothesis is true, then the distribution of R is anF distribution (see page ) with (n1 − 1,n2 − 1) degrees of freedom.

Examples

. Flour beetles

This example will show, for one thing, how it is possible to extend a binomialmodel in such a way that the probability parameter may depend on a numberof covariates, using a so-called logistic regression model.

In a study of the reaction of insects to the insecticide Pyrethrum, a number offlour beetles, Tribolium castaneum, were exposed to this insecticide in variousconcentrations, and after a period of days the number of dead beetles wasrecorded for each concentration (Pack and Morgan, ). The experiment wascarried out using four different concentrations, and on male and female beetlesseparately. Table . shows (parts of) the results.

The basic model

The study comprises a total of 144+69+54+. . .+47 = 641 beetles. Each individualis classified by sex (two groups) and by dose (four groups), and the response iseither death or survival.

An important part of the modelling process is to understand the status of thevarious quantities entering into the model:

• Dose and sex are covariates used to classify the beetles into 2 × 4 = 8groups—the idea being that dose and sex have some influence on theprobability of survival.

• The group totals 144,69,54,50,152,81,44,47 are known constants.• The number of dead in each group (43,50,47,48,26,34,27,43) are ob-

served values of random variables.Initially, we make a few simple calculations and plots, such as Table . andFigure ..

It seems reasonable to describe, for each group, the number of dead as anobservation from a binomial distribution whose parameter n is the total numberof beetles in the group, and whose unknown probability parameter has aninterpretation as the probability that a beetle of the actual sex will die whengiven the insecticide in the actual dose. Although it could be of some interest toknow whether there is a significant difference between the groups, it would be

Page 70: Probability Theory Statistics

Examples

Table .Flour beetles:For each combinationof dose and sex the ta-ble gives “the numberof dead” / “group total”= “observed deathfrequency”.Dose is given asmg/cm2.

dose M F

. / = . / = .. / = . / = .. / = . / = .. / = . / = .

much more interesting if we could also give a more detailed description of therelation between the dose and death probability, and if we could tell whetherthe insecticide has the same effect on males and females. Let us introduce somenotation in order to specify the basic model:. The group corresponding to dose d and sex k consists of ndk beetles, of

which ydk dead; here k ∈ M,F and d ∈ 0.20,0.32,0.50,0.80.. It is assumed that ydk is an observation of a binomial random variable Ydk

with parameters ndk (known) and pdk ∈ ]0 ; 1[ (unknown).. Furthermore, it is assumed that all of the random variables Ydk are inde-

pendent.The likelihood function corresponding to this model is

L =∏

k

d

(ndkydk

)pydkdk (1− pdk)ndk−ydk

=∏

k

d

(ndkydk

)·∏

k

d

( pdk1− pdk

)ydk ·∏

k

d

(1− pdk)ndk

= const ·∏

k

d

( pdk1− pdk

)ydk ·∏

k

d

(1− pdk)ndk ,

and the log-likelihood function is

lnL = const +∑

k

d

ydk lnpdk

1− pdk+∑

k

d

ndk ln(1− pdk)

= const +∑

k

d

ydk logit(pdk) +∑

k

d

ndk ln(1− pdk).

The eight parameters pdk vary independently of each other, and pdk = ydk/ndk.We are now going to model the depencence of p on d and k.

. Flour beetles

M

M

MM

F

F

F

F

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

1

dose

dea

thfr

equ

ency

Figure .Flour beetles:Observed death fre-quency plotted againstdose, for each sex.

M

M

MM

F

F

F

F

0.2

0.4

0.6

0.8

1

−1.6 −1.15 −0.7 −0.2log(dose)

dea

thfr

equ

ency Figure .

Flour beetles:Observed death fre-quency plotted againstthe logarithm of thedose, for each sex.

A dose-response model

How is the relation between the dose d and the probability pd that a beetle willdie when given the insecticide in dose d? If we knew almost everything aboutthis very insecticide and its effects on the beetle organism, we might be ableto deduce a mathematical formula for the relation between d and pd . However,the designer of statistical models will approach the problem in a much morepragmatic and down-to-earth manner, as we shall see.

In the actual study the set of dose values seems to be rather odd, 0.20, 0.32,0.50 and 0.80, but on closer inspection you will see that it is (almost) a geometricprogression: the ratio between one value and the next is approximately constant,namely 1.6. The statistician will take this as an indication that dose should bemeasured on a logarithmic scale, so we should try to model the relation betweenln(d) and pd . Hence, we make another plot, Figure ..

The simplest relationship between two numerical quantities is probably alinear (or affine) relation, but it is ususally a bad idea to suggest a linear relationbetween ln(d) and pd (that is, pd = α+β lnd for suitably values of the constants αand β), since this would be incompatible with the requirement that probabilities

Page 71: Probability Theory Statistics

Examples

Figure .Flour beetles:Logit of observed deathfrequency plottedagainst logarithm ofdose, for each sex.

M

M

M

M

F

F

F

F3

2

1

0

−1

−1.6 −1.15 −0.7 −0.2log(dose)

logi

t(d

eath

freq

uen

cy)

must always be numbers between 0 and 1. Often, one would now convert theprobabilities to a new scale and claim a linear relation between “probabilitieson the new scale” and the logarithm of the dose. A frequently used conversionfunction is the logit function.The logit function.

The function

logit(p) = lnp

1− p .

is a bijection from ]0,1[to the real axis R. Itsinverse function is

p =exp(z)

1 + exp(z).

If p is the probabilityof a given event, thenp/(1− p) is the ratio be-tween the probability ofthe event and the prob-ability of the oppositeevent, the so-called oddsof the event. Hence, thelogit function calculatesthe logarithm of theodds.

We will therefore suggest/claim the following often used model for the rela-tion between dose and probability of dying: For each sex, logit(pd) is a linear (ormore correctly: an affine) function of x = lnd, that is, for suitable values of theconstants αM , βM and αF , βF ,

logit(pdM ) = αM + βM lnd

logit(pdF) = αF + βF lnd.

This is a logistic regression model.In Figure . the logit of the death frequencies is plotted against the loga-

rithm of the dose; if the model is applicable, then each set of points should lieapproximately on a straight line, and that seems to be the case, although a closerexamination is needed in order to determine whether the model actually givesan adequate description of the data.

In the subsequent sections we shall see how to estimate the unknown parame-ters, how to check the adequacy of the model, and how to compare the effects ofthe poison on male and female beetles.

Estimation

The likelihood function L0 in the logistic model is obtained from the basiclikelihood function by considering the ps as functions of the αs and βs. Thisgives the log-likelihood function

lnL0(αM ,αF ,βM ,βF)

. Flour beetles

M

M

M

M

F

F

F

F3

2

1

0

−1

−1.6 −1.15 −0.7 −0.2log(dose)

logi

t(d

eath

freq

uen

cy)

Figure .Flour beetles:Logit of observed deathfrequency plottedagainst logarithm ofdose, for each sex,and the estimatedregression lines.

= const +∑

k

d

ydk(αk + βk lnd) +∑

k

d

ndk ln(1− pdk)

= const +∑

k

(αkyk + βk

d

ydk lnd +∑

d

ndk ln(1− pdk)),

which consists of a contribution from the male beetles plus a contribution fromthe female beetles. The parameter space is R4.

To find the stationary point (which hopefully is a point of maximum) weequate each of the four partial derivatives of lnL to zero; this leads (after somecalculations) to the four equations

d

ydM =∑

d

ndMpdM ,∑

d

ydF =∑

d

ndFpdF ,

d

ydM lnd =∑

d

ndMpdM lnd,∑

d

ydF lnd =∑

d

ndFpdF lnd,

where pdk =exp(αk + βk lnd)

1 + exp(αk + βk lnd). It is not possible to write down an explicit

solution to these equations, so we have to accept a numerical solution.Standard statistical computer software will easily find the estimates in the

present example: αM = 4.27 (with a standard deviation of 0.53) and βM = 3.14(standard deviation 0.39), and αF = 2.56 (standard deviation 0.38) and βF = 2.58(standard deviation 0.30).

We can add the estimated regression lines to the plot Figure .; this resultsin Figure ..

Model validation

We have estimated the parameters in the model where

logit(pdk) = αk + βk lnd

Page 72: Probability Theory Statistics

Examples

or

pdk =exp(αk + βk lnd)

1 + exp(αk + βk lnd).

An obvious thing to do next is to plot the graphs of the two functions

x 7→ exp(αM + βMx)1 + exp(αM + βMx)

and x 7→ exp(αF + βFx)1 + exp(αF + βFx)

into Figure ., giving Figure .. This shows that our model is not entirelyerroneous. We can also perform a numerical test based on the likelihood ratiotest statistic

Q =L(αM , αF , βM , βF)

Lmax(.)

where Lmax denotes the maximum value of the basic likelihood function (givenon page ). Using the notation pdk = logit−1(αk + βk lnd) and ydk = ndkpdk , weget

Q =

k

d

(ndkydk

)pydkdk (1− pdk)ndk−ydk

k

d

(ndkydk

)( ydkndk

)ydk(1− ydk

ndk

)ndk−ydk

=∏

k

d

( ydkydk

)ydk(ndk − ydkndk − ydk

)ndk−ydk

and

−2lnQ = 2∑

k

d

(ydk ln

ydkydk

+ (ndk − ydk) lnndk − ydkndk − ydk

).

Large values of −2lnQ (or small values of Q) mean large discrepancy betweenthe observed numbers (ykd and nkd − ykd) and the predicted numbers (ykd andnkd − ykd), thus indicating that the model is inadequate. In the present examplewe get −2lnQobs = 3.36, and the corresponding test probability, i.e. the prob-ability (when the model is correct) of getting values of −2lnQ that are largerthan −2lnQobs, can be found approximately using the fact that −2lnQ is asymp-totically χ2 distributed with 8 − 4 = 4 degrees of freedom; the (approximate)test probability is therefore about 0.50. (The number of degrees of freedom isthe number of free parameters in the basic model minus the number of freeparameters in the model tested.)

We can therefore conclude that the model seems to be adequate.

. Flour beetles

M

M

MM

F

F

F

F

0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0log(dose)

dea

thfr

equ

ency

Figure .Flour beetles:Two different curves,and the observed deathfrequencies.

Hypotheses about the parameters

The current model with four parameters seems to be quite good, but maybe itcan be simplified. We could for example test whether the two curves are parallel,and if they are parallel, we could test whether they are identical. Considertherefore these two statistical hypotheses:. The hypothesis of parallel curves: H1 : βM = βF , or more detailed: for

suitable values of the constants αM , αF and β

logit(pdM ) = αM + β lnd and logit(pdF) = αF + β lnd

for all d.. The hypothesis of identical curves: H2 : αM = αF and βM = βF or more

detailed: for suitable values of the constants α and β

logit(pdM ) = α + β lnd and logit(pdF) = α + β lnd

for all d.The hypothesis H1 of parallel curves is tested as the first one. The maximumlikelihood estimates are αM = 3.84 (with a standard deviation of 0.34), αF = 2.83

(standard deviation 0.31) and β = 2.81 (standard deviation 0.24). We use thestandard likelihood ratio test and compare the maximal likelihood under H1 tothe maximal likelihood under the current model:

−2lnQ = −2lnL(αM ,αF ,β,β)

L(αM , αF , βM , βF)

= 2∑

k

d

(ydk ln

ydkydk

+ (ndk − ydk) lnndk − ydkndk −ydk

)

Page 73: Probability Theory Statistics

Examples

Figure .Flour beetles:The final model. M

M

MM

F

F

F

F

0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0log(dose)

dea

thfr

equ

ency

where ydk = ndkexp(αk +β lnd)

1 + exp(αk +β lnd). We find that −2lnQobs = 1.31, to be com-

pared to the χ2 distribution with 4−3 = 1 degrees of freedom (the change in thenumber of free parameters). The test probability (i.e. the probability of getting−2lnQ values larger than 1.31) is about 25%, so the value −2lnQobs = 1.31 isnot significantly large. Thus the model with parallel curves is not significantlyinferior to the current model.

Having accepted the hypothesis H1, we can then test the hypothesis H2 ofidentical curves. (If H1 had been rejected, we would not test H2.) Testing H2against H1 gives −2lnQobs = 27.50, to be compared to the χ2 distribution with3 − 2 = 1 degrees of freedom; the probability of values larger than 27.50 isextremely small, so the model with identical curves is very much inferior to themodel with parallel curves. The hypothesis H2 is rejected.

The conclusion is therefore that the relation between the dose d and theprobability p of dying can be described using a model that makes logitd be alinear function of lnd, for each sex; the two curves are parallel but not identical.The estimated curves are

logit(pdM ) = 3.84 + 2.81lnd and logit(pdF) = 2.83 + 2.81lnd ,

corresponding to

pdM =exp(3.84 + 2.81lnd)

1 + exp(3.84 + 2.81lnd)and pdF =

exp(2.83 + 2.81lnd)1 + exp(2.83 + 2.81lnd)

.

Figure . illustrates the situation.

. Lung cancer in Fredericia

This is an example of a so-called multiplicative Poisson model. It also turns outto be an example of a modelling situation where small changes in the way the

. Lung cancer in Fredericia

age group Fredericia Horsens Kolding Vejle total

-----+

total

Table .Number of lung cancercases in six age groupsand four cities (fromAndersen, ).

model is analysed lead to quite contradictory conclusions.

The situation

In the mid s there was some debate about whether the citizens of the Danishcity Fredericia had a higher risk of getting lung cancer than citizens in the threecomparable neighbouring cities Horsens, Kolding and Vejle, since Fredericia,as opposed to the three other cities, had a considerable amount of pollutingindustries located in the mid-town. To shed light on the problem, data werecollected about lung cancer incidence in the four cities in the years to

Lung cancer is often the long-term result of a tiny daily impact of harmfulsubstances. An increased lung cancer risk in Fredericia might therefore result inlung cancer patients being younger in Fredericia than in the three control cities;and in any case, lung cancer incidence is generally known to vary with age.Therefore we need to know, for each city, the number of cancer cases in differentage groups, see Table ., and we need to know the number of inhabitants inthe same groups, see Table ..

Now the statistician’s job is to find a way to describe the data in Table . bymeans of a statistical model that gives an appropriate description of the risk oflung cancer for a person in a given age group and in a given city. Furthermore, itwould certainly be expedient if we could separate out three or four parametersthat could be said to describe a “city effect” (i.e. differences between cities) afterhaving allowed for differences between age groups.

Specification of the model

We are not going to model the variation in the number of inhabitants in thevarious cities and age groups (Table .), so these numbers are assumed to beknown constants. The numbers in Table ., the number of lung cancer cases,are the numbers that will be assumed to be observed values of random variables,and the statistical model should specify the joint distribution of these random

Page 74: Probability Theory Statistics

Examples

Table .Number of inhabitantsin six age groupsand four cities (fromAndersen, ).

age group Fredericia Horsens Kolding Vejle total

-----+

total

variables. Let us introduce some notation:

yij = number of cases in age group i in city j,

rij = number of persons in age group i in city j,

where i = 1,2,3,4,5,6 numbers the age groups, and j = 1,2,3,4 numbers thecities. The observations yij are considered to be observed values of randomvariables Yij .

Which distribution for the Y s should we use? At first one might suggest thatYij could be a binomial random variable with some small p and with n = rij ;since the probability of getting lung cancer fortunately is rather small, anothersuggestion would be, with reference to Theorem . on page , that Yij followsa Poisson distribution with mean µij . Let us write µij as µij = λijrij (recall that rijis a known number), then the intensity λij has an interpretation as “number oflung cancer cases per person in age group i in city j during the four year period”,so λ is the age- and city-specific cancer incidence. Furthermore we will assumeindependence between the Yijs (lung cancer is believed to be non-contagious).Hence our basic model is:

The Yijs are independent Poisson random variables with EYij = λijrijwhere the λijs are unknown positive real numbers.

The parameters of the basic model are estimated in the straightforward manneras λij = yij /rij , so that for example the estimate of the intensity λ21 for -–year-old inhabitants of Fredericia is 11/800 = 0.014 (there are 0.014 cases perperson per four years).

Now, the idea was that we should be able to compare the cities after havingallowed for differences between age groups, but the basic model does not inany obvious way allow such comparisons. We will therefore specify and test ahypothesis that we can do with a slightly simpler model (i.e. a model with fewerparameters) that does allow such comparisons, namely a model in which λij is

. Lung cancer in Fredericia

written as a product αiβj of an age effect αi and a city effect βj . If so, we are in agood position, because then we can compare the cities simply by comparing thecity parameters βj . Therefore our first statistical hypothesis will be

H0 : λij = αiβj

where α1,α2,α3,α4,α5,α6,β1,β2,β3,β4 are unknown parameters. More specifi-cally, the hypothesis claims that there exist values of the unknown parametersα1,α2,α3,α4,α5,α6,β1,β2,β3,β4 ∈R+ such that lung cancer incidence λij in cityj and age group i is λij = αiβj . The model specified by H0 is a multiplicativemodel since city parameters and age parameters enter multiplicatively.

A detail pertaining to the parametrisation

It is a special feature of the parametrisation under H0 that it is not injective: themapping

(α1,α2,α3,α4,α5,α6,β1,β2,β3,β4) 7→ (αi βj )i=1,...,6; j=1,...,4

from R10+ to R24 is not injective; in fact the range of the map is a 9-dimensional

surface. (It is easily seen that αi βj = α∗i β∗j for all i and j, if and only if there

exists a c > 0 such that αi = cα∗i for all i and β∗j = cβj for all j.)Therefore we must impose one linear constraint on the parameters in order to

obtain an injective parametrisation; examples of such a constraint are: α1 = 1,or α1 +α2 + · · ·+α6 = 1, or α1α2 . . .α6 = 1, or a similar constraint on the βs, etc.In the actual example we will chose the constraint β1 = 1, that is, we fix theFredericia parameter to the value 1. Henceforth the model H0 has only nineindependent parameters

Estimation in the multiplicative model

In the multiplicative model you cannot write down explicit expressions for theestimates, but most statistical software will, after a few instructions, calculatethe actual values for any given set of data. The likelihood function of the basicmodel is

L =6∏

i=1

4∏

j=1

(λijrij )yij

yij !exp(−λijrij ) = const ·

6∏

i=1

4∏

j=1

λyijij exp(−λijrij )

as a function of the λijs. As noted earlier, this function attains its maximumwhen λij = yij /rij for all i and j.

Page 75: Probability Theory Statistics

Examples

If we substitute αiβj for λij in the expression for L, we obtain the likelihoodfunction L0 under H0:

L0 = const ·6∏

i=1

4∏

j=1

αyiji β

yijj exp(−αiβjrij )

= const ·( 6∏

i=1

αyi i

)( 4∏

j=1

βy jj

)exp

(−

6∑

i=1

4∑

j=1

αiβjrij

),

and the log-likelihood function is

lnL0 = const +( 6∑

i=1

yi lnαi)

+( 4∑

j=1

y j lnβj)−

6∑

i=1

4∑

j=1

αiβjrij .

We are seeking the set of parameter values (α1, α2, α3, α4, α5, α6,1, β2, β3, β4) thatmaximises lnL0. The partial derivatives are

∂ lnL0

∂αi=yi αi−∑

j

βjrij , i = 1,2,3,4,5,6

∂ lnL0

∂βj=y jβj−∑

i

αirij , j = 1,2,3,4,

and these partial derivatives are equal to zero if and only if

yi =∑

j

αiβjrij for all i and y j =∑

i

αiβjrij for all j.

We note for later use that this implies that

y =∑

i

j

αi βjrij . (.)

The maximum likelihood estimates in the multiplicative model are found to be

α1 = 0.0036 β1 = 1

α2 = 0.0108 β2 = 0.719

α3 = 0.0164 β3 = 0.690

α4 = 0.0210 β4 = 0.762

α5 = 0.0229

α6 = 0.0148.

. Lung cancer in Fredericia

age group Fredericia Horsens Kolding Vejle

-----+

......

......

......

......

Table .Estimated age- andcity-specific lung can-cer intensities in theyears -, assum-ing a multiplicativePoisson model. The val-ues are number of casesper inhabitantsper years.

How does the multiplicative model describe the data?

We are going to investigate how well the multiplicative model describes theactual data. This can be done by testing the multiplicative hypothesis H0 againstthe basic model, using the standard log likelihood ratio test statistic

−2lnQ = −2lnL0(α1, α2, α3, α4, α5, α6,1, β2, β3, β4)

L(λ11, λ12, . . . , λ63, λ64).

Large values of −2lnQ are significant, i.e. indicate that H0 does not provide anadequate description of data. To tell if −2lnQobs is significant, we can use the testprobability ε = P0

(−2lnQ ≥ −2lnQobs

), that is, the probability of getting a larger

value of −2lnQ than the one actually observed, provided that the true model isH0. When H0 is true, −2lnQ is approximately a χ2-variable with f = 24−9 = 15degrees of freedom (provided that all the expected values are 5 or above), sothat the test probability can be found approximately as ε = P

(χ2

15 ≥ −2lnQobs

).

Standard calculations (including an application of (.)) lead to

Q =6∏

i=1

4∏

j=1

(yijyij

)yij,

and thus

−2lnQ = 26∑

i=1

4∑

j=1

yij lnyijyij.

where yij = αi βjrij is the expected number of cases in age group i in city j.The calculated values 1000 αi βj , the expected number of cases per inhab-

itants, can be found in Table ., and the actual value of −2lnQ is −2lnQobs =22.6. In the χ2 distribution with f = 24 − 9 = 15 degrees of freedom the 90%quantile is 22.3 and the 95% quantile is 25.0; the value −2lnQobs = 22.6 there-fore corresponds to a test probability above 5%, so there seems to be no seriousevidence against the multiplicative model. Hence we will assume multiplicativ-ity, that is, the lung cancer risk depends multiplicatively on age group and city.

Page 76: Probability Theory Statistics

Examples

Table .Expected number oflung cancer cases yijunder the multiplica-tive Poisson model.

age group Fredericia Horsens Kolding Vejle total

-----+

......

......

......

......

......

total . . . . .

We have arrived at a statistical model that describes the data using a numberof city-parameters (the βs) and a number of age-parameters (the αs), but withno interaction between cities and age groups. This means that any differencebetween the cities is the same for all age groups, and any difference between theage groups is the same in all cities. In particular, a comparison of the cities canbe based entirely on the βs.

Identical cities?

The purpose of all this is to see whether there is a significant difference betweenthe cities. If there is no difference, then the city parameters are equal, β1 = β2 =β3 = β4, and since β1 = 1, the common value is necessarily 1. So we shall test thestatistical hypothesis

H1 : β1 = β2 = β3 = β4 = 1.

The hypothesis has to be tested against the current basic model H0, leading tothe test statistic

−2lnQ = −2lnL1(α1,α2,α3,α4,α5,α6)

L0(α1, α2, α3, α4, α5, α6,1, β2, β3, β4)

where L1(α1,α2, . . . ,α6) = L0(α1,α2, . . . ,α6,1,1,1,1) is the likelihood function un-derH1, and α1,α2, . . . ,α6 are the estimates underH1, that is, (α1,α2, . . . ,α6) mustbe the point of maximum of L1.

The function L1 is a product of six functions, each a function of a single α:

L1(α1,α2, . . . ,α6) = const ·6∏

i=1

4∏

j=1

αyiji exp(−αirij )

= const ·6∏

i=1

αyi i exp(−αiri ) .

. Lung cancer in Fredericia

age group Fredericia Horsens Kolding Vejle total

-----+

......

......

......

......

......

total . . . . .

Table .Expected number oflung cancer cases yijunder the hypothesisH1 of no differencebetween cities.

Therefore, the maximum likelihood estimates are αi =yi ri

, i = 1,2, . . . ,6, and the

actual values are

α1 = 33/11600 = 0.002845α2 = 32/3811 = 0.00840α3 = 43/3367 = 0.0128

α4 = 45/2748 = 0.0164α5 = 40/2217 = 0.0180α6 = 31/2665 = 0.0116.

Standard calculations show that

Q =L1(α1,α2,α3,α4,α5,α6)

L0(α1, α2, α3, α4, α5, α6,1, β2, β3, β4)=

6∏

i=1

4∏

j=1

(yijyij

)yij

and hence

−2lnQ = 26∑

i=1

4∑

j=1

yij lnyijyij

;

here yij = αirij , and yij = αi βjrij as earlier. The expected numbers yij are shownin Table ..

As always, large values of −2lnQ are significant. Under H1 the distribution of−2lnQ is approximately a χ2 distribution with f = 9−6 = 3 degrees of freedom.

We get −2lnQobs = 5.67. In the χ2 distribution with f = 9 − 6 = 3 degreesof freedom, the 80% quantile is 4.64 and the 90% quantile is 6.25, so the testprobability is almost 20%, and therefore there seems to be a good agreementbetween the data and the hypothesis H1 of no difference between cities. Put inanother way, there is no significant difference between the cities.

Another approach

Only in rare instances will there be a definite course of action to be followedin a statistical analysis of a given problem. In the present case, we have investi-gated the question of an increased lung cancer risk in Fredericia by testing the

Page 77: Probability Theory Statistics

Examples

hypothesis H1 of identical city parameters. It turned out that we could acceptH1, that is, there seems to be no significant difference between the cities.

But we could chose to investigate the problem in a different way. One mightargue that since the question is whether Fredericia has a higher risk than theother cities, then it seems to be implied that the three other cities are more orless identical—an assumption which can be tested. We should therefore adoptthis strategy for testing hypotheses:. The basic model is still the multiplicative Poisson model H0.. Initially we test whether the three cities Horsens, Kolding and Vejle are

identical, i.e. we test the hypothesis

H2 : β2 = β3 = β4.

. If H2 is accepted, it means that there is a common level β for the three“control cities”. We can then compare Fredericia to this common level bytesting whether β1 = β. Since β1 per definition equals 1, the hypothesis totest is

H3 : β = 1.

Comparing the three control cities

We are going to test the hypothesis H2 : β2 = β3 = β4 of identical control citiesrelative to the multiplicative model H0. This is done using the test statistic

Q =L2(α1, α2, . . . , α6, β)

L0(α1, α2, . . . , α6,1, β2, β3, β4)

where L2(α1,α2, . . . ,α6,β) = L0(α1,α2, . . . ,α6,1,β,β,β) is the likelihood functionunder H2, and α1, α2, . . . , α6, β are the maximum likelihood estimates under H2.

The distribution of −2lnQ underH2 can be approximated by a χ2 distributionwith f = 9− 7 = 2 degrees of freedom.

The model H2 is a multiplicative Poisson model similar to the previous one,now with two cities (Fredericia and the others) and six age groups, and thereforeestimation in this model does not raise any new problems. We find that

α1 = 0.00358

α2 = 0.0108

α3 = 0.0164

α4 = 0.0210

α5 = 0.0230

α6 = 0.0148

β1 = 1

β = 0.7220

Furthermore,

−2lnQ = 26∑

i=1

4∑

j=1

yij lnyijyij

. Lung cancer in Fredericia

age group Fredericia Horsens Kolding Vejle total

-----+

......

......

......

......

......

total . . . . .

Table .Expected number oflung cancer cases yijunder H2.

where yij = αi βjrij , see Table ., and

yi1 = αi ri1yij = αi β rij , j = 2,3,4.

The expected numbers yij are shown in Table .. We find that −2lnQobs = 0.40,to be compared to the χ2 distribution with f = 9− 7 = 2 degrees of freedom. Inthis distribution the 20% quantile is 0.446, which means that the test probabilityis well above 80%, so H2 is perfectly compatible with the available data, and wecan easily assume that there is no significant difference between the three cities.

Then we can testH3, the model with six age groups with parameters α1,α2, . . . ,α6and with identical cities. AssumingH2,H3 is identical to the previous hypothesisH1, and hence the estimates of the age parameters are α1,α2, . . . ,α6 on page .

This time we have to test H3 (= H1) relative to the current basic model H2.The test statistic is −2lnQ where

Q =L1(α1,α2, . . . ,α6)

L2(α1, α2, . . . , α6, β)=L0(α1,α2, . . . ,α6,1,1,1,1)

L0(α1, α2, . . . , α6,1, β, β, β)

which can be rewritten as

Q =6∏

i=1

4∏

j=1

(yijyij

)yij

so that

−2lnQ = 26∑

i=1

4∑

j=1

yij lnyijyij

.

Large values of −2lnQ are significant. The distribution of −2lnQ under H3can be approximated by a χ2 distribution with f = 7− 6 = 1 degree of freedom(provided that all the expected numbers are at least 5).

Page 78: Probability Theory Statistics

Examples

Overview of approachno. .

Model/Hypotesis −2lnQ f ε

M: arbitrary parametersH: multiplicativity

22.65 24− 9 = 15 a good 5%

M: multiplicativityH: four identical cities

5.67 9− 6 = 3 about 20%

Overview of approachno. .

Model/Hypotesis −2lnQ f ε

M: arbitrary parametersH: multiplicativity

22.65 24− 9 = 15 a good 5%

M: multiplicativityH: identical control cities

0.40 9− 7 = 2 a good 80%

M: identical control citiesH: four identical cities

5.27 7− 6 = 1 about 2%

Using values from Tables ., . and . we find that −2lnQobs = 5.27. Inthe χ2 distribution with 1 degree of freedom the 97.5% quantile is 5.02 and the99% quantile is 6.63, so the test probability is about 2%. On this basis the usualpractice is to reject the hypothesis H3 (=H1). So the conclusion is that as regardslung cancer risk there is no significant difference between the three cities Horsens,Kolding and Vejle, whereas Fredericia has a lung cancer risk that is significantlydifferent from those three cities.

The lung cancer incidence of the three control cities relative to that of Frederi-cia is estimated as β = 0.7, so our conclusion can be sharpened: the lung cancerincidence of Fredericia is significantly larger than that of the three cities.

Now we have reached a nice and clear conclusion, which unfortunately isquite contrary to the previous conclusion on page !

Comparing the two approaches

We have been using two just slightly different approaches, which neverthelesslead to quite opposite conclusions. Both approaches follow this scheme:. Decide on a suitable basic model.. State a simplifying hypothesis.. Test the hypothesis relative to the current basic model.. a) If the hypothesis can be accepted, then we have a new basic model,

namely the previous basic model with the simplifications implied bythe hypothesis. Continue from step .

b) If the hypothesis is rejected, then keep the current basic model andexit.

. Accidents in a weapon factory

The two approaches take their starting point in the same Poisson model, theydiffer only in the choice of hypotheses (in item above). The first approach goesfrom the multiplicative model to “four identical cities” in a single step, giving atest statistic of 5.67, and with 3 degrees of freedom this is not significant. Thesecond approach uses two steps, “multiplicativity→ three identical” and “threeidentical → four identical”, and it turns out that the 5.67 with 3 degrees offreedom is divided into 0.40 with 2 degrees of freedom and 5.27 with 1 degreesof freedom, and the latter is highly significant.

Sometimes it may be desirable to do such a stepwise testing, but you shouldnever just blindly state and test as many hypotheses as possible. Instead youshould consider only hypotheses that are reasonable within the actual context.

About test statistics

You may have noted some common traits of the −2lnQ statistics occuring insections . and .: They all seem to be of the form

−2lnQ = 2∑

obs.number · ln expected number in current basic model

expected number under hypothesis

and they all have an approximate χ2 distribution with a number of degreesof freedom which is found as “number of free parameters in current basicmodel” minus “number of free parameters under the hypothesis”. This is infact a general result about testing hypotheses in Poisson models (only suchhypotheses where the sum of the expected numbers equals the sum of theobserved numbers).

. Accidents in a weapon factory

This example seems to be a typical case for a Poisson model, but it turns outthat it does not give an adequate fit. So we have to find another model.

The situation

The data is the number of accidents in a period of five weeks for each womanworking on -inch HE shells (high-explosive shells) at a weapon factory in Eng-land during World War I. Table . gives the distribution of n = 647 women bynumber of accidents over the period in question. We are looking for a statisticalmodel that can describe this data set. (The example originates from Greenwoodand Yule () and is known through its occurence in Hald (, ) (En-glish version: Hald ()), which for decades was a leading statistics textbook(at least in Denmark).)

Page 79: Probability Theory Statistics

Examples

Table .Distribution of women by number y ofaccidents over a periodof five weeks (fromGreenwood and Yule()).

number of womeny having y accidents

+

Table .Model : Table andgraph showing theobserved numbers fy(black bars) and theexpected numbers fy(white bars).

y fy fy

+

.......

. 0

100

200

300

400

f

0 1 2 3 4 5 6+y

Let yi be the number of accidents that come upon woman no. i, and let fydenote the number of women with exactly y accidents, so that in the present case,f0 = 447, f1 = 132, etc. The total number of accidents then is 0f0 +1f +2f2 + . . . =∞∑

y=0

yfy = 301.

We will assume that yi is an observed value of a random variable Yi , i =1,2, . . . ,n, and we will assume the random variables Y1,Y2, . . . ,Yn to be indepen-dent, even if this may be a questionable assumption.

The first model

The first model to try is the model stating that Y1,Y2, . . . ,Yn are independentand identically Poisson distributed with a common parameter µ. This model isplausible if we assume the accidents to happen “entirely at random”, and theparameter µ then describes the women’s “accident proneness”.

The Poisson parameter µ is estimated as µ = y = 301/647 = 0.465 (therehas been 0.465 accidents per woman per five weeks). The expected number of

. Accidents in a weapon factory

women hit by y accidents is fy = nµ y

y!exp(−µ); the actual values are shown in

Table .. The agreement between the observed and the expected numbers isnot very impressive. Further evidence of the inadequacy of the model is obtainedby calculating the central variance estimate s2 = 1

n−1∑

(yi − y)2 = 0.692 which isalmost % larger than y—and in the Poisson distribution the mean and thevariance are equal. Therefore we should probably consider another model.

The second model

Model can be extended in the following way:• We still assume Y1,Y2, . . . ,Yn to be independent Poisson variables, but this

time each Yi is allowed to have its own mean value, so that Yi now is aPoisson random variable with parameter µi , for all i = 1,2, . . . ,n.If the modelling process came to a halt at this stage, there would be oneparameter per person, allowing a perfect fit (with µi = yi , i = 1,2, . . . ,n),but that would certainly contradict Fisher’s maxim, that “the object ofstatistical methods is the reduction of data” (cf. page ). But there is afurther step in modelling process:

• We will further assume that µ1,µ2, . . . ,µn are independent observationsfrom a single probability distribution, a continuous distribution on thepositive real numbers. It turns out that in this context a gamma distribu-tion is a convenient choice, so assume that the µs come from a gammadistribution with shape parameter κ and scale parameter β, that is, withdensity function

g(µ) =1

Γ (κ)βκµκ−1 exp(−µ/β), µ > 0.

• The conditional probability, for a given value of µ, that a woman is hit

by exactly y accidents, isµy

y!exp(−µ), and the unconditional probability is

then obtained by mixing the conditional probabilities with respect to thedistribution of µ:

P(Y = y) =∫ +∞

0

µy

y!exp(−µ) · g(µ)dµ

=∫ +∞

0

µy

y!exp(−µ)

1Γ (κ)βκ

µκ−1 exp(−µ/β)dµ

=Γ (y +κ)y!Γ (κ)

(1

β + 1

)κ( β

β + 1

)y,

Page 80: Probability Theory Statistics

Examples

Table .Model : Table andgraph of the observednumbers fy (blackbars) and expected

numbers f y (white

bars).

y fyf y

+

.......

. 0

100

200

300

400

f

0 1 2 3 4 5 6+y

where the last equality follows from the definition of the gamma function(use for example the formula at the end of the side note on page ). If κis a natural number, Γ (κ) = (κ − 1)!, and then

(y +κ − 1

y

)=Γ (y +κ)y!Γ (κ)

.

In this equation, the right-hand side is actually defined for all κ > 0, so wecan see the equation as a way of defining the symbol on the left-hand sidefor all κ > 0 and all y ∈ 0,1,2, . . .. If we let p = 1/(β + 1), the probabilityof y accidents can now be written as

P(Y = y) =(y +κ − 1

y

)pκ(1− p)y , y = 0,1,2, . . . (.)

It is seen that Y has a negative binomial distribution with shape parameter κand probability parameter p = 1/(β + 1) (cf. Definition . on page ).

The negative binomial distribution has two parameters, so one would expectModel to give a better fit than the one-parameter model Model .

In the new model we have (cf. Example . on page )

E(Y ) = κ(1− p)/p = κβ,

Var(Y ) = κ(1− p)/p2 = κβ(β + 1).

It is seen that the variance is (β + 1) times larger than the mean. In the presentdata set, the variance is actually larger than the mean, so for the present Model cannot be rejected.

Estimation of the parameters in Model

As always we will be using the maximum likelihood estimator of the unknownparameters. The likelihood function is a product of probabilities of the form

. Accidents in a weapon factory

(.):

L(κ,p) =n∏

i=1

(yi +κ − 1

yi

)pκ(1− p)yi

= pnκ(1− p)y1+y2+···+ynn∏

i=1

(yi +κ − 1

yi

)

= const · pnκ(1− p)y∞∏

k=1

(κ+ k − 1)fk+fk+1+fk+2+...

where fk still denotes the number of observations equal to k. The logarithm ofthe likelihood function is then (except for a constant term)

lnL(κ,p) = nκ lnp+ y ln(1− p) +∞∑

k=1

( ∞∑

j=k

fj

)ln(κ+ k − 1),

and more innocuous-looking with the actual data entered:

lnL(κ,p) = 647κ lnp+ 301ln(1− p)

+ 200lnκ+ 68ln(κ+ 1) + 26ln(κ+ 2)

+ 5ln(κ+ 3) + 2ln(κ+ 4).

We have to find the point(s) of maximum for this function. It is not clear whetheran analytic solution can be found, but it is fairly simple to find a numericalsolution, for example by using a simplex method (which does not use thederivatives of the function to be maximised). The simplex method is an iterativemethod and requires an initial value of (κ,p), say (κ, p). One way to determinesuch an initial value is to solve the equations

theoretical mean = empirical mean

theoretical variance = empirical variance.

In the present case these equations become κβ = 0.465 and κβ(β + 1) = 0.692,where β = (1− p)/p. The solution to these two equations is (κ, p) = (0.953,0.672)(and hence β = 0.488). These values can then be used as starting values in aniterative procedure that will lead to the point of maximum of the likelihoodfunction, which is found to be

κ = 0.8651

p = 0.6503 (and hence β = 0.5378).

Page 81: Probability Theory Statistics

Examples

Table . shows the corresponding expected numbers

f y = n

(y + κ − 1

y

)p κ (1− p)y

calculated from the estimated negative binomial distribution. There seems to bea very good agreement between the observed and the expected numbers, andwe may therefore conclude that our Model gives an adequate description ofthe observations.

The Multivariate NormalDistribution

Linear normal models can, as we shall see in Chapter , be expressed in a veryclear and elegant way using the language of linear algebra, and there seems tobe a considerable advantage in discussing estimation and hypothesis testing inlinear models in a linear algebra setting.

Initially, however, we have to clarify a few concepts relating to multivariaterandom variables (Section .), and to define the multivariate normal distribu-tion, which turns out to be rather complicated (Section .). The reader maywant to take a look at the short summary of notation and results from linearalgebra (page ff).

. Multivariate random variables

An n-dimensional random variable X can be regarded as a set of n one-di-mensional random variables, or as a single random vector (in the vector spaceV = Rn) whose coordinates relative to the standard coordinate system are n one-dimensional random variables. The mean value of the n-dimensional random

variabel X =

X1X2...Xn

is the vector EX =

EX1EX2...

EXn

, that is, the set of mean values of

the coordinates of X—here we assume that all the one-dimensional randomvariables have a mean value. The variance of X is the symmetric and positivesemidefinite n×n matrix VarX whose (i, j)th entry is Cov(Xi ,Xj ):

VarX =

VarX1 Cov(X1,X2) · · · Cov(X1,Xn)Cov(X2,X1) VarX2 · · · Cov(X2,Xn)

......

. . ....

Cov(Xn,X1) Cov(Xn,X2) · · · VarXn

= E((X −EX)(X −EX)′

),

if all the one-dimensional random variables have a variance.

Page 82: Probability Theory Statistics

The Multivariate Normal Distribution

Using these definitions the following proposition is easily shown:

Proposition .Let X be an n-dimensional random variable that has a mean and a variance. If A is alinear map from Rn to Rp and b a constant vector in Rp [or A is a p ×n matrix andb constant p × 1 matrix], then

E(AX +b) = A(EX) +b (.)

Var(AX +b) = A(VarX)A′ . (.)

Example .: The multinomial distributionThe multinomial distribution (page ) is an example of a multivariate distribution.

If X =

X1X2...Xr

is multinomial with parameters n and p =

p1p2...pr

, then EX =

np1np2...npr

and

VarX =

np1(1− p1) −np1p2 · · · −np1pr−np2p1 np2(1− p2) · · · −np2pr

......

. . ....

−nprp1 −nprp2 · · · npr (1− pr )

.

The Xs always add up to n, that is, AX = n where A is the linear map

A :

x1x2...xr

7−→

[1 1 . . . 1

]

x1x2...xr

= x1 + x2 + · · ·+ xr .

Therefore Var(AX) = 0. Since Var(AX) = A(VarX)A′ (equation (.)), the variancematrix VarX is positive semidefinite (as always), but not positive definite.

. Definition and properties

In this section we shall define the multivariate normal distribution and provethe multivariate version of Proposition . on page , namely that if X isn-dimensional normal with parameters µ and Σ, and if A is a p ×n matrix andb a p × 1 matrix, then AX +b is p-dimensional normal with parameters Aµ+band AΣA′. Unfortunately, it is not so easy to define the n-dimensional normaldistribution with parameters µ and Σ when Σ is singular. We proceed step bystep:

Definition .: n-dimensional standard normal distributionThe n-dimensional standard normal distribution is the n-dimensional continuous

. Definition and properties

distribution with density function

f (x) =1

(2π)n/2exp

(−1

2‖x‖2), x ∈Rn. (.)

Remarks:. For n = 1 this is our earlier definition of standard normal distribution

(Definition . on page ).

. The n-dimensional random variable X =

X1X2...Xn

is n-dimensional standard

normal if and only if X1,X2, . . . ,Xn are independent and one-dimensionalstandard normal. This follows from the identity

1(2π)n/2

exp(−1

2‖x‖2)

=n∏

i=1

1√2π

exp(−12x

2i )

(cf. Proposition . on page ).. If X is standard normal, then EX = 0 and VarX = I; this follows from

item . (Here I is the identity map of Rn onto itself [or the n × n unitmatrix].)

Proposition .If A is an isometric linear mapping of Rn onto itself, and X is n-dimensional standardnormal, then AX is also n-dimensional standard normal.

ProofFrom Theorem . (on page ) about transformation of densities, the densityfunction of Y = AX is f (A−1y) |detA−1| where f is given by (.). Since fdepends on x only though ‖x‖, and since ‖A−1y‖ = ‖y‖ because A is an isometry,we have that f (A−1y) = f (y); and also because A is an isometry, detA−1 = 1.Hence f (A−1y) |detA−1| = f (y).

Corollary .If X is n-dimensional standard normal and e1,e2, . . . ,en an orthonormal basis for Rn,then the coordinates X1,X2, . . . ,Xn of X in this basis are independent and one-dimen-sional standard normal.

Recall that the coordinates of X in the basis e1,e2, . . . ,en can be expressed asXi = 〈X ,ei〉, i = 1,2, . . . ,n,

ProofSince the coordinate transformation matrix is an isometry, this follows fromProposition . and remark to Definition ..

Page 83: Probability Theory Statistics

The Multivariate Normal Distribution

Definition .: Regular normal distributionSuppose that µ ∈Rn and Σ is a positive definite linear map of Rn onto itself [or thatµ is a n-dimensional column vector and Σ a positive definite n×n matrix].

The n-dimensional regular normal distribution with mean µ and variance Σ is then-dimensional continuous distribution with density function

f (x) =1

(2π)n/2 |detΣ|1/2 exp(−1

2 (x −µ)′Σ−1(x −µ)), x ∈Rn.

Remarks:. For n = 1 we get the ususal (one-dimensional) normal distribution with

mean µ and variance σ2 (= the single element of Σ).. If Σ = σ2I, the density function becomes

f (x) =1

(2πσ2)n/2exp

(−1

2‖x −µ‖2σ2

), x ∈Rn.

In other words, X is n-dimensional regular normal with parameters µand σ2I, if and only if X1,X2, . . . ,Xn are independent and one-dimensionalnormal and the parameters of Xi are µi and σ2.

. The definition refers to the parameters µ and Σ simply as mean andvariance, but actually we must prove that µ and Σ are in fact the meanand the variance of the distribution:In the case µ = 0 and Σ = I this follows from remark to Definition .. Inthe general case consider X = µ+Σ1/2U where U is n-dimensional standardnormal and Σ1/2 is as in Proposition B. on page (with A = Σ). Thenusing Proposition . below, X is regular normal with parameters µ andΣ, and Proposition . now gives EX = µ and VarX = Σ.

Proposition .If X has an n-dimensional regular normal distribution with parameters µ and Σ, ifA is a bijective linear mapping of Rn onto itself, and if b ∈ Rn is a constant vector,then Y = AX +b has an n-dimensional regular normal distribution with paramtersAµ+b and AΣA′.

ProofFrom Theorem . on page the density function of Y = AX + b is fY (y) =f (A−1(y − b)) |detA−1| where f is as in Definition .. Standard calculationsgive

fY (y)

=1

(2π)n/2 |detΣ|1/2 |detA| exp(−1

2

(A−1(y −b)−µ

)′Σ−1

(A−1(y −b)−µ

))

. Definition and properties

=1

(2π)n/2 |det(AΣA′)|1/2 exp(−1

2

(y − (Aµ+b)

)′(AΣA′)−1

(y − (Aµ+b)

))

as requested.

Example .

Suppose that X =[X1X2

]has a two-dimensional regular normal distribution with parame-

ters µ =[µ1µ2

]and Σ = σ2I = σ2

[1 00 1

]. Then let

[Y1Y2

]=

[X1 +X2X1 −X2

], that is, Y = AX where

A =[1 11 −1

].

Proposition . shows that Y has a two-dimensional regular normal distribution

with mean EY =[µ1 +µ2µ1 −µ2

]and variance σ2AA′ = 2σ2I, so X1 +X2 and X1 −X2 are inde-

pendent one-dimensional normal variables with means µ1 +µ2 and µ1 −µ2, respectively,and whith identical variances.

Now we can define the general n-dimensional normal distribution:

Definition .: The n-dimensional normal distributionLet µ ∈Rn and let Σ be a positive semidefinit linear map of Rn into itself [or µ is ann-dimensional column vector and Σ a positive semidefinit n×n matrix]. Let p denotethe rank of Σ.

The n-dimensional normal distribution with mean µ and variance Σ is the dis-tribution of µ+BU where U is p-dimensional standard normal and B an injectivelinear map of Rp into Rn such that BB′ = Σ.

Remarks:. It is known from Proposition B. on page that there exists a B with

the required properties, and that B is unique up to isometries of Rp. Sincethe standard normal distribution is invariant under isometries (Proposi-tion .), it follows that the distribution of µ+BU does not depend onthe choice of B, as long as B is injective and BB′ = Σ.

. It follows from Proposition . that Definition . generalises Defini-tion ..

Proposition .If X has an n-dimensional normal distribution with parameters µ and Σ, and if A isa linear map of Rn into Rm, then Y = AX has an m-dimensional normal distributionwith parameters Aµ and AΣA′.

ProofWe may assume that X is of the form X = µ + BU where U is p-dimensional

Page 84: Probability Theory Statistics

The Multivariate Normal Distribution

standard normal and B is an injective linear map of Rp into Rn. Then Y =Aµ+ABU . We have to show that ABU has the same distribution as CV , whereC is an appropriate injective linear map Rq into Rm, and V is q-dimensionalstandard normal; here q is the rank of AB.

Let L be the range of (AB)′, L =R((AB)′); then q = dimL. Now the idea is thatwe as C can use the restriction of AB to L ' Rq. According to Proposition B.(page ), L is the orthogonal complement ofN (AB), the null space of AB, andfrom this follows that the restriction of AB to L is injective. We can thereforechoose an orthonormal basis e1,e2, . . . ,ep for Rp such that e1,e2, . . . ,eq is a basisfor L, and the remaining es is a basis forN (AB). Let U1,U2, . . . ,Up be the coordi-nates of U in this basis, that is, Ui = 〈U ,ei〉. Since eq+1,eq+2, . . . ,ep ∈ N (AB),

Rn Rm

Rp RqL⊇ '

A

B CAB

(AB

)′

ABU = ABp∑

i=1

Uiei =q∑

i=1

UiABei .

According to Corollary . U1,U2, . . . ,Up are independent one-dimensionalstandard normal variables, and therefore the same is true of U1,U2, . . . ,Uq,so if we define V as the q-dimensional random variable whose entries areU1,U2, . . . ,Uq, then V is q-dimensional standard normal. Finally define a linear

map C of Rq into Rn as the map that maps v =

v1v2...vq

toq∑

i=1

viABei . Then V and C

have the desired properties.

Theorem .: Partitioning the sum of squaresLet X have an n-dimensional normal distribution with mean 0 and variance σ2I.If Rn = L1 ⊕ L2 ⊕ · · · ⊕ Lk, where the linear subspaces L1,L2, . . . ,Lk are orthogonal,and if p1,p2, . . . ,pk are the projections onto L1,L2, . . . ,Lk, then the random variablesp1X ,p2X , . . . ,pkX are independent; pjX has an n-dimensional normal distributionwith mean 0 and variance σ2pj , and ‖pjX‖2 has a χ2 distribution with scale parame-ter σ2 and fj = dimLj degrees of freedom.

ProofIt suffices to consider the case σ2 = 1.

We can choose an orthonormal basis e1,e2, . . . ,en such that each basis vector be-longs to one of the subspaces Li . The random variables Xi = 〈X ,ei〉, i = 1,2, . . . ,n,are independent one-dimensional standard normal according to Corollary ..The projection pjX of X onto Lj is the sum of the fj terms Xiei for which ei ∈ Lj .Therefore the distribution of pjX is as stated in the Theorem, cf. Definition ..

. Exercises

The individual pjXs are caclulated from non-overlapping sets of Xis, and there-fore they are independent. Finally, ‖pjX‖2 is the sum of the fi X

2i s for which

ei ∈ Lj , and therefore it has a χ2 distribution with fi degrees of freedom, cf.Proposition . (page ).

. Exercises

Exercise .On page it is claimed that VarX is positive semidefinite. Show that this is indeedthe case.—Hint: Use the rules of calculus for covariances (Proposition . page ;that proposition is actually true for all kinds of real random variables with variance) toshow that Var(a′Xa) = a′(VarX)a where a is an n× 1 matrix (i.e. a column vector).

Exercise .Fill in the details in the reasoning in remark to the definition of the n-dimensionalstandard normal distribution.

Exercise .

Let X =[X1X2

]have a two-dimensional normal distribution. Show that X1 and X2 are

independent if and only if their covariance is 0. (Compare this to Proposition .(page ) which is true for all kinds of random variables with variance.)

Exercise .Let t be the function (from R to R) given by t(x) = −x for |x| < a and t(x) = x otherwise;here a is a positive constant. Let X1 be a one-dimensional standard normal variable,and define X2 = t(X1).

Are X1 and X2 independent? Show that a can be chosen in such a way that theircovariance is zero. Explain why this result is consistent with Exercise .. (Exercise .dealt with a similar problem.)

Page 85: Probability Theory Statistics

Linear Normal Models

This chapter presents the classical linear normal models, such as analysis ofvariance and linear regression, in a linear algebra formulation.

. Estimation and test in the general linear model

In the general linear normal model we consider an observation y of an n-dimen-sional normal random variable Y with mean vector µ and variance matrix σ2I;the parameter µ is assumed to be a point in the linear subspace L of V = Rn, andσ2 > 0; hence the parameter space is L× ]0 ; +∞[. The model is a linear normalmodel because the mean µ is an element of a linear subspace L.

The likelihood function corresponding to the observation y is

L(µ,σ2) =1

(2π)n/2(σ2)n/2exp

(−1

2‖y −µ‖2σ2

), µ ∈ L, σ2 > 0.

Estimation

Let p be the orthogonal projection of V = Rn onto L. Then for any z ∈ V we havethat z = (z−pz)+pz where z−pz ⊥ pz, and so ‖z‖2 = ‖z−pz‖2 +‖pz‖2 (Pythagoras);applied to z = y −µ this gives

‖y −µ‖2 = ‖y − py‖2 + ‖py −µ‖2,

from which it follows that L(µ,σ2) ≤ L(py,σ2) for each σ2, so the maximum likeli-hood estimator of µ is py. We can then use standard methods to see that L(py,σ2)is maximised with respect to σ2 when σ2 equals 1

n‖y − py‖2, and standard cal-culations show that the maximum value L(py, 1

n‖y − py‖2) can be expressed asan (‖y − py‖2)−n/2 where an is a number that depends on n, but not on y.

We can now apply Theorem . to X = Y −µ and the orthogonal decomposi-tion R = L⊕L⊥ to get the following

Theorem .In the general linear model,

• the mean vector µ is estimated as µ = py, the orthogonal projection of y onto L,

Page 86: Probability Theory Statistics

Linear Normal Models

• the estimator µ = pY is n-dimensional normal with mean µ and variance σ2p,• an unbiased estimator of the variance σ2 is s2 = 1

n−dimL ‖y − py‖2, and themaximum likelihood estimator of σ2 is σ 2 = 1

n ‖y − py‖2,• the distribution of the estimator s2 is a χ2 distribution with scale parameterσ2/(n−dimL) and n−dimL degrees of freedom,

• the two estimators µ and s2 are independent (and so are the two estimators µand σ 2).

The vector y − py is the residual vector, and ‖y − py‖2 is the residual sum of squares.The quantity (n − dimL) is the number of degrees of freedom of the varianceestimator and/or the residual sum of squares.

You can find µ from the relation y − µ⊥ L, which in fact is nothing but a set ofdimL linear equations with as many unknowns; this set of equations is knownas the normal equations (they are expressing the fact that y − µ is a normal of L).

Testing hypotheses about the mean

Suppose that we want to test a hypothesis of the form H0 : µ ∈ L0 where L0 isa linear subspace of L. According to Theorem . the maximum likelihoodestimators of µ and σ2 underH0 are p0y and 1

n‖y−p0y‖2. Therefore the likelihoodratio test statistic is

Q =L(p0y,

1n‖y − p0y‖2)

L(py, 1n‖y − py‖2)

=( ‖y − py‖2‖y − p0y‖2

)n/2.

Since L0 ⊆ L, we have py − p0y ∈ L and therefore y − py ⊥ py − p0y so that‖y − p0y‖2 = ‖y − py‖2 + ‖py − p0y‖2 (Pythagoras). Using this we can rewrite Q

y

py

0p0y

L

L0

further:

Q =( ‖y − py‖2‖y − py‖2 + ‖py − p0y‖2

)n/2

=(1 +‖py − p0y‖2‖y − py‖2

)−n/2=

(1 +

dimL−dimL0

n−dimLF

)−n/2

where

F =1

dimL−dimL0‖py − p0y‖2

1n−dimL ‖y − py‖2

is the test statistic commonly used.Since Q is a decreasing function of F, the hypothesis is rejected for large

values of F. It follows from Theorem . that under H0 the nominator and thedenominator of the F statistic are independent; the nominator has a χ2 distri-bution with scale parameter σ2/(dimL − dimL0) and (dimL − dimL0) degrees

. The one-sample problem

of freedom, and the denominator has a χ2 distribution with scale parameterσ2/(n−dimL) and (n−dimL) degrees of freedom. The distribution of the teststatistic is an F distribution with (dimL − dimL0) and (n − dimL) degrees offreedom. F distribution

The F distributionwith ft = ftop andfb = fbottom degreesof freedom is the con-tinuous distribution on]0 ; +∞[ given by thedensity function

Cx(ft−2)/2

(fb + ftx)(ft+fb)/2

where C is the number

Γ( ft+fb

2

)

Γ( ft

2

)Γ( fb

2

) f ft /2t ffb/2b .

All problems of estimation and hypothesis testing are now solved, at least inprinciple (and Theorems . and . on page / are proved). In the nextsections we shall apply the general results to a number of more specific models.

. The one-sample problem

We have observed values y1, y2, . . . , yn of independent and identically distributed(one-dimensional) normal variables Y1,Y2, . . . ,Yn with mean µ and variance σ2,where µ ∈ R and σ2 > 0. The problem is to estimate the parameters µ and σ2,and to test the hypothesis H0 : µ = 0.

We will think of the yis as being arranged into an n-dimensional vector ywhich is regarded as an observation of an n-dimensional normal random variableY with mean µ and variance σ2I, and where the model claims that µ is a pointin the one-dimensional subspace L = µ1 : µ ∈ R consisting of vectors havingthe same value µ in all the places. (Here 1 is the vector with the value 1 in allplaces.)

According to Theorem . the maximum likelihood estimator µ is foundby projecting y orthogonally onto L. The residual vector y − µ will then beorthogonal to L and in particular orthogonal to the vector 1 ∈ L, so

0 =⟨y − µ,1⟩ =

n∑

i=1

(yi − µ) = y −nµ

where as usual y denotes the sum of the yis. Hence µ = µ1 where µ = y, that is, µis estimated as the average of the observations. Next, we find that the maximumlikelihood estimate of σ2 is

σ 2 =1n‖y − µ‖2 =

1n

n∑

i=1

(yi − y

)2,

and that an unbiased estimate of σ2 is

s2 =1

n−dimL‖y − µ‖2 =

1n− 1

n∑

i=1

(yi − y

)2.

These estimates were found by other means on page .

Page 87: Probability Theory Statistics

Linear Normal Models

Next, we will test the hypothesis H0 : µ = 0, or rather µ ∈ L0 where L0 = 0.Under H0, µ is estimated as the projection of y onto L0, which is 0. Thereforethe F test statistic for testing H0 is

F =1

1−0 ‖µ− 0‖21n−1 ‖y − µ‖2

=nµ2

s2=

(y√s2/n

)2

,

so F = t2 where t is the usual t test statistic, cf. page . Hence, the hypothesisH0 can be tested either using the F statistic (with 1 and n−1 degrees of freedom)or using the t statistic (with n− 1 degrees of freedom).

The hypothesis µ = 0 is the only hypothesis of the form µ ∈ L0 where L0 is alinear subspace of L; then what to do if the hypothesis of interest is of the formµ = µ0 where µ0 is non-zero? Answer: Subtract µ0 from all the ys and then applythe method just described. The resulting F or t statistic will have y −µ0 insteadof y in the denominator, everything else remains unchanged.

. One-way analysis of variance

We have observations that are classified into k groups; the observations arenamed yij where i is the group number (i = 1,2, . . . , k), and j is the number of theobservations within the group (j = 1,2, . . . ,ni). Schematically it looks like this:

observations

group 1 y11 y12 . . . y1j . . . y1n1

group 2 y21 y22 . . . y2j . . . y2n2...

......

. . ....

. . ....

group i yi1 yi2 . . . yij . . . yini...

......

. . ....

. . ....

group k yk1 yk2 . . . ykj . . . yknk

The variation between observations within a single group is assumed to berandom, whereas there is a systematic variation between the groups. We alsoassume that the yijs are observed values of independent random variables Yij ,and that the random variation can be described using the normal distribution.More specifically, the random variables Yij are assumed to be independent andnormal, with mean EYij = µi and variance VarYij = σ2. Spelled out in moredetails: there exist real numbers µ1,µ2, . . . ,µk and a positive number σ2 suchthat Yij is normal with mean µi and variance σ2, j = 1,2, . . . ,ni , i = 1,2, . . . , k, andall Yijs are independent. In this way the mean value parameters µ1,µ2, . . . ,µkdescribe the systematic variation, that is, the levels of the individual groups,

. One-way analysis of variance

while the variance parameter σ2 (and the normal distribution) describes therandom variation within the groups. The random variation is supposed to be thesame in all groups; this assumption can sometimes be tested, see Section .(page ).

We are going to estimate the parameters µ1,µ2, . . . ,µk and σ2, and to test thehypothesis H0 : µ1 = µ2 = · · · = µk of no difference between groups.

We will be thinking of the yijs as arranged as a single vector y ∈ V = Rn, wheren = n is the number of observations. Then the basic model claims that y is anobserved value of an n-dimensional normal random variable Y with mean µand variance σ2I, where σ2 > 0, and where µ belongs to the linear subspace thatis written briefly as

L = ξ ∈ V : ξij = µi;in more details, L is the set of vectors ξ ∈ V for which there exists a k-tuple(µ1,µ2, . . . ,µk) ∈ Rk such that ξij = µi for all i and j. The dimension of L is k(provided that ni > 0 for all i).

According to Theorem . µ is estimated as µ = py where p is the orthogonalprojection of V onto L. To obtain a more usable expression for the estimator, wemake use of the fact that µ ∈ L and y−µ⊥ L, and in particular is y−µ orthogonalto each of the k vectors e1,e2, . . . ,ek ∈ L defined in such a way that the (i, j)thentry of es is (es)ij = δis, that is, Kronecker’s δ

is the symbol

δij =

1 if i = j

0 otherwise0 =⟨y − µ,es⟩ =

ns∑

j=1

(ysj − µs) = ys −nsµs, s = 1,2, . . . , k,

from which we get µi = yi /ni = yi , so µi is the average of the observations ingroup i.

The variance parameter σ2 is estimated as

s20 =1

n−dimL‖y − py‖2 =

1n− k

k∑

i=1

ni∑

j=1

(yij − yi)2

with n− k degrees of freedom. The two estimators µ and s20 are independent.

The hypothesis H0 of identical groups can be expressed as H0 : µ ∈ L0, where

L0 = ξ ∈ V : ξij = µ

is the one-dimensional subspace of vectors having the same value in all n entries.From Section . we know that the projection of y onto L0 is the vector p0y that

Page 88: Probability Theory Statistics

Linear Normal Models

has the grand average y = y /n in all entries. To test the hypothesis H0 we usethe test statistic

F =1

dimL−dimL0‖py − p0y‖2

1n−dimL ‖y − py‖2

=s21s20

where

s21 =1

dimL−dimL0‖py − p0y‖2

=1

k − 1

k∑

i=1

ni∑

j=1

(yi − y)2 =1

k − 1

k∑

i=1

ni(yi − y)2.

The distribution of F under H0 is the F distribution with k − 1 and n− k degreesof freedom. We say that s20 describes the variation within groups, and s21 describesthe variation between groups. Accordingly, the F statistic measures the variationbetween groups relative to the variation within groups—hence the name of thismethod: one-way analysis of variance.

If the hypothesis is accepted, the mean vector µ is estimated as the vector p0ywhich has y in all entries, and the variance σ2 is estimated as

s201 =1

n−dimL0‖y − p0y‖2 =

‖y − py‖2 + ‖py − p0y‖2(n−dimL) + (dimL−dimL0)

,

that is,

s201 =1

n− 1

k∑

i=1

ni∑

j=1

(yij − y)2 =1

n− 1

( k∑

i=1

ni∑

j=1

(yij − yi)2 +k∑

i=1

ni(yi − y)2).

Example .: Strangling of dogsIn an investigation of the relations between the duration of hypoxia and the con-centration of hypoxanthine in the cerebrospinal fluid (described in more details inExample . on page ), an experiment was conducted in which a number of dogswere exposed to hypoxia and the concentration of hypoxanthine was measured at fourdifferent times, see Table . on page . In the present example we will examinewhether there is a significant difference between the four groups corresponding to thefour times.

First, we calculate estimates of the unknown parameters, see Table . which alsogives a few intermediate results. The estimated mean values are 1.46, 5.50, 7.48 and12.81, and the estimated variance is s20 = 4.82 with 21 degrees of freedom. The estimatedvariance between groups is s21 = 465.5/3 = 155.2 with 3 degrees of freedom, so the Fstatistic for testing homogeneity between groups is F = 155.2/4.82 = 32.2 which ishighly significant. Hence there is a highly significant difference between the means ofthe four groups.

. One-way analysis of variance

i ni

ni∑

j=1

yij yi fi

ni∑

j=1

(yij − yi)2 s2i

. . . . . . . . . . . . . . . .

sum . .average . .

Table .Strangling of dogs:intermediate results.

f SS s2 test

variation within groups . .variation between groups . . ./.=.

total variation . .

Table .Strangling of dogs:Analysis of variancetable.f is the number ofdegrees of freedom,SS is the sum ofsquared deviations,and s2 = SS/f .

Traditionally, calculations and results are summarised in an analysis of variance table(ANOVA table), see Table ..

The analysis of variance model assumes homogeneity of variances, and we can testfor that, for example using Bartlett’s test (Section .). Using the s2 values fromTable . we find that the value of the test statistic B (equation (.) on page )

is B = −(6ln 1.27

4.82 + 5ln 2.994.82 + 4ln 7.63

4.82 + 6ln 8.044.82

)= 5.5. This value is well below the 10%

quantile of the χ2 distribution with k − 1 = 3 degrees of freedom, and hence it is notsignificant. We may therefore assume homogeneity of variances.B [See also Example . on page .]

The two-sample problem with unpaired observations

The case k = 2 is often called the two-sample problem (no surprise!). In fact,many textbooks treat the two-sample problem as a separate topic, and sometimeseven before the k-sample problem. The only interesting thing to be added aboutthe case k = 2 seems to be the fact that the F statistic equals t2, where

t =y1 − y2√(1n1

+ 1n2

)s20

,

cf. page . Under H0 the distribution of t is the t distribution with n − 2degrees of freedom.

Page 89: Probability Theory Statistics

Linear Normal Models

. Bartlett’s test of homogeneity of variances

Normal models often require the observations to come from normal distribu-tions all with the same variance (or a variance which is known except for anunknown factor). This section describes a test procedure that can be used fortesting whether a number of groups of observations of normal variables can beassumed to have equal variances, that is, for testing homogeneity of variances(or homoscedasticity). All groups must contain more than one observation, andin order to apply the usual χ2 approximation to the distribution of the −2lnQstatistic, each group has to have at least six observations, or rather: in each groupthe variance estimate has to have at least five degrees of freedom.

The general situation is as in the k-sample problem (the one-way analysisof variance situation), see page , and we want to perform a test of theassumption that the groups have the same variance σ2. This can be done asfollows.

Assume that each of the k groups has its own mean value parameter andits own variance parameter: the parameters associated with group no. i are µiand σ2

i . From earlier results we know that the usual unbiased estimate of σ2i

is s2i =1fi

ni∑

j=1

(yij − yi)2, where fi = ni − 1 is the number of degrees of freedom

of the estimate, and ni is the number of observations in group i. According toProposition . (page ) the estimator s2i has a gamma distribution with shapeparameter fi/2 and scale parameter 2σ2

i /fi . Hence we will consider the followingstatistical problem: We are given independent observations s21, s

22, . . . , s

2k such that

s2i is an observation from a gamma distribution with shape parameter fi/2 andscale parameter 2σ2

i /fi where fi is a known number; in this model we want totest the hypothesis

H0 : σ21 = σ2

2 = . . . = σ2k .

The likelihood function (when having observed s21, s22, . . . , s

2k ) is

L(σ21 ,σ

22 , . . . ,σ

2k ) =

k∏

i=1

1

Γ ( fi2 )(

2σ2ifi

)fi /2(s2i

)fi /2−1exp

(−s2i

/2σ2i

fi

)

= const ·k∏

i=1

(σ2i

)−fi /2 exp(−fi

2s2iσ2i

)

= const ·( k∏

i=1

(σ2i )−fi /2

)· exp

(−1

2

k∑

i=1

fis2iσ2i

).

. Bartlett’s test of homogeneity of variances

The maximum likelihood estimates of σ21 ,σ

22 , . . . ,σ

2k are s21, s

22, . . . , s

2k . The max-

imum likelihood estimate of the common value σ2 under H0 is the point of

maximum of the function σ2 7→ L(σ2,σ2, . . . ,σ2), and that is s20 =1f

k∑

i=1

fi s2i ,

where f =k∑

i=1

fi . We see that s20 is a weighted average of the s2i s, using the

numbers of degrees of freedom as weights. The likelihood ratio test statisticis Q = L(s20, s

20, . . . , s

20)/L(s21, s

22, . . . , s

2k ). Normally one would use the test statistic

−2lnQ, and −2lnQ would then be denoted B, the Bartlett test statistic for testinghomogeneity of variances; some straightforward algebra leads to

B = −k∑

i=1

fi lns2is20. (.)

The B statistic is always non-negative, and large values are significant, meaningthat the hypothesisH0 of homogeneity of variances is wrong. UnderH0, B has anapproximate χ2 distribution with k − 1 degrees of freedom, so an approximatetest probability is ε = P(χ2

k−1 ≥ Bobs). The χ2 approximation is valid when all fisare large; a rule of thumb is that they all have to be at least 5.

With only two groups (i.e. k = 2) another possibility is to test the hypothesis ofvariance homogeneity using the ratio between the two variance estimates as atest statistic. This was discussed in connection with the two-sample problemwith normal variables, see Exercise . page . (That two-sample test does notrely on any χ2 approximation and has no restrictions on the number of degreesof freedom.)

Page 90: Probability Theory Statistics

Linear Normal Models

. Two-way analysis of variance

We have a number of observations y arranged as a two-way table:

1 2 . . . j . . . s

1 y11kk=1,2,...,n11

y12kk=1,2,...,n12

. . . y1jkk=1,2,...,n1j

. . . y1skk=1,2,...,n1s

2 y21kk=1,2,...,n21

y22kk=1,2,...,n22

. . . y2jkk=1,2,...,n2j

. . . y2skk=1,2,...,n2s

......

......

...

i yi1kk=1,2,...,ni1

yi2kk=1,2,...,ni2

. . . yijkk=1,2,...,nij

. . . yiskk=1,2,...,nis

......

......

. . ....

r yr1kk=1,2,...,nr1

yr2kk=1,2,...,nr2

. . . yrjkk=1,2,...,nrj

. . . yrskk=1,2,...,nrs

Here yijk is observation no. k in the (i, j)th cell, which contains a total of nijobservations. The (i, j)th cell is located in the ith row and jth column of the table,and the table has a total of r rows and s columns.

We shall now formulate a statistical model in which the yijks are observations ofindependent normal variables Yijk, all with the same variance σ2, and with amean value structure that reflects the way the observations are classified. Herefollows the basic model and a number of possible hypotheses about the meanvalue parameters (the variance is always the same unknown σ2):

• The basic model G states that Y s belonging to the same cell have the samemean value. More detailed this model states that there exist numbers ηij ,i = 1,2, . . . , r, j = 1,2, . . . , s, such that EYijk = ηij for all i, j and k. A shortformulation of the model is

G : EYijk = ηij .

• The hypothesis of additivity, also known as the hypothesis of no interaction,states that there is no interaction between rows and columns, but that roweffects and column effects are additive. More detailed the hypothesis is thatthere exist numbers α1,α2, . . . ,αr and β1,β2, . . . ,βs such that EYijk = αi + βjfor all i, j and k. A short formulation of this hypothesis is

H0 : EYijk = αi + βj .

. Two-way analysis of variance

• The hypothesis of no column effects states that there is no difference be-tween the columns. More detailed the hypothesis is that there exist num-bers α1,α2, . . . ,αr such that EYijk = αi for all i, j and k. A short formulationof this hypothesis is

H1 : EYijk = αi .

• The hypothesis of no row effects states that there is no difference be-tween the rows. More detailed the hypothesis is that there exist numbersβ1,β2, . . . ,βs such that EYijk = βj for all i, j and k. A short formulation ofthis hypothesis is

H2 : EYijk = βj .

• The hypothesis of total homogeneity states that there is no difference atall between the cells, more detailed the hypothesis is that there exists anumber γ such that EYijk = γ for all i, j and k. A short formulation of thishypothesis is

H3 : EYijk = γ .

We will express these hypotheses in linear algebra terms, so we will think of theobservations as forming a single vector y ∈ V = Rn, where n = n is the numberof (one-dimensional) observations; we shall be thinking of vectors in V as beingarranged as two-way tables like the one on the preceeding page. In this set-up,the basic model and the four hypotheses can be rewritten as statements that themean vector µ = EY belongs to a certain linear subspace:

G : µ ∈ L where L = ξ : ξijk = ηij,H0 : µ ∈ L0 where L0 = ξ : ξijk = αi + βj,H1 : µ ∈ L1 where L1 = ξ : ξijk = αi,H2 : µ ∈ L2 where L2 = ξ : ξijk = βj,H3 : µ ∈ L3 where L3 = ξ : ξijk = γ.

There are certain relations between the hypotheses/subspaces:

L L0

L1

L2

L3⊇ ⊇⊇

⊇⊇G H0

H1

H2

H3⇐ ⇐⇐⇐⇐

The basic model and the models corresponding to H1 and H2 are instances ofk sample problems (with k equal to rs, r, and s, respectively), and the modelcorresponding to H3 is a one-sample problem, so it is straightforward to write

Page 91: Probability Theory Statistics

Linear Normal Models

down the estimates in these four models:

under G, ηij = yij i.e., the average in cell (i, j),

under H1, αi = yi i.e., the average in row i,

under H2, βj = y j i.e., the average in column j,

under H3, γ = y i.e., the overall average.

On the other hand, it can be quite intricate to estimate the parameters underthe hypothesis H0 of no interaction.

First we shall introduce the notion of a connected model.

Connected models

What would be a suitable requirement for the numbers of observations in thevarious cells of y, that is, requirements for the table

n =

n11 n12 · · · n1s

n21 n22 · · · n2s...

.... . .

...nr1 nr2 · · · nrs

We should not allow rows (or columns) consisting entirely of 0s, but do all thenijs have to be non-zero?

To clarify the problems consider a simple example with r = 2, s = 2, and

n =0 99 0

In this case the subspace L corresponding to the additive model has dimension 2,since the only parameters are η12 and η21, and in fact L = L0 = L1 = L2. Inparticular, L1 ∩ L2 , L3, and hence it is not true in general that L1 ∩ L2 = L3.The fact is, that the present model consists of two separate sub-models (one forcell (1,2) and one for cell (2,1)), and therefore dim(L1 ∩L2) > 1, or equivalently,dim(L0) < r + s−1 (to see this, apply the general formula dim(L1 +L2) = dimL1 +dimL2 −dim(L1 ∩L2) and the fact that L0 = L1 +L2).

If the model cannot be split into separate sub-models, we say that the modelis connected. The notion of a connected model can be specified in the followingway: From the table n form a graph whose vertices are those integer pairs (i, j)for which nij > 0, and whose edges are connecting pairs of adjacent vertices, i.e.,pairs of vertices with a common i or a common j. Example:

5 8 01 0 4

−→•

. Two-way analysis of variance

Then the model is said to be connected if the graph is connected (in the usualgraph-theoretic sense).

It is easy to verify that the model is connected, if and only if the null space ofthe linear map that takes the (r+s)-dimensional vector (α1,α2, . . . ,αr ,β1,β2, . . . ,βs)into the “corresponding” vector in L0, is one-dimensional, that is, if and onlyif L3 = L1 ∩L2. In plain language, a connected model is a model for which it istrue that “if all row parameters are equal and all column parameters are equal,then there is total homogeneity”.

From now on, all two-way ANOVA models are assumed to be connected.

The projection onto L0

The general theory tells that under H0 the mean vector µ is estimated as theprojection p0y of y onto L0. In certain cases this projection can be calculatedvery easily. – First recall that L0 = L1 + L2 and (since the model is connected)L3 = L1 ∩L2.

Now suppose that(L1 ∩L⊥3 )⊥ (L2 ∩L⊥3 ). (.)

ThenL0 = (L1 ∩L⊥3 ) ⊕ (L2 ∩L⊥3 ) ⊕ L3

and hencep0 = (p1 − p3) + (p2 − p3) + p3 ,

that is,αi + βj =(yi − y ) + (y j − y ) +y

= yi + y j −y

L2

L1 L3

(L1 ∩L⊥3 )⊥ (L2 ∩L⊥3 )

This is indeed a very simple and nice formula for (the coordinates of) theprojection of y onto L0; we only need to know when the premise is true. Anecessary and sufficient condition for (.) is that

〈p1e− p3e,p2f − p3f 〉 = 0 (.)

for all e,f ∈B, where B is a basis for L. As B we can use the vectors eρσ where(eρσ )ijk = 1 if (i, j) = (ρ,σ ) and 0 otherwise. Inserting such vectors in (.) leadsto this necessary and sufficient condition for (.):

nij =ni n jn

, i = 1,2, . . . , r; j = 1,2, . . . , s. (.)

When this condition holds, we say that the model is balanced. Note that if allthe cells of y have the same number of observations, then the model is alwaysbalanced.

Page 92: Probability Theory Statistics

Linear Normal Models

In conclusion, in a balanced model, that is, if (.) is true, the estimatesunder the hypothesis of additivity can be calculated as

αi + βj = yi + y j − y . (.)

Testing the hypotheses

The various hypotheses about the mean vector can be tested using an F statistic.As always, the denominator of the test statistic is the estimate of the varianceparameter calculated under the current basic model, and the nominator of thetest statistic is an estimate of “the variation of the hypothesis about the currentbasic model”. The F statistic inherits its two numbers of degrees of freedomfrom the nominator and the denominator variance estimates.. For H0, the hypothesis of additivity, the test statistic is F = s21

/s20 where

s20 =1

n−dimL‖y − py‖2 =

1n− g

r∑

i=1

s∑

j=1

nij∑

k=1

(yijk − yij

)2(.)

describes the within groups variation (g = dimL is the number of non-empty cells), and

s21 =1

dimL−dimL0‖py − p0y‖2

=1

g − (r + s − 1)

r∑

i=1

s∑

j=1

nij∑

k=1

(yij − ( αi + βj )

)2

describes the interaction variation. In a balanced model

s21 =1

g − (r + s − 1)

r∑

i=1

s∑

j=1

nij∑

k=1

(yij − yi − y j + y

)2.

This requires that n > g, that is, there must be some cells with more thanone observation.

. Once H0 is accepted, the additive model is named the current basic model,and an updated variance estimate is calculated:

s201 =1

n−dimL0‖y − p0y‖2

=1

n−dimL0

(‖y − py‖2 + ‖py − p0y‖2

).

Dependent on the actual circumstances, the next hypothesis to be testedis either H1 or H2 or both.

. Two-way analysis of variance

• To test the hypothesis H1 of no column effects, use the test statisticF = s21

/s201 where

s21 =1

dimL0 −dimL1‖p0y − p1y‖2

=1s − 1

r∑

i=1

s∑

j=1

nij(αi + βj − yi

)2.

In a balanced model,

s21 =1s − 1

s∑

j=1

n j(y j − y

)2.

• To test the hypothesis H2 of no row effects, use the test statistic F =s22

/s201 where

s22 =1

dimL0 −dimL2‖p0y − p2y‖2

=1

r − 1

r∑

i=1

s∑

j=1

nij(αi + βj − y j

)2.

In a balanced model,

s22 =1

r − 1

r∑

i=1

ni (yi − y

)2.

Note that H1 and H2 are tested side by side—both are tested against H0 and thevariance estimate s201; in a balanced model, the nominators s21 and s22 of the Fstatistics are independent (since p0y − p1y and p0y − p2y are orthogonal).

An example (growing of potatoes)

To obtain optimal growth conditions, plants need a number of nutrients admin-istered in correct proportions. This example deals with the problem of findingan appropriate proportion between the amount of nitrogen and the amount ofphosphate in the fertilisers used when growing potatoes.

Potatoes were grown in six different ways, corresponding to six differentcombinations of amount of phosphate supplied (, or units) and amount ofnitrogen supplied ( or unit). This results in a set of observations, the yields,that are classified according to two criteria, amount of phosphate and amountof nitrogen.

Page 93: Probability Theory Statistics

Linear Normal Models

Table .Growing of potatoes:Yield of each of parcels. The entries ofthe table are 1000 ×(log(yield in lbs)− 3).

nitrogen

phosphate

Table .Growing of potatoes:Estimated mean y(top) and variances2 (bottom) in eachof the six groups, cf.Table ..

nitrogen

..

..

phosphate ..

..

..

..

Table . records some of the results from one such experiment, conductedin at Ely.* We want to investigate the effects of the two factors nitrogenand phosphate separately and jointly, and to be able to answer questions suchas: does the effect of adding one unit of nitrogen depend on whether we addphosphate or not? We will try a two-way analysis of variance.

The two-way analysis of variance model assumes homogeneity of the vari-ances, and we are in a position to perform a test of this assumption. We calculatethe empirical mean and variance for each of the six groups, see Table ..The total sum of squares is found to be 111817, so that the estimate of thewithin-groups variance is s20 = 111817/(36 − 6) = 3727.2 (cf. equation (.)).The Bartlett test statistic is Bobs = 9.5, which is only slightly above the 10%quantile of the χ2 distribution with 6− 1 = 5 degrees of freedom, so there is nosignificant evidence against the assumption of homogeneity of variances.

Then we can test whether the data can be described adequately using anadditive model in which the effects of the two factors “phosphate” and “ni-

* Note that Table . does not give the actual yields, but the yields transformed as indicated.The reason for taking the logarithm of the yields is that experience shows that the distribution

of the yield of potatoes is not particularly normal, whereas the distribution of the logarithm ofthe yield looks more like a normal distribution. After having passed to logarithms, it turned outthat all the numbers were of the form -point-something, and that is the reason for subtracting and multiplying by (to get rid of the decimal point).

. Two-way analysis of variance

nitrogen

..

..

phosphate ..

..

..

..

Table .Growing of potatoes:Estimated group meansin the basic model (top)and in the additivemodel (bottom).

trogen” enter in an additive way. Let yijk denote the kth observation in rowno. i and column no. j. According to the basic model, the yijks are understoodas observations of independent normal random variables Yijk, where Yijk hasmean µij and variance σ2. The hypothesis of additivity can be formulated asH0 : EYijk = αi +βj . Since the model is balanced (equation (.) is satisfied), theestimates αi + βj can be calculated using equation (.). First, calculate someauxiliary quantities

y = 654.75 y1 − y =−86.92y2 − y = 13.42y3 − y = 73.50

y1 − y =−43.47y2 − y = 43.47

.

Then we can calculate the estimated group means αi + βj under the hypothesis ofadditivity, see Table .. The variance estimate to be used under this hypothesis

is s201 =112693.5

36− (3 + 2− 1)= 3521.7 with 36− (3 + 2− 1) = 32 degrees of freedom.

The interaction variance is

s21 =877

6− (3 + 2− 1)=

8772

= 438.5 ,

and we have already found that s20 = 3727.2, so the F statistic for testing thehypothesis of additivity is

F =s21s20

=438.5

3727.2= 0.12 .

This value is to be compared to the F distribution with 2 and 30 degrees offreedom, and we find the test probability to be almost 90%, so there is no doubtthat we can accept the hypothesis of additivity.

Having accepted the additive model, we should from now on use the improvedvariance estimate

s201 =112694

36− (3 + 2− 1)= 3521.7

Page 94: Probability Theory Statistics

Linear Normal Models

Table .Growing of potatoes:Analysis of variancetable.f is the number ofdegrees of freedom, SSis the Sum of squareddeviations, s2 = SS/f .

variation f SS s2 test

within groups interaction /=.

the additive model

between N-levels /=between P-levels /=

total

with 36− (3 + 2− 1) = 32 degrees of freedom.Knowing that the additive model gives an adequate description of the data, it

makes sense to talk of a nitrogen effect and a phosphate effect, and it may be ofsome interest to test whether these effects are significant.

We can test the hypothesis H1 of no nitrogen effect (column effect): Thevariance between nitrogen levels is

s22 =680342− 1

= 68034 ,

and the variance estimate in the additive model is s201 = 3522 with 32 degrees offreedom, so the F statistic is

Fobs =s22s202

=680343522

= 19

to be compared to the F1,32 distribution. The value Fobs = 19 is undoubtedlysignificant, so the hypothesis of no nitrogen effect must be rejected, that is,nitrogen has definitely an effect.

We can test for the effect of phosphate in a similar way, and it turns out thatphosphate, too, has a highly significant effect.

The analysis of variance table (Table .) gives an overview of the analysis.

. Regression analysis

Regression analysis is a technique for modelling and studying the dependenceof one quantity on one or more other quantities.

Suppose that for each of a number of items (test subjects, test animals, singlelaboratory measurements, etc.) a number of quantities (variables) have beenmeasured. One of these quantities/variables has a special position, because wewish to “describe” or “explain” this variable in terms of the others. The generic

. Regression analysis

symbol for the variable to be described is usually y, and the generic symbols forthose quantities that are used to describe y, are x1,x2, . . . ,xp. Often used namesfor the x and y quantities are

y x1,x2, . . . ,xp

dependent variable independent variablesexplained variable explanatory variablesresponse variable covariate

We briefly outline a few examples:. The doctor observes the survival time y of patients treated for a certain

disease, but the doctor also records some extra information, for examplesex, age and weight of the patient. Some of these “covariates” may containinformation that could turn out to be useful in a model for the survivaltime.

. In a number of fairly similar industrialised countries the prevalence oflung cancer, cigarette consumption, the consumption of fossil fuels wasrecorded and the results given as per capita values. Here, one might namethe lung cancer prevalence the y variable and then try to “explain” it usingthe other two variables as explanatory variables.

. In order to determine the toxicity of a certain drug, a number of testanimals are exposed to various doses of the drug, and the number ofanimals dying is recorded. Here, the dose x is an independent variabledetermined as part of the design of the experiment, and the number y ofdead animals is the dependent variable.

Regression analysis is about developing statistical models that attempt to de-scribe a variable y using a known and fairly simple function of a few numberof covariates x1,x2, . . . ,xp and parameters β1,β2, . . . ,βp. The parameters are thesame for all data points (y,x1,x2, . . . ,xp).

Obviously, a statistical model cannot be expected to give a perfect descriptionof the data, partly because the model hardly is totally correct, and partly becausethe very idea of a statistical model is to model only the main features of thedata while leaving out the finer details. So there will almost always be a certaindifference between an observed value y and the corresponding “fitted” valuey, i.e. the value that the model predicts from the covariates and the estimatedparameters. This difference is known as the residual, often denoted by e, and wecan write

y = y + eobserved value = fitted value + residual.

The residuals are what the model does not describe, so it would seem natural

Page 95: Probability Theory Statistics

Linear Normal Models

to describe them as being random, that is, as random numbers drawn from acertain probability distribution.

-oOo-

Vital conditions for using regression analysis are that. the ys (and the residuals) are subject to random variation, wheras the xs

are assumed to be known constants.. the individual ys are independent: the random causes that act on one

measurement of y (after taking account for the covariates) are assumed tobe independent of those acting on other measurements of y.

The simplest examples of regression analysis use only a single covariate x, andthe task then is to describe y using a known, simple function of x. The simplestnon-trivial function would be of the form y = α + xβ where α and β are twoparameters, that is, we assume y to be an affine function of x. Thus we havearrived at the simple linear regression model, cf. page .

A more advanced example is the multiple linear regression model, which has p

covariates x1,x2, . . . ,xp and attempts to model y as y =p∑

j=1

xjβj .

Specification of the model

We shall now discuss the multiple linear regression model. In order to havea genuine statistical model, we need to specify the probability distributiondescribing the random variation of the ys. Throughout this chapter we assumethis probability distribution to be a normal distribution with variance σ2 (whichis the same for all observations).

We can specify the model as follows: The available data consists of n mea-surements of the dependent variable y, and for each measurement the corre-sponding values of the p covariates x1,x2, . . . ,xp are known. The ith set of values

is(yi , (xi1,xi2, . . . ,xip)

). It is assumed that y1, y2, . . . , yn are observed values of in-

dependent normal random variables Y1,Y2, . . . ,Yn, all having the same varianceσ2, and with

EYi =p∑

j=1

xijβj , i = 1,2, . . . ,n , (.)

where β1,β2, . . . ,βp are unknown parameters. Often, one of the covariates willbe the constant 1, that is, the covariate has the value 1 for all i.

Using matrix notation, equation (.) can be formulated as EY = Xβ, whereY is the column vector containing Y1,Y2, . . . ,Yn, X is a known n × p matrix(known as the design matrix) whose entries are the xijs, and β is the column

. Regression analysis

vector consisting of the p unknown βs. This could also be written in terms oflinear subspaces: EY ∈ L1 where L1 = Xβ ∈Rn : β ∈Rp.

The name linear regression is due to the fact that EY is a linear function of β.

The above model can be generalised in different ways. Instead of assumingthe observations to have the same variance, we might assume them to havea variance which is known apart from a constant factor, that is, VarY = σ2Σ,where Σ > 0 is a known matrix and σ2 an unknown parameter; this is a so-calledweighted linear regression model

Or we could substitute the normal distribution with another distribution, suchas a binomial, Poisson or gamma distribution, and at the same time generalise(.) to

g(EYi) =p∑

j=1

xijβj , i = 1,2, . . . ,n

for a suitable function g; then we get a generalised linear regression. Logisticregression (see Section ., in particular page ff) is an example of generalisedlinear regression.

This chapter treats ordinary linear regression only.

Estimating the parameters

According to the general theory the mean vector Xβ is estimated as the or-thogonal projection or y onto L1 = Xβ ∈ Rn : β ∈ Rp. This means that theestimate of β is a vector β such that y −Xβ ⊥ L1. Now, y −Xβ ⊥ L1 is equivalentto

⟨y −Xβ,Xβ

⟩= 0 for all β ∈ Rp, which is equivalent to

⟨X ′y −X ′Xβ,β

⟩= 0

for all β ∈ Rp, which finally is equivalent to X ′Xβ = X ′y. The matrix equationX ′Xβ = X ′y can be written out as p linear equations in p unknowns, the normalequations (see also page ). If the p×p matrix X ′X is regular, there is a uniquesolution, which in matrix notation is

β = (X ′X)−1X ′y .

(The condition that X ′X is regular, can be formulated in several equivalent ways:the dimension of L1 is p; the rank of X is p; the rank of X ′X is p; the columns ofX are linearly independent; the parametrisation is injective.)

The variance parameter is estimated as

s2 =‖y −Xβ‖2n−dimL1

.

Page 96: Probability Theory Statistics

Linear Normal Models

Using the rules of calculus for variance matrices we get

Var β = Var((X ′X)−1X ′Y

)

=((X ′X)−1X ′

)VarY

((X ′X)−1X ′

)′

=((X ′X)−1X ′

)σ2I

((X ′X)−1X ′

)′

= σ2 (X ′X)−1

(.)

which is estimated as s2 (X ′X)−1. The square root of the diagonal elements inthis matrix are estimates of the standard error (the standard deviation) of thecorresponding βs.

Standard statistical software comes with functions that can solve the normalequations and return the estimated βs and their standard errors (and sometimeseven all of the estimated Var β)).

Simple linear regression

In simple linear regression (see for example page ),

X =

1 x11 x2...1 xn

, X ′X =

nn∑

i=1

xi

n∑

i=1

xi

n∑

i=1

x2i

and X ′y =

n∑

i=1

yi

n∑

i=1

xiyi

,

so the normal equations are

αn+ βn∑

i=1

xi =n∑

i=1

yi

αn∑

i=1

xi + βn∑

i=1

x2i =

n∑

i=1

xiyi

which can be solved using standard methods. However, it is probably easiersimply to calculate the projection py of y onto L1. It is easily seen that the two

vectors u =

11...1

and v =

x1 − xx2 − x...

xn − x

are orthogonal and span L1, and therefore

py =⟨y,u

‖u‖2 u+⟨y,v

‖v‖2 v =

n∑

i=1

yi

nu+

n∑

i=1

(xi − x)yi

n∑

i=1

(xi − x)2

v ,

. Regression analysis

which shows that the jth coordinate of py is (py)j = y +

n∑

i=1

(xi − x)yi

n∑

i=1

(xi − x)2(xj − x), cf.

page . By calculating the variance matrix (.) we obtain the variances andcorrelations postulated in Proposition . on page .

Testing hypotheses

Hypotheses of the form H0 : EY ∈ L0 where L0 is a linear subspace of L1, aretested in the standard way using an F-test.

Usually such a hypothesis will be of the form H : βj = 0, which means that thejth covariate xj is of no significance and hence can be omitted. Such a hypothesiscan be tested either using an F test, or using the t test statistic

t =βj

estimated standard error of βj.

Factors

There are two different kinds of covariates. The covariates we have met so farall indicate some numerical quantity, that is, they are quantitative covariates.But covariates may also be qualitative, thus indicating membership of a certaingroup as a result of a classification. Such covariates are known as factors.

Example: In one-way analysis of variance observations y are classified intoa number of groups, see for example page ; one way to think of the data isas a set of pairs (y,f ) of an observation y and a factor f which simply tells thegroup that the actual y belongs to. This can be formulated as a regression model:Assume there are k levels of f , denoted by 1,2, . . . , k (i.e. there are k groups), andintroduce k artificial (and quantitative) covariates x1,x2, . . . ,xk such that xi = 1if f = i, and xi = 0 otherwise. In this way, we can replace the set of data points(y,f ) with “points”

(y, (x1,x2, . . . ,xp)

)where all xs except one equal zero, and the

single non-zero x (which equals 1) identifies the group that y belongs to. In thisnotation the one-way analysis of variance model can be written as

EY =p∑

j=1

xjβj

where βj corresponds to µj in the original formulation of the model.

Page 97: Probability Theory Statistics

Linear Normal Models

Table .Strangling of dogs:The concentration ofhypoxanthine at fourdifferent times. In eachgroup the observationsare arranged accordingto size.

time concentration(min) (µmol/l)

. . . . . . .. . . . . .. . . . .. . . . . . .

By combining factors and quantitative covariates one can build rather compli-cated models, e.g. with different levels of groupings, or with different groupshaving different linear relationships.

An example

Example .: Strangling of dogsSeven dogs (subjected to anaesthesia) were exposed to hypoxia by compression of thetrachea, and the concentration of hypoxantine was measured after , , and minutes. For various reasons it was not possible to carry out measurements on all ofthe dogs at all of the four times, and furthermore the measurements no longer can berelated to the individual dogs. Table . shows the available data.

We can think of the data as n = 25 pairs of a concentration and a time, the timesbeing know quantities (part of the plan of the experiment), and the concentrationsbeing observed values of random variables: the variation within the four groups canbe ascribed to biological variation and measurement errors, and it seems reasonableto model ths variation as a random variation. We will therefore apply a simple linearregression model with concentration as the dependent variable and time as a covariate.Of course, we cannot know in advance whether this model will give an appropriatedescription of the data. Maybe it will turn out that we can achieve a much better fitusing simple linear regression with the square root of the time as covariate, but in thatcase we would still have a linear regression model, now with the square root of the timeas covariate.

We will use a linear regression model with time as the independent variable (x) andthe concentration of hypoxantine as the dependent variable (y).

So let x1, x2, x3 and x4 denote the four times, and let yij denote the jth concentrationmesured at time xi . With this notation the suggested statistical model is that theobservations yij are observed values of independent random variables Yij , where Yij isnormal with EYij = α + βxi and variance σ2.

Estimated values of α and β can be found, for example using the formulas on page ,and we get β = 0.61 µmol l−1 min−1 and α = 1.4 µmol l−1. The estimated variance iss202 = 4.85µmol2 l−2 with 25− 2 = 23 degrees of freedom.

The standard error of β is√s202/SSx = 0.06µmol l−1 min−1, and the standard error of

α is√(

1n + x2

SSx

)s202 = 0.7 µmol l−1 (Proposition .). The orders of magnitude of these

two standard errors suggest that β should be written using two decimal places, and α

. Exercises

••••••• •

•••••

••••

•••••

••

0 5 10 15 20

0

5

10

15

time x

conc

entr

atio

ny Figure .

Strangling of dogs:The concentration ofhypoxantine plottedagainst time, and theestimated regressionline.

using one decimal place, so that the estimated regression line is to be written as

y = 1.4µmol l−1 + 0.61µmol l−1 min−1 x .

As a kind of model validation we will carry out a numerical test of the hypothesis thatthe concentration in fact does depend on time in a linear (or rather: an affine) way. Wetherefore embed the regression model into the larger k-sample model with k = 4 groupscorresponding to the four different values of time, cf. Example . on page .

We will state the method in linear algebra terms. Let L1 = ξ : ξij = α + βxi be thesubspace corresponding to the linear regression model, and let L = ξ : ξ = µi be thesubspace corresponding to the k-sample model. We have L1 ⊂ L, and from Section .we know that the test statistic for testing L1 against L is

F =1

dimL−dimL1‖py − p1y‖2

1n−dimL‖y − py‖2

where p and p1 are the projections onto L and L1. We know that dimL−dimL1 = 4−2 = 2and n − dimL = 25 − 4 = 21. In an earlier treatment of the data (page ) it wasfound that ‖y − py‖2 is 101.32; ‖py − p1y‖2 can be calculated from quantities alreadyknown: ‖y − p1y‖2 − ‖y − py‖2 = 111.50 − 101.32 = 10.18. Hence the test statistic isF = 5.09/4.82 = 1.06. Using the F distribution with 2 and 21 degrees of freedom, wesee that the test probability is above 30%, implying that we may accept the hypothesis,that is, we may assume that there is a linear relation between time and concentration ofhypoxantine. This conclusion is confirmed by Figure ..

. Exercises

Exercise .: Peruvian IndiansChanges in somebody’s lifestyle may have long-term physiological effects, such aschanges in blood pressure.

Anthropologists measured the blood pressure of a number of Peruvian Indians whohad been moved from their original primitive environment high in the Andes mountains

Page 98: Probability Theory Statistics

Linear Normal Models

Table .Peruvian Indians:Values of y: systolicblood pressure (mmHg), x1: fraction oflifetime in urban area,and x2: weight (kg).

y x1 x2 y x1 x2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

into the so-called civilisation, the modern city life, which also is at a much lower altitude(Davin (), here from Ryan et al. ()). The anthropologists examined a sampleconsisting of males over years, who had been subjected to such a migration intocivilisation. On each person the systolic and diastolic blood pressure was measured, aswell as a number of additional variables such as age, number of years since migration,height, weight and pulse. Besides, yet another variable was calculated, the “fraction oflifetime in urban area”, i.e. the number of years since migration divided by current age.

This exercise treats only part of the data set, namely the systolic blood pressure whichwill be used as the dependent variable (y), and the fraction of lifetime in urban area andweight which will be used as independent variables (x1 and x2). The values of y, x1 andx2 are given in Table . (from Ryan et al. ()).. The anthropologists believed x1, the fraction of lifetime in urban area, to be a

good indicator of how long the person had been living in the urban area, so itwould be interesting to see whether x1 could indeed explain the variation inblood pressure y. The first thing to do would therefore be to fit a simple linearregression model with x1 as the independent variable. Do that!

. From a plot of y against x1, however, it is not particularly obvious that thereshould be a linear relationship between y and x1, so maybe we should includeone or two of the remaining explanatory variables.Since a person’s weight is known to influence the blood pressure, we might try aregression model with both x1 and x2 as explanatory variables.

. Exercises

a) Estimate the parameters in this model. What happens to the variance esti-mate?

b) Examine the residuals in order to assess the quality of the model.. How is the final model to be interpreted relative to the Peruvian Indians?

Exercise .This is a renowned two-way analysis of variance problem from University of Copen-hagen.

Five days a week, a student rides his bike from his home to the H.C. Ørsted Institutet [thebuilding where the Mathematics department at University of Copenhagen is situated]in the morning, and back again in the afternoon. He can chose between two differentroutes, one which he usually uses, and another one which he thinks might be a shortcut.To find out whether the second route really is a shortcut, he sometimes measures thetime used for his journey. His results are given in the table below, which shows thetravel times minus minutes, in units of seconds.

shortcut the usual route

morning 4 −1 3 13 8 11 5 7afternoon 11 6 16 18 17 21 19

As we all know [!] it can be difficult to get away from the H.C. Ørsted Institutet bybike, so the afternoon journey will on the average take a little longer than the morningjourney.

Is the shortcut in fact the shorter route?

Hint: This is obviously a two-sample problem, but the model is not a balanced one, andwe cannot apply the ususal formula for the projection onto the subspace correspondingto additivity.

Exercise .Consider the simple linear regression model EY = α + xβ for a set of data points (yi ,xi),i = 1,2, . . . ,n.

What does the design matrix look like? Write down the normal equations, and solvethem.

Find formulas for the standard errors (i.e. the standard deviations) of α and β, anda formula for the correlation between the two estimators. Hint: use formula (.) onpage .

In some situations the designer of the experiment can, within limits, decide the xvalues. What would be an optimal choice?

Page 99: Probability Theory Statistics

A A Derivation of the NormalDistribution

The normal distribution is used very frequently in statistical models. Sometimesyou can argue for using it by referring to the Central Limit Theorem, andsometimes it seems to be used mostly for the sake of mathematical convenience(or by habit). In certain modelling situations, however, you may also makearguments of the form “if you are going to analyse data in such-and-such a way,then you are implicitly assuming such-and-such a distribution”. We shall nowpresent an argumentation of this kind leading to the normal distribution; themain idea stems from Gauss (, book II, section ).

Suppose that we want to find a type of probability distributions that may beused to describe the random scattering of observations around some unknownvalue µ. We make a number of assumptions:

Assumption : The distributions are continuous distributions on R, i.e. theypossess a continuous density function.

Assumption : The parameter µ may be any real number, and µ is a locationparameter; hence the model function is of the form

f (x −µ), (x,µ) ∈R2,

for a suitable function f defined on all of R. By assumption f must becontinuous.

Assumption : The function f has a continuous derivative.

Assumption : The maximum likelihood estimate of µ is the arithmetic mean ofthe observations, that is, for any sample x1,x2, . . . ,xn from the distributionwith density function x 7→ f (x−µ), the maximum likelihood estimate mustbe the arithmetic mean x.

Now we can make a number of deductions:

The log-likelihood function corresponding to the observations x1,x2, . . . ,xn is

lnL(µ) =n∑

i=1

lnf (xi −µ).

Page 100: Probability Theory Statistics

A Derivation of the Normal Distribution

This function is differentiable with

(lnL)′(µ) =n∑

i=1

g(xi −µ)

where g = −(lnf )′.Since f is a probability density function, we have liminf

x→±∞ f (x) = 0, and so

liminfx→±∞ lnL(µ) = −∞; this means that lnL attains its maximum value at (at least)

one point µ, and that this is a stationary point, i.e. (lnL)′(µ) = 0. Since we requirethat µ = x, we may conclude that

n∑

i=1

g(xi − x) = 0 (A.)

for all samples x1,x2, . . . ,xn.Conside a sample with two elements x and −x; the mean of this sample is 0,

so equation (A.) becomes g(x) + g(−x) = 0. Hence g(−x) = −g(x) for all x. Inparticular, g(0) = 0.

Next, consider a sample consisting of the three elements x, y and −(x+y); againthe sample mean is 0, and equation (A.) now becomes g(x)+g(y)+g(−(x+y)) = 0;we just showed that g is an odd function, so it follows that g(x+ y) = g(x) + g(y),and this is true for all x and y. From Proposition A. below, such a funtionmust be of the form g(x) = cx for some constant c. Per definition, g = −(lnf )′,so lnf (x) = b − 1

2cx2 for some constant of integration b, and f (x) = aexp(−1

2cx2)

where a = eb. Since f has to be non-negative and integrate to 1, the constant c ispositive and we can rename it to 1/σ2, and the constant a is uniquely determined(in fact, a = 1/

√2πσ2, cf. page ). The function f is therefore necessarily the

density function of the normal distribution with parameters 0 and σ2, and thetype of distributions searched for is the normal distributions with parameters µand σ2 where µ ∈R and σ2 > 0.

Exercise A.The above investigation deals with a situation with observations from a continuousdistribution on the real line, and such that the maximum likelihood estimator of thelocation parameter is equal to the arithmetic mean of the observations.

Now assume that we instead had observations from a continuous distribution on theset of positive real numbers, that this distribution is parametrised by a scale parameter,and that the maximum likelihood estimator of the scale parameter is equal to thearithmetic mean of the observations. — What can then be said about the distribuion ofthe observations?

Cauchy’s functional equation

Here we prove a general result which was used above, and which certainlyshould be part of general mathematical education.

Proposition A.Let f be a real-valued function such that

f (x+ y) = f (x) + f (y) (A.)

for all x,y ∈R. If f is known to be continuous at a point x0 , 0, then for some realnumber c, f (x) = cx for all x ∈R.

ProofFirst let us deduce some consequences of the requirement f (x+ y) = f (x) + f (y).. Letting x = y = 0 gives f (0) = f (0) + f (0), i.e. f (0) = 0.. Letting y = −x shows that f (x) + f (−x) = f (0) = 0, that is, f (−x) = −f (x) for

all x ∈R.. Repeated applications of (A.) gives that for any real numbers x1,x2, . . . ,xn,

f (x1 + x2 + . . .+ xn) = f (x1) + f (x2) + . . .+ f (xn). (A.)

Consider now a real number x , 0 and positive integers p and q, and leta = p/q. Since

f (ax+ ax+ . . .+ ax︸ ︷︷ ︸q tems

) = f (x+ x+ . . .+ x︸ ︷︷ ︸p terms

),

equation (A.) gives that qf (ax) = pf (x) or f (ax) = af (x). We may concludethat

f (ax) = af (x). (A.)

for all rational numbers a and all real numbers xUntil now we did not make use of the continuity of f at x0, but we shall do thatnow. Let x be a non-zero real number. There exists a sequence (an) of non-zerorational numbers converging to x0/x. Then the sequence (anx) converges to x0,and since f is continuous at this point,

f (anx)an

−→ f (x0)x0/x

=f (x0)x0

x.

Now, according to equation (A.),f (anx)an

= f (x) for all n, and hence f (x) =

f (x0)x0

x, that is, f (x) = cx where c =f (x0)x0

. Since x was arbitrary (and c does not

depend on x), the proof is finished.

Page 101: Probability Theory Statistics

B Some Results from Linear Algebra

These pages present a few results from linear algebra, more precisely from thetheory of finite-dimensional real vector spaces with an inner product.

Notation and definitions

The generic vector space is denoted V . Subspaces, i.e. linear subspaces, are de- V

noted L,L1,L2, . . . Vectors are usually denoted by boldface letters (x,y,u,v etc.). L u,v,x,y

The zero vector is 0. The set of coordinates of a vector relative to a given basis is 0

written as a column matrix.Linear transformations [and their matrices] are often denoted by letters as A

and B; the transposed of A is denoted A′. The null space (or kernel) of A is denoted A A′

N (A), and the range is denotedR(A); the rank of A is the number dimR(A). The N (A) R(A)

identity transformation [the unit matrix] is denoted I. I

The inner product of u and v is written as 〈u,v〉 and in matrix notation u′v. 〈u,v〉 u′vThe length of u is ‖u‖. ‖u‖

Two vectors u and v are orthogonal, u⊥ v, if 〈u,v〉 = 0. Two subspaces L1 and u⊥ vL2 are orthogonal, L1 ⊥ L2, if every vector in L1 is orthogonal to every vector in L1 ⊥ L2

L2, and if so, the orthogonal direct sum of L1 and L2 is defined as the subspace L1 ⊕L2

L1 ⊕L2 = u1 +u2 : u1 ∈ L1,u2 ∈ L2.

If v ∈ L1 ⊕ L2, then v has a unique decomposition as v = u1 +u2 where u1 ∈ L1and u2 ∈ L2. The dimension of the space is dimL1 ⊕L2 = dimL1 + dimL2.

The orthogonal complement of the subspace L is denoted L⊥. We have V = L⊕L⊥. L⊥

The orthogonal projection of V onto the subspace L is the linear transformationp : V → V such that px ∈ L and x − px ∈ L⊥ for all x ∈ V . p

A symmetric linear transformation A of V [a symmetric matrix A] is positivesemidefinite if 〈x,Ax〉 ≥ 0 [or x′Ax ≥ 0] for all x; it is positive definite if there is astrict inequality for all x , 0. Sometimes we write A ≥ 0 or A > 0 to tell that A is A ≥ 0 A > 0

positive semidefinite or positive definite.

Various results

This proposition is probably well-known from linear algebra:

Page 102: Probability Theory Statistics

Some Results from Linear Algebra

Proposition B.Let A be a linear transformation of Rp into Rn. Then R(A) is the orthogonal comple-ment ofN (A′) (in Rn), andN (A′) is the orthogonal complement ofR(A) (in Rn), inbrief, Rn =R(A)⊕N (A′).

Corollary B.Let A be a symmetric linear transformation of Rn. Then R(A) andN (A) are orthogo-nal complements of one another: Rn =R(A)⊕N (A).

Corollary B.R(A) =R(AA′).

Proof of Corollary B.According to Proposition B. it suffices to show thatN (A′) =N (AA′), that is, toshow that A′u = 0⇔ AA′u = 0.

Clearly A′0 ⇒ AA′u = 0, so it remains to show that AA′u = 0 ⇒ A′u = 0.Suppose that AA′u = 0. Since A(A′u) = 0, A′u ∈ N (A) = R(A′)⊥, and since wealways have A′u ∈ R(A′), we have A′u ∈ R(A′)⊥ ∩R(A′) = 0.

Proposition B.Suppose that A and B are injective linear transformations of Rp into Rn such thatAA′ = BB′. Then there exists an isometry C of Rp onto itself such that A = BC.

ProofLet L = R(A). From the hypothesis and Corollary B. we have L = R(A) =R(AA′) =R(BB′) =R(B).

Since A is injective,N (A) = 0; Proposition B. then gives R(A′) = 0⊥ = Rp,that is, for each u ∈ Rp the equation A′x = u has at least one solution x ∈ Rn,and since N (A′) = R(A)⊥ = L⊥, there will be exactly one solution x(u) in L toA′x = u.

We will show that this mapping u 7→ x(u) of Rp into L is linear. Let u1 and u2be two vectors in Rp, and let α1 and α2 be two scalars. Then

A′(x(α1u1 +α2u2)

)= α1u1 +α2u2

= α1A′x(u1) +α2A

′x(u1)

= A′(α1x(u1) +α2x(u2)

),

and since the equation A′x = u as mentioned has a unique solution x ∈ L, thisimplies that x(α1u1 +α2u2) = α1x(u1) +α2x(u2). Hence u 7→ x(u) is linear.

Now define the linear transformation C : Rp → Rp as Cu = B′x(u), u ∈ Rp.Then BCu = BB′x(u) = AA′x(u) = Au for all u ∈Rp, showing that A = BC.

Finally let us show that C preserves inner products (an hence is an isometry):For u1,u2 ∈ Rp we have 〈Cu1,Cu2〉 = 〈B′x(u1),B′x(u2)〉 = 〈x(u1),BB′x(u2)〉 =〈x(u1),AA′x(u2)〉 = 〈A′x(u1),A′x(u2)〉 = 〈u1,u2〉.

Proposition B.Let A be a symmetric and positive semidefinite linear transformation of Rn into itself.Then there exists a unique symmetric and positive semidefinite linear transformationA

1/2 of Rn into itself such that A1/2A1/2 = A.

ProofLet e1,e2, . . . ,en be an orthonormal basis of eigenvectors of A, and let the cor-responding eigenvalues be λ1,λ2, . . . ,λn ≥ 0. If we define A1/2 to be the lineartransformation that takes basis vector ej into i λ

1/2j ej , j = 1,2, . . . ,n, then we have

one mapping with the requested properties.Now suppose that A? is any other solution, that is, A? is symmetric and posi-

tive semidefinite, and A?A? = A. There exists an orthonormal basis e?1,e?2, . . . ,e

?n

of eigenvectors of A? ; let the corresponding eigenvalues be µ1,µ2, . . . ,µn ≥ 0.Since Ae?j = A?A?e?j = µ2

j e?j , we see that e?j is an A-eigenvector with eigen-

value µ2j . Hence A? and A have the same eigenspaces, and for each eigenspace

the A?-eigenvalue is the non-negative squareroot of the A-eigenvalue, and soA? = A1/2.

Proposition B.Assume that A is a symmetric and positive semidefinite linear transformation of Rn

into itself, and let p = dimR(A).Then there exists an injective linear transformation B of Rp into Rn such that

BB′ = A. B is unique up to isometry.

ProofLet e1,e2, . . . ,en be an orthonormal basis of eigenvectors of A, with the corre-sponding eigenvectors λ1,λ2, . . . ,λn, where λ1 ≥ λ2 ≥ · · · ≥ λp > 0 and λp+1 =λp+2 = . . . = λn = 0.

As B we can use the linear transformation whose matrix representation rela-tive to the standard basis of Rp and the basis e1,e2, . . . ,en of Rn is the n×p-matrixwhose (i, j)th entry is λ

1/2i if hvis i = j ≤ p and 0 otherwise. The uniqueness mod-

ulo isometry follows from Proposition B..

Page 103: Probability Theory Statistics

C Statistical Tables

In the pre-computer era a collection of statistical tables was an indispensabletool for the working statistician. The following pages contain some examples ofsuch statistical tables, namely tables of quantiles of a few distributions occuringin hypothesis testing.

Recall that for a probability distribution with a strictly increasing and continu-ous distribution function F, the α-quantile xα is defined as the solution of theequation F(x) = α; here 0 < α < 1. (If F is not known to be strictly increasingand continuous, we may define the set of α-quantiles as the closed interval whoseendpoints are supx : F(x) < α and infx : F(x) > α.)

Page 104: Probability Theory Statistics

Statistical Tables

The distribution function Φ(u)of the standard normal distribution

u u + 5 Φ(u) u u + 5 Φ(u)

. . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .–. . . . . .

Quantiles of the standard normal distribution

α uα = Φ−1(α) uα + 5 α uα = Φ−1(α) uα + 5

. . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .. –. . . . .

Page 105: Probability Theory Statistics

Statistical Tables

Quantiles of the χ2 distribution with f degrees of freedom

Probability in percentf . . .

........................................

........................................

........................................

........................................

........................................

........................................

........................................

Quantiles of the χ2 distribution with f degrees of freedom

Probability in percentf . . .

........................................

........................................

........................................

........................................

........................................

........................................

........................................

Page 106: Probability Theory Statistics

Statistical Tables

% quantiles of the F distribution

f1 is the number of degrees of freedom of the nominator,f2 is the number of degrees of freedom of the denominator.

f1f2

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

% quantiles of the F distribution

f1 is the number of degrees of freedom of the nominator,f2 is the number of degrees of freedom of the denominator.

f1f2

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

Page 107: Probability Theory Statistics

Statistical Tables

.% quantiles of the F distribution

f1 is the number of degrees of freedom of the nominator,f2 is the number of degrees of freedom of the denominator.

f1f2

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

% quantiles of the F distribution

f1 is the number of degrees of freedom of the nominator,f2 is the number of degrees of freedom of the denominator.

f1f2

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

Page 108: Probability Theory Statistics

Statistical Tables

Quantiles of the t distribution with f degrees of freedomProbability in percent

f . . . f

................................

................................

................................

................................

................................

................................

D Dictionaries

Danish-English

afbildning mapafkomstfordeling offspring distribution

betinget sandsynlighed conditional proba-bility

binomialfordeling binomial distribution

Cauchyfordeling Cauchy distributioncentral estimator unbiased estimatorCentral Grænseværdisætning Central

Limit Theorem

delmængde subsetDen Centrale Grænseværdisætning The

Central Limit Theoremdisjunkt disjointdiskret discrete

eksponentialfordeling exponential distri-bution

endelig finiteensidet variansanalyse one-way analysis

of varianceenstikprøveproblem one-sample problemestimat estimateestimation estimationestimator estimatoretpunktsfordeling one-point distributionetpunktsmængde singleton

flerdimensional fordeling multivariate dis-tribution

fordeling distributionfordelingsfunktion distribution functionforeningsmængde unionforgreningsproces branching processforklarende variabel explanatory variable

formparameter shape parameterforventet værdi expected valuefraktil quantilefrembringende funktion generating func-

tionfrihedsgrader degrees of freedomfællesmængde intersection

gammafordeling gamma distributiongennemsnit average, meangentagelse repetition, replicategeometrisk fordeling geometric distribu-

tion

hypergeometrisk fordeling hypergeometricdistribution

hypotese hypothesishypoteseprøvning hypothesis testinghyppighed frequencyhændelse event

indikatorfunktion indicator functionkhi i anden chi squaredklassedeling partitionkontinuert continuouskorrelation correlationkovarians covariancekvadratsum sum of squareskvotientteststørrelse likelihood ratio test

statistic

ligefordeling uniform distributionlikelihood likelihoodlikelihoodfunktion likelihood function

maksimaliseringsestimat maximum likeli-hood estimate

Page 109: Probability Theory Statistics

Dictionaries

maksimaliseringsestimator maximum like-lihood estimator

marginal fordeling marginal distributionmiddelværdi expected value; meanmultinomialfordeling multinomial distri-

butionmængde set

niveau levelnormalfordeling normal distributionnævner (i brøk) nominator

odds odds

parameter parameterplat eller krone heads or tailspoissonfordeling Poisson distributionproduktrum product spacepunktsandsynlighed point probability

regression regressionrelativ hyppighed relative frequencyresidual residualresidualkvadratsum residual sum of

squaresrække row

sandsynlighed probabilitysandsynlighedsfunktion probability func-

tionsandsynlighedsmål probability measuresandsynlighedsregning probability theorysandsynlighedsrum probability spacesandsynlighedstæthedsfunktion probabil-

ity density functionsignifikans significancesignifikansniveau level of significancesignifikant significantsimultan fordeling joint distributionsimultan tæthed joint densityskalaparameter scale parameterskøn estimatestatistik statisticsstatistisk model statistical modelstikprøve sample, random sample

stikprøvefunktion statisticstikprøveudtagning samplingstokastisk uafhængighed stochastic inde-

pendencestokastisk variabel random variableStore Tals Lov Law of Large NumbersStore Tals Stærke Lov Strong Law of Large

NumbersStore Tals Svage Lov Weak Law of Large

Numbersstøtte supportsufficiens sufficiencysufficient sufficientsystematisk variation systematic variationsøjle column

test testteste testtestsandsynlighed test probabilityteststørrelse test statistictilbagelægning (med/uden) replacement

(with/without)tilfældig variation random variationtosidet variansanalyse two-way analysis of

variancetostikprøveproblem two-sample problemtotalgennemsnit grand meantæller (i brøk) denominatortæthed densitytæthedsfunktion density function

uafhængig independentuafhængige identisk fordelte independent

identically distributed; i.i.d.uafhængighed independenceudfald outcomeudfaldsrum sample spaceudtage en stikprøve sampleuendelig infinite

variansanalyse analysis of variancevariation variationvekselvirkning interaction

English-Danish

analysis of variance variansanalyseaverage gennemsnit

binomial distribution binomialfordelingbranching process forgreningsproces

Cauchy distribution CauchyfordelingCentral Limit Theorem Den Centrale

Grænseværdisætningchi squared khi i andencolumn søjleconditional probability betinget sandsyn-

lighedcontinuous kontinuertcorrelation korrelationcovariance kovarians

degrees of freedom frihedsgraderdenominator tællerdensity tætheddensity function tæthedsfunktiondiscrete diskretdisjoint disjunktdistribution fordelingdistribution function fordelingsfunktionestimate estimat, skønestimation estimationestimator estimatorevent hændelseexpected value middelværdi; forventet

værdiexplanatory variable forklarende variabelexponential distribution eksponentialfor-

deling

finite endeligfrequency hyppighed

gamma distribution gammafordelinggenerating function frembringende funk-

tiongeometric distribution geometrisk forde-

linggrand mean totalgennemsnit

heads or tails plat eller krone

hypergeometric distribution hypergeome-trisk fordeling

hypothesis hypotesehypothesis testing hypoteseprøvning

i.i.d. = independent identically distributedindependence uafhængighedindependent uafhængigindicator function indikatorfunktioninfinite uendeliginteraction vekselvirkningintersection fællesmængde

joint density simultan tæthedjoint distribution simultan fordeling

Law of Large Numbers Store Tals Lovlevel niveaulevel of significance signifikansniveaulikelihood likelihoodlikelihood function likelihoodfunktionlikelihood ratio test statistic kvotienttest-

størrelse

map afbildningmarginal distribution marginal fordelingmaximum likelihood estimate maksimali-

seringsestimatmaximum likelihood estimator maksimali-

seringsestimatormean gennemsnit, middelværdimultinomial distribution multinomialfor-

delingmultivariate distribution flerdimensional

fordelingnominator nævnernormal distribution normalfordeling

odds oddsoffspring distribution afkomstfordelingone-point distribution etpunktsfordelingone-sample problem enstikprøveproblemone-way analysis of variance ensidet vari-

ansanalyseoutcome udfald

Page 110: Probability Theory Statistics

Dictionaries

parameter parameterpartition klassedelingpoint probability punktsandsynlighedPoisson distribution poissonfordelingprobability sandsynlighedprobability density function sandsynlig-

hedstæthedsfunktionprobability function sandsynlighedsfunk-

tionprobability measure sandsynlighedsmålprobability space sandsynlighedsrumprobability theory sandsynlighedsregningproduct space produktrum

quantile fraktil

r.v. = random variablerandom sample stikprøverandom variable stokastisk variabelrandom variation tilfældig variationregression regressionrelative frequency relativ hyppighedreplacement (with/without) tilbagelæg-

ning (med/uden)residual residualresidual sum of squares residualkvadrat-

sumrow række

sample stikprøve; at udtage en stikprøvesample space udfaldsrumsampling stikprøveudtagning

scale parameter skalaparameterset mængdeshape parameter formparametersignificance signifikanssignificant signifikantsingleton etpunktsmængdestatistic stikprøvefunktionstatistical model statistisk modelstatistics statistikStrong Law of Large Numbers Store Tals

Stærke Lovsubset delmængdesufficiency sufficienssufficient sufficientsum of squares kvadratsumsupport støttesystematic variation systematisk variation

test at teste; et testtest probability testsandsynlighedtest statistic teststørrelsetwo-sample problem tostikprøveproblemtwo-way analysis of variance tosidet vari-

ansanalyse

unbiased estimator central estimatoruniform distribution ligefordelingunion foreningsmængdevariation variation

Weak Law of Large Numbers Store TalsSvage Lov

Bibliography

Andersen, E. B. (). Multiplicative poisson models with unequal cell rates,Scandinavian Journal of Statistics : –.

Bachelier, L. (). Théorie de la spéculation, Annales scientifiques de l’ÉcoleNormale Supérieure, e série : –.

Bayes, T. (). An essay towards solving a problem in the doctrine of chances,Philosophical Transactions of the Royal Society of London : –.

Bliss, C. I. and Fisher, R. A. (). Fitting the negative binomial distributionto biological data and Note on the efficient fitting of the negative binomial,Biometrics : –.

Bortkiewicz, L. (). Das Gesetz der kleinen Zahlen, Teubner, Leipzig.

Davin, E. P. (). Blood pressure among residents of the Tambo Valley, Master’sthesis, The Pennsylvania State University.

Fisher, R. A. (). On the mathematical foundations of theoretical statistics,Philosophical Transactions of the Royal Society of London, Series A : –.

Forbes, J. D. (). Further experiments and remarks on the measurementof heights by the boiling point of water, Transactions of the Royal Society ofEdinburgh : –.

Gauss, K. F. (). Theoria motus corporum coelestium in sectionibus conicis solemambientum, F. Perthes und I.H. Besser, Hamburg.

Greenwood, M. and Yule, G. U. (). An inquiry into the nature of frequencydistributions representative of multiple happenings with particular referenceto the occurrence of multiple attacks of disease or of repeated accidents,Journal of the Royal Statistical Society : –.

Hald, A. (, ). Statistiske Metoder, Akademisk Forlag, København.

Hald, A. (). Statistical Theory with Engineering Applications, John Wiley,London.

Page 111: Probability Theory Statistics

Bibliography

Kolmogorov, A. (). Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer,Berlin.

Kotz, S. and Johnson, N. L. (eds) (). Breakthroughs in Statistics, Vol. ,Springer-Verlag, New York.

Larsen, J. (). Basisstatistik, . udgave, IMFUFA tekst nr , Roskilde Uni-versitetscenter.

Lee, L. and Krutchkoff, R. G. (). Mean and variance of partially-truncateddistributions, Biometrics : –.

Newcomb, S. (). Measures of the velocity of light made under the directionof the Secretary of the Navy during the years -, Astronomical Papers: –.

Pack, S. E. and Morgan, B. J. T. (). A mixture model for interval-censoredtime-to-response quantal assay data, Biometrics : –.

Ryan, T. A., Joiner, B. L. and Ryan, B. F. (). MINITAB Student Handbook,Duxbury Press, North Scituate, Massachusetts.

Sick, K. (). Haemoglobin polymorphism of cod in the Baltic and the DanishBelt Sea, Hereditas : –.

Stigler, S. M. (). Do robust estimators work with real data?, The Annals ofStatistics : –.

Weisberg, S. (). Applied Linear Regression, Wiley series in Probability andMathematical Statistics, John Wiley & Sons.

Wiener, N. (). Collected Works, Vol. , MIT Press, Cambridge, MA. Editedby P. Masani.

Index

01-variables , – generating function – mean and variance – one-sample problem ,

additivity – hypothesis of a.

analysis of variance– one-way , – two-way ,

analysis of variance table , ANOVA see analysis of varianceapplied statistics arithmetic mean asymptotic χ2 distribution asymptotic normality ,

balanced model Bartlett’s test , Bayes’ formula , , Bayes, T. (-) Bernoulli variable see 01-variablesbinomial coefficient

– generalised binomial distribution , , , ,

– comparison , , – convergence to Poisson distribution,

– generating function – mean and variance – simple hypothesis

binomial series Borel σ -algebra Borel, É. (-)

branching process

Cauchy distribution Cauchy’s functional equation Cauchy, A.L. (-) Cauchy-Schwarz inequality , cell Central Limit Theorem Chebyshev inequality Chebyshev, P. (-) χ2 distribution

– table column column effect , conditional distribution conditional probability connected model Continuity Theorem for generating func-

tions continuous distribution , correlation covariance , covariance matrix covariate , , , Cramér-Rao inequality

degrees of freedom– in Bartlett’s test – in F test , , – in t test , – of −2lnQ , – of χ2 distribution – of sum of squares , – of variance estimate , , ,,

Page 112: Probability Theory Statistics

Index

density function dependent variable design matrix distribution function ,

– general dose-response model dot notation

equality of variances estimate , estimated regression line , estimation , estimator event examples

– accidents in a weapon factory – Baltic cod , , – flour beetles , , , ,, ,

– Forbes’ data on boiling points ,

– growing of potatoes – horse-kicks , – lung cancer in Fredericia – mercury in swordfish – mites on apple leaves – Peruvian Indians – speed of light , , – strangling of dogs , – vitamin C , ,

expectation– countable case – finite case

expected value– continuous distribution – countable case – finite case

explained variable explanatory variable , exponential distribution

Φ – table

ϕ F distribution

– table F test , , , , factor (in linear models) Fisher, R.A. (-) fitted value frequency interpretation ,

gamma distribution , gamma function Γ function Gauss, C.F. (-) , Gaussian distribution see normal dis-

tributiongeneralised linear regression generating function

– Continuity Theorem – of a sum

geometric distribution , , geometric mean geometric series Gosset, W.S. (-)

Hardy-Weinberg equilibrium homogeneity of variances , homoscedasticity see homogeneity of

varianceshypergeometric distribution hypothesis of additivity

– test hypothesis testing hypothesis, statistical ,

i.i.d. independent events independent experiments independent random variables , ,

independent variable indicator function , injective parametrisation , intensity , interaction , interaction variation

Jensen’s inequality Jensen, J.L.W.V. (-)

Index

joint density function joint distribution joint probability function

Kolmogorov, A. (-) Kronecker’s δ

L 1 L 2 Law of Large Numbers , , least squares level of significance likelihood function , , likelihood ratio test statistic linear normal model linear regression

– simple linear regression, simple location parameter

– normal distribution log-likelihood function , logarithmic distribution , logistic regression , , logit

marginal distribution marginal probability function Markov inequality Markov, A. (-) mathematical statistics maximum likelihood estimate maximum likelihood estimator , ,

– asymptotic normality – existence and uniqueness

mean value– continuous distribution – countable case – finite case

measurable map model function model validation multinomial coefficient multinomial distribution , , ,

– mean and variance

multiplicative Poisson model multivariate continuous distribution multivariate normal distribution multivariate random variable ,

– mean and variance

negative binomial distribution , ,

– convergence to Poisson distribution,

– expected value , – generating function – variance ,

normal distribution , – derivation – one-sample problem , , – two-sample problem , ,

normal equations ,

observation , observation space odds , offspring distribution one-point distribution ,

– generating function one-sample problem

– 01-variables , – normal variables , , ,

– Poisson variables , one-sided test one-way analysis of variance , outcome , outlier

parameter , parameter space partition , partitioning the sum of squares point probability , Poisson distribution , , , , ,

– as limit of negative binomial distri-bution

– expected value – generating function

Page 113: Probability Theory Statistics

Index

– limit of binomial distribution ,

– limit of negative binomial distribu-tion

– one-sample problem , – variance

Poisson, S.-D. (-) posterior probability prior probability probability bar probability density function probability function , probability mass function probability measure , , probability parameter

– geometric distribution – negative binomial distribution ,

– of binomial distribution probability space , , problem of estimation projection , ,

quadratic scale parameter quantiles

– of χ2 distribution – of F distribution – of standard normal disribution – of t distribution

random numbers random variable , , ,

– independence , , – multivariate ,

random variation , random vector random walk regression

– logistic – multiple linear – simple linear ,

regression analysis regression line, estimated , regular normal distribution residual ,

residual sum of squares , residual vector response variable row row effect ,

σ -additive σ -algebra sample space , , , , scale parameter

– Cauchy distribution – exponential distribution , – gamma distribution – normal distribution

Schwarz, H.A. (-) shape parameter

– gamma distribution – negative binomial distribution ,

simple random sampling – without replacement

St. Petersburg paradox standard deviation standard error , standard normal distribution ,

– quantiles – table

statistic statistical hypothesis , statistical model , stochastic process , Student see GossetStudent’s t sufficient statistic , , systematic variation ,

t distribution , – table

t test , , , test

– Bartlett’s test – F test , , , , – homogeneity of variances – likelihood ratio – one-sided

Index

– t test , , , – two-sided

test probability , test statistic testing hypotheses , transformation of distributions trinomial distribution two-sample problem

– normal variables , , ,

two-sided test two-way analysis of variance ,

unbiased estimator , , unbiased variance estimate uniform distribution

– continuous , , – finite sample space – on finite set

validation– beetles example

variable– dependent – explained – explanatory – independent

variance , variance homogeneity , variance matrix variation

– between groups – random , – systematic , – within groups ,

waiting time , Wiener process within groups variation