Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B)....

$: Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B). To check that P is a probability measure, note that P() = j j=j j= 1, and for disjoint$
Probability and Statistics

Vittoria Silvestri

Statslab, Centre for Mathematical Sciences, University of Cambridge, Wilber-

force Road, Cambridge CB3 0WB, UK

Contents

Preface 5

Chapter 1. Discrete probability 7

1. Introduction 7

2. Combinatorial analysis 9

3. Exercises 12

4. Stirling’s formula 13

5. Properties of Probability measures 15

6. Exercises 18

7. Independence 19

8. Conditional probability 21

9. Exercises 24

10. Some natural probability distributions 25

11. Exercises 27

12. Additional exercises 28

13. Random variables 29

14. Expectation 33

15. Exercises 36

16. Variance and covariance 37

17. Exercises 40

18. Inequalities 41

19. Exercises 42

20. Probability generating functions 44

21. Conditional distributions 46

22. Exercises 49

Chapter 2. Continuous probability 51

1. Some natural continuous probability distributions 51

2. Continuous random variables 53

3. Exercises 55

4. Transformations of one-dimensional random variables 56

5. Exercises 57

3

4 CONTENTS

6. Moment generating functions 58

7. Exercises 60

8. Multivariate distributions 61

9. Exercises 65

10. Transformations of two-dimensional random variables 66

11. Limit theorems 67

12. Exercises 68

Chapter 3. Statistics 71

1. Parameter estimation 71

2. Exercises 76

3. Mean square error 77

4. Confidence intervals 79

5. Exercises 82

6. Bayesian statistics 83

7. Exercises 86

Appendix A. Background on set theory 89

1. Table of Φ(z) 91

PREFACE 5

Preface

These lecture notes are for the NYU Shanghai course Probability and Statistics, given in Fall

2018. The contents are closely based on the following lecture notes, all available online:

• James Norris: http://www.statslab.cam.ac.uk/ james/Lectures/p.pdf

• Douglas Kennedy: http://trin-hosts.trin.cam.ac.uk/fellows/dpk10/IA/IAprob.html

• Richard Weber: http://www.statslab.cam.ac.uk/ rrw1/stats/Sa4.pdf

Please notify [email protected] for comments and corrections.

mailto:[email protected]

CHAPTER 1

Discrete probability

1. Introduction

This course concerns the study of experiments with random outcomes, such as throwing a die,

tossing a coin or blindly drawing a card from a deck. Say that the set of possible outcomes is

Ω = ω1, ω2, ω3, . . ..

We call Ω sample space, while its elements are called outcomes. A subset A ⊆ Ω is called an

event.

Example 1.1 (Throwing a die). Toss a normal six-faced die: the sample space is Ω =

1, 2, 3, 4, 5, 6. Examples of events are:

5 (the outcome is 5)

2, 4, 6 (the outcome is even)

3, 6 (the outcome is divisible by 3)

Example 1.2 (Drawing a card). Draw a card from a standard deck: Ω is the set of all possible

cards, so that |Ω| = 52. Examples of events are:

A1 = the card is a Jack, |A1| = 4

A2 = the card is Diamonds, |A2| = 13

A3 = the card is not the Queen of Spades, |A3| = 51.

Example 1.3 (Picking a natural number). Pick any natural number: the sample space is

Ω = N. Examples of events are:

the number is at most 5 = 0, 1, 2, 3, 4, 5

the number is even = 2, 4, 6, 8 . . .

the number is not 7 = N \ 7.

7

8 1. DISCRETE PROBABILITY

Example 1.4 (Picking a real number). Pick any number in the closed interval [0, 1]: the sample

space is Ω = [0, 1]. Examples of events are:

x : x < 1/3 = [0, 1/3)

x : x 6= 0.7 = [0, 1] \ 0.7 = [0, 0.7) ∪ (0.7, 1]

x : x = 2−n for some n ∈ N = 1, 1/2, 1/4, 1/8 . . ..

Recall that a set Ω is said to be countable if there exists a bijection between Ω and a subset of

N. A set is said to be uncountable if it is not countable. Thus, for example, any finite set is

countable, the set of even (or odd) integers is countable, but R is uncountable. Note that the

sample space Ω is finite in the first two examples, infinite but countable in the third example,

and uncountable in the last example.

Remark 1.5. For the first part of the course we will restrict to countable sample spaces, thus

excluding Example 1.4 above.

We can now give a general definition.

Definition 1.6. Let Ω be any set, and F be a set of subsets of Ω. We say that F is a σ-algebra

if

- Ω ∈ F ,

- if A ∈ F , then Ac ∈ F ,

- for every sequence (An)n≥1 in F , it holds⋃∞n=1An ∈ F .

Assume that F is a σ-algebra. A function P : F → [0, 1] is called a probability measure if

- P(Ω) = 1,

- for any sequence of disjoint events (An)n≥1 it holds

P

( ∞⋃n=1

An

)=∞∑n=1

P(An).

The triple (Ω,F ,P) is called a probability space.

Remark 1.7. In the case of countable state space we take F to be the set of all subsets of Ω,

unless otherwise stated.

We think of F as the collection of observable events. If A ∈ F , then P(A) is the probability of

the event A. In some probability models, such as the one in Example 1.4, the probability of

each individual outcome is 0. This is one reason why we need to specify probabilities of events

rather than outcomes.

2. COMBINATORIAL ANALYSIS 9

1.1. Equally likely outcomes. The simplest case is that of a finite sample space Ω and

equally likely outcomes:

P(ω) =1

|Ω|∀ω ∈ Ω.

Note that by definition of probability measure this implies that

P(A) =|A||Ω|

∀A ∈ F .

Moreover

• P(∅) = 0,

• if A ⊆ B then P(A) ≤ P(B),

• if A ∩B = ∅ then P(A ∪B) = P(A) + P(B),

• P(A ∪B) = P(A) + P(B)− P(A ∩B),

• P(A ∪B) ≤ P(A) + P(B).

To check that P is a probability measure, note that P(Ω) = |Ω|/|Ω| = 1, and for disjoint events

(Ak)nk=1 it holds

P

(n⋃k=1

Ak

)=|A1 ∪A2 ∪ . . . ∪An|

|Ω|=

n∑k=1

|Ak||Ω|

=n∑k=1

P(Ak),

as wanted.

Example 1.8. When throwing a fair die there are 6 possible outcomes, all equally likely. Then

Ω = 1, 2, 3, 4, 5, 6, P(i) = 1/6 for i = 1 . . . 6.

So P(even outcome) = P(2, 4, 6) = 1/2, while P(outcome ≤ 5) = P(1, 2, 3, 4, 5) = 5/6.

2. Combinatorial analysis

We have seen that it is often necessary to be able to count the number of subsets of Ω with a

given property. We now take a systematic look at some counting methods.

2.1. Multiplication rule. Take N finite sets Ω1,Ω2 . . .ΩN (some of which might coincide),

with cardinalities |Ωk| = nk. We imagine to pick one element from each set: how many possible

ways do we have to do so? Clearly, we have n1 choices for the first element. Now, for each

choice of the first element, we have n2 choices for the second, so that

|Ω1 × Ω2| = n1n2.

Once the first two element, we have n3 choices for the third, and so on, giving

|Ω1 × Ω2 × . . .× ΩN | = n1n2 · · ·nN .

We refer to this as the multiplication rule.


Example 2.1. A restaurant offers 6 starters, 7 main courses and 5 desserts. The number of

possible three–course meals is then 6× 7× 5 = 210.

Example 2.2. Throw 3 dice. Then the number of outcomes given by an even number, a 6 and

then a number smaller than 4 is 3× 1× 4 = 12.

Example 2.3 (The number of subsets). Suppose a set Ω = ω1, ω2 . . . ωn has n elements. How

many subsets does Ω have? We proceed as follows. To each subset A of Ω we can associate a

sequence of 0’s and 1’s of length n so that the ith number is 1 if ωi is in A, and 0 otherwise.

Thus if, say, Ω = ω1, ω2, ω3, ω4 then

A1 = ω1 7→ 1, 0, 0, 0

A2 = ω1, ω3, ω4 7→ 1, 0, 1, 1

A3 = ∅ 7→ 0, 0, 0, 0.

This defines a bijection between the subsets of Ω and the strings of 0’s and 1’s of length n.

Thus we have to count the number of such strings. Since for each element we have 2 choices

(either 0 or 1), there are 2n strings. This shows that a set of n elements has 2n subsets. Note

that this also counts the number of functions from a set of n elements to 0, 1.

2.2. Permutations. How many possible orderings of n elements are there? Label the

elements 1, 2 . . . n. A permutation is a bijection from 1, 2 . . . n to itself, i.e. an ordering of

the elements. We may obtain all permutations by subsequently choosing the image of element

1, then the image of element 2 and so on. We have n choices for the image of 1, then n − 1

choices for the image of 2, n− 2 choices for the image of 3 until we have only one choice for the

image of n. Thus the total number of choices is, by the multiplication rule,

n! = n(n− 1)(n− 2) · · · 1.

Thus there are n! different orderings, or permutations, of n elements. Equivalently, there are n!

different bijections from any two sets of n elements.

Example 2.4. There are 52! possible orderings of a standard deck of cards.

2.3. Subsets. How many ways are there to choose k elements from a set of n elements?

2.3.1. With ordering. We have n choices for the first element, n− 1 choices for the second

element and so on, ending with n− k + 1 choices for the kth element. Thus there are

(2.1) n(n− 1) · · · (n− k + 1) =n!

(n− k)!

ways to choose k ordered elements from n. An alternative way to obtain the above formula is

the following: to pick k ordered elements from n, first pick a permutation of the n elements (n!

choices), then forget all elements but the first k. Since for each choice of the first k elements

there are (n− k)! permutations starting with those k elements, we again obtain (2.1).

2. COMBINATORIAL ANALYSIS 11

2.3.2. Without ordering. To choose k unordered elements from n, we could first choose k

ordered elements, and then forget about the order. Recall that there are n!/(n− k)! possible

ways to choose k ordered elements from n. Moreover, any given k elements can be ordered in

k! possible ways. Thus there are (n

k

)=

n!

k!(n− k)!

possible ways to choose k unordered elements from n.

More generally, suppose we have integers n1, n2 . . . nk with n1 + n2 + · · ·+ nk = n. Then we

have (n

n1 . . . nk

)=

n!

n1! . . . nk!

possible ways to partition n elements in k subsets of cardinalities n1, . . . nk.

Example 2.5. Imagine to have a box containing n balls. Then there are:

• n!(n−k)! ordered ways to pick k balls,

•(nk

)unordered ways to pick k balls,

•(

nn1 n2 n3

)to subdivide the balls into 3 unordered groups of n1, n2 and n3 balls each.

2.4. Subsets with repetitions. How many ways are there to choose k elements from a

set of n elements, allowing repetitions?

2.4.1. With ordering. We have n choices for the first element, n choices for the second

element and so on. Thus there are

nk = n× n× · · · × n

possible ways to choose k ordered elements from n, allowing repetitions.

2.4.2. Without ordering. Suppose we want to choose k elements from n, allowing repetitions

but discarding the order. How many ways do we have to do so? Note that naıvely dividing

nk by k! doesn’t give the right answer, since there may be repetitions. Instead, we count as

follows. Label the n elements 1, 2 . . . n, and for each element draw a ∗ each time it is picked.

1 2 3 . . . n

** * . . . ***

Note that there are k ∗’s and n− 1 vertical lines. Now delete the numbers:

(2.2) ∗ ∗| ∗ | | . . . | ∗ ∗∗

The above diagram uniquely identifies an unordered set of (possibly repeated) k elements. Thus

we simply have to count how many such diagrams there are. The only restriction is that there


must be n− 1 vertical lines and k ∗’s. Since there are n+ k − 1 locations, we can fix such a

diagram by assigning the positions of the ∗’s, which can be done in(n+ k − 1

k

)ways. This therefore counts the number of unordered subsets of k elements from n, without

ordering.

Example 2.6. Imagine to again have a box with n balls, but now each time a ball is picked, it

is put back in the box (so that it can be picked again). There are:

• nk ordered ways to pick k balls, and

•(n+k−1

k

)unordered ways to pick k balls.

Example 2.7 (Increasing and non–decreasing functions). An increasing function from 1, 2 . . . kto 1, 2 . . . n is uniquely determined by its range, which is a subset of 1, 2 . . . n of size k.

Vice versa, each such subset determines a unique increasing function. This bijection tells us

that there are(nk

)increasing functions from 1, 2 . . . k to 1, 2 . . . n, since this is the number

of subsets of size k of 1, 2, . . . n.

How about non–decreasing functions? There is a bijection from the set of non–decreasing

functions f : 1, 2 . . . k → 1, 2 . . . n to the set of increasing functions g : 1, 2 . . . k →1, 2 . . . n+ k − 1, given by

g(i) = i+ f(i)− 1.

Hence the number of such decreasing functions is(n+k−1

k

).

3. Exercises

Exercise 1. Define a probability space. Take Ω = a, b, c. Which ones of the following

are σ-algebras? Explain.

• F = ∅, a,• F = ∅, a, b, c, a, b, c,• F = ∅, a, b, c, b, c, a, b, c,• F = ∅, a, b, c, a, b, b, c, a, c, a, b, c.

Exercise 2. Let Ω be a sample space, and A be an event. Show that P(Ac) = 1− P(A).

Exercise 3. How many ways are there to divide a deck of 52 cards into two decks of 26

each? How many ways are there to do so, ensuring that each of the two decks contains exactly

13 black and 13 red cards?

4. STIRLING’S FORMULA 13

Exercise 4. Four mice are chosen (without replacement) from a litter, two of which are

white. The probability that both white mice are chosen is twice the probability that neither is

chosen. How many mice are there in the litter?

Exercise 5. Suppose that n (non-identical) balls are tossed at random into n boxes. What

is the probability that no box is empty? What is the probability that all balls end up in the

same box?

Exercise 6. [Ordered partitions] An ordered partition of k of size n is a sequence

(k1, k2 . . . kn) of non-negative integers such that k1 + · · ·+kn = k. How many ordered partitions

of k of size n are there?

Exercise 7. [The birthday problem] If there are n people in the room, what is the

probability that at least two people have the same birthday?

3.1. Recap of formulas.

• Permutations of n elements: n!.

• Choose k elements from n, no repetitions:

– with ordering:n!

(n− k)!,

– without ordering:

(n

k

).

• Choose k elements from n, with repetitions:

– with ordering: nk,

– without ordering:

(n− 1 + k

k

).

4. Stirling’s formula

We have seen how factorials are ubiquitous in combinatorial analysis. It is therefore important

to be able to provide asymptotics for n! as n becomes large, which is often the case of interest.

Recall that we write an ∼ bn to mean that an/bn → 1 as n→∞.

Theorem 4.1 (Stirling’s formula). As n→∞ we have

n! ∼√

2πnn+1/2e−n.

Note that this implies that

log(n!) ∼ log(√

2πnn+1/2e−n)

as n→∞. We now prove this weaker statement.


Set

ln = log(n!) = log 1 + log 2 + · · ·+ log n.

Write bxc for the integer part of x. Then

log bxc ≤ log x ≤ log bx+ 1c.

Integrate over the interval [1, n] to get

ln−1 ≤∫ n

1log xdx ≤ ln

from which ∫ n

1log xdx ≤ ln ≤

∫ n+1

1log xdx.

Integrating by parts we find ∫ n

1log xdx = n log n− n+ 1,

from which we deduce that

n log n− n+ 1 ≤ ln ≤ (n+ 1) log(n+ 1)− n.

Since

n log n− n+ 1 ∼ n log n,

(n+ 1) log(n+ 1)− n ∼ n log n,

dividing through by n log n and taking the limit as n→∞ we find that ln ∼ n log n, as wanted.

Example 4.2. Suppose we have 4n balls, of which 2n are red and 2n are black. We put them

randomly into two urns, so that each urn contains 2n balls. What is the probability that each

urn contains exactly n red balls and n black balls? Call this probability pn. Note that there

are(

4n2n

)ways to distribute the balls into the two urns,

(2nn

)(2nn

)of which will result in each urn

containing exactly n balls of each color. Thus

pn =

(2n

n

)(2n

n

)(

4n

2n

) =

((2n)!

n!

)4 1

(4n)!

∼

(√2πe−2n(2n)2n+1/2

√2πe−nnn+1/2

)41√

2πe−4n(4n)4n+1/2=

√2

πn.

5. PROPERTIES OF PROBABILITY MEASURES 15

5. Properties of Probability measures

Let (Ω,F ,P) be a probability space, and recall that P : F → [0, 1] has the property that

P(Ω) = 1 and for any sequence (An)n≥1 of disjoint events in F it holds

P

⋃n≥1

An

=∑n≥1

P(An).

Then we have the following:

• 0 ≤ P(A) ≤ 1 for all A ∈ F ,

• if A ∩B = ∅ then P(A ∪B) = P(A) + P(B),

• P(Ac) = 1− P(A), since P(A) + P(Ac) = P(Ω) = 1,

• P(∅) = 0,

• if A ⊆ B then P(A) ≤ P(B), since

P(B) = P(A ∪ (B \A)) = P(A) + P(B \A) ≥ P(A).

• P(A ∪B) = P(A) + P(B)− P(A ∩B),

• P(A ∪B) ≤ P(A) + P(B).

5.1. Continuity of probability measures. For a non-decreasing sequence of events

A1 ⊆ A2 ⊆ A3 ⊆ · · · we have

P

⋃n≥1

An

= limn→∞

P(An).

Indeed, we can define a new sequence (Bn)n≥1 as

B1 = A1, B2 = A2 \A1, B3 = A3 \ (A1 ∪A2) . . . Bn = An \ (A1 ∪ · · ·An−1) . . .

Then the Bn’s are disjoint, and so

P(An) = P(B1 ∪ · · · ∪Bn) =n∑k=1

P(Bk).

Taking the limit as n→∞ both sides, we get

limn→∞

P(An) =∞∑n=1

P(Bn) = P( ⋃n≥1

Bn

)= P

( ⋃n≥1

An

),

as wanted. Similarly, for a non-increasing sequence of events B1 ⊇ B2 ⊇ B3 ⊇ · · · we have

P

⋂n≥1

Bn

= limn→∞

P(Bn).


Example 5.1. Take Ω = N and let An = 1, 2 . . . n for n ≥ 1. Then we have A1 ⊆ A2 ⊆ A3 ⊆· · · and

limn→∞

P(An) = P

⋃n≥1

An

= P(Ω) = 1.

If, on the other hand, we set An = A5 for all n ≥ 5, then we find

limn→∞

P(An) = P

⋃n≥1

An

= P(A5).

Example 5.2. Take Ω = N and let Bn = n, n + 1, n + 2 . . . for n ≥ 1. Then we have

B1 ⊇ B2 ⊇ B3 · · · and

limn→∞

P(Bn) = P

⋂n≥1

Bn

= P(∅) = 0.

If, on the other hand, we set Bn = B3 for all n ≥ 3, then we find

limn→∞

P(Bn) = P

⋂n≥1

Bn

= P(B10).

5.2. Subadditivity of probability measures. For any events A1, A2 . . . An we have

(5.1) P

(n⋃k=1

Ak

)≤

n∑k=1

P(Ak).

To see this, define the events

B1 = A1, B2 = A2 \A1, B3 = A3 \ (A1 ∪A2) . . . Bn = An \ (A1 ∪ · · ·An−1).

Then the Bk’s are disjoint, and P(Bk) ≤ P(Ak) for all k = 1 . . . n, from which

P

(n⋃k=1

Ak

)= P

(n⋃i=1

Bk

)=

n∑k=1

P(Bk) ≤n∑k=1

P(Ak).

The same proof shows that this also holds for infinite sequences of events (An)n≥1

P

⋃n≥1

An

≤∑n≥1

P(An).

The above inequalities show that probability measures are subadditive, and are often referred

to as Boole’s inequality.

Example 5.3. We have already seen that P(A ∪B) ≤ P(A) + P(B), which is an example of

(5.1) with n = 2.

5. PROPERTIES OF PROBABILITY MEASURES 17

Example 5.4. Throw a fair die, so that Ω = 1, 2, 3, 4, 5, 6. Then

P(1, 2, 3 ∪ 2, 4) = P(1, 2, 3, 4) =4

6,

while

P(1, 2, 3) + P(2, 4) =3

6+

2

6=

5

6,

so indeed P(1, 2, 3 ∪ 2, 4) ≤ P(1, 2, 3) + P(2, 4).

Example 5.5. Toss a fair coin 3 times. Write H for head and T for tail. Then

P(exactly one head ∪ (T,H,H) ∪ (H,T, T )

)=

= P((H,T, T ), (T,H, T ), (T, T,H), (T,H,H)) =4

8=

1

2,

while

P(exactly one head) + P(T,H,H) + P(H,T, T ) =

3

8+

1

8+

1

8=

5

8.

5.3. Inclusion-exclusion formula. We have already seen that for any two events A,B

P(A ∪B) = P(A) + P(B)− P(A ∩B).

How does this generalise to an arbitrary number of events? For 3 events we have

P(A ∪B ∪ C) = P(A) + P(B) + P(C)− P(A ∩B)− P(A ∩ C)− P(B ∩ C) + P(A ∩B ∩ C).

A B

C

In general, for n events A1, A2 . . . An we have

P

(n⋃i=1

Ai

)=

n∑k=1

(−1)k+1∑

1≤i1<···<ik≤nP(Ai1 ∩Ai2 ∩ · · · ∩Aik).

You are not required to remember the general formula for this course, but you should learn it

for n = 2, 3.


Example 5.6. Pick a card from a standard deck of 52. Define the events

A = the card is diamonds,

B = the card is even,

C = the cards is at least 10.

Then

P(A) =13

52, P(B) =

24

52, P(C) =

16

52,

P(A ∩B) =6

52, P(A ∩ C) =

4

52, P(B ∩ C) =

8

52,

P(A ∩B ∩ C) =2

52,

which gives

P(A ∪B ∪ C) =13 + 24 + 16− 6− 4− 8 + 2

52=

37

52.

6. Exercises

Exercise 1. Let A,B,C be three events. Express in symbols the following events:

(i) only A occurs,

(ii) at least one event occurs,

(iii) exactly one event occurs,

(iv) no event occurs.

Exercise 2. Show that for any two sets A,B it holds (A ∪B)c = Ac ∩Bc.

Exercise 3. Show that, for any three events A,B,C

P(Ac ∩ (B ∪ C)) = P(B) + P(C)− P(B ∩ C)− P(C ∩A)− P(A ∩B) + P(A ∩B ∩ C).

How many numbers in 1, 2 . . . 500 are not divisible by 7 but are divisible by 3 or 5?

Exercise 4. Let (An)n≥1 be a sequence of events in some probability space (Ω,F ,P).

Define

A = ω ∈ Ω : ω ∈ An infinitely often,

B = ω ∈ Ω : ω ∈ An for all sufficiently large n.

Write A and B in terms of unions/intersections of the Ak’s.

Exercise 5. A committee of size r is chosen at random from a set of n people. Calculate

the probability that m given people will be in the committee.

7. INDEPENDENCE 19

Exercise 6. What is the probability that a non-decreasing function f : 1 . . . k → 1 . . . nis increasing?

Exercise 7.

(i) How many injective functions f : 1, 2 . . . k → 1, 2 . . . n are there?

(ii) How many injective, non-decreasing functions f : 1, 2 . . . k → 1, 2 . . . n are there?

(iii) How many non-constant functions f : 1, 2 . . . k → 0, 1 are there? How about

non-constant f : 1, 2 . . . k → 1, 2 . . . n?

7. Independence

We will now discuss what is arguably the most important concept in probability theory, namely

independence.

Definition 7.1. Two events A,B are said to be independent if

P(A ∩B) = P(A)P(B).

Example 7.2. Throw a fair die, and let

A = the first number is even, B = the second number is ≤ 2.

Then P(A) = 36 , P(B) = 2

6 and P(A ∩B) = 636 , so P(A ∩B) = P(A)P(B) and the events A and

B are independent.

Example 7.3. Pick a card from a standard deck. Let

A = the card is heart, B = the card is at most 5.

Then P(A) = 1352 , P(B) = 20

52 and P(A∩B) = 552 , so P(A∩B) = P(A)P(B) and therefore A and

B are independent.

Example 7.4. Toss two fair dice. Let

A = the sum of the two numbers is 6, B = the firs number is 4.

Then P(A) = 536 , P(B) = 1

6 and P(A ∩ B) = 136 . Thus P(A ∩ B) 6= P(A)P(B) and the events

are not independent. Intuitively, the probability of getting 6 for the sum depends on the first

outcome, since if we were to get 6 at the first toss then it would be impossible to obtain 6 for

the sum, while if the first toss gives a number ≤ 5 then we have a positive probability of getting

6 for the sum.

If, on the other hand, we replace A with

A′ = the sum of the two numbers is 7,


then P(A′) = 636 , P(B) = 1

6 and P(A′ ∩B) = 136 . Thus P(A′ ∩B) = P(A′)P(B) and the events

are independent. That is, knowing that the first toss gave 4 doesn’t give us any information on

the probability that the sum is equal to 7.

An important property of independence is the following: if A is independent of B, then A is

independent of Bc. Indeed, we have

P(A ∩Bc) = P(A)− P(A ∩B)

= P(A)− P(A)P(B) (by independence of A,B)

= P(A)(1− P(B))

= P(A)P(Bc) (since P(Bc) = 1− P(B))

which shows that A and Bc are independent.

Example 7.5. Let A,B be defined as in Example 7.2, so that

Bc = the second number is ≥ 3.

Then P(Bc) = 1− P(B) = 46 , so P(A)P(Bc) = 12

36 = P(A ∩Bc), so A and Bc are independent.

Example 7.6. Let A,B be defined as in Example 7.3, so that

Bc = the card is at least 6.

Then P(Bc) = 1− P(B) = 3252 , so P(A)P(Bc) = 8

52 = P(A ∩Bc), so A and Bc are independent.

We can also define independence for more than two events.

Definition 7.7. We say that the events A1, A2 . . . An are independent if for any k ≥ 2 and

any collection of distinct indices 1 ≤ i1 < · · · < ik ≤ n we have

P(Ai1 ∩Ai2 ∩ . . . ∩Aik) = P(Ai1)P(Ai2) · · ·P(Aik).

Note that we require all possible subsets of the n events to be independent.

Example 7.8. Toss 3 fair coins. Let

A = first coin H, B = second coin H, C = third coin T.

Then P(A) = P(B) = P(C) = 12 and

P(A ∩B) =1

4= P(A)P(B)

P(B ∩ C) =1

4= P(B)P(C)

P(A ∩ C) =1

4= P(A)P(C)

P(A ∩B ∩ C) =1

8= P(A)P(B)P(C),

8. CONDITIONAL PROBABILITY 21

so A,B,C are independent.

At this point the reader may wonder whether independence of n events is equivalent to pairwise

independence, that is the requirement that any two events are independent. The answer is no:

independence is stronger than pairwise independence. Here is an example.

Example 7.9. Toss 2 fair coins, so that

Ω = (H,H), (H,T ), (T,H), (T, T ).

Let

A = the first coin is T = (T,H), (T, T ),

B = the second coin is T = (H,T ), (T, T ),

C = get exactly one T = (T,H), (H,T ).

Then P(A) = P(B) = P(C) = 1/2, P(A ∩B) = P(A ∩ C) = P(B ∩ C) = 1/4 so the events are

pairwise independent. On the other hand,

P(A ∩B ∩ C) = 0,

so the events are not independent.

8. Conditional probability

Closely related to the notion of independence is that of conditional probability.

Definition 8.1. Let A,B be two events with P(B) > 0. Then the conditional probability of A

given B is

P(A|B) =P(A ∩B)

P(B).

We interpret P(A|B) as the probability that the event A occurs, when it is known that the

event B occurs.

Example 8.2. Toss a fair die. Let A1 = 3 and B = even = 2, 4, 6. Then

P(A1|B) =P(3 and even)

P(even)= 0.

This is intuitive: if we know that the outcome is even, the probability that the outcome is 3 is

zero.

If instead we take A2 = 2, 3, 4, 5, 6 then

P(A2|B) =P(2, 4, 6)P(even)

= 1.


This is also intuitive: if we know that the outcome is even, then we are certain that it is at

least 2.

Finally, take A3 = 4, 5, 6, for which

P(A3|B) =P(4, 6)P(even)

=2

3.

An important property to note is that if A and B are independent, then P(A|B) = P(A). That

is, when two events are independent, knowing that one occurs does not affect the probability of

the other. This follows from

P(A|B) =P(A ∩B)

P(B)=

P(A)P(B)

P(B)= P(A),

where we have used that A and B are independent in the second equality.

8.1. The law of total probability. From the definition of conditional probability we

see that for any two events A,B with P(B) > 0 we have

P(A ∩B) = P(A|B)P(B).

This shows that the probability of two events occurring simultaneously can be broken up into

calculating successive probabilities: first the probability that B occurs, and then the probability

that A occurs given that B has occurred.

Clearly by replacing B with Bc in the above formula we also have

P(A ∩Bc) = P(A|Bc)P(Bc).

But then, since B and Bc are disjoint,

P(A) = P(A ∩B) + P(A ∩Bc)

= P(A|B)P(B) + P(A|Bc)P(Bc).

This has an important generalisation, called law of total probability.

Theorem 8.3 (Law of total probability). Let (Bn)n≥1 be a sequence of disjoint events of

positive probability, whose union is the sample space Ω. Then for all events A

P(A) =∞∑n=1

P(A|Bn)P(Bn).

Proof. We know that, by assumption,⋃n≥1

Bn = Ω.

8. CONDITIONAL PROBABILITY 23

This gives

P(A) = P(A ∩ Ω) = P

A ∩⋃n≥1

Bn

= P

⋃n≥1

(A ∩Bn)

=∑n≥1

P(A ∩Bn) =∑n≥1

P(A|Bn)P(Bn).

Remark 8.4. In the statement of the law of total probabilities we can also drop the assumption

P(Bn) > 0, provided we interpret P(A|Bn)P(Bn) = 0 if P(Bn) = 0. It follows that we can

also take (Bn)n to be a finite collection of events (simply set Bn = ∅ from some finite index

onwards).

Example 8.5. An urn contains 10 black balls and 5 red balls. We draw two balls from the urn

without replacement. What is the probability that the second ball drawn is black? Let

A = the second ball is black, B = the first ball is black.

Then

P(A) = P(A|B)P(B) + P(A|Bc)P(Bc) =10

15· 9

14+

5

15· 10

14=

2

3.

If, in general, the urn contains b black balls and r red balls we have

P(A) =b

b+ r· b− 1

b+ r − 1+

r

b+ r· b

b+ r − 1=

b

b+ r.

Example 8.6. Toss a fair coin. If it comes up head, then toss one fair die, otherwise toss 2 fair

dice. Call M the maximum number obtained from the dice, and let

A = M ≤ 4.

What is the probability of A? We look at the two cases: either the coin came head and we

tossed one die, or the coin came tail and we tossed two dice. Thus let

B = the coin gives H.

By the law of total probabilities we have

P(A) = P(A|B)P(B) + P(A|Bc)P(Bc)

= P(M ≤ 4|H)P(H) + P(M ≤ 4|T )P(T )

=1

2· 4

6+

1

2· 4 · 4

6 · 6=

5

9.

Example 8.7. Toss a fair coin twice. If get (H,H) then toss 1 die, if (T, T ) toss 2 dice and

otherwise toss 3 dice. With M defined as before, what is the probability of the event A? Again,

we split the probability space into disjoint events according to the number of dice we toss. Let

B1 = (H,H), B2 = (T, T ), B3 = (H,T ), (T,H).


Then by the law of total probabilities

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + P(A|B3)P(B3)

=1

4· 4

6+

1

4· 4 · 4

6 · 6+

1

2· 4 · 4 · 4

6 · 6 · 6=

1

2.

8.2. Bayes’ theorem. Note that if A and B are two events of positive probability we

have, by definition of conditional probability,

P(A ∩B) = P(A|B)P(B) = P(B|A)P(A).

Often, one of the conditional probabilities is easier to compute than the other. Bayes’ theorem

tells us how to switch between the two.

Theorem 8.8 (Bayes’ theorem). For any two events A,B both of positive probability we have

P(A|B) =P(B|A)P(A)

P(B).

Proof. We have

P(A|B) =P(A ∩B)

P(B)=

P(B|A)P(A)

P(B).

Example 8.9. Going back to Example 8.5, suppose we are told that the second ball is black.

What is the probability that the first ball was black? Applying Bayes’ theorem we find

P(B|A) =P(A|B)P(B)

P(A)=

914 ·

1015

23

=9

14.

Example 8.10. Going back to Example 8.6, suppose we are told that M ≤ 2. What is

the probability that the coin came up head? Recall that B = the coin gives H, and let

C = M ≤ 2. By Bayes’ theorem, we have

P(B|C) =P(C|B)P(B)

P(C)=

P(C|B)P(B)

P(C|B)P(B) + P(C|Bc)P(Bc)=

26 ·

12

26 ·

12 + 2·2

6·6 ·12

=3

4.

9. Exercises

Exercise 1. Toss two fair dice. Determine whether the following events are independent:

(i) A = the first die gives 4, B = the maximum outcome is ≤ 3,(ii) A = the first die gives 4, B = the maximum outcome is ≤ 5,(iii) A = the first die gives 5, B = the sum of the outcomes is 6,(iv) A = the first die gives 6, B = the sum of the outcomes is 7.

Exercise 2. Let A and B be independent. Show that Ac and Bc are independent.

10. SOME NATURAL PROBABILITY DISTRIBUTIONS 25

Exercise 3. Let A,B be two events with P(B) > 0. Define P(A|B). Show that

(a) P(B|B) = 1,

(b) if A ⊆ B then P(A|B) = P(A)P(B) ,

(c) if A ( B then P(A|B) < 1.

Exercise 4. Give an example of two events for which P(A|B) 6= P(B|A). Can you find two

events A,B of positive probability such that P(A|B)P(B) 6= P(B|A)P(A)? Explain.

Exercise 5. Draw a card from a standard deck. What is the probability that the card

is heart? What is the probability that the card is even, given that it is heart? What is the

probability that the card is even, given that it is ≤ 5? What is the probability that the card is

≤ 5, given that it is even?

Exercise 6. Students of a given class get grades A,B,C,D with probability 18 ,

28 ,

38 ,

28

respectively. If they misread the question, they usually do worse, getting grades A,B,C,D

with probability 110 ,

210 ,

310 ,

410 respectively. Each student misreads the question with probability

23 . What is the probability that:

(a) a student who read the question correctly gets A?

(b) a student who got B has read the question correctly?

(c) a student who got C has misread the question?

(d) a student who misread the question got C?

(e) a student gets D?

Exercise 7. What is the probability that a function f : 1, 2 . . . k → 1, 2 . . . n is constant,

given that it is non-decreasing? What is the probability that f : 1, 2 . . . k → 1, 2 . . . n is

non-decreasing, given that it is constant?

10. Some natural probability distributions

Up to this point in the course we have only worked with equally likely outcomes. On the other

hand, many other choices of probability measures are possible. We will next see some some of

them, which arise naturally.

Definition 10.1. A probability distribution, or simply distribution, is a probability measure on

some (Ω,F). We say that a distribution is discrete if Ω is a discrete set.


For discrete Ω we always take the σ-algebra F to be the set of all subsets of Ω. In particular, it

contains all sets ω, ω ∈ Ω, made of a single outcome. Thus, if µ is a probability distribution,

we can write

pω = µ(ω).

Note that pω ∈ [0, 1] for all ω ∈ Ω, and∑ω∈Ω

pω = µ(Ω) = 1.

Moreover, µ is uniquely determined by the collection (pω)ω∈Ω, that we call weights. We think

of the weight pω as the probability of seeing the outcome ω, when sampling according to µ.

We now list several natural probability distributions.

10.1. Bernoulli distribution. Take Ω = 0, 1 and define µ to be the probability distri-

bution given by the weights

p1 = p, p0 = 1− p.

Then µ models the number of heads obtained in one biased coin toss (the coin gives head

with probability p and tail with probability 1− p). Such µ is called Bernoulli distribution of

parameter p, and it is denoted by Bernoulli(p).

10.2. Binomial distribution. Fix an integer N ≥ 1 and let Ω = 0, 1, 2 . . . N. Define

µ to be the probability distribution given by the weights

pk =

(N

k

)pk(1− p)N−k, 0 ≤ k ≤ N.

Then µ models the number of heads in N biased coin tosses (again, the coin gives head with

probability p, and tail with probability 1 − p). Such µ is called Binomial distribution of

parameters N, p, and it is denoted by Binomial(N, p). Note that

µ(Ω) =N∑k=0

pk =N∑k=0

(N

k

)pk(1− p)N−k = (p+ (1− p))N = 1.

10.3. Geometric distribution. Let Ω = 1, 2, 3 . . ., and define µ as the distribution

given by the weights

pk = (1− p)k−1p, k ≥ 1.

for all k ≥ 1. Then µ models the number of biased coin tosses up to (and including) the first

head. Such µ is called Geometric distribution of parameter p, and it is denoted by Geometric(p).

Note that

µ(Ω) =

∞∑k=1

pk = p

∞∑k=1

(1− p)k−1 =p

1− (1− p)= 1.

11. EXERCISES 27

Note: A variant of the geometric distribution counts the number of tails until the first head.

In this case Ω = 0, 1, 2, 3 . . . and

pk = (1− p)kp, k ≥ 0.

for all k ≥ 0. This is also referred to as Geometric(p), and you should check that µ(Ω) = 1. It

should always be made clear which version of the geometric distribution is intended.

10.4. Poisson distribution. Let Ω = N, and for a fixed parameter λ ∈ (0,+∞) let µ be

the probability distribution given by the weights

pk =e−λλk

k!, k ≥ 0.

Such µ is called Poisson distribution of parameter λ, and it is denoted by Poisson(λ). Note

that

µ(Ω) =∞∑k=0

e−λλk

k!= e−λeλ = 1.

This distribution arises as the limit of a Binomial distribution with parameters N, λN as N →∞.

Indeed, if pk(N,λ/N) denote the weights of a Binomial(N,λ/N) distribution, we have

pk(N,λ/N) =

(N

k

)(λ

N

)k (1− λ

N

)N−k=N(N − 1) · · · (N − k + 1)

Nk

(1− λ

N

)N−k λkk!−→ e−λλk

k!

as N →∞, since

N(N − 1) · · · (N − k + 1)

Nk→ 1,

(1− λ

N

)N−k→ e−λ,

(1− λ

N

)−k→ 1.

11. Exercises

Exercise 1. Let A,B be two events such that 0 < P(A) < 1 and 0 < P(B) < 1. We say

that B attracts A if P(A|B) > P(A). Show that if B attracts A then A attracts B.

Exercise 2. Tom is undecided as to whether to take a Mathematics course or a Literature

course. He estimates that the probability of getting an A would be 1/3 in a Maths course, and

1/2 in a Literature course. Suppose he chooses which course to take by tossing a fair coin: what

is the probability that he gets an A? Suppose that instead he tosses a biased coin: if it comes

up head, which happens with probability p ∈ (0, 1), he takes the Maths course, and otherwise

he takes the Literature course. What is the probability that Tom gets an A?


Exercise 3. Toss two dice. Let

Ai = the first die gives i, 1 ≤ i ≤ 6

Bj = the sum of the two dice is j, 2 ≤ j ≤ 12.

Determine for which (i, j) the events Ai, Bj are independent.

Exercise 4. At a certain stage of a criminal investigation the inspector is 60 per cent

convinced of the guilt of a certain suspect. Suppose, however, that a new piece of evidence

is uncovered: it is found that the criminal is left-handed. If the probability of being both

left-handed and non-guilty is 2/25, how certain of the guilt of the suspect should the inspector

be if it turns out that the suspect is left-handed?

Exercise 5. Show that for any A,B with P(B) > 0 it holds

0 ≤ P(A|B) ≤ 1.

Exercise 6. Toss a biased coin, which gives head with probability p ∈ (0, 1) and tail with

probability 1− p. If it comes up head, roll two dice and record the sum. If tail, pick a card

from a standard deck and record its number (between 1 and 13). What is the probability that

the number you have recorded is 4? What is the probability that it is ≤ 2?

12. Additional exercises

Exercise 1. A committee of 3 people is to be formed from a group of n people. How many

different committees are possible? Let c(n) denote the number of possible committees. Find α

such that

limn→∞

c(n)

nα∈ (0,+∞).

Exercise 2. How many different words can be formed from (exactly) the letters QUIZZES?

Exercise 3. How many distinct non-negative integer-valued solutions (x1, x2) of the equa-

tion x1 + x2 = 3 are possible?

Exercise 4. Show that P(A ∩B) ≤ P(A). When does equality hold?

Exercise 5. A total of 36 members of a club play tennis, 28 play squash and 18 play

badminton. Furthermore, 22 of the members play both tennis and squash, 12 play both tennis

and badminton, 9 play both squash and badminton, and 4 play all three sports. How many

members of the club play at least one of the three sports?

13. RANDOM VARIABLES 29

Exercise 6. How many people have to be in a room in order that the probability that at

least two of them celebrate their birthday in the same month is at least 1/2? Assume that all

months are equally likely.

Exercise 7. An urn contains two coins of type A and one coin of type B. When a type

A coin is flipped, it comes up head with probability 1/4, while when a type B coin is flipped

it comes up head with probability 3/4. A coin is randomly chosen from the urn and flipped.

What is the probability to get head? Given that the coin gave head, what is the probability

that it was a coin of type A?

Exercise 8. An infinite sequence of independent trials is to be performed. Each trial

results in a success with probability p ∈ (0, 1), and a failure with probability 1− p. Determine

the probability that:

(a) at least one success occurs in the first n trials,

(b) exactly k successes occur in the first n trials,

(c) all trials result in successes,

(d) all trials result in failures.

Exercise 9. An infinite sequence of independent trials is to be performed. Each trial

results in a success with probability p ∈ (0, 1), and a failure with probability 1− p. What is the

probability distribution of the number of trials until (and including) the first success? Write

down the corresponding weights (pk)k≥1.

Exercise 10. Each day of the week James flips a biased coin: if it comes up head (which

happens with probability p), then he goes out for a walk. What is the distribution of the

number of times James goes for a walk in one week? Write down the corresponding weights

(pk)0≤k≤7.

13. Random variables

It is often the case that when a random experiment is conducted we don’t just want to record

the outcome, but perhaps we are interested in some functions of the outcome.

Definition 13.1. A random variable on (Ω,F) taking values in a discrete set S is a function

X : Ω→ S.

Typically S ⊆ R or S ⊆ Rk, in which case we say that X is a real-valued random variable.


Example 13.2. Toss a biased coin. Then Ω = H,T, where H,T stands for head and tail.

Define

X(H) = 1, X(T ) = 0.

Then X : Ω→ 0, 1 counts the number of heads in the outcome.

Example 13.3. Toss two biased coins, so that Ω = H,T2. Define

X(HH) = 2, X(TH) = X(HT ) = 1, X(TT ) = 0.

Then X : Ω→ 0, 1, 2 counts the number of heads in 2 coin tosses.

Example 13.4. Roll two dice, so that Ω = 1, 2, 3, 4, 5, 62. For (i, j) ∈ Ω, set

X(i, j) = i+ j.

Then X : Ω→ 2, 3 . . . 12 records the sum of the two dice. We could also define the random

variables

Y (i, j) = maxi, j, Z(i, j) = i.

Then Y : Ω → 1, 2, 3, 4, 5, 6 records the maximum of the two dice, while Z : Ω →1, 2, 3, 4, 5, 6 records the outcome of the first die.

Example 13.5. For a given probability space (Ω,F ,P) and A ∈ F , define

1A(ω) =

1, for ω ∈ A,

0, for ω /∈ A.

Then 1A : Ω→ 0, 1 tells us whether the outcome was in A or not. Note that

(i) 1Ac = 1− 1A,

(ii) 1A∩B = 1A1B,

(iii) 1A∪B = 1− (1− 1A)(1− 1B).

You should prove (iii) as an exercise.

For a subset T ⊆ S we denote the set ω ∈ Ω : X(ω) ∈ T simply by X ∈ T. Since X takes

values in a discrete S, we let

px = P(X = x)

for all x ∈ S. The collection (px)x∈S is referred to as the probability distribution of X. If the

probability distribution of X is, say, Geometric(p), then we say that X is a Geometric random

variable of parameter p, and write X ∼ Geometric(p). Similarly, for the other distributions we

have encountered write X ∼ Bernoulli(p), X ∼ Binomial(N, p), X ∼ Poisson(λ).

Given a random variable X taking values in S ⊂ R, the function FX : R→ [0, 1] given by

FX(x) = P(X ≤ x)

13. RANDOM VARIABLES 31

is called distribution function of X. Note that FX is piecewise constant, non-decreasing,

right-continuous, and

limx→−∞

FX(x) = 0, limx→+∞

FX(x) = 1.

Example 13.6. Toss a biased coin, which gives head with probability p, and define the random

variable

X(H) = 1, X(T ) = 0.

Then

FX(x) =

0, if x < 0

1− p, if x ∈ [0, 1)

1, if x ≥ 1.

Note that FX is piecewise constant, non-decreasing, right-continuous, and its jumps are given

by the weights of a Bernoulli distribution of parameter p.

Knowing the probability distribution function is equivalent to knowing the collection of weights

(px)x∈S such that px = P(X = x), and hence it is equivalent to knowing the probability

distribution of X.

Definition 13.7 (Independent random variables). Two random variables X,Y , taking values

in SX , SY respectively, are said to be independent if

P(X = x, Y = y) = P(X = x)P(Y = y)

for all (x, y) ∈ SX × SY . In general, the random variables X1, X2 . . . Xn taking values in

S1, S2 . . . Sn are said to be independent if

P(X1 = x1, X2 = x2 . . . Xn = xn) = P(X1 = x1)P(X2 = x2) · · ·P(Xn = xn)

for all (x1, x2 . . . xn) ∈ S1 × S2 × · · · × Sn.

Note that if X1, X2 . . . XN are independent, then for any k ≤ N and any distinct indices

1 ≤ i1 < i2 < · · · < ik ≤ N , the random variables Xi1 , Xi2 . . . Xik are independent. We show

this with N = 3 for simplicity. Let X : Ω→ SX , Y : Ω→ SY and Z : Ω→ SZ be independent

random variables, that is

P(X = x, Y = y, Z = z) = P(X = x)P(Y = y)P(Z = z)

for all (x, y, z) ∈ SX × SY × SZ . Then X,Y are independent. Indeed,

P(X = x, Y = y) =∑z∈SZ

P(X = x, Y = y, Z = z) =∑z∈SZ

P(X = x)P(Y = y)P(Z = z)

= P(X = x)P(Y = y)

∑z∈SZ

P(Z = z)

︸︷︷︸

1

= P(X = x)P(Y = y).


Similarly, one shows that Y,Z and X,Z are independent.

Example 13.8. Consider the probability distribution on Ω = 0, 1N given by the weights

pω =

N∏k=1

pωk(1− p)1−ωk

for any ω = (ω1, ω2 . . . ωN ) ∈ Ω. This models a sequence of N tosses of a biased coin. On Ω we

can define the random variables X1, X2 . . . XN by

Xk(ω) = ωk

for all ω = ω1ω2 . . . ωN ∈ Ω. Then each Xk is a Bernoulli random variable of parameter p, since

P(Xi = 1) = P(ωi = 1) =∑

w:wi=1

N∏k=1

pωk(1− p)1−ωk = p = 1− P(Xi = 0).

Moreover, for ω = ω1ω2 . . . ωN ∈ 0, 1N ,

P(X1 = ω1, X2 = ω2 . . . Xn = ωn) = pω =

N∏k=1

pωk(1− p)1−ωk =

N∏k=1

P(Xk = ωk),

so X1, X2 . . . XN are independent. Define a further random variable SN on Ω by setting

SN (ω) =N∑k=1

Xk(ω) =N∑k=1

ωk.

Then SN (ω) counts the number of ones in ω. Moreover, for k = 0, 1 . . . N

|SN = k| =(N

k

),

and pω = pk(1− p)N−k for all ω ∈ SN = k, so

P(SN = k) =

(N

k

)pk(1− p)N−k.

Thus SN is a Binomial random variable of parameters N, p.

We have seen in the above example that the sum of random variables is again a random variable.

In general, one could look at functions of random variables.

Definition 13.9. Let X : Ω→ S be a random variable taking values in S, and let g : S → S′

be a function. Then g(X) : Ω→ S′ defined by

g(X)(ω) = g(X(ω))

is a random variable taking values in S′.

14. EXPECTATION 33

14. Expectation

From now on all the random variables we consider are assumed to take real values, unless

otherwise specified. We say that a random variable X is non-negative if X takes values in

S ⊆ [0,∞).

Definition 14.1 (Expectation of a non-negative random variable). For a non-negative random

variable X : Ω→ S we define the expectation (or expected value, or mean value) of X to be

E(X) =∑x∈S

xP(X = x).

Thus the expectation of X is the average of the values taken by X, averaged with weights

corresponding to the probabilities of the values.

Example 14.2. If X ∼ Bernoulli(p) then

E(X) = 1 · P(X = 1) + 0 · P(X = 0) = p.

Example 14.3. If X ∼ Geometric(p) then

E(X) =

∞∑k=1

kP(X = k) =

∞∑k=1

k(1− p)k−1p = p

∞∑k=1

[− d

dx(1− x)k

]x=p

= −p

[d

dx

( ∞∑k=0

(1− x)k

)]x=p

= −p[d

dx

(1

1− (1− x)

)]x=p

= p1

p2=

1

p.

For a random variable X write

X+ = maxX, 0, X− = max−X, 0,

so that X = X+ −X− and |X| = X+ +X−.

Definition 14.4. For X : Ω→ S, define X+ and X− as above. Then, provided E(X+) <∞or E(X−) <∞, we define the expectation of X to be

E(X) = E(X+)− E(X−) =∑x∈S

xP(X = x).

Note that we are allowing the expectation to be infinite: if E(X+) = +∞ and E(X−) < ∞then we set E(X) = +∞. Similarly, if E(X+) <∞ and E(X−) = +∞ then we set E(X) = −∞.

However, if both E(X+) = E(X−) = +∞ then the expectation of X is not defined.

Definition 14.5. A random variable X : Ω→ S is said to be integrable if E(|X|) <∞.


Example 14.6. Let X : Ω→ Z have probability distribution (pk)k∈Z where

p0 = P(X = 0) = 0,

pk = P(X = k) = P(X = −k) = 2−|k|−1, k ≥ 1.

Then E(X+) = E(X−) = 12 , so E(X) = E(X+)− E(X−) = 0.

Example 14.7. Let X : Ω→ Z have probability distribution (pk)k∈Z where

p0 = P(X = 0) = 0,

pk = P(X = k) = P(X = −k) =3

(πk)2, k ≥ 1.

Then E(X+) = E(X−) = +∞, so E(X) is not defined.

Properties of the expectation. The expectation of a random variable X : Ω → S

satisfies the following properties, some of which we state without proof:

(1) If X ≥ 0 then E(X) ≥ 0, and E(X) = 0 if and only if P(X = 0) = 1.

(2) If c ∈ R is a constant, then E(c) = c and E(cX) = cE(X).

(3) For random variables X,Y , it holds E(X + Y ) = E(X) + E(Y ).

Proof. If X,Y take values in SX , SY respectively, we have

E(X + Y ) =∑

z∈SX+SY

zP(X + Y = z) =∑x∈SX

∑y∈SY

(x+ y)P(X = x, Y = y)

=∑x∈SX

x

( ∑y∈SY

P(X = x, Y = y)︸︷︷︸P(X=x)

)+∑y∈SY

y

( ∑x∈SX

P(X = x, Y = y)︸︷︷︸P(Y=y)

)

=∑x∈SX

xP(X = x) +∑y∈SY

yP(Y = y) = E(X) + E(Y ).

The above properties generalise by induction, to give that for any constants c1, c2 . . . cn and

random variables X1, X2 . . . Xn it holds

E

(n∑k=1

ckXk

)=

n∑k=1

ckE(Xk).

Thus the expectation is linear.

(4) For any function g : S → S′, g(X) : Ω→ S′ is a random variable taking values in S′,

and

E(g(X)) =∑x∈S

g(x)P(X = x).

14. EXPECTATION 35

An important example is given by g(x) = xk, and the corresponding expectation E(Xk)

is called the kth moment of the random variable X.

(5) If X ≥ 0 takes integer values, then

E(X) =∞∑k=1

P(X ≥ k).

Proof. We have

E(X) =

∞∑k=0

kP(X = k) =

∞∑k=1

k∑j=1

P(X = k) =

∞∑j=1

∞∑k=j

P(X = k) =

∞∑j=1

P(X ≥ j).

(6) If X,Y are independent random variables, then E(XY ) = E(X)E(Y ). In general, if

X1, X2 . . . XN are independent random variables,

E(X1 ·X2 · · ·XN ) = E(X1) · E(X2) · · ·E(XN ).

Example 14.8. Let X ∼ Geometric(p). Then P(X ≥ k) = (1− p)k−1, and

E(X) =

∞∑k=1

P(X ≥ k) =

∞∑k=1

(1− p)k−1 =

∞∑k=0

(1− p)k =1

p.

Example 14.9. Roll two dice, and let X,Y denote the outcomes of the first and second die

respectively. Then X and Y are independent, and

E(X) = E(Y ) =6∑

k=1

kP(X = k) =1

6

6∑k=1

k =21

6,

so if S = X + Y and P = XY we have

E(S) = E(X + Y ) = E(X) + E(Y ) =42

6, E(P ) = E(XY ) = E(X)E(Y ) =

(21

6

)2

.

In general, if X1, X2 . . . XN denote the outcome of N consecutive rolls of a die, and we define

S =

N∑k=1

Xk, P =

N∏k=1

Xk,

then

E(S) = NE(X1) =21

6·N, E(P ) = [E(X1)]N =

(21

6

)N.


14.1. Functions of random variables. Recall that if X : Ω→ S is a random variable,

and g : S → S′ is a (non-random) function, then g(X) : Ω→ S′ is itself a random variable, and

by Property (14)

E(g(X)) =∑x∈S

g(x)P(X = x).

Now, if X : Ω → SX and Y : Ω → SY are two random variables, and f : SX → S′Xand g : SY → S′Y are two functions, then if X,Y are independent then also f(X), g(Y ) are

independent. This generalises to arbitrarily many random variables, to give that if X1, X2 . . . XN

are independent, then f1(X1), f2(X2) . . . fN (XN ) are independent.

Example 14.10. Toss two dice. Let

A = the first die gives 4, B = the sum of the two dice is 7.

Then we have seen that A,B are independent events. Define the indicator random variables

X = 1A, Y = 1B.

Since indicators of independent events are independent, X,Y are independent. Moreover,

E(X) = P(A) =1

6, E(Y ) = P(B) =

1

6,

E(X + Y ) = P(A) + P(B) =1

3, E(XY ) = P(A)P(B) =

1

36.

If, on the other hand, we introduce

C = the sum of the two dice is 3, Z = 1C

then A,C are not independent, so X,Z are not independent, and

E(Z) = P(C) =2

36, E(X + Z) = P(A) + P(C) =

8

36,

E(XZ) = E(1A∩C) = P(A ∩ C) = 0 6= E(X)E(Z).

15. Exercises

Exercise 1. Let X ∼ Bernoulli(p). Determine the distribution of Y = 1−X.

Exercise 2. Show that for any two sets A,B

1A∪B = 1− (1− 1A)(1− 1B).

Exercise 3. Show that if A,B are independent events then 1A and 1B are independent

random variables. Compute the expectation of 1A∩B and 1A∪B.

16. VARIANCE AND COVARIANCE 37

Exercise 4. Roll two dice, and let Ω = 1, 2 . . . 62 be the associated sample space. Define

the random variable X : Ω→ 1, 2 . . . 6 by

X(i, j) = maxi, j

for all (i, j) ∈ Ω. Write down the weights of the random variable X. Draw the graph of the

associated distribution function FX , and check that

limx→−∞

FX(x) = 0, limx→+∞

FX(x) = 1.

Exercise 5. Roll two dice, and let Ω = 1, 2 . . . 62 be the associated sample space.

Consider the event A = i = 4, and define the random variable

X(i, j) = 1A(i, j) =

1 if i = 4,

0 if i 6= 4.

Define further the random variable

Y (i, j) = i+ j.

Determine whether X and Y are independent.

Exercise 6. Redo all the computations of Example 13.8 for N = 2 and N = 3.

Exercise 7. Show that the Xk’s defined in Example 13.8 are Bernoulli(p).

Exercise 8. For an event A ⊆ Ω, let 1A denote the indicator function of A. Compute

E(1A). When is E(1A) = 0? When is E(1A) = 1?

Exercise 9. Let X ∼ Binomial(N, p). Show that E(X) = Np.

Exercise 10. Let X ∼ Poisson(λ). Show that E(X) = λ.

16. Variance and covariance

16.1. Variance. Once we know the mean of a random variable X, we may ask how much

typically X deviates from it. This is measured by the variance.

Definition 16.1. For any random variable X with finite mean E(X), the variance of X is

defined to be

Var(X) = E[(X − E(X))2].

Thus Var(X) measures how much the distribution of X is concentrated around its mean: the

smaller the variance, the more the distribution is concentrated around E(X).


Properties of the variance. The variance of a random variable X : Ω→ S satisfies the

following properties:

(1) Var(X) = E(X2)− [E(X)]2.

Proof. We have

Var(X) = E[(X − E(X))2] = E[X2 + (E(X))2 − 2XE(X)] = E(X2)− [E(X)]2.

(2) If E(X) = 0 then Var(X) = E(X2).

(3) Var(X) ≥ 0, and Var(X) = 0 if and only if P(X = c) = 1 for some constant c ∈ R.

(4) If c ∈ R then Var(cX) = c2Var(X).

(5) If c ∈ R then Var(X + c) = Var(X).

(6) If X,Y are independent then Var(X + Y ) = Var(X) + Var(Y ). In general, if

X1, X2 . . . XN are independent then

Var(X1 +X2 + · · ·+XN ) = Var(X1) + Var(X2) + · · ·+ Var(XN ).

16.2. Covariance. Closely related to the concept of variance is that of covariance.

Definition 16.2. For any two random variables X,Y with finite mean, we define their covari-

ance to be

Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))].

16.3. Properties of the covariance. The covariance of two random variables X,Y

satisfies the following properties:

(1) Cov(X,Y ) = E(XY )− E(X)E(Y ).

Proof. We have

Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))]

= E[XY −XE(Y )− Y E(X) + E(X)E(Y )]

= E(XY )− E(X)E(Y ).

(2) Cov(X,X) = Var(X).

(3) Cov(X, c) = 0 for all c ∈ R.

(4) For c ∈ R, Cov(cX, Y ) = cCov(X,Y ) and Cov(X + c, Y ) = Cov(X,Y ).

(5) For X,Y, Z random variables, Cov(X + Z, Y ) = Cov(X,Y ) + Cov(Z, Y ).

16. VARIANCE AND COVARIANCE 39

The above properties generalise, to give that the covariance is bilinear. That is, for any

collections of random variables X1, X2 . . . Xn and Y1, Y2 . . . Yn, and constants a1, a2 . . . an and

b1, b2 . . . bn, it holds

Cov

n∑i=1

aiXi,n∑j=1

bjYj

=n∑i=1

n∑j=1

Cov(Xi, Yj).

(6) For any two random variables with finite variance, it holds

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ).

Proof. We have

Var(X + Y ) = E[(X − E(X) + Y − E(Y ))2]

= E[(X − E(X))2] + E[(Y − E(Y ))2] + 2E[(X − E(X))(Y − E(Y ))]

= Var(X) + Var(Y ) + 2Cov(X,Y ).

(7) If X,Y are independent then Cov(X,Y ) = 0.

Proof. We have

Cov(X,Y ) = E(XY )− E(X)E(Y ) = E(X)E(Y )− E(X)E(Y ) = 0,

where the second equality follows from independence.

We remark that the converse of property (7) is false, as shown by the following example.

Example 16.3 (Zero covariance does not imply independence). Let

X =

2, with probability 1

4

1, with probability 14

−1, with probability 14

−2, with probability 14

, Y = X2.

Then E(X) = 0 and E(XY ) = E(X3) = 0 by symmetry, so

Cov(X,Y ) = E(XY )− E(X)E(Y ) = 0.

However, X,Y are not independent, since

P(X = 2, Y = 4) = P(X = 2) =1

4,

while

P(X = 2)P(X = 4) = P(X = 2)P(X = 2 ∪ X = −2) =1

4

1

2=

1

8.


17. Exercises

Exercise 1. Toss two biased coin (giving head with probability p). Let

A = the first coin gives head,

B = the second coin gives head,

C = at least one head.

Let X = 1A, Y = 1B, Z = 1C . Compute expectation and variance of X,Y, Z. Compute

Cov(X,Y ), Cov(X,Z). Are X,Z independent? Compute E(X+Y ), E[(X+Y )2], E[log(X+2)],

E(4Y Z −X2).

Exercise 2. LetX be a random variable taking values 0, 1, 2, 3 with probability p0, p1, p2, p3

respectively, with probability

p0 = p1 = p2 =1

6, p3 =

1

2.

Compute E(X), E(X2), Var(X). Describe the random variable Y = |X − 2| (that is, list

the possible values of Y and the associated weights). Compute E(Y ), E(Y 2), Var(Y ) and

Cov(X,Y ). Are X,Y independent? Compute E(X + Y + 3), Var(X + Y − 2), E[(X − Y )X],

E[(X + Y )(X − Y )].

Exercise 3. Draw two cards from a standard deck. Let

A = the first card is heart, B = the second card is heart.

Define X = 1A, Y = 1B. Are A,B independent? Are X,Y independent? Write down the

weights of X and Y , and compute E(X), E(Y ), Cov(X,Y ), Var(X + Y ).

Exercise 4. Let X ∼ Geometric(p). Compute E(X), E(X(X − 1)), and deduce E(X2).

Exercise 5. Let X ∼ Poisson(λ). Compute E(X), E(X2) and Var(X).

Exercise 6. Toss a biased coin N times. For all 1 ≤ k ≤ N let

Ak = the kth coin gives head,

and define Xk = 1Ak . Compute E(Xk) and Var(Xk) for all k. Are the Xk’s independent? Let

S =N∑k=1

Xk.

Then S counts the total number of heads in N coin tosses. Compute E(S) and Var(S). Assume

N is even. What is the expected total number of heads obtained in even coin tosses? And odd

coin tosses?

18. INEQUALITIES 41

Exercise 7. Let X be a random variable with finite mean µ. For x ∈ R define the function

f(x) = E[(X − x)2].

Computing f ′(x), show that f(x) is minimised at x = µ, and compute its minimum value.

Exercise 8. Let X be a random variable with mean µ and variance σ2. Compute E(X2)

and E[(X − 2)2] in terms of µ, σ2.

Exercise 9. Give an example of two random variables with equal mean but different

variance. Give an example of two random variables with the same mean and variance, but

different distribution.

18. Inequalities

It is often the case that, rather than computing the exact value of E(X), we are only interested

in having a bound for it.

Theorem 18.1 (Markov’s inequality). Let X be a non-negative random variable, and λ ∈ (0,∞).

Then

P(X ≥ λ) ≤ E(X)

λ.

Proof. Let Y = λ1X≥λ. Then X ≥ Y , so

E(X) ≥ E(Y ) = E(λ1X≥λ) = λE(1X≥λ) = λP(X ≥ λ).

Example 18.2. Roll a die N times, and let SN denote the total number of 3’s obtained. Then

SN ∼ Binomial(N, 1/6), and so E(SN ) = N/6. Thus

P(X ≥ αN) ≤ E(X)

αN=

1

6α.

Note that the bound is useless if α ≤ 1/6, while it contains interesting information if α > 1/6.

Theorem 18.3 (Chebyshev’s inequality). Let X be a random variable with finite mean E(X).

Then, for λ ∈ (0,∞),

P(|X − E(X)| ≥ λ) ≤ Var(X)

λ2.

Proof. We have

P(|X − E(X)| ≥ λ) = P((X − E(X))2 ≥ λ2) ≤ E[(X − E(X))2]

λ2=

Var(X)

λ2,

where the inequality follows from Markov’s inequality.


Theorem 18.4 (Cauchy-Schwarz inequality). For all random variables X,Y , it holds

E(|XY |) ≤√E(X2)E(Y 2).

We do not prove this inequality. Note that if Y is constant, say P(Y = 1) = 1, then the

inequality gives

E(|X|) ≤√E(X2).

Example 18.5. Toss a fair coin repeatedly. Let X denote the number of tossed until (and

including) the first head, while Y denotes the number of tossed until (and including) the first

tail. Then

X ∼ Geometric(p), Y ∼ Geometric(1− p),

and X,Y are not independent since

P(X = 1, Y = 1) = 0 6= 1

4= P(X = 1)P(Y = 1).

We want a bound for E(XY ). To this end, recall that if Z ∼ Geometric(p) then E(Z2) =

(2− p)/p2. So

E(X2) =2− 1/2

(1/2)2= 6 = E(Y 2),

from which, using Cauchy-Schwarz inequality, we find

E(XY ) ≤√

6 · 6 = 6.

This, together with E(X) = E(Y ) = 2, allows us to bound the covariance:

Cov(X,Y ) = E(XY )− E(X)E(Y ) ≤ 6− 4 = 2.

Theorem 18.6 (Jensen’s inequality). Let X be an integrable random variable and let f : R→ Rbe a convex function. Then

f(E(X)) ≤ E(f(X)).

We do not prove this. To remember the direction of the inequality, take f(x) = x2 and use that

E(X2)− [E(X)]2 = Var(X) ≥ 0.

Example 18.7. Let X ∼ Poisson(λ), so that E(X) = λ. Define Y = eX . Since f(x) = ex is a

convex function, we have

E(Y ) = E(eX) ≥ eE(X) = eλ.

19. Exercises

Exercise 1. In a sequence of n independent trials the probability of a success at the ith

trial is pi. Let N denote the total number of successes. Find E(N) and Var(N).

19. EXERCISES 43

Exercise 2. Let X be an integrable random variable. Show that for every p > 0 and all

x > 0

P(|X| ≥ x) ≤ E(|X|p)x−p.

Show further that for all β ≥ 0

P(X ≥ x) ≤ E(eβX)e−βx.

Exercise 3. Let X1, X2 . . . Xn be independent and identically distributed random variables

with finite mean µ and variance σ2. Define the sample mean

X =1

n

n∑k=1

Xk.

Compute E(X) and Var(X). Use Chebyshev’s inequality to bound

P(|X − µ| > 2σ).

Use this bound to find n such that

P(|X − µ| > 2σ) ≤ 0.01.

Exercise 4. Let X ∼ Poisson(λ). Compute E(eX). Compute E(eαX) for all α ∈ R. Let

X,Y be independent Poisson(λ) random variables. For any α, β ∈ R compute

E(eαX+βY ).

Exercise 5. Let X1, X2 . . . Xn be independent and identically distributed random variables

with finite mean µ and variance σ2. Define the sample mean

X =1

n

n∑k=1

Xk.

Use Chebyshev’s inequality to show that for any ε > 0

P(|X − µ| > ε)→ 0

as n→∞.

Exercise 6. Explain why

E(X) ≤ logE(eX), |E(X)| ≤ E(|X|).

for any random variable X with finite mean.


Exercise 7. Roll a die N times. Denote by X the total number of 3′s and by Y the total

number of 6’s. Are X and Y independent? Denote by Z the total number of even outcomes.

Are X and Z independent? Compute expectation and variance of X,Y, Z. [Hint: write X,Y, Z

as sums of independent indicator random variables.]

20. Probability generating functions

Let X be a random variable taking values in N.

Definition 20.1. We define the generating function of X to be

GX(t) = E(tX) =

∞∑k=0

tkP(X = k)

for t ∈ [0, 1].

Note that the assumption t ∈ [0, 1] guarantees that the infinite sum defining GX(t) is convergent.

Moreover,

GX(1) =

∞∑k=1

P(X = k) = 1,

and GX(t) ≤ 1 for all t ∈ [0, 1].

Example 20.2. If X ∼ Bernoulli(p) then

GX(t) = E(tX) = tp+ 1− p.

Example 20.3. If X ∼ Geometric(p), then

GX(t) = E(tX) =

∞∑k=1

tk(1− p)k−1p = tp

∞∑k=0

[t(1− p)]k =pt

1− t(1− p).

Example 20.4. If X ∼ Poisson(λ), then

GX(t) = E(tX) =

∞∑k=0

tke−λλk

k!= e−λ

∞∑k=0

(tλ)k

k!= e−λ+λt.

The importance of probability generating functions lies in the following result.

Theorem 20.5. The probability generating function of a random variable X determines the

distribution of X uniquely.

We do not prove the above theorem. Note that it tells us that if X,Y have the same generating

function, that is if

E(tX) = E(tY ) for all t ∈ [0, 1],

then X,Y have the same distribution.

20. PROBABILITY GENERATING FUNCTIONS 45

Example 20.6. We show that if X ∼ Poisson(λ) and Y ∼ Poisson(µ) are independent, then

X + Y ∼ Poisson(λ+ µ). Indeed, we find

GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = e−λ+λte−µ+µt = e−(λ+µ)+(λ+µ)t.

We recognise in this expression the probability generating function of a Poisson(λ+µ) random

variable. Since probability generating functions uniquely determine distributions, we conclude

that X + Y ∼ Poisson(λ+ µ).

Example 20.7. If X ∼ Binomial(N, p), then

GX(t) = E(tX) =

N∑k=0

tk(N

k

)pk(1− p)N−k = (tp+ 1− p)N .

We use this to show that the sum of two independent Bernoulli(p) random variables is

Binomial(2, p). Indeed, if X,Y are independent Bernoulli(p), we find

GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = (tp+ 1− p)2,

which is the probability generating function of a Binomial(2, p). This generalises to show that

the sum of N independent Bernoulli(p) is Binomial(N, p).

As shown in the above examples, if X,Y are independent then

GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = GX(t)GY (t).

If, moreover, X and Y also have the same distribution, then

GX+Y (t) = [GX(t)]2.

This generalises to arbitrarily many random variables, to give that if X1, X2 . . . Xn are inde-

pendent, and

S =n∑k=1

Xk,

then

GS(t) = GX1(t)GX2(t) · · ·GXn(t).

If, moreover, the Xk’s all have the same distribution, then

GS(t) = [GX1(t)]n.

Example 20.8. If X1, X2 . . . Xn are independent random variables, with Xk ∼ Poisson(λk),

then

X1 +X2 + · · ·+Xn ∼ Poisson(λ1 + λ2 + · · ·+ λn).


The expectation E(Xn) is called nth moment of X. Provided GX(t) is finite for some t > 1, we

can differentiate inside the series, to obtain

G(1) =∞∑k=0

P(X = k) = 1,

G′(1) =

∞∑k=0

kP(X = k) = E(X),

G′′(1) =∞∑k=0

k(k − 1)P(X = k) = E(X(X − 1)).

In general,

G(n)(1) = E(X(X − 1) · · · (X − n+ 1)).

Example 20.9. Recall that if X ∼ Poisson(λ) then GX(t) = e−λ+λt. So

G′(t) = λe−λ+λt, G′(1) = λ = E(X)

G′′(t) = λ2e−λ+λt, G′′(1) = λ2 = E(X(X − 1))

from which we find

E(X) = G′(1) = λ, E(X2) = G′′(1) +G′(1) = λ2 + λ,

and so

Var(X) = E(X2)− [E(X)]2 = λ.

21. Conditional distributions

Definition 21.1 (Joint distribution). Given two random variables X : Ω→ SX and Y : Ω→SY , their joint distribution is defined as the collection of probabilities

P(X = x, Y = y), x ∈ SX , y ∈ SY .

Note that the joint distribution of X and Y is a probability distribution on SX × SY .

Example 21.2. Let X,Y be independent Bernoulli random variables of parameter p, and

define Z = X + Y . Then, if we interpret X,Y as recording the outcome of two independent

coin tosses, Z denotes the total number of head. Here SX = SY = 0, 1 while SZ = 0, 1, 2.Moreover, by independence

P(X = x, Y = y) = P(X = x)P(Y = y) for all (x, y) ∈ SX × SY .

On the other hand, X and Z are not independent, and they have joint distribution

P(X = 0, Z = 0) = (1− p)2, P(X = 0, Z = 1) = (1− p)p,

P(X = 1, Z = 1) = p(1− p), P(X = 1, Z = 2) = p2.

Note that the weights of the joint distribution sum up to 1.

21. CONDITIONAL DISTRIBUTIONS 47

Remark 21.3. The joint distribution of two independent random variables is the product of

the marginal distributions.

From the joint distribution, one can recover the marginal distribution of X and Y by mean of

the total probability law:

P(X = x) =∑y∈SY

P(X = x, Y = y) for all x ∈ SX ,

P(Y = y) =∑x∈SX

P(X = x, Y = y) for all y ∈ SY .

Example 21.4. Continuing Example 21.2, from the joint distribution of X,Z we can recover

the marginal distribution of Z via

P(Z = 0) = P(Z = 0, X = 0) + P(Z = 0, X = 1) = (1− p)2,

P(Z = 1) = P(Z = 1, X = 0) + P(Z = 1, X = 1) = 2p(1− p),

P(Z = 2) = P(Z = 2, X = 0) + P(Z = 2, X = 1) = p2.

This generalises to arbitrarily many random variables: for X1 . . . Xn random variables taking

values in S1 . . . Sn, their joint distribution is given by

P(X1 = x1, X2 = x2, . . . Xn = xn), x1 ∈ S1, . . . , xn ∈ Sn,

and we can recover the marginal distributions via

P(Xi = x) =∑

x1...xi−1,xi+1...xn

P(X1 = x1, . . . , Xi = x, . . . ,Xn = xn).

Definition 21.5 (Conditional distribution). For two random variables X,Y , the conditional

distribution of X given Y = y (which we assume to have positive probability) is the collection

of probabilities

P(X = x|Y = y), x ∈ SX ,where, clearly,

P(X = x|Y = y) =P(X = x, Y = y)

P(Y = y).

Then we can recover the marginal distribution of X via the total probability law:

P(X = x) =∑y∈SY

P(X = x|Y = y)P(Y = y).

Example 21.6. Continuing Example 21.2, the conditional distribution of Z given X = 0 is

P(Z = 0|X = 0) = 1− p, P(Z = 1|X = 0) = p, P(Z = 2|X = 0) = 0.

On the other hand, the conditional distribution of Z given Y = 1 is

P(Z = 0|X = 1) = 0, P(Z = 1|X = 1) = 1− p, P(Z = 2|X = 1) = p.


We can recover the marginal distribution of Z by

P(Z = 0) = P(Z = 0|X = 0)P(X = 0) + P(Z = 0|X = 1)P(X = 1) = (1− p)2,

P(Z = 1) = P(Z = 1|X = 0)P(X = 0) + P(Z = 1|X = 1)P(X = 1) = 2p(1− p),

P(Z = 2) = P(Z = 2|X = 0)P(X = 0) + P(Z = 2, X = 1)P(X = 1) = p2.

Remark 21.7. If X,Y are independent, then the conditional distribution of X given Y = ycoincides with the distribution of X, since P(X = x|Y = y) = P(X = x) for all (x, y) ∈ SX×SY .

Definition 21.8 (Conditional expectation). Finally, we define the conditional expectation of

X given Y = y by

E(X|Y = y) =∑x∈SX

xP(X = x|Y = y).

Note that if X,Y are independent then E(X|Y = y) = E(X) for all y ∈ SY .

Example 21.9. Continuing Example 21.2, we have

E(Z) = E(X) + E(Y ) = 2p,

while

E(Z|X = 0) = 0 · P(Z = 0|X = 0) + 1 · P(Z = 1|X = 0) + 2 · P(Z = 2|X = 0) = p.

We could have arrived to the same conclusion by observing that, since Z = X + Y , on the

event X = 0 we have Z = Y , and so

E(Z|X = 0) = E(Y |X = 0) = E(Y ) = p.

21.1. Properties of the conditional expectation.

(1) For a constant c ∈ R, we have E(cX|Y = y) = cE(X|Y = y) and E(c|Y = y) = c.

(2) For random variables X,Z, it holds E(X + Z|Y = y) = E(X|Y = y) + E(Z|Y = y).

The above two properties generalise by induction, to give that for any constants c1, c2 . . . cn

and random variables X1, X2 . . . Xn it holds

E

(n∑k=1

ckXk

∣∣∣Y = y

)=

n∑k=1

ckE(Xk|Y = y).

Thus the conditional expectation is linear.

(3) If X,Y are independent then E(X|Y = y) = E(X) for all y ∈ SY .

(4) For a function g : SX → S′,

E(g(X)|Y = y) =∑x∈SX

g(x)P(X = x|Y = y).

22. EXERCISES 49

22. Exercises

Exercise 1. Let K be a random variable taking values 0, 1, 2 . . . 7 with probability 1/8

each. Define

X = cos

(Kπ

4

), Y = sin

(Kπ

4

).

Show that Cov(X,Y ) = 0 but X,Y are not independent.

Exercise 2. Let X be a Poisson random variable of parameter λ. We have seen in a

previous exercise that for all β ≥ 0

P(X ≥ x) ≤ E(eβX)e−βx.

By finding the value of β that minimises the right hand side, show that for all x ≥ λ

P(X ≥ x) ≤ exp−x log(x/λ)− λ+ x.

Show that, on the other hand, for integer x

P(X = x) ∼ 1√2πx

exp−x log(x/λ)− λ+ x

as x→∞.

Exercise 3. Suppose that a random variable X has probability generating function

GX(t) =t(1− tr)r(1− t)

for t ∈ [0, 1), where r ≥ 1 is a fixed integer. Determine the mean of X.

Exercise 4. Toss a biased coin repeatedly, and let X denote the number of tosses until

(and including) the second head. Show that

P(X = k) = (k − 1)(1− p)k−2p2

for k ≥ 2. Show that the probability generating function of X is

GX(t) =

(pt

1− (1− p)t

)2

.

Deduce that E(X) = 2/p and Var(X) = 2(1− p)/p2. Explain how X can be represented as the

sum of two independent and identically distributed random variables. Use this representation

to derive again the mean and variance of X.

Exercise 5. Solve the above exercise with X denoting the number of tosses until (and

including) the ath head, for a ≥ 2. Show that in this case

GX(t) =

(pt

1− (1− p)t

)a, E(X) =

a

p, Var(X) =

a(1− p)p2

.


Exercise 6. Let X,Y ∼ Poisson(λ) be independent random variables. Find the dis-

tribution of X + Y . Show that the conditional distribution of X given X + Y = n is

Binomial (n, 1/2).

Exercise 7. Let X ∼ Poisson(λ) and Y ∼ Poisson(µ) be independent random variables.

Find the distribution of X+Y . Show that the conditional distribution of X given X+Y = nis Binomial

(n, λ

λ+µ

).

Exercise 8. The number of misprints on a given page of a book has Poisson distribution

of parameter λ, and the number of misprints on different pages are independent. What is the

probability that the second misprint will occur on page k?

CHAPTER 2

Continuous probability

1. Some natural continuous probability distributions

Up to this point we have only considered probability distributions on discrete sets, such as

finite sets, N or Z. We now turn our attention to continuous distributions.

Definition 1.1 (Probability density function). A function f : R→ R is a probability density

function if

f(x) ≥ 0 for all x ∈ R,∫ +∞

−∞f(x)dx = 1.

We will often write p.d.f. in place of probability density function for brevity. Given a p.d.f., we

can define a probability measure µ on R by setting

µ([a, b]) =

∫ b

af(x)dx

for all a, b ∈ R. Note that

µ((−∞, a]) =

∫ a

−∞f(x)dx, µ(R) =

∫ +∞

−∞f(x)dx = 1,

and µ([a, b)) = µ((a, b]) = µ((a, b)) = µ([a, b]). We interpret f(x) as the density of probability

at x (the continuous analogue of the weight of x), and µ([a, b]) as the probability of the interval

[a, b]. For this course we restrict to piecewise continuous probability density functions.

Example 1.2. Take f(x) = e−|x|

2 . Then f(x) ≥ 0 for all x ∈ R, and∫ +∞

−∞f(x)dx = 2

∫ ∞0

e−x

2dx = 1.

Thus f is a valid probability density function. If µ denotes the associated probability measure,

we have

µ([0,∞)) =1

2

∫ ∞0

e−xdx =1

2, µ([−1, 1]) =

1

2

∫ 1

−1e−|x|dx =

∫ 1

0e−xdx = 1− e−1.

This tells us that if we choose a real number at random according to µ, the probability that this

number is non-negative is 1/2 (which was also clear as f is symmetric), while the probability

that it has modulus at most 1 is 1− e−1.

51

52 2. CONTINUOUS PROBABILITY

Remark 1.3. To be precise, we would have had to introduce a σ-algebra on R, called Borel

σ-algebra, and require that f is Borel-measurable. This is a very weak assumption, and it holds

for all f piecewise continuous, to which we restrict our attention. This subtleties go beyond the

scope of this course, and we will ignore them.

We now list some natural continuous probability distributions. Throughout, we use the following

notation: for A ⊆ R

1A(x) =

1, if x ∈ A

0, if x /∈ A,in analogy with the notation introduced for indicator random variables.

1.1. Uniform distribution. For a, b ∈ R with a < b, the density function

f(x) =1

b− a1[a,b](x)

defines the uniform distribution on [a, b]. Note that∫ +∞

−∞f(x)dx =

∫ b

af(x)dx = 1.

The uniform distribution gives to each interval contained in [a, b] a probability proportional to

its length.

1.2. Exponential distribution. For λ ∈ (0,∞), the density function

f(x) = λe−λx1[0,+∞)(x)

defines the exponential distribution of parameter λ. Note that∫ +∞

−∞f(x)dx =

∫ ∞0

λe−λxdx = 1.

1.3. Gamma distribution. The Gamma distribution generalises the exponential distri-

bution. For α, λ ∈ (0,+∞), the density function

f(x) =λαxα−1e−λx

Γ(α)1[0,+∞)(x)

defines the gamma distribution of parameters α, λ. Here

Γ(α) =

∫ ∞0

xα−1e−xdx

is a finite constant, which ensures that f integrates to 1, since with y = λx∫ ∞0

λαxα−1e−λxdx =

∫ ∞0

yα−1e−ydy = Γ(α).

Note that by setting α = 1 we recover the exponential distribution, of which the Gamma

distribution is therefore a generalisation.

2. CONTINUOUS RANDOM VARIABLES 53

1.4. Cauchy distribution. The density function

f(x) =1

π(1 + x2)

defines the Cauchy distribution. Note that∫ +∞

−∞f(x)dx =

2

π

∫ +∞

0

1

1 + x2dx =

2

πarctan(x)

∣∣∣+∞0

= 1.

1.5. Gaussian distribution. The density function

f(x) =1√2πe−x

2/2

defines the standard Gaussian distribution, or standard normal distribution, written N(0, 1).

More generally, for µ ∈ R and σ2 ∈ (0,∞), the density function

f(x) =1√

2πσ2e−

(x−µ)2

2σ2

defines the Gaussian distribution of mean µ and variance σ2, written N(µ, σ2). To check that

f integrates to 1 we can use Fubini’s theorem and change variables into polar coordinates to

obtain (∫ +∞

−∞e−x

2/2dx

)2

=

∫ +∞

−∞

∫ +∞

−∞e−(x2+y2)/2dxdy =

∫ 2π

0

∫ ∞0

e−r2/2rdrdθ = 2π.

2. Continuous random variables

A random variable X : Ω→ R has probability density function f if

P(X ∈ [a, b]) =

∫ b

af(x)dx

for all a < b ∈ R. We again define the distribution function of X by

FX(x) = P(X ≤ x) =

∫ x

−∞f(y)dy.

Definition 2.1. A random variable X is said to be continuous if the associated distribution

function FX is continuous.

Note that by the fundamental theorem of calculus

F ′X(x) = fX(x),

thus the probability density function is the derivative of the distribution function. Moreover,

limx→−∞

FX(x) = 0, limx→+∞

FX(x) = 1

and FX is non-decreasing, since F ′X(x) = fX(x) ≥ 0.


Definition 2.2. Let X be a continuous random variable with probability density function f

and distribution function F . Then, provided

E(X+) =

∫ +∞

0xf(x)dx <∞, or E(X−) =

∫ 0

−∞(−x)f(x)dx <∞,

we define

E(X) =

∫ +∞

−∞xf(x)dx.

Note that if E(X+) = +∞ and E(X−) < ∞ then E(X) = +∞, while if E(X−) = +∞ and

E(X+) <∞ then E(X) = −∞, in analogy with the discrete case.

Example 2.3. If X ∼ Uniform[a, b] then

E(X) =

∫ b

a

x

b− adx =

1

b− a

(b2 − a2

2

)=a+ b

2.

Example 2.4. If X ∼ Exponential(λ) then we integrate by parts to get

E(X) =

∫ ∞0

xλe−λxdx = −xe−λx∣∣∣∞0

+

∫ ∞0

e−λxdx =1

λ.

Example 2.5. If X ∼ Cauchy then

E(X+) =

∫ ∞0

x

π(1 + x2)dx = +∞ = E(X−),

so the expectation is not defined.

Example 2.6. If X ∼ N(0, 1) then

E(X) =

∫ +∞

−∞

x√2πe−x

2/2dx = 0

by symmetry.

The expectation satisfies the same properties as in the case of discrete random variables. In

particular it is linear, and for a function g : R→ R we have

E(g(X)) =

∫ +∞

−∞g(x)f(x)dx.

Moreover, if X is non-negative, we have the alternative formula

E(X) =

∫ +∞

0P(X ≥ x)dx =

∫ +∞

0(1− F (x))dx.

Example 2.7. If X ∼ Exponential(λ) then P(X ≥ x) = e−λx, so we can use the above formula

to compute

E(X) =

∫ ∞0

P(X ≥ x)dx = −e−λx

λ

∣∣∣∞0

=1

λ.

3. EXERCISES 55

Definition 2.8. A random variable X is said to be integrable if

E(|X|) =

∫ +∞

−∞|x|f(x)dx <∞.

For X integrable random variable, we define the variance of X by

Var(X) = E[(X − E(X))2] =

∫(x− E(X))2f(x)dx.

In analogy with the discrete case,

Var(X) = E(X2)− [E(X)]2 =

∫x2f(x)dx−

(∫xf(x)dx

)2

.

Example 2.9. If X ∼ Uniform[a, b] then

E(X2) =

∫ b

a

x2

b− adx =

1

b− a

(b3 − a3

3

)=a2 + b2 + ab

3,

from which

Var(X) = E(X2)− [E(X)]2 =a2 + b2 + ab

3− (a+ b)2

4=

(a− b)2

12.

Example 2.10. If X ∼ Exponential(λ) then we integrate by parts to get

E(X2) =

∫ ∞0

x2λe−λxdx = −x2e−λx∣∣∣∞0

+ 2

∫ ∞0

xe−λxdx =2

λ2,

from which

Var(X) = E(X2)− [E(X)]2 =2

λ2− 1

λ2=

1

λ2.

Example 2.11. If X ∼ N(0, 1) then we integrate by parts to get

Var(X) = E(X2) =

∫ +∞

−∞

x2

√2πe−x

2/2dx = − x√2πe−x

2/2∣∣∣+∞−∞

+

∫ +∞

−∞

e−x2/2

√2π

dx = 1.

3. Exercises

Exercise 1. Let X be a Uniform random variable in [−1, 1]. Compute E(X3) and E(X4).

Can you compute E(X2n+1) for n ≥ 0?


Exercise 2. Recall that the Gamma function is defined as

Γ(α) =

∫ ∞0

xα−1e−xdx.

Using integration by parts, show that for all k positive integers

Γ(k + 1) = kΓ(k).

By comparison with the exponential distribution, explain why Γ(1) = 1. Use this to conclude

that

Γ(k) = (k − 1)!

for all k positive integers.

4. Transformations of one-dimensional random variables

Let X be a random variable taking values in S ⊆ R (that is, with probability density function

fX supported on S). We consider a function g : S → S′ and define the random variable

Y = g(X). What is the probability density function fY of Y ?

To answer this question, let us assume that g is strictly increasing, so that g′(x) > 0 for all x.

Then we can look at the distribution function

FY (y) = P(Y ≤ y) = P (g(X) ≤ y) = P(X ≤ g−1(y)

)= FX(g−1(y)).

Differentiating both sides, we find

fY (y) =d

dyFX(g−1(y)

)= fX(g−1(y))

d

dyg−1(y).

A similar argument applies for the case of g decreasing, to give the following.

Theorem 4.1. Let X be a random variable taking values in S ⊆ R with p.d.f. fX , and let

g : S → S′ be a strictly increasing or strictly decreasing function. Then the probability density

function of Y = g(X) is given by

fY (y) = fX(g−1(y))

∣∣∣∣ ddyg−1(y)

∣∣∣∣for y ∈ S′.

Example 4.2. Let X ∼ Uniform[0, 1]. We may assume that X only takes values in (0, 1] since

P(X = 0) = 0. Define Y = − logX, so that Y takes values in [0,∞). Then we have S = (0, 1],

S′ = [0,∞) and g(x) = − log x strictly decreasing. Using that g−1(y) = e−y, we find

fY (y) = fX(g−1(y))︸︷︷︸1

∣∣∣ ddyg−1(y)︸︷︷︸−e−y

∣∣∣ = e−y,

5. EXERCISES 57

which shows that Y ∼ Exponential(1). Note that we could have arrived to the same conclusion

by directly looking at the distribution functions. Indeed, we have

FY (y) = P(Y ≤ y) = P(− logX ≤ y) = P(X ≥ e−y) = 1− e−y.

By differentiating both sides with respect to y, we find

fY (y) = e−y,

so Y ∼ Exponential(1).

Example 4.3. Let X ∼ N(µ, σ2), and define Y = (X − µ)/σ. Then S = S′ = R and

g(x) = (x− µ)/σ. Using that g−1(y) = σy + µ, we find

fY (y) = fX(g−1(y))∣∣∣ ddyg−1(y)︸︷︷︸σ

∣∣∣ = σ1√2πσ

e−(x−µ)2

2σ2

∣∣∣x=σy+µ

=1√2πe−y

2/2,

which shows that Y ∼ N(0, 1). Note that we could have arrived to the same conclusion using

that

FY (y) = P(Y ≤ y) = P(X ≤ µ+ σy) =

∫ µ+σy

−∞

1√2πσ

e−(x−µ)2

2σ2 dx

for all y ∈ R, from which, differentiating both sides, we get

fY (y) = σ1√2πσ

e−(x−µ)2

2σ2

∣∣∣x=σy+µ

=1√2πe−y

2/2.

The same reasoning shows that if X ∼ N(0, 1) and Y = σX + µ then Y ∼ N(µ, σ2).

Note that the above example gives us a way to deduce the mean and variance of a N(µ, σ2)

from the ones of N(0, 1). Indeed, if X ∼ N(0, 1) then Y = σX + µ ∼ N(µ, σ2), from which,

using the linearity of expectation,

E(Y ) = σE(X) + µ = µ,

and

Var(Y ) = σ2Var(X) = σ2.

5. Exercises

Exercise 1. Let X be a random variable with p.d.f. f(x) = ce−3x1[0,∞)(x). Find c. Find

P(X ∈ [2, 3]), P(|X| ≤ 1), P(X > 4).

Exercise 2. Let X be a random variable with p.d.f. f(x) = cx1[0,2](x). Find c. Find the

p.d.f. of Y = 2x.

Exercise 3. Let X ∼ Exponential(1). Find the p.d.f. of Y = X2.


Exercise 4. Let X ∼ Cauchy. Find the p.d.f. of Y = arctanX.

Exercise 5. Conversely, show that if X ∼ U [−π/2, π/2] and Y = tanX, then Y ∼ Cauchy.

Exercise 6. Let X ∼ U [−1, 1]. Find the p.d.f. of Y = 2X + 5 and Z = X3 + 1.

Exercise 7. Let X be an exponential random variable of parameter λ. Show that for any

0 < s < t it holds

P(X > t|X > s) = P(X > t− s).

This is the memoryless property of the exponential distribution. Indeed, thinking of X as a

random time (say the time of arrival of a bus), it tells us that given that X > s, the random

time X − s still has exponential distribution of parameter λ. In fact, this property characterises

the exponential distribution, but we won’t prove this.

Exercise 8. Let X ∼ U [0, 1], and let F be a strictly increasing distribution function. Show

that the random variable Y = F−1(X) has distribution function F . This is important for

simulating random variables on a computer.

6. Moment generating functions

Definition 6.1. The moment generating function of a random variable X with p.d.f. fX is

defined as

MX(θ) = E(eθX) =

∫ +∞

−∞eθxfX(x)dx,

where the above integral is finite.

Note that MX(0) = 1 and if θ < 0 and X ≥ 0 then MX(θ) ≤ 1, so the moment generating

function of a non-negative random variable is always finite on (−∞, 0]. The range of values

of θ > 0 for which the moment generating function is defined will depend on the tails of the

probability density function fX .

Example 6.2. Let X be an exponential random variable of parameter λ. Then

MX(θ) = E(eθX) =

∫ ∞0

λe−(λ−θ)xdx =λ

λ− θ,

provided θ < λ, which is needed for the integral to be finite.

6. MOMENT GENERATING FUNCTIONS 59

If X,Y are independent random variables, then

MX+Y (θ) = E(eθ(X+Y )) = E(eθX)E(eθY ) = MX(θ)MY (θ).

This generalises to arbitrarily many random variables, to give that if X1, X2 . . . Xn are inde-

pendent random variables, then

MX1+···+Xn(θ) = MX1(θ)MX2(θ) · · ·MXn(θ).

Moreover, if the Xk’s also have the same distribution, then

MX1+···+Xn(θ) = [MX1(θ)]n.

The moment generating function (that we abbreviate m.g.f.) plays the same role for continuous

random variables as the probability generating function for discrete random variables. In

particular, the following theorem holds.

Theorem 6.3. The moment generating function of a random variable X determine the distri-

bution of X uniquely, provided MX(θ) <∞ for some θ > 0.

Example 6.4. In this example we compute the moment generating function of a Gaussian

random variable, and show that the sum of independent Gaussian random variable is still

Gaussian. Let X ∼ N(µ, σ2). Then

MX(θ) =1

2πσ2

∫ +∞

−∞eθx−

(x−µ)2

2σ2 dx.

The above integral is finite for all θ ∈ R. We complete the square in the exponent as follows:

θx− (x− µ)2

2σ2= −(x− µ)2 − 2θσ2x

2σ2= −x

2 − 2(µ+ θσ2)x+ µ2

2σ2

= −x2 − 2(µ+ θσ2)x+ (µ+ θσ2)2 + µ2 − (µ+ θσ2)2

2σ2

= −(x− (µ+ θσ2))2

2σ2+ θµ+

θ2σ2

2.

Plugging this into the integral, we find

MX(θ) = eθµ+ θ2σ2

21√

2πσ2

∫ +∞

−∞e−

(x−(µ+θσ2))2

2σ2 dx︸︷︷︸1

= eθµ+ θ2σ2

2 ,

where in the last equality we have used that the p.d.f. of a N(µ + θσ2, σ2) random variable

integrates to 1. Thus we showed that

X ∼ N(µ, σ2) ⇒ MX(θ) = eθµ+ θ2σ2

2 .

In particular,

X ∼ N(0, 1) ⇒ MX(θ) = eθ2/2.


We now use this explicit expression for the m.g.f. to compute the distribution of the sum of

two independent Gaussian random variables. Let X ∼ N(µX , σ2X) and Y ∼ N(µY , σ

2Y ) be

independent. Then

MX+Y = E(eθ(X+Y )) = E(eθX)E(eθY ) = eθ(µX+µY )+ θ2

2(σ2X+σ2

Y ),

which shows that

X + Y ∼ N(µX + µY , σ2X + σ2

Y ).

Finally, exactly as in the discrete case, we can use the m.g.f. MX of a random variable X to

recover the moments of X. Indeed, from

eθX = 1 + θX +θ2X2

2+θ3X3

3!+ · · ·

taking the expectation both sides we find

MX(θ) = 1 + θE(X) +θ2

2E(X2) +

θ3

3!E(X3) + · · ·

Provided MX is finite for some θ 6= 0, we can differentiate to get

MX(0) = 1, M ′X(0) = E(X), M ′′X(0) = E(X2)

and in general M(k)X (0) = E(Xk), whenever the expectation is finite.

Example 6.5. We have seen that if X is exponential of parameter λ then MX(θ) = λ/(λ− θ)for θ < λ. We differentiate to find

M ′X(θ) =λ

(λ− θ)2, M ′X(0) =

1

λ= E(X),

and

M ′′X(θ) =2λ

(λ− θ)3, M ′′X(0) =

2

λ2= E(X2),

from which we recover Var(X) = 1/λ2. Note that all the above computations are for θ < λ.

7. Exercises

Exercise 1. Show that the m.g.f. of a Gamma random variable of parameters n, λ, where

n is a positive integer, is given by

MX(θ) =

(λ

λ− θ

)n.

Deduce that the m.g.f. of an exponential random variable of parameter λ is given by λ/(λ− θ).Differentiate the m.g.f. to compute mean and variance of the Gamma(n, λ) distribution. Show

that if X,Y are independent exponential random variables of parameter λ, then X + Y ∼

8. MULTIVARIATE DISTRIBUTIONS 61

Gamma(2, λ). In general, show that of X1, X2 . . . Xn are i.i.d. exponential random variables of

parameter λ, then

X1 +X2 + · · ·+Xn ∼ Gamma(n, λ).

Exercise 2. Compute the m.g.f. of a uniform random variable on [a, b]. Differentiate to

deduce mean and variance.

Exercise 3. Compute the m.g.f. of a standard Gaussian random variable by completing

the square.

Exercise 4. Show that the m.g.f. of a Cauchy random variable is infinite for all values of

θ 6= 0, while it is equal to 1 at θ = 0.

8. Multivariate distributions

8.1. Joint distribution. We start by defining the joint probability density function of

two real-valued random variables.

Definition 8.1. Two random variables X,Y are said to have joint probability density function

fX,Y if

P((X,Y ) ∈ A) =

∫∫AfX,Y (x, y)dxdy

for all subsets A ⊆ R2. Their joint distribution function is given by

FX,Y (x, y) = P(X ≤ x, Y ≤ y) =

∫ x

−∞

∫ y

−∞fX,Y (u, v)dudv.

Properties of the joint probability density function fX,Y

(i) fX,Y (x, y) ≥ 0,

(ii)

∫ +∞

−∞

∫ +∞

−∞fX,Y (x, y)dxdy = 1.

Properties of the joint distribution function FX,Y

(i) For each fixed x, FX,Y (x, y) is non-decreasing in y, and for each fixed y FX,Y is

non-decreasing in x.

(ii) We have

limx→+∞

limy→+∞

FX,Y (x, y) = 1, limx→−∞

FX,Y (x, y) = limy→−∞

FX,Y (x, y) = 0,

and

FX,Y (x,+∞) = P(X ≤ x) = FX(x), FX,Y (+∞, y) = P(Y ≤ y) = FY (y).


Note that fX,Y : R2 → R and FX,Y : R2 → [0, 1], and

P((X,Y ) ∈ (a, b)× (c, d)) =

∫ b

a

∫ d

cfX,Y (x, y)dxdy

for all (a, b)× (c, d) ⊆ R2. Moreover,

fX,Y (x, y) =∂2

∂x∂yFX,Y (x, y),

so we can obtain the joint p.d.f. by differentiating the joint distribution function.

Example 8.2. Let (X,Y ) be uniformly distributed on [0, 1]2, which is equivalent to say that

the joint p.d.f. of X,Y is given by

fX,Y (x, y) = 1[0,1]2(x, y).

To compute, say, P(X < Y ), we integrate the joint p.d.f. over the set x < y, to get

P(X < Y ) =

∫∫x<y

fX,Y (x, y)dxdy =

∫ 1

0

∫ y

0dxdy =

∫ 1

0ydy =

1

2.

If, on the other hand, we want to compute P(X ≤ 1/2, X + Y ≤ 1) we integrate the joint p.d.f.

over this set to get

P(X ≤ 1/2, X + Y ≤ 1) =

∫∫x≤1/2,x+y≤1

fX,Y (x, y)dxdy =

∫ 1/2

x=0

∫ 1−x

y=0dydx =

3

8.

Example 8.3. Let (X,Y ) have joint p.d.f.

fX,Y (x, y) = 6xy21[0,1]2(x, y).

We compute P((X,Y ) ∈ A) where A = (x, y) : x+ y ≥ 1. We have

P((X,Y ) ∈ A) =

∫∫AfX,Y (x, y)dxdy =

∫ 1

0

∫ 1

1−y6xy2dxdy =

∫ 1

0(6y3 − 3y4)dy =

9

10.

Note that from the joint p.d.f. we can recover the marginals by integrating over one of the

variables:

fX(x) =

∫ +∞

−∞fX,Y (x, y)dy, fY (y) =

∫ +∞

−∞fX,Y (x, y)dx.

Example 8.4. Continuing with the above example, we find

fX(x) =

∫ 1

06xy2dy = 2x for x ∈ [0, 1],

and

fY (y) =

∫ 1

06xy2dx = 3y2 for y ∈ [0, 1].

Note that fX,Y (x, y) = fX(x)fY (y). In this case we say that X and Y are independent.

8. MULTIVARIATE DISTRIBUTIONS 63

Definition 8.5. Two random variables X,Y are said to be independent if

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y)

for all (x, y) ∈ R2.

The following theorem provides a very useful characterization of independence.

Theorem 8.6. Two random variables X,Y are independent if and only if their joint p.d.f. is

given by the product of the marginals, that is

fX,Y (x, y) = fX(x)fY (y).

We won’t prove this theorem.

Example 8.7. Let X,Y have joint p.d.f.

fX,Y (x, y) = λ2e−λ(x+y)1[0,∞)2(x, y)

for λ > 0. Then

fX(x) = λe−λx1[0,∞)(x), fY (y) = λe−λy1[0,∞)(y),

so fX,Y = fXfY and X,Y are independent exponential random variables of parameter λ.

Example 8.8. Let X,Y have joint p.d.f.

fX,Y (x, y) = λe−λx1[0,∞)×[0,1](x, y)

for λ > 0. Then

fX(x) = λe−λx1[0,∞)(x), fY (y) = 1[0,1](y),

so fX,Y = fXfY and X,Y are independent exponential random variables, with X ∼ exp(λ)

and Y ∼ U [0, 1].

All the above definitions generalise in the obvious way.

Definition 8.9. The random variables X1, X2 . . . Xn are said to have joint probability density

function fX1...Xn if

P((X1, X2 . . . Xn) ∈ A) =

∫ ∫· · ·∫AfX1...Xndx1dx2 · · · dxn

for all subsets A ⊆ Rn. Their joint distribution function is given by

FX1...Xn(x1 . . . xn) = P(X1 ≤ x1, X2 ≤ x2 . . . xn ≤ xn)

=

∫ x1

−∞

∫ x2

−∞· · ·∫ xn

−∞fX1...Xn(u1 . . . un)du1 · · · dun.

The random variables X1, X2 . . . Xn are said to be independent if

P(X1 ≤ x1, X2 ≤ x2 . . . xn ≤ xn) = P(X1 ≤ x1)P(X2 ≤ x2) · · ·P(xn ≤ xn)


for all (x1 . . . xn) ∈ Rn.

Theorem 8.10. The random variables X1, X2 . . . Xn are independent if and only if

fX1,X2...Xn(x1, x2, . . . , xn) = fX1(x1)fX2(x2) · · · fXn(xn).

Exactly as in the discrete case, if X,Y are independent random variables, and g1, g2 are

real functions, then g1(X), g2(Y ) are again independent random variables. In general, if

X1, X2 . . . Xn are independent random variables, and g1, g2 . . . gn are real functions, then

g1(X1), g2(X2) . . . gn(Xn) are independent random variables, and

E(g1(X1)g2(X2) · · · gn(Xn)) = E(g1(X1))E(g2(X2)) · · ·E(gn(Xn)).

8.2. Covariance. Given two random variables X,Y and a function g : R2 → R, we have

the formula

E(g(X,Y )) =

∫ +∞

−∞

∫ +∞

−∞g(x, y)fX,Y (x, y)dxdy.

In particular, taking g(x, y) = xy we find

E(XY ) =

∫ +∞

−∞

∫ +∞

−∞xyfX,Y (x, y)dxdy.

Definition 8.11 (Covariance). Given two random variables X,Y with joint p.d.f. fX,Y we

define their covariance as

Cov(X,Y ) = E(XY )− E(X)E(Y ).

Note that if X,Y are independent then fX,Y (x, y) = fX(x)fY (y) and so

E(XY ) =

∫ +∞

−∞

∫ +∞

−∞xyfX,Y (x, y)dxdy =

∫ +∞

−∞

∫ +∞

−∞xyfX(x)fY (y)dxdy

=

(∫ +∞

−∞xfX(x)dx

)(∫ +∞

−∞yfY (y)dy

)= E(X)E(Y )

which shows that Cov(X,Y ) = 0. The covariance satisfies all the properties that we have listed

for the discrete case.

Example 8.12. Let X,Y be two random variables with joint p.d.f. fX,Y (x, y) = cxy for

0 ≤ x ≤ y2 ≤ 1 and y ∈ [0, 1]. To ensure that fX,Y integrates to 1 we take c = 12. We find

E(XY ) =

∫ 1

0

∫ y2

012x2y2dxdy = 4

∫ 1

0y8dy =

4

9,

and marginals

fX(x) =

∫ 1

√x

12xydy = 6x(1− x), fY (y) =

∫ y2

012xydx = 6y5,

9. EXERCISES 65

from which E(X) = 1/2 and E(Y ) = 6/7. It follows that Cov(X,Y ) = E(XY )− E(X)E(Y ) =49 −

12

67 6= 0, and therefore X,Y are not independent.

9. Exercises

Exercise 1. Let X,Y be independent exponential random variables of parameter λ > 0.

Find the distribution of Z = minX,Y . Deduce E(Z) and Var(Z).

Exercise 2. Let (X,Y ) be uniformly distributed on the unit disc, that is

fX,Y (x, y) = c1x2+y2≤1(x, y).

Determine c. Compute the marginal probability density functions fX and fY . Are X and Y

independent?

Exercise 3. Let X,Y have joint p.d.f. fX,Y (x, y) = c(x+ 2y)1[0,1]2(x, y). Find c. Compute

P(X < Y ) and P(X + Y < 1/2). Find the marginals fX and fY . Compute E(X), E(Y ) and

Cov(X,Y ). Compute E(XY 2).

Exercise 4. Let X,Y have joint distribution function

FX,Y (x, y) = 1− e−x − e−y + e−(x+y)

for (x, y) ∈ [0,∞)2. Find P(X < 4, Y < 2). Find P(X < 3). Differentiate to compute the joint

p.d.f. fX,Y of X,Y . Check that fX,Y integrates to 1.

Exercise 5. Let X,Y be independent uniform random variables in [0, 1]. Write the joint

p.d.f. fX,Y . Compute E(XY ), E(sin(X + Y )). Write down the joint distribution function FX,Y .

Exercise 6. Let X,Y have joint p.d.f. fX,Y (x, y) = cx2ey10<x<y<1(x, y). Find c. Find

the marginals fX and fY . Are X,Y independent? Compute E(XY ), E(X2e−Y ), E(X(1 + 4Y )),

E(eX+Y ), E(eX−Y ).

Exercise 7. The radius of a circle is exponentially distributed with parameter λ. Determine

the probability density function of the perimeter of the circle and of the area of the circle.

Exercise 8. Let Y ∼ N(µ, σ2). Determine the p.d.f. of X = eY .

Exercise 9. Let X ∼ U [0, π]. Determine the p.d.f. of Y = cosX.

Exercise 10. Let X,Y have joint p.d.f. fX,Y (x, y) = 1y10<x<y<1(x, y). Check that fX,Y

integrates to 1. Find the marginal p.d.f.’s fX and fY . Are X,Y independent? Compute E(XY ),

E(XY k) for k positive integer.


Exercise 11. Find the joint p.d.f. of X,Y having joint distribution function

FX,Y (x, y) = (1− e−x2)(1− e−y2)1[0,∞)2(x, y).

10. Transformations of two-dimensional random variables

Let X,Y be two random variables with joint probability density function fX,Y . We introduce

the new random variables U = u(X,Y ), V = v(X,Y ) for some functions u, v : R2 → R such

that the map (X,Y )→ (U, V ) is invertible. Then we can writeU = u(X,Y ),

V = v(X,Y )

X = x(U, V ),

Y = y(U, V ).

We would like to obtain a formula for the joint p.d.f. of U, V . To this end, introduce the

Jacobian J of this transformation, defined as

J = det

∂x

∂u

∂x

∂v∂y

∂u

∂y

∂v

=∂x

∂u

∂y

∂v− ∂x

∂v

∂y

∂u.

Then the joint p.d.f. of (U, V ) is given by

fU,V (u, v) = |J |fX,Y (x(u, v), y(u, v)).

Example 10.1. Let X,Y be independent exponential random variables of parameter λ > 0, so

that their joint p.d.f. is fX,Y (x, y) = λ2e−λ(x+y) for (x, y) ∈ [0,+∞). We want to compute the

joint p.d.f. of X + Y, Y . Let U = X + Y , V = Y . Thenu(x, y) = x+ y

v(x, y) = y

x(u, v) = u− v

y(u, v) = v,

and so

J = det

(1 −1

0 1

)= 1.

It follows that the joint p.d.f. of U, V is given by

fU,V (u, v) = |J |fX,Y (u− v, v) = fX(u− v)fY (v) = λ2e−λu1[0,∞)(u)1[0,u](v).

We can use this to compute the marginal p.d.f. of X + Y , to get

fX+Y (u) =

∫ +∞

−∞fU,V (u, v)dv =

∫ u

0λ2e−λudv = λ2ue−λu

for u ≥ 0. Thus X + Y ∼ Gamma(2, λ).

11. LIMIT THEOREMS 67

Example 10.2. Let again X,Y be independent exponentials of parameter λ > 0. Define

U = X + Y, V =X

X + Y.

Then u(x, y) = x+ y

v(x, y) = xx+y

x(u, v) = uv

y(u, v) = u(1− v),

from which

J = det

(v u

1− v −u

)= −u.

It follows that the joint p.d.f. of U, V is given by

fU,V (u, v) = |J |fX(uv)fY (u(1− v)) = λ2ue−λu for 0 < u <∞, 0 < v < 1.

Note that we can decompose fU,V as a product fUfV for fU (u) = λ2ue−λu1[0,∞)(u) and

fV (v) = 1[0,1](v). It follows that U, V are independent random variables, and

U ∼ Gamma(2, λ), V ∼ U [0, 1].

11. Limit theorems

In this section we look at the limiting behaviour of sums of i.i.d. (independent and identically

distributed) random variables.

Theorem 11.1 (Weak law of large numbers). Let (Xk)k≥1 be a sequence of i.i.d. random

variables with finite mean µ, and define

SN = X1 +X2 + · · ·+XN .

Then for all ε > 0

P(∣∣∣∣SNN − µ

∣∣∣∣ > ε

)→ 0

as N →∞.

Proof. We prove the theorem assuming that the random variables have also finite variance

σ2. Note that E(SN/N) = µ and Var(SN/N) = σ2/N . It then follows from Chebyshev’s

inequality that

P(∣∣∣∣SNN − µ

∣∣∣∣ > ε

)= P

(∣∣∣∣SNN − E(SNN

)∣∣∣∣ > ε

)≤ Var(SN )/N

ε2=

σ2

Nε2→ 0

as N →∞.

The following theorem tells us that the convergence holds in a stronger sense.


Theorem 11.2 (Strong law of large numbers). Let (Xk)k≥1 be a sequence of i.i.d. random

variables with finite mean µ. Then

P(

limN→∞

SNN

= µ

)= 1.

We won’t prove this result in this course.

Finally, once we know that SN/N → µ as N → ∞, we could ask what is the error we make

when approximating SN by Nµ for large N . This is the object of the next result.

Theorem 11.3 (Central Limit Theorem). Let (Xk)k≥1 be a sequence of i.i.d. random variables

with finite mean µ and variance σ2. Then for all x ∈ R

P(SN −Nµσ√N

≤ x)→ Φ(x)

as N →∞, where

Φ(x) =

∫ x

−∞

1√2πe−y

2/2dy

is the distribution function of a standard Gaussian random variable.

Thus the central limit theorem tells us that, for large N , we can approximate the sum SN by a

Gaussian random variable with mean Nµ and variance Nσ2.

12. Exercises

Exercise 1. Let X,Y be independent non-negative random variables, with p.d.f. fX and

fY respectively. By performing the change of variables U = X + Y , V = Y , show that

fX+Y (u) =

∫ u

0fX(u− v)fY (v)dv = (fX ∗ fY )(u).

Exercise 2. Use the previous exercise to compute the p.d.f. of the sum of two independent

U [0, 1] random variables, and sketch its graph.

Exercise 3. Let X,Y have joint p.d.f. fX,Y (x, y) = 4xy1[0,1]2(x, y). Are X,Y independent?

Find the joint p.d.f. of U = X/Y and V = XY . Compute the marginals fU and fV . Are U, V

independent?

Exercise 4. (Polar coordinates) Let X,Y be independent N(0, 1) random variables. Write

their joint p.d.f. fX,Y . Define the new random variables R,Θ by

X = R cos Θ, Y = R sin Θ.

Compute the joint p.d.f. fR,Θ of R,Θ, and show that it factorizes into fR(r)fΘ(θ). Deduce that

R,Θ are independent, and determine their distributions.

12. EXERCISES 69

Exercise 5. Let X,Y be independent N(0, 1) random variables. For a, b, c ∈ R+, defineX = aU + bV

Y = cV.

Assuming that ac 6= 0, find the joint p.d.f. of U, V and the marginals fU and fV . Where was

the assumption ac 6= 0 used?

Exercise 6. Let X,Y be independent N(0, 1) random variables. Show that, for any fixed

θ, the random variables

U = X cos θ + Y sin θ, V = −X sin θ + Y cos θ

are independent and find their distributions.

CHAPTER 3

Statistics

1. Parameter estimation

Statistics is concerned with data analysis. Let us start with a simple example. Imagine to toss

a biased coin n times. The coin gives head with probability p, but p is unknown: we would like

to estimate p based on the observed outcomes. Denote by X1, X2 . . . Xn the random variables

recording the n coin tosses. Then they are independent Bernoulli(p) random variables with

mean p. Let x1, x2 . . . xn denote the sequence of observed outcomes in 0, 1. A reasonable

guess for p is given by the proportion of heads in the trials, that is we estimate p via

p(x1, . . . xn) =1

n

n∑k=1

xk.

How good is this guess? How do we build a reasonable guess for the parameter in a more

general setting?

Let X1, X2 . . . Xn be independent and identically distributed (i.i.d.) random variables, repre-

senting n identical and independent experiments (or measurement). We assume that the Xk’s

have common probability distribution P( · ; θ) depending on a parameter θ ∈ R. Our aim is to

estimate θ, that is to come up with a guess for the value of θ, based on having observed that

X1 = x1, X2 = x2 . . . Xn = xn.

Definition 1.1. An estimator T for the parameter θ is a function T = T (X1, X2 . . . Xn). We

say that the estimator T is unbiased if

E(T (X1 . . . Xn)) = θ,

and T is said to be biased if it is not unbiased.

Thus an unbiased estimator gives, on average, the right guess for the value of θ. We will often

use the vector notation

X = (X1, X2 . . . Xn)

for brevity, using which we say that T is unbiased if E(T (X)) = θ.

71

72 3. STATISTICS

Example 1.2. Going back to biased coin tosses, we have proposed to estimate p via

T (X) =1

n

n∑k=1

Xk.

To check whether this is an unbiased estimator for the parameter p, we compute the expectation:

E(T (X)) =1

n

n∑k=1

E(Xk) = p,

so T is unbiased for p. Note that we could have instead considered the estimator

T2(X) = X1,

which is also unbiased since E(T2(X)) = E(X1) = p. This corresponds to guessing that the

probability p of head equals 1 if the first coin comes up head, and p = 0 otherwise, ignoring

all the following coin tosses. This doesn’t seem to be a good estimator, and we will see that

indeed it isn’t.

How do we build estimators in a general setting?

1.1. Maximum Likelihood Estimation. Let X1, X2 . . . Xn be i.i.d. random variables

with common distribution P( · ; θ) depending on a real parameter θ. Suppose we have observed

that

X1 = x1, X2 = x2 . . . Xn = xn,

and we are asked to come up with a guess for the value of θ based on these data. What would

be our best guess? Intuitively, it is reasonable to pick the value of θ which maximises the

probability of seeing the observed data. This is the idea behind maximum likelihood estimation.

Definition 1.3 (Likelihood). The likelihood of θ is the function

l(θ) = P(X1 = x1, X2 = x2 . . . Xn = xn; θ).

The log-likelihood of θ is defined as

L(θ) = log l(θ).

Note that by independence

l(θ) =

n∏k=1

P(Xk = xk; θ),

and

L(θ) =

n∑k=1

logP(Xk = xk; θ).

1. PARAMETER ESTIMATION 73

Definition 1.4. The Maximum Likelihood Estimator (MLE) for θ, given data x = (x1, x2 . . . xn),

is the value of θ that maximises the likelihood l(θ), that is

θ(x) = arg max l(θ).

The associated MLE is given by T (X) = θ(X).

This gives us an algorithm to construct estimators: simply write down the likelihood, and

maximise it with respect to the parameter θ.

Remark 1.5. Since L(θ) = log l(θ), the likelihood and log-likelihood are maximised at the same

point. Thus the MLE is also the value of θ that maximises the log-likelihood

θ(x) = arg maxL(θ).

This is often convenient for computations. Indeed, the likelihood comes as a product (due to

independence) of functions of θ, so it’s lengthy to differentiate it. On the other hand, the

log-likelihood is a sum of functions of θ, which is much easier to differentiate.

Example 1.6. Going back to coin tosses, we have X1, X2 . . . Xn i.i.d. Bernoulli(p), so

P(Xk = xk; p) = pxk(1− p)1−xk

for all k ≤ n. The likelihood function is given by

l(p) = P(X1 = x1, X2 = x2 . . . Xn = xn; p)

=n∏k=1

pxk(1− p)1−xk = p∑k xk(1− p)n−

∑k xk .

Thus

L(p) = log l(p) = log p

(n∑k=1

xk

)+ log(1− p)

(n∑k=1

(1− xk)

).

We differentiate to find

d

dpL(p) =

1

p

n∑k=1

xk −1

1− p

n∑k=1

(1− xk) =1

p(1− p)

(n∑k=1

xk − np

),

which shows that L(p) is maximised at

p(x) =1

n

n∑k=1

xk.

Thus the MLE is given by

T (X) =1

n

n∑k=1

Xk,

which, as we have already seen, is unbiased.

74 3. STATISTICS

Example 1.7. Let X1, X2 . . . Xn be i.i.d. Poisson(λ). To build the MLE for λ, we write down

the likelihood

l(λ) = P(X1 = x1, X2 = x2 . . . Xn = xn) =

n∏k=1

e−λλxk

xk!= e−λnλ

∑k xk

n∏k=1

1

xk!.

Thus

L(λ) = log l(λ) = −λn+ log λ

n∑k=1

xk −n∑k=1

log(xk!).

We differentiate to get

d

dλL(λ) = −n+

1

λ

n∑k=1

xk,

which shows that L(λ) is maximised at

λ(x) =1

n

n∑k=1

xk.

Thus the MLE for λ is

T (X) =1

n

n∑k=1

Xk.

This is unbiased, since

E(T (X)) =1

n

n∑k=1

E(Xk) = λ.

For continuous random variables we can use exactly the same approach. If X1, X2 . . . Xn are

i.i.d. continuous random variable with common p.d.f. f( · ; θ) depending on a parameter θ, define

the likelihood as

l(θ) = fX1,X2...Xn(x1, x2 . . . xn; θ) =n∏k=1

f(xk; θ),

and the log-likelihood by

L(θ) = log l(θ)

whenever l(θ) 6= 0, and L(θ) = 0 otherwise. Again, the MLE is the value of θ which maximises

the likelihood, or equivalently the log-likelihood.

Example 1.8. Let X1, X2 . . . Xn be i.i.d. exponentials of parameter λ. We aim to build the

MLE for λ. To this end, write the likelihood function

l(λ) =

n∏k=1

λe−λxk = λne−λ∑k xk ,

from which

L(λ) = n log λ− λn∑k=1

xk.

1. PARAMETER ESTIMATION 75

We differentiate, to get

d

dλL(λ) =

n

λ−

n∑k=1

xk,

which shows that the likelihood is maximised at

λ(x) =1

1n

∑nk=1 xk

.

Thus our MLE is given by

T (X) =

(1

n

n∑k=1

Xk

)−1

.

This is biased, since for n = 1 we find

E(T (X)) = E( 1

X

)=

∫ ∞0

1

xλe−λxdx = +∞.

Example 1.9. Let X1, X2 . . . Xn be i.i.d. Uniform random variables in [0, θ]. We want to

estimate θ. Using that the Xk’s have joint p.d.f.

f(x) =1

θ1[0,θ](x),

we have likelihood function

l(θ) =1

θn1[0,θ](x1)1[0,θ](x2) · · ·1[0,θ](xn).

Using that the condition xk ∈ [0, θ] for all k is equivalent to

max1≤k≤n

xk ≤ θ, min1≤k≤n

xk ≥ 0,

we rewrite l(θ) as

l(θ) =

1

θnfor θ ≥ max

1≤k≤nxk and min

1≤k≤nxk ≥ 0,

0, otherwise.

Moreover, we can assume that mink xk ≥ 0, since negative data would contradict our assumption

that the Xk’s are uniform in [0, θ]. Thus we differentiate the likelihood to find

d

dθl(θ) = −nθ−n−1 for θ ≥ max

1≤k≤nxk,

which shows that the likelihood is maximised at

θ(x) = max1≤k≤n

xk.

Our MLE is therefore given by

T (X) = max1≤k≤n

Xk.

76 3. STATISTICS

Is this unbiased? To answer this question, we have to compute its expectation and check

whether it equals θ. To this end, we first compute the distribution function FT of T (X)

FT (x) = P(T ≤ x) = P(

max1≤k≤n

Xk ≤ x)

= P(X1 ≤ x,X2 ≤ x . . .Xn ≤ x) =(xθ

)nfor x ∈ [0, θ], while FT (x) = 0 if x ≤ 0 and FT (t) = 1 if x ≥ θ. It follows that the p.d.f. of T is

given by

fT (x) =n

θnxn−11[0,θ](x).

We use this to compute

E(T ) =

∫ θ

0xfT (x)dx =

n

θnxndx =

(n

n+ 1

)θ.

Thus the MLE T (X) is biased. On the other hand, note that as n→∞

E(T (X)) =

(n

n+ 1

)θ −→ θ,

so T (X) is asymptotically unbiased.

2. Exercises

Exercise 1. Let X1, X2 . . . Xn be i.i.d. geometric random variables of parameter p. Write

down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased

for n = 1? [Hint: recall the Taylor series for log(1− x).]

Exercise 2. Let X ∼ Binomial(N, p) where N is known and p is to be estimated. Write

down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased?

Exercise 3. Let X ∼ Binomial(N, p) where p is known and N is to be estimated. Write

down the likelihood function l(N). By maximising l(N), compute the MLE for N . Is it always

unique?

Exercise 4. Let X1, X2 . . . Xn be i.i.d. N(µ, 1) random variables. Write down the likelihood

function l(µ). By maximising l(µ), compute the MLE for µ. Is it unbiased?

Exercise 5. Let X1, X2 . . . Xn be i.i.d. N(µ, σ2) random variables, with µ known and σ2

to be estimated. Write down the likelihood function l(σ2). By maximising l(σ2), compute the

MLE for σ2. Is it unbiased?

Exercise 6. Let X1, X2 . . . Xn be i.i.d. U [0, θ] random variables. We have seen that the

MLE for θ is given by


Xk,

and that this is a biased estimator. Can you use this to build an unbiased estimator for θ?

3. MEAN SQUARE ERROR 77

3. Mean square error

Let X1, X2 . . . Xn be i.i.d. following a given probability distribution depending on a parameter

θ, that we aim to estimate. We have seen how to build the MLE for θ, but of course we could

consider different estimators. Which one is the best? We need a measure for how good a given

estimator is.

Example 3.1. If the Xk’s are i.i.d. Bernoulli of parameter p, representing independent coin

tosses, we have seen that the MLE is given by

T1(X) =1

n

n∑k=1

Xk,

and it is unbiased. On the other hand, we could introduce the alternative estimators

T2(X) = X1, T3(X) =1

3X1 +

2

3X2.

It is easy to see that they are all unbiased: which one should we use? We want our estimator

to be as close as possible to the true value p, so it is reasonable to choose the estimator that

minimises the square distance from p. We have

E[(T1(X)− p)2] =p(1− p)

n, E[(T2(X)− p)2] = p(1− p), E[(T3(X)− p)2] =

5

9p(1− p),

This shows that the MLE T1(X) is the one that, on average, gives the least square error, so it

is the one that performs best. Note that, in fact,

E[(T1(X)− p)2] =p(1− p)

n−→ 0 as n→∞,

so if we could observe infinitely many coin tosses we would be able to determine the value of p

with certainty. This is false for the estimators T2 and T3.

Definition 3.2 (Mean Square Error). Given an estimator T (X) for a parameter θ, the Mean

Square Error (MSE) of T is defined as

MSE(T ) = E[(T (X)− θ)2].

Note that if T is unbiased, that is E(T (X)) = θ, then MSE(T ) = Var(T (X)). If T is biased,

let

bias(T ) = E(T (X))− θ

denote the bias of T . Then the MSE can be expressed as

MSE(T ) = Var(T (X)) + (bias(T ))2.

78 3. STATISTICS

Indeed,

MSE(T ) = E[(T (X)− θ)2] = E[(T (X)− E(T (X)) + E(T (X))− θ)2]

= E[(T (X)− E(T (X)))2]︸︷︷︸Var(T (X))

+(E(T (X))− θ︸︷︷︸bias(T )

)2 + 2E[T (X)− E(T (X))]︸︷︷︸0

·(E(T (X))− θ)

= Var(T (X)) + (bias(T ))2.

This identity is often useful for computations involving biased estimators.

Example 3.3. Let us go back to Example 1.9, in which we showed that the MLE for i.i.d.

U [0, θ] random variables is given by


Xk,

which has p.d.f. fT (x) = nθnx

n−11[0,θ](x) and mean E(T ) = nn+1θ. Note that T is biased: we

may wish to consider instead the unbiased estimator

T1(X) = 2X1.

Does T1 perform better than T ? To answer this question, we compute the Mean Square Errors.

Since

bias(T ) = E(T (X))− θ =θ

n+ 1,

Var(T ) = E(T 2(X))− E(T )2 =

∫ θ

0

n

θnxn+1dx−

(n

n+ 1

)2

θ2 =

(n

n+ 2

)θ2 −

(n

n+ 1

)2

θ2,

we find

MSE(T ) = Var(T (X)) + (bias(T ))2 =2θ2

(n+ 1)(n+ 2).

On the other hand, since T1 is unbiased we have

MSE(T1) = Var(T1(X)) = 4Var(X1) =θ2

3.

Since2θ2

(n+ 1)(n+ 2)≤ θ2

3

for all θ > 0 and n ≥ 1, this shows that T , although biased, on average performs better than

the unbiased estimator T1. Note that again

MSE(T ) =2θ2

(n+ 1)(n+ 2)→ 0 as n→∞,

which means that if we had access to infinitely many observation we would be able to determine

the exact value of θ using the estimator T . This is false for the unbiased estimator T1, whose

MSE does not depend on n.

4. CONFIDENCE INTERVALS 79

4. Confidence intervals

Given i.i.d. experiments X1, X2 . . . Xn following some distribution f( · ; θ) depending on a

parameter θ ∈ R, we have seen how to construct the MLE for θ. Moreover, given several

estimators for θ we have proposed to use the MSE to determine which one performs better.

Once an estimator T for θ is chosen, and data x = (x1, x2 . . . xn) have been observed, we

estimate θ by

θ(x) = T (x),

which is a real number, a precise guess for θ. It is often the case that instead of a single guess,

we are willing to accept a guess of the form

a(x) ≤ θ ≤ b(x)

for some a(x), b(x) depending on the data, provided this has high probability. In other words,

we are willing to give up precision in our guess, to gain confidence that our guess is correct.

Example 4.1. Let X1, X2 . . . Xn be i.i.d. N(µ, 1), where µ is to be estimated. We have seen

that the MLE for µ is

T (X) = X =1

n

n∑k=1

Xk,

and it is unbiased. Moreover, since linear combination of Gaussian random variables is still

Gaussian, we have X ∼ N(µ, 1/n), and thus

X − µ ∼ N(0, 1/n).

Note that, crucially, the distribution of X − µ does not depend on µ, and

Z =√n(X − µ) ∼ N(0, 1).

Thus we can choose α, β such that

P(α ≤ Z ≤ β

)=

∫ β

α

1√2πex

2/2dx = 0.95.

The choice of α, β is not unique: it is natural to choose α = −β, which makes the interval [α, β]

as small as possible. With this choice, we get

P(−α ≤ Z ≤ α) = P(−α ≤√n(X − µ) ≤ α) = 0.95.

Rearranging, we find

P(X − α√

n≤ µ ≤ X +

α√n

)= 0.95.

In all, we have found that if a(X) = X − α/√n and b(X) = X + α/

√n, the interval[

a(X), b(X)]

80 3. STATISTICS

is a 95% confidence interval for µ, that is

P(a(X) ≤ µ ≤ b(X)

)= 0.95.

Definition 4.2 (Confidence interval). Given two statistics a(X) and b(X) with a(X) ≤ b(X),

and a number γ ∈ (0, 1), the interval [a(X), b(X)] is said to be a 100γ% confidence interval for

the unknown parameter θ if

P(a(X) ≤ θ ≤ b(X)

)= γ.

It is important to note that here the parameter θ is not random, but rather the extrema of the

confidence interval a(X) and b(X) are. Thus, if we were to collect data multiple times, and

compute the confidence interval for each data collection, we expect θ to be in the confidence

interval 100γ% of the time.

Example 4.3. Let X1, X2 . . . Xn be i.i.d. exponential random variables of parameter λ. Consider

the statistics

T (X) =n∑k=1

Xk

(which is not the MLE for λ). We have seen that sum of n independent exponentials of

parameter λ has Gamma(n, λ) distribution, so

T (X) ∼ Gamma(n, λ), fT (t) =λntn−1e−λt

(n− 1)!1[0,∞)(t).

We cannot use T (X) to construct a confidence interval for λ, as its p.d.f. depends on the

unknown parameter λ. We therefore look for some other function of the Xk’s whose distribution

does not depend on λ. Take

S(X) = 2λT (X).

We compute the distribution function of S, to find

FS(s) = P(S ≤ s) = P(2λT ≤ s) = P(T ≤ s

2λ

)= FT

( s

2λ

),

where FT denotes the distribution function of T (X). Differentiating, we find

fS(s) =d

dsFT

( s

2λ

)= fT

( s

2λ

) 1

2λ=(1

2

)n sn−1

(n− 1)!e−s/21[0,∞)(s).

Note that S ∼ Gamma(n, 1/2). In particular, we see that the p.d.f. of S(X) does not depend

on λ. Given γ ∈ (0, 1), then, we can build a 100γ% confidence interval for λ by choosing

α, β ∈ R such that

P (α ≤ S(X) ≤ β) = γ.

Using that S(X) = 2λ∑n

k=1Xk, this gives

P

(α ≤ 2λ

n∑k=1

Xk ≤ β

)= γ,

4. CONFIDENCE INTERVALS 81

which, solving the inequalities with respect to λ, gives

P(

α

2∑n

k=1Xk≤ λ ≤ β

2∑n

k=1Xk

)= γ.

Note that here α, β, γ are known numbers. Thus[α

2∑n

k=1Xk;

β

2∑n

k=1Xk

]is a 100γ% confidence interval for λ.

4.1. Approximate confidence intervals. We have seen that in order to build a confi-

dence interval for a parameter θ we need to find some function of the data X and the parameter

θ whose distribution does not depend on θ. We have seen in Example 4.1 that this is possible

when X1, X2 . . . Xn are i.i.d. N(µ, 1) random variables and T (X) = X. But how to proceed in

general? The idea is to use the Central Limit Theorem (CLT) to reduce to the Gaussian case.

Example 4.4. Let X1, X2 . . . Xn be i.i.d. Bernoulli random variables of parameter p. Let

Sn =n∑k=1

Xk,

so that the MLE for p is given by T (X) = Sn/n, and it is unbiased. Write

E(X1) = p, Var(X1) = p(1− p).

Then by the CLT the random variable

Sn − np√np(1− p)

is approximately N(0, 1) distributed as n→∞. Thus, if Z ∼ N(0, 1), for very large n we can

approximate

P

(α ≤ Sn − np√

np(1− p)≤ β

)≈ P (−α ≤ Z ≤ α) =

∫ α

−α

1√2πe−x

2/2dx.

We choose α > 0 so that the above probability equals γ ∈ (0, 1) given, from which we get

P

(α ≤ Sn − np√

np(1− p)≤ β

)≈ γ,

which rearranging gives

P

(Snn− α

n

√Sn(n− Sn)

n≤ p ≤ Sn

n+α

n

√Sn(n− Sn)

n

)≈ γ

for n large enough. Thus[Snn− α

n

√Sn(n− Sn)

n,Snn

+α

n

√Sn(n− Sn)

n

]

82 3. STATISTICS

is an approximate 100% confidence interval for p. This is useful for opinion polls, where we

assume that each person votes, say, democratic with probability p, and given that we observed

n votes we want to estimate p.

Definition 4.5 (Approximate confidence interval). Let X ∈ Rn. Given two statistics a(X)

and b(X) with a(X) ≤ b(X), and a number γ ∈ (0, 1), the interval [a(X), b(X)] is said to be an

approximate 100γ% confidence interval for the unknown parameter θ if

P(a(X) ≤ θ ≤ b(X)

)→ γ

as n→∞.

5. Exercises

Exercise 1. Suppose we observe the outcomes of 10 tosses of the same biased coin, giving

head with probability p (unknown). These are given by 0100111101, where 1 stands for head

and 0 stands for tail. By using Maximum Likelihood Estimation, make a guess for p.

Exercise 2. Let X1, X2 . . . Xn be i.i.d. exp(λ), with λ to be estimated. Suppose that n = 5

and we observe outcomes x = (0.2, 3, 1, 0.3, 4.1). By using Maximum Likelihood Estimation,

make a guess for λ.

Exercise 3. Let X ∼ Binomial(N, p), N known and p to be estimated. Compute the

MSE of the MLE T (X) = X/N for p. Consider the alternative estimator

T1(X) =X + 1

N + 1.

Is it unbiased? Compute the MSE of T1. Which one would you use between T and T1? Why?

Exercise 4. In the setting of the previous exercise, define the alternative estimator

Tα(X) = αX

N+ (1− α)

1

2,

for α ∈ [0, 1] given. Is Tα unbiased? Show that

MSE(Tα) =α2

Np(1− p) + (1− α)2

(1

2− p)2.

Suppose α = 1/2. Which estimator would you choose between T (X) = X/N and Tα(X)?

Why?

6. BAYESIAN STATISTICS 83

Exercise 5. Let X1, X2 . . . Xn be i.i.d. N(µ, 1) random variables. Compute the MSE of

the MLE T (X) for µ. Show that

MSE(T )→ 0 as n→∞.

Let T1 = X1 be an alternative estimator. Is it unbiased? Show that MSE(T1) 9 0 as n→∞.

Which estimator would you use and why?

Exercise 6. Let X1, X2 . . . Xn i.i.d. U [0, θ]. We have seen that the MLE is T (X) =

max1≤k≤nXk and it is biased. Introduce the unbiased estimator

T1(X) =

(n+ 1

n

)T (X).

Compute the MSE of T1. Which estimator would you choose between T and T1 and why?

Exercise 7. Suppose that X1, X2 . . . Xn are i.i.d. N(µ, 4) random variables, where µ is to

be estimated. Show that

S(X) =

√n

2(X − µ) ∼ N(0, 1), where X =

1

n

n∑k=1

Xk.

Using S(X), construct a 95% confidence interval for µ. Construct a 90% confidence interval for

µ. Which one is wider? Construct a 98% confidence interval for µ.

Exercise 8. Let X1, X2 . . . Xn be i.i.d. Poisson random variables of parameter λ to be

estimated. Assuming that n is large, construct a 94% approximate confidence interval for λ.

Exercise 9. Let X1, X2 . . . Xn be i.i.d. Geometric random variables of parameter p to be

estimated. Assuming that n is large, construct a 90% approximate confidence interval for p.

6. Bayesian statistics

Up to this point we have discussed point estimation. That is, we have assumed that X1, X2 . . . Xn

are i.i.d. random variables following a common probability distribution f( · ; θ), depending on a

specific parameter θ, which is unknown. We could, on the other hand, think of the parameter θ

as being itself random. We make an initial guess for the distribution of θ, and then update the

guess based on the observed data. This is the point of view of Bayesian statistics.

Notation. When we want to treat the unknown parameter as a random variable we denote it

by Θ, while we use θ for its values. We say that Θ ∼ π if

P(Θ = θ) = π(θ) (discrete setting)

P(Θ ∈ [θ1, θ2]) =

∫ θ2

θ1

π(θ)dθ (continuous setting).

84 3. STATISTICS

Definition 6.1 (Discrete setting). The initial guess for the probability distribution for the

parameter θ is called prior distribution, and it is denoted by π(·). Given that the data x =

(x1, x2 . . . xn) are observed, the updated distribution for θ is called posterior distribution, and it

is given by

π(θ|x) = P(Θ = θ|X = x) =P(X = x; θ)π(θ)

P(X = x).

Example 6.2. Imagine to toss a biased coin giving head with probability Θ. We represent the

outcome by a Bernoulli(Θ) random variable. Suppose we are told that Θ ∈ θ1, θ2, θ3 for

some given numbers 0 ≤ θ1 < θ2 < θ3 ≤ 1. Prior to observing the outcome of the coin toss we

have no reason to believe that one of the values is more likely than the others, so our prior

distribution is uniform over the set θ1, θ2, θ3, that is

π(θ1) = π(θ2) = π(θ3) =1

3.

We now observe the outcome x = 1 (head) of the coin toss. How does our belief on θ change

given the observed datum? We compute the posterior distribution π(θ|1)

π(θk|1) =P(X = 1; θk)π(θk)

P(X = 1)

for k = 1, 2, 3. We use the law of total probabilities to compute

P(X = 1) = P(X = 1; θ1)π(θ1) + P(X = 1; θ2)π(θ2) + P(X = 1; θ3)π(θ3)

=1

3(θ1 + θ2 + θ3).

Thus our posterior distribution is given by

π(θ1|1) =P(X = 1; θ1)π(θ1)

P(X = 1)=

θ1

θ1 + θ2 + θ3,

π(θ2|1) =P(X = 1; θ2)π(θ2)

P(X = 1)=

θ2

θ1 + θ2 + θ3

π(θ3|1) =P(X = 1; θ3)π(θ3)

P(X = 1)=

θ3

θ1 + θ2 + θ3.

Note that the posterior distribution is a probability distribution, as the weights sum up to 1.

In fact, we could have used this to compute P(X = 1).

Definition 6.3 (Continuous setting). The initial guess for the probability distribution for

the parameter θ is called prior distribution, and it is denoted by π(·). Given that the data

x = (x1, x2 . . . xn) are observed, the updated distribution for θ is called posterior distribution,

and it is given by

π(θ|x) =fX1...Xn(x; θ)π(θ)

fX1...Xn(x).

6. BAYESIAN STATISTICS 85

Example 6.4. Let X1, X2 . . . Xn be i.i.d. exponential random variables of parameter Θ. We

assume that Θ ∼ Exponential(µ) for some given µ. Thus our prior distribution is

π(θ) = µe−µθ1[0,∞)(θ).

After data x = (x1, x2 . . . xn) are observed, our posterior distribution for the parameter is given

by

π(θ|x) =fX1...Xn(x; θ)π(θ)

fX1...Xn(x)=

1

fX1...Xn(x)

(n∏k=1

θe−θxk

)µe−µθ

=µ

fX1...Xn(x)θne−θ(

∑nk=1 xk+µ)

for θ ≥ 0. Note that the factor µ/fX1...Xn(x) does not depend on θ, so it must equal the

unique constant such that the posterior distribution integrates to 1. Since the remaining part

of the p.d.f. is proportional to the p.d.f. of a Gamma distribution with parameters n+ 1 and∑nk=1 xk + µ, we conclude that

Θ|x ∼ Gamma

(n+ 1,

n∑k=1

xk + µ

).

That is, if prior to the data collection we believed that the parameter was Exponential(µ),

after observing the data our new belief is that the parameter is Gamma(n+ 1,∑n

k=1 xk + µ).

6.1. Bayesian estimation. Once we have computed the posterior distribution π( · |x) for

Θ, we can use it to make a guess for the value of θ.

Definition 6.5 (Bayes estimator). The Bayes estimator θ(x) is defined to be the posterior

mean, that is

θ(x) = E(Θ|x) =

∫θ π(θ|x)dθ.

Example 6.6. We have seen that in Example 6.4 the posterior distribution is a Gamma(n+

1,∑n

k=1 xk + µ). It follows that the Bayes estimator for the unknown parameter θ is given by

θ(x) = E(Θ|x) =n+ 1∑n

k=1 xk + µ.

Theorem 6.7. The Bayes estimator is the value of θ which minimises the expected posterior

quadratic loss E[(Θ− θ)2|x ].

Remark 6.8. It is possible to introduce different loss functions. For example, we might want to

estimate θ by the minimiser of the expected posterior absolute loss E[|Θ− θ| |x ]. In this case

the Bayes estimator would be the median of the posterior distribution.

86 3. STATISTICS

7. Exercises

Exercise 1. Let X1, X2 . . . Xn be i.i.d. N(Θ, 1), and consider the prior distribution Θ ∼N(0, 1). Show that the posterior distribution is given by

Θ|x ∼ N

(1

n+ 1

n∑k=1

xk,1

n+ 1

).

Deduce that the Bayes estimator (for quadratic loss) is given by θ(x) = 1n+1

∑nk=1 xk.

Exercise 2. Let X1, X2 . . . Xn be i.i.d. Poisson(Θ) random variables, with prior distribu-

tion Θ ∼ Exponential(1). Show that

Θ|x ∼ Gamma

(n∑k=1

xk + 1 , n+ 1

).

Deduce that the Bayes estimator (for quadratic loss) for θ is given by

θ(x) =

∑nk=1 xk + 1

n+ 1.

Exercise 3. Solve the above exercise with prior Θ ∼ Exponential(µ).

Exercise 4. Let X1, X2 . . . Xn be i.i.d. Bernoulli(Θ) random variables, with prior dis-

tribution Θ ∼ U [0, 1] (no prior information). Show that the posterior distribution is given

by

Θ|x ∼ Beta

(n∑k=1

xk + 1 , n+ 1−n∑k=1

xk

),

where we say that X ∼ Beta(α, β) if

fX(x) =xα−1(1− x)β−1

B(α, β)1[0,1](x),

with B(α, β) = Γ(α)Γ(β)/Γ(α+ β). Using that a B(α, β) has expectation α/(α+ β), deduce

that the Bayes estimator (for quadratic loss) is given by

θ(x) =

∑nk=1 xk + 1

n+ 2.

Exercise 5. We have an urn containing N balls, N known, some of which are blue and

some of which are red. Let Θ be the proportion of red balls. We take the uninformative prior

Θ ∼ U [0, 1]. Suppose that we pick one ball, and it turns out to be red. Compute the posterior

distribution Θ|Red. Compute the Bayes estimator (for quadratic loss) for the proportion of red

balls.

7. EXERCISES 87

Exercise 6. Suppose we roll a biased die, where the probability of getting k is proportional

to Θk, for k = 1, 2, 3, 4, 5, 6. Let X1, X2 . . . Xn denote n independent rolls of the die. Starting

from the uninformative prior Θ ∼ U [0, 1], compute the posterior distribution π(θ|x) of Θ|x,

and the Bayes estimator (for quadratic loss).

Exercise 7. Let X1, X2 . . . Xn be i.i.d. Exponential(Θ) random variables, with prior

distribution Θ ∼ Gamma(α, β) for α, β known. Show that the posterior distribution is given

by

Θ|x ∼ Gamma

(α+ n, β +

n∑k=1

xk

).

Deduce that the Bayes estimator (for quadratic loss) is given by θ(x) = (α+ n)/(β +∑n

k=1 xk).

APPENDIX A

Background on set theory

A set is a collection of elements, which can be finite or infinite. If A is a finite set, |A| denotes

the number of elements of A. The empty set is the set consisting of no element, and it is

denoted by ∅. If A and B are two sets, then

• A ∪B = x : x ∈ A or x ∈ B (union).

• A ∩B = x : x ∈ A and B (intersection).

• A \B = x ∈ A but x /∈ B (difference).

• A ⊂ B if all elements of A are also elements of B.

• If A ∩B = ∅ then we say that A and B are disjoint.

Note that the empty set is disjoint from every set, including itself. In fact, the empty set is the

only set disjoint from itself. If Ω is a set, and A ⊆ Ω, we define the complement of A in Ω as

Ac = Ω \A = x ∈ Ω but x /∈ A. Note that A ∩Ac = ∅ so A and Ac are disjoint.

Example 0.1. Take A = 1, 2, 3, B = 2, 4, 6, 8. Then

A ∪B = 1, 2, 3, 4, 6, 8, A ∩B = 2, A \B = 1, 3, B \A = 4, 6, 8.

If C = 6, 7, 8 then

A∪C = 1, 2, 3, 6, 7, 8, A∩C = ∅, A\C = A, A∪B∪C = 1, 2, 3, 4, 6, 7, 8, A∩B∩C = ∅.

Take Ω = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Then A,B,C ⊆ Ω. Their complements in Ω are

Ac = 4, 5, 6, 7, 8, 9, 10, Bc = 1, 3, 5, 7, 9, 10, Cc = 1, 2, 3, 4, 5, 9, 10,

We can also define infinite unions and intersections. For a sequence of sets (An)n∈N set⋃n∈N

An = A1 ∪A2 ∪A3 ∪ · · · = x : x ∈ An for some n ∈ N,

⋂n∈N

An = A1 ∩A2 ∩A3 ∩ · · · = x : x ∈ An for all n ∈ N.

Example 0.2. Take An = n, n+ 1, n+ 2 . . . = k ∈ N : k ≥ n. Then⋃n∈N

An = N,⋂n∈N

An = ∅.

89

90 A. BACKGROUND ON SET THEORY

If on the other hand we take An as before but set An = A10 for all n ≥ 10 then⋃n∈N

An = N,⋂n∈N

An = A10 = 10, 11, 12 . . ..

Given two sets A,B, we define their product as

A×B = (x, y) : x ∈ A, y ∈ B.

Example 0.3. If A = 1, 2, 3 and B = α, β then

A×B = (1, α), (1, β), (2, α), (2, β), (3, α), (3, β),

A×A = (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3),

B ×B = (α, α), (α, β), (β, α), (β, β).

Note that the pairs are ordered, and that |A×B| = |A| · |B|. We can define the product of an

arbitrary number of sets by setting

A1 ×A2 ×A3 · · ·An = (x1, x2, x3 . . . xn) : xk ∈ Ak.

Example 0.4. If A = 0, 1, B = a and C = x, y then

A×B × C = (0, a, x), (0, a, y), (1, a, x), (1, a, y).

Note that |A×B × C| = 2 · 1 · 2 = 4.

1. TABLE OF Φ(z) 91

1. Table of Φ(z)

Source: https://math.ucalgary.ca/

Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B)....

Documents

Transcript of Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B)....