Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B)....
Transcript of Probability and Statistics Vittoria Silvestri · P(A[B) = P(A) + P(B) P(A\B), P(A[B) P(A) + P(B)....
Probability and Statistics
Vittoria Silvestri
Statslab, Centre for Mathematical Sciences, University of Cambridge, Wilber-
force Road, Cambridge CB3 0WB, UK
Contents
Preface 5
Chapter 1. Discrete probability 7
1. Introduction 7
2. Combinatorial analysis 9
3. Exercises 12
4. Stirling’s formula 13
5. Properties of Probability measures 15
6. Exercises 18
7. Independence 19
8. Conditional probability 21
9. Exercises 24
10. Some natural probability distributions 25
11. Exercises 27
12. Additional exercises 28
13. Random variables 29
14. Expectation 33
15. Exercises 36
16. Variance and covariance 37
17. Exercises 40
18. Inequalities 41
19. Exercises 42
20. Probability generating functions 44
21. Conditional distributions 46
22. Exercises 49
Chapter 2. Continuous probability 51
1. Some natural continuous probability distributions 51
2. Continuous random variables 53
3. Exercises 55
4. Transformations of one-dimensional random variables 56
5. Exercises 57
3
4 CONTENTS
6. Moment generating functions 58
7. Exercises 60
8. Multivariate distributions 61
9. Exercises 65
10. Transformations of two-dimensional random variables 66
11. Limit theorems 67
12. Exercises 68
Chapter 3. Statistics 71
1. Parameter estimation 71
2. Exercises 76
3. Mean square error 77
4. Confidence intervals 79
5. Exercises 82
6. Bayesian statistics 83
7. Exercises 86
Appendix A. Background on set theory 89
1. Table of Φ(z) 91
PREFACE 5
Preface
These lecture notes are for the NYU Shanghai course Probability and Statistics, given in Fall
2018. The contents are closely based on the following lecture notes, all available online:
• James Norris: http://www.statslab.cam.ac.uk/ james/Lectures/p.pdf
• Douglas Kennedy: http://trin-hosts.trin.cam.ac.uk/fellows/dpk10/IA/IAprob.html
• Richard Weber: http://www.statslab.cam.ac.uk/ rrw1/stats/Sa4.pdf
Please notify [email protected] for comments and corrections.
CHAPTER 1
Discrete probability
1. Introduction
This course concerns the study of experiments with random outcomes, such as throwing a die,
tossing a coin or blindly drawing a card from a deck. Say that the set of possible outcomes is
Ω = ω1, ω2, ω3, . . ..
We call Ω sample space, while its elements are called outcomes. A subset A ⊆ Ω is called an
event.
Example 1.1 (Throwing a die). Toss a normal six-faced die: the sample space is Ω =
1, 2, 3, 4, 5, 6. Examples of events are:
5 (the outcome is 5)
2, 4, 6 (the outcome is even)
3, 6 (the outcome is divisible by 3)
Example 1.2 (Drawing a card). Draw a card from a standard deck: Ω is the set of all possible
cards, so that |Ω| = 52. Examples of events are:
A1 = the card is a Jack, |A1| = 4
A2 = the card is Diamonds, |A2| = 13
A3 = the card is not the Queen of Spades, |A3| = 51.
Example 1.3 (Picking a natural number). Pick any natural number: the sample space is
Ω = N. Examples of events are:
the number is at most 5 = 0, 1, 2, 3, 4, 5
the number is even = 2, 4, 6, 8 . . .
the number is not 7 = N \ 7.
7
8 1. DISCRETE PROBABILITY
Example 1.4 (Picking a real number). Pick any number in the closed interval [0, 1]: the sample
space is Ω = [0, 1]. Examples of events are:
x : x < 1/3 = [0, 1/3)
x : x 6= 0.7 = [0, 1] \ 0.7 = [0, 0.7) ∪ (0.7, 1]
x : x = 2−n for some n ∈ N = 1, 1/2, 1/4, 1/8 . . ..
Recall that a set Ω is said to be countable if there exists a bijection between Ω and a subset of
N. A set is said to be uncountable if it is not countable. Thus, for example, any finite set is
countable, the set of even (or odd) integers is countable, but R is uncountable. Note that the
sample space Ω is finite in the first two examples, infinite but countable in the third example,
and uncountable in the last example.
Remark 1.5. For the first part of the course we will restrict to countable sample spaces, thus
excluding Example 1.4 above.
We can now give a general definition.
Definition 1.6. Let Ω be any set, and F be a set of subsets of Ω. We say that F is a σ-algebra
if
- Ω ∈ F ,
- if A ∈ F , then Ac ∈ F ,
- for every sequence (An)n≥1 in F , it holds⋃∞n=1An ∈ F .
Assume that F is a σ-algebra. A function P : F → [0, 1] is called a probability measure if
- P(Ω) = 1,
- for any sequence of disjoint events (An)n≥1 it holds
P
( ∞⋃n=1
An
)=∞∑n=1
P(An).
The triple (Ω,F ,P) is called a probability space.
Remark 1.7. In the case of countable state space we take F to be the set of all subsets of Ω,
unless otherwise stated.
We think of F as the collection of observable events. If A ∈ F , then P(A) is the probability of
the event A. In some probability models, such as the one in Example 1.4, the probability of
each individual outcome is 0. This is one reason why we need to specify probabilities of events
rather than outcomes.
2. COMBINATORIAL ANALYSIS 9
1.1. Equally likely outcomes. The simplest case is that of a finite sample space Ω and
equally likely outcomes:
P(ω) =1
|Ω|∀ω ∈ Ω.
Note that by definition of probability measure this implies that
P(A) =|A||Ω|
∀A ∈ F .
Moreover
• P(∅) = 0,
• if A ⊆ B then P(A) ≤ P(B),
• if A ∩B = ∅ then P(A ∪B) = P(A) + P(B),
• P(A ∪B) = P(A) + P(B)− P(A ∩B),
• P(A ∪B) ≤ P(A) + P(B).
To check that P is a probability measure, note that P(Ω) = |Ω|/|Ω| = 1, and for disjoint events
(Ak)nk=1 it holds
P
(n⋃k=1
Ak
)=|A1 ∪A2 ∪ . . . ∪An|
|Ω|=
n∑k=1
|Ak||Ω|
=n∑k=1
P(Ak),
as wanted.
Example 1.8. When throwing a fair die there are 6 possible outcomes, all equally likely. Then
Ω = 1, 2, 3, 4, 5, 6, P(i) = 1/6 for i = 1 . . . 6.
So P(even outcome) = P(2, 4, 6) = 1/2, while P(outcome ≤ 5) = P(1, 2, 3, 4, 5) = 5/6.
2. Combinatorial analysis
We have seen that it is often necessary to be able to count the number of subsets of Ω with a
given property. We now take a systematic look at some counting methods.
2.1. Multiplication rule. Take N finite sets Ω1,Ω2 . . .ΩN (some of which might coincide),
with cardinalities |Ωk| = nk. We imagine to pick one element from each set: how many possible
ways do we have to do so? Clearly, we have n1 choices for the first element. Now, for each
choice of the first element, we have n2 choices for the second, so that
|Ω1 × Ω2| = n1n2.
Once the first two element, we have n3 choices for the third, and so on, giving
|Ω1 × Ω2 × . . .× ΩN | = n1n2 · · ·nN .
We refer to this as the multiplication rule.
10 1. DISCRETE PROBABILITY
Example 2.1. A restaurant offers 6 starters, 7 main courses and 5 desserts. The number of
possible three–course meals is then 6× 7× 5 = 210.
Example 2.2. Throw 3 dice. Then the number of outcomes given by an even number, a 6 and
then a number smaller than 4 is 3× 1× 4 = 12.
Example 2.3 (The number of subsets). Suppose a set Ω = ω1, ω2 . . . ωn has n elements. How
many subsets does Ω have? We proceed as follows. To each subset A of Ω we can associate a
sequence of 0’s and 1’s of length n so that the ith number is 1 if ωi is in A, and 0 otherwise.
Thus if, say, Ω = ω1, ω2, ω3, ω4 then
A1 = ω1 7→ 1, 0, 0, 0
A2 = ω1, ω3, ω4 7→ 1, 0, 1, 1
A3 = ∅ 7→ 0, 0, 0, 0.
This defines a bijection between the subsets of Ω and the strings of 0’s and 1’s of length n.
Thus we have to count the number of such strings. Since for each element we have 2 choices
(either 0 or 1), there are 2n strings. This shows that a set of n elements has 2n subsets. Note
that this also counts the number of functions from a set of n elements to 0, 1.
2.2. Permutations. How many possible orderings of n elements are there? Label the
elements 1, 2 . . . n. A permutation is a bijection from 1, 2 . . . n to itself, i.e. an ordering of
the elements. We may obtain all permutations by subsequently choosing the image of element
1, then the image of element 2 and so on. We have n choices for the image of 1, then n − 1
choices for the image of 2, n− 2 choices for the image of 3 until we have only one choice for the
image of n. Thus the total number of choices is, by the multiplication rule,
n! = n(n− 1)(n− 2) · · · 1.
Thus there are n! different orderings, or permutations, of n elements. Equivalently, there are n!
different bijections from any two sets of n elements.
Example 2.4. There are 52! possible orderings of a standard deck of cards.
2.3. Subsets. How many ways are there to choose k elements from a set of n elements?
2.3.1. With ordering. We have n choices for the first element, n− 1 choices for the second
element and so on, ending with n− k + 1 choices for the kth element. Thus there are
(2.1) n(n− 1) · · · (n− k + 1) =n!
(n− k)!
ways to choose k ordered elements from n. An alternative way to obtain the above formula is
the following: to pick k ordered elements from n, first pick a permutation of the n elements (n!
choices), then forget all elements but the first k. Since for each choice of the first k elements
there are (n− k)! permutations starting with those k elements, we again obtain (2.1).
2. COMBINATORIAL ANALYSIS 11
2.3.2. Without ordering. To choose k unordered elements from n, we could first choose k
ordered elements, and then forget about the order. Recall that there are n!/(n− k)! possible
ways to choose k ordered elements from n. Moreover, any given k elements can be ordered in
k! possible ways. Thus there are (n
k
)=
n!
k!(n− k)!
possible ways to choose k unordered elements from n.
More generally, suppose we have integers n1, n2 . . . nk with n1 + n2 + · · ·+ nk = n. Then we
have (n
n1 . . . nk
)=
n!
n1! . . . nk!
possible ways to partition n elements in k subsets of cardinalities n1, . . . nk.
Example 2.5. Imagine to have a box containing n balls. Then there are:
• n!(n−k)! ordered ways to pick k balls,
•(nk
)unordered ways to pick k balls,
•(
nn1 n2 n3
)to subdivide the balls into 3 unordered groups of n1, n2 and n3 balls each.
2.4. Subsets with repetitions. How many ways are there to choose k elements from a
set of n elements, allowing repetitions?
2.4.1. With ordering. We have n choices for the first element, n choices for the second
element and so on. Thus there are
nk = n× n× · · · × n
possible ways to choose k ordered elements from n, allowing repetitions.
2.4.2. Without ordering. Suppose we want to choose k elements from n, allowing repetitions
but discarding the order. How many ways do we have to do so? Note that naıvely dividing
nk by k! doesn’t give the right answer, since there may be repetitions. Instead, we count as
follows. Label the n elements 1, 2 . . . n, and for each element draw a ∗ each time it is picked.
1 2 3 . . . n
** * . . . ***
Note that there are k ∗’s and n− 1 vertical lines. Now delete the numbers:
(2.2) ∗ ∗| ∗ | | . . . | ∗ ∗∗
The above diagram uniquely identifies an unordered set of (possibly repeated) k elements. Thus
we simply have to count how many such diagrams there are. The only restriction is that there
12 1. DISCRETE PROBABILITY
must be n− 1 vertical lines and k ∗’s. Since there are n+ k − 1 locations, we can fix such a
diagram by assigning the positions of the ∗’s, which can be done in(n+ k − 1
k
)ways. This therefore counts the number of unordered subsets of k elements from n, without
ordering.
Example 2.6. Imagine to again have a box with n balls, but now each time a ball is picked, it
is put back in the box (so that it can be picked again). There are:
• nk ordered ways to pick k balls, and
•(n+k−1
k
)unordered ways to pick k balls.
Example 2.7 (Increasing and non–decreasing functions). An increasing function from 1, 2 . . . kto 1, 2 . . . n is uniquely determined by its range, which is a subset of 1, 2 . . . n of size k.
Vice versa, each such subset determines a unique increasing function. This bijection tells us
that there are(nk
)increasing functions from 1, 2 . . . k to 1, 2 . . . n, since this is the number
of subsets of size k of 1, 2, . . . n.
How about non–decreasing functions? There is a bijection from the set of non–decreasing
functions f : 1, 2 . . . k → 1, 2 . . . n to the set of increasing functions g : 1, 2 . . . k →1, 2 . . . n+ k − 1, given by
g(i) = i+ f(i)− 1.
Hence the number of such decreasing functions is(n+k−1
k
).
3. Exercises
Exercise 1. Define a probability space. Take Ω = a, b, c. Which ones of the following
are σ-algebras? Explain.
• F = ∅, a,• F = ∅, a, b, c, a, b, c,• F = ∅, a, b, c, b, c, a, b, c,• F = ∅, a, b, c, a, b, b, c, a, c, a, b, c.
Exercise 2. Let Ω be a sample space, and A be an event. Show that P(Ac) = 1− P(A).
Exercise 3. How many ways are there to divide a deck of 52 cards into two decks of 26
each? How many ways are there to do so, ensuring that each of the two decks contains exactly
13 black and 13 red cards?
4. STIRLING’S FORMULA 13
Exercise 4. Four mice are chosen (without replacement) from a litter, two of which are
white. The probability that both white mice are chosen is twice the probability that neither is
chosen. How many mice are there in the litter?
Exercise 5. Suppose that n (non-identical) balls are tossed at random into n boxes. What
is the probability that no box is empty? What is the probability that all balls end up in the
same box?
Exercise 6. [Ordered partitions] An ordered partition of k of size n is a sequence
(k1, k2 . . . kn) of non-negative integers such that k1 + · · ·+kn = k. How many ordered partitions
of k of size n are there?
Exercise 7. [The birthday problem] If there are n people in the room, what is the
probability that at least two people have the same birthday?
3.1. Recap of formulas.
• Permutations of n elements: n!.
• Choose k elements from n, no repetitions:
– with ordering:n!
(n− k)!,
– without ordering:
(n
k
).
• Choose k elements from n, with repetitions:
– with ordering: nk,
– without ordering:
(n− 1 + k
k
).
4. Stirling’s formula
We have seen how factorials are ubiquitous in combinatorial analysis. It is therefore important
to be able to provide asymptotics for n! as n becomes large, which is often the case of interest.
Recall that we write an ∼ bn to mean that an/bn → 1 as n→∞.
Theorem 4.1 (Stirling’s formula). As n→∞ we have
n! ∼√
2πnn+1/2e−n.
Note that this implies that
log(n!) ∼ log(√
2πnn+1/2e−n)
as n→∞. We now prove this weaker statement.
14 1. DISCRETE PROBABILITY
Set
ln = log(n!) = log 1 + log 2 + · · ·+ log n.
Write bxc for the integer part of x. Then
log bxc ≤ log x ≤ log bx+ 1c.
Integrate over the interval [1, n] to get
ln−1 ≤∫ n
1log xdx ≤ ln
from which ∫ n
1log xdx ≤ ln ≤
∫ n+1
1log xdx.
Integrating by parts we find ∫ n
1log xdx = n log n− n+ 1,
from which we deduce that
n log n− n+ 1 ≤ ln ≤ (n+ 1) log(n+ 1)− n.
Since
n log n− n+ 1 ∼ n log n,
(n+ 1) log(n+ 1)− n ∼ n log n,
dividing through by n log n and taking the limit as n→∞ we find that ln ∼ n log n, as wanted.
Example 4.2. Suppose we have 4n balls, of which 2n are red and 2n are black. We put them
randomly into two urns, so that each urn contains 2n balls. What is the probability that each
urn contains exactly n red balls and n black balls? Call this probability pn. Note that there
are(
4n2n
)ways to distribute the balls into the two urns,
(2nn
)(2nn
)of which will result in each urn
containing exactly n balls of each color. Thus
pn =
(2n
n
)(2n
n
)(
4n
2n
) =
((2n)!
n!
)4 1
(4n)!
∼
(√2πe−2n(2n)2n+1/2
√2πe−nnn+1/2
)41√
2πe−4n(4n)4n+1/2=
√2
πn.
5. PROPERTIES OF PROBABILITY MEASURES 15
5. Properties of Probability measures
Let (Ω,F ,P) be a probability space, and recall that P : F → [0, 1] has the property that
P(Ω) = 1 and for any sequence (An)n≥1 of disjoint events in F it holds
P
⋃n≥1
An
=∑n≥1
P(An).
Then we have the following:
• 0 ≤ P(A) ≤ 1 for all A ∈ F ,
• if A ∩B = ∅ then P(A ∪B) = P(A) + P(B),
• P(Ac) = 1− P(A), since P(A) + P(Ac) = P(Ω) = 1,
• P(∅) = 0,
• if A ⊆ B then P(A) ≤ P(B), since
P(B) = P(A ∪ (B \A)) = P(A) + P(B \A) ≥ P(A).
• P(A ∪B) = P(A) + P(B)− P(A ∩B),
• P(A ∪B) ≤ P(A) + P(B).
5.1. Continuity of probability measures. For a non-decreasing sequence of events
A1 ⊆ A2 ⊆ A3 ⊆ · · · we have
P
⋃n≥1
An
= limn→∞
P(An).
Indeed, we can define a new sequence (Bn)n≥1 as
B1 = A1, B2 = A2 \A1, B3 = A3 \ (A1 ∪A2) . . . Bn = An \ (A1 ∪ · · ·An−1) . . .
Then the Bn’s are disjoint, and so
P(An) = P(B1 ∪ · · · ∪Bn) =n∑k=1
P(Bk).
Taking the limit as n→∞ both sides, we get
limn→∞
P(An) =∞∑n=1
P(Bn) = P( ⋃n≥1
Bn
)= P
( ⋃n≥1
An
),
as wanted. Similarly, for a non-increasing sequence of events B1 ⊇ B2 ⊇ B3 ⊇ · · · we have
P
⋂n≥1
Bn
= limn→∞
P(Bn).
16 1. DISCRETE PROBABILITY
Example 5.1. Take Ω = N and let An = 1, 2 . . . n for n ≥ 1. Then we have A1 ⊆ A2 ⊆ A3 ⊆· · · and
limn→∞
P(An) = P
⋃n≥1
An
= P(Ω) = 1.
If, on the other hand, we set An = A5 for all n ≥ 5, then we find
limn→∞
P(An) = P
⋃n≥1
An
= P(A5).
Example 5.2. Take Ω = N and let Bn = n, n + 1, n + 2 . . . for n ≥ 1. Then we have
B1 ⊇ B2 ⊇ B3 · · · and
limn→∞
P(Bn) = P
⋂n≥1
Bn
= P(∅) = 0.
If, on the other hand, we set Bn = B3 for all n ≥ 3, then we find
limn→∞
P(Bn) = P
⋂n≥1
Bn
= P(B10).
5.2. Subadditivity of probability measures. For any events A1, A2 . . . An we have
(5.1) P
(n⋃k=1
Ak
)≤
n∑k=1
P(Ak).
To see this, define the events
B1 = A1, B2 = A2 \A1, B3 = A3 \ (A1 ∪A2) . . . Bn = An \ (A1 ∪ · · ·An−1).
Then the Bk’s are disjoint, and P(Bk) ≤ P(Ak) for all k = 1 . . . n, from which
P
(n⋃k=1
Ak
)= P
(n⋃i=1
Bk
)=
n∑k=1
P(Bk) ≤n∑k=1
P(Ak).
The same proof shows that this also holds for infinite sequences of events (An)n≥1
P
⋃n≥1
An
≤∑n≥1
P(An).
The above inequalities show that probability measures are subadditive, and are often referred
to as Boole’s inequality.
Example 5.3. We have already seen that P(A ∪B) ≤ P(A) + P(B), which is an example of
(5.1) with n = 2.
5. PROPERTIES OF PROBABILITY MEASURES 17
Example 5.4. Throw a fair die, so that Ω = 1, 2, 3, 4, 5, 6. Then
P(1, 2, 3 ∪ 2, 4) = P(1, 2, 3, 4) =4
6,
while
P(1, 2, 3) + P(2, 4) =3
6+
2
6=
5
6,
so indeed P(1, 2, 3 ∪ 2, 4) ≤ P(1, 2, 3) + P(2, 4).
Example 5.5. Toss a fair coin 3 times. Write H for head and T for tail. Then
P(exactly one head ∪ (T,H,H) ∪ (H,T, T )
)=
= P((H,T, T ), (T,H, T ), (T, T,H), (T,H,H)) =4
8=
1
2,
while
P(exactly one head) + P(T,H,H) + P(H,T, T ) =
3
8+
1
8+
1
8=
5
8.
5.3. Inclusion-exclusion formula. We have already seen that for any two events A,B
P(A ∪B) = P(A) + P(B)− P(A ∩B).
How does this generalise to an arbitrary number of events? For 3 events we have
P(A ∪B ∪ C) = P(A) + P(B) + P(C)− P(A ∩B)− P(A ∩ C)− P(B ∩ C) + P(A ∩B ∩ C).
A B
C
In general, for n events A1, A2 . . . An we have
P
(n⋃i=1
Ai
)=
n∑k=1
(−1)k+1∑
1≤i1<···<ik≤nP(Ai1 ∩Ai2 ∩ · · · ∩Aik).
You are not required to remember the general formula for this course, but you should learn it
for n = 2, 3.
18 1. DISCRETE PROBABILITY
Example 5.6. Pick a card from a standard deck of 52. Define the events
A = the card is diamonds,
B = the card is even,
C = the cards is at least 10.
Then
P(A) =13
52, P(B) =
24
52, P(C) =
16
52,
P(A ∩B) =6
52, P(A ∩ C) =
4
52, P(B ∩ C) =
8
52,
P(A ∩B ∩ C) =2
52,
which gives
P(A ∪B ∪ C) =13 + 24 + 16− 6− 4− 8 + 2
52=
37
52.
6. Exercises
Exercise 1. Let A,B,C be three events. Express in symbols the following events:
(i) only A occurs,
(ii) at least one event occurs,
(iii) exactly one event occurs,
(iv) no event occurs.
Exercise 2. Show that for any two sets A,B it holds (A ∪B)c = Ac ∩Bc.
Exercise 3. Show that, for any three events A,B,C
P(Ac ∩ (B ∪ C)) = P(B) + P(C)− P(B ∩ C)− P(C ∩A)− P(A ∩B) + P(A ∩B ∩ C).
How many numbers in 1, 2 . . . 500 are not divisible by 7 but are divisible by 3 or 5?
Exercise 4. Let (An)n≥1 be a sequence of events in some probability space (Ω,F ,P).
Define
A = ω ∈ Ω : ω ∈ An infinitely often,
B = ω ∈ Ω : ω ∈ An for all sufficiently large n.
Write A and B in terms of unions/intersections of the Ak’s.
Exercise 5. A committee of size r is chosen at random from a set of n people. Calculate
the probability that m given people will be in the committee.
7. INDEPENDENCE 19
Exercise 6. What is the probability that a non-decreasing function f : 1 . . . k → 1 . . . nis increasing?
Exercise 7.
(i) How many injective functions f : 1, 2 . . . k → 1, 2 . . . n are there?
(ii) How many injective, non-decreasing functions f : 1, 2 . . . k → 1, 2 . . . n are there?
(iii) How many non-constant functions f : 1, 2 . . . k → 0, 1 are there? How about
non-constant f : 1, 2 . . . k → 1, 2 . . . n?
7. Independence
We will now discuss what is arguably the most important concept in probability theory, namely
independence.
Definition 7.1. Two events A,B are said to be independent if
P(A ∩B) = P(A)P(B).
Example 7.2. Throw a fair die, and let
A = the first number is even, B = the second number is ≤ 2.
Then P(A) = 36 , P(B) = 2
6 and P(A ∩B) = 636 , so P(A ∩B) = P(A)P(B) and the events A and
B are independent.
Example 7.3. Pick a card from a standard deck. Let
A = the card is heart, B = the card is at most 5.
Then P(A) = 1352 , P(B) = 20
52 and P(A∩B) = 552 , so P(A∩B) = P(A)P(B) and therefore A and
B are independent.
Example 7.4. Toss two fair dice. Let
A = the sum of the two numbers is 6, B = the firs number is 4.
Then P(A) = 536 , P(B) = 1
6 and P(A ∩ B) = 136 . Thus P(A ∩ B) 6= P(A)P(B) and the events
are not independent. Intuitively, the probability of getting 6 for the sum depends on the first
outcome, since if we were to get 6 at the first toss then it would be impossible to obtain 6 for
the sum, while if the first toss gives a number ≤ 5 then we have a positive probability of getting
6 for the sum.
If, on the other hand, we replace A with
A′ = the sum of the two numbers is 7,
20 1. DISCRETE PROBABILITY
then P(A′) = 636 , P(B) = 1
6 and P(A′ ∩B) = 136 . Thus P(A′ ∩B) = P(A′)P(B) and the events
are independent. That is, knowing that the first toss gave 4 doesn’t give us any information on
the probability that the sum is equal to 7.
An important property of independence is the following: if A is independent of B, then A is
independent of Bc. Indeed, we have
P(A ∩Bc) = P(A)− P(A ∩B)
= P(A)− P(A)P(B) (by independence of A,B)
= P(A)(1− P(B))
= P(A)P(Bc) (since P(Bc) = 1− P(B))
which shows that A and Bc are independent.
Example 7.5. Let A,B be defined as in Example 7.2, so that
Bc = the second number is ≥ 3.
Then P(Bc) = 1− P(B) = 46 , so P(A)P(Bc) = 12
36 = P(A ∩Bc), so A and Bc are independent.
Example 7.6. Let A,B be defined as in Example 7.3, so that
Bc = the card is at least 6.
Then P(Bc) = 1− P(B) = 3252 , so P(A)P(Bc) = 8
52 = P(A ∩Bc), so A and Bc are independent.
We can also define independence for more than two events.
Definition 7.7. We say that the events A1, A2 . . . An are independent if for any k ≥ 2 and
any collection of distinct indices 1 ≤ i1 < · · · < ik ≤ n we have
P(Ai1 ∩Ai2 ∩ . . . ∩Aik) = P(Ai1)P(Ai2) · · ·P(Aik).
Note that we require all possible subsets of the n events to be independent.
Example 7.8. Toss 3 fair coins. Let
A = first coin H, B = second coin H, C = third coin T.
Then P(A) = P(B) = P(C) = 12 and
P(A ∩B) =1
4= P(A)P(B)
P(B ∩ C) =1
4= P(B)P(C)
P(A ∩ C) =1
4= P(A)P(C)
P(A ∩B ∩ C) =1
8= P(A)P(B)P(C),
8. CONDITIONAL PROBABILITY 21
so A,B,C are independent.
At this point the reader may wonder whether independence of n events is equivalent to pairwise
independence, that is the requirement that any two events are independent. The answer is no:
independence is stronger than pairwise independence. Here is an example.
Example 7.9. Toss 2 fair coins, so that
Ω = (H,H), (H,T ), (T,H), (T, T ).
Let
A = the first coin is T = (T,H), (T, T ),
B = the second coin is T = (H,T ), (T, T ),
C = get exactly one T = (T,H), (H,T ).
Then P(A) = P(B) = P(C) = 1/2, P(A ∩B) = P(A ∩ C) = P(B ∩ C) = 1/4 so the events are
pairwise independent. On the other hand,
P(A ∩B ∩ C) = 0,
so the events are not independent.
8. Conditional probability
Closely related to the notion of independence is that of conditional probability.
Definition 8.1. Let A,B be two events with P(B) > 0. Then the conditional probability of A
given B is
P(A|B) =P(A ∩B)
P(B).
We interpret P(A|B) as the probability that the event A occurs, when it is known that the
event B occurs.
Example 8.2. Toss a fair die. Let A1 = 3 and B = even = 2, 4, 6. Then
P(A1|B) =P(3 and even)
P(even)= 0.
This is intuitive: if we know that the outcome is even, the probability that the outcome is 3 is
zero.
If instead we take A2 = 2, 3, 4, 5, 6 then
P(A2|B) =P(2, 4, 6)P(even)
= 1.
22 1. DISCRETE PROBABILITY
This is also intuitive: if we know that the outcome is even, then we are certain that it is at
least 2.
Finally, take A3 = 4, 5, 6, for which
P(A3|B) =P(4, 6)P(even)
=2
3.
An important property to note is that if A and B are independent, then P(A|B) = P(A). That
is, when two events are independent, knowing that one occurs does not affect the probability of
the other. This follows from
P(A|B) =P(A ∩B)
P(B)=
P(A)P(B)
P(B)= P(A),
where we have used that A and B are independent in the second equality.
8.1. The law of total probability. From the definition of conditional probability we
see that for any two events A,B with P(B) > 0 we have
P(A ∩B) = P(A|B)P(B).
This shows that the probability of two events occurring simultaneously can be broken up into
calculating successive probabilities: first the probability that B occurs, and then the probability
that A occurs given that B has occurred.
Clearly by replacing B with Bc in the above formula we also have
P(A ∩Bc) = P(A|Bc)P(Bc).
But then, since B and Bc are disjoint,
P(A) = P(A ∩B) + P(A ∩Bc)
= P(A|B)P(B) + P(A|Bc)P(Bc).
This has an important generalisation, called law of total probability.
Theorem 8.3 (Law of total probability). Let (Bn)n≥1 be a sequence of disjoint events of
positive probability, whose union is the sample space Ω. Then for all events A
P(A) =∞∑n=1
P(A|Bn)P(Bn).
Proof. We know that, by assumption,⋃n≥1
Bn = Ω.
8. CONDITIONAL PROBABILITY 23
This gives
P(A) = P(A ∩ Ω) = P
A ∩⋃n≥1
Bn
= P
⋃n≥1
(A ∩Bn)
=∑n≥1
P(A ∩Bn) =∑n≥1
P(A|Bn)P(Bn).
Remark 8.4. In the statement of the law of total probabilities we can also drop the assumption
P(Bn) > 0, provided we interpret P(A|Bn)P(Bn) = 0 if P(Bn) = 0. It follows that we can
also take (Bn)n to be a finite collection of events (simply set Bn = ∅ from some finite index
onwards).
Example 8.5. An urn contains 10 black balls and 5 red balls. We draw two balls from the urn
without replacement. What is the probability that the second ball drawn is black? Let
A = the second ball is black, B = the first ball is black.
Then
P(A) = P(A|B)P(B) + P(A|Bc)P(Bc) =10
15· 9
14+
5
15· 10
14=
2
3.
If, in general, the urn contains b black balls and r red balls we have
P(A) =b
b+ r· b− 1
b+ r − 1+
r
b+ r· b
b+ r − 1=
b
b+ r.
Example 8.6. Toss a fair coin. If it comes up head, then toss one fair die, otherwise toss 2 fair
dice. Call M the maximum number obtained from the dice, and let
A = M ≤ 4.
What is the probability of A? We look at the two cases: either the coin came head and we
tossed one die, or the coin came tail and we tossed two dice. Thus let
B = the coin gives H.
By the law of total probabilities we have
P(A) = P(A|B)P(B) + P(A|Bc)P(Bc)
= P(M ≤ 4|H)P(H) + P(M ≤ 4|T )P(T )
=1
2· 4
6+
1
2· 4 · 4
6 · 6=
5
9.
Example 8.7. Toss a fair coin twice. If get (H,H) then toss 1 die, if (T, T ) toss 2 dice and
otherwise toss 3 dice. With M defined as before, what is the probability of the event A? Again,
we split the probability space into disjoint events according to the number of dice we toss. Let
B1 = (H,H), B2 = (T, T ), B3 = (H,T ), (T,H).
24 1. DISCRETE PROBABILITY
Then by the law of total probabilities
P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + P(A|B3)P(B3)
=1
4· 4
6+
1
4· 4 · 4
6 · 6+
1
2· 4 · 4 · 4
6 · 6 · 6=
1
2.
8.2. Bayes’ theorem. Note that if A and B are two events of positive probability we
have, by definition of conditional probability,
P(A ∩B) = P(A|B)P(B) = P(B|A)P(A).
Often, one of the conditional probabilities is easier to compute than the other. Bayes’ theorem
tells us how to switch between the two.
Theorem 8.8 (Bayes’ theorem). For any two events A,B both of positive probability we have
P(A|B) =P(B|A)P(A)
P(B).
Proof. We have
P(A|B) =P(A ∩B)
P(B)=
P(B|A)P(A)
P(B).
Example 8.9. Going back to Example 8.5, suppose we are told that the second ball is black.
What is the probability that the first ball was black? Applying Bayes’ theorem we find
P(B|A) =P(A|B)P(B)
P(A)=
914 ·
1015
23
=9
14.
Example 8.10. Going back to Example 8.6, suppose we are told that M ≤ 2. What is
the probability that the coin came up head? Recall that B = the coin gives H, and let
C = M ≤ 2. By Bayes’ theorem, we have
P(B|C) =P(C|B)P(B)
P(C)=
P(C|B)P(B)
P(C|B)P(B) + P(C|Bc)P(Bc)=
26 ·
12
26 ·
12 + 2·2
6·6 ·12
=3
4.
9. Exercises
Exercise 1. Toss two fair dice. Determine whether the following events are independent:
(i) A = the first die gives 4, B = the maximum outcome is ≤ 3,(ii) A = the first die gives 4, B = the maximum outcome is ≤ 5,(iii) A = the first die gives 5, B = the sum of the outcomes is 6,(iv) A = the first die gives 6, B = the sum of the outcomes is 7.
Exercise 2. Let A and B be independent. Show that Ac and Bc are independent.
10. SOME NATURAL PROBABILITY DISTRIBUTIONS 25
Exercise 3. Let A,B be two events with P(B) > 0. Define P(A|B). Show that
(a) P(B|B) = 1,
(b) if A ⊆ B then P(A|B) = P(A)P(B) ,
(c) if A ( B then P(A|B) < 1.
Exercise 4. Give an example of two events for which P(A|B) 6= P(B|A). Can you find two
events A,B of positive probability such that P(A|B)P(B) 6= P(B|A)P(A)? Explain.
Exercise 5. Draw a card from a standard deck. What is the probability that the card
is heart? What is the probability that the card is even, given that it is heart? What is the
probability that the card is even, given that it is ≤ 5? What is the probability that the card is
≤ 5, given that it is even?
Exercise 6. Students of a given class get grades A,B,C,D with probability 18 ,
28 ,
38 ,
28
respectively. If they misread the question, they usually do worse, getting grades A,B,C,D
with probability 110 ,
210 ,
310 ,
410 respectively. Each student misreads the question with probability
23 . What is the probability that:
(a) a student who read the question correctly gets A?
(b) a student who got B has read the question correctly?
(c) a student who got C has misread the question?
(d) a student who misread the question got C?
(e) a student gets D?
Exercise 7. What is the probability that a function f : 1, 2 . . . k → 1, 2 . . . n is constant,
given that it is non-decreasing? What is the probability that f : 1, 2 . . . k → 1, 2 . . . n is
non-decreasing, given that it is constant?
10. Some natural probability distributions
Up to this point in the course we have only worked with equally likely outcomes. On the other
hand, many other choices of probability measures are possible. We will next see some some of
them, which arise naturally.
Definition 10.1. A probability distribution, or simply distribution, is a probability measure on
some (Ω,F). We say that a distribution is discrete if Ω is a discrete set.
26 1. DISCRETE PROBABILITY
For discrete Ω we always take the σ-algebra F to be the set of all subsets of Ω. In particular, it
contains all sets ω, ω ∈ Ω, made of a single outcome. Thus, if µ is a probability distribution,
we can write
pω = µ(ω).
Note that pω ∈ [0, 1] for all ω ∈ Ω, and∑ω∈Ω
pω = µ(Ω) = 1.
Moreover, µ is uniquely determined by the collection (pω)ω∈Ω, that we call weights. We think
of the weight pω as the probability of seeing the outcome ω, when sampling according to µ.
We now list several natural probability distributions.
10.1. Bernoulli distribution. Take Ω = 0, 1 and define µ to be the probability distri-
bution given by the weights
p1 = p, p0 = 1− p.
Then µ models the number of heads obtained in one biased coin toss (the coin gives head
with probability p and tail with probability 1− p). Such µ is called Bernoulli distribution of
parameter p, and it is denoted by Bernoulli(p).
10.2. Binomial distribution. Fix an integer N ≥ 1 and let Ω = 0, 1, 2 . . . N. Define
µ to be the probability distribution given by the weights
pk =
(N
k
)pk(1− p)N−k, 0 ≤ k ≤ N.
Then µ models the number of heads in N biased coin tosses (again, the coin gives head with
probability p, and tail with probability 1 − p). Such µ is called Binomial distribution of
parameters N, p, and it is denoted by Binomial(N, p). Note that
µ(Ω) =N∑k=0
pk =N∑k=0
(N
k
)pk(1− p)N−k = (p+ (1− p))N = 1.
10.3. Geometric distribution. Let Ω = 1, 2, 3 . . ., and define µ as the distribution
given by the weights
pk = (1− p)k−1p, k ≥ 1.
for all k ≥ 1. Then µ models the number of biased coin tosses up to (and including) the first
head. Such µ is called Geometric distribution of parameter p, and it is denoted by Geometric(p).
Note that
µ(Ω) =
∞∑k=1
pk = p
∞∑k=1
(1− p)k−1 =p
1− (1− p)= 1.
11. EXERCISES 27
Note: A variant of the geometric distribution counts the number of tails until the first head.
In this case Ω = 0, 1, 2, 3 . . . and
pk = (1− p)kp, k ≥ 0.
for all k ≥ 0. This is also referred to as Geometric(p), and you should check that µ(Ω) = 1. It
should always be made clear which version of the geometric distribution is intended.
10.4. Poisson distribution. Let Ω = N, and for a fixed parameter λ ∈ (0,+∞) let µ be
the probability distribution given by the weights
pk =e−λλk
k!, k ≥ 0.
Such µ is called Poisson distribution of parameter λ, and it is denoted by Poisson(λ). Note
that
µ(Ω) =∞∑k=0
e−λλk
k!= e−λeλ = 1.
This distribution arises as the limit of a Binomial distribution with parameters N, λN as N →∞.
Indeed, if pk(N,λ/N) denote the weights of a Binomial(N,λ/N) distribution, we have
pk(N,λ/N) =
(N
k
)(λ
N
)k (1− λ
N
)N−k=N(N − 1) · · · (N − k + 1)
Nk
(1− λ
N
)N−k λkk!−→ e−λλk
k!
as N →∞, since
N(N − 1) · · · (N − k + 1)
Nk→ 1,
(1− λ
N
)N−k→ e−λ,
(1− λ
N
)−k→ 1.
11. Exercises
Exercise 1. Let A,B be two events such that 0 < P(A) < 1 and 0 < P(B) < 1. We say
that B attracts A if P(A|B) > P(A). Show that if B attracts A then A attracts B.
Exercise 2. Tom is undecided as to whether to take a Mathematics course or a Literature
course. He estimates that the probability of getting an A would be 1/3 in a Maths course, and
1/2 in a Literature course. Suppose he chooses which course to take by tossing a fair coin: what
is the probability that he gets an A? Suppose that instead he tosses a biased coin: if it comes
up head, which happens with probability p ∈ (0, 1), he takes the Maths course, and otherwise
he takes the Literature course. What is the probability that Tom gets an A?
28 1. DISCRETE PROBABILITY
Exercise 3. Toss two dice. Let
Ai = the first die gives i, 1 ≤ i ≤ 6
Bj = the sum of the two dice is j, 2 ≤ j ≤ 12.
Determine for which (i, j) the events Ai, Bj are independent.
Exercise 4. At a certain stage of a criminal investigation the inspector is 60 per cent
convinced of the guilt of a certain suspect. Suppose, however, that a new piece of evidence
is uncovered: it is found that the criminal is left-handed. If the probability of being both
left-handed and non-guilty is 2/25, how certain of the guilt of the suspect should the inspector
be if it turns out that the suspect is left-handed?
Exercise 5. Show that for any A,B with P(B) > 0 it holds
0 ≤ P(A|B) ≤ 1.
Exercise 6. Toss a biased coin, which gives head with probability p ∈ (0, 1) and tail with
probability 1− p. If it comes up head, roll two dice and record the sum. If tail, pick a card
from a standard deck and record its number (between 1 and 13). What is the probability that
the number you have recorded is 4? What is the probability that it is ≤ 2?
12. Additional exercises
Exercise 1. A committee of 3 people is to be formed from a group of n people. How many
different committees are possible? Let c(n) denote the number of possible committees. Find α
such that
limn→∞
c(n)
nα∈ (0,+∞).
Exercise 2. How many different words can be formed from (exactly) the letters QUIZZES?
Exercise 3. How many distinct non-negative integer-valued solutions (x1, x2) of the equa-
tion x1 + x2 = 3 are possible?
Exercise 4. Show that P(A ∩B) ≤ P(A). When does equality hold?
Exercise 5. A total of 36 members of a club play tennis, 28 play squash and 18 play
badminton. Furthermore, 22 of the members play both tennis and squash, 12 play both tennis
and badminton, 9 play both squash and badminton, and 4 play all three sports. How many
members of the club play at least one of the three sports?
13. RANDOM VARIABLES 29
Exercise 6. How many people have to be in a room in order that the probability that at
least two of them celebrate their birthday in the same month is at least 1/2? Assume that all
months are equally likely.
Exercise 7. An urn contains two coins of type A and one coin of type B. When a type
A coin is flipped, it comes up head with probability 1/4, while when a type B coin is flipped
it comes up head with probability 3/4. A coin is randomly chosen from the urn and flipped.
What is the probability to get head? Given that the coin gave head, what is the probability
that it was a coin of type A?
Exercise 8. An infinite sequence of independent trials is to be performed. Each trial
results in a success with probability p ∈ (0, 1), and a failure with probability 1− p. Determine
the probability that:
(a) at least one success occurs in the first n trials,
(b) exactly k successes occur in the first n trials,
(c) all trials result in successes,
(d) all trials result in failures.
Exercise 9. An infinite sequence of independent trials is to be performed. Each trial
results in a success with probability p ∈ (0, 1), and a failure with probability 1− p. What is the
probability distribution of the number of trials until (and including) the first success? Write
down the corresponding weights (pk)k≥1.
Exercise 10. Each day of the week James flips a biased coin: if it comes up head (which
happens with probability p), then he goes out for a walk. What is the distribution of the
number of times James goes for a walk in one week? Write down the corresponding weights
(pk)0≤k≤7.
13. Random variables
It is often the case that when a random experiment is conducted we don’t just want to record
the outcome, but perhaps we are interested in some functions of the outcome.
Definition 13.1. A random variable on (Ω,F) taking values in a discrete set S is a function
X : Ω→ S.
Typically S ⊆ R or S ⊆ Rk, in which case we say that X is a real-valued random variable.
30 1. DISCRETE PROBABILITY
Example 13.2. Toss a biased coin. Then Ω = H,T, where H,T stands for head and tail.
Define
X(H) = 1, X(T ) = 0.
Then X : Ω→ 0, 1 counts the number of heads in the outcome.
Example 13.3. Toss two biased coins, so that Ω = H,T2. Define
X(HH) = 2, X(TH) = X(HT ) = 1, X(TT ) = 0.
Then X : Ω→ 0, 1, 2 counts the number of heads in 2 coin tosses.
Example 13.4. Roll two dice, so that Ω = 1, 2, 3, 4, 5, 62. For (i, j) ∈ Ω, set
X(i, j) = i+ j.
Then X : Ω→ 2, 3 . . . 12 records the sum of the two dice. We could also define the random
variables
Y (i, j) = maxi, j, Z(i, j) = i.
Then Y : Ω → 1, 2, 3, 4, 5, 6 records the maximum of the two dice, while Z : Ω →1, 2, 3, 4, 5, 6 records the outcome of the first die.
Example 13.5. For a given probability space (Ω,F ,P) and A ∈ F , define
1A(ω) =
1, for ω ∈ A,
0, for ω /∈ A.
Then 1A : Ω→ 0, 1 tells us whether the outcome was in A or not. Note that
(i) 1Ac = 1− 1A,
(ii) 1A∩B = 1A1B,
(iii) 1A∪B = 1− (1− 1A)(1− 1B).
You should prove (iii) as an exercise.
For a subset T ⊆ S we denote the set ω ∈ Ω : X(ω) ∈ T simply by X ∈ T. Since X takes
values in a discrete S, we let
px = P(X = x)
for all x ∈ S. The collection (px)x∈S is referred to as the probability distribution of X. If the
probability distribution of X is, say, Geometric(p), then we say that X is a Geometric random
variable of parameter p, and write X ∼ Geometric(p). Similarly, for the other distributions we
have encountered write X ∼ Bernoulli(p), X ∼ Binomial(N, p), X ∼ Poisson(λ).
Given a random variable X taking values in S ⊂ R, the function FX : R→ [0, 1] given by
FX(x) = P(X ≤ x)
13. RANDOM VARIABLES 31
is called distribution function of X. Note that FX is piecewise constant, non-decreasing,
right-continuous, and
limx→−∞
FX(x) = 0, limx→+∞
FX(x) = 1.
Example 13.6. Toss a biased coin, which gives head with probability p, and define the random
variable
X(H) = 1, X(T ) = 0.
Then
FX(x) =
0, if x < 0
1− p, if x ∈ [0, 1)
1, if x ≥ 1.
Note that FX is piecewise constant, non-decreasing, right-continuous, and its jumps are given
by the weights of a Bernoulli distribution of parameter p.
Knowing the probability distribution function is equivalent to knowing the collection of weights
(px)x∈S such that px = P(X = x), and hence it is equivalent to knowing the probability
distribution of X.
Definition 13.7 (Independent random variables). Two random variables X,Y , taking values
in SX , SY respectively, are said to be independent if
P(X = x, Y = y) = P(X = x)P(Y = y)
for all (x, y) ∈ SX × SY . In general, the random variables X1, X2 . . . Xn taking values in
S1, S2 . . . Sn are said to be independent if
P(X1 = x1, X2 = x2 . . . Xn = xn) = P(X1 = x1)P(X2 = x2) · · ·P(Xn = xn)
for all (x1, x2 . . . xn) ∈ S1 × S2 × · · · × Sn.
Note that if X1, X2 . . . XN are independent, then for any k ≤ N and any distinct indices
1 ≤ i1 < i2 < · · · < ik ≤ N , the random variables Xi1 , Xi2 . . . Xik are independent. We show
this with N = 3 for simplicity. Let X : Ω→ SX , Y : Ω→ SY and Z : Ω→ SZ be independent
random variables, that is
P(X = x, Y = y, Z = z) = P(X = x)P(Y = y)P(Z = z)
for all (x, y, z) ∈ SX × SY × SZ . Then X,Y are independent. Indeed,
P(X = x, Y = y) =∑z∈SZ
P(X = x, Y = y, Z = z) =∑z∈SZ
P(X = x)P(Y = y)P(Z = z)
= P(X = x)P(Y = y)
∑z∈SZ
P(Z = z)
︸ ︷︷ ︸
1
= P(X = x)P(Y = y).
32 1. DISCRETE PROBABILITY
Similarly, one shows that Y,Z and X,Z are independent.
Example 13.8. Consider the probability distribution on Ω = 0, 1N given by the weights
pω =
N∏k=1
pωk(1− p)1−ωk
for any ω = (ω1, ω2 . . . ωN ) ∈ Ω. This models a sequence of N tosses of a biased coin. On Ω we
can define the random variables X1, X2 . . . XN by
Xk(ω) = ωk
for all ω = ω1ω2 . . . ωN ∈ Ω. Then each Xk is a Bernoulli random variable of parameter p, since
P(Xi = 1) = P(ωi = 1) =∑
w:wi=1
N∏k=1
pωk(1− p)1−ωk = p = 1− P(Xi = 0).
Moreover, for ω = ω1ω2 . . . ωN ∈ 0, 1N ,
P(X1 = ω1, X2 = ω2 . . . Xn = ωn) = pω =
N∏k=1
pωk(1− p)1−ωk =
N∏k=1
P(Xk = ωk),
so X1, X2 . . . XN are independent. Define a further random variable SN on Ω by setting
SN (ω) =N∑k=1
Xk(ω) =N∑k=1
ωk.
Then SN (ω) counts the number of ones in ω. Moreover, for k = 0, 1 . . . N
|SN = k| =(N
k
),
and pω = pk(1− p)N−k for all ω ∈ SN = k, so
P(SN = k) =
(N
k
)pk(1− p)N−k.
Thus SN is a Binomial random variable of parameters N, p.
We have seen in the above example that the sum of random variables is again a random variable.
In general, one could look at functions of random variables.
Definition 13.9. Let X : Ω→ S be a random variable taking values in S, and let g : S → S′
be a function. Then g(X) : Ω→ S′ defined by
g(X)(ω) = g(X(ω))
is a random variable taking values in S′.
14. EXPECTATION 33
14. Expectation
From now on all the random variables we consider are assumed to take real values, unless
otherwise specified. We say that a random variable X is non-negative if X takes values in
S ⊆ [0,∞).
Definition 14.1 (Expectation of a non-negative random variable). For a non-negative random
variable X : Ω→ S we define the expectation (or expected value, or mean value) of X to be
E(X) =∑x∈S
xP(X = x).
Thus the expectation of X is the average of the values taken by X, averaged with weights
corresponding to the probabilities of the values.
Example 14.2. If X ∼ Bernoulli(p) then
E(X) = 1 · P(X = 1) + 0 · P(X = 0) = p.
Example 14.3. If X ∼ Geometric(p) then
E(X) =
∞∑k=1
kP(X = k) =
∞∑k=1
k(1− p)k−1p = p
∞∑k=1
[− d
dx(1− x)k
]x=p
= −p
[d
dx
( ∞∑k=0
(1− x)k
)]x=p
= −p[d
dx
(1
1− (1− x)
)]x=p
= p1
p2=
1
p.
For a random variable X write
X+ = maxX, 0, X− = max−X, 0,
so that X = X+ −X− and |X| = X+ +X−.
Definition 14.4. For X : Ω→ S, define X+ and X− as above. Then, provided E(X+) <∞or E(X−) <∞, we define the expectation of X to be
E(X) = E(X+)− E(X−) =∑x∈S
xP(X = x).
Note that we are allowing the expectation to be infinite: if E(X+) = +∞ and E(X−) < ∞then we set E(X) = +∞. Similarly, if E(X+) <∞ and E(X−) = +∞ then we set E(X) = −∞.
However, if both E(X+) = E(X−) = +∞ then the expectation of X is not defined.
Definition 14.5. A random variable X : Ω→ S is said to be integrable if E(|X|) <∞.
34 1. DISCRETE PROBABILITY
Example 14.6. Let X : Ω→ Z have probability distribution (pk)k∈Z where
p0 = P(X = 0) = 0,
pk = P(X = k) = P(X = −k) = 2−|k|−1, k ≥ 1.
Then E(X+) = E(X−) = 12 , so E(X) = E(X+)− E(X−) = 0.
Example 14.7. Let X : Ω→ Z have probability distribution (pk)k∈Z where
p0 = P(X = 0) = 0,
pk = P(X = k) = P(X = −k) =3
(πk)2, k ≥ 1.
Then E(X+) = E(X−) = +∞, so E(X) is not defined.
Properties of the expectation. The expectation of a random variable X : Ω → S
satisfies the following properties, some of which we state without proof:
(1) If X ≥ 0 then E(X) ≥ 0, and E(X) = 0 if and only if P(X = 0) = 1.
(2) If c ∈ R is a constant, then E(c) = c and E(cX) = cE(X).
(3) For random variables X,Y , it holds E(X + Y ) = E(X) + E(Y ).
Proof. If X,Y take values in SX , SY respectively, we have
E(X + Y ) =∑
z∈SX+SY
zP(X + Y = z) =∑x∈SX
∑y∈SY
(x+ y)P(X = x, Y = y)
=∑x∈SX
x
( ∑y∈SY
P(X = x, Y = y)︸ ︷︷ ︸P(X=x)
)+∑y∈SY
y
( ∑x∈SX
P(X = x, Y = y)︸ ︷︷ ︸P(Y=y)
)
=∑x∈SX
xP(X = x) +∑y∈SY
yP(Y = y) = E(X) + E(Y ).
The above properties generalise by induction, to give that for any constants c1, c2 . . . cn and
random variables X1, X2 . . . Xn it holds
E
(n∑k=1
ckXk
)=
n∑k=1
ckE(Xk).
Thus the expectation is linear.
(4) For any function g : S → S′, g(X) : Ω→ S′ is a random variable taking values in S′,
and
E(g(X)) =∑x∈S
g(x)P(X = x).
14. EXPECTATION 35
An important example is given by g(x) = xk, and the corresponding expectation E(Xk)
is called the kth moment of the random variable X.
(5) If X ≥ 0 takes integer values, then
E(X) =∞∑k=1
P(X ≥ k).
Proof. We have
E(X) =
∞∑k=0
kP(X = k) =
∞∑k=1
k∑j=1
P(X = k) =
∞∑j=1
∞∑k=j
P(X = k) =
∞∑j=1
P(X ≥ j).
(6) If X,Y are independent random variables, then E(XY ) = E(X)E(Y ). In general, if
X1, X2 . . . XN are independent random variables,
E(X1 ·X2 · · ·XN ) = E(X1) · E(X2) · · ·E(XN ).
Example 14.8. Let X ∼ Geometric(p). Then P(X ≥ k) = (1− p)k−1, and
E(X) =
∞∑k=1
P(X ≥ k) =
∞∑k=1
(1− p)k−1 =
∞∑k=0
(1− p)k =1
p.
Example 14.9. Roll two dice, and let X,Y denote the outcomes of the first and second die
respectively. Then X and Y are independent, and
E(X) = E(Y ) =6∑
k=1
kP(X = k) =1
6
6∑k=1
k =21
6,
so if S = X + Y and P = XY we have
E(S) = E(X + Y ) = E(X) + E(Y ) =42
6, E(P ) = E(XY ) = E(X)E(Y ) =
(21
6
)2
.
In general, if X1, X2 . . . XN denote the outcome of N consecutive rolls of a die, and we define
S =
N∑k=1
Xk, P =
N∏k=1
Xk,
then
E(S) = NE(X1) =21
6·N, E(P ) = [E(X1)]N =
(21
6
)N.
36 1. DISCRETE PROBABILITY
14.1. Functions of random variables. Recall that if X : Ω→ S is a random variable,
and g : S → S′ is a (non-random) function, then g(X) : Ω→ S′ is itself a random variable, and
by Property (14)
E(g(X)) =∑x∈S
g(x)P(X = x).
Now, if X : Ω → SX and Y : Ω → SY are two random variables, and f : SX → S′Xand g : SY → S′Y are two functions, then if X,Y are independent then also f(X), g(Y ) are
independent. This generalises to arbitrarily many random variables, to give that if X1, X2 . . . XN
are independent, then f1(X1), f2(X2) . . . fN (XN ) are independent.
Example 14.10. Toss two dice. Let
A = the first die gives 4, B = the sum of the two dice is 7.
Then we have seen that A,B are independent events. Define the indicator random variables
X = 1A, Y = 1B.
Since indicators of independent events are independent, X,Y are independent. Moreover,
E(X) = P(A) =1
6, E(Y ) = P(B) =
1
6,
E(X + Y ) = P(A) + P(B) =1
3, E(XY ) = P(A)P(B) =
1
36.
If, on the other hand, we introduce
C = the sum of the two dice is 3, Z = 1C
then A,C are not independent, so X,Z are not independent, and
E(Z) = P(C) =2
36, E(X + Z) = P(A) + P(C) =
8
36,
E(XZ) = E(1A∩C) = P(A ∩ C) = 0 6= E(X)E(Z).
15. Exercises
Exercise 1. Let X ∼ Bernoulli(p). Determine the distribution of Y = 1−X.
Exercise 2. Show that for any two sets A,B
1A∪B = 1− (1− 1A)(1− 1B).
Exercise 3. Show that if A,B are independent events then 1A and 1B are independent
random variables. Compute the expectation of 1A∩B and 1A∪B.
16. VARIANCE AND COVARIANCE 37
Exercise 4. Roll two dice, and let Ω = 1, 2 . . . 62 be the associated sample space. Define
the random variable X : Ω→ 1, 2 . . . 6 by
X(i, j) = maxi, j
for all (i, j) ∈ Ω. Write down the weights of the random variable X. Draw the graph of the
associated distribution function FX , and check that
limx→−∞
FX(x) = 0, limx→+∞
FX(x) = 1.
Exercise 5. Roll two dice, and let Ω = 1, 2 . . . 62 be the associated sample space.
Consider the event A = i = 4, and define the random variable
X(i, j) = 1A(i, j) =
1 if i = 4,
0 if i 6= 4.
Define further the random variable
Y (i, j) = i+ j.
Determine whether X and Y are independent.
Exercise 6. Redo all the computations of Example 13.8 for N = 2 and N = 3.
Exercise 7. Show that the Xk’s defined in Example 13.8 are Bernoulli(p).
Exercise 8. For an event A ⊆ Ω, let 1A denote the indicator function of A. Compute
E(1A). When is E(1A) = 0? When is E(1A) = 1?
Exercise 9. Let X ∼ Binomial(N, p). Show that E(X) = Np.
Exercise 10. Let X ∼ Poisson(λ). Show that E(X) = λ.
16. Variance and covariance
16.1. Variance. Once we know the mean of a random variable X, we may ask how much
typically X deviates from it. This is measured by the variance.
Definition 16.1. For any random variable X with finite mean E(X), the variance of X is
defined to be
Var(X) = E[(X − E(X))2].
Thus Var(X) measures how much the distribution of X is concentrated around its mean: the
smaller the variance, the more the distribution is concentrated around E(X).
38 1. DISCRETE PROBABILITY
Properties of the variance. The variance of a random variable X : Ω→ S satisfies the
following properties:
(1) Var(X) = E(X2)− [E(X)]2.
Proof. We have
Var(X) = E[(X − E(X))2] = E[X2 + (E(X))2 − 2XE(X)] = E(X2)− [E(X)]2.
(2) If E(X) = 0 then Var(X) = E(X2).
(3) Var(X) ≥ 0, and Var(X) = 0 if and only if P(X = c) = 1 for some constant c ∈ R.
(4) If c ∈ R then Var(cX) = c2Var(X).
(5) If c ∈ R then Var(X + c) = Var(X).
(6) If X,Y are independent then Var(X + Y ) = Var(X) + Var(Y ). In general, if
X1, X2 . . . XN are independent then
Var(X1 +X2 + · · ·+XN ) = Var(X1) + Var(X2) + · · ·+ Var(XN ).
16.2. Covariance. Closely related to the concept of variance is that of covariance.
Definition 16.2. For any two random variables X,Y with finite mean, we define their covari-
ance to be
Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))].
16.3. Properties of the covariance. The covariance of two random variables X,Y
satisfies the following properties:
(1) Cov(X,Y ) = E(XY )− E(X)E(Y ).
Proof. We have
Cov(X,Y ) = E[(X − E(X))(Y − E(Y ))]
= E[XY −XE(Y )− Y E(X) + E(X)E(Y )]
= E(XY )− E(X)E(Y ).
(2) Cov(X,X) = Var(X).
(3) Cov(X, c) = 0 for all c ∈ R.
(4) For c ∈ R, Cov(cX, Y ) = cCov(X,Y ) and Cov(X + c, Y ) = Cov(X,Y ).
(5) For X,Y, Z random variables, Cov(X + Z, Y ) = Cov(X,Y ) + Cov(Z, Y ).
16. VARIANCE AND COVARIANCE 39
The above properties generalise, to give that the covariance is bilinear. That is, for any
collections of random variables X1, X2 . . . Xn and Y1, Y2 . . . Yn, and constants a1, a2 . . . an and
b1, b2 . . . bn, it holds
Cov
n∑i=1
aiXi,n∑j=1
bjYj
=n∑i=1
n∑j=1
Cov(Xi, Yj).
(6) For any two random variables with finite variance, it holds
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ).
Proof. We have
Var(X + Y ) = E[(X − E(X) + Y − E(Y ))2]
= E[(X − E(X))2] + E[(Y − E(Y ))2] + 2E[(X − E(X))(Y − E(Y ))]
= Var(X) + Var(Y ) + 2Cov(X,Y ).
(7) If X,Y are independent then Cov(X,Y ) = 0.
Proof. We have
Cov(X,Y ) = E(XY )− E(X)E(Y ) = E(X)E(Y )− E(X)E(Y ) = 0,
where the second equality follows from independence.
We remark that the converse of property (7) is false, as shown by the following example.
Example 16.3 (Zero covariance does not imply independence). Let
X =
2, with probability 1
4
1, with probability 14
−1, with probability 14
−2, with probability 14
, Y = X2.
Then E(X) = 0 and E(XY ) = E(X3) = 0 by symmetry, so
Cov(X,Y ) = E(XY )− E(X)E(Y ) = 0.
However, X,Y are not independent, since
P(X = 2, Y = 4) = P(X = 2) =1
4,
while
P(X = 2)P(X = 4) = P(X = 2)P(X = 2 ∪ X = −2) =1
4
1
2=
1
8.
40 1. DISCRETE PROBABILITY
17. Exercises
Exercise 1. Toss two biased coin (giving head with probability p). Let
A = the first coin gives head,
B = the second coin gives head,
C = at least one head.
Let X = 1A, Y = 1B, Z = 1C . Compute expectation and variance of X,Y, Z. Compute
Cov(X,Y ), Cov(X,Z). Are X,Z independent? Compute E(X+Y ), E[(X+Y )2], E[log(X+2)],
E(4Y Z −X2).
Exercise 2. LetX be a random variable taking values 0, 1, 2, 3 with probability p0, p1, p2, p3
respectively, with probability
p0 = p1 = p2 =1
6, p3 =
1
2.
Compute E(X), E(X2), Var(X). Describe the random variable Y = |X − 2| (that is, list
the possible values of Y and the associated weights). Compute E(Y ), E(Y 2), Var(Y ) and
Cov(X,Y ). Are X,Y independent? Compute E(X + Y + 3), Var(X + Y − 2), E[(X − Y )X],
E[(X + Y )(X − Y )].
Exercise 3. Draw two cards from a standard deck. Let
A = the first card is heart, B = the second card is heart.
Define X = 1A, Y = 1B. Are A,B independent? Are X,Y independent? Write down the
weights of X and Y , and compute E(X), E(Y ), Cov(X,Y ), Var(X + Y ).
Exercise 4. Let X ∼ Geometric(p). Compute E(X), E(X(X − 1)), and deduce E(X2).
Exercise 5. Let X ∼ Poisson(λ). Compute E(X), E(X2) and Var(X).
Exercise 6. Toss a biased coin N times. For all 1 ≤ k ≤ N let
Ak = the kth coin gives head,
and define Xk = 1Ak . Compute E(Xk) and Var(Xk) for all k. Are the Xk’s independent? Let
S =N∑k=1
Xk.
Then S counts the total number of heads in N coin tosses. Compute E(S) and Var(S). Assume
N is even. What is the expected total number of heads obtained in even coin tosses? And odd
coin tosses?
18. INEQUALITIES 41
Exercise 7. Let X be a random variable with finite mean µ. For x ∈ R define the function
f(x) = E[(X − x)2].
Computing f ′(x), show that f(x) is minimised at x = µ, and compute its minimum value.
Exercise 8. Let X be a random variable with mean µ and variance σ2. Compute E(X2)
and E[(X − 2)2] in terms of µ, σ2.
Exercise 9. Give an example of two random variables with equal mean but different
variance. Give an example of two random variables with the same mean and variance, but
different distribution.
18. Inequalities
It is often the case that, rather than computing the exact value of E(X), we are only interested
in having a bound for it.
Theorem 18.1 (Markov’s inequality). Let X be a non-negative random variable, and λ ∈ (0,∞).
Then
P(X ≥ λ) ≤ E(X)
λ.
Proof. Let Y = λ1X≥λ. Then X ≥ Y , so
E(X) ≥ E(Y ) = E(λ1X≥λ) = λE(1X≥λ) = λP(X ≥ λ).
Example 18.2. Roll a die N times, and let SN denote the total number of 3’s obtained. Then
SN ∼ Binomial(N, 1/6), and so E(SN ) = N/6. Thus
P(X ≥ αN) ≤ E(X)
αN=
1
6α.
Note that the bound is useless if α ≤ 1/6, while it contains interesting information if α > 1/6.
Theorem 18.3 (Chebyshev’s inequality). Let X be a random variable with finite mean E(X).
Then, for λ ∈ (0,∞),
P(|X − E(X)| ≥ λ) ≤ Var(X)
λ2.
Proof. We have
P(|X − E(X)| ≥ λ) = P((X − E(X))2 ≥ λ2) ≤ E[(X − E(X))2]
λ2=
Var(X)
λ2,
where the inequality follows from Markov’s inequality.
42 1. DISCRETE PROBABILITY
Theorem 18.4 (Cauchy-Schwarz inequality). For all random variables X,Y , it holds
E(|XY |) ≤√E(X2)E(Y 2).
We do not prove this inequality. Note that if Y is constant, say P(Y = 1) = 1, then the
inequality gives
E(|X|) ≤√E(X2).
Example 18.5. Toss a fair coin repeatedly. Let X denote the number of tossed until (and
including) the first head, while Y denotes the number of tossed until (and including) the first
tail. Then
X ∼ Geometric(p), Y ∼ Geometric(1− p),
and X,Y are not independent since
P(X = 1, Y = 1) = 0 6= 1
4= P(X = 1)P(Y = 1).
We want a bound for E(XY ). To this end, recall that if Z ∼ Geometric(p) then E(Z2) =
(2− p)/p2. So
E(X2) =2− 1/2
(1/2)2= 6 = E(Y 2),
from which, using Cauchy-Schwarz inequality, we find
E(XY ) ≤√
6 · 6 = 6.
This, together with E(X) = E(Y ) = 2, allows us to bound the covariance:
Cov(X,Y ) = E(XY )− E(X)E(Y ) ≤ 6− 4 = 2.
Theorem 18.6 (Jensen’s inequality). Let X be an integrable random variable and let f : R→ Rbe a convex function. Then
f(E(X)) ≤ E(f(X)).
We do not prove this. To remember the direction of the inequality, take f(x) = x2 and use that
E(X2)− [E(X)]2 = Var(X) ≥ 0.
Example 18.7. Let X ∼ Poisson(λ), so that E(X) = λ. Define Y = eX . Since f(x) = ex is a
convex function, we have
E(Y ) = E(eX) ≥ eE(X) = eλ.
19. Exercises
Exercise 1. In a sequence of n independent trials the probability of a success at the ith
trial is pi. Let N denote the total number of successes. Find E(N) and Var(N).
19. EXERCISES 43
Exercise 2. Let X be an integrable random variable. Show that for every p > 0 and all
x > 0
P(|X| ≥ x) ≤ E(|X|p)x−p.
Show further that for all β ≥ 0
P(X ≥ x) ≤ E(eβX)e−βx.
Exercise 3. Let X1, X2 . . . Xn be independent and identically distributed random variables
with finite mean µ and variance σ2. Define the sample mean
X =1
n
n∑k=1
Xk.
Compute E(X) and Var(X). Use Chebyshev’s inequality to bound
P(|X − µ| > 2σ).
Use this bound to find n such that
P(|X − µ| > 2σ) ≤ 0.01.
Exercise 4. Let X ∼ Poisson(λ). Compute E(eX). Compute E(eαX) for all α ∈ R. Let
X,Y be independent Poisson(λ) random variables. For any α, β ∈ R compute
E(eαX+βY ).
Exercise 5. Let X1, X2 . . . Xn be independent and identically distributed random variables
with finite mean µ and variance σ2. Define the sample mean
X =1
n
n∑k=1
Xk.
Use Chebyshev’s inequality to show that for any ε > 0
P(|X − µ| > ε)→ 0
as n→∞.
Exercise 6. Explain why
E(X) ≤ logE(eX), |E(X)| ≤ E(|X|).
for any random variable X with finite mean.
44 1. DISCRETE PROBABILITY
Exercise 7. Roll a die N times. Denote by X the total number of 3′s and by Y the total
number of 6’s. Are X and Y independent? Denote by Z the total number of even outcomes.
Are X and Z independent? Compute expectation and variance of X,Y, Z. [Hint: write X,Y, Z
as sums of independent indicator random variables.]
20. Probability generating functions
Let X be a random variable taking values in N.
Definition 20.1. We define the generating function of X to be
GX(t) = E(tX) =
∞∑k=0
tkP(X = k)
for t ∈ [0, 1].
Note that the assumption t ∈ [0, 1] guarantees that the infinite sum defining GX(t) is convergent.
Moreover,
GX(1) =
∞∑k=1
P(X = k) = 1,
and GX(t) ≤ 1 for all t ∈ [0, 1].
Example 20.2. If X ∼ Bernoulli(p) then
GX(t) = E(tX) = tp+ 1− p.
Example 20.3. If X ∼ Geometric(p), then
GX(t) = E(tX) =
∞∑k=1
tk(1− p)k−1p = tp
∞∑k=0
[t(1− p)]k =pt
1− t(1− p).
Example 20.4. If X ∼ Poisson(λ), then
GX(t) = E(tX) =
∞∑k=0
tke−λλk
k!= e−λ
∞∑k=0
(tλ)k
k!= e−λ+λt.
The importance of probability generating functions lies in the following result.
Theorem 20.5. The probability generating function of a random variable X determines the
distribution of X uniquely.
We do not prove the above theorem. Note that it tells us that if X,Y have the same generating
function, that is if
E(tX) = E(tY ) for all t ∈ [0, 1],
then X,Y have the same distribution.
20. PROBABILITY GENERATING FUNCTIONS 45
Example 20.6. We show that if X ∼ Poisson(λ) and Y ∼ Poisson(µ) are independent, then
X + Y ∼ Poisson(λ+ µ). Indeed, we find
GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = e−λ+λte−µ+µt = e−(λ+µ)+(λ+µ)t.
We recognise in this expression the probability generating function of a Poisson(λ+µ) random
variable. Since probability generating functions uniquely determine distributions, we conclude
that X + Y ∼ Poisson(λ+ µ).
Example 20.7. If X ∼ Binomial(N, p), then
GX(t) = E(tX) =
N∑k=0
tk(N
k
)pk(1− p)N−k = (tp+ 1− p)N .
We use this to show that the sum of two independent Bernoulli(p) random variables is
Binomial(2, p). Indeed, if X,Y are independent Bernoulli(p), we find
GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = (tp+ 1− p)2,
which is the probability generating function of a Binomial(2, p). This generalises to show that
the sum of N independent Bernoulli(p) is Binomial(N, p).
As shown in the above examples, if X,Y are independent then
GX+Y (t) = E(tX+Y ) = E(tX)E(tY ) = GX(t)GY (t).
If, moreover, X and Y also have the same distribution, then
GX+Y (t) = [GX(t)]2.
This generalises to arbitrarily many random variables, to give that if X1, X2 . . . Xn are inde-
pendent, and
S =n∑k=1
Xk,
then
GS(t) = GX1(t)GX2(t) · · ·GXn(t).
If, moreover, the Xk’s all have the same distribution, then
GS(t) = [GX1(t)]n.
Example 20.8. If X1, X2 . . . Xn are independent random variables, with Xk ∼ Poisson(λk),
then
X1 +X2 + · · ·+Xn ∼ Poisson(λ1 + λ2 + · · ·+ λn).
46 1. DISCRETE PROBABILITY
The expectation E(Xn) is called nth moment of X. Provided GX(t) is finite for some t > 1, we
can differentiate inside the series, to obtain
G(1) =∞∑k=0
P(X = k) = 1,
G′(1) =
∞∑k=0
kP(X = k) = E(X),
G′′(1) =∞∑k=0
k(k − 1)P(X = k) = E(X(X − 1)).
In general,
G(n)(1) = E(X(X − 1) · · · (X − n+ 1)).
Example 20.9. Recall that if X ∼ Poisson(λ) then GX(t) = e−λ+λt. So
G′(t) = λe−λ+λt, G′(1) = λ = E(X)
G′′(t) = λ2e−λ+λt, G′′(1) = λ2 = E(X(X − 1))
from which we find
E(X) = G′(1) = λ, E(X2) = G′′(1) +G′(1) = λ2 + λ,
and so
Var(X) = E(X2)− [E(X)]2 = λ.
21. Conditional distributions
Definition 21.1 (Joint distribution). Given two random variables X : Ω→ SX and Y : Ω→SY , their joint distribution is defined as the collection of probabilities
P(X = x, Y = y), x ∈ SX , y ∈ SY .
Note that the joint distribution of X and Y is a probability distribution on SX × SY .
Example 21.2. Let X,Y be independent Bernoulli random variables of parameter p, and
define Z = X + Y . Then, if we interpret X,Y as recording the outcome of two independent
coin tosses, Z denotes the total number of head. Here SX = SY = 0, 1 while SZ = 0, 1, 2.Moreover, by independence
P(X = x, Y = y) = P(X = x)P(Y = y) for all (x, y) ∈ SX × SY .
On the other hand, X and Z are not independent, and they have joint distribution
P(X = 0, Z = 0) = (1− p)2, P(X = 0, Z = 1) = (1− p)p,
P(X = 1, Z = 1) = p(1− p), P(X = 1, Z = 2) = p2.
Note that the weights of the joint distribution sum up to 1.
21. CONDITIONAL DISTRIBUTIONS 47
Remark 21.3. The joint distribution of two independent random variables is the product of
the marginal distributions.
From the joint distribution, one can recover the marginal distribution of X and Y by mean of
the total probability law:
P(X = x) =∑y∈SY
P(X = x, Y = y) for all x ∈ SX ,
P(Y = y) =∑x∈SX
P(X = x, Y = y) for all y ∈ SY .
Example 21.4. Continuing Example 21.2, from the joint distribution of X,Z we can recover
the marginal distribution of Z via
P(Z = 0) = P(Z = 0, X = 0) + P(Z = 0, X = 1) = (1− p)2,
P(Z = 1) = P(Z = 1, X = 0) + P(Z = 1, X = 1) = 2p(1− p),
P(Z = 2) = P(Z = 2, X = 0) + P(Z = 2, X = 1) = p2.
This generalises to arbitrarily many random variables: for X1 . . . Xn random variables taking
values in S1 . . . Sn, their joint distribution is given by
P(X1 = x1, X2 = x2, . . . Xn = xn), x1 ∈ S1, . . . , xn ∈ Sn,
and we can recover the marginal distributions via
P(Xi = x) =∑
x1...xi−1,xi+1...xn
P(X1 = x1, . . . , Xi = x, . . . ,Xn = xn).
Definition 21.5 (Conditional distribution). For two random variables X,Y , the conditional
distribution of X given Y = y (which we assume to have positive probability) is the collection
of probabilities
P(X = x|Y = y), x ∈ SX ,where, clearly,
P(X = x|Y = y) =P(X = x, Y = y)
P(Y = y).
Then we can recover the marginal distribution of X via the total probability law:
P(X = x) =∑y∈SY
P(X = x|Y = y)P(Y = y).
Example 21.6. Continuing Example 21.2, the conditional distribution of Z given X = 0 is
P(Z = 0|X = 0) = 1− p, P(Z = 1|X = 0) = p, P(Z = 2|X = 0) = 0.
On the other hand, the conditional distribution of Z given Y = 1 is
P(Z = 0|X = 1) = 0, P(Z = 1|X = 1) = 1− p, P(Z = 2|X = 1) = p.
48 1. DISCRETE PROBABILITY
We can recover the marginal distribution of Z by
P(Z = 0) = P(Z = 0|X = 0)P(X = 0) + P(Z = 0|X = 1)P(X = 1) = (1− p)2,
P(Z = 1) = P(Z = 1|X = 0)P(X = 0) + P(Z = 1|X = 1)P(X = 1) = 2p(1− p),
P(Z = 2) = P(Z = 2|X = 0)P(X = 0) + P(Z = 2, X = 1)P(X = 1) = p2.
Remark 21.7. If X,Y are independent, then the conditional distribution of X given Y = ycoincides with the distribution of X, since P(X = x|Y = y) = P(X = x) for all (x, y) ∈ SX×SY .
Definition 21.8 (Conditional expectation). Finally, we define the conditional expectation of
X given Y = y by
E(X|Y = y) =∑x∈SX
xP(X = x|Y = y).
Note that if X,Y are independent then E(X|Y = y) = E(X) for all y ∈ SY .
Example 21.9. Continuing Example 21.2, we have
E(Z) = E(X) + E(Y ) = 2p,
while
E(Z|X = 0) = 0 · P(Z = 0|X = 0) + 1 · P(Z = 1|X = 0) + 2 · P(Z = 2|X = 0) = p.
We could have arrived to the same conclusion by observing that, since Z = X + Y , on the
event X = 0 we have Z = Y , and so
E(Z|X = 0) = E(Y |X = 0) = E(Y ) = p.
21.1. Properties of the conditional expectation.
(1) For a constant c ∈ R, we have E(cX|Y = y) = cE(X|Y = y) and E(c|Y = y) = c.
(2) For random variables X,Z, it holds E(X + Z|Y = y) = E(X|Y = y) + E(Z|Y = y).
The above two properties generalise by induction, to give that for any constants c1, c2 . . . cn
and random variables X1, X2 . . . Xn it holds
E
(n∑k=1
ckXk
∣∣∣Y = y
)=
n∑k=1
ckE(Xk|Y = y).
Thus the conditional expectation is linear.
(3) If X,Y are independent then E(X|Y = y) = E(X) for all y ∈ SY .
(4) For a function g : SX → S′,
E(g(X)|Y = y) =∑x∈SX
g(x)P(X = x|Y = y).
22. EXERCISES 49
22. Exercises
Exercise 1. Let K be a random variable taking values 0, 1, 2 . . . 7 with probability 1/8
each. Define
X = cos
(Kπ
4
), Y = sin
(Kπ
4
).
Show that Cov(X,Y ) = 0 but X,Y are not independent.
Exercise 2. Let X be a Poisson random variable of parameter λ. We have seen in a
previous exercise that for all β ≥ 0
P(X ≥ x) ≤ E(eβX)e−βx.
By finding the value of β that minimises the right hand side, show that for all x ≥ λ
P(X ≥ x) ≤ exp−x log(x/λ)− λ+ x.
Show that, on the other hand, for integer x
P(X = x) ∼ 1√2πx
exp−x log(x/λ)− λ+ x
as x→∞.
Exercise 3. Suppose that a random variable X has probability generating function
GX(t) =t(1− tr)r(1− t)
for t ∈ [0, 1), where r ≥ 1 is a fixed integer. Determine the mean of X.
Exercise 4. Toss a biased coin repeatedly, and let X denote the number of tosses until
(and including) the second head. Show that
P(X = k) = (k − 1)(1− p)k−2p2
for k ≥ 2. Show that the probability generating function of X is
GX(t) =
(pt
1− (1− p)t
)2
.
Deduce that E(X) = 2/p and Var(X) = 2(1− p)/p2. Explain how X can be represented as the
sum of two independent and identically distributed random variables. Use this representation
to derive again the mean and variance of X.
Exercise 5. Solve the above exercise with X denoting the number of tosses until (and
including) the ath head, for a ≥ 2. Show that in this case
GX(t) =
(pt
1− (1− p)t
)a, E(X) =
a
p, Var(X) =
a(1− p)p2
.
50 1. DISCRETE PROBABILITY
Exercise 6. Let X,Y ∼ Poisson(λ) be independent random variables. Find the dis-
tribution of X + Y . Show that the conditional distribution of X given X + Y = n is
Binomial (n, 1/2).
Exercise 7. Let X ∼ Poisson(λ) and Y ∼ Poisson(µ) be independent random variables.
Find the distribution of X+Y . Show that the conditional distribution of X given X+Y = nis Binomial
(n, λ
λ+µ
).
Exercise 8. The number of misprints on a given page of a book has Poisson distribution
of parameter λ, and the number of misprints on different pages are independent. What is the
probability that the second misprint will occur on page k?
CHAPTER 2
Continuous probability
1. Some natural continuous probability distributions
Up to this point we have only considered probability distributions on discrete sets, such as
finite sets, N or Z. We now turn our attention to continuous distributions.
Definition 1.1 (Probability density function). A function f : R→ R is a probability density
function if
f(x) ≥ 0 for all x ∈ R,∫ +∞
−∞f(x)dx = 1.
We will often write p.d.f. in place of probability density function for brevity. Given a p.d.f., we
can define a probability measure µ on R by setting
µ([a, b]) =
∫ b
af(x)dx
for all a, b ∈ R. Note that
µ((−∞, a]) =
∫ a
−∞f(x)dx, µ(R) =
∫ +∞
−∞f(x)dx = 1,
and µ([a, b)) = µ((a, b]) = µ((a, b)) = µ([a, b]). We interpret f(x) as the density of probability
at x (the continuous analogue of the weight of x), and µ([a, b]) as the probability of the interval
[a, b]. For this course we restrict to piecewise continuous probability density functions.
Example 1.2. Take f(x) = e−|x|
2 . Then f(x) ≥ 0 for all x ∈ R, and∫ +∞
−∞f(x)dx = 2
∫ ∞0
e−x
2dx = 1.
Thus f is a valid probability density function. If µ denotes the associated probability measure,
we have
µ([0,∞)) =1
2
∫ ∞0
e−xdx =1
2, µ([−1, 1]) =
1
2
∫ 1
−1e−|x|dx =
∫ 1
0e−xdx = 1− e−1.
This tells us that if we choose a real number at random according to µ, the probability that this
number is non-negative is 1/2 (which was also clear as f is symmetric), while the probability
that it has modulus at most 1 is 1− e−1.
51
52 2. CONTINUOUS PROBABILITY
Remark 1.3. To be precise, we would have had to introduce a σ-algebra on R, called Borel
σ-algebra, and require that f is Borel-measurable. This is a very weak assumption, and it holds
for all f piecewise continuous, to which we restrict our attention. This subtleties go beyond the
scope of this course, and we will ignore them.
We now list some natural continuous probability distributions. Throughout, we use the following
notation: for A ⊆ R
1A(x) =
1, if x ∈ A
0, if x /∈ A,in analogy with the notation introduced for indicator random variables.
1.1. Uniform distribution. For a, b ∈ R with a < b, the density function
f(x) =1
b− a1[a,b](x)
defines the uniform distribution on [a, b]. Note that∫ +∞
−∞f(x)dx =
∫ b
af(x)dx = 1.
The uniform distribution gives to each interval contained in [a, b] a probability proportional to
its length.
1.2. Exponential distribution. For λ ∈ (0,∞), the density function
f(x) = λe−λx1[0,+∞)(x)
defines the exponential distribution of parameter λ. Note that∫ +∞
−∞f(x)dx =
∫ ∞0
λe−λxdx = 1.
1.3. Gamma distribution. The Gamma distribution generalises the exponential distri-
bution. For α, λ ∈ (0,+∞), the density function
f(x) =λαxα−1e−λx
Γ(α)1[0,+∞)(x)
defines the gamma distribution of parameters α, λ. Here
Γ(α) =
∫ ∞0
xα−1e−xdx
is a finite constant, which ensures that f integrates to 1, since with y = λx∫ ∞0
λαxα−1e−λxdx =
∫ ∞0
yα−1e−ydy = Γ(α).
Note that by setting α = 1 we recover the exponential distribution, of which the Gamma
distribution is therefore a generalisation.
2. CONTINUOUS RANDOM VARIABLES 53
1.4. Cauchy distribution. The density function
f(x) =1
π(1 + x2)
defines the Cauchy distribution. Note that∫ +∞
−∞f(x)dx =
2
π
∫ +∞
0
1
1 + x2dx =
2
πarctan(x)
∣∣∣+∞0
= 1.
1.5. Gaussian distribution. The density function
f(x) =1√2πe−x
2/2
defines the standard Gaussian distribution, or standard normal distribution, written N(0, 1).
More generally, for µ ∈ R and σ2 ∈ (0,∞), the density function
f(x) =1√
2πσ2e−
(x−µ)2
2σ2
defines the Gaussian distribution of mean µ and variance σ2, written N(µ, σ2). To check that
f integrates to 1 we can use Fubini’s theorem and change variables into polar coordinates to
obtain (∫ +∞
−∞e−x
2/2dx
)2
=
∫ +∞
−∞
∫ +∞
−∞e−(x2+y2)/2dxdy =
∫ 2π
0
∫ ∞0
e−r2/2rdrdθ = 2π.
2. Continuous random variables
A random variable X : Ω→ R has probability density function f if
P(X ∈ [a, b]) =
∫ b
af(x)dx
for all a < b ∈ R. We again define the distribution function of X by
FX(x) = P(X ≤ x) =
∫ x
−∞f(y)dy.
Definition 2.1. A random variable X is said to be continuous if the associated distribution
function FX is continuous.
Note that by the fundamental theorem of calculus
F ′X(x) = fX(x),
thus the probability density function is the derivative of the distribution function. Moreover,
limx→−∞
FX(x) = 0, limx→+∞
FX(x) = 1
and FX is non-decreasing, since F ′X(x) = fX(x) ≥ 0.
54 2. CONTINUOUS PROBABILITY
Definition 2.2. Let X be a continuous random variable with probability density function f
and distribution function F . Then, provided
E(X+) =
∫ +∞
0xf(x)dx <∞, or E(X−) =
∫ 0
−∞(−x)f(x)dx <∞,
we define
E(X) =
∫ +∞
−∞xf(x)dx.
Note that if E(X+) = +∞ and E(X−) < ∞ then E(X) = +∞, while if E(X−) = +∞ and
E(X+) <∞ then E(X) = −∞, in analogy with the discrete case.
Example 2.3. If X ∼ Uniform[a, b] then
E(X) =
∫ b
a
x
b− adx =
1
b− a
(b2 − a2
2
)=a+ b
2.
Example 2.4. If X ∼ Exponential(λ) then we integrate by parts to get
E(X) =
∫ ∞0
xλe−λxdx = −xe−λx∣∣∣∞0
+
∫ ∞0
e−λxdx =1
λ.
Example 2.5. If X ∼ Cauchy then
E(X+) =
∫ ∞0
x
π(1 + x2)dx = +∞ = E(X−),
so the expectation is not defined.
Example 2.6. If X ∼ N(0, 1) then
E(X) =
∫ +∞
−∞
x√2πe−x
2/2dx = 0
by symmetry.
The expectation satisfies the same properties as in the case of discrete random variables. In
particular it is linear, and for a function g : R→ R we have
E(g(X)) =
∫ +∞
−∞g(x)f(x)dx.
Moreover, if X is non-negative, we have the alternative formula
E(X) =
∫ +∞
0P(X ≥ x)dx =
∫ +∞
0(1− F (x))dx.
Example 2.7. If X ∼ Exponential(λ) then P(X ≥ x) = e−λx, so we can use the above formula
to compute
E(X) =
∫ ∞0
P(X ≥ x)dx = −e−λx
λ
∣∣∣∞0
=1
λ.
3. EXERCISES 55
Definition 2.8. A random variable X is said to be integrable if
E(|X|) =
∫ +∞
−∞|x|f(x)dx <∞.
For X integrable random variable, we define the variance of X by
Var(X) = E[(X − E(X))2] =
∫(x− E(X))2f(x)dx.
In analogy with the discrete case,
Var(X) = E(X2)− [E(X)]2 =
∫x2f(x)dx−
(∫xf(x)dx
)2
.
Example 2.9. If X ∼ Uniform[a, b] then
E(X2) =
∫ b
a
x2
b− adx =
1
b− a
(b3 − a3
3
)=a2 + b2 + ab
3,
from which
Var(X) = E(X2)− [E(X)]2 =a2 + b2 + ab
3− (a+ b)2
4=
(a− b)2
12.
Example 2.10. If X ∼ Exponential(λ) then we integrate by parts to get
E(X2) =
∫ ∞0
x2λe−λxdx = −x2e−λx∣∣∣∞0
+ 2
∫ ∞0
xe−λxdx =2
λ2,
from which
Var(X) = E(X2)− [E(X)]2 =2
λ2− 1
λ2=
1
λ2.
Example 2.11. If X ∼ N(0, 1) then we integrate by parts to get
Var(X) = E(X2) =
∫ +∞
−∞
x2
√2πe−x
2/2dx = − x√2πe−x
2/2∣∣∣+∞−∞
+
∫ +∞
−∞
e−x2/2
√2π
dx = 1.
3. Exercises
Exercise 1. Let X be a Uniform random variable in [−1, 1]. Compute E(X3) and E(X4).
Can you compute E(X2n+1) for n ≥ 0?
56 2. CONTINUOUS PROBABILITY
Exercise 2. Recall that the Gamma function is defined as
Γ(α) =
∫ ∞0
xα−1e−xdx.
Using integration by parts, show that for all k positive integers
Γ(k + 1) = kΓ(k).
By comparison with the exponential distribution, explain why Γ(1) = 1. Use this to conclude
that
Γ(k) = (k − 1)!
for all k positive integers.
4. Transformations of one-dimensional random variables
Let X be a random variable taking values in S ⊆ R (that is, with probability density function
fX supported on S). We consider a function g : S → S′ and define the random variable
Y = g(X). What is the probability density function fY of Y ?
To answer this question, let us assume that g is strictly increasing, so that g′(x) > 0 for all x.
Then we can look at the distribution function
FY (y) = P(Y ≤ y) = P (g(X) ≤ y) = P(X ≤ g−1(y)
)= FX(g−1(y)).
Differentiating both sides, we find
fY (y) =d
dyFX(g−1(y)
)= fX(g−1(y))
d
dyg−1(y).
A similar argument applies for the case of g decreasing, to give the following.
Theorem 4.1. Let X be a random variable taking values in S ⊆ R with p.d.f. fX , and let
g : S → S′ be a strictly increasing or strictly decreasing function. Then the probability density
function of Y = g(X) is given by
fY (y) = fX(g−1(y))
∣∣∣∣ ddyg−1(y)
∣∣∣∣for y ∈ S′.
Example 4.2. Let X ∼ Uniform[0, 1]. We may assume that X only takes values in (0, 1] since
P(X = 0) = 0. Define Y = − logX, so that Y takes values in [0,∞). Then we have S = (0, 1],
S′ = [0,∞) and g(x) = − log x strictly decreasing. Using that g−1(y) = e−y, we find
fY (y) = fX(g−1(y))︸ ︷︷ ︸1
∣∣∣ ddyg−1(y)︸ ︷︷ ︸−e−y
∣∣∣ = e−y,
5. EXERCISES 57
which shows that Y ∼ Exponential(1). Note that we could have arrived to the same conclusion
by directly looking at the distribution functions. Indeed, we have
FY (y) = P(Y ≤ y) = P(− logX ≤ y) = P(X ≥ e−y) = 1− e−y.
By differentiating both sides with respect to y, we find
fY (y) = e−y,
so Y ∼ Exponential(1).
Example 4.3. Let X ∼ N(µ, σ2), and define Y = (X − µ)/σ. Then S = S′ = R and
g(x) = (x− µ)/σ. Using that g−1(y) = σy + µ, we find
fY (y) = fX(g−1(y))∣∣∣ ddyg−1(y)︸ ︷︷ ︸σ
∣∣∣ = σ1√2πσ
e−(x−µ)2
2σ2
∣∣∣x=σy+µ
=1√2πe−y
2/2,
which shows that Y ∼ N(0, 1). Note that we could have arrived to the same conclusion using
that
FY (y) = P(Y ≤ y) = P(X ≤ µ+ σy) =
∫ µ+σy
−∞
1√2πσ
e−(x−µ)2
2σ2 dx
for all y ∈ R, from which, differentiating both sides, we get
fY (y) = σ1√2πσ
e−(x−µ)2
2σ2
∣∣∣x=σy+µ
=1√2πe−y
2/2.
The same reasoning shows that if X ∼ N(0, 1) and Y = σX + µ then Y ∼ N(µ, σ2).
Note that the above example gives us a way to deduce the mean and variance of a N(µ, σ2)
from the ones of N(0, 1). Indeed, if X ∼ N(0, 1) then Y = σX + µ ∼ N(µ, σ2), from which,
using the linearity of expectation,
E(Y ) = σE(X) + µ = µ,
and
Var(Y ) = σ2Var(X) = σ2.
5. Exercises
Exercise 1. Let X be a random variable with p.d.f. f(x) = ce−3x1[0,∞)(x). Find c. Find
P(X ∈ [2, 3]), P(|X| ≤ 1), P(X > 4).
Exercise 2. Let X be a random variable with p.d.f. f(x) = cx1[0,2](x). Find c. Find the
p.d.f. of Y = 2x.
Exercise 3. Let X ∼ Exponential(1). Find the p.d.f. of Y = X2.
58 2. CONTINUOUS PROBABILITY
Exercise 4. Let X ∼ Cauchy. Find the p.d.f. of Y = arctanX.
Exercise 5. Conversely, show that if X ∼ U [−π/2, π/2] and Y = tanX, then Y ∼ Cauchy.
Exercise 6. Let X ∼ U [−1, 1]. Find the p.d.f. of Y = 2X + 5 and Z = X3 + 1.
Exercise 7. Let X be an exponential random variable of parameter λ. Show that for any
0 < s < t it holds
P(X > t|X > s) = P(X > t− s).
This is the memoryless property of the exponential distribution. Indeed, thinking of X as a
random time (say the time of arrival of a bus), it tells us that given that X > s, the random
time X − s still has exponential distribution of parameter λ. In fact, this property characterises
the exponential distribution, but we won’t prove this.
Exercise 8. Let X ∼ U [0, 1], and let F be a strictly increasing distribution function. Show
that the random variable Y = F−1(X) has distribution function F . This is important for
simulating random variables on a computer.
6. Moment generating functions
Definition 6.1. The moment generating function of a random variable X with p.d.f. fX is
defined as
MX(θ) = E(eθX) =
∫ +∞
−∞eθxfX(x)dx,
where the above integral is finite.
Note that MX(0) = 1 and if θ < 0 and X ≥ 0 then MX(θ) ≤ 1, so the moment generating
function of a non-negative random variable is always finite on (−∞, 0]. The range of values
of θ > 0 for which the moment generating function is defined will depend on the tails of the
probability density function fX .
Example 6.2. Let X be an exponential random variable of parameter λ. Then
MX(θ) = E(eθX) =
∫ ∞0
λe−(λ−θ)xdx =λ
λ− θ,
provided θ < λ, which is needed for the integral to be finite.
6. MOMENT GENERATING FUNCTIONS 59
If X,Y are independent random variables, then
MX+Y (θ) = E(eθ(X+Y )) = E(eθX)E(eθY ) = MX(θ)MY (θ).
This generalises to arbitrarily many random variables, to give that if X1, X2 . . . Xn are inde-
pendent random variables, then
MX1+···+Xn(θ) = MX1(θ)MX2(θ) · · ·MXn(θ).
Moreover, if the Xk’s also have the same distribution, then
MX1+···+Xn(θ) = [MX1(θ)]n.
The moment generating function (that we abbreviate m.g.f.) plays the same role for continuous
random variables as the probability generating function for discrete random variables. In
particular, the following theorem holds.
Theorem 6.3. The moment generating function of a random variable X determine the distri-
bution of X uniquely, provided MX(θ) <∞ for some θ > 0.
Example 6.4. In this example we compute the moment generating function of a Gaussian
random variable, and show that the sum of independent Gaussian random variable is still
Gaussian. Let X ∼ N(µ, σ2). Then
MX(θ) =1
2πσ2
∫ +∞
−∞eθx−
(x−µ)2
2σ2 dx.
The above integral is finite for all θ ∈ R. We complete the square in the exponent as follows:
θx− (x− µ)2
2σ2= −(x− µ)2 − 2θσ2x
2σ2= −x
2 − 2(µ+ θσ2)x+ µ2
2σ2
= −x2 − 2(µ+ θσ2)x+ (µ+ θσ2)2 + µ2 − (µ+ θσ2)2
2σ2
= −(x− (µ+ θσ2))2
2σ2+ θµ+
θ2σ2
2.
Plugging this into the integral, we find
MX(θ) = eθµ+ θ2σ2
21√
2πσ2
∫ +∞
−∞e−
(x−(µ+θσ2))2
2σ2 dx︸ ︷︷ ︸1
= eθµ+ θ2σ2
2 ,
where in the last equality we have used that the p.d.f. of a N(µ + θσ2, σ2) random variable
integrates to 1. Thus we showed that
X ∼ N(µ, σ2) ⇒ MX(θ) = eθµ+ θ2σ2
2 .
In particular,
X ∼ N(0, 1) ⇒ MX(θ) = eθ2/2.
60 2. CONTINUOUS PROBABILITY
We now use this explicit expression for the m.g.f. to compute the distribution of the sum of
two independent Gaussian random variables. Let X ∼ N(µX , σ2X) and Y ∼ N(µY , σ
2Y ) be
independent. Then
MX+Y = E(eθ(X+Y )) = E(eθX)E(eθY ) = eθ(µX+µY )+ θ2
2(σ2X+σ2
Y ),
which shows that
X + Y ∼ N(µX + µY , σ2X + σ2
Y ).
Finally, exactly as in the discrete case, we can use the m.g.f. MX of a random variable X to
recover the moments of X. Indeed, from
eθX = 1 + θX +θ2X2
2+θ3X3
3!+ · · ·
taking the expectation both sides we find
MX(θ) = 1 + θE(X) +θ2
2E(X2) +
θ3
3!E(X3) + · · ·
Provided MX is finite for some θ 6= 0, we can differentiate to get
MX(0) = 1, M ′X(0) = E(X), M ′′X(0) = E(X2)
and in general M(k)X (0) = E(Xk), whenever the expectation is finite.
Example 6.5. We have seen that if X is exponential of parameter λ then MX(θ) = λ/(λ− θ)for θ < λ. We differentiate to find
M ′X(θ) =λ
(λ− θ)2, M ′X(0) =
1
λ= E(X),
and
M ′′X(θ) =2λ
(λ− θ)3, M ′′X(0) =
2
λ2= E(X2),
from which we recover Var(X) = 1/λ2. Note that all the above computations are for θ < λ.
7. Exercises
Exercise 1. Show that the m.g.f. of a Gamma random variable of parameters n, λ, where
n is a positive integer, is given by
MX(θ) =
(λ
λ− θ
)n.
Deduce that the m.g.f. of an exponential random variable of parameter λ is given by λ/(λ− θ).Differentiate the m.g.f. to compute mean and variance of the Gamma(n, λ) distribution. Show
that if X,Y are independent exponential random variables of parameter λ, then X + Y ∼
8. MULTIVARIATE DISTRIBUTIONS 61
Gamma(2, λ). In general, show that of X1, X2 . . . Xn are i.i.d. exponential random variables of
parameter λ, then
X1 +X2 + · · ·+Xn ∼ Gamma(n, λ).
Exercise 2. Compute the m.g.f. of a uniform random variable on [a, b]. Differentiate to
deduce mean and variance.
Exercise 3. Compute the m.g.f. of a standard Gaussian random variable by completing
the square.
Exercise 4. Show that the m.g.f. of a Cauchy random variable is infinite for all values of
θ 6= 0, while it is equal to 1 at θ = 0.
8. Multivariate distributions
8.1. Joint distribution. We start by defining the joint probability density function of
two real-valued random variables.
Definition 8.1. Two random variables X,Y are said to have joint probability density function
fX,Y if
P((X,Y ) ∈ A) =
∫∫AfX,Y (x, y)dxdy
for all subsets A ⊆ R2. Their joint distribution function is given by
FX,Y (x, y) = P(X ≤ x, Y ≤ y) =
∫ x
−∞
∫ y
−∞fX,Y (u, v)dudv.
Properties of the joint probability density function fX,Y
(i) fX,Y (x, y) ≥ 0,
(ii)
∫ +∞
−∞
∫ +∞
−∞fX,Y (x, y)dxdy = 1.
Properties of the joint distribution function FX,Y
(i) For each fixed x, FX,Y (x, y) is non-decreasing in y, and for each fixed y FX,Y is
non-decreasing in x.
(ii) We have
limx→+∞
limy→+∞
FX,Y (x, y) = 1, limx→−∞
FX,Y (x, y) = limy→−∞
FX,Y (x, y) = 0,
and
FX,Y (x,+∞) = P(X ≤ x) = FX(x), FX,Y (+∞, y) = P(Y ≤ y) = FY (y).
62 2. CONTINUOUS PROBABILITY
Note that fX,Y : R2 → R and FX,Y : R2 → [0, 1], and
P((X,Y ) ∈ (a, b)× (c, d)) =
∫ b
a
∫ d
cfX,Y (x, y)dxdy
for all (a, b)× (c, d) ⊆ R2. Moreover,
fX,Y (x, y) =∂2
∂x∂yFX,Y (x, y),
so we can obtain the joint p.d.f. by differentiating the joint distribution function.
Example 8.2. Let (X,Y ) be uniformly distributed on [0, 1]2, which is equivalent to say that
the joint p.d.f. of X,Y is given by
fX,Y (x, y) = 1[0,1]2(x, y).
To compute, say, P(X < Y ), we integrate the joint p.d.f. over the set x < y, to get
P(X < Y ) =
∫∫x<y
fX,Y (x, y)dxdy =
∫ 1
0
∫ y
0dxdy =
∫ 1
0ydy =
1
2.
If, on the other hand, we want to compute P(X ≤ 1/2, X + Y ≤ 1) we integrate the joint p.d.f.
over this set to get
P(X ≤ 1/2, X + Y ≤ 1) =
∫∫x≤1/2,x+y≤1
fX,Y (x, y)dxdy =
∫ 1/2
x=0
∫ 1−x
y=0dydx =
3
8.
Example 8.3. Let (X,Y ) have joint p.d.f.
fX,Y (x, y) = 6xy21[0,1]2(x, y).
We compute P((X,Y ) ∈ A) where A = (x, y) : x+ y ≥ 1. We have
P((X,Y ) ∈ A) =
∫∫AfX,Y (x, y)dxdy =
∫ 1
0
∫ 1
1−y6xy2dxdy =
∫ 1
0(6y3 − 3y4)dy =
9
10.
Note that from the joint p.d.f. we can recover the marginals by integrating over one of the
variables:
fX(x) =
∫ +∞
−∞fX,Y (x, y)dy, fY (y) =
∫ +∞
−∞fX,Y (x, y)dx.
Example 8.4. Continuing with the above example, we find
fX(x) =
∫ 1
06xy2dy = 2x for x ∈ [0, 1],
and
fY (y) =
∫ 1
06xy2dx = 3y2 for y ∈ [0, 1].
Note that fX,Y (x, y) = fX(x)fY (y). In this case we say that X and Y are independent.
8. MULTIVARIATE DISTRIBUTIONS 63
Definition 8.5. Two random variables X,Y are said to be independent if
P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y)
for all (x, y) ∈ R2.
The following theorem provides a very useful characterization of independence.
Theorem 8.6. Two random variables X,Y are independent if and only if their joint p.d.f. is
given by the product of the marginals, that is
fX,Y (x, y) = fX(x)fY (y).
We won’t prove this theorem.
Example 8.7. Let X,Y have joint p.d.f.
fX,Y (x, y) = λ2e−λ(x+y)1[0,∞)2(x, y)
for λ > 0. Then
fX(x) = λe−λx1[0,∞)(x), fY (y) = λe−λy1[0,∞)(y),
so fX,Y = fXfY and X,Y are independent exponential random variables of parameter λ.
Example 8.8. Let X,Y have joint p.d.f.
fX,Y (x, y) = λe−λx1[0,∞)×[0,1](x, y)
for λ > 0. Then
fX(x) = λe−λx1[0,∞)(x), fY (y) = 1[0,1](y),
so fX,Y = fXfY and X,Y are independent exponential random variables, with X ∼ exp(λ)
and Y ∼ U [0, 1].
All the above definitions generalise in the obvious way.
Definition 8.9. The random variables X1, X2 . . . Xn are said to have joint probability density
function fX1...Xn if
P((X1, X2 . . . Xn) ∈ A) =
∫ ∫· · ·∫AfX1...Xndx1dx2 · · · dxn
for all subsets A ⊆ Rn. Their joint distribution function is given by
FX1...Xn(x1 . . . xn) = P(X1 ≤ x1, X2 ≤ x2 . . . xn ≤ xn)
=
∫ x1
−∞
∫ x2
−∞· · ·∫ xn
−∞fX1...Xn(u1 . . . un)du1 · · · dun.
The random variables X1, X2 . . . Xn are said to be independent if
P(X1 ≤ x1, X2 ≤ x2 . . . xn ≤ xn) = P(X1 ≤ x1)P(X2 ≤ x2) · · ·P(xn ≤ xn)
64 2. CONTINUOUS PROBABILITY
for all (x1 . . . xn) ∈ Rn.
Theorem 8.10. The random variables X1, X2 . . . Xn are independent if and only if
fX1,X2...Xn(x1, x2, . . . , xn) = fX1(x1)fX2(x2) · · · fXn(xn).
Exactly as in the discrete case, if X,Y are independent random variables, and g1, g2 are
real functions, then g1(X), g2(Y ) are again independent random variables. In general, if
X1, X2 . . . Xn are independent random variables, and g1, g2 . . . gn are real functions, then
g1(X1), g2(X2) . . . gn(Xn) are independent random variables, and
E(g1(X1)g2(X2) · · · gn(Xn)) = E(g1(X1))E(g2(X2)) · · ·E(gn(Xn)).
8.2. Covariance. Given two random variables X,Y and a function g : R2 → R, we have
the formula
E(g(X,Y )) =
∫ +∞
−∞
∫ +∞
−∞g(x, y)fX,Y (x, y)dxdy.
In particular, taking g(x, y) = xy we find
E(XY ) =
∫ +∞
−∞
∫ +∞
−∞xyfX,Y (x, y)dxdy.
Definition 8.11 (Covariance). Given two random variables X,Y with joint p.d.f. fX,Y we
define their covariance as
Cov(X,Y ) = E(XY )− E(X)E(Y ).
Note that if X,Y are independent then fX,Y (x, y) = fX(x)fY (y) and so
E(XY ) =
∫ +∞
−∞
∫ +∞
−∞xyfX,Y (x, y)dxdy =
∫ +∞
−∞
∫ +∞
−∞xyfX(x)fY (y)dxdy
=
(∫ +∞
−∞xfX(x)dx
)(∫ +∞
−∞yfY (y)dy
)= E(X)E(Y )
which shows that Cov(X,Y ) = 0. The covariance satisfies all the properties that we have listed
for the discrete case.
Example 8.12. Let X,Y be two random variables with joint p.d.f. fX,Y (x, y) = cxy for
0 ≤ x ≤ y2 ≤ 1 and y ∈ [0, 1]. To ensure that fX,Y integrates to 1 we take c = 12. We find
E(XY ) =
∫ 1
0
∫ y2
012x2y2dxdy = 4
∫ 1
0y8dy =
4
9,
and marginals
fX(x) =
∫ 1
√x
12xydy = 6x(1− x), fY (y) =
∫ y2
012xydx = 6y5,
9. EXERCISES 65
from which E(X) = 1/2 and E(Y ) = 6/7. It follows that Cov(X,Y ) = E(XY )− E(X)E(Y ) =49 −
12
67 6= 0, and therefore X,Y are not independent.
9. Exercises
Exercise 1. Let X,Y be independent exponential random variables of parameter λ > 0.
Find the distribution of Z = minX,Y . Deduce E(Z) and Var(Z).
Exercise 2. Let (X,Y ) be uniformly distributed on the unit disc, that is
fX,Y (x, y) = c1x2+y2≤1(x, y).
Determine c. Compute the marginal probability density functions fX and fY . Are X and Y
independent?
Exercise 3. Let X,Y have joint p.d.f. fX,Y (x, y) = c(x+ 2y)1[0,1]2(x, y). Find c. Compute
P(X < Y ) and P(X + Y < 1/2). Find the marginals fX and fY . Compute E(X), E(Y ) and
Cov(X,Y ). Compute E(XY 2).
Exercise 4. Let X,Y have joint distribution function
FX,Y (x, y) = 1− e−x − e−y + e−(x+y)
for (x, y) ∈ [0,∞)2. Find P(X < 4, Y < 2). Find P(X < 3). Differentiate to compute the joint
p.d.f. fX,Y of X,Y . Check that fX,Y integrates to 1.
Exercise 5. Let X,Y be independent uniform random variables in [0, 1]. Write the joint
p.d.f. fX,Y . Compute E(XY ), E(sin(X + Y )). Write down the joint distribution function FX,Y .
Exercise 6. Let X,Y have joint p.d.f. fX,Y (x, y) = cx2ey10<x<y<1(x, y). Find c. Find
the marginals fX and fY . Are X,Y independent? Compute E(XY ), E(X2e−Y ), E(X(1 + 4Y )),
E(eX+Y ), E(eX−Y ).
Exercise 7. The radius of a circle is exponentially distributed with parameter λ. Determine
the probability density function of the perimeter of the circle and of the area of the circle.
Exercise 8. Let Y ∼ N(µ, σ2). Determine the p.d.f. of X = eY .
Exercise 9. Let X ∼ U [0, π]. Determine the p.d.f. of Y = cosX.
Exercise 10. Let X,Y have joint p.d.f. fX,Y (x, y) = 1y10<x<y<1(x, y). Check that fX,Y
integrates to 1. Find the marginal p.d.f.’s fX and fY . Are X,Y independent? Compute E(XY ),
E(XY k) for k positive integer.
66 2. CONTINUOUS PROBABILITY
Exercise 11. Find the joint p.d.f. of X,Y having joint distribution function
FX,Y (x, y) = (1− e−x2)(1− e−y2)1[0,∞)2(x, y).
10. Transformations of two-dimensional random variables
Let X,Y be two random variables with joint probability density function fX,Y . We introduce
the new random variables U = u(X,Y ), V = v(X,Y ) for some functions u, v : R2 → R such
that the map (X,Y )→ (U, V ) is invertible. Then we can writeU = u(X,Y ),
V = v(X,Y )
X = x(U, V ),
Y = y(U, V ).
We would like to obtain a formula for the joint p.d.f. of U, V . To this end, introduce the
Jacobian J of this transformation, defined as
J = det
∂x
∂u
∂x
∂v∂y
∂u
∂y
∂v
=∂x
∂u
∂y
∂v− ∂x
∂v
∂y
∂u.
Then the joint p.d.f. of (U, V ) is given by
fU,V (u, v) = |J |fX,Y (x(u, v), y(u, v)).
Example 10.1. Let X,Y be independent exponential random variables of parameter λ > 0, so
that their joint p.d.f. is fX,Y (x, y) = λ2e−λ(x+y) for (x, y) ∈ [0,+∞). We want to compute the
joint p.d.f. of X + Y, Y . Let U = X + Y , V = Y . Thenu(x, y) = x+ y
v(x, y) = y
x(u, v) = u− v
y(u, v) = v,
and so
J = det
(1 −1
0 1
)= 1.
It follows that the joint p.d.f. of U, V is given by
fU,V (u, v) = |J |fX,Y (u− v, v) = fX(u− v)fY (v) = λ2e−λu1[0,∞)(u)1[0,u](v).
We can use this to compute the marginal p.d.f. of X + Y , to get
fX+Y (u) =
∫ +∞
−∞fU,V (u, v)dv =
∫ u
0λ2e−λudv = λ2ue−λu
for u ≥ 0. Thus X + Y ∼ Gamma(2, λ).
11. LIMIT THEOREMS 67
Example 10.2. Let again X,Y be independent exponentials of parameter λ > 0. Define
U = X + Y, V =X
X + Y.
Then u(x, y) = x+ y
v(x, y) = xx+y
x(u, v) = uv
y(u, v) = u(1− v),
from which
J = det
(v u
1− v −u
)= −u.
It follows that the joint p.d.f. of U, V is given by
fU,V (u, v) = |J |fX(uv)fY (u(1− v)) = λ2ue−λu for 0 < u <∞, 0 < v < 1.
Note that we can decompose fU,V as a product fUfV for fU (u) = λ2ue−λu1[0,∞)(u) and
fV (v) = 1[0,1](v). It follows that U, V are independent random variables, and
U ∼ Gamma(2, λ), V ∼ U [0, 1].
11. Limit theorems
In this section we look at the limiting behaviour of sums of i.i.d. (independent and identically
distributed) random variables.
Theorem 11.1 (Weak law of large numbers). Let (Xk)k≥1 be a sequence of i.i.d. random
variables with finite mean µ, and define
SN = X1 +X2 + · · ·+XN .
Then for all ε > 0
P(∣∣∣∣SNN − µ
∣∣∣∣ > ε
)→ 0
as N →∞.
Proof. We prove the theorem assuming that the random variables have also finite variance
σ2. Note that E(SN/N) = µ and Var(SN/N) = σ2/N . It then follows from Chebyshev’s
inequality that
P(∣∣∣∣SNN − µ
∣∣∣∣ > ε
)= P
(∣∣∣∣SNN − E(SNN
)∣∣∣∣ > ε
)≤ Var(SN )/N
ε2=
σ2
Nε2→ 0
as N →∞.
The following theorem tells us that the convergence holds in a stronger sense.
68 2. CONTINUOUS PROBABILITY
Theorem 11.2 (Strong law of large numbers). Let (Xk)k≥1 be a sequence of i.i.d. random
variables with finite mean µ. Then
P(
limN→∞
SNN
= µ
)= 1.
We won’t prove this result in this course.
Finally, once we know that SN/N → µ as N → ∞, we could ask what is the error we make
when approximating SN by Nµ for large N . This is the object of the next result.
Theorem 11.3 (Central Limit Theorem). Let (Xk)k≥1 be a sequence of i.i.d. random variables
with finite mean µ and variance σ2. Then for all x ∈ R
P(SN −Nµσ√N
≤ x)→ Φ(x)
as N →∞, where
Φ(x) =
∫ x
−∞
1√2πe−y
2/2dy
is the distribution function of a standard Gaussian random variable.
Thus the central limit theorem tells us that, for large N , we can approximate the sum SN by a
Gaussian random variable with mean Nµ and variance Nσ2.
12. Exercises
Exercise 1. Let X,Y be independent non-negative random variables, with p.d.f. fX and
fY respectively. By performing the change of variables U = X + Y , V = Y , show that
fX+Y (u) =
∫ u
0fX(u− v)fY (v)dv = (fX ∗ fY )(u).
Exercise 2. Use the previous exercise to compute the p.d.f. of the sum of two independent
U [0, 1] random variables, and sketch its graph.
Exercise 3. Let X,Y have joint p.d.f. fX,Y (x, y) = 4xy1[0,1]2(x, y). Are X,Y independent?
Find the joint p.d.f. of U = X/Y and V = XY . Compute the marginals fU and fV . Are U, V
independent?
Exercise 4. (Polar coordinates) Let X,Y be independent N(0, 1) random variables. Write
their joint p.d.f. fX,Y . Define the new random variables R,Θ by
X = R cos Θ, Y = R sin Θ.
Compute the joint p.d.f. fR,Θ of R,Θ, and show that it factorizes into fR(r)fΘ(θ). Deduce that
R,Θ are independent, and determine their distributions.
12. EXERCISES 69
Exercise 5. Let X,Y be independent N(0, 1) random variables. For a, b, c ∈ R+, defineX = aU + bV
Y = cV.
Assuming that ac 6= 0, find the joint p.d.f. of U, V and the marginals fU and fV . Where was
the assumption ac 6= 0 used?
Exercise 6. Let X,Y be independent N(0, 1) random variables. Show that, for any fixed
θ, the random variables
U = X cos θ + Y sin θ, V = −X sin θ + Y cos θ
are independent and find their distributions.
CHAPTER 3
Statistics
1. Parameter estimation
Statistics is concerned with data analysis. Let us start with a simple example. Imagine to toss
a biased coin n times. The coin gives head with probability p, but p is unknown: we would like
to estimate p based on the observed outcomes. Denote by X1, X2 . . . Xn the random variables
recording the n coin tosses. Then they are independent Bernoulli(p) random variables with
mean p. Let x1, x2 . . . xn denote the sequence of observed outcomes in 0, 1. A reasonable
guess for p is given by the proportion of heads in the trials, that is we estimate p via
p(x1, . . . xn) =1
n
n∑k=1
xk.
How good is this guess? How do we build a reasonable guess for the parameter in a more
general setting?
Let X1, X2 . . . Xn be independent and identically distributed (i.i.d.) random variables, repre-
senting n identical and independent experiments (or measurement). We assume that the Xk’s
have common probability distribution P( · ; θ) depending on a parameter θ ∈ R. Our aim is to
estimate θ, that is to come up with a guess for the value of θ, based on having observed that
X1 = x1, X2 = x2 . . . Xn = xn.
Definition 1.1. An estimator T for the parameter θ is a function T = T (X1, X2 . . . Xn). We
say that the estimator T is unbiased if
E(T (X1 . . . Xn)) = θ,
and T is said to be biased if it is not unbiased.
Thus an unbiased estimator gives, on average, the right guess for the value of θ. We will often
use the vector notation
X = (X1, X2 . . . Xn)
for brevity, using which we say that T is unbiased if E(T (X)) = θ.
71
72 3. STATISTICS
Example 1.2. Going back to biased coin tosses, we have proposed to estimate p via
T (X) =1
n
n∑k=1
Xk.
To check whether this is an unbiased estimator for the parameter p, we compute the expectation:
E(T (X)) =1
n
n∑k=1
E(Xk) = p,
so T is unbiased for p. Note that we could have instead considered the estimator
T2(X) = X1,
which is also unbiased since E(T2(X)) = E(X1) = p. This corresponds to guessing that the
probability p of head equals 1 if the first coin comes up head, and p = 0 otherwise, ignoring
all the following coin tosses. This doesn’t seem to be a good estimator, and we will see that
indeed it isn’t.
How do we build estimators in a general setting?
1.1. Maximum Likelihood Estimation. Let X1, X2 . . . Xn be i.i.d. random variables
with common distribution P( · ; θ) depending on a real parameter θ. Suppose we have observed
that
X1 = x1, X2 = x2 . . . Xn = xn,
and we are asked to come up with a guess for the value of θ based on these data. What would
be our best guess? Intuitively, it is reasonable to pick the value of θ which maximises the
probability of seeing the observed data. This is the idea behind maximum likelihood estimation.
Definition 1.3 (Likelihood). The likelihood of θ is the function
l(θ) = P(X1 = x1, X2 = x2 . . . Xn = xn; θ).
The log-likelihood of θ is defined as
L(θ) = log l(θ).
Note that by independence
l(θ) =
n∏k=1
P(Xk = xk; θ),
and
L(θ) =
n∑k=1
logP(Xk = xk; θ).
1. PARAMETER ESTIMATION 73
Definition 1.4. The Maximum Likelihood Estimator (MLE) for θ, given data x = (x1, x2 . . . xn),
is the value of θ that maximises the likelihood l(θ), that is
θ(x) = arg max l(θ).
The associated MLE is given by T (X) = θ(X).
This gives us an algorithm to construct estimators: simply write down the likelihood, and
maximise it with respect to the parameter θ.
Remark 1.5. Since L(θ) = log l(θ), the likelihood and log-likelihood are maximised at the same
point. Thus the MLE is also the value of θ that maximises the log-likelihood
θ(x) = arg maxL(θ).
This is often convenient for computations. Indeed, the likelihood comes as a product (due to
independence) of functions of θ, so it’s lengthy to differentiate it. On the other hand, the
log-likelihood is a sum of functions of θ, which is much easier to differentiate.
Example 1.6. Going back to coin tosses, we have X1, X2 . . . Xn i.i.d. Bernoulli(p), so
P(Xk = xk; p) = pxk(1− p)1−xk
for all k ≤ n. The likelihood function is given by
l(p) = P(X1 = x1, X2 = x2 . . . Xn = xn; p)
=n∏k=1
pxk(1− p)1−xk = p∑k xk(1− p)n−
∑k xk .
Thus
L(p) = log l(p) = log p
(n∑k=1
xk
)+ log(1− p)
(n∑k=1
(1− xk)
).
We differentiate to find
d
dpL(p) =
1
p
n∑k=1
xk −1
1− p
n∑k=1
(1− xk) =1
p(1− p)
(n∑k=1
xk − np
),
which shows that L(p) is maximised at
p(x) =1
n
n∑k=1
xk.
Thus the MLE is given by
T (X) =1
n
n∑k=1
Xk,
which, as we have already seen, is unbiased.
74 3. STATISTICS
Example 1.7. Let X1, X2 . . . Xn be i.i.d. Poisson(λ). To build the MLE for λ, we write down
the likelihood
l(λ) = P(X1 = x1, X2 = x2 . . . Xn = xn) =
n∏k=1
e−λλxk
xk!= e−λnλ
∑k xk
n∏k=1
1
xk!.
Thus
L(λ) = log l(λ) = −λn+ log λ
n∑k=1
xk −n∑k=1
log(xk!).
We differentiate to get
d
dλL(λ) = −n+
1
λ
n∑k=1
xk,
which shows that L(λ) is maximised at
λ(x) =1
n
n∑k=1
xk.
Thus the MLE for λ is
T (X) =1
n
n∑k=1
Xk.
This is unbiased, since
E(T (X)) =1
n
n∑k=1
E(Xk) = λ.
For continuous random variables we can use exactly the same approach. If X1, X2 . . . Xn are
i.i.d. continuous random variable with common p.d.f. f( · ; θ) depending on a parameter θ, define
the likelihood as
l(θ) = fX1,X2...Xn(x1, x2 . . . xn; θ) =n∏k=1
f(xk; θ),
and the log-likelihood by
L(θ) = log l(θ)
whenever l(θ) 6= 0, and L(θ) = 0 otherwise. Again, the MLE is the value of θ which maximises
the likelihood, or equivalently the log-likelihood.
Example 1.8. Let X1, X2 . . . Xn be i.i.d. exponentials of parameter λ. We aim to build the
MLE for λ. To this end, write the likelihood function
l(λ) =
n∏k=1
λe−λxk = λne−λ∑k xk ,
from which
L(λ) = n log λ− λn∑k=1
xk.
1. PARAMETER ESTIMATION 75
We differentiate, to get
d
dλL(λ) =
n
λ−
n∑k=1
xk,
which shows that the likelihood is maximised at
λ(x) =1
1n
∑nk=1 xk
.
Thus our MLE is given by
T (X) =
(1
n
n∑k=1
Xk
)−1
.
This is biased, since for n = 1 we find
E(T (X)) = E( 1
X
)=
∫ ∞0
1
xλe−λxdx = +∞.
Example 1.9. Let X1, X2 . . . Xn be i.i.d. Uniform random variables in [0, θ]. We want to
estimate θ. Using that the Xk’s have joint p.d.f.
f(x) =1
θ1[0,θ](x),
we have likelihood function
l(θ) =1
θn1[0,θ](x1)1[0,θ](x2) · · ·1[0,θ](xn).
Using that the condition xk ∈ [0, θ] for all k is equivalent to
max1≤k≤n
xk ≤ θ, min1≤k≤n
xk ≥ 0,
we rewrite l(θ) as
l(θ) =
1
θnfor θ ≥ max
1≤k≤nxk and min
1≤k≤nxk ≥ 0,
0, otherwise.
Moreover, we can assume that mink xk ≥ 0, since negative data would contradict our assumption
that the Xk’s are uniform in [0, θ]. Thus we differentiate the likelihood to find
d
dθl(θ) = −nθ−n−1 for θ ≥ max
1≤k≤nxk,
which shows that the likelihood is maximised at
θ(x) = max1≤k≤n
xk.
Our MLE is therefore given by
T (X) = max1≤k≤n
Xk.
76 3. STATISTICS
Is this unbiased? To answer this question, we have to compute its expectation and check
whether it equals θ. To this end, we first compute the distribution function FT of T (X)
FT (x) = P(T ≤ x) = P(
max1≤k≤n
Xk ≤ x)
= P(X1 ≤ x,X2 ≤ x . . .Xn ≤ x) =(xθ
)nfor x ∈ [0, θ], while FT (x) = 0 if x ≤ 0 and FT (t) = 1 if x ≥ θ. It follows that the p.d.f. of T is
given by
fT (x) =n
θnxn−11[0,θ](x).
We use this to compute
E(T ) =
∫ θ
0xfT (x)dx =
n
θnxndx =
(n
n+ 1
)θ.
Thus the MLE T (X) is biased. On the other hand, note that as n→∞
E(T (X)) =
(n
n+ 1
)θ −→ θ,
so T (X) is asymptotically unbiased.
2. Exercises
Exercise 1. Let X1, X2 . . . Xn be i.i.d. geometric random variables of parameter p. Write
down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased
for n = 1? [Hint: recall the Taylor series for log(1− x).]
Exercise 2. Let X ∼ Binomial(N, p) where N is known and p is to be estimated. Write
down the likelihood function l(p). By maximising l(p), compute the MLE for p. Is it unbiased?
Exercise 3. Let X ∼ Binomial(N, p) where p is known and N is to be estimated. Write
down the likelihood function l(N). By maximising l(N), compute the MLE for N . Is it always
unique?
Exercise 4. Let X1, X2 . . . Xn be i.i.d. N(µ, 1) random variables. Write down the likelihood
function l(µ). By maximising l(µ), compute the MLE for µ. Is it unbiased?
Exercise 5. Let X1, X2 . . . Xn be i.i.d. N(µ, σ2) random variables, with µ known and σ2
to be estimated. Write down the likelihood function l(σ2). By maximising l(σ2), compute the
MLE for σ2. Is it unbiased?
Exercise 6. Let X1, X2 . . . Xn be i.i.d. U [0, θ] random variables. We have seen that the
MLE for θ is given by
T (X) = max1≤k≤n
Xk,
and that this is a biased estimator. Can you use this to build an unbiased estimator for θ?
3. MEAN SQUARE ERROR 77
3. Mean square error
Let X1, X2 . . . Xn be i.i.d. following a given probability distribution depending on a parameter
θ, that we aim to estimate. We have seen how to build the MLE for θ, but of course we could
consider different estimators. Which one is the best? We need a measure for how good a given
estimator is.
Example 3.1. If the Xk’s are i.i.d. Bernoulli of parameter p, representing independent coin
tosses, we have seen that the MLE is given by
T1(X) =1
n
n∑k=1
Xk,
and it is unbiased. On the other hand, we could introduce the alternative estimators
T2(X) = X1, T3(X) =1
3X1 +
2
3X2.
It is easy to see that they are all unbiased: which one should we use? We want our estimator
to be as close as possible to the true value p, so it is reasonable to choose the estimator that
minimises the square distance from p. We have
E[(T1(X)− p)2] =p(1− p)
n, E[(T2(X)− p)2] = p(1− p), E[(T3(X)− p)2] =
5
9p(1− p),
This shows that the MLE T1(X) is the one that, on average, gives the least square error, so it
is the one that performs best. Note that, in fact,
E[(T1(X)− p)2] =p(1− p)
n−→ 0 as n→∞,
so if we could observe infinitely many coin tosses we would be able to determine the value of p
with certainty. This is false for the estimators T2 and T3.
Definition 3.2 (Mean Square Error). Given an estimator T (X) for a parameter θ, the Mean
Square Error (MSE) of T is defined as
MSE(T ) = E[(T (X)− θ)2].
Note that if T is unbiased, that is E(T (X)) = θ, then MSE(T ) = Var(T (X)). If T is biased,
let
bias(T ) = E(T (X))− θ
denote the bias of T . Then the MSE can be expressed as
MSE(T ) = Var(T (X)) + (bias(T ))2.
78 3. STATISTICS
Indeed,
MSE(T ) = E[(T (X)− θ)2] = E[(T (X)− E(T (X)) + E(T (X))− θ)2]
= E[(T (X)− E(T (X)))2]︸ ︷︷ ︸Var(T (X))
+(E(T (X))− θ︸ ︷︷ ︸bias(T )
)2 + 2E[T (X)− E(T (X))]︸ ︷︷ ︸0
·(E(T (X))− θ)
= Var(T (X)) + (bias(T ))2.
This identity is often useful for computations involving biased estimators.
Example 3.3. Let us go back to Example 1.9, in which we showed that the MLE for i.i.d.
U [0, θ] random variables is given by
T (X) = max1≤k≤n
Xk,
which has p.d.f. fT (x) = nθnx
n−11[0,θ](x) and mean E(T ) = nn+1θ. Note that T is biased: we
may wish to consider instead the unbiased estimator
T1(X) = 2X1.
Does T1 perform better than T ? To answer this question, we compute the Mean Square Errors.
Since
bias(T ) = E(T (X))− θ =θ
n+ 1,
Var(T ) = E(T 2(X))− E(T )2 =
∫ θ
0
n
θnxn+1dx−
(n
n+ 1
)2
θ2 =
(n
n+ 2
)θ2 −
(n
n+ 1
)2
θ2,
we find
MSE(T ) = Var(T (X)) + (bias(T ))2 =2θ2
(n+ 1)(n+ 2).
On the other hand, since T1 is unbiased we have
MSE(T1) = Var(T1(X)) = 4Var(X1) =θ2
3.
Since2θ2
(n+ 1)(n+ 2)≤ θ2
3
for all θ > 0 and n ≥ 1, this shows that T , although biased, on average performs better than
the unbiased estimator T1. Note that again
MSE(T ) =2θ2
(n+ 1)(n+ 2)→ 0 as n→∞,
which means that if we had access to infinitely many observation we would be able to determine
the exact value of θ using the estimator T . This is false for the unbiased estimator T1, whose
MSE does not depend on n.
4. CONFIDENCE INTERVALS 79
4. Confidence intervals
Given i.i.d. experiments X1, X2 . . . Xn following some distribution f( · ; θ) depending on a
parameter θ ∈ R, we have seen how to construct the MLE for θ. Moreover, given several
estimators for θ we have proposed to use the MSE to determine which one performs better.
Once an estimator T for θ is chosen, and data x = (x1, x2 . . . xn) have been observed, we
estimate θ by
θ(x) = T (x),
which is a real number, a precise guess for θ. It is often the case that instead of a single guess,
we are willing to accept a guess of the form
a(x) ≤ θ ≤ b(x)
for some a(x), b(x) depending on the data, provided this has high probability. In other words,
we are willing to give up precision in our guess, to gain confidence that our guess is correct.
Example 4.1. Let X1, X2 . . . Xn be i.i.d. N(µ, 1), where µ is to be estimated. We have seen
that the MLE for µ is
T (X) = X =1
n
n∑k=1
Xk,
and it is unbiased. Moreover, since linear combination of Gaussian random variables is still
Gaussian, we have X ∼ N(µ, 1/n), and thus
X − µ ∼ N(0, 1/n).
Note that, crucially, the distribution of X − µ does not depend on µ, and
Z =√n(X − µ) ∼ N(0, 1).
Thus we can choose α, β such that
P(α ≤ Z ≤ β
)=
∫ β
α
1√2πex
2/2dx = 0.95.
The choice of α, β is not unique: it is natural to choose α = −β, which makes the interval [α, β]
as small as possible. With this choice, we get
P(−α ≤ Z ≤ α) = P(−α ≤√n(X − µ) ≤ α) = 0.95.
Rearranging, we find
P(X − α√
n≤ µ ≤ X +
α√n
)= 0.95.
In all, we have found that if a(X) = X − α/√n and b(X) = X + α/
√n, the interval[
a(X), b(X)]
80 3. STATISTICS
is a 95% confidence interval for µ, that is
P(a(X) ≤ µ ≤ b(X)
)= 0.95.
Definition 4.2 (Confidence interval). Given two statistics a(X) and b(X) with a(X) ≤ b(X),
and a number γ ∈ (0, 1), the interval [a(X), b(X)] is said to be a 100γ% confidence interval for
the unknown parameter θ if
P(a(X) ≤ θ ≤ b(X)
)= γ.
It is important to note that here the parameter θ is not random, but rather the extrema of the
confidence interval a(X) and b(X) are. Thus, if we were to collect data multiple times, and
compute the confidence interval for each data collection, we expect θ to be in the confidence
interval 100γ% of the time.
Example 4.3. Let X1, X2 . . . Xn be i.i.d. exponential random variables of parameter λ. Consider
the statistics
T (X) =n∑k=1
Xk
(which is not the MLE for λ). We have seen that sum of n independent exponentials of
parameter λ has Gamma(n, λ) distribution, so
T (X) ∼ Gamma(n, λ), fT (t) =λntn−1e−λt
(n− 1)!1[0,∞)(t).
We cannot use T (X) to construct a confidence interval for λ, as its p.d.f. depends on the
unknown parameter λ. We therefore look for some other function of the Xk’s whose distribution
does not depend on λ. Take
S(X) = 2λT (X).
We compute the distribution function of S, to find
FS(s) = P(S ≤ s) = P(2λT ≤ s) = P(T ≤ s
2λ
)= FT
( s
2λ
),
where FT denotes the distribution function of T (X). Differentiating, we find
fS(s) =d
dsFT
( s
2λ
)= fT
( s
2λ
) 1
2λ=(1
2
)n sn−1
(n− 1)!e−s/21[0,∞)(s).
Note that S ∼ Gamma(n, 1/2). In particular, we see that the p.d.f. of S(X) does not depend
on λ. Given γ ∈ (0, 1), then, we can build a 100γ% confidence interval for λ by choosing
α, β ∈ R such that
P (α ≤ S(X) ≤ β) = γ.
Using that S(X) = 2λ∑n
k=1Xk, this gives
P
(α ≤ 2λ
n∑k=1
Xk ≤ β
)= γ,
4. CONFIDENCE INTERVALS 81
which, solving the inequalities with respect to λ, gives
P(
α
2∑n
k=1Xk≤ λ ≤ β
2∑n
k=1Xk
)= γ.
Note that here α, β, γ are known numbers. Thus[α
2∑n
k=1Xk;
β
2∑n
k=1Xk
]is a 100γ% confidence interval for λ.
4.1. Approximate confidence intervals. We have seen that in order to build a confi-
dence interval for a parameter θ we need to find some function of the data X and the parameter
θ whose distribution does not depend on θ. We have seen in Example 4.1 that this is possible
when X1, X2 . . . Xn are i.i.d. N(µ, 1) random variables and T (X) = X. But how to proceed in
general? The idea is to use the Central Limit Theorem (CLT) to reduce to the Gaussian case.
Example 4.4. Let X1, X2 . . . Xn be i.i.d. Bernoulli random variables of parameter p. Let
Sn =n∑k=1
Xk,
so that the MLE for p is given by T (X) = Sn/n, and it is unbiased. Write
E(X1) = p, Var(X1) = p(1− p).
Then by the CLT the random variable
Sn − np√np(1− p)
is approximately N(0, 1) distributed as n→∞. Thus, if Z ∼ N(0, 1), for very large n we can
approximate
P
(α ≤ Sn − np√
np(1− p)≤ β
)≈ P (−α ≤ Z ≤ α) =
∫ α
−α
1√2πe−x
2/2dx.
We choose α > 0 so that the above probability equals γ ∈ (0, 1) given, from which we get
P
(α ≤ Sn − np√
np(1− p)≤ β
)≈ γ,
which rearranging gives
P
(Snn− α
n
√Sn(n− Sn)
n≤ p ≤ Sn
n+α
n
√Sn(n− Sn)
n
)≈ γ
for n large enough. Thus[Snn− α
n
√Sn(n− Sn)
n,Snn
+α
n
√Sn(n− Sn)
n
]
82 3. STATISTICS
is an approximate 100% confidence interval for p. This is useful for opinion polls, where we
assume that each person votes, say, democratic with probability p, and given that we observed
n votes we want to estimate p.
Definition 4.5 (Approximate confidence interval). Let X ∈ Rn. Given two statistics a(X)
and b(X) with a(X) ≤ b(X), and a number γ ∈ (0, 1), the interval [a(X), b(X)] is said to be an
approximate 100γ% confidence interval for the unknown parameter θ if
P(a(X) ≤ θ ≤ b(X)
)→ γ
as n→∞.
5. Exercises
Exercise 1. Suppose we observe the outcomes of 10 tosses of the same biased coin, giving
head with probability p (unknown). These are given by 0100111101, where 1 stands for head
and 0 stands for tail. By using Maximum Likelihood Estimation, make a guess for p.
Exercise 2. Let X1, X2 . . . Xn be i.i.d. exp(λ), with λ to be estimated. Suppose that n = 5
and we observe outcomes x = (0.2, 3, 1, 0.3, 4.1). By using Maximum Likelihood Estimation,
make a guess for λ.
Exercise 3. Let X ∼ Binomial(N, p), N known and p to be estimated. Compute the
MSE of the MLE T (X) = X/N for p. Consider the alternative estimator
T1(X) =X + 1
N + 1.
Is it unbiased? Compute the MSE of T1. Which one would you use between T and T1? Why?
Exercise 4. In the setting of the previous exercise, define the alternative estimator
Tα(X) = αX
N+ (1− α)
1
2,
for α ∈ [0, 1] given. Is Tα unbiased? Show that
MSE(Tα) =α2
Np(1− p) + (1− α)2
(1
2− p)2.
Suppose α = 1/2. Which estimator would you choose between T (X) = X/N and Tα(X)?
Why?
6. BAYESIAN STATISTICS 83
Exercise 5. Let X1, X2 . . . Xn be i.i.d. N(µ, 1) random variables. Compute the MSE of
the MLE T (X) for µ. Show that
MSE(T )→ 0 as n→∞.
Let T1 = X1 be an alternative estimator. Is it unbiased? Show that MSE(T1) 9 0 as n→∞.
Which estimator would you use and why?
Exercise 6. Let X1, X2 . . . Xn i.i.d. U [0, θ]. We have seen that the MLE is T (X) =
max1≤k≤nXk and it is biased. Introduce the unbiased estimator
T1(X) =
(n+ 1
n
)T (X).
Compute the MSE of T1. Which estimator would you choose between T and T1 and why?
Exercise 7. Suppose that X1, X2 . . . Xn are i.i.d. N(µ, 4) random variables, where µ is to
be estimated. Show that
S(X) =
√n
2(X − µ) ∼ N(0, 1), where X =
1
n
n∑k=1
Xk.
Using S(X), construct a 95% confidence interval for µ. Construct a 90% confidence interval for
µ. Which one is wider? Construct a 98% confidence interval for µ.
Exercise 8. Let X1, X2 . . . Xn be i.i.d. Poisson random variables of parameter λ to be
estimated. Assuming that n is large, construct a 94% approximate confidence interval for λ.
Exercise 9. Let X1, X2 . . . Xn be i.i.d. Geometric random variables of parameter p to be
estimated. Assuming that n is large, construct a 90% approximate confidence interval for p.
6. Bayesian statistics
Up to this point we have discussed point estimation. That is, we have assumed that X1, X2 . . . Xn
are i.i.d. random variables following a common probability distribution f( · ; θ), depending on a
specific parameter θ, which is unknown. We could, on the other hand, think of the parameter θ
as being itself random. We make an initial guess for the distribution of θ, and then update the
guess based on the observed data. This is the point of view of Bayesian statistics.
Notation. When we want to treat the unknown parameter as a random variable we denote it
by Θ, while we use θ for its values. We say that Θ ∼ π if
P(Θ = θ) = π(θ) (discrete setting)
P(Θ ∈ [θ1, θ2]) =
∫ θ2
θ1
π(θ)dθ (continuous setting).
84 3. STATISTICS
Definition 6.1 (Discrete setting). The initial guess for the probability distribution for the
parameter θ is called prior distribution, and it is denoted by π(·). Given that the data x =
(x1, x2 . . . xn) are observed, the updated distribution for θ is called posterior distribution, and it
is given by
π(θ|x) = P(Θ = θ|X = x) =P(X = x; θ)π(θ)
P(X = x).
Example 6.2. Imagine to toss a biased coin giving head with probability Θ. We represent the
outcome by a Bernoulli(Θ) random variable. Suppose we are told that Θ ∈ θ1, θ2, θ3 for
some given numbers 0 ≤ θ1 < θ2 < θ3 ≤ 1. Prior to observing the outcome of the coin toss we
have no reason to believe that one of the values is more likely than the others, so our prior
distribution is uniform over the set θ1, θ2, θ3, that is
π(θ1) = π(θ2) = π(θ3) =1
3.
We now observe the outcome x = 1 (head) of the coin toss. How does our belief on θ change
given the observed datum? We compute the posterior distribution π(θ|1)
π(θk|1) =P(X = 1; θk)π(θk)
P(X = 1)
for k = 1, 2, 3. We use the law of total probabilities to compute
P(X = 1) = P(X = 1; θ1)π(θ1) + P(X = 1; θ2)π(θ2) + P(X = 1; θ3)π(θ3)
=1
3(θ1 + θ2 + θ3).
Thus our posterior distribution is given by
π(θ1|1) =P(X = 1; θ1)π(θ1)
P(X = 1)=
θ1
θ1 + θ2 + θ3,
π(θ2|1) =P(X = 1; θ2)π(θ2)
P(X = 1)=
θ2
θ1 + θ2 + θ3
π(θ3|1) =P(X = 1; θ3)π(θ3)
P(X = 1)=
θ3
θ1 + θ2 + θ3.
Note that the posterior distribution is a probability distribution, as the weights sum up to 1.
In fact, we could have used this to compute P(X = 1).
Definition 6.3 (Continuous setting). The initial guess for the probability distribution for
the parameter θ is called prior distribution, and it is denoted by π(·). Given that the data
x = (x1, x2 . . . xn) are observed, the updated distribution for θ is called posterior distribution,
and it is given by
π(θ|x) =fX1...Xn(x; θ)π(θ)
fX1...Xn(x).
6. BAYESIAN STATISTICS 85
Example 6.4. Let X1, X2 . . . Xn be i.i.d. exponential random variables of parameter Θ. We
assume that Θ ∼ Exponential(µ) for some given µ. Thus our prior distribution is
π(θ) = µe−µθ1[0,∞)(θ).
After data x = (x1, x2 . . . xn) are observed, our posterior distribution for the parameter is given
by
π(θ|x) =fX1...Xn(x; θ)π(θ)
fX1...Xn(x)=
1
fX1...Xn(x)
(n∏k=1
θe−θxk
)µe−µθ
=µ
fX1...Xn(x)θne−θ(
∑nk=1 xk+µ)
for θ ≥ 0. Note that the factor µ/fX1...Xn(x) does not depend on θ, so it must equal the
unique constant such that the posterior distribution integrates to 1. Since the remaining part
of the p.d.f. is proportional to the p.d.f. of a Gamma distribution with parameters n+ 1 and∑nk=1 xk + µ, we conclude that
Θ|x ∼ Gamma
(n+ 1,
n∑k=1
xk + µ
).
That is, if prior to the data collection we believed that the parameter was Exponential(µ),
after observing the data our new belief is that the parameter is Gamma(n+ 1,∑n
k=1 xk + µ).
6.1. Bayesian estimation. Once we have computed the posterior distribution π( · |x) for
Θ, we can use it to make a guess for the value of θ.
Definition 6.5 (Bayes estimator). The Bayes estimator θ(x) is defined to be the posterior
mean, that is
θ(x) = E(Θ|x) =
∫θ π(θ|x)dθ.
Example 6.6. We have seen that in Example 6.4 the posterior distribution is a Gamma(n+
1,∑n
k=1 xk + µ). It follows that the Bayes estimator for the unknown parameter θ is given by
θ(x) = E(Θ|x) =n+ 1∑n
k=1 xk + µ.
Theorem 6.7. The Bayes estimator is the value of θ which minimises the expected posterior
quadratic loss E[(Θ− θ)2|x ].
Remark 6.8. It is possible to introduce different loss functions. For example, we might want to
estimate θ by the minimiser of the expected posterior absolute loss E[|Θ− θ| |x ]. In this case
the Bayes estimator would be the median of the posterior distribution.
86 3. STATISTICS
7. Exercises
Exercise 1. Let X1, X2 . . . Xn be i.i.d. N(Θ, 1), and consider the prior distribution Θ ∼N(0, 1). Show that the posterior distribution is given by
Θ|x ∼ N
(1
n+ 1
n∑k=1
xk,1
n+ 1
).
Deduce that the Bayes estimator (for quadratic loss) is given by θ(x) = 1n+1
∑nk=1 xk.
Exercise 2. Let X1, X2 . . . Xn be i.i.d. Poisson(Θ) random variables, with prior distribu-
tion Θ ∼ Exponential(1). Show that
Θ|x ∼ Gamma
(n∑k=1
xk + 1 , n+ 1
).
Deduce that the Bayes estimator (for quadratic loss) for θ is given by
θ(x) =
∑nk=1 xk + 1
n+ 1.
Exercise 3. Solve the above exercise with prior Θ ∼ Exponential(µ).
Exercise 4. Let X1, X2 . . . Xn be i.i.d. Bernoulli(Θ) random variables, with prior dis-
tribution Θ ∼ U [0, 1] (no prior information). Show that the posterior distribution is given
by
Θ|x ∼ Beta
(n∑k=1
xk + 1 , n+ 1−n∑k=1
xk
),
where we say that X ∼ Beta(α, β) if
fX(x) =xα−1(1− x)β−1
B(α, β)1[0,1](x),
with B(α, β) = Γ(α)Γ(β)/Γ(α+ β). Using that a B(α, β) has expectation α/(α+ β), deduce
that the Bayes estimator (for quadratic loss) is given by
θ(x) =
∑nk=1 xk + 1
n+ 2.
Exercise 5. We have an urn containing N balls, N known, some of which are blue and
some of which are red. Let Θ be the proportion of red balls. We take the uninformative prior
Θ ∼ U [0, 1]. Suppose that we pick one ball, and it turns out to be red. Compute the posterior
distribution Θ|Red. Compute the Bayes estimator (for quadratic loss) for the proportion of red
balls.
7. EXERCISES 87
Exercise 6. Suppose we roll a biased die, where the probability of getting k is proportional
to Θk, for k = 1, 2, 3, 4, 5, 6. Let X1, X2 . . . Xn denote n independent rolls of the die. Starting
from the uninformative prior Θ ∼ U [0, 1], compute the posterior distribution π(θ|x) of Θ|x,
and the Bayes estimator (for quadratic loss).
Exercise 7. Let X1, X2 . . . Xn be i.i.d. Exponential(Θ) random variables, with prior
distribution Θ ∼ Gamma(α, β) for α, β known. Show that the posterior distribution is given
by
Θ|x ∼ Gamma
(α+ n, β +
n∑k=1
xk
).
Deduce that the Bayes estimator (for quadratic loss) is given by θ(x) = (α+ n)/(β +∑n
k=1 xk).
APPENDIX A
Background on set theory
A set is a collection of elements, which can be finite or infinite. If A is a finite set, |A| denotes
the number of elements of A. The empty set is the set consisting of no element, and it is
denoted by ∅. If A and B are two sets, then
• A ∪B = x : x ∈ A or x ∈ B (union).
• A ∩B = x : x ∈ A and B (intersection).
• A \B = x ∈ A but x /∈ B (difference).
• A ⊂ B if all elements of A are also elements of B.
• If A ∩B = ∅ then we say that A and B are disjoint.
Note that the empty set is disjoint from every set, including itself. In fact, the empty set is the
only set disjoint from itself. If Ω is a set, and A ⊆ Ω, we define the complement of A in Ω as
Ac = Ω \A = x ∈ Ω but x /∈ A. Note that A ∩Ac = ∅ so A and Ac are disjoint.
Example 0.1. Take A = 1, 2, 3, B = 2, 4, 6, 8. Then
A ∪B = 1, 2, 3, 4, 6, 8, A ∩B = 2, A \B = 1, 3, B \A = 4, 6, 8.
If C = 6, 7, 8 then
A∪C = 1, 2, 3, 6, 7, 8, A∩C = ∅, A\C = A, A∪B∪C = 1, 2, 3, 4, 6, 7, 8, A∩B∩C = ∅.
Take Ω = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Then A,B,C ⊆ Ω. Their complements in Ω are
Ac = 4, 5, 6, 7, 8, 9, 10, Bc = 1, 3, 5, 7, 9, 10, Cc = 1, 2, 3, 4, 5, 9, 10,
We can also define infinite unions and intersections. For a sequence of sets (An)n∈N set⋃n∈N
An = A1 ∪A2 ∪A3 ∪ · · · = x : x ∈ An for some n ∈ N,
⋂n∈N
An = A1 ∩A2 ∩A3 ∩ · · · = x : x ∈ An for all n ∈ N.
Example 0.2. Take An = n, n+ 1, n+ 2 . . . = k ∈ N : k ≥ n. Then⋃n∈N
An = N,⋂n∈N
An = ∅.
89
90 A. BACKGROUND ON SET THEORY
If on the other hand we take An as before but set An = A10 for all n ≥ 10 then⋃n∈N
An = N,⋂n∈N
An = A10 = 10, 11, 12 . . ..
Given two sets A,B, we define their product as
A×B = (x, y) : x ∈ A, y ∈ B.
Example 0.3. If A = 1, 2, 3 and B = α, β then
A×B = (1, α), (1, β), (2, α), (2, β), (3, α), (3, β),
A×A = (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3),
B ×B = (α, α), (α, β), (β, α), (β, β).
Note that the pairs are ordered, and that |A×B| = |A| · |B|. We can define the product of an
arbitrary number of sets by setting
A1 ×A2 ×A3 · · ·An = (x1, x2, x3 . . . xn) : xk ∈ Ak.
Example 0.4. If A = 0, 1, B = a and C = x, y then
A×B × C = (0, a, x), (0, a, y), (1, a, x), (1, a, y).
Note that |A×B × C| = 2 · 1 · 2 = 4.
1. TABLE OF Φ(z) 91
1. Table of Φ(z)
Source: https://math.ucalgary.ca/