Introduction to Probability - UPF

LECTURE NOTES

Course 6.041-6.431

M.I.T.

FALL 2000

Introduction to Probability

Dimitri P. Bertsekas and John N. Tsitsiklis

Professors of Electrical Engineering and Computer Science

Massachusetts Institute of Technology

Cambridge, Massachusetts

These notes are copyright-protected but may be freely distributed forinstructional nonprofit pruposes.

3

General Random Variables

Contents

3.1. Continuous Random Variables and PDFs . . . . . . . . . . p. 23.2. Cumulative Distribution Functions . . . . . . . . . . . . p. 113.3. Normal Random Variables . . . . . . . . . . . . . . . . p. 173.4. Conditioning on an Event . . . . . . . . . . . . . . . . p. 213.5. Multiple Continuous Random Variables . . . . . . . . . . p. 273.6. Derived Distributions . . . . . . . . . . . . . . . . . . p. 393.7. Summary and Discussion . . . . . . . . . . . . . . . . p. 51

1

2 General Random Variables Chap. 3

Random variables with a continuous range of possible experimental values arequite common – the velocity of a vehicle traveling along the highway could be oneexample. If such a velocity is measured by a digital speedometer, the speedome-ter’s reading is a discrete random variable. But if we also wish to model the exactvelocity, a continuous random variable is called for. Models involving continuousrandom variables can be useful for several reasons. Besides being finer-grainedand possibly more accurate, they allow the use of powerful tools from calculusand often admit an insightful analysis that would not be possible under a discretemodel.

All of the concepts and methods introduced in Chapter 2, such as expec-tation, PMFs, and conditioning, have continuous counterparts. Developing andinterpreting these counterparts is the subject of this chapter.

3.1 CONTINUOUS RANDOM VARIABLES AND PDFS

A random variable X is called continuous if its probability law can be describedin terms of a nonnegative function fX , called the probability density functionof X, or PDF for short, which satisfies

P(X ∈ B) =∫

BfX(x) dx,

for every subset B of the real line.† In particular, the probability that the valueof X falls within an interval is

P(a ≤ X ≤ b) =∫ b

afX(x) dx,

and can be interpreted as the area under the graph of the PDF (see Fig. 3.1).For any single value a, we have P(X = a) =

∫ aa fX(x) dx = 0. For this reason,

including or excluding the endpoints of an interval has no effect on its probability:

P(a ≤ X ≤ b) = P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b).

Note that to qualify as a PDF, a function fX must be nonnegative, i.e.,fX(x) ≥ 0 for every x, and must also satisfy the normalization equation

∫ ∞

−∞fX(x) dx = P(−∞ < X < ∞) = 1.

† The integral∫

BfX(x) dx is to be interpreted in the usual calculus/Riemann

sense and we implicitly assume that it is well-defined. For highly unusual functionsand sets, this integral can be harder – or even impossible – to define, but such issuesbelong to a more advanced treatment of the subject. In any case, it is comfortingto know that mathematical subtleties of this type do not arise if fX is a piecewisecontinuous function with a finite number of points of discontinuity, and B is the unionof a finite or countable number of intervals.

Sec. 3.1 Continuous Random Variables and PDFs 3

Sample Space

x

PDF fX(x)

Event {a < X < b} a b

Figure 3.1: Illustration of a PDF. The probability that X takes value in an

interval [a, b] is∫ b

afX(x) dx, which is the shaded area in the figure.

Graphically, this means that the entire area under the graph of the PDF mustbe equal to 1.

To interpret the PDF, note that for an interval [x, x + δ] with very smalllength δ, we have

P([x, x + δ]

)=

∫ x+δ

xfX(t) dt ≈ fX(x) · δ,

so we can view fX(x) as the “probability mass per unit length” near x (cf.Fig. 3.2). It is important to realize that even though a PDF is used to calculateevent probabilities, fX(x) is not the probability of any particular event. Inparticular, it is not restricted to be less than or equal to one.

x

PDF fX(x )

x + !

Figure 3.2: Interpretation of the PDFfX(x) as “probability mass per unit length”around x. If δ is very small, the prob-ability that X takes value in the inter-val [x, x + δ] is the shaded area in thefigure, which is approximately equal tofX(x) · δ.

Example 3.1. Continuous Uniform Random Variable. A gambler spinsa wheel of fortune, continuously calibrated between 0 and 1, and observes theresulting number. Assuming that all subintervals of [0,1] of the same length areequally likely, this experiment can be modeled in terms a random variable X withPDF

fX(x) ={

c if 0 ≤ x ≤ 1,0 otherwise,


for some constant c. This constant can be determined by using the normalizationproperty

1 =

∫ ∞

−∞fX(x) dx =

∫ 1

0

c dx = c

∫ 1

0

dx = c

so that c = 1.More generally, we can consider a random variable X that takes values in

an interval [a, b], and again assume that all subintervals of the same length areequally likely. We refer to this type of random variable as uniform or uniformlydistributed. Its PDF has the form

fX(x) ={

c if a ≤ x ≤ b,0 otherwise,

where c is a constant. This is the continuous analog of the discrete uniform randomvariable discussed in Chapter 2. For fX to satisfy the normalization property, wemust have (cf. Fig. 3.3)

1 =

∫ b

a

c dx = c

∫ b

a

dx = c(b − a),

so that

c =1

b − a.

x

PDF fX(x)

a b

1b - a

Figure 3.3: The PDF of a uniformrandom variable.

Note that the probability P(X ∈ I) that X takes value in a set I is

P(X ∈ I) =

∫

[a,b]∩I

1b − a

dx =1

b − a

∫

[a,b]∩I

dx =length of [a, b] ∩ I

length of [a, b].

The uniform random variable bears a relation to the discrete uniform law, whichinvolves a sample space with a finite number of equally likely outcomes. The dif-ference is that to obtain the probability of various events, we must now calculatethe “length” of various subsets of the real line instead of counting the number ofoutcomes contained in various events.

Example 3.2. Piecewise Constant PDF. Alvin’s driving time to work isbetween 15 and 20 minutes if the day is sunny, and between 20 and 25 minutes if


the day is rainy, with all times being equally likely in each case. Assume that a dayis sunny with probability 2/3 and rainy with probability 1/3. What is the PDF ofthe driving time, viewed as a random variable X?

We interpret the statement that “all times are equally likely” in the sunnyand the rainy cases, to mean that the PDF of X is constant in each of the intervals[15, 20] and [20, 25]. Furthermore, since these two intervals contain all possibledriving times, the PDF should be zero everywhere else:

fX(x) =

{c1 if 15 ≤ x < 20,c2 if 20 ≤ x ≤ 25,0 otherwise,

where c1 and c2 are some constants. We can determine these constants by usingthe given probabilities of a sunny and of a rainy day:

23

= P(sunny day) =

∫ 20

15

fX(x) dx =

∫ 20

15

c1 dx = 5c1,

13

= P(rainy day) =

∫ 25

20

fX(x) dx =

∫ 25

20

c2 dx = 5c2,

so that

c1 =215

, c2 =115

.

Generalizing this example, consider a random variable X whose PDF has thepiecewise constant form

fX(x) ={

ci if ai ≤ x < ai+1, i = 1, 2, . . . , n − 1,0 otherwise,

where a1, a2, . . . , an are some scalars with ai < ai+1 for all i, and c1, c2, . . . , cn aresome nonnegative constants (cf. Fig. 3.4). The constants ci may be determined byadditional problem data, as in the case of the preceding driving context. Generally,the ci must be such that the normalization property holds:

1 =

∫ an

a1

fX(x) dx =

n−1∑

i=1

∫ ai+1

ai

ci dx =

n−1∑

i=1

ci(ai+1 − ai).

x

PDF fX(x)

a 1 a 2 a 3 a 4

c 1c 2

c 3

Figure 3.4: A piecewise constant PDF involving three intervals.


Example 3.3. A PDF can be arbitrarily large. Consider a random variableX with PDF

fX(x) =

{1

2√

xif 0 < x ≤ 1,

0 otherwise.

Even though fX(x) becomes infinitely large as x approaches zero, this is still a validPDF, because ∫ ∞

−∞fX(x) dx =

∫ 1

0

1

2√

xdx =

√x∣∣∣1

0= 1.

Summary of PDF Properties

Let X be a continuous random variable with PDF fX .

• fX(x) ≥ 0 for all x.

•∫ ∞−∞ fX(x) dx = 1.

• If δ is very small, then P([x, x + δ]

)≈ fX(x) · δ.

• For any subset B of the real line,

P(X ∈ B) =∫

BfX(x) dx.

Expectation

The expected value or mean of a continuous random variable X is definedby†

E[X] =∫ ∞

−∞xfX(x) dx.

† One has to deal with the possibility that the integral∫ ∞−∞ xfX(x) dx is infi-

nite or undefined. More concretely, we will say that the expectation is well-defined if∫ ∞−∞ |x|fX(x) dx < ∞. In that case, it is known that the integral

∫ ∞−∞ xfX(x) dx takes

a finite and unambiguous value.For an example where the expectation is not well-defined, consider a random vari-

able X with PDF fX(x) = c/(1 + x2), where c is a constant chosen to enforce the nor-malization condition. The expression |x|fX(x) is approximately the same as 1/|x| when|x| is large. Using the fact

∫ ∞1

(1/x) dx = ∞, one can show that∫ ∞−∞ |x|fX(x) dx = ∞.

Thus, E[X] is left undefined, despite the symmetry of the PDF around zero.Throughout this book, in lack of an indication to the contrary, we implicitly

assume that the expected value of the random variables of interest is well-defined.


This is similar to the discrete case except that the PMF is replaced by thePDF, and summation is replaced by integration. As in Chapter 2, E[X] can beinterpreted as the “center of gravity” of the probability law and, also, as theanticipated average value of X in a large number of independent repetitions ofthe experiment. Its mathematical properties are similar to the discrete case –after all, an integral is just a limiting form of a sum.

If X is a continuous random variable with given PDF, any real-valuedfunction Y = g(X) of X is also a random variable. Note that Y can be acontinuous random variable: for example, consider the trivial case where Y =g(X) = X. But Y can also turn out to be discrete. For example, suppose thatg(x) = 1 for x > 0, and g(x) = 0, otherwise. Then Y = g(X) is a discreterandom variable. In either case, the mean of g(X) satisfies the expected valuerule

E[g(X)

]=

∫ ∞

−∞g(x)fX(x) dx,

in complete analogy with the discrete case.The nth moment of a continuous random variable X is defined as E[Xn],

the expected value of the random variable Xn. The variance, denoted byvar(X), is defined as the expected value of the random variable

(X − E[X]

)2.We now summarize this discussion and list a number of additional facts

that are practically identical to their discrete counterparts.

Expectation of a Continuous Random Variable and its Properties

Let X be a continuous random variable with PDF fX .

• The expectation of X is defined by

E[X] =∫ ∞

−∞xfX(x) dx.

• The expected value rule for a function g(X) has the form

E[g(X)

]=

∫ ∞

−∞g(x)fX(x) dx.

• The variance of X is defined by

var(X) = E[(

X − E[X])2] =

∫ ∞

−∞

(x − E[X]

)2fX(x) dx.


• We have0 ≤ var(X) = E[X2] −

(E[X]

)2.

• If Y = aX + b, where a and b are given scalars, then

E[Y ] = aE[X] + b, var(Y ) = a2var(X).

Example 3.4. Mean and Variance of the Uniform Random Variable.Consider the case of a uniform PDF over an interval [a, b], as in Example 3.1. Wehave

E[X] =

∫ ∞

−∞xfX(x) dx

=

∫ b

a

x · 1b − a

dx

=1

b − a· 12x2

∣∣∣b

a

=1

b − a· b2 − a2

2

=a + b

2,

as one expects based on the symmetry of the PDF around (a + b)/2.To obtain the variance, we first calculate the second moment. We have

E[X2] =

∫ b

a

x2

b − adx

=1

b − a

∫ b

a

x2 dx

=1

b − a· 13x3

∣∣∣b

a

=b3 − a3

3(b − a)

=a2 + ab + b2

3.

Thus, the variance is obtained as

var(X) = E[X2] −(E[X]

)2=

a2 + ab + b2

3− (a + b)2

4=

(b − a)2

12,

after some calculation.


Suppose now that [a, b] = [0, 1], and consider the function g(x) = 1 if x ≤ 1/3,and g(x) = 2 if x > 1/3. The random variable Y = g(X) is a discrete one withPMF pY (1) = P(X ≤ 1/3) = 1/3, pY (2) = 1 − pY (1) = 2/3. Thus,

E[Y ] =13· 1 +

23· 2 =

53.

The same result could be obtained using the expected value rule:

E[Y ] =

∫ 1

0

g(x)fX(x) dx =

∫ 1/3

0

dx +

∫ 1

1/3

2 dx =53.

Exponential Random Variable

An exponential random variable has a PDF of the form

fX(x) ={

λe−λx if x ≥ 0,0 otherwise,

where λ is a positive parameter characterizing the PDF (see Fig. 3.5). This is alegitimate PDF because

∫ ∞

−∞fX(x) dx =

∫ ∞

0λe−λx dx = −e−λx

∣∣∣∞

0= 1.

Note that the probability that X exceeds a certain value falls exponentially.Indeed, for any a ≥ 0, we have

P(X ≥ a) =∫ ∞

aλe−λx dx = −e−λx

∣∣∣∞

a= e−λa.

An exponential random variable can be a very good model for the amountof time until a piece of equipment breaks down, until a light bulb burns out,or until an accident occurs. It will play a major role in our study of randomprocesses in Chapter 5, but for the time being we will simply view it as anexample of a random variable that is fairly tractable analytically.

0 x 0 x

Small Large

Figure 3.5: The PDF λe−λx of an exponential random variable.


The mean and the variance can be calculated to be

E[X] =1λ

, var(X) =1λ2

.

These formulas can be verified by straightforward calculation, as we now show.We have, using integration by parts,

E[X] =∫ ∞

0xλe−λx dx

= (−xe−λx)∣∣∣∞

0+

∫ ∞

0e−λx dx

= 0 − e−λx

λ

∣∣∣∣∣

∞

0

=1λ

.

Using again integration by parts, the second moment is

E[X2] =∫ ∞

0x2λe−λx dx

= (−x2e−λx)∣∣∣∞

0+

∫ ∞

02xe−λx dx

= 0 +2λE[X]

=2λ2

.

Finally, using the formula var(X) = E[X2] −(E[X]

)2, we obtain

var(X) =2λ2

− 1λ2

=1λ2

.

Example 3.5. The time until a small meteorite first lands anywhere in the Saharadesert is modeled as an exponential random variable with a mean of 10 days. Thetime is currently midnight. What is the probability that a meteorite first landssome time between 6am and 6pm of the first day?

Let X be the time elapsed until the event of interest, measured in days.Then, X is exponential, with mean 1/λ = 10, which yields λ = 1/10. The desiredprobability is

P(1/4 ≤ X ≤ 3/4) = P(X ≥ 1/4) − P(X > 3/4) = e−1/40 − e−3/40 = 0.0476,

where we have used the formula P(X ≥ a) = P(X > a) = e−λa.

Sec. 3.2 Cumulative Distribution Functions 11

Let us also derive an expression for the probability that the time when ameteorite first lands will be between 6am and 6pm of some day. For the kth day,this set of times corresponds to the event k − (3/4) ≤ X ≤ k − (1/4). Since theseevents are disjoint, the probability of interest is

∞∑

k=1

P(k − 3

4≤ X ≤ k − 1

4

)=

∞∑

k=1

(P

(X ≥ k − 3

4

)− P

(X > k − 1

4

))

=

∞∑

k=1

(e−(4k−3)/40 − e−(4k−1)/40

).

We omit the remainder of the calculation, which involves using the geometric seriesformula.

3.2 CUMULATIVE DISTRIBUTION FUNCTIONS

We have been dealing with discrete and continuous random variables in a some-what different manner, using PMFs and PDFs, respectively. It would be desir-able to describe all kinds of random variables with a single mathematical concept.This is accomplished by the cumulative distribution function, or CDF forshort. The CDF of a random variable X is denoted by FX and provides theprobability P(X ≤ x). In particular, for every x we have

FX(x) = P(X ≤ x) =

∑

k≤x

pX(k) X: discrete,

∫ x

−∞fX(t) dt X: continuous.

Loosely speaking, the CDF FX(x) “accumulates” probability “up to” the value x.Any random variable associated with a given probability model has a CDF,

regardless of whether it is discrete, continuous, or other. This is because {X ≤ x}is always an event and therefore has a well-defined probability. Figures 3.6 and3.7 illustrate the CDFs of various discrete and continuous random variables.From these figures, as well as from the definition, some general properties of theCDF can be observed.


PMF pX(x)1

0 0

1

0 0x1 2 43 1 2 43

px(2) px(2)

CDF FX(x)PMF pX(x)

x x

x

CDF FX(x)

.. .

.. . .

Figure 3.6: CDFs of some discrete random variables. The CDF is related to thePMF through the formula

FX(x) = P(X ≤ x) =∑

k≤x

pX(k),

and has a staircase form, with jumps occurring at the values of positive probabilitymass. Note that at the points where a jump occurs, the value of FX is the largerof the two corresponding values (i.e., FX is continuous from the right).

Properties of a CDF

The CDF FX of a random variable X is defined by

FX(x) = P(X ≤ x), for all x,

and has the following properties.

• FX is monotonically nondecreasing:

if x ≤ y, then FX(x) ≤ FX(y).

• FX(x) tends to 0 as x → −∞, and to 1 as x → ∞.

• If X is discrete, then FX has a piecewise constant and staircase-likeform.

• If X is continuous, then FX has a continuously varying form.


• If X is discrete and takes integer values, the PMF and the CDF canbe obtained from each other by summing or differencing:

FX(k) =k∑

i=−∞pX(i),

pX(k) = P(X ≤ k) − P(X ≤ k − 1) = FX(k) − FX(k − 1),

for all integers k.

• If X is continuous, the PDF and the CDF can be obtained from eachother by integration or differentiation:

FX(x) =∫ x

−∞fX(t) dt,

fX(x) =dFX

dx(x).

(The latter relation is valid for those x for which the CDF has a deriva-tive.)

Because the CDF is defined for any type of random variable, it providesa convenient means for exploring the relations between continuous and discreterandom variables. This is illustrated in the following example, which showsthat there is a close relation between the geometric and the exponential randomvariables.

Example 3.6. The Geometric and Exponential CDFs. Let X be a geometricrandom variable with parameter p; that is, X is the number of trials to obtain thefirst success in a sequence of independent Bernoulli trials, where the probability ofsuccess is p. Thus, for k = 1, 2, . . ., we have P(X = k) = p(1 − p)k−1 and the CDFis given by

F geo(n) =

n∑

k=1

p(1 − p)k−1 = p1 − (1 − p)n

1 − (1 − p)= 1 − (1 − p)n, for n = 1, 2, . . .

Suppose now that X is an exponential random variable with parameter λ > 0.Its CDF is given by

F exp(x) = P(X ≤ x) = 0, for x ≤ 0,

and

F exp(x) =

∫ x

0

λe−λtdt = −e−λt∣∣∣x

0= 1 − e−λx, for x > 0.


CDF FX(x)

x

PDF fX (x)

a b

1b - a

a b

1

a b

2b - a

a b

1

x - ab - a

(x - a)2

(b - a)2

c c

Area = Fx(c)

CDF FX(x)

PDF fX (x)

x

x

x

Figure 3.7: CDFs of some continuous random variables. The CDF is related tothe PDF through the formula

FX(x) = P(X ≤ x) =

∫ x

−∞fX(t) dt.

Thus, the PDF fX can be obtained from the CDF by differentiation:

fX(x) =dFX(x)

dx.

For a continuous random variable, the CDF has no jumps, i.e., it is continuous.

To compare the two CDFs above, let δ = − ln(1 − p)/λ, so that

e−λδ = 1 − p.

Then we see that the values of the exponential and the geometric CDFs are equalfor all x = nδ, where n = 1, 2, . . ., i.e.,

F exp(nδ) = F geo(n), n = 1, 2, . . . ,

as illustrated in Fig. 3.8.If δ is very small, there is close proximity of the exponential and the geometric

CDFs, provided that we scale the values taken by the geometric random variable byδ. This relation is best interpreted by viewing X as time, either continuous, in thecase of the exponential, or δ-discretized, in the case of the geometric. In particular,suppose that δ is a small number, and that every δ seconds, we flip a coin with theprobability of heads being a small number p. Then, the time of the first occurrenceof heads is well approximated by an exponential random variable. The parameter


1

0 xGeometric CDF

1 - (1 - p)n with p = 1 - e -

n

Exponential CDF 1 - e - x

Figure 3.8: Relation of the geometric and the exponential CDFs. We have

F exp(nδ) = F geo(n), n = 1, 2, . . . ,

if the interval δ is such that e−λδ = 1− p. As δ approaches 0, the exponentialrandom variable can be interpreted as the “limit” of the geometric.

λ of this exponential is such that e−λδ = 1 − p or λ = − ln(1 − p)/δ. This relationbetween the geometric and the exponential random variables will play an importantrole in the theory of the Bernoulli and Poisson stochastic processes in Chapter 5.

Sometimes, in order to calculate the PMF or PDF of a discrete or contin-uous random variable, respectively, it is more convenient to first calculate theCDF and then use the preceding relations. The systematic use of this approachfor the case of a continuous random variable will be discussed in Section 3.6.The following is a discrete example.

Example 3.7. The Maximum of Several Random Variables. You areallowed to take a certain test three times, and your final score will be the maximumof the test scores. Thus,

X = max{X1, X2, X3},

where X1, X2, X3 are the three test scores and X is the final score. Assume thatyour score in each test takes one of the values from 1 to 10 with equal probability1/10, independently of the scores in other tests. What is the PMF pX of the finalscore?

We calculate the PMF indirectly. We first compute the CDF FX(k) and thenobtain the PMF as

pX(k) = FX(k) − FX(k − 1), k = 1, . . . , 10.


We haveFX(k) = P(X ≤ k)

= P(X1 ≤ k, X2 ≤ k, X3 ≤ k)

= P(X1 ≤ k)P(X2 ≤ k)P(X3 ≤ k)

=(

k10

)3

,

where the third equality follows from the independence of the events {X1 ≤ k},{X2 ≤ k}, {X3 ≤ k}. Thus the PMF is given by

pX(k) =(

k10

)3

−(

k − 110

)3

, k = 1, . . . , 10.

3.3 NORMAL RANDOM VARIABLES

A continuous random variable X is said to be normal or Gaussian if it has aPDF of the form (see Fig. 3.9)

fX(x) =1√2π σ

e−(x−µ)2/2σ2,

where µ and σ are two scalar parameters characterizing the PDF, with σ assumednonnegative. It can be verified that the normalization property

1√2π σ

∫ ∞

−∞e−(x−µ)2/2σ2

dx = 1

holds (see the theoretical problems).

1 2-1 30

Normal PDF fx(x)

x 1 2-1 30 x

Normal CDF FX(x)1

0.5

" = 1 " = 1

Figure 3.9: A normal PDF and CDF, with µ = 1 and σ2 = 1. We observe thatthe PDF is symmetric around its mean µ, and has a characteristic bell-shape.

As x gets further from µ, the term e−(x−µ)2/2σ2decreases very rapidly. In this

figure, the PDF is very close to zero outside the interval [−1, 3].

Sec. 3.3 Normal Random Variables 17

The mean and the variance can be calculated to be

E[X] = µ, var(X) = σ2.

To see this, note that the PDF is symmetric around µ, so its mean must be µ.Furthermore, the variance is given by

var(X) =1√2π σ

∫ ∞

−∞(x − µ)2e−(x−µ)2/2σ2

dx.

Using the change of variables y = (x − µ)/σ and integration by parts, we have

var(X) =σ2√

2π

∫ ∞

−∞y2e−y2/2 dy

=σ2√

2π

(−ye−y2/2

) ∣∣∣∣∞

−∞+

σ2√

2π

∫ ∞

−∞e−y2/2 dy

=σ2√

2π

∫ ∞

−∞e−y2/2 dy

= σ2.

The last equality above is obtained by using the fact

1√2π

∫ ∞

−∞e−y2/2 dy = 1,

which is just the normalization property of the normal PDF for the case whereµ = 0 and σ = 1.

The normal random variable has several special properties. The followingone is particularly important and will be justified in Section 3.6.

Normality is Preserved by Linear Transformations

If X is a normal random variable with mean µ and variance σ2, and if a, bare scalars, then the random variable

Y = aX + b

is also normal, with mean and variance

E[Y ] = aµ + b, var(Y ) = a2σ2.


The Standard Normal Random Variable

A normal random variable Y with zero mean and unit variance is said to be astandard normal. Its CDF is denoted by Φ,

Φ(y) = P(Y ≤ y) = P(Y < y) =1√2π

∫ y

−∞e−t2/2 dt.

It is recorded in a table (given in the next page), and is a very useful toolfor calculating various probabilities involving normal random variables; see alsoFig. 3.10.

Note that the table only provides the values of Φ(y) for y ≥ 0, because theomitted values can be found using the symmetry of the PDF. For example, if Yis a standard normal random variable, we have

Φ(−0.5) = P(Y ≤ −0.5) = P(Y ≥ 0.5) = 1 − P(Y < 0.5)= 1 − Φ(0.5) = 1 − .6915 = 0.3085.

Let X be a normal random variable with mean µ and variance σ2. We“standardize” X by defining a new random variable Y given by

Y =X − µ

σ.

Since Y is a linear transformation of X, it is normal. Furthermore,

E[Y ] =E[X] − µ

σ= 0, var(Y ) =

var(X)σ2

= 1.

Thus, Y is a standard normal random variable. This fact allows us to calculatethe probability of any event defined in terms of X: we redefine the event in termsof Y , and then use the standard normal table.

2-1 0

Standard Normal PDF

0.399Mean = 0

Variance = 1

2-1 0 y

Standard Normal CDF(y )

1

0.7

Area = (0.7) (0.7)

0.7

Figure 3.10: The PDF

fY (y) =1

√2π

e−y2/2

of the standard normal random variable. Its corresponding CDF, which is denotedby Φ(y), is recorded in a table.

Sec. 3.3 Normal Random Variables 19

Example 3.8. Using the Normal Table. The annual snowfall at a particulargeographic location is modeled as a normal random variable with a mean of µ = 60inches, and a standard deviation of σ = 20. What is the probability that this year’ssnowfall will be at least 80 inches?

Let X be the snow accumulation, viewed as a normal random variable, andlet

Y =X − µ

σ=

X − 6020

,

be the corresponding standard normal random variable. We want to find

P(X ≥ 80) = P(

X − 6020

≥ 80 − 6020

)= P

(Y ≥ 80 − 60

20

)= P(Y ≥ 1) = 1−Φ(1),

where Φ is the CDF of the standard normal. We read the value Φ(1) from the table:

Φ(1) = 0.8413,

so thatP(X ≥ 80) = 1 − Φ(1) = 0.1587.

Generalizing the approach in the preceding example, we have the followingprocedure.

CDF Calculation of the Normal Random Variable

The CDF of a normal random variable X with mean µ and variance σ2 isobtained using the standard normal table as

P(X ≤ x) = P(

X − µ

σ≤ x − µ

σ

)= P

(Y ≤ x − µ

σ

)= Φ

(x − µ

σ

),

where Y is a standard normal random variable.

The normal random variable is often used in signal processing and com-munications engineering to model noise and unpredictable distortions of signals.The following is a typical example.

Example 3.9. Signal Detection. A binary message is transmitted as a signalthat is either −1 or +1. The communication channel corrupts the transmission withadditive normal noise with mean µ = 0 and variance σ2. The receiver concludesthat the signal −1 (or +1) was transmitted if the value received is < 0 (or ≥ 0,respectively); see Fig. 3.11. What is the probability of error?

An error occurs whenever −1 is transmitted and the noise N is at least 1 sothat N + S = N − 1 ≥ 0, or whenever +1 is transmitted and the noise N is smaller


+1 if N + S > 0

-1 if N + S <0

Transmitter ReceiverNoisy ChannelSignal

S = +1 or -1

Normal zero-mean noise N

with variance 2

N + S

0-1

Region of errorwhen a -1 istransmitted

0

Region of errorwhen a +1 istransmitted

1

Figure 3.11: The signal detection scheme of Example 3.9. The area of theshaded region gives the probability of error in the two cases where −1 and +1is transmitted.

than −1 so that N + S = N + 1 < 0. In the former case, the probability of error is

P(N ≥ 1) = 1 − P(N < 1) = 1 − P(

N − µσ

<1 − µ

σ

)

= 1 − Φ(

1 − µσ

)= 1 − Φ

(1σ

).

In the latter case, the probability of error is the same, by symmetry. The valueof Φ(1/σ) can be obtained from the normal table. For σ = 1, we have Φ(1/σ) =Φ(1) = 0.8413, and the probability of the error is 0.1587.

The normal random variable plays an important role in a broad range ofprobabilistic models. The main reason is that, generally speaking, it models wellthe additive effect of many independent factors, in a variety of engineering, phys-ical, and statistical contexts. Mathematically, the key fact is that the sum of alarge number of independent and identically distributed (not necessarily normal)random variables has an approximately normal CDF, regardless of the CDF ofthe individual random variables. This property is captured in the celebratedcentral limit theorem, which will be discussed in Chapter 7.

3.4 CONDITIONING ON AN EVENT

The conditional PDF of a continuous random variable X, conditioned on aparticular event A with P(A) > 0, is a function fX|A that satisfies

P(X ∈ B |A) =∫

BfX|A(x) dx,

Sec. 3.4 Conditioning on an event 21

for any subset B of the real line. It is the same as an ordinary PDF, except thatit now refers to a new universe in which the event A is known to have occurred.

An important special case arises when we condition on X belonging to asubset A of the real line, with P(X ∈ A) > 0. We then have

P(X ∈ B |X ∈ A) =P(X ∈ B and X ∈ A)

P(X ∈ A)=

∫

A∩BfX(x) dx

P(X ∈ A).

This formula must agree with the earlier one, and therefore,†

fX|A(x |A) =

{fX(x)

P(X ∈ A)if x ∈ A,

0 otherwise.

As in the discrete case, the conditional PDF is zero outside the conditioningset. Within the conditioning set, the conditional PDF has exactly the sameshape as the unconditional one, except that it is scaled by the constant factor1/P(X ∈ A). This normalization ensures that fX|A integrates to 1, which makesit a legitimate PDF; see Fig. 3.13.

x

fX(x)

a b

fX|A(x)

Figure 3.13: The unconditional PDF fX and the conditional PDF fX|A, whereA is the interval [a, b]. Note that within the conditioning event A, fX|A retainsthe same shape as fX , except that it is scaled along the vertical axis.

Example 3.10. The exponential random variable is memoryless. Alvingoes to a bus stop where the time T between two successive buses has an exponentialPDF with parameter λ. Suppose that Alvin arrives t secs after the preceding busarrival and let us express this fact with the event A = {T > t}. Let X be the timethat Alvin has to wait for the next bus to arrive. What is the conditional CDFFX|A(x |A)?

† We are using here the simpler notation fX|A(x) in place of fX|X∈A, which is

more accurate.


We have

P(X > x |A) = P(T > t + x |T > t)

=P(T > t + x and T > t)

P(T > t)

=P(T > t + x)

P(T > t)

=e−λ(t+x)

e−λt

= e−λx,

where we have used the expression for the CDF of an exponential random variablederived in Example 3.6.

Thus, the conditional CDF of X is exponential with parameter λ, regardlessthe time t that elapsed between the preceding bus arrival and Alvin’s arrival. Thisis known as the memorylessness property of the exponential. Generally, if we modelthe time to complete a certain operation by an exponential random variable X,this property implies that as long as the operation has not been completed, theremaining time up to completion has the same exponential CDF, no matter whenthe operation started.

For a continuous random variable, the conditional expectation is definedsimilar to the unconditional case, except that we now need to use the condi-tional PDF. We summarize the discussion so far, together with some additionalproperties in the table that follows.

Conditional PDF and Expectation Given an Event

• The conditional PDF fX|A of a continuous random variable X givenan event A with P(A) > 0, satisfies

P(X ∈ B |A) =∫

BfX|A(x) dx.

• If A be a subset of the real line with P(X ∈ A) > 0, then

fX|A(x) =

fX(x)P(X ∈ A)

if x ∈ A,

0 otherwise,

andP(X ∈ B |X ∈ A) =

∫

BfX|A(x) dx,

for any set B.


• The corresponding conditional expectation is defined by

E[X |A] =∫ ∞

−∞xfX|A(x) dx.

• The expected value rule remains valid:

E[g(X) |A

]=

∫ ∞

−∞g(x)fX|A(x) dx.

• If A1, A2, . . . , An are disjoint events with P(Ai) > 0 for each i, thatform a partition of the sample space, then

fX(x) =n∑

i=1

P(Ai)fX|Ai(x)

(a version of the total probability theorem), and

E[X] =n∑

i=1

P(Ai)E[X |Ai]

(the total expectation theorem). Similarly,

E[g(X)

]=

n∑

i=1

P(Ai)E[g(X) |Ai

].

To justify the above version of the total probability theorem, we use thetotal probability theorem from Chapter 1, to obtain

P(X ≤ x) =n∑

i=1

P(Ai)P(X ≤ x |Ai).

This formula can be rewritten as∫ x

−∞fX(t) dt =

n∑

i=1

P(Ai)∫ x

−∞fX|Ai

(t) dt.

We take the derivative of both sides, with respect to x, and obtain the desiredrelation

fX(x) =n∑

i=1

P(Ai)fX|Ai(x).


If we now multiply both sides by x and then integrate from −∞ to ∞, we obtainthe total expectation theorem for continuous random variables.

The total expectation theorem can often facilitate the calculation of themean, variance, and other moments of a random variable, using a divide-and-conquer approach.

Example 3.11. Mean and Variance of a Piecewise Constant PDF. Supposethat the random variable X has the piecewise constant PDF

fX(x) =

{1/3 if 0 ≤ x ≤ 1,2/3 if 1 < x ≤ 2,0 otherwise,

(see Fig. 3.14). Consider the events

A1 ={X lies in the first interval [0, 1]

},

A2 ={X lies in the second interval (1, 2]

}.

We have from the given PDF,

P(A1) =

∫ 1

0

fX(x) dx =13, P(A2) =

∫ 2

1

fX(x) dx =23.

Furthermore, the conditional mean and second moment of X, conditioned on A1

and A2, are easily calculated since the corresponding conditional PDFs fX|A1and

fX|A2are uniform. We recall from Example 3.4 that the mean of a uniform random

variable on an interval [a, b] is (a + b)/2 and its second moment is (a2 + ab + b2)/3.Thus,

E[X |A1] =12, E[X |A2] =

32,

E[X2 |A1

]=

13, E

[X2 |A2

]=

73.

x1

fx(x)

2

1/3

2/3

Figure 3.14: Piecewise con-

stant PDF for Example 3.11.


We now use the total expectation theorem to obtain

E[X] = P(A1)E[X |A1] + P(A2)E[X |A2] =13· 12

+23· 32

=76,

E[X2] = P(A1)E[X2 |A1] + P(A2)E[X2 |A2] =13· 13

+23· 73

=159

.

The variance is given by

var(X) = E[X2] −(E[X]

)2=

159

− 4936

=1136

.

Note that this approach to the mean and variance calculation is easily generalizedto piecewise constant PDFs with more than two pieces.

The next example illustrates a divide-and-conquer approach that uses thetotal probability theorem to calculate a PDF.

Example 3.12. The metro train arrives at the station near your home everyquarter hour starting at 6:00 AM. You walk into the station every morning between7:10 and 7:30 AM, with the time in this interval being a uniform random variable.What is the PDF of the time you have to wait for the first train to arrive?

x

fX(x)

7:10 7:15 7:30

fY|A(y)

y5

1/5

fY(y)

y

fY|B(y)

y15

1/15

5 15

1/101/20

(a) (b)

(c) (d)

Figure 3.15: The PDFs fX , fY |A, fY |B , and fY in Example 3.12.

The time of your arrival, denoted by X, is a uniform random variable on theinterval from 7:10 to 7:30; see Fig. 3.15(a). Let Y be the waiting time. We calculatethe PDF fY using a divide-and-conquer strategy. Let A and B be the events

A = {7:10 ≤ X ≤ 7:15} = {you board the 7:15 train},


where λ is a positive parameter. Let Y = aX + b. Then,

fY (y) =

{λ|a|e

−λ(y−b)/a if (y − b)/a ≥ 0,

0 otherwise.

Note that if b = 0 and a > 0, then Y is an exponential random variable withparameter λ/a. In general, however, Y need not be exponential. For example, ifa < 0 and b = 0, then the range of Y is the negative real axis.

Example 3.25. A linear function of a normal random variable is normal.Suppose that X is a normal random variable with mean µ and variance σ2, and letY = aX + b, where a and b are some scalars. We have

fX(x) =1√2π σ

e−(x−µ)2/2σ2.

Therefore,

fY (y) =1|a|fX

(y − b

a

)

=1|a|

1√2π σ

e−((y−b)/a)−µ)2/2σ2

=1√

2π |a|σe−(y−b−aµ)2/2a2σ2

.

We recognize this as a normal PDF with mean aµ + b and variance a2σ2. Inparticular, Y is a normal random variable.

The Monotonic Case

The calculation and the formula for the linear case can be generalized tothe case where g is a monotonic function. Let X be a continuous random variableand suppose that its range is contained in a certain interval I, in the sense thatfX(x) = 0 for x /∈ I. We consider the random variable Y = g(X), and assumethat g is strictly monotonic over the interval I. That is, either

(a) g(x) < g(x′) for all x, x′ ∈ I satisfying x < x′ (monotonically increasingcase), or

(b) g(x) > g(x′) for all x, x′ ∈ I satisfying x < x′ (monotonically decreasingcase).

Furthermore, we assume that the function g is differentiable. Its derivativewill necessarily be nonnegative in the increasing case and nonpositive in thedecreasing case.

Sec. 3.6 Derived Distributions 45

An important fact is that a monotonic function can be “inverted” in thesense that there is some function h, called the inverse of g, such that for allx ∈ I, we have y = g(x) if and only if x = h(y). For example, the inverse of thefunction g(x) = 180/x considered in Example 3.22 is h(y) = 180/y, because wehave y = 180/x if and only if x = 180/y. Other such examples of pairs of inversefunctions include

g(x) = ax + b, h(y) =y − b

a,

where a and b are scalars with a )= 0 (see Fig. 3.22), and

g(x) = eax, h(y) =ln y

a,

where a is a nonzero scalar.

0

h(y) =

x

y

Slope ab

0

g(x) = ax + b

x

yb

Slope 1/a

y - ba

0 x

y

0

g(x)

x

y

h(y) y = g(x)

x = h(y)

Figure 3.22: A monotonically increasing function g (on the left) and its inverse(on the right). Note that the graph of h has the same shape as the graph of g,except that it is rotated by 90 degrees and then reflected (this is the same asinterchanging the x and y axes).

For monotonic functions g, the following is a convenient analytical formulafor the PDF of the function Y = g(X).


PDF Formula for a Monotonic Function of a Continuous RandomVariable

Suppose that g is monotonic and that for some function h and all x in therange I of X we have

y = g(x) if and only if x = h(y).

Assume that h has first derivative (dh/dy)(y). Then the PDF of Y in theregion where fY (y) > 0 is given by

fY (y) = fX

(h(y)

) ∣∣∣∣dh

dy(y)

∣∣∣∣ .

For a verification of the above formula, assume first that g is monotonicallyincreasing. Then, we have

FY (y) = P(g(X) ≤ y

)= P

(X ≤ h(y)

)= FX

(h(y)

),

where the second equality can be justified using the monotonically increasingproperty of g (see Fig. 3.23). By differentiating this relation, using also thechain rule, we obtain

fY (y) =dFY

dy(y) = fX

(h(y)

)dh

dy(y).

Because g is monotonically increasing, h is also monotonically increasing, so itsderivative is positive:

dh

dy(y) =

∣∣∣∣dh

dy(y)

∣∣∣∣ .

This justifies the PDF formula for a monotonically increasing function g. Thejustification for the case of monotonically decreasing function is similar: wedifferentiate instead the relation

FY (y) = P(g(X) ≤ y

)= P

(X ≥ h(y)

)= 1 − FX

(h(y)

),

and use the chain rule.There is a similar formula involving the derivative of g, rather than the

derivative of h. To see this, differentiate the equality g(h(y)

)= y, and use the

chain rule to obtaindg

dh

(h(y)

)· dh

dy(y) = 1.

Sec. 3.6 Derived Distributions 47

Let us fix some x and y that are related by g(x) = y, which is the same ash(y) = x. Then,

dg

dx(x) · dh

dy(y) = 1,

which leads tofY (y) = fX(x)

/ ∣∣∣∣dg

dx(x)

∣∣∣∣ .

x

y= g(x)

y

h(y ) x

y

h(y )Event {X < h(Y)} Event {X >h(Y)}

y= g(x)

Figure 3.23: Calculating the probability P(g(X) ≤ y

). When g is monotonically

increasing (left figure), the event {g(X) ≤ y} is the same as the event {X ≤ h(y)}.When g is monotonically decreasing (right figure), the event {g(X) ≤ y} is thesame as the event {X ≥ h(y)}.

Example 3.22. (Continued) To check the PDF formula, let us apply it tothe problem of Example 3.22. In the region of interest, x ∈ [30, 60], we haveh(y) = 180/y, and

dFX

dh

(h(y)

)=

130

,

∣∣∣∣dhdy

(y)

∣∣∣∣ =180y2

.

Thus, in the region of interest y ∈ [3, 6], the PDF formula yields

fY (y) = fX

(h(y)

) ∣∣∣∣dhdy

(y)

∣∣∣∣ =130

· 180y2

=6y2

,

consistently with the expression obtained earlier.

Example 3.26. Let Y = g(X) = X2, where X is a continuous uniform randomvariable in the interval (0, 1]. Within this interval, g is monotonic, and its inverse


3.7 SUMMARY AND DISCUSSION

Continuous random variables are characterized by PDFs and arise in many ap-plications. PDFs are used to calculate event probabilities. This is similar tothe use of PMFs for the discrete case, except that now we need to integrateinstead of adding. Joint PDFs are similar to joint PMFs and are used to de-termine the probability of events that are defined in terms of multiple randomvariables. Finally, conditional PDFs are similar to conditional PMFs and areused to calculate conditional probabilities, given the value of the conditioningrandom variable.

We have also introduced a few important continuous probability laws andderived their mean and variance. A summary is provided in the table thatfollows.

Summary of Results for Special Random Variables

Continuous Uniform Over [a, b]:

fX(x) =

{ 1b − a

if a ≤ x ≤ b,0 otherwise,

E[X] =a + b

2, var(X) =

(b − a)2

12.

Exponential with Parameter λ:

fX(x) ={

λe−λx if x ≥ 0,0 otherwise, FX(x) =

{1 − e−λx if x ≥ 0,0 otherwise,

E[X] =1λ

, var(X) =1λ2

.

Normal with Parameters µ and σ2:

fX(x) =1√2πσ

e−(x−µ)2/2σ2,

E[X] = µ, var(X) = σ2.

Introduction to Probability - UPF

Documents

Transcript of Introduction to Probability - UPF