Download - 500ch5 word97 - UAHEE385 Class Notes 11/13/2012 John Stensby Updates at 5-1 Chapter 5 Moments and Conditional Statistics Let X denote a random ...

EE385 Class Notes 11/13/2012 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 5-1

Chapter 5 Moments and Conditional Statistics

Let X denote a random variable, and z = h(x) a function of x. Consider the

transformation Z = h(X). We saw that we could express

E Z E h x x dxx[ ] [ ( )f ( )= =−∞∞zh(X)] , (5-1)

a method of calculating E[Z] that does not require knowledge of fZ(z). It is possible to extend

this method to transformations of two random variables.

Given random variables X, Y and function z = g(x,y), form the new random variable

Z = g(X,Y). (5-2)

fZ(z) denotes the density of Z. The expected value of Z is E Z z f z dzz[ ] ( )=−∞∞z ; however, this

formula requires knowledge of fZ, a density which may not be available. Instead, we can use

E Z E g X Y g x y f x y dxdyxy[ ] [ ( , )] ( , ) ( , )= =−∞∞

−∞∞ zz (5-3)

to calculate E[Z] without having to obtain fZ. This is a very useful result.

Covariance

The covariance CXY of random variables X and Y is defined as

C = E[(X - )(Y - )] = (x - )(y - ) f (x, y)dxdy XY -- XYη η η ηx y x y∞∞

∞∞ zz , (5-4)

where ηx = E[X] and ηy = E[Y]. Note that CXY can be expressed as

C = E[(X - )(Y - )] = E[XY -XY xη η η η η η η ηx y y x y x yY X E XY− + = −] [ ] . (5-5)



Correlation Coefficient

The correlation coefficient for random variables X and Y is defined as

r Cxy

XY

x y=

σ σ . (5-6)

rxy is a measure of the “statistical similarity” between X and Y.

Theorem 5-1: The correlation coefficient must lie in the range −1 ≤ rxy ≤ +1.

Proof: Let α denote any real number. Consider the parabolic equation

g( ) α α η η α σ α σ≡ − + − = + + ≥E X Y Cx y x xy y[ ( ) ( ) ]m r2 2 2 22 0 (5-7)

Note that g(α) ≥ 0 for all α; g is a parabola that opens

upward.

As a first case, suppose that there exists a

value α0 for which g(α0) = 0 (see Fig. 5-1). Then α0

is a repeated root of g(α) = 0. In the quadratic

formula used to determine the roots of (5-7), the

discriminant must be zero. That is, (2Cxy)2-4σx2σy

2 =

0, so that

rxy = =Cxy x y/ σ σ 1 .

Now, consider the case g(α) > 0 for all α; g

has no real roots (see Fig. 5-2). This means that the

discriminant must be negative (so the roots are

complex valued). Hence, (2Cxy)2-4σx2σy

2 < 0 so that

α0α -axis

g( ) =α α σ α σ2 2 2

2x xy yC+ +

Figure 5-1: Case for which the discriminant is zero.

α -axis

g( ) =α α σ α σ2 2 2

2x xy yC+ +

Figure 5-2: Case for which the discriminant is negative.



rxy = <Cxy

x yσ σ1. (5-8)

Hence, in either case, −1 ≤ rxy ≤ +1 as claimed.

Suppose an experiment yields values for X and Y. Consider that we perform the

experiment many times, and plot the outcomes X and Y on a two dimensional plane. Some

hypothetical results follow.

x-axis

y-axis

Correlation Coefficient rxynear -1

x-axis

y-axis

Correlation Coefficient rxyVery Small

y-axis

Correlation Coefficient rxynear +1

x-axis

Figure 5-3: Samples of X and Y with varying degrees of correlation.



Notes:

1. If ⎮rxy⎮ = 1, then there exists constants a and b such that Y = aX + b in the mean-square sense

(i.e., E[{Y - (aX + b)}2] = 0).

2. The addition of a constant to a random variable does not change the variance of the random

variable. That is, σ2 = VAR[X] = VAR[X + α] for any α.

3. Multiplication by a constant increases the variance of a random variable. If VAR[X] = σ2,

then VAR[αX] = α2σ2.

4. Adding constants to random variables X and Y does not change the covarance or correlation

of these random variables. That is, X + α and Y + β have the same covariance and correlation

coefficient as X and Y.

Correlation Coefficient for Gaussian Random Variables

Let zero mean X and Y be joint Gaussian with joint density

f x yXY( , ) exp( )

= −−

− +LNMM

OQPP

RS|T|

UV|W|−

1

2 1 rx 2r xy y

x y2 2

2

x2 x y

2

y21 rπσ σ σ σ σ σ

12

. (5-9)

We are interested in the correlation coefficient rXY; we claim that rXY = r, where r is just a

parameter in the joint density (from statements given above, r is the correlation coefficient for

the nonzero mean case as well). First, note that CXY = E[XY], since the means are zero. Now,

show rXY = r by establishing E[XY] = rσXσY, so that rXY = CXY/σXσY = E[XY]/σXσY = r. In the

square brackets of fXY is an expression that is quadratic in x/σX. Complete the square for this

quadratic form to obtain

( ) ( )2x

2 2 22 2 2x2 2 2 2x y y yxx y y y

2 2xy y y y1yxx 2r r 1 r x r y 1 r

⎧ ⎫ ⎧ ⎫⎪ ⎪ ⎪ ⎪=⎨ ⎬ ⎨ ⎬⎪ ⎪ ⎪ ⎪⎩ ⎭ ⎩ ⎭

σ− + = − + − − + −σ σ σ σσ σσ σ σ σ. (5-10)

Use this new quadratic form to obtain



XY

x2 2

yy

y

x y

2y / 2

2 22 xx

x {normal density with mean r y}

E[XY] xy f (x, y) dxdy

(x r y)1 xy e exp dxdy.2 2 (1 r )2 (1 r )

× σσ

∞ ∞−∞ −∞

σσ∞ ∞− σ

−∞ −∞

=

−⎡ ⎤⎢= − ⎥

σ π ⎢ σ − ⎥σ π − ⎣ ⎦

∫ ∫

∫ ∫ (5-11)

Note that the inner integral is an expected value calculation; the inner integral evaluates to r yx

y

σσ . Hence,

2 2

y xyy

2 2yx x

y yy

y / 2

y / 22 2y

x y

1E[XY] y e r y dxdy2

1r y e dy r2

r ,

∞ − σ σσ−∞

∞ − σσ σσ σ−∞

⎡ ⎤= ⎢ ⎥σ π ⎣ ⎦

⎡ ⎤⎡ ⎤= = σ⎢ ⎥ ⎣ ⎦σ π⎢ ⎥⎣ ⎦

= σ σ

∫

∫ (5-12)

as desired. From this, we conclude that rXY = r.

Uncorrelatedness and Orthogonality

Two random variables are uncorrelated if their covariance is zero. That is, they are

uncorrelated if

CXY = rXY = 0. (5-13)

Since CXY = E[XY] – E[X]E[Y], Equation (5-13) is equivalent to the requirement that E[XY] =

E[X]E[Y]. Two random variables are called orthogonal if

E[XY] = 0. (5-14)



Theorem 5-2: If random variables X and Y are independent, then they are uncorrelated

(independence ⇒ uncorrelated).

Proof: Let X and Y be independent. Then

XY X YE[XY] xy f (x, y) dxdy xy f (x)f (y) dxdy E[X] E[Y]∞ ∞ ∞ ∞−∞ −∞ −∞ −∞

= = =∫ ∫ ∫ ∫ . (5-15)

Therefore, X and Y are uncorrelated. Note: The converse is not true in general. If X and Y are

uncorrelated, then they are not necessarily independent. This general rule has an exception for

Gaussian random variable, a special case.

Theorem 5-3: For Gaussian random variables, uncorrelatedness is equivalent to independence

( Gaussian random variablesIndependence Uncorrelatedness for⇔ ) .

Proof: We have only to show that uncorrelatedness ⇒ independence. But this is easy. Let the

correlations coefficient r = 0 (so that the two random variables are uncorrelated) in the joint

Gaussian density . Note that the joint density factors into a product of marginal densities.

Joint Moments

Joint moments of X and Y can be computed. These are defined as

m E X Y x y f x y dxdykrk r k r

XY= =−∞∞

−∞∞ zz[ ] ( , ) . (5-16)

Joint central moments are defined as

μ η η η ηkr xk

yr

xk

yrE X Y x y f x y dxdyXY= − − = − −

−∞∞

−∞∞ zz[( ) ( ) ] ( ) ( ) ( , ) . (5-17)

Conditional Distributions/Densities

Let M denote an event with P(M) ≠ 0, and let X and Y be random variables. Recall that



[Y y M]F(y M) [Y y M][M]≤ ,

⎮ = ≤ ⎮ = PPP

. (5-18)

Now, event M can be defined in terms of the random variable X.

Example (5-1): Define M = [X ≤ x] and write

XY

X

[X x Y y] F (x, y)F(y X x)[X x] F (x)≤ , ≤

⎮ ≤ = =≤

PP

(5-19)

f y X F x y yF x

XY

X( ) ( , ) /

( )Y ≤ = ∂ ∂x . (5-20)

Example (5-2): Define M = [x1 < X ≤ x2] and write

XY XY

X X

1 2 2 11 2

1 2 2 1

[x X x Y y] F (x , y) F (x , y)F(y x X x )[x X x ] F (x ) F (x )< ≤ , ≤ −

⎮ < ≤ = =< ≤ −

PP

. (5-21)

Example (5-3): Define M = [X = x], where fX(x) ≠ 0. The quantity [Y y M]/ [M]≤ ,P P can be

indeterminant (i.e., 0/0) in this case (certainly, this is true for continuous X) so that we must use

x 0F(y X x) F(y x - x X x)limit

+Δ →⎮ = = ⎮ Δ < ≤ . (5-22)

From the previous example, this result can be written as

XY XY XY XY

X X X X

XY

X

x 0 x 0

F (x, y) F (x x, y) [F (x, y) F (x x, y)] / xF(y X x) limit limitF (x) F (x x) [F (x) F (x x)]/ x

F (x, y) / x .F (x) / x

+ +Δ → Δ →

− − Δ − − Δ Δ⎮ = = =

− − Δ − − Δ Δ

∂ ∂=∂ ∂

(5-23)



From this last result, we conclude that the conditional density can be expressed as

XY

X

2

f (y X x) F(y X x)y

F (x, y) / x y ,F (x) / x

∂⎮ = = ⎮ =

∂

∂ ∂ ∂=∂ ∂

(5-24)

which yields

f y X f x yf x

XY

X( ) ( , )

( )Y = =x . (5-25)

Use the abbreviated notation f (y⎮x) = f (y⎮X = x), Equation (5-25) and symmetry to write

fXY(x,y) = f (y⎮x) fX(x) = f (x⎮y) fY(y). (5-26)

Use this form of the joint density with the formula before last to write

f yf xX

( )( )

Yx = Yf(x y)f (y)Y , (5-27)

a result that is called Bayes Theorem for densities.

Conditional Expectations

Let M denote an event, g(x) a function of x, and X a random variable. Then, the conditional

expectation of g(X) given M is defined as

E g X g x f x[ ( ) ( ) (YΜ] = YΜ)−∞∞z dx . (5-28)



For example, let X and Y denote random variables, and write the conditional mean of X given Y

= y as

x y E[X Y = y] E[X y] x f (x y dx⎮

∞−∞

η ≡ ≡ ≡ )⎮ ⎮ ⎮∫ . (5-29)

Higher-order conditional moments can be defined in a similar manner. For example, the

conditional variance is written as

2 2x y x y x y x yE[(X Y = y] E[(X y] (x ) f (x y) dx⎮ ⎮ ⎮ ⎮

∞2 2−∞

σ ≡ − η ) ≡ − η ) ≡ − η⎮ ⎮ ⎮∫ . (5-30)

Remember that x y⎮η and 2x y⎮σ are functions of algebraic variable y, in general.

Example (5-4): Let X and Y be zero-mean, jointly Gaussian random variables with

f x yXY( , ) exp( )

= −−

− +LNMM

OQPP

RS|T|

UV|W|−

1

2 1 rx 2r xy y

x y2 2

2

x2 x y

2

y21 rπσ σ σ σ σ σ

12

. (5-31)

Find f(x⎮y), ηX⎮Y and σx yY2 . We will accomplish this by factoring fXY into the product

f(x⎮y)fY(y). By completing the square on the quadratic, we can write

( )

( )2x

2 22 22 2 2x y yxx y y

22x

2y y

2xy y yyxx 2r r 1 r

2y1 x r y 1 r

⎧ ⎫⎪ ⎪⎨ ⎬⎪ ⎪⎩ ⎭

⎧ ⎫⎪ ⎪= ⎨ ⎬⎪ ⎪⎩ ⎭

− + = − + −σ σ σσσ σ σ

σ− + −σσ σ

, (5-32)

so that



xy

XY

Y

22

22 22 yyxxf (y)f (x y)

(x r y)1 1 yf (x, y) exp exp222 (1 r )2 (1 r )

σσ

⎮

−⎡ ⎤ ⎡ ⎤⎢ ⎥= − −⎢ ⎥⎢ σπ σ⎥σ − ⎢ ⎥π − σ ⎣ ⎦⎣ ⎦

. (5-33)

From this factorization, we observe that

2x

y2 22 xx

(x r y)1f (x y) exp2 (1 r )2 (1 r )

σσ−⎡ ⎤

⎢ ⎥⎮ = −⎢ ⎥σ −π − σ ⎣ ⎦

. (5-34)

Note that this conditional density is Gaussian! This unexpected conclusion leads to

x

x y y

2 2 2x y x

r y

(1 r )

⎮

⎮

σση =

σ = σ −

(5-35)

as the conditional mean and variance, respectively.

The variance σx2 of a random variable X is a measure of uncertainty in the value of X. If

σx2 is small, it is highly likely to find X near its mean. The conditional variance σx yY

2 is a

measure of uncertainty in the value of X given that Y = y. From (5-35), note that σx yY2 0→ as

⎮r⎮ → 1. As perfect correlation is approached, it becomes more likely to find X near its

conditional mean ηx yY .

Example (5-5): Generalize the previous example to the non-zero mean case. Consider X and Y

same as above except for E[X] = ηX and E[Y] = ηY. Now, define zero mean Gaussian variables

Xd and Yd so that X = Xd + ηX , Y = Yd + ηY and



X Yd dXY X Yd d

xy

x yx y

d d

2 2x y y22 22 yyxx

f (x , y )f (x, y) f (x , y )

(x, y)(x , y )

(x r (y )) (y )1 1exp exp222 (1 r )2 (1 r )

σσ

− η − η= = − η − η

∂∂

− η − − η⎡ ⎤ ⎡ ⎤−η⎢ ⎥= − −⎢ ⎥⎢ σπ σ⎥σ − ⎢ ⎥π − σ ⎣ ⎦⎣ ⎦

. (5-36)

By Bayes rule for density functions, it is easily seen that

xy

2x y

2 22 xx

(x r (y ))1f (x y) exp2 (1 r )2 (1 r )

σσ− η − − η⎡ ⎤

⎢ ⎥⎮ = −⎢ ⎥σ −π − σ ⎣ ⎦

. (5-37)

Hence, the conditional mean and variance are

x

x y x yy

2 2 2x y x

r (y )

(1 r )

⎮

⎮

σση = η + − η

σ = σ −

(5-38)

respectively, for the case where X and Y are themselves nonzero mean. Note that (5-38) follows

directly from (5-35) since

x y d x d y

d d y x d y

xy xy

E X Y y E X Y y

E X Y y E Y y

r (y ) .

⎮

σσ

⎡ ⎤ ⎡ ⎤η ≡ = = + η + η =⎮ ⎮⎣ ⎦ ⎣ ⎦

⎡ ⎤ ⎡ ⎤= = − η + η = − η⎮ ⎮⎣ ⎦ ⎣ ⎦

= − η + η

(5-39)



Conditional Expected Value as a Transformation for a Random Variable

Let X and Y denote random variables. The conditional mean of random variable Y given

that X = x is an "ordinary" function ϕ(x) of x. That is,

(x) E[Y X x] E[Y x] y f (y x) dy∞−∞

ϕ = ⎮ = = ⎮ = ⎮∫ . (5-40)

In general, function ϕ(x) can be plotted, integrated, differentiated, etc.; it is an "ordinary"

function of x. For example, as we have just seen, if X and Y are jointly Gaussian, we know that

y

y xx

(x) E[Y X x] r (x )σ

ϕ = ⎮ = = η + − ησ

, (5-41)

a simple linear function of x.

Use ϕ(x) to transform random variable X. Now, ϕ(X) = E[Y⎮X] is a random variable.

Be very careful with the notation: random variable E[Y⎮X] is different from function

E[Y⎮X = x] ≡ E[Y⎮x] (note that E[Y⎮X = x] and E[Y⎮x] are used interchangeably). Find the

expected value E[ϕ(X)] = E[E[Y⎮X]] of random variable ϕ(X). In the usual way, we start this

task by writing

E E Y X E Y X f x dx y f y dy f x dxX X[ [ ] ] [ ] ( ) ( ) ( )Y Y Y= = LNM

OQP−∞

∞−∞∞

−∞∞z zzx x= . (5-42)

Now, since fXY(x,y) = f (y⎮x) fX(x) we have

E E Y X y f y f x dxdy y f x y dxdy y f y dyX XY Y[ [ ] ] ( ) ( ) ( , ) ( )Y Y=−∞∞

−∞∞

−∞∞

−∞∞

−∞∞zz zz z= =x . (5-43)

From this, we conclude that



E Y E E Y X[ ] [ [ ] ]= Y . (5-44)

The inner conditional expectation is conditioned on X; the outer expectation is over X. To

emphasis this fact, the notation EX[E[Y⎮X]] ≡ E[E[Y⎮X]] is used sometimes in the literature.

Example (5-6): Example: Two fair dice are tossed until the combination “1 and 1” (“snake

eyes”) appear. Determine the average (i.e., expected) number of tosses required to hit “snake

eyes”. To solve this problem, define random variables

1) N = {number of tosses to hit “snakes eyes” for the first time

2) H =1 if “snake eyes” hit on first roll

= 0 if “snake eyes” not hit first roll

Note that H takes on only two values with P[H= 1] = 1/36 and P[H=0] = 35/36. Now, we can

compute the average [ ]E N E E[N H]⎡ ⎤= ⎮⎣ ⎦ , where the inner expectation is conditioned on H, and

the outer expectation is an average over H. We write

[ ] [ ] [ ]E N E E[N H] E[N H 1] H 1 E[N H 0] H 0⎡ ⎤= ⎮ = ⎮ = = + ⎮ = =⎣ ⎦ P P

Now, if H = 0, then “snake eyes” was not hit on the first toss, and the game starts over (at the

second toss) with an average of E[N] additional tosses still required to hit “snake eyes”. Hence,

E[N⎮H = 0] = 1 + E[N]. On the other hand, if H = 1, “snake eyes” was hit on the first roll, so

E[N⎮H = 1] = 1. These two observations produce

[ ] [ ] [ ]

[ ]

[ ]

E N E[N H 1] H 1 E[N H 0] H 0

1 351 1 E N36 36

35 E N 1,36

= ⎮ = = + ⎮ = =

⎛ ⎞ ⎛ ⎞⎤⎡= + +⎜ ⎟ ⎜ ⎟⎣ ⎦⎝ ⎠ ⎝ ⎠

⎛ ⎞= +⎜ ⎟⎝ ⎠

P P



and the conclusion E[N] = 36.

Generalizations

This basic concept can be generalized. Again, X and Y denote random variables. And,

g(x,y) denotes a function of algebraic variables x and y. The conditional mean

ϕ(x) = E[g(X,Y) X = x] = E[g(x,Y) X = x] = g(x, y) f(y x ) dy-

Y Y Y∞∞z (5-45)

is an "ordinary" function of real value x.

Now, ϕ(X) = E[g(X,Y)⎮X] is a transformation of random variable X (again, be careful:

E[g(X,Y)⎮X] is a random variable and E[g(X,Y)⎮X = x] = E[g(x,Y)⎮x] = ϕ(x) is a function of

x). We are interested in the expected value E[ϕ(X)] = E[E[g(X,Y)⎮X]] so we write

X

X

- -

xy- - - -

E[ (X)] = E[ E[g(X,Y) X] ] = f (x) g(x,y)f (y x)dy dx

= g(x,y)f (y x)f (x) dy dx g(x,y)f (x, y) dy dx E[g(X, Y)] ,

∞ ∞∞ ∞

∞ ∞ ∞ ∞∞ ∞ ∞ ∞

⎡ ⎤ϕ ⎮ ⎮⎢ ⎥⎣ ⎦

⎮ = =

∫ ∫

∫ ∫ ∫ ∫

(5-46)

where we have used fXY(x,y) = f(y⎮x)fX(x), Bayes law of densities. Hence, we conclude that

E[g(X,Y)] = E[E[g(X,Y)⎮X]] = EX[E[g(X,Y)⎮X]]. (5-47)

In this last equality, the inner conditional expectation is used to transform X; the outer

expectation is over X.

Example (5-7): Let X and Y be jointly Gaussian with E[X] = E[Y] = 0, Var[X] = σX2, Var[Y] =

σY2 and correlation coefficient r. Find the conditional second moment E[X2⎮Y = y] = E[X2⎮y].

First, note that



Var[XY Y Yy E X y E X y2] [ ] [ ]= − e j2 . (5-48)

Using the conditional mean and variance given by (5-35), we write

E X y y E X y2[ ] ] [ ] ( )Y Y Y= + = − +FHGIKJVar[X r r yx

x

ye j2 2 2

2

1σ σσ

. (5-49)

Example (5-8): Let X and Y be jointly Gaussian with E[X] = E[Y] = 0, Var[X] = σX2, Var[Y] =

σY2 and correlation coefficient r. Find

YE[XY] E [ (Y)]= ϕ , (5-50)

where

x

yr y(y) E[XY Y = y] = y E[X Y = y] y

σ⎛ ⎞ϕ = ⎮ ⎮ = ⎜ ⎟σ⎝ ⎠

. (5-51)

To accomplish this, substitute (5-51) into (5-50) to obtain

Y Y2 2x x

y x yy y

E[XY] E [ (Y)] r E [Y ] r rσ σ= ϕ = = σ = σ σσ σ

. (5-52)

Application of Conditional Expectation: Bayesian Estimation

Let θ denote an unknown DC voltage (for example, the output a thermocouple, strain

gauage, etc.). We are trying to measure θ. Unfortunately, the measurement is obscured by

additive noise n(t). At time t = T, we take a single sample of θ and noise; this sample is called z

= θ + n(T). We model the noise sample n(T) as a random variable with known density fn(n) (we



have “abused” the symbol n by using it simultaneously to denote a random quantity and an

algebraic variable. Such “abuses” are common in the literature). We model unknown θ as a

random variable with density fθ(θ). Density fθ(θ) is called the a-priori density of θ, and it is

known. In most cases, random variables θ and n(T) are independent, but this is not an absolute

requirement (the independence assumption simplifies the analysis). Figure 5-4 depicts a block

diagram that illustrates the generation of voltage-sample z.

From context in the discussion given below (and in the literature), the reader should be

able to discern the current usage of the symbol z. He/she should be able to tell whether z denotes

a random variable or a realization of a random variable (a particular sample outcome). Here, (as

is often the case in the literature) there is no need to use Z to denote the random variable and z to

denote a particular value (sample outcome or realization) of the random variable.

We desire to use the measurement z to estimate voltage θ. We need to develop an

estimator that will take our measurement sample value z and give us an estimate θ̂ (z) of the

actual value of θ. Of course, there is some difference between the estimate θ̂ and the true value

of θ; that is, there is an error voltage θ (z) ≡ θ̂ (z) - θ. Finally, making errors cost us. C( θ (z))

denotes the cost incurred by using measurement z to estimate voltage θ; C is a known cost

function.

The values of z and C( θ (z)) change from one sample to the next; they can be interpreted

as random variables as described above. Hence, it makes no sense to develop estimator θ̂ that

minimizes C( θ (z)). But, it does make sense to choose/design/develop θ̂ with the goal of

θ

n(t)

Σ+

+

at t = T

z = θ + n(T)+

Figure 5-4: Noisy measurement of a DC voltage.



minimizing E[C( θ (z))] = E[C( θ̂ (z) - θ)], the expected or average cost associated with the

estimation process. It is important to note that we are performing an ensemble average over all

possible z and θ (random variables that we average over when computing E[C( θ̂ (z) - θ)]).

The estimator, denoted here as bθ̂ , that minimizes this average cost is called the

Bayesian estimator. That is, Bayesian estimator bθ̂ satisfies

bˆ ˆE[ ( (z) - )] E[ ( (z) - )]ˆ ˆ .b

θ θ ≤ θ θθ≠ θ⎮

C C (5-53)

( bθ̂ is the "best" estimator. On the average, you "pay more" if you use any other estimator).

Important Special Case : Mean Square Cost Function C( θ ) = 2θ

Let's use the squared error cost function C( θ ) = 2θ . Then, when estimator θ̂ is used,

the average cost per decision is

( ) ( ) Z

2 22

zˆ ˆE[ ] (z) f ( , z) d dz (z) f ( z) d f (z)dzθ∞ ∞ ∞ ∞−∞ −∞ −∞ −∞

⎡ ⎤θ = θ − θ θ θ = θ − θ θ⎮ θ⎢ ⎥

⎢ ⎥⎣ ⎦∫ ∫ ∫ ∫ (5-54)

For the outer integral of the last double integral, the integrand is a non-negative function of z.

Hence, average cost 2E[ ]θ will be minimized if, for every value of z, we pick ˆ(z)θ to minimize

the non-negative inner integral

( )2

ˆ(z) f ( z) d∞−∞

θ − θ θ⎮ θ∫ . (5-55)

With respect to θ̂ , differentiate this last integral, set your result to zero and get

( )ˆ2 (z) f ( z) d 0∞−∞

θ − θ θ⎮ θ =∫ . (5-56)



Finally, solve this last result for the Bayesian estimator

bˆ (z) f ( z) d E[ z]∞−∞

θ = θ θ⎮ θ = θ⎮∫ . (5-57)

That is, for the mean square cost function, the Bayesian estimator is the mean of θ conditioned

on the data z. Sometimes, we call (5-57) the conditional mean estimator.

As outlined above, we make a measurement and get a specific numerical value for z (i.e.,

we may interpret numerical z as a specific realization of a random variable). This measured

value can be used in (5-57) to obtain a numerical estimate of θ. On the other hand, suppose that

we are interested in the average performance of our estimator (averaged over all possible

measurements and all possible values of θ). Then, as discussed below, we treat z as a random

variable and average 2 2bˆ(z) { (z) }θ = θ − θ over all possible measurements (values of z) and all

possible values of θ; that is, we compute the variance of the estimation error. In doing this, we

treat z as a random variable. However, we use the same symbol z regardless of the interpretation

and use of (5-57). From context, we must determine if z is being used to denote a random

variable or a specific measurement (that is, a realization of a random variable).

Alternative Expression for ˆbθ

The conditional mean estimator can be expressed in a more convenient fashion. First,

use Bayes rule for densities (here, we interpret z as a random variable)

z

f (z )f ( )f ( z)f (z)

θ⎮θ θθ⎮ = (5-58)

in the estimator formula (5-57) to obtain

bz z

f (z )f ( ) d f (z )f ( ) df (z )f ( )ˆ (z) d ,f (z) f (z) f (z )f ( ) d

∞ ∞θ θ∞ θ −∞ −∞

∞−∞θ−∞

θ ⎮θ θ θ θ ⎮θ θ θ⎮θ θθ = θ θ = =⎮θ θ θ

∫ ∫∫

∫ (5-59)



a formulation that is used in application.

Mean and Variance of the Estimation Error

For the conditional mean estimator, the estimation error is

bˆ E[ z]θ = θ − θ = θ − θ⎮ . (5-60)

The mean value of θ is (averaged over all θ and all possible measurements z)

bˆE[ ] E[ ] E E[ z]

E[ ] E E[ z] E[ ] E[ ]

= 0

⎡ ⎤θ = θ − θ = θ − θ⎮⎣ ⎦

⎡ ⎤= θ − θ⎮ = θ − θ⎣ ⎦ . (5-61)

Equivalently, bˆE[ ] E[ ]θ = θ ; because of this, we say that bθ̂ is an unbiased estimator.

Since E[ θ ] = 0, the variance of the estimation error is

22VAR[ ] E[ ] E[ z] f ( , z)d dz

∞ ∞−∞ −∞

⎡ ⎤θ = θ = θ − θ⎮ θ θ⎣ ⎦∫ ∫ , (5-62)

where f(θ,z) is the joint density that describes θ and z. We want VAR[ θ ] < VAR[θ]; otherwise,

our estimator is of little value since we could use E[θ] to estimate θ. In general, VAR[ θ ] is a

measure of estimator performance.

Example (5-9): Bayesian Estimator for Single-Sample Gaussian Case

Suppose that θ is N(θ0, σ0) and n(T) is N(0,σ). Also, assume that θ and n are

independent. Find the conditional mean (Bayesian) estimator θb . First, when interpreted as a

random variable, z = θ + n(T) is Gaussian with mean θ0 and variance σ02 + σ2. Hence, from the

conditional mean formula (5-38) for the Gaussian case, we have



z0

b 0 02 20

ˆ (z) E[ z] = r (z )θσθ = θ⎮ θ + − θ

σ + σ, (5-63)

where rθZ is the correlation coefficient between θ and z. Now, we must find rθZ. Observe that

2

0 0 0 0 0 0z 2 2 2 2 2 2

0 0 0 0 0 0

20 0

2 2 2 20 0 0

E[( )(z )] E[( )([ ] (T))] E[( ) ( ) (T)]r

E[( ) ] ,

θθ − θ − θ θ − θ θ − θ + θ − θ + θ − θ= = =σ σ + σ σ σ + σ σ σ + σ

θ − θ σ= =σ σ + σ σ + σ

n n

(5-64)

since θ and η(T) are independent. Hence, the Bayesian estimator is

20

b 0 02 20

ˆ (z) (z )σθ = θ + − θσ + σ

. (5-65)

The error is θ = θ - bθ̂ , and E[ θ ] = 0 as shown by (5-61). That is, bθ̂ is an unbiased

estimator since its expected value is the mean of the quantity being estimated. The variance of

θ is

22

2 0b 0 02 2

0

22 22 20 0

0 0 0 02 2 2 20 0

ˆVAR[ ] E[( ) ] E ( ) (z )

E[( ) ] 2 E[( )(z )] E[(z ) ]

⎡ ⎤⎛ ⎞σ⎢ ⎥θ = θ − θ = θ − θ − − θ⎜ ⎟⎜ ⎟⎢ ⎥σ + σ⎝ ⎠⎢ ⎥⎣ ⎦

⎛ ⎞σ σ= θ − θ − θ − θ − θ + − θ⎜ ⎟⎜ ⎟σ + σ σ + σ⎝ ⎠

. (5-66)

Due to independence, we have



20 0 0 0 0 0 0E[( )(z )] E[( )( (T))] E[( )( )]θ − θ − θ = θ − θ θ − θ + = θ − θ θ − θ = σn (5-67)

2 2 2 2

0 0 0E[(z ) ] E[( (T)) ]− θ = θ − θ + = σ + σn (5-68)

Now, use (5-67) and (5-68) in (5-66) to obtain

22 2

2 2 2 20 00 0 02 2 2 2

0 0

220 2 2

0

VAR[ ] 2 [ ]⎛ ⎞σ σθ = σ − σ + σ + σ⎜ ⎟⎜ ⎟σ + σ σ + σ⎝ ⎠

⎡ ⎤σ= σ ⎢ ⎥σ + σ⎢ ⎥⎣ ⎦

. (5-69)

As expected, the variance of error θ approaches zero as the noise average power (i.e., the

variance) σ2 → 0. On the other hand, as σ2 → ∞, we have VAR[ θ ] → σ02 (this is the noise

dominated case). As can be seen from (5-69), for all values of σ2, we have VAR[ θ ] < VAR[θ]

= σ02, which means that bθ̂ will always out perform the simple approach of selecting mean E[θ]

= θ0 as the estimate of θ.

Example (5-10): Bayesian Estimator for Multiple Sample Gaussian Case

As given by (5-69), the variance (i.e., the uncertainty) of bθ̂ may be too large for some

applications. We can use a sample mean (involving multiple samples) in the Bayesian estimator

to lower its variance.

Take multiple samples of z(tk) = θ + n(tk), 1 ≤ k ≤ N (tk, 1 ≤ k ≤ N, denote the times at

which samples are taken). Assume that the tk are far enough apart in time that n(tk) and n(tj) are

independent for tk ≠ tj (for example, this would be the case if the time intervals between samples

are large compared to the reciprocal of the bandwidth of noise n(t)). Define the sample mean of

the collected data as



Nk

k 1

1z z(t )N =

≡ = θ +∑ n (5-70)

where

N

kk 1

1 (t )N =

≡ ∑n n (5-71)

is the sample mean of the noise. The quantity n is Gaussian with mean E[ n ] = 0; due to

independence, the variance is

N 2

k2k 1

1VAR[ ] VAR[ (t )]NN =

σ≡ =∑n n . (5-72)

Note that z ≡ θ + n has the same form regardless of the number of samples N. Hence,

based on the data z , the Bayesian estimator for θ has the same form regardless of the number of

samples. We can adopt (5-65) and write

20

b 0 02 20

ˆ (z) (z )/ N

σθ = θ + − θσ + σ

. (5-73)

That is, in the Bayesian estimator formula, use sample mean z instead of the single sample z.

Adapt (5-69) to the multiple sample case and write the variance of error θ = θ - bθ̂ as

2

20 2 2

0

/ NVAR[ ]/ N

⎡ ⎤σθ = σ ⎢ ⎥σ + σ⎢ ⎥⎣ ⎦

. (5-74)

By making the number N of averaged samples large enough, we can “average out the noise” and



make (5-74) arbitrarily small.

Conditional Multidimensional Gaussian Density

Let X be an n × 1 Gaussian vector with E[ X ] = 0 and a positive definite n × n

covariance matrix ΛX. Likewise, define Y as a zero-mean, m × 1 Gaussian random vector with

m × m positive definite covariance matrix ΛY. Also, define n × m matrix ΛXY = E[ XYT ]; note

that ΛXYT = ΛYX = E[YXT ], an m × n matrix. Find the conditional density f( X⎮Y ).

First, define the (n+m) × 1 “super vector”

ZXY

=LNMOQP

, (5-75)

which is obtained by “stacking” X on top of Y . The (n+m) × (n+m) covariance matrix for Z

is

X XYT T T

ZYX Y

XE[Z Z ] E X Y

Y

⎡ ⎤⎡ ⎤ Λ Λ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥⎡ ⎤Λ = = =⎮⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ Λ Λ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦⎣ ⎦⎣ ⎦

. (5-76)

The inverse of this matrix can be expressed as (observe that ΛZΛZ-1 = I)

ΛZ

A BB CT

− =LNMOQP

1 , (5-77)

where A is n×n, B is n×m and C is m×m. These intermediate block matrices are given by



A I C

B A C

C I A

X XY Y YX X XY YX X

XY Y X XY

Y YX X XY Y YX XY Y

= − = +

= − = −

= − = +

− − − −

− −

− − − −

( ) [ ]

( ) [ ]

Λ Λ Λ Λ Λ Λ Λ Λ

Λ Λ Λ Λ

Λ Λ Λ Λ Λ Λ Λ Λ

1 1 1 1

1 1

1 1 1 1

(5-78)

Now, the joint density is

f X Y X YA BB C

XYXY

Z

T TTn m

( , )( )

exp= −LNMOQPLNMOQP

LNMM

OQPP+

1

212

π Λ Y (5-79)

The marginal density is

f Y Y YY

Y

TYm

( )( )

exp= − −1

212

1

π ΛΛ (5-80)

From Bayes Theorem for densities

f X Y f X Yf Y

X YA BB C

XY

XY

Y

T TT

Yn Z

Y

( ) ( )( ) ( )

expY , Y= = −−

LNM

OQPLNMOQP

LNMM

OQPP−

1

2

12 1

π ΛΛ

Λ (5-81)

However, straightforward but tedious matrix algebra yields

X YA BB C

XY

X YAX BY

B X C Y

X AX BY Y B X C Y

X AX X BY Y C Y

T TT

Y

T TT

Y

T T TY

T T TY

Y Y−LNM

OQPLNMOQP

=+

+ −LNM

OQP

= + + + −

= + + −

− −

−

−

Λ Λ

Λ

Λ

1 1

1

12

(

[ ] [ ( ]

[ ]

)

) (5-82)



(Note that the scalar identity X BYT = Y B XT T was used in obtaining this result). From the

previous page, use the results B A XY Y= − −Λ Λ 1 and C AY Y YX XY Y− =− − −Λ Λ Λ Λ Λ1 1 1 to write

X YA BB C

XY

X AX X A Y Y A Y

X Y A X Y

T TT

Y

T TXY Y

TY YX XY Y

XY YT

XY Y

Y −LNM

OQPLNMOQP

= − +

= − −

−− − −

− −

ΛΛ Λ Λ Λ Λ Λ

Λ Λ Λ Λ

11 1 1

1 1

2

(5-83)

To simplify the notation, define

M YXY Y

X XY Y YX

≡ ×

≡ = − ×

−

−

Λ Λ

Λ Λ Λ Λ

1

1

(an m 1 vector)

Q A (an n n matrix)-1 (5-84)

so that the quadratic form becomes

X YA BB C

XY

X M Q X MT TT

Y

T Y −LNM

OQPLNMOQP

= − −−−

Λ 11( ) ( ) (5-85)

Now, we must find the quotient ΛΛ

Z

Y. Write

1

X XY X XY Y YX XY

Z 1YX Y Y YXY

n

m

I 0

I0

−

−

⎡ ⎤Λ Λ⎡ ⎤ ⎡ ⎤Λ − Λ Λ Λ Λ⎢ ⎥⎢ ⎥ ⎢ ⎥Λ = = ⎢ ⎥⎢ ⎥ ⎢ ⎥Λ Λ Λ ΛΛ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎣ ⎦

(5-86)

Im is the m × m identity matrix and In is the n × n identity matrix. Hence,

Λ Λ Λ Λ Λ ΛZ X XY Y YX Y= − −1 (5-87)



ΛΛ

Λ Λ Λ ΛZ

YX XY Y YX Q= − =−1 (5-88)

Use Equation (5-85) and (5-88) in fX(x⎮y) to obtain

T 11

2n1f (X Y) exp (X M) Q (X M)

(2 ) Q−⎡ ⎤⎮ = − − −⎣ ⎦π

, (5-89)

where

1

XY Y

1X XY Y YX

-1

M Y (an m 1 vector)

Q A (an n n matrix)

−

−

≡ Λ Λ ×

≡ = Λ − Λ Λ Λ × (5-90)

Vector M = E[ X YY ] is the conditional expectation vector. Matrix Q E X M X M YT= − −[( )( ) ]Y

is the conditional covariance matrix.

Generalizations to Nonzero Mean Case

Suppose E[ X ] = MX and E[Y ] = MY , then

f X YQ

X M Q X Mn

T( )( )

exp ( ) ( )Y = − − −−1

212

1

π, (5-91)

where

1

X XY Y Y

T 1X XY Y YX

(M E[X Y] M Y M ) (an n 1 vector)

Q E[(X M)(X M) Y] (an n n matrix).

−

−

≡ ⎮ = + Λ Λ − ×

≡ − − ⎮ = Λ − Λ Λ Λ × (5-92)