Gaussians

31
1 Copyright © Andrew W. Moore Slide 1 Gaussians Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Copyright © Andrew W. Moore Slide 2 Gaussians in Data Mining • Why we should care • The entropy of a PDF • Univariate Gaussians • Multivariate Gaussians • Bayes Rule and Gaussians • Maximum Likelihood and MAP using Gaussians

description

 

Transcript of Gaussians

Page 1: Gaussians

1

Copyright © Andrew W. Moore Slide 1

GaussiansAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Copyright © Andrew W. Moore Slide 2

Gaussians in Data Mining• Why we should care• The entropy of a PDF• Univariate Gaussians• Multivariate Gaussians• Bayes Rule and Gaussians• Maximum Likelihood and MAP using

Gaussians

Page 2: Gaussians

2

Copyright © Andrew W. Moore Slide 3

Why we should care• Gaussians are as natural as Orange Juice

and Sunshine• We need them to understand Bayes Optimal

Classifiers• We need them to understand regression• We need them to understand neural nets• We need them to understand mixture

models• …

(You get the idea)

Copyright © Andrew W. Moore Slide 4

The “box”distribution

-w/2 0 w/2

1/w

⎪⎩

⎪⎨

>

≤=

2w|x|if02w|x|if1

)( wxp

Page 3: Gaussians

3

Copyright © Andrew W. Moore Slide 5

The “box”distribution

-w/2 0 w/2

1/w

0][ =XE12

]Var[2wX =

⎪⎩

⎪⎨

>

≤=

2w|x|if02w|x|if1

)( wxp

Copyright © Andrew W. Moore Slide 6

Entropy of a PDF

∫∞

−∞=

−==x

dxxpxpXHX )(log)(][ ofEntropy

Natural log (ln or loge)

The larger the entropy of a distribution…

…the harder it is to predict

…the harder it is to compress it

…the less spiky the distribution

Page 4: Gaussians

4

Copyright © Andrew W. Moore Slide 7

The “box”distribution

-w/2 0 w/2

1/w

⎪⎩

⎪⎨

>

≤=

2w|x|if02w|x|if1

)( wxp

wdxww

dxww

dxxpxpXHw

wx

w

wxx

log1log11log1)(log)(][2/

2/

2/

2/

=−=−=−= ∫∫∫−=−=

−∞=

Copyright © Andrew W. Moore Slide 8

Unit variance box distribution

0

12]Var[

2wX =

0][ =XE

242.1][ and 1]Var[ then 32 if === XHXw

3

32

1

3−

⎪⎩

⎪⎨

>

≤=

2w|x|if02w|x|if1

)( wxp

Page 5: Gaussians

5

Copyright © Andrew W. Moore Slide 9

The Hat distribution

0

⎪⎩

⎪⎨⎧

>

≤−

=w|x|

w|x|w

xwxp

if0

if||)( 2

6]Var[

2wX =

0][ =XE

w

1w

w−

Copyright © Andrew W. Moore Slide 10

Unit variance hat distribution

0

⎪⎩

⎪⎨⎧

>

≤−

=w|x|

w|x|w

xwxp

if0

if||)( 2

6]Var[

2wX =

0][ =XE

396.1][ and 1]Var[ then 6 if === XHXw

6

6

1

6−

Page 6: Gaussians

6

Copyright © Andrew W. Moore Slide 11

The “2 spikes”distribution

-1 0 1

2)1()1()( =+−=

=xxxp δδ

−∞=−= ∫∞

−∞=x

dxxpxpXH )(log)(][

1]Var[ =X

0][ =XE

2∞ )1(

21

−=xδ )1(21

=xδ

Dirac Delta

Copyright © Andrew W. Moore Slide 12

Entropies of unit-variance distributions

1.4189???

-infinity2 spikes

1.396Hat

1.242Box

EntropyDistribution

Largest possible entropy of any unit-variance distribution

Page 7: Gaussians

7

Copyright © Andrew W. Moore Slide 13

Unit variance Gaussian ⎟⎟

⎞⎜⎜⎝

⎛−=

2exp

21)(

2xxpπ

4189.1)(log)(][ =−= ∫∞

−∞=x

dxxpxpXH

1]Var[ =X

0][ =XE

Copyright © Andrew W. Moore Slide 14

General Gaussian ⎟⎟

⎞⎜⎜⎝

⎛ −−= 2

2

2)(exp

21)(

σμ

σπxxp

2]Var[ σ=X

μXE =][

μ=100

σ=15

Page 8: Gaussians

8

Copyright © Andrew W. Moore Slide 15

General Gaussian ⎟⎟

⎞⎜⎜⎝

⎛ −−= 2

2

2)(exp

21)(

σμ

σπxxp

2]Var[ σ=X

μXE =][

μ=100

σ=15

Shorthand: We say X ~ N(μ,σ2) to mean “X is distributed as a Gaussian with parameters μ and σ2”.

In the above figure, X ~ N(100,152)

Also known as the normal distribution

or Bell-shaped curve

Copyright © Andrew W. Moore Slide 16

The Error FunctionAssume X ~ N(0,1)

Define ERF(x) = P(X<x) = Cumulative Distribution of X

∫−∞=

=x

z

dzzpxERF )()(

∫−∞=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

x

z

dzz2

exp21 2

π

Page 9: Gaussians

9

Copyright © Andrew W. Moore Slide 17

Using The Error FunctionAssume X ~ N(μ,σ2)

P(X<x| μ,σ2) = )( 2σμ−xERF

Copyright © Andrew W. Moore Slide 18

The Central Limit Theorem• If (X1,X2, … Xn) are i.i.d. continuous random

variables• Then define

• As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi]

Somewhat of a justification for assuming Gaussian noise is common

∑=

==n

iin x

nxxxfz

121

1),...,(

Page 10: Gaussians

10

Copyright © Andrew W. Moore Slide 19

Other amazing facts about Gaussians

• Wouldn’t you like to know?

• We will not examine them until we need to.

Copyright © Andrew W. Moore Slide 20

Bivariate Gaussians

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

⎟⎟⎠

⎞⎜⎜⎝

⎛=

y

x

μμ

μ

⎟⎟⎠

⎞⎜⎜⎝

⎛=

YX

X r.v. Write

⎟⎟⎠

⎞⎜⎜⎝

⎛=

yxy

xyx

2

2

σσσσ

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters are…

Where we insist that Σ is symmetric non-negative definite

Page 11: Gaussians

11

Copyright © Andrew W. Moore Slide 21

Bivariate Gaussians

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

⎟⎟⎠

⎞⎜⎜⎝

⎛=

y

x

μμ

μ

⎟⎟⎠

⎞⎜⎜⎝

⎛=

YX

X r.v. Write

⎟⎟⎠

⎞⎜⎜⎝

⎛=

yxy

xyx

2

2

σσσσ

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters are…

Where we insist that Σ is symmetric non-negative definite

It turns out that E[X] = μ and Cov[X] = Σ. (Note that this is a resulting property of Gaussians, not a definition)*

*This note rates 7.4 on the pedanticness scale

Copyright © Andrew W. Moore Slide 22

Evaluating p(x): Step 1

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

1. Begin with vector x

x

μ

Page 12: Gaussians

12

Copyright © Andrew W. Moore Slide 23

Evaluating p(x): Step 2

1. Begin with vector x

2. Define δ = x - μ x

μ

δ

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

Copyright © Andrew W. Moore Slide 24

Evaluating p(x): Step 3

1. Begin with vector x

2. Define δ = x - μ

3. Count the number of contours crossed of the ellipsoids formed Σ-1

D = this count = sqrt(δTΣ-1δ)= Mahalonobis Distance between x and μ

x

μ

δ

Contours defined by sqrt(δTΣ-1δ) = constant

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

Page 13: Gaussians

13

Copyright © Andrew W. Moore Slide 25

Evaluating p(x): Step 4

1. Begin with vector x

2. Define δ = x - μ

3. Count the number of contours crossed of the ellipsoids formed Σ-1

D = this count = sqrt(δTΣ-1δ)= Mahalonobis Distance between x and μ

4. Define w = exp(-D 2/2)

D 2ex

p(-D

2/2

)

x close to μ in squared Mahalonobis space gets a large weight. Far away gets

a tiny weight

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

Copyright © Andrew W. Moore Slide 26

Evaluating p(x): Step 5

1. Begin with vector x

2. Define δ = x - μ

3. Count the number of contours crossed of the ellipsoids formed Σ-1

D = this count = sqrt(δTΣ-1δ)= Mahalonobis Distance between x and μ

4. Define w = exp(-D 2/2)

5. D 2

exp(

-D 2

/2)

1ensure to||||2

1by Multiply w2

1 =∫ xxΣ

d)p(π

( ))()(exp||||2

1)( 121

21 μxΣμx

Σx −−−= −Tp

π

Page 14: Gaussians

14

Copyright © Andrew W. Moore Slide 27

Example

Common convention: show contour corresponding to 2 standard

deviations from mean

Observe: Mean, Principal axes, implication of off-diagonal covariance term, max gradient zone of p(x)

Copyright © Andrew W. Moore Slide 28

Example

Page 15: Gaussians

15

Copyright © Andrew W. Moore Slide 29

Example

In this example, x and y are almost independent

Copyright © Andrew W. Moore Slide 30

Example

In this example, x and “x+y” are clearly not independent

Page 16: Gaussians

16

Copyright © Andrew W. Moore Slide 31

Example

In this example, x and “20x+y” are clearly not independent

Copyright © Andrew W. Moore Slide 32

Multivariate Gaussians

( ))()(exp||||)2(

1)( 121

21

2μxΣμx

Σx −−−= −T

mpπ

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

μμ

M2

1

μ

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

mX

XX

M

2

1

r.v. Write X

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

mmm

m

m

221

222

12

11212

σσσ

σσσσσσ

L

MOMM

L

L

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters have…

Where we insist that Σ is symmetric non-negative definiteAgain, E[X] = μ and Cov[X] = Σ. (Note that this is a resulting property of Gaussians, not a definition)

Page 17: Gaussians

17

Copyright © Andrew W. Moore Slide 33

General Gaussians

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

μμ

M2

1

μ

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

mmm

m

m

221

222

12

11212

σσσ

σσσσσσ

L

MOMM

L

L

Σ

x1

x2

Copyright © Andrew W. Moore Slide 34

Axis-Aligned Gaussians

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

μμ

M2

1

μ

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

m

m

2

12

32

22

12

00000000

000000000000

σσ

σσ

σ

L

L

MMOMMM

L

L

L

Σ

x1

x2

jiXX ii ≠⊥ for

Page 18: Gaussians

18

Copyright © Andrew W. Moore Slide 35

Spherical Gaussians

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

μμ

M2

1

μ

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

=

2

2

2

2

2

00000000

000000000000

σσ

σσ

σ

L

L

MMOMMM

L

L

L

Σ

x1

x2

jiXX ii ≠⊥ for

Copyright © Andrew W. Moore Slide 36

Degenerate Gaussians

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

μμ

M2

1

μ 0|||| =Σ

x1

x2

What’s so wrong with clipping

one’s toenails in public?

Page 19: Gaussians

19

Copyright © Andrew W. Moore Slide 37

Where are we now?• We’ve seen the formulae for Gaussians• We have an intuition of how they behave• We have some experience of “reading” a

Gaussian’s covariance matrix

• Coming next:Some useful tricks with Gaussians

Copyright © Andrew W. Moore Slide 38

Subsets of variables

where as Write1)(

)(

1

2

1

⎟⎟⎟

⎜⎜⎜

⎛=

⎟⎟⎟

⎜⎜⎜

=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+

m

um

um

m

X

XX

X

X

XX

M

M

M

V

U

VU

XX

This will be our standard notation for breaking an m-dimensional distribution into subsets of variables

Page 20: Gaussians

20

Copyright © Andrew W. Moore Slide 39

Gaussian Marginals are Gaussian

⎟⎟⎟

⎜⎜⎜

⎛=

⎟⎟⎟

⎜⎜⎜

=⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+

m

um

umm

X

X

X

X

X

XX

MMM

1)(

)(

12

1

, where as Write VUVU

XX

THEN U is also distributed as a Gaussian

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( )uuu ΣμU ,N~

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Margin-

alize U

Copyright © Andrew W. Moore Slide 40

Gaussian Marginals are Gaussian

⎟⎟⎟

⎜⎜⎜

⎛=

⎟⎟⎟

⎜⎜⎜

=⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+

m

um

umm

X

X

X

X

X

XX

MMM

1)(

)(

12

1

, where as Write VUVU

XX

THEN U is also distributed as a Gaussian

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( )uuu ΣμU ,N~

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Margin-

alize U

This fact is not immediately obvious

Obvious, once we know it’s a Gaussian (why?)

Page 21: Gaussians

21

Copyright © Andrew W. Moore Slide 41

Gaussian Marginals are Gaussian

⎟⎟⎟

⎜⎜⎜

⎛=

⎟⎟⎟

⎜⎜⎜

=⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=+

m

um

umm

X

X

X

X

X

XX

MMM

1)(

)(

12

1

, where as Write VUVU

XX

THEN U is also distributed as a Gaussian

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( )uuu ΣμU ,N~

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Margin-

alize U

How would you prove this?

(snore...)

),()(

=

= ∫v

vvuu

dpp

Copyright © Andrew W. Moore Slide 42

Linear Transforms remain Gaussian

( )ΣμX ,N~

Multiply AXX

Matrix A

Assume X is an m-dimensional Gaussian r.v.

Define Y to be a p-dimensional r. v. thusly (note ):

AXY =

…where A is a p x m matrix. Then…

mp ≤

( )TAAΣAμY ,N~ Note: the “subset” result is a special case of this result

Page 22: Gaussians

22

Copyright © Andrew W. Moore Slide 43

Adding samples of 2 independent Gaussians

is Gaussian

( ) ( ) YXΣμYΣμX ⊥ and ,N~ and ,N~ if yyxx

+ YX + XY

( )yxyx ΣΣμμYX +++ ,N~ then

Why doesn’t this hold if X and Y are dependent?

Which of the below statements is true?

If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance

If X and Y are dependent, then X+Y might be non-Gaussian

Copyright © Andrew W. Moore Slide 44

Conditional of Gaussian is Gaussian

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Condition-

alizeVU |

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( ) where,N~ | THEN || vuvu ΣμVU

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

uvvvTuvuuvu ΣΣΣΣΣ 1

| −−=

Page 23: Gaussians

23

Copyright © Andrew W. Moore Slide 45

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( ) where,N~ | THEN || vuvu ΣμVU

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

uvvvTuvuuvu ΣΣΣΣΣ 1

| −−=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

−−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛2

2

68.3967967849

,76

2977N~ IF

yw

( ) where,N~ | THEN || ywywyw Σμ

2| 68.3)76(9762977 −

−=y

ywμ

22

22

| 80868.3

967849 =−=ywΣ

Copyright © Andrew W. Moore Slide 46

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( ) where,N~ | THEN || vuvu ΣμVU

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

uvvvTuvuuvu ΣΣΣΣΣ 1

| −−=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

−−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛2

2

68.3967967849

,76

2977N~ IF

yw

( ) where,N~ | THEN || ywywyw Σμ

2| 68.3)76(9762977 −

−=y

ywμ

22

22

| 80868.3

967849 =−=ywΣ

P(w|m=82)

P(w|m=76)

P(w)

Page 24: Gaussians

24

Copyright © Andrew W. Moore Slide 47

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

( ) where,N~ | THEN || vuvu ΣμVU

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

uvvvTuvuuvu ΣΣΣΣΣ 1

| −−=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

−−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛2

2

68.3967967849

,76

2977N~ IF

yw

( ) where,N~ | THEN || ywywyw Σμ

2| 68.3)76(9762977 −

−=y

ywμ

22

22

| 80868.3

967849 =−=ywΣ

P(w|m=82)

P(w|m=76)

P(w)

Note: conditional variance is independent of the given value of v

Note: conditional variance can only be

equal to or smaller than marginal variance

Note: marginal mean is a linear function of v

Note: when given value of v is μv, the conditional

mean of u is μu

Copyright © Andrew W. Moore Slide 48

Gaussians and the chain rule

⎟⎟⎠

⎞⎜⎜⎝

⎛ +=

vvT

vv

vvvuT

vv

ΣAΣAΣΣAAΣ

Σ)(

|

( ) ( )vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

⎟⎟⎠

⎞⎜⎜⎝

⎛VUChain

RuleVU |V

Let A be a constant matrix

( ) with,,N~ THEN ΣμVU⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

⎛=

v

v

μAμ

μ

Page 25: Gaussians

25

Copyright © Andrew W. Moore Slide 49

Available Gaussian tools

⎟⎟⎠

⎞⎜⎜⎝

⎛VUChain

RuleVU |V

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Condition-

alizeVU |

+ YX + X Y

Multiply AX X

Matrix A

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Margin-

alize U ⎟⎟

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF ( )uuu ΣμU ,N~ THEN

( )ΣμX ,N~ IF AXY = AND ( )TAAΣAμY ,N~ THEN

( ) ( ) YXΣμYΣμX ⊥ and ,N~ and ,N~ if yyxx

( )yxyx ΣΣμμYX +++ ,N~ then

uvvvTuvuuvu ΣΣΣΣΣ 1

| −−=where

( )vuvu || ,N~ | ΣμVUTHEN ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

⎟⎟⎠

⎞⎜⎜⎝

⎛ +=

vvT

vv

vvvuT

vv

ΣAΣAΣΣAAΣ

Σ)(

|

( ) ( )vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

( ) with,,N~ THEN ΣμVU⎟⎟⎠

⎞⎜⎜⎝

Copyright © Andrew W. Moore Slide 50

Assume…• You are an intellectual snob• You have a child

Page 26: Gaussians

26

Copyright © Andrew W. Moore Slide 51

Intellectual snobs with children• …are obsessed with IQ• In the world as a whole, IQs are drawn from

a Gaussian N(100,152)

Copyright © Andrew W. Moore Slide 52

IQ tests• If you take an IQ test you’ll get a score that,

on average (over many tests) will be your IQ

• But because of noise on any one test the score will often be a few points lower or higher than your true IQ.

SCORE | IQ ~ N(IQ,102)

Page 27: Gaussians

27

Copyright © Andrew W. Moore Slide 53

Assume…• You drag your kid off to get tested• She gets a score of 130• “Yippee” you screech and start deciding how

to casually refer to her membership of the top 2% of IQs in your Christmas newsletter.

P(X<130|μ=100,σ2=152) =

P(X<2| μ=0,σ2=1) =

erf(2) = 0.977

Copyright © Andrew W. Moore Slide 54

Assume…• You drag your kid off to get tested• She gets a score of 130• “Yippee” you screech and start deciding how

to casually refer to her membership of the top 2% of IQs in your Christmas newsletter.

P(X<130|μ=100,σ2=152) =

P(X<2| μ=0,σ2=1) =

erf(2) = 0.977

You are thinking:

Well sure the test isn’t accurate, so she might have an IQ of 120 or she might have an 1Q of 140, but the most likely IQ given the evidence “score=130” is, of course, 130.

Can we trust this reasoning?

Page 28: Gaussians

28

Copyright © Andrew W. Moore Slide 55

Maximum Likelihood IQ• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130

• The MLE is the value of the hidden parameter that makes the observed data most likely

• In this case

)|130(maxarg iqspIQiq

mle ==

130 =mleIQ

Copyright © Andrew W. Moore Slide 56

BUT….• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130

• The MLE is the value of the hidden parameter that makes the observed data most likely

• In this case

)|130(maxarg iqspIQiq

mle ==

130 =mleIQ

This is not the same as“The most likely value of the

parameter given the observed data”

Page 29: Gaussians

29

Copyright © Andrew W. Moore Slide 57

What we really want:• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130

• Question: What is IQ | (S=130)?

Called the Posterior Distribution of IQ

Copyright © Andrew W. Moore Slide 58

Which tool or tools?• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130

• Question: What is IQ | (S=130)?

⎟⎟⎠

⎞⎜⎜⎝

⎛VUChain

RuleVU |V

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Condition-

alizeVU |

+ YX + XY

Multiply AXX

Matrix A

⎟⎟⎠

⎞⎜⎜⎝

⎛VU Margin-

alize U

Page 30: Gaussians

30

Copyright © Andrew W. Moore Slide 59

Plan• IQ~N(100,152)• S|IQ ~ N(IQ, 102)• S=130

• Question: What is IQ | (S=130)?

⎟⎟⎠

⎞⎜⎜⎝

⎛IQSChain

RuleQ| IS

IQ ⎟⎟⎠

⎞⎜⎜⎝

⎛SIQ Condition-

alizeSI |QSwap

Copyright © Andrew W. Moore Slide 60

Working…IQ~N(100,152)S|IQ ~ N(IQ, 102)S=130

Question: What is IQ | (S=130)?

THEN ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

vvTuv

uvuu

v

u

ΣΣΣΣ

μμ

VU

,N~ IF

)( 1| vvv

Tuvuvu μVΣΣμμ −+= −

⎟⎟⎠

⎞⎜⎜⎝

⎛ +=

vvT

vv

vvvuT

vv

ΣAΣAΣΣAAΣ

Σ)(

|

( ) ( )vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

( ) with,,N~ THEN ΣμVU⎟⎟⎠

⎞⎜⎜⎝

Page 31: Gaussians

31

Copyright © Andrew W. Moore Slide 61

Your pride and joy’s posterior IQ• If you did the working, you now have

p(IQ|S=130)• If you have to give the most likely IQ given

the score you should give

• where MAP means “Maximum A-posteriori”

)130|(maxarg == siqpIQiq

map

Copyright © Andrew W. Moore Slide 62

What you should know• The Gaussian PDF formula off by heart• Understand the workings of the formula for

a Gaussian• Be able to understand the Gaussian tools

described so far• Have a rough idea of how you could prove

them• Be happy with how you could use them