Andrew Rosenberg- Lecture 8: Graphical Models II Machine Learning

Lecture 8: Graphical Models II

Machine Learning

Andrew Rosenberg

March 5, 2010

1 / 1

Today

Graphical Models

Naive Bayes classificationConditional Probability Tables (CPTs)Inference in Graphical Models and Belief Propagation

2 / 1

Recap of Graphical Models

Graphical Models

Graphical representation of the dependency relationshipsbetween random variables.

x0

x1

x2

x3

x4

x5

3 / 1

Topological Graph

Graphical models factorize probabilities

p(x0, . . . , xn−1) =

n−1∏

i=0

p(xi |pai) =

n−1∏

i=0

p(xi |πi)

Nodes are generally topologically ordered so that parents, π comebefore children.

x0

x1

x2

x3

x4

x5

4 / 1

Plate Notation of a Graphical Model

Recall the Naive Bayes Graphical Model

x0

y

xn. . .

There can be many variables xi .Plate notation gives a compact representation of models like this:

'

&

$

%n

y

xi

5 / 1

Naive Bayes Example

Flu Fever Sinus Ache Swell Head

Y L Y Y Y NN M N N N NY H Y N Y YY M Y N N Y

x0 x1 x2 x3 x4

y

6 / 1

Naive Bayes Example


Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25

7 / 1

Naive Bayes Example



Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25p(fev |flu) =

L M H

Y .33 .33 .33

N 0 1 0

8 / 1

Naive Bayes Example



Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25p(sinus|flu) =

Y N

Y .67 .33

N 1 0

9 / 1

Naive Bayes Example



Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25p(ache|flu) =

Y N

Y .33 .67

N 1 0

10 / 1

Naive Bayes Example



Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25p(swell |flu) =

Y N

Y .67 .33

N 1 0

11 / 1

Naive Bayes Example



Fev Sin Ach Swe Hea

Flu

p(flu) =Y N

.75 .25p(head |flu) =

Y N

Y .67 .33

N 1 0

12 / 1

Naive Bayes Example



? M N N N N

Fev Sin Ach Swe Hea

Flu

Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ p(swe = N|flu = Y )p(head = N|flu = Y )

p(swell |flu) =

Y N

Y .67 .33

N 1 0

17 / 1

Naive Bayes Example



? M N N N N

Fev Sin Ach Swe Hea

Flu

Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ .33 ∗ p(head = N|flu = Y )

p(head |flu) =

Y N

Y .67 .33

N 1 0

18 / 1

Naive Bayes Example



? M N N N N

Fev Sin Ach Swe Hea

Flu

Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ .33 ∗ .33 = 0.0060

19 / 1

Completely Observed graphical models

Suppose we have observations for every node.



In the simplest – least general – graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)

Fl Fe Si Ac Sw He

20 / 1

Maximum Likelihood Conditional Probability Tables

Consider this Graphical Model

x0x1

x2

x3

x4

x5

Each node has a conditional probability table θi .

Given the table, we have a pdf

p(x|θ) =M−1Y

i=0

p(xi |πi , θi)

We have m variables in x, and N data points, X.

Maximum (log) Likelihood

θ∗ = argmax

θ

ln p(X|θ)

= argmaxθ

N−1X

n=0

ln p(Xn|θ)

= argmaxθ

N−1X

n=0

lnM−1Y

i=0

p(xin|θi )

= argmaxθ

N−1X

n=0

M−1X

i=0

ln p(xin|θi )

21 / 1

Maximum Likelihood CPTs

First, Kronecker’s delta function.

δ(xn, xm) =

1 if xn = xm

0 otherwise

Counts: the number of times something appears in the data

m(xi ) =N−1X

n=0

δ(xi , xin)

m(X) =

N−1X

n=0

δ(X,Xn)

N =X

x1

m(x1) =X

x1

X

x2

δ(x1, x2)

!

=X

x1

X

x2

X

x3

δ(x1, x2, x3)

!!

. . .

22 / 1

Maximum likelihood CPTs

l(θ) =

N−1X

n=0

ln p(Xn|θ)

=

N−1X

n=0

lnY

X

p(X|θ)δ(xn,X)

=N−1X

n=0

X

X

δ(xn,X) ln p(X|θ)

=X

xn

m(X) ln p(X|θ)

=X

xn

m(X) ln

M−1Y

i=0

p(xi |πi , θi )

=X

xn

M−1X

i=0

m(X) ln p(xi |πi , θi )

=

M−1X

i=0

X

xi ,πi

X

X\xi\πi

m(X) ln p(xi |πi , θi )

=

M−1X

i=0

X

xi ,πi

m(xi , πi) ln p(xi |πi , θi )

Define a function:θ(xi , πi ) = p(xi |πi , θi )

Constraint:X

xi

θ(xi , πi ) = 1

23 / 1

ML With Lagrange Multipliers

Lagrange Multipliers

To maximize f (x , y)subject to g(x , y) = c .Maximize f (x , y) − λ(g(x , y) − c)

l(θ) =M−1X

i=0

X

xi

X

πi

m(xi , πi ) ln θ(xi , πi ) −M−1X

i=0

X

πi

λπi

0

@

X

xi

θ(xi , πi ) − 1

1

A

∂l(θ)

∂θ(xi , πi )=

m(xi , πi )

θ(xi , πi )− λπi

= 0

θ(xi , πi ) =m(xi , πi )

λπi

X

xi

m(xi , πi )

λπi

= 1 – the constraint

λπi=

X

xi

m(xi , πi) = m(πi )


m(πi )– counts!

24 / 1

Maximum A Posteriori CPT training

For the Bayesians, MAP leads to:

θ(xi , πi ) =m(xi , πi ) + ǫ

m(πi ) + ǫ|xi |

25 / 1

Example of maximum likelihood.

Flu (x0) Fever (x1) Sinus (x2) Ache (x3) Swell (x4) Head (x5)

Y L Y Y Y NN M N N N NY H Y N Y YY M Y N N Y

x0

x1

x2

x3

x4

x5


m(πi )

26 / 1

Conditional Dependence Test.

We also what to be able to check conditional independenciesin a graphical model.i.e. “Is achiness (x3) independent of flu (x0) given fever (x1)?”i.e. “Is achiness (x3) independent of sinus infection (x2) givenfever (x1)?”

p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

p(x3|x0, x1, x2) =p(x0, x1, x2, x3)

p(x0, x1, x2)

=p(x0)p(x1|x0)p(x2|x0)p(x3|x1)

p(x0)p(x1|x0)p(x2|x0)

= p(x3|x1)

x3 ⊥⊥ x0, x2|x1

No problem, right?

What about x0 ⊥⊥ x5|x1, x2? 27 / 1

D-separation and Bayes Ball

�

�

�

�x0

x1

x2

x3

x4

x5

Intuition: nodes are separated, or blocked by sets of nodes

Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 ⊥⊥ x5|x2, x3

28 / 1

D-separation and Bayes Ball

�

�

�

�x0

x1

x2

x3

x4

x5

Intuition: nodes are separated, or blocked by sets of nodes

Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 ⊥⊥ x5|x2, x3

While this is true in undirected graphs, it is not in directedgraphs.

We need more than simple Separation

We need directed separation – D-Separation

the D-separation is computed using the Bayes Ball algorithm.

Allows us to prove general statements xa ⊥⊥ xb|xc .

29 / 1

Bayes Ball Algorithm

xa ⊥⊥ xb|xc

Shade nodes xc

Place a “ball” at each node in xa

Bounce balls around the graph according to some rules

If no balls read xb, then xa ⊥⊥ xb|xc , else false.

Balls can travel along/against edges

Pick any path

Test to see if the ball goes through or bounces back.

30 / 1

Ten Rules of Bayes Ball

31 / 1

Bayes Ball Example - I

x0 ⊥⊥ x4|x2?

x0

x1

x2

x3

x4

x5

32 / 1

Bayes Ball Example - II

x0 ⊥⊥ x5|x1, x2?

x0

x1

x2

x3

x4

x5

33 / 1

Undirected Graphs

What if we allow undirected graphs?

What do they correspond to?

It’s not cause/effect, or trigger/response, rather, generaldependence.

Example: Image pixels, where each pixel is a bernouli.

Can have a probability over all pixels p(x11, x1M , xM1, xMM)

Bright pixels have bright neighbors.

No parents, just probabilities.

Grid models are called Markov Random Fields.34 / 1

Undirected Graphs

x ⊥⊥ y |{w , x}

w

x y

z

Undirected separation is easy.

To check xa ⊥⊥ xb|xc , check Graph reachability of xa and xb

without going through nodes in xc .

35 / 1

Bye

Next

Representing probabilities in Undirected Graphs.

36 / 1

Andrew Rosenberg- Lecture 8: Graphical Models II Machine Learning

Documents

Transcript of Andrew Rosenberg- Lecture 8: Graphical Models II Machine Learning