Machine Learning Andrew Rosenberg Lecture 3: Math Primer II.
Andrew Rosenberg- Lecture 8: Graphical Models II Machine Learning
Transcript of Andrew Rosenberg- Lecture 8: Graphical Models II Machine Learning
Lecture 8: Graphical Models II
Machine Learning
Andrew Rosenberg
March 5, 2010
1 / 1
Today
Graphical Models
Naive Bayes classificationConditional Probability Tables (CPTs)Inference in Graphical Models and Belief Propagation
2 / 1
Recap of Graphical Models
Graphical Models
Graphical representation of the dependency relationshipsbetween random variables.
x0
x1
x2
x3
x4
x5
3 / 1
Topological Graph
Graphical models factorize probabilities
p(x0, . . . , xn−1) =
n−1∏
i=0
p(xi |pai) =
n−1∏
i=0
p(xi |πi)
Nodes are generally topologically ordered so that parents, π comebefore children.
x0
x1
x2
x3
x4
x5
4 / 1
Plate Notation of a Graphical Model
Recall the Naive Bayes Graphical Model
x0
y
xn. . .
There can be many variables xi .Plate notation gives a compact representation of models like this:
'
&
$
%n
y
xi
5 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H Y N Y YY M Y N N Y
x0 x1 x2 x3 x4
y
6 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25
7 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25p(fev |flu) =
L M H
Y .33 .33 .33
N 0 1 0
8 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25p(sinus|flu) =
Y N
Y .67 .33
N 1 0
9 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25p(ache|flu) =
Y N
Y .33 .67
N 1 0
10 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25p(swell |flu) =
Y N
Y .67 .33
N 1 0
11 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
Fev Sin Ach Swe Hea
Flu
p(flu) =Y N
.75 .25p(head |flu) =
Y N
Y .67 .33
N 1 0
12 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head).p(flu = Y )p(fev = M|flu = Y )p(sin = N|flu = Y )p(ach = N|flu =Y )p(swe = N|flu = Y )p(head = N|flu = Y )
p(flu) =Y N
.75 .25
13 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ p(fev = M|flu = Y )p(sin = N|flu = Y )p(ach = N|flu = Y )p(swe =N|flu = Y )p(head = N|flu = Y )
p(fev |flu) =
L M H
Y .33 .33 .33
N 0 1 0
14 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ p(sin = N|flu = Y )p(ach = N|flu = Y )p(swe = N|flu =Y )p(head = N|flu = Y )
p(sinus|flu) =
Y N
Y .67 .33
N 1 0
15 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ p(ach = N|flu = Y )p(swe = N|flu = Y )p(head = N|flu = Y )
p(ache|flu) =
Y N
Y .33 .67
N 1 0
16 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ p(swe = N|flu = Y )p(head = N|flu = Y )
p(swell |flu) =
Y N
Y .67 .33
N 1 0
17 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ .33 ∗ p(head = N|flu = Y )
p(head |flu) =
Y N
Y .67 .33
N 1 0
18 / 1
Naive Bayes Example
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
? M N N N N
Fev Sin Ach Swe Hea
Flu
Find p(flu|fever , sinus, ache, swell , head)..75 ∗ .33 ∗ .33 ∗ .67 ∗ .33 ∗ .33 = 0.0060
19 / 1
Completely Observed graphical models
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell Head
Y L Y Y Y NN M N N N NY H N N Y YY M Y N N Y
In the simplest – least general – graph, assume each independent. Train 6separate models.
Fl Fe Si Ac Sw He
2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)
Fl Fe Si Ac Sw He
20 / 1
Maximum Likelihood Conditional Probability Tables
Consider this Graphical Model
x0x1
x2
x3
x4
x5
Each node has a conditional probability table θi .
Given the table, we have a pdf
p(x|θ) =M−1Y
i=0
p(xi |πi , θi)
We have m variables in x, and N data points, X.
Maximum (log) Likelihood
θ∗ = argmax
θ
ln p(X|θ)
= argmaxθ
N−1X
n=0
ln p(Xn|θ)
= argmaxθ
N−1X
n=0
lnM−1Y
i=0
p(xin|θi )
= argmaxθ
N−1X
n=0
M−1X
i=0
ln p(xin|θi )
21 / 1
Maximum Likelihood CPTs
First, Kronecker’s delta function.
δ(xn, xm) =
1 if xn = xm
0 otherwise
Counts: the number of times something appears in the data
m(xi ) =N−1X
n=0
δ(xi , xin)
m(X) =
N−1X
n=0
δ(X,Xn)
N =X
x1
m(x1) =X
x1
X
x2
δ(x1, x2)
!
=X
x1
X
x2
X
x3
δ(x1, x2, x3)
!!
. . .
22 / 1
Maximum likelihood CPTs
l(θ) =
N−1X
n=0
ln p(Xn|θ)
=
N−1X
n=0
lnY
X
p(X|θ)δ(xn,X)
=N−1X
n=0
X
X
δ(xn,X) ln p(X|θ)
=X
xn
m(X) ln p(X|θ)
=X
xn
m(X) ln
M−1Y
i=0
p(xi |πi , θi )
=X
xn
M−1X
i=0
m(X) ln p(xi |πi , θi )
=
M−1X
i=0
X
xi ,πi
X
X\xi\πi
m(X) ln p(xi |πi , θi )
=
M−1X
i=0
X
xi ,πi
m(xi , πi) ln p(xi |πi , θi )
Define a function:θ(xi , πi ) = p(xi |πi , θi )
Constraint:X
xi
θ(xi , πi ) = 1
23 / 1
ML With Lagrange Multipliers
Lagrange Multipliers
To maximize f (x , y)subject to g(x , y) = c .Maximize f (x , y) − λ(g(x , y) − c)
l(θ) =M−1X
i=0
X
xi
X
πi
m(xi , πi ) ln θ(xi , πi ) −M−1X
i=0
X
πi
λπi
0
@
X
xi
θ(xi , πi ) − 1
1
A
∂l(θ)
∂θ(xi , πi )=
m(xi , πi )
θ(xi , πi )− λπi
= 0
θ(xi , πi ) =m(xi , πi )
λπi
X
xi
m(xi , πi )
λπi
= 1 – the constraint
λπi=
X
xi
m(xi , πi) = m(πi )
θ(xi , πi ) =m(xi , πi )
m(πi )– counts!
24 / 1
Maximum A Posteriori CPT training
For the Bayesians, MAP leads to:
θ(xi , πi ) =m(xi , πi ) + ǫ
m(πi ) + ǫ|xi |
25 / 1
Example of maximum likelihood.
Flu (x0) Fever (x1) Sinus (x2) Ache (x3) Swell (x4) Head (x5)
Y L Y Y Y NN M N N N NY H Y N Y YY M Y N N Y
x0
x1
x2
x3
x4
x5
θ(xi , πi ) =m(xi , πi )
m(πi )
26 / 1
Conditional Dependence Test.
We also what to be able to check conditional independenciesin a graphical model.i.e. “Is achiness (x3) independent of flu (x0) given fever (x1)?”i.e. “Is achiness (x3) independent of sinus infection (x2) givenfever (x1)?”
p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)
p(x3|x0, x1, x2) =p(x0, x1, x2, x3)
p(x0, x1, x2)
=p(x0)p(x1|x0)p(x2|x0)p(x3|x1)
p(x0)p(x1|x0)p(x2|x0)
= p(x3|x1)
x3 ⊥⊥ x0, x2|x1
No problem, right?
What about x0 ⊥⊥ x5|x1, x2? 27 / 1
D-separation and Bayes Ball
�
�
�
�x0
x1
x2
x3
x4
x5
Intuition: nodes are separated, or blocked by sets of nodes
Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 ⊥⊥ x5|x2, x3
28 / 1
D-separation and Bayes Ball
�
�
�
�x0
x1
x2
x3
x4
x5
Intuition: nodes are separated, or blocked by sets of nodes
Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 ⊥⊥ x5|x2, x3
While this is true in undirected graphs, it is not in directedgraphs.
We need more than simple Separation
We need directed separation – D-Separation
the D-separation is computed using the Bayes Ball algorithm.
Allows us to prove general statements xa ⊥⊥ xb|xc .
29 / 1
Bayes Ball Algorithm
xa ⊥⊥ xb|xc
Shade nodes xc
Place a “ball” at each node in xa
Bounce balls around the graph according to some rules
If no balls read xb, then xa ⊥⊥ xb|xc , else false.
Balls can travel along/against edges
Pick any path
Test to see if the ball goes through or bounces back.
30 / 1
Ten Rules of Bayes Ball
31 / 1
Bayes Ball Example - I
x0 ⊥⊥ x4|x2?
x0
x1
x2
x3
x4
x5
32 / 1
Bayes Ball Example - II
x0 ⊥⊥ x5|x1, x2?
x0
x1
x2
x3
x4
x5
33 / 1
Undirected Graphs
What if we allow undirected graphs?
What do they correspond to?
It’s not cause/effect, or trigger/response, rather, generaldependence.
Example: Image pixels, where each pixel is a bernouli.
Can have a probability over all pixels p(x11, x1M , xM1, xMM)
Bright pixels have bright neighbors.
No parents, just probabilities.
Grid models are called Markov Random Fields.34 / 1
Undirected Graphs
x ⊥⊥ y |{w , x}
w
x y
z
Undirected separation is easy.
To check xa ⊥⊥ xb|xc , check Graph reachability of xa and xb
without going through nodes in xc .
35 / 1
Bye
Next
Representing probabilities in Undirected Graphs.
36 / 1