Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf ·...
Transcript of Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf ·...
![Page 1: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/1.jpg)
Graphical Models
Xiaojin Zhu
Department of Computer SciencesUniversity of Wisconsin–Madison, USA
KDD 2012 Tutorial
(KDD 2012 Tutorial) Graphical Models 1 / 131
![Page 2: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/2.jpg)
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 2 / 131
![Page 3: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/3.jpg)
Overview
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 3 / 131
![Page 4: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/4.jpg)
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls
You randomly picked an envelope, randomly took out a ball – it wasblack
At this point, you are given the option to switch envelopes. Shouldyou?
(KDD 2012 Tutorial) Graphical Models 4 / 131
![Page 5: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/5.jpg)
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls
You randomly picked an envelope, randomly took out a ball – it wasblack
At this point, you are given the option to switch envelopes. Shouldyou?
(KDD 2012 Tutorial) Graphical Models 4 / 131
![Page 6: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/6.jpg)
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls
You randomly picked an envelope, randomly took out a ball – it wasblack
At this point, you are given the option to switch envelopes. Shouldyou?
(KDD 2012 Tutorial) Graphical Models 4 / 131
![Page 7: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/7.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, b
P (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 8: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/8.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2
P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 9: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/9.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0
We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 10: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/10.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?
P (E = 1 | B = b) = P (B=b|E=1)P (E=1)P (B=b) = 1/2×1/2
3/4 = 1/3Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 11: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/11.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 12: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/12.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.
The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 13: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/13.jpg)
Overview
The envelope quiz
Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)
P (B=b) = 1/2×1/23/4 = 1/3
Switch.The graphical model:
B
E
(KDD 2012 Tutorial) Graphical Models 5 / 131
![Page 14: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/14.jpg)
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1, . . . , xn
e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn
e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional
p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑
v p(x1, . . . , xn−1, xn = v)
Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),
where X(i) = (x(i)1 , . . . , x
(i)n )
(KDD 2012 Tutorial) Graphical Models 6 / 131
![Page 15: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/15.jpg)
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1, . . . , xn
e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn
e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional
p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑
v p(x1, . . . , xn−1, xn = v)
Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),
where X(i) = (x(i)1 , . . . , x
(i)n )
(KDD 2012 Tutorial) Graphical Models 6 / 131
![Page 16: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/16.jpg)
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1, . . . , xn
e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn
e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional
p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑
v p(x1, . . . , xn−1, xn = v)
Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),
where X(i) = (x(i)1 , . . . , x
(i)n )
(KDD 2012 Tutorial) Graphical Models 6 / 131
![Page 17: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/17.jpg)
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1, . . . , xn
e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn
e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional
p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑
v p(x1, . . . , xn−1, xn = v)
Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),
where X(i) = (x(i)1 , . . . , x
(i)n )
(KDD 2012 Tutorial) Graphical Models 6 / 131
![Page 18: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/18.jpg)
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1, . . . , xn
e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn
e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional
p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑
v p(x1, . . . , xn−1, xn = v)
Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),
where X(i) = (x(i)1 , . . . , x
(i)n )
(KDD 2012 Tutorial) Graphical Models 6 / 131
![Page 19: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/19.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)
exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)
Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 20: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/20.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)
hard to interpret (conditional independence)
inference p(XQ | XE)
Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 21: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/21.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)
Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 22: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/22.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)
Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 23: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/23.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 24: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/24.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 25: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/25.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 26: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/26.jpg)
Overview
It is difficult to reason with uncertainty
joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)
inference p(XQ | XE)Often can’t afford to do it by brute force
If p(x1, . . . , xn) not given, estimate it from data
Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately
(KDD 2012 Tutorial) Graphical Models 7 / 131
![Page 27: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/27.jpg)
Definitions
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 8 / 131
![Page 28: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/28.jpg)
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models
Just because there are nodes and edges doesn’t mean it’s a graphicalmodel
These are not graphical models:
neural network decision tree network flow HMM template
(KDD 2012 Tutorial) Graphical Models 9 / 131
![Page 29: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/29.jpg)
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models
Just because there are nodes and edges doesn’t mean it’s a graphicalmodel
These are not graphical models:
neural network decision tree network flow HMM template
(KDD 2012 Tutorial) Graphical Models 9 / 131
![Page 30: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/30.jpg)
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models
Just because there are nodes and edges doesn’t mean it’s a graphicalmodel
These are not graphical models:
neural network decision tree network flow HMM template
(KDD 2012 Tutorial) Graphical Models 9 / 131
![Page 31: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/31.jpg)
Definitions Directed Graphical Models
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 10 / 131
![Page 32: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/32.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 33: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/33.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 34: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/34.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 35: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/35.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 36: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/36.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 37: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/37.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 38: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/38.jpg)
Definitions Directed Graphical Models
Bayesian Network
Directed graphical models are also called Bayesian networks
A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj
A cycle is a directed path x1 → . . .→ xk where x1 = xk
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions satisfying
p | p(X) =∏
i
p(xi | Pa(xi))
where Pa(xi) is the set of parents of xi.
p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi
By specifying the CPDs for all i, we specify a particular distributionp(X)
(KDD 2012 Tutorial) Graphical Models 11 / 131
![Page 39: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/39.jpg)
Definitions Directed Graphical Models
Example: Alarm
Binary variables
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
P (B,∼ E,A, J,∼M)= P (B)P (∼ E)P (A | B,∼ E)P (J | A)P (∼M | A)= 0.001× (1− 0.002)× 0.94× 0.9× (1− 0.7)≈ .000253
(KDD 2012 Tutorial) Graphical Models 12 / 131
![Page 40: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/40.jpg)
Definitions Directed Graphical Models
Example: Naive Bayes
y y
x x. . .1 d x
d
p(y, x1, . . . xd) = p(y)∏d
i=1 p(xi | y)
Used extensively in natural language processing
Plate representation on the right
(KDD 2012 Tutorial) Graphical Models 13 / 131
![Page 41: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/41.jpg)
Definitions Directed Graphical Models
Example: Naive Bayes
y y
x x. . .1 d x
d
p(y, x1, . . . xd) = p(y)∏d
i=1 p(xi | y)Used extensively in natural language processing
Plate representation on the right
(KDD 2012 Tutorial) Graphical Models 13 / 131
![Page 42: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/42.jpg)
Definitions Directed Graphical Models
Example: Naive Bayes
y y
x x. . .1 d x
d
p(y, x1, . . . xd) = p(y)∏d
i=1 p(xi | y)Used extensively in natural language processing
Plate representation on the right
(KDD 2012 Tutorial) Graphical Models 13 / 131
![Page 43: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/43.jpg)
Definitions Directed Graphical Models
No Causality Whatsoever
P(A)=aP(B|A)=bP(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hence correlation)
However, people tend to design BNs based on causal relations
(KDD 2012 Tutorial) Graphical Models 14 / 131
![Page 44: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/44.jpg)
Definitions Directed Graphical Models
No Causality Whatsoever
P(A)=aP(B|A)=bP(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hence correlation)
However, people tend to design BNs based on causal relations
(KDD 2012 Tutorial) Graphical Models 14 / 131
![Page 45: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/45.jpg)
Definitions Directed Graphical Models
No Causality Whatsoever
P(A)=aP(B|A)=bP(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hence correlation)
However, people tend to design BNs based on causal relations
(KDD 2012 Tutorial) Graphical Models 14 / 131
![Page 46: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/46.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 47: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/47.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 48: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/48.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 49: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/49.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 50: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/50.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
Ndw
D
αβT
zφ θ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 51: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/51.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 52: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/52.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 53: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/53.jpg)
Definitions Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
θ
Ndw
D
αβT
zφ
A generative model for p(φ, θ, z, w | α, β):For each topic t
φt ∼ Dirichlet(β)For each document d
θ ∼ Dirichlet(α)For each word position in d
topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)
Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)
(KDD 2012 Tutorial) Graphical Models 15 / 131
![Page 54: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/54.jpg)
Definitions Directed Graphical Models
Some Topics by LDA on the Wish Corpus
p(word | topic)
“troops” “election” “love”
(KDD 2012 Tutorial) Graphical Models 16 / 131
![Page 55: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/55.jpg)
Definitions Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified byd-separation (“directed separation”)
(KDD 2012 Tutorial) Graphical Models 17 / 131
![Page 56: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/56.jpg)
Definitions Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified byd-separation (“directed separation”)
(KDD 2012 Tutorial) Graphical Models 17 / 131
![Page 57: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/57.jpg)
Definitions Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified byd-separation (“directed separation”)
(KDD 2012 Tutorial) Graphical Models 17 / 131
![Page 58: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/58.jpg)
Definitions Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified byd-separation (“directed separation”)
(KDD 2012 Tutorial) Graphical Models 17 / 131
![Page 59: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/59.jpg)
Definitions Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
A B
C
A B
A, B in general dependent
A, B conditionally independent given C (observed nodes are shaded)
An observed C is a tail-to-tail node, blocks the undirected path A-B
(KDD 2012 Tutorial) Graphical Models 18 / 131
![Page 60: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/60.jpg)
Definitions Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
A B
C
A B
A, B in general dependent
A, B conditionally independent given C (observed nodes are shaded)
An observed C is a tail-to-tail node, blocks the undirected path A-B
(KDD 2012 Tutorial) Graphical Models 18 / 131
![Page 61: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/61.jpg)
Definitions Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
A B
C
A B
A, B in general dependent
A, B conditionally independent given C (observed nodes are shaded)
An observed C is a tail-to-tail node, blocks the undirected path A-B
(KDD 2012 Tutorial) Graphical Models 18 / 131
![Page 62: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/62.jpg)
Definitions Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A C B A C B
A, B in general dependent
A, B conditionally independent given C
An observed C is a head-to-tail node, blocks the path A-B
(KDD 2012 Tutorial) Graphical Models 19 / 131
![Page 63: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/63.jpg)
Definitions Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A C B A C B
A, B in general dependent
A, B conditionally independent given C
An observed C is a head-to-tail node, blocks the path A-B
(KDD 2012 Tutorial) Graphical Models 19 / 131
![Page 64: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/64.jpg)
Definitions Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A C B A C B
A, B in general dependent
A, B conditionally independent given C
An observed C is a head-to-tail node, blocks the path A-B
(KDD 2012 Tutorial) Graphical Models 19 / 131
![Page 65: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/65.jpg)
Definitions Directed Graphical Models
d-Separation Case 3: Head-to-Head
A B A B
C C
A, B in general independent
A, B conditionally dependent given C, or any of C’s descendants
An observed C is a head-to-head node, unblocks the path A-B
(KDD 2012 Tutorial) Graphical Models 20 / 131
![Page 66: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/66.jpg)
Definitions Directed Graphical Models
d-Separation Case 3: Head-to-Head
A B A B
C C
A, B in general independent
A, B conditionally dependent given C, or any of C’s descendants
An observed C is a head-to-head node, unblocks the path A-B
(KDD 2012 Tutorial) Graphical Models 20 / 131
![Page 67: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/67.jpg)
Definitions Directed Graphical Models
d-Separation Case 3: Head-to-Head
A B A B
C C
A, B in general independent
A, B conditionally dependent given C, or any of C’s descendants
An observed C is a head-to-head node, unblocks the path A-B
(KDD 2012 Tutorial) Graphical Models 20 / 131
![Page 68: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/68.jpg)
Definitions Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked
A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.
(KDD 2012 Tutorial) Graphical Models 21 / 131
![Page 69: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/69.jpg)
Definitions Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked
A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.
(KDD 2012 Tutorial) Graphical Models 21 / 131
![Page 70: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/70.jpg)
Definitions Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked
A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x ∈ C, or
The path is head-to-head at x, and neither x nor any of itsdescendants is in C.
(KDD 2012 Tutorial) Graphical Models 21 / 131
![Page 71: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/71.jpg)
Definitions Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked
A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.
(KDD 2012 Tutorial) Graphical Models 21 / 131
![Page 72: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/72.jpg)
Definitions Directed Graphical Models
d-Separation Example 1
The undirected path from A to B is unblocked by E (because of C),and is not blocked by F
A, B dependent given C
A
C
B
F
E
(KDD 2012 Tutorial) Graphical Models 22 / 131
![Page 73: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/73.jpg)
Definitions Directed Graphical Models
d-Separation Example 1
The undirected path from A to B is unblocked by E (because of C),and is not blocked by F
A, B dependent given C
A
C
B
F
E
(KDD 2012 Tutorial) Graphical Models 22 / 131
![Page 74: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/74.jpg)
Definitions Directed Graphical Models
d-Separation Example 2
The path from A to B is blocked both at E and F
A, B conditionally independent given F
A
B
F
E
C
(KDD 2012 Tutorial) Graphical Models 23 / 131
![Page 75: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/75.jpg)
Definitions Directed Graphical Models
d-Separation Example 2
The path from A to B is blocked both at E and F
A, B conditionally independent given F
A
B
F
E
C
(KDD 2012 Tutorial) Graphical Models 23 / 131
![Page 76: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/76.jpg)
Definitions Undirected Graphical Models
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 24 / 131
![Page 77: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/77.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 78: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/78.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 79: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/79.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 80: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/80.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 81: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/81.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 82: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/82.jpg)
Definitions Undirected Graphical Models
Markov Random Fields
Undirected graphical models are also called Markov Random Fields
The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)
Define a nonnegative potential function ψC : XC 7→ R+
An undirected graphical model is a family of distributions satisfyingp | p(X) =
1Z
∏C
ψC(XC)
Z =∫ ∏
C ψC(XC)dX is the partition function
(KDD 2012 Tutorial) Graphical Models 25 / 131
![Page 83: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/83.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1
A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 84: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/84.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 85: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/85.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 86: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/86.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)
p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 87: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/87.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)
p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 88: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/88.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)
When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 89: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/89.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 90: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/90.jpg)
Definitions Undirected Graphical Models
Example: A Tiny Markov Random Field
x x1 2
C
x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2
p(x1, x2) = 1Z e
ax1x2
Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains
(KDD 2012 Tutorial) Graphical Models 26 / 131
![Page 91: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/91.jpg)
Definitions Undirected Graphical Models
Log Linear Models
Real-valued feature functions f1(X), . . . , fk(X)
Real-valued weights w1, . . . , wk
p(X) =1Z
exp
(−
k∑i=1
wifi(X)
)
(KDD 2012 Tutorial) Graphical Models 27 / 131
![Page 92: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/92.jpg)
Definitions Undirected Graphical Models
Log Linear Models
Real-valued feature functions f1(X), . . . , fk(X)Real-valued weights w1, . . . , wk
p(X) =1Z
exp
(−
k∑i=1
wifi(X)
)
(KDD 2012 Tutorial) Graphical Models 27 / 131
![Page 93: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/93.jpg)
Definitions Undirected Graphical Models
Example: The Ising Model
θsθ
xs xtst
This is an undirected model with x ∈ 0, 1.
pθ(x) =1Z
exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt
fs(X) = xs, fst(X) = xsxt
ws = −θs, wst = −θst
(KDD 2012 Tutorial) Graphical Models 28 / 131
![Page 94: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/94.jpg)
Definitions Undirected Graphical Models
Example: The Ising Model
θsθ
xs xtst
This is an undirected model with x ∈ 0, 1.
pθ(x) =1Z
exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt
fs(X) = xs, fst(X) = xsxt
ws = −θs, wst = −θst
(KDD 2012 Tutorial) Graphical Models 28 / 131
![Page 95: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/95.jpg)
Definitions Undirected Graphical Models
Example: Image Denoising
[From Bishop PRML] noisy image argmaxX P (X|Y )
pθ(X | Y ) =1Z
exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt
θs =
c ys = 1−c ys = 0
(KDD 2012 Tutorial) Graphical Models 29 / 131
![Page 96: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/96.jpg)
Definitions Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N(µ,Σ) =1
(2π)n/2|Σ|1/2exp
(−1
2(X − µ)>Σ−1(X − µ)
)
Multivariate Gaussian
The n× n covariance matrix Σ positive semi-definite
Let Ω = Σ−1 be the precision matrix
xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj
(KDD 2012 Tutorial) Graphical Models 30 / 131
![Page 97: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/97.jpg)
Definitions Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N(µ,Σ) =1
(2π)n/2|Σ|1/2exp
(−1
2(X − µ)>Σ−1(X − µ)
)
Multivariate Gaussian
The n× n covariance matrix Σ positive semi-definite
Let Ω = Σ−1 be the precision matrix
xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj
(KDD 2012 Tutorial) Graphical Models 30 / 131
![Page 98: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/98.jpg)
Definitions Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N(µ,Σ) =1
(2π)n/2|Σ|1/2exp
(−1
2(X − µ)>Σ−1(X − µ)
)
Multivariate Gaussian
The n× n covariance matrix Σ positive semi-definite
Let Ω = Σ−1 be the precision matrix
xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj
(KDD 2012 Tutorial) Graphical Models 30 / 131
![Page 99: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/99.jpg)
Definitions Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N(µ,Σ) =1
(2π)n/2|Σ|1/2exp
(−1
2(X − µ)>Σ−1(X − µ)
)
Multivariate Gaussian
The n× n covariance matrix Σ positive semi-definite
Let Ω = Σ−1 be the precision matrix
xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0
When Ωij 6= 0, there is an edge between xi, xj
(KDD 2012 Tutorial) Graphical Models 30 / 131
![Page 100: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/100.jpg)
Definitions Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N(µ,Σ) =1
(2π)n/2|Σ|1/2exp
(−1
2(X − µ)>Σ−1(X − µ)
)
Multivariate Gaussian
The n× n covariance matrix Σ positive semi-definite
Let Ω = Σ−1 be the precision matrix
xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj
(KDD 2012 Tutorial) Graphical Models 30 / 131
![Page 101: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/101.jpg)
Definitions Undirected Graphical Models
Conditional Independence in Markov Random Fields
Two group of variables A, B are conditionally independent givenanother group C, if:
A, B become disconnected by removing C and all edges involving C
AC
B
(KDD 2012 Tutorial) Graphical Models 31 / 131
![Page 102: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/102.jpg)
Definitions Undirected Graphical Models
Conditional Independence in Markov Random Fields
Two group of variables A, B are conditionally independent givenanother group C, if:
A, B become disconnected by removing C and all edges involving C
AC
B
(KDD 2012 Tutorial) Graphical Models 31 / 131
![Page 103: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/103.jpg)
Definitions Factor Graph
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 32 / 131
![Page 104: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/104.jpg)
Definitions Factor Graph
Factor Graph
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node
Factors represent computation
A B
C
(A,B,C)ψ
A B
C
A B
C
(A,B,C)ψf
A B
C
fP(A)P(B)P(C|A,B)
(KDD 2012 Tutorial) Graphical Models 33 / 131
![Page 105: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/105.jpg)
Definitions Factor Graph
Factor Graph
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node
Factors represent computation
A B
C
(A,B,C)ψ
A B
C
A B
C
(A,B,C)ψf
A B
C
fP(A)P(B)P(C|A,B)
(KDD 2012 Tutorial) Graphical Models 33 / 131
![Page 106: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/106.jpg)
Definitions Factor Graph
Factor Graph
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node
Factors represent computation
A B
C
(A,B,C)ψ
A B
C
A B
C
(A,B,C)ψf
A B
C
fP(A)P(B)P(C|A,B)
(KDD 2012 Tutorial) Graphical Models 33 / 131
![Page 107: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/107.jpg)
Inference
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 34 / 131
![Page 108: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/108.jpg)
Inference Exact Inference
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 35 / 131
![Page 109: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/109.jpg)
Inference Exact Inference
Inference by Enumeration
Let X = (XQ, XE , XO) for query, evidence, and other variables.
Infer P (XQ | XE)By definition
P (XQ | XE) =P (XQ, XE)P (XE)
=
∑XO
P (XQ, XE , XO)∑XQ,XO
P (XQ, XE , XO)
Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms
(KDD 2012 Tutorial) Graphical Models 36 / 131
![Page 110: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/110.jpg)
Inference Exact Inference
Inference by Enumeration
Let X = (XQ, XE , XO) for query, evidence, and other variables.
Infer P (XQ | XE)
By definition
P (XQ | XE) =P (XQ, XE)P (XE)
=
∑XO
P (XQ, XE , XO)∑XQ,XO
P (XQ, XE , XO)
Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms
(KDD 2012 Tutorial) Graphical Models 36 / 131
![Page 111: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/111.jpg)
Inference Exact Inference
Inference by Enumeration
Let X = (XQ, XE , XO) for query, evidence, and other variables.
Infer P (XQ | XE)By definition
P (XQ | XE) =P (XQ, XE)P (XE)
=
∑XO
P (XQ, XE , XO)∑XQ,XO
P (XQ, XE , XO)
Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms
(KDD 2012 Tutorial) Graphical Models 36 / 131
![Page 112: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/112.jpg)
Inference Exact Inference
Inference by Enumeration
Let X = (XQ, XE , XO) for query, evidence, and other variables.
Infer P (XQ | XE)By definition
P (XQ | XE) =P (XQ, XE)P (XE)
=
∑XO
P (XQ, XE , XO)∑XQ,XO
P (XQ, XE , XO)
Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms
(KDD 2012 Tutorial) Graphical Models 36 / 131
![Page 113: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/113.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 114: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/114.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 115: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/115.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 116: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/116.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 117: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/117.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 118: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/118.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ X
Idea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 119: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/119.jpg)
Inference Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1, . . . , xk
We sum over the r values each variable can take:∑vr
xi=v1
This is exponential (rk):∑
x1...xk
We want∑
x1...xkp(X)
For a graphical model, the joint probability factorsp(X) =
∏mj=1 fj(X(j))
Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming
(KDD 2012 Tutorial) Graphical Models 37 / 131
![Page 120: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/120.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 121: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/121.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)
Introduce a new factor f−m+1 =(∑
x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 122: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/122.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)
f−m+1 contains the union of variables in f+l+1 . . . f
+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 123: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/123.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 124: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/124.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 125: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/125.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 126: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/126.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 127: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/127.jpg)
Inference Exact Inference
Eliminating a Variable
Rearrange factors∑
x1...xkf−1 . . . f−l f
+l+1 . . . f
+m by whether x1 ∈ X(j)
Obviously equivalent:∑
x2...xkf−1 . . . f−l
(∑x1f+
l+1 . . . f+m
)Introduce a new factor f−m+1 =
(∑x1f+
l+1 . . . f+m
)f−m+1 contains the union of variables in f+
l+1 . . . f+m except x1
In fact, x1 disappears altogether in∑
x2...xkf−1 . . . f−l f
−m+1
Dynamic programming: compute f−m+1 once, use it thereafter
Hope: f−m+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial) Graphical Models 38 / 131
![Page 128: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/128.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Binary variables
Say we want P (D) =∑
A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑
A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(
∑A f1(A)f2(A,B))
(KDD 2012 Tutorial) Graphical Models 39 / 131
![Page 129: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/129.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Binary variables
Say we want P (D) =∑
A,B,C P (A)P (B|A)P (C|B)P (D|C)
Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑
A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(
∑A f1(A)f2(A,B))
(KDD 2012 Tutorial) Graphical Models 39 / 131
![Page 130: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/130.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Binary variables
Say we want P (D) =∑
A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)
f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑
A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(
∑A f1(A)f2(A,B))
(KDD 2012 Tutorial) Graphical Models 39 / 131
![Page 131: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/131.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Binary variables
Say we want P (D) =∑
A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))
∑A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(
∑A f1(A)f2(A,B))
(KDD 2012 Tutorial) Graphical Models 39 / 131
![Page 132: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/132.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Binary variables
Say we want P (D) =∑
A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑
A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(
∑A f1(A)f2(A,B))
(KDD 2012 Tutorial) Graphical Models 39 / 131
![Page 133: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/133.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)
f5(B) ≡∑
A f1(A)f2(A,B) reduces to an array of size twoP (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑
B,C f3(B,C)f4(C,D)f5(B) =∑
C f4(C,D)(∑
B f3(B,C)f5(B)),and so on
In the end, f7(D) is represented by P (D = 0), P (D = 1)
(KDD 2012 Tutorial) Graphical Models 40 / 131
![Page 134: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/134.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡
∑A f1(A)f2(A,B) reduces to an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)
For this example, f5(B) happens to be P (B)∑B,C f3(B,C)f4(C,D)f5(B) =
∑C f4(C,D)(
∑B f3(B,C)f5(B)),
and so on
In the end, f7(D) is represented by P (D = 0), P (D = 1)
(KDD 2012 Tutorial) Graphical Models 40 / 131
![Page 135: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/135.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡
∑A f1(A)f2(A,B) reduces to an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)
∑B,C f3(B,C)f4(C,D)f5(B) =
∑C f4(C,D)(
∑B f3(B,C)f5(B)),
and so on
In the end, f7(D) is represented by P (D = 0), P (D = 1)
(KDD 2012 Tutorial) Graphical Models 40 / 131
![Page 136: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/136.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡
∑A f1(A)f2(A,B) reduces to an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑
B,C f3(B,C)f4(C,D)f5(B) =∑
C f4(C,D)(∑
B f3(B,C)f5(B)),and so on
In the end, f7(D) is represented by P (D = 0), P (D = 1)
(KDD 2012 Tutorial) Graphical Models 40 / 131
![Page 137: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/137.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡
∑A f1(A)f2(A,B) reduces to an array of size two
P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑
B,C f3(B,C)f4(C,D)f5(B) =∑
C f4(C,D)(∑
B f3(B,C)f5(B)),and so on
In the end, f7(D) is represented by P (D = 0), P (D = 1)
(KDD 2012 Tutorial) Graphical Models 40 / 131
![Page 138: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/138.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Computation for P (D): 12 ×, 6 +
Enumeration: 48 ×, 14 +
Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.
Saving depends more critically on the graph structure (tree width),may still be intractable
(KDD 2012 Tutorial) Graphical Models 41 / 131
![Page 139: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/139.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Computation for P (D): 12 ×, 6 +
Enumeration: 48 ×, 14 +
Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.
Saving depends more critically on the graph structure (tree width),may still be intractable
(KDD 2012 Tutorial) Graphical Models 41 / 131
![Page 140: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/140.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Computation for P (D): 12 ×, 6 +
Enumeration: 48 ×, 14 +
Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.
Saving depends more critically on the graph structure (tree width),may still be intractable
(KDD 2012 Tutorial) Graphical Models 41 / 131
![Page 141: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/141.jpg)
Inference Exact Inference
Example: Chain Graph
A B C D
Computation for P (D): 12 ×, 6 +
Enumeration: 48 ×, 14 +
Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.
Saving depends more critically on the graph structure (tree width),may still be intractable
(KDD 2012 Tutorial) Graphical Models 41 / 131
![Page 142: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/142.jpg)
Inference Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e
Eliminate variables XO = X −XE −XQ
The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:
P (XQ | XE = e) =f(XQ)∑XQ
f(XQ)
(KDD 2012 Tutorial) Graphical Models 42 / 131
![Page 143: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/143.jpg)
Inference Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e
Eliminate variables XO = X −XE −XQ
The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:
P (XQ | XE = e) =f(XQ)∑XQ
f(XQ)
(KDD 2012 Tutorial) Graphical Models 42 / 131
![Page 144: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/144.jpg)
Inference Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e
Eliminate variables XO = X −XE −XQ
The final factor will be the joint f(XQ) = P (XQ, XE = e)
Normalize to answer query:
P (XQ | XE = e) =f(XQ)∑XQ
f(XQ)
(KDD 2012 Tutorial) Graphical Models 42 / 131
![Page 145: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/145.jpg)
Inference Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e
Eliminate variables XO = X −XE −XQ
The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:
P (XQ | XE = e) =f(XQ)∑XQ
f(XQ)
(KDD 2012 Tutorial) Graphical Models 42 / 131
![Page 146: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/146.jpg)
Inference Exact Inference
Summary: Exact Inference
Common methods:
EnumerationVariable eliminationNot covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial) Graphical Models 43 / 131
![Page 147: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/147.jpg)
Inference Exact Inference
Summary: Exact Inference
Common methods:
Enumeration
Variable eliminationNot covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial) Graphical Models 43 / 131
![Page 148: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/148.jpg)
Inference Exact Inference
Summary: Exact Inference
Common methods:
EnumerationVariable elimination
Not covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial) Graphical Models 43 / 131
![Page 149: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/149.jpg)
Inference Exact Inference
Summary: Exact Inference
Common methods:
EnumerationVariable eliminationNot covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial) Graphical Models 43 / 131
![Page 150: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/150.jpg)
Inference Exact Inference
Summary: Exact Inference
Common methods:
EnumerationVariable eliminationNot covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial) Graphical Models 43 / 131
![Page 151: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/151.jpg)
Inference Markov Chain Monte Carlo
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 44 / 131
![Page 152: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/152.jpg)
Inference Markov Chain Monte Carlo
Inference by Monte Carlo
Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn
p(XQ = cQ | XE) =∫
1(xQ=cQ)p(xQ | XE)dxQ
If we can draw samples x(1)Q , . . . x
(m)Q ∼ p(xQ | XE), an unbiased
estimator is
p(XQ = cQ | XE) ≈ 1m
m∑i=1
1(x
(i)Q =cQ)
The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling
(KDD 2012 Tutorial) Graphical Models 45 / 131
![Page 153: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/153.jpg)
Inference Markov Chain Monte Carlo
Inference by Monte Carlo
Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn
p(XQ = cQ | XE) =∫
1(xQ=cQ)p(xQ | XE)dxQ
If we can draw samples x(1)Q , . . . x
(m)Q ∼ p(xQ | XE), an unbiased
estimator is
p(XQ = cQ | XE) ≈ 1m
m∑i=1
1(x
(i)Q =cQ)
The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling
(KDD 2012 Tutorial) Graphical Models 45 / 131
![Page 154: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/154.jpg)
Inference Markov Chain Monte Carlo
Inference by Monte Carlo
Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn
p(XQ = cQ | XE) =∫
1(xQ=cQ)p(xQ | XE)dxQ
If we can draw samples x(1)Q , . . . x
(m)Q ∼ p(xQ | XE), an unbiased
estimator is
p(XQ = cQ | XE) ≈ 1m
m∑i=1
1(x
(i)Q =cQ)
The variance of the estimator decreases as O(1/m)
Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling
(KDD 2012 Tutorial) Graphical Models 45 / 131
![Page 155: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/155.jpg)
Inference Markov Chain Monte Carlo
Inference by Monte Carlo
Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn
p(XQ = cQ | XE) =∫
1(xQ=cQ)p(xQ | XE)dxQ
If we can draw samples x(1)Q , . . . x
(m)Q ∼ p(xQ | XE), an unbiased
estimator is
p(XQ = cQ | XE) ≈ 1m
m∑i=1
1(x
(i)Q =cQ)
The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)
We discuss two methods: forward sampling and Gibbs sampling
(KDD 2012 Tutorial) Graphical Models 45 / 131
![Page 156: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/156.jpg)
Inference Markov Chain Monte Carlo
Inference by Monte Carlo
Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn
p(XQ = cQ | XE) =∫
1(xQ=cQ)p(xQ | XE)dxQ
If we can draw samples x(1)Q , . . . x
(m)Q ∼ p(xQ | XE), an unbiased
estimator is
p(XQ = cQ | XE) ≈ 1m
m∑i=1
1(x
(i)Q =cQ)
The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling
(KDD 2012 Tutorial) Graphical Models 45 / 131
![Page 157: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/157.jpg)
Inference Markov Chain Monte Carlo
Forward Sampling: Example
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0
2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)
(KDD 2012 Tutorial) Graphical Models 46 / 131
![Page 158: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/158.jpg)
Inference Markov Chain Monte Carlo
Forward Sampling: Example
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0
2 Sample E ∼ Ber(0.002)
3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)
(KDD 2012 Tutorial) Graphical Models 46 / 131
![Page 159: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/159.jpg)
Inference Markov Chain Monte Carlo
Forward Sampling: Example
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0
2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)
(KDD 2012 Tutorial) Graphical Models 46 / 131
![Page 160: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/160.jpg)
Inference Markov Chain Monte Carlo
Forward Sampling: Example
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0
2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)
5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)
(KDD 2012 Tutorial) Graphical Models 46 / 131
![Page 161: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/161.jpg)
Inference Markov Chain Monte Carlo
Forward Sampling: Example
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0
2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)
(KDD 2012 Tutorial) Graphical Models 46 / 131
![Page 162: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/162.jpg)
Inference Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1,M = 1)
Throw away all samples except those with (E = 1,M = 1)
p(B = 1 | E = 1,M = 1) ≈ 1m
m∑i=1
1(B(i)=1)
where m is the number of surviving samples
Can be highly inefficient (note P (E = 1) tiny)
Does not work for Markov Random Fields
(KDD 2012 Tutorial) Graphical Models 47 / 131
![Page 163: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/163.jpg)
Inference Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)
p(B = 1 | E = 1,M = 1) ≈ 1m
m∑i=1
1(B(i)=1)
where m is the number of surviving samples
Can be highly inefficient (note P (E = 1) tiny)
Does not work for Markov Random Fields
(KDD 2012 Tutorial) Graphical Models 47 / 131
![Page 164: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/164.jpg)
Inference Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)
p(B = 1 | E = 1,M = 1) ≈ 1m
m∑i=1
1(B(i)=1)
where m is the number of surviving samples
Can be highly inefficient (note P (E = 1) tiny)
Does not work for Markov Random Fields
(KDD 2012 Tutorial) Graphical Models 47 / 131
![Page 165: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/165.jpg)
Inference Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)
p(B = 1 | E = 1,M = 1) ≈ 1m
m∑i=1
1(B(i)=1)
where m is the number of surviving samples
Can be highly inefficient (note P (E = 1) tiny)
Does not work for Markov Random Fields
(KDD 2012 Tutorial) Graphical Models 47 / 131
![Page 166: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/166.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)Works for both graphical models
Initialization:
Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 167: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/167.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)
Works for both graphical models
Initialization:
Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 168: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/168.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)Works for both graphical models
Initialization:
Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 169: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/169.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)Works for both graphical models
Initialization:
Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 170: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/170.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)Works for both graphical models
Initialization:
Fix evidence; randomly set other variables
e.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 171: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/171.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
Directly sample from p(xQ | XE)Works for both graphical models
Initialization:
Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 48 / 131
![Page 172: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/172.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)
This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children
P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏
y∈C(xi)
P (y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 49 / 131
![Page 173: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/173.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))
For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children
P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏
y∈C(xi)
P (y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 49 / 131
![Page 174: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/174.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children
P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏
y∈C(xi)
P (y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.
For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 49 / 131
![Page 175: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/175.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children
P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏
y∈C(xi)
P (y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.
For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 49 / 131
![Page 176: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/176.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children
P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏
y∈C(xi)
P (y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 49 / 131
![Page 177: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/177.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)
Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)
Move on to J , then repeat B,A, J,B,A, J . . .
Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 50 / 131
![Page 178: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/178.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)
Move on to J , then repeat B,A, J,B,A, J . . .
Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 50 / 131
![Page 179: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/179.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)
Move on to J , then repeat B,A, J,B,A, J . . .
Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 50 / 131
![Page 180: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/180.jpg)
Inference Markov Chain Monte Carlo
Gibbs Update
Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)
Move on to J , then repeat B,A, J,B,A, J . . .
Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.
P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001
P(J | A) = 0.9P(J | ~A) = 0.05
P(M | A) = 0.7P(M | ~A) = 0.01
A
J
B
P(E)=0.002P(B)=0.001
E=1
M=1
(KDD 2012 Tutorial) Graphical Models 50 / 131
![Page 181: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/181.jpg)
Inference Markov Chain Monte Carlo
Gibbs Example 2: The Ising Model
xs
A
B
C
D
This is an undirected model with x ∈ 0, 1.
pθ(x) =1Z
exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt
(KDD 2012 Tutorial) Graphical Models 51 / 131
![Page 182: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/182.jpg)
Inference Markov Chain Monte Carlo
Gibbs Example 2: The Ising Model
xs
A
B
C
D
The Markov blanket of xs is A,B,C,D
In general for undirected graphical models
p(xs | x−s) = p(xs | xN(s))
N(s) is the neighbors of s.
The Gibbs update is
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
(KDD 2012 Tutorial) Graphical Models 52 / 131
![Page 183: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/183.jpg)
Inference Markov Chain Monte Carlo
Gibbs Example 2: The Ising Model
xs
A
B
C
D
The Markov blanket of xs is A,B,C,D
In general for undirected graphical models
p(xs | x−s) = p(xs | xN(s))
N(s) is the neighbors of s.
The Gibbs update is
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
(KDD 2012 Tutorial) Graphical Models 52 / 131
![Page 184: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/184.jpg)
Inference Markov Chain Monte Carlo
Gibbs Example 2: The Ising Model
xs
A
B
C
D
The Markov blanket of xs is A,B,C,D
In general for undirected graphical models
p(xs | x−s) = p(xs | xN(s))
N(s) is the neighbors of s.
The Gibbs update is
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
(KDD 2012 Tutorial) Graphical Models 52 / 131
![Page 185: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/185.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)
Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 186: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/186.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 187: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/187.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)
But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 188: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/188.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 189: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/189.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixing
In practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 190: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/190.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 191: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/191.jpg)
Inference Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ
Gibbs sampler is such a Markov chain withTi((X−i, x
′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution
p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)
Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )
Use all of X(T+1), . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial) Graphical Models 53 / 131
![Page 192: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/192.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 193: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/193.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 194: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/194.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )
No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 195: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/195.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 196: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/196.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 197: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/197.jpg)
Inference Markov Chain Monte Carlo
Collapsed Gibbs Sampling
In general, Ep[f(X)] ≈ 1m
∑mi=1 f(X(i)) if X(i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]
≈ 1m
m∑i=1
Ep(Z|Y (i))[f(Y (i), Z)]
if Y (i) ∼ p(Y )No need to sample Z: it is collapsed
Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)
Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ
(KDD 2012 Tutorial) Graphical Models 54 / 131
![Page 198: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/198.jpg)
Inference Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA
θ
Ndw
D
αβT
zφ
Collapse θ, φ, Gibbs update:
P (zi = j | z−i,w) ∝n
(wi)−i,j + βn
(di)−i,j + α
n(·)−i,j +Wβn
(di)−i,· + Tα
n(wi)−i,j : number of times word wi has been assigned to topic j,
excluding the current position
n(di)−i,j : number of times a word from document di has been assigned
to topic j, excluding the current position
n(·)−i,j : number of times any word has been assigned to topic j,
excluding the current position
n(di)−i,·: length of document di, excluding the current position
(KDD 2012 Tutorial) Graphical Models 55 / 131
![Page 199: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/199.jpg)
Inference Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA
Collapse θ, φ, Gibbs update:
P (zi = j | z−i,w) ∝n
(wi)−i,j + βn
(di)−i,j + α
n(·)−i,j +Wβn
(di)−i,· + Tα
n(wi)−i,j : number of times word wi has been assigned to topic j,
excluding the current position
n(di)−i,j : number of times a word from document di has been assigned
to topic j, excluding the current position
n(·)−i,j : number of times any word has been assigned to topic j,
excluding the current position
n(di)−i,·: length of document di, excluding the current position
(KDD 2012 Tutorial) Graphical Models 55 / 131
![Page 200: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/200.jpg)
Inference Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA
Collapse θ, φ, Gibbs update:
P (zi = j | z−i,w) ∝n
(wi)−i,j + βn
(di)−i,j + α
n(·)−i,j +Wβn
(di)−i,· + Tα
n(wi)−i,j : number of times word wi has been assigned to topic j,
excluding the current position
n(di)−i,j : number of times a word from document di has been assigned
to topic j, excluding the current position
n(·)−i,j : number of times any word has been assigned to topic j,
excluding the current position
n(di)−i,·: length of document di, excluding the current position
(KDD 2012 Tutorial) Graphical Models 55 / 131
![Page 201: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/201.jpg)
Inference Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA
Collapse θ, φ, Gibbs update:
P (zi = j | z−i,w) ∝n
(wi)−i,j + βn
(di)−i,j + α
n(·)−i,j +Wβn
(di)−i,· + Tα
n(wi)−i,j : number of times word wi has been assigned to topic j,
excluding the current position
n(di)−i,j : number of times a word from document di has been assigned
to topic j, excluding the current position
n(·)−i,j : number of times any word has been assigned to topic j,
excluding the current position
n(di)−i,·: length of document di, excluding the current position
(KDD 2012 Tutorial) Graphical Models 55 / 131
![Page 202: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/202.jpg)
Inference Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
Gibbs sampling
Collapsed Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings, etc.
Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial) Graphical Models 56 / 131
![Page 203: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/203.jpg)
Inference Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
Gibbs sampling
Collapsed Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings, etc.
Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial) Graphical Models 56 / 131
![Page 204: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/204.jpg)
Inference Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
Gibbs sampling
Collapsed Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings, etc.
Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial) Graphical Models 56 / 131
![Page 205: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/205.jpg)
Inference Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
Gibbs sampling
Collapsed Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings, etc.
Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial) Graphical Models 56 / 131
![Page 206: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/206.jpg)
Inference Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
Gibbs sampling
Collapsed Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings, etc.
Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial) Graphical Models 56 / 131
![Page 207: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/207.jpg)
Inference Belief Propagation
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 57 / 131
![Page 208: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/208.jpg)
Inference Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as “loopy BP”,approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)
(KDD 2012 Tutorial) Graphical Models 58 / 131
![Page 209: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/209.jpg)
Inference Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as “loopy BP”,approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)
(KDD 2012 Tutorial) Graphical Models 58 / 131
![Page 210: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/210.jpg)
Inference Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as “loopy BP”,approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)
(KDD 2012 Tutorial) Graphical Models 58 / 131
![Page 211: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/211.jpg)
Inference Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as “loopy BP”,approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)
(KDD 2012 Tutorial) Graphical Models 58 / 131
![Page 212: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/212.jpg)
Inference Belief Propagation
Example: A Simple HMM
The Hidden Markov Model template (not a graphical model)
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 59 / 131
![Page 213: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/213.jpg)
Inference Belief Propagation
Example: A Simple HMM
Observing x1 = R, x2 = G, the directed graphical model
z1
x =G2
z2
x =R1
Factor graphz1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
(KDD 2012 Tutorial) Graphical Models 60 / 131
![Page 214: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/214.jpg)
Inference Belief Propagation
Example: A Simple HMM
Observing x1 = R, x2 = G, the directed graphical model
z1
x =G2
z2
x =R1
Factor graphz1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
(KDD 2012 Tutorial) Graphical Models 60 / 131
![Page 215: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/215.jpg)
Inference Belief Propagation
Messages
A message is a vector of length K, where K is the number of valuesx takes.
There are two types of messages:
1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.
2 µx→f : message from a variable node x to a factor node f
(KDD 2012 Tutorial) Graphical Models 61 / 131
![Page 216: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/216.jpg)
Inference Belief Propagation
Messages
A message is a vector of length K, where K is the number of valuesx takes.
There are two types of messages:
1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.
2 µx→f : message from a variable node x to a factor node f
(KDD 2012 Tutorial) Graphical Models 61 / 131
![Page 217: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/217.jpg)
Inference Belief Propagation
Messages
A message is a vector of length K, where K is the number of valuesx takes.
There are two types of messages:1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.
2 µx→f : message from a variable node x to a factor node f
(KDD 2012 Tutorial) Graphical Models 61 / 131
![Page 218: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/218.jpg)
Inference Belief Propagation
Messages
A message is a vector of length K, where K is the number of valuesx takes.
There are two types of messages:1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.
2 µx→f : message from a variable node x to a factor node f
(KDD 2012 Tutorial) Graphical Models 61 / 131
![Page 219: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/219.jpg)
Inference Belief Propagation
Leaf Messages
Assume tree factor graph. Pick an arbitrary root, say z2
Start messages at leaves.
If a leaf is a factor node f , µf→x(x) = f(x)
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4
µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
If a leaf is a variable node x, µx→f (x) = 1
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 62 / 131
![Page 220: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/220.jpg)
Inference Belief Propagation
Leaf Messages
Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.
If a leaf is a factor node f , µf→x(x) = f(x)
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4
µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
If a leaf is a variable node x, µx→f (x) = 1
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 62 / 131
![Page 221: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/221.jpg)
Inference Belief Propagation
Leaf Messages
Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.
If a leaf is a factor node f , µf→x(x) = f(x)
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4
µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
If a leaf is a variable node x, µx→f (x) = 1
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 62 / 131
![Page 222: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/222.jpg)
Inference Belief Propagation
Leaf Messages
Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.
If a leaf is a factor node f , µf→x(x) = f(x)
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4
µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
If a leaf is a variable node x, µx→f (x) = 1
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 62 / 131
![Page 223: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/223.jpg)
Inference Belief Propagation
Message from Variable to Factor
A node (factor or variable) can send out a message if all otherincoming messages have arrived
Let x be in factor fs. ne(x)\fs are factors connected to x excludingfs.
µx→fs(x) =∏
f∈ne(x)\fs
µf→x(x)
µz1→f2(z1 = 1) = 1/4
µz1→f2(z1 = 2) = 1/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B(KDD 2012 Tutorial) Graphical Models 63 / 131
![Page 224: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/224.jpg)
Inference Belief Propagation
Message from Variable to Factor
A node (factor or variable) can send out a message if all otherincoming messages have arrivedLet x be in factor fs. ne(x)\fs are factors connected to x excludingfs.
µx→fs(x) =∏
f∈ne(x)\fs
µf→x(x)
µz1→f2(z1 = 1) = 1/4
µz1→f2(z1 = 2) = 1/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B(KDD 2012 Tutorial) Graphical Models 63 / 131
![Page 225: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/225.jpg)
Inference Belief Propagation
Message from Factor to Variable
Let x be in factor fs. Let the other variables in fs be x1:M .
µfs→x(x) =∑x1
. . .∑xM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
In this example
µf2→z2(s) =2∑
s′=1
µz1→f2(s′)f2(z1 = s′, z2 = s)
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
(KDD 2012 Tutorial) Graphical Models 64 / 131
![Page 226: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/226.jpg)
Inference Belief Propagation
Message from Factor to Variable
Let x be in factor fs. Let the other variables in fs be x1:M .
µfs→x(x) =∑x1
. . .∑xM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
In this example
µf2→z2(s) =2∑
s′=1
µz1→f2(s′)f2(z1 = s′, z2 = s)
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
(KDD 2012 Tutorial) Graphical Models 64 / 131
![Page 227: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/227.jpg)
Inference Belief Propagation
Message from Factor to Variable
Let x be in factor fs. Let the other variables in fs be x1:M .
µfs→x(x) =∑x1
. . .∑xM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
In this example
µf2→z2(s) =2∑
s′=1
µz1→f2(s′)f2(z1 = s′, z2 = s)
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
(KDD 2012 Tutorial) Graphical Models 64 / 131
![Page 228: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/228.jpg)
Inference Belief Propagation
Up to Root, Back Down
The message has reached the root, pass it back down
µz2→f2(z2 = 1) = 1
µz2→f2(z2 = 2) = 1
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 65 / 131
![Page 229: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/229.jpg)
Inference Belief Propagation
Keep Passing Down
µf2→z1(s) =∑2
s′=1 µz2→f2(s′)f2(z1 = s, z2 = s′)
= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).
We getµf2→z1(z1 = 1) = 7/16µf2→z1(z1 = 2) = 3/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 66 / 131
![Page 230: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/230.jpg)
Inference Belief Propagation
Keep Passing Down
µf2→z1(s) =∑2
s′=1 µz2→f2(s′)f2(z1 = s, z2 = s′)
= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).
We getµf2→z1(z1 = 1) = 7/16µf2→z1(z1 = 2) = 3/8
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
(KDD 2012 Tutorial) Graphical Models 66 / 131
![Page 231: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/231.jpg)
Inference Belief Propagation
From Messages to Marginals
Once a variable receives all incoming messages, we compute itsmarginal as
p(x) ∝∏
f∈ne(x)
µf→x(x)
In this example
P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4
1/8
)·( 7/16
3/8
)=( 7/64
3/64
)⇒(
0.70.3
)P (z2|x1, x2) ∝ µf2→z2 =
( 1/321/8
)⇒(
0.20.8
)One can also compute the marginal of the set of variables xs involvedin a factor fs
p(xs) ∝ fs(xs)∏
x∈ne(f)
µx→f (x)
(KDD 2012 Tutorial) Graphical Models 67 / 131
![Page 232: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/232.jpg)
Inference Belief Propagation
From Messages to Marginals
Once a variable receives all incoming messages, we compute itsmarginal as
p(x) ∝∏
f∈ne(x)
µf→x(x)
In this example
P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4
1/8
)·( 7/16
3/8
)=( 7/64
3/64
)⇒(
0.70.3
)P (z2|x1, x2) ∝ µf2→z2 =
( 1/321/8
)⇒(
0.20.8
)
One can also compute the marginal of the set of variables xs involvedin a factor fs
p(xs) ∝ fs(xs)∏
x∈ne(f)
µx→f (x)
(KDD 2012 Tutorial) Graphical Models 67 / 131
![Page 233: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/233.jpg)
Inference Belief Propagation
From Messages to Marginals
Once a variable receives all incoming messages, we compute itsmarginal as
p(x) ∝∏
f∈ne(x)
µf→x(x)
In this example
P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4
1/8
)·( 7/16
3/8
)=( 7/64
3/64
)⇒(
0.70.3
)P (z2|x1, x2) ∝ µf2→z2 =
( 1/321/8
)⇒(
0.20.8
)One can also compute the marginal of the set of variables xs involvedin a factor fs
p(xs) ∝ fs(xs)∏
x∈ne(f)
µx→f (x)
(KDD 2012 Tutorial) Graphical Models 67 / 131
![Page 234: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/234.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 235: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/235.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); or
set messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 236: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/236.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 237: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/237.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 238: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/238.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 239: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/239.jpg)
Inference Belief Propagation
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v
Observing XE ,
multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):
p(x,XE) ∝∏
f∈ne(x)
µf→x(x)
The conditional is easily obtained by normalization
p(x|XE) =p(x,XE)∑x′ p(x′, XE)
(KDD 2012 Tutorial) Graphical Models 68 / 131
![Page 240: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/240.jpg)
Inference Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph
When the factor graph contains loops, pass messages indefinitely untilconvergence
But convergence may not happen
But in many cases loopy BP still works well, empirically
(KDD 2012 Tutorial) Graphical Models 69 / 131
![Page 241: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/241.jpg)
Inference Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph
When the factor graph contains loops, pass messages indefinitely untilconvergence
But convergence may not happen
But in many cases loopy BP still works well, empirically
(KDD 2012 Tutorial) Graphical Models 69 / 131
![Page 242: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/242.jpg)
Inference Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph
When the factor graph contains loops, pass messages indefinitely untilconvergence
But convergence may not happen
But in many cases loopy BP still works well, empirically
(KDD 2012 Tutorial) Graphical Models 69 / 131
![Page 243: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/243.jpg)
Inference Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph
When the factor graph contains loops, pass messages indefinitely untilconvergence
But convergence may not happen
But in many cases loopy BP still works well, empirically
(KDD 2012 Tutorial) Graphical Models 69 / 131
![Page 244: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/244.jpg)
Inference Mean Field Algorithm
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 70 / 131
![Page 245: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/245.jpg)
Inference Mean Field Algorithm
Example: The Ising Model
θsθ
xs xtst
The random variables x take values in 0, 1.
pθ(x) =1Z
exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt
(KDD 2012 Tutorial) Graphical Models 71 / 131
![Page 246: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/246.jpg)
Inference Mean Field Algorithm
The Conditional
θsθ
xs xtst
Markovian: the conditional distribution for xs is
p(xs | x−s) = p(xs | xN(s))
N(s) is the neighbors of s.
This reduces to (recall Gibbs sampling)
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
(KDD 2012 Tutorial) Graphical Models 72 / 131
![Page 247: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/247.jpg)
Inference Mean Field Algorithm
The Conditional
θsθ
xs xtst
Markovian: the conditional distribution for xs is
p(xs | x−s) = p(xs | xN(s))
N(s) is the neighbors of s.
This reduces to (recall Gibbs sampling)
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
(KDD 2012 Tutorial) Graphical Models 72 / 131
![Page 248: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/248.jpg)
Inference Mean Field Algorithm
The Mean Field Algorithm for Ising Model
Gibbs sampling would draw xs from
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:
µs ←1
exp(−(θs +∑
t∈N(s) θstµt)) + 1
The µ’s are updated iteratively
The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).
(KDD 2012 Tutorial) Graphical Models 73 / 131
![Page 249: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/249.jpg)
Inference Mean Field Algorithm
The Mean Field Algorithm for Ising Model
Gibbs sampling would draw xs from
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
Instead, let µs be the estimated marginal p(xs = 1)
Mean field algorithm:
µs ←1
exp(−(θs +∑
t∈N(s) θstµt)) + 1
The µ’s are updated iteratively
The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).
(KDD 2012 Tutorial) Graphical Models 73 / 131
![Page 250: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/250.jpg)
Inference Mean Field Algorithm
The Mean Field Algorithm for Ising Model
Gibbs sampling would draw xs from
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:
µs ←1
exp(−(θs +∑
t∈N(s) θstµt)) + 1
The µ’s are updated iteratively
The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).
(KDD 2012 Tutorial) Graphical Models 73 / 131
![Page 251: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/251.jpg)
Inference Mean Field Algorithm
The Mean Field Algorithm for Ising Model
Gibbs sampling would draw xs from
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:
µs ←1
exp(−(θs +∑
t∈N(s) θstµt)) + 1
The µ’s are updated iteratively
The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).
(KDD 2012 Tutorial) Graphical Models 73 / 131
![Page 252: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/252.jpg)
Inference Mean Field Algorithm
The Mean Field Algorithm for Ising Model
Gibbs sampling would draw xs from
p(xs = 1 | xN(s)) =1
exp(−(θs +∑
t∈N(s) θstxt)) + 1
Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:
µs ←1
exp(−(θs +∑
t∈N(s) θstµt)) + 1
The µ’s are updated iteratively
The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).
(KDD 2012 Tutorial) Graphical Models 73 / 131
![Page 253: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/253.jpg)
Inference Exponential Family and Variational Inference
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 74 / 131
![Page 254: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/254.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ R
Note X is all the nodes in a Graphical model
φi(X) sometimes called a feature function
Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.
The exponential family is a family of probability densities:
pθ(x) = exp(θ>φ(x)−A(θ)
)
(KDD 2012 Tutorial) Graphical Models 75 / 131
![Page 255: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/255.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model
φi(X) sometimes called a feature function
Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.
The exponential family is a family of probability densities:
pθ(x) = exp(θ>φ(x)−A(θ)
)
(KDD 2012 Tutorial) Graphical Models 75 / 131
![Page 256: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/256.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model
φi(X) sometimes called a feature function
Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.
The exponential family is a family of probability densities:
pθ(x) = exp(θ>φ(x)−A(θ)
)
(KDD 2012 Tutorial) Graphical Models 75 / 131
![Page 257: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/257.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model
φi(X) sometimes called a feature function
Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.
The exponential family is a family of probability densities:
pθ(x) = exp(θ>φ(x)−A(θ)
)
(KDD 2012 Tutorial) Graphical Models 75 / 131
![Page 258: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/258.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model
φi(X) sometimes called a feature function
Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.
The exponential family is a family of probability densities:
pθ(x) = exp(θ>φ(x)−A(θ)
)
(KDD 2012 Tutorial) Graphical Models 75 / 131
![Page 259: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/259.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
pθ(x) = exp(θ>φ(x)−A(θ)
)Inner product between parameters θ and sufficient statistics φ.
θ
φ( )
φ( )
φ( )
x
y
z
pθ(x) > pθ(y) > pθ(z)
A is the log partition function,
A(θ) = log∫
exp(θ>φ(x)
)ν(dx)
A = logZ
(KDD 2012 Tutorial) Graphical Models 76 / 131
![Page 260: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/260.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
pθ(x) = exp(θ>φ(x)−A(θ)
)Inner product between parameters θ and sufficient statistics φ.
θ
φ( )
φ( )
φ( )
x
y
z
pθ(x) > pθ(y) > pθ(z)
A is the log partition function,
A(θ) = log∫
exp(θ>φ(x)
)ν(dx)
A = logZ
(KDD 2012 Tutorial) Graphical Models 76 / 131
![Page 261: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/261.jpg)
Inference Exponential Family and Variational Inference
Exponential Family
pθ(x) = exp(θ>φ(x)−A(θ)
)Inner product between parameters θ and sufficient statistics φ.
θ
φ( )
φ( )
φ( )
x
y
z
pθ(x) > pθ(y) > pθ(z)
A is the log partition function,
A(θ) = log∫
exp(θ>φ(x)
)ν(dx)
A = logZ(KDD 2012 Tutorial) Graphical Models 76 / 131
![Page 262: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/262.jpg)
Inference Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:
Ω = θ ∈ Rd | A(θ) <∞
A minimal exponential family is where the φ’s are linearlyindependent.
An overcomplete exponential family is where the φ’s are linearlydependent:
∃α ∈ Rd, α>φ(x) = constant ∀x
Both minimal and overcomplete representations are useful.
(KDD 2012 Tutorial) Graphical Models 77 / 131
![Page 263: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/263.jpg)
Inference Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:
Ω = θ ∈ Rd | A(θ) <∞
A minimal exponential family is where the φ’s are linearlyindependent.
An overcomplete exponential family is where the φ’s are linearlydependent:
∃α ∈ Rd, α>φ(x) = constant ∀x
Both minimal and overcomplete representations are useful.
(KDD 2012 Tutorial) Graphical Models 77 / 131
![Page 264: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/264.jpg)
Inference Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:
Ω = θ ∈ Rd | A(θ) <∞
A minimal exponential family is where the φ’s are linearlyindependent.
An overcomplete exponential family is where the φ’s are linearlydependent:
∃α ∈ Rd, α>φ(x) = constant ∀x
Both minimal and overcomplete representations are useful.
(KDD 2012 Tutorial) Graphical Models 77 / 131
![Page 265: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/265.jpg)
Inference Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:
Ω = θ ∈ Rd | A(θ) <∞
A minimal exponential family is where the φ’s are linearlyindependent.
An overcomplete exponential family is where the φ’s are linearlydependent:
∃α ∈ Rd, α>φ(x) = constant ∀x
Both minimal and overcomplete representations are useful.
(KDD 2012 Tutorial) Graphical Models 77 / 131
![Page 266: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/266.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log β + (1− x) log(1− β))
Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.
Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x
(KDD 2012 Tutorial) Graphical Models 78 / 131
![Page 267: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/267.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log β + (1− x) log(1− β))
Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.
Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x
(KDD 2012 Tutorial) Graphical Models 78 / 131
![Page 268: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/268.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log β + (1− x) log(1− β))
Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.
Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x
(KDD 2012 Tutorial) Graphical Models 78 / 131
![Page 269: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/269.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log β + (1− x) log(1− β))
Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.
Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x
(KDD 2012 Tutorial) Graphical Models 78 / 131
![Page 270: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/270.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1− x) log(1− β))
Can be further rewritten as
p(x) = exp (xθ − log(1 + exp(θ)))
Minimal exponential family withφ(x) = x, θ = log β
1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family
But not all (e.g., the Laplace distribution).
(KDD 2012 Tutorial) Graphical Models 79 / 131
![Page 271: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/271.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1− x) log(1− β))
Can be further rewritten as
p(x) = exp (xθ − log(1 + exp(θ)))
Minimal exponential family withφ(x) = x, θ = log β
1−β , A(θ) = log(1 + exp(θ)).
Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family
But not all (e.g., the Laplace distribution).
(KDD 2012 Tutorial) Graphical Models 79 / 131
![Page 272: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/272.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1− x) log(1− β))
Can be further rewritten as
p(x) = exp (xθ − log(1 + exp(θ)))
Minimal exponential family withφ(x) = x, θ = log β
1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family
But not all (e.g., the Laplace distribution).
(KDD 2012 Tutorial) Graphical Models 79 / 131
![Page 273: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/273.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1− x) log(1− β))
Can be further rewritten as
p(x) = exp (xθ − log(1 + exp(θ)))
Minimal exponential family withφ(x) = x, θ = log β
1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family
But not all (e.g., the Laplace distribution).
(KDD 2012 Tutorial) Graphical Models 79 / 131
![Page 274: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/274.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 2: Ising Model
θsθ
xs xtst
pθ(x) = exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt −A(θ)
Binary random variable xs ∈ 0, 1
d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>
This is a regular (Ω = Rd), minimal exponential family.
(KDD 2012 Tutorial) Graphical Models 80 / 131
![Page 275: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/275.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 2: Ising Model
θsθ
xs xtst
pθ(x) = exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt −A(θ)
Binary random variable xs ∈ 0, 1d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>
This is a regular (Ω = Rd), minimal exponential family.
(KDD 2012 Tutorial) Graphical Models 80 / 131
![Page 276: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/276.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 2: Ising Model
θsθ
xs xtst
pθ(x) = exp
∑s∈V
θsxs +∑
(s,t)∈E
θstxsxt −A(θ)
Binary random variable xs ∈ 0, 1d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>
This is a regular (Ω = Rd), minimal exponential family.
(KDD 2012 Tutorial) Graphical Models 80 / 131
![Page 277: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/277.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model
θsθ
xs xtst
Similar to Ising model but generalizing xs ∈ 0, . . . , r − 1.
Indicator functions fsj(x) = 1 if xs = j and 0 otherwise, andfstjk(x) = 1 if xs = j ∧ xt = k, and 0 otherwise.
pθ(x) = exp
∑sj
θsjfsj(x) +∑stjk
θstjkfstjk(x)−A(θ)
(KDD 2012 Tutorial) Graphical Models 81 / 131
![Page 278: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/278.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model
θsθ
xs xtst
Similar to Ising model but generalizing xs ∈ 0, . . . , r − 1.Indicator functions fsj(x) = 1 if xs = j and 0 otherwise, andfstjk(x) = 1 if xs = j ∧ xt = k, and 0 otherwise.
pθ(x) = exp
∑sj
θsjfsj(x) +∑stjk
θstjkfstjk(x)−A(θ)
(KDD 2012 Tutorial) Graphical Models 81 / 131
![Page 279: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/279.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model
θsθ
xs xtst
pθ(x) = exp
∑sj
θsjfsj(x) +∑stjk
θstjkfstjk(x)−A(θ)
d = r|V |+ r2|E|
Regular but overcomplete, because∑r−1
j=0 θsj(x) = 1 for any s ∈ Vand all x.
The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.
(KDD 2012 Tutorial) Graphical Models 82 / 131
![Page 280: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/280.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model
θsθ
xs xtst
pθ(x) = exp
∑sj
θsjfsj(x) +∑stjk
θstjkfstjk(x)−A(θ)
d = r|V |+ r2|E|Regular but overcomplete, because
∑r−1j=0 θsj(x) = 1 for any s ∈ V
and all x.
The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.
(KDD 2012 Tutorial) Graphical Models 82 / 131
![Page 281: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/281.jpg)
Inference Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model
θsθ
xs xtst
pθ(x) = exp
∑sj
θsjfsj(x) +∑stjk
θstjkfstjk(x)−A(θ)
d = r|V |+ r2|E|Regular but overcomplete, because
∑r−1j=0 θsj(x) = 1 for any s ∈ V
and all x.
The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.
(KDD 2012 Tutorial) Graphical Models 82 / 131
![Page 282: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/282.jpg)
Inference Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions,
e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise
The marginal can be obtained via the mean
Eθ[φsj(x)] = P (xs = j)
Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.
We will come back to this point when we discuss variationalinference...
(KDD 2012 Tutorial) Graphical Models 83 / 131
![Page 283: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/283.jpg)
Inference Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions,
e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise
The marginal can be obtained via the mean
Eθ[φsj(x)] = P (xs = j)
Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.
We will come back to this point when we discuss variationalinference...
(KDD 2012 Tutorial) Graphical Models 83 / 131
![Page 284: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/284.jpg)
Inference Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions,
e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise
The marginal can be obtained via the mean
Eθ[φsj(x)] = P (xs = j)
Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.
We will come back to this point when we discuss variationalinference...
(KDD 2012 Tutorial) Graphical Models 83 / 131
![Page 285: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/285.jpg)
Inference Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions,
e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise
The marginal can be obtained via the mean
Eθ[φsj(x)] = P (xs = j)
Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.
We will come back to this point when we discuss variationalinference...
(KDD 2012 Tutorial) Graphical Models 83 / 131
![Page 286: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/286.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 287: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/287.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 288: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/288.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 289: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/289.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 290: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/290.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inM
Therefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 291: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/291.jpg)
Inference Exponential Family and Variational Inference
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>
is
µi = Ep[φi(x)] =∫φi(x)p(x)dx
The set of mean parameters define a convex set:
M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ
If µ(1), µ(2) ∈M, there must exist p(1), p(2)
The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex
(KDD 2012 Tutorial) Graphical Models 84 / 131
![Page 292: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/292.jpg)
Inference Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1(x) = x, φ2(x) = x2
For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.
Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p
M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.
(KDD 2012 Tutorial) Graphical Models 85 / 131
![Page 293: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/293.jpg)
Inference Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1(x) = x, φ2(x) = x2
For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.
Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p
M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.
(KDD 2012 Tutorial) Graphical Models 85 / 131
![Page 294: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/294.jpg)
Inference Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1(x) = x, φ2(x) = x2
For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.
Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p
M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.
(KDD 2012 Tutorial) Graphical Models 85 / 131
![Page 295: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/295.jpg)
Inference Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1(x) = x, φ2(x) = x2
For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.
Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p
M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.
(KDD 2012 Tutorial) Graphical Models 85 / 131
![Page 296: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/296.jpg)
Inference Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
Recall M = µ ∈ Rd | µ =∑
x φ(x)p(x) for some pp can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass functions.
M = convφ(x),∀x is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial) Graphical Models 86 / 131
![Page 297: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/297.jpg)
Inference Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
Recall M = µ ∈ Rd | µ =∑
x φ(x)p(x) for some p
p can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass functions.
M = convφ(x),∀x is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial) Graphical Models 86 / 131
![Page 298: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/298.jpg)
Inference Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
Recall M = µ ∈ Rd | µ =∑
x φ(x)p(x) for some pp can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass functions.
M = convφ(x),∀x is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial) Graphical Models 86 / 131
![Page 299: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/299.jpg)
Inference Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
Recall M = µ ∈ Rd | µ =∑
x φ(x)p(x) for some pp can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass functions.
M = convφ(x),∀x is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial) Graphical Models 86 / 131
![Page 300: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/300.jpg)
Inference Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
Recall M = µ ∈ Rd | µ =∑
x φ(x)p(x) for some pp can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass functions.
M = convφ(x),∀x is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial) Graphical Models 86 / 131
![Page 301: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/301.jpg)
Inference Exponential Family and Variational Inference
Marginal Polytope Example
Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.
minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.
only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.
(KDD 2012 Tutorial) Graphical Models 87 / 131
![Page 302: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/302.jpg)
Inference Exponential Family and Variational Inference
Marginal Polytope Example
Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.
minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.
only 4 different x = (x1, x2).
the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.
(KDD 2012 Tutorial) Graphical Models 87 / 131
![Page 303: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/303.jpg)
Inference Exponential Family and Variational Inference
Marginal Polytope Example
Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.
minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.
only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.
(KDD 2012 Tutorial) Graphical Models 87 / 131
![Page 304: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/304.jpg)
Inference Exponential Family and Variational Inference
Marginal Polytope Example
Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.
minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.
only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.
(KDD 2012 Tutorial) Graphical Models 87 / 131
![Page 305: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/305.jpg)
Inference Exponential Family and Variational Inference
Marginal Polytope Example
Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.
minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.
only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.
(KDD 2012 Tutorial) Graphical Models 87 / 131
![Page 306: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/306.jpg)
Inference Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ.
Strictly convex for minimal exponential family.
Nice property: ∂A(θ)∂θi
= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.
(KDD 2012 Tutorial) Graphical Models 88 / 131
![Page 307: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/307.jpg)
Inference Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ.
Strictly convex for minimal exponential family.
Nice property: ∂A(θ)∂θi
= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.
(KDD 2012 Tutorial) Graphical Models 88 / 131
![Page 308: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/308.jpg)
Inference Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ.
Strictly convex for minimal exponential family.
Nice property: ∂A(θ)∂θi
= Eθ[φi(x)]
Therefore, ∇A = µ, the mean parameters of pθ.
(KDD 2012 Tutorial) Graphical Models 88 / 131
![Page 309: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/309.jpg)
Inference Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ.
Strictly convex for minimal exponential family.
Nice property: ∂A(θ)∂θi
= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.
(KDD 2012 Tutorial) Graphical Models 88 / 131
![Page 310: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/310.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 311: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/311.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 312: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/312.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 313: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/313.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 314: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/314.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)
For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 315: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/315.jpg)
Inference Exponential Family and Variational Inference
Conjugate Duality
The conjugate dual function A∗ to A is defined as
A∗(µ) = supθ∈Ω
θ>µ−A(θ)
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.
For any µ ∈M’s interior, let θ(µ) satisfy
Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ
Then A∗(µ) = −H(pθ(µ)), the negative entropy.
The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].
(KDD 2012 Tutorial) Graphical Models 89 / 131
![Page 316: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/316.jpg)
Inference Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.
By definitionA∗(µ) = sup
θ∈Rθµ− log(1 + exp(θ))
Taking derivative and solve
A∗(µ) = µ logµ+ (1− µ) log(1− µ)
i.e., the negative entropy.
(KDD 2012 Tutorial) Graphical Models 90 / 131
![Page 317: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/317.jpg)
Inference Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.
By definitionA∗(µ) = sup
θ∈Rθµ− log(1 + exp(θ))
Taking derivative and solve
A∗(µ) = µ logµ+ (1− µ) log(1− µ)
i.e., the negative entropy.
(KDD 2012 Tutorial) Graphical Models 90 / 131
![Page 318: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/318.jpg)
Inference Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.
By definitionA∗(µ) = sup
θ∈Rθµ− log(1 + exp(θ))
Taking derivative and solve
A∗(µ) = µ logµ+ (1− µ) log(1− µ)
i.e., the negative entropy.
(KDD 2012 Tutorial) Graphical Models 90 / 131
![Page 319: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/319.jpg)
Inference Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The supremum is attained by µ = Eθ[φ(x)].
Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.
Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.
This variational representation is exact, not approximate
We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial) Graphical Models 91 / 131
![Page 320: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/320.jpg)
Inference Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.
Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.
This variational representation is exact, not approximate
We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial) Graphical Models 91 / 131
![Page 321: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/321.jpg)
Inference Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.
Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.
This variational representation is exact, not approximate
We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial) Graphical Models 91 / 131
![Page 322: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/322.jpg)
Inference Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.
Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.
This variational representation is exact, not approximate
We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial) Graphical Models 91 / 131
![Page 323: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/323.jpg)
Inference Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.
Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.
This variational representation is exact, not approximate
We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial) Graphical Models 91 / 131
![Page 324: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/324.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 325: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/325.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 326: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/326.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)
The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 327: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/327.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 328: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/328.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 329: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/329.jpg)
Inference Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = supµ∈M
µ>θ −A∗(µ)
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.
Next, we cast mean field and sum-product algorithms as variationalapproximations.
(KDD 2012 Tutorial) Graphical Models 92 / 131
![Page 330: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/330.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.
Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges
The densities in this sub-family are all fully factorized:
pθ(x) =∏s∈V
p(xs; θs)
(KDD 2012 Tutorial) Graphical Models 93 / 131
![Page 331: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/331.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.
Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)
Set all θi = 0 if φi involves edges
The densities in this sub-family are all fully factorized:
pθ(x) =∏s∈V
p(xs; θs)
(KDD 2012 Tutorial) Graphical Models 93 / 131
![Page 332: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/332.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.
Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges
The densities in this sub-family are all fully factorized:
pθ(x) =∏s∈V
p(xs; θs)
(KDD 2012 Tutorial) Graphical Models 93 / 131
![Page 333: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/333.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.
Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges
The densities in this sub-family are all fully factorized:
pθ(x) =∏s∈V
p(xs; θs)
(KDD 2012 Tutorial) Graphical Models 93 / 131
![Page 334: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/334.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂M
Recall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 335: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/335.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).
It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 336: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/336.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).
Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 337: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/337.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 338: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/338.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 339: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/339.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.
This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 340: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/340.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.
Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 341: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/341.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:
The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>
The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial) Graphical Models 94 / 131
![Page 342: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/342.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
[courtesy Nick Bridle]
Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).
But in general M(F ) is a true subset ofMTherefore,M(F ) is a nonconvex inner set ofM
(KDD 2012 Tutorial) Graphical Models 95 / 131
![Page 343: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/343.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
[courtesy Nick Bridle]
Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).But in general M(F ) is a true subset ofM
Therefore,M(F ) is a nonconvex inner set ofM
(KDD 2012 Tutorial) Graphical Models 95 / 131
![Page 344: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/344.jpg)
Inference Exponential Family and Variational Inference
The Geometry ofM(F )
[courtesy Nick Bridle]
Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).But in general M(F ) is a true subset ofMTherefore,M(F ) is a nonconvex inner set ofM
(KDD 2012 Tutorial) Graphical Models 95 / 131
![Page 345: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/345.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].
The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 346: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/346.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 347: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/347.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).
The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 348: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/348.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )
Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 349: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/349.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 350: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/350.jpg)
Inference Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
Recall the exact variational problem
A(θ) = supµ∈M
µ>θ −A∗(µ)
attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )
L(θ) = supµ∈M(F )
µ>θ −A∗(µ)
Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it
Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )
(KDD 2012 Tutorial) Graphical Models 96 / 131
![Page 351: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/351.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)
Fully factorizedM(F ) means no edge. µst = µsµt
ForM(F ), the dual function A∗(µ) has the simple form
A∗(µ) =∑s∈V
−H(µs) =∑s∈V
µs logµs + (1− µs) log(1− µs)
Thus the mean field problem is
L(θ) = supµ∈M(F )
µ>θ −∑s∈V
(µs logµs + (1− µs) log(1− µs))
= max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
(KDD 2012 Tutorial) Graphical Models 97 / 131
![Page 352: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/352.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt
ForM(F ), the dual function A∗(µ) has the simple form
A∗(µ) =∑s∈V
−H(µs) =∑s∈V
µs logµs + (1− µs) log(1− µs)
Thus the mean field problem is
L(θ) = supµ∈M(F )
µ>θ −∑s∈V
(µs logµs + (1− µs) log(1− µs))
= max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
(KDD 2012 Tutorial) Graphical Models 97 / 131
![Page 353: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/353.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt
ForM(F ), the dual function A∗(µ) has the simple form
A∗(µ) =∑s∈V
−H(µs) =∑s∈V
µs logµs + (1− µs) log(1− µs)
Thus the mean field problem is
L(θ) = supµ∈M(F )
µ>θ −∑s∈V
(µs logµs + (1− µs) log(1− µs))
= max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
(KDD 2012 Tutorial) Graphical Models 97 / 131
![Page 354: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/354.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt
ForM(F ), the dual function A∗(µ) has the simple form
A∗(µ) =∑s∈V
−H(µs) =∑s∈V
µs logµs + (1− µs) log(1− µs)
Thus the mean field problem is
L(θ) = supµ∈M(F )
µ>θ −∑s∈V
(µs logµs + (1− µs) log(1− µs))
= max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
(KDD 2012 Tutorial) Graphical Models 97 / 131
![Page 355: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/355.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
L(θ) = max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
Bilinear in µ, not jointly concave
But concave in a single dimension µs, fixing others.
Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.
Setting the partial derivative w.r.t. µs to 0 yields:
µs =1
1 + exp(−(θs +
∑(s,t)∈E θstµt)
)as we’ve seen before.
Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.
(KDD 2012 Tutorial) Graphical Models 98 / 131
![Page 356: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/356.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
L(θ) = max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
Bilinear in µ, not jointly concave
But concave in a single dimension µs, fixing others.
Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.
Setting the partial derivative w.r.t. µs to 0 yields:
µs =1
1 + exp(−(θs +
∑(s,t)∈E θstµt)
)as we’ve seen before.
Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.
(KDD 2012 Tutorial) Graphical Models 98 / 131
![Page 357: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/357.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
L(θ) = max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
Bilinear in µ, not jointly concave
But concave in a single dimension µs, fixing others.
Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.
Setting the partial derivative w.r.t. µs to 0 yields:
µs =1
1 + exp(−(θs +
∑(s,t)∈E θstµt)
)as we’ve seen before.
Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.
(KDD 2012 Tutorial) Graphical Models 98 / 131
![Page 358: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/358.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
L(θ) = max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
Bilinear in µ, not jointly concave
But concave in a single dimension µs, fixing others.
Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.
Setting the partial derivative w.r.t. µs to 0 yields:
µs =1
1 + exp(−(θs +
∑(s,t)∈E θstµt)
)as we’ve seen before.
Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.
(KDD 2012 Tutorial) Graphical Models 98 / 131
![Page 359: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/359.jpg)
Inference Exponential Family and Variational Inference
Example: Mean Field for Ising Model
L(θ) = max(µ1...µm)∈[0,1]m
∑s∈V
θsµs +∑
(s,t)∈E
θstµsµt +∑s∈V
H(µs)
Bilinear in µ, not jointly concave
But concave in a single dimension µs, fixing others.
Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.
Setting the partial derivative w.r.t. µs to 0 yields:
µs =1
1 + exp(−(θs +
∑(s,t)∈E θstµt)
)as we’ve seen before.
Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.(KDD 2012 Tutorial) Graphical Models 98 / 131
![Page 360: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/360.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The sum-product algorithm makes two approximations:
it relaxes M to an outer set L
it replaces the dual A∗ with an approximation A∗.
A(θ) = supµ∈L
µ>θ − A∗(µ)
(KDD 2012 Tutorial) Graphical Models 99 / 131
![Page 361: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/361.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
A(θ) = supµ∈M
µ>θ −A∗(µ)
The sum-product algorithm makes two approximations:
it relaxes M to an outer set L
it replaces the dual A∗ with an approximation A∗.
A(θ) = supµ∈L
µ>θ − A∗(µ)
(KDD 2012 Tutorial) Graphical Models 99 / 131
![Page 362: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/362.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).
The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd
+ satisfying “node normalization” and“edge-node marginal consistency” conditions:
r−1∑j=0
τsj = 1 ∀s ∈ V
r−1∑k=0
τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1
r−1∑j=0
τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1
Define L = τ satisfying the above conditions.
(KDD 2012 Tutorial) Graphical Models 100 / 131
![Page 363: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/363.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.
Now consider τ ∈ Rd+ satisfying “node normalization” and
“edge-node marginal consistency” conditions:
r−1∑j=0
τsj = 1 ∀s ∈ V
r−1∑k=0
τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1
r−1∑j=0
τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1
Define L = τ satisfying the above conditions.
(KDD 2012 Tutorial) Graphical Models 100 / 131
![Page 364: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/364.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd
+ satisfying “node normalization” and“edge-node marginal consistency” conditions:
r−1∑j=0
τsj = 1 ∀s ∈ V
r−1∑k=0
τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1
r−1∑j=0
τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1
Define L = τ satisfying the above conditions.
(KDD 2012 Tutorial) Graphical Models 100 / 131
![Page 365: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/365.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd
+ satisfying “node normalization” and“edge-node marginal consistency” conditions:
r−1∑j=0
τsj = 1 ∀s ∈ V
r−1∑k=0
τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1
r−1∑j=0
τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1
Define L = τ satisfying the above conditions.(KDD 2012 Tutorial) Graphical Models 100 / 131
![Page 366: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/366.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
If the graph is a tree thenM = L
If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy
Nice property: L is still a polytope, but much simpler thanM.
φ( )x
L
M
τ
The first approximation in sum-product is to replace M with L.
(KDD 2012 Tutorial) Graphical Models 101 / 131
![Page 367: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/367.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
If the graph is a tree thenM = L
If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy
Nice property: L is still a polytope, but much simpler thanM.
φ( )x
L
M
τ
The first approximation in sum-product is to replace M with L.
(KDD 2012 Tutorial) Graphical Models 101 / 131
![Page 368: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/368.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
If the graph is a tree thenM = L
If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy
Nice property: L is still a polytope, but much simpler thanM.
φ( )x
L
M
τ
The first approximation in sum-product is to replace M with L.
(KDD 2012 Tutorial) Graphical Models 101 / 131
![Page 369: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/369.jpg)
Inference Exponential Family and Variational Inference
The Outer Relaxation
If the graph is a tree thenM = L
If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy
Nice property: L is still a polytope, but much simpler thanM.
φ( )x
L
M
τ
The first approximation in sum-product is to replace M with L.
(KDD 2012 Tutorial) Graphical Models 101 / 131
![Page 370: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/370.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint probability
pµ(x) =∏s∈V
µsxs
∏(s,t)∈E
µstxsxt
µsxsµtxt
If the graph is a tree, the entropy of the joint distribution is
H(pµ) = −∑s∈V
r−1∑j=0
µsj logµsj −∑
(s,t)∈E
∑j,k
µstjk logµstjk
µsjµtk
Neither holds for graph with cycles.
(KDD 2012 Tutorial) Graphical Models 102 / 131
![Page 371: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/371.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint probability
pµ(x) =∏s∈V
µsxs
∏(s,t)∈E
µstxsxt
µsxsµtxt
If the graph is a tree, the entropy of the joint distribution is
H(pµ) = −∑s∈V
r−1∑j=0
µsj logµsj −∑
(s,t)∈E
∑j,k
µstjk logµstjk
µsjµtk
Neither holds for graph with cycles.
(KDD 2012 Tutorial) Graphical Models 102 / 131
![Page 372: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/372.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint probability
pµ(x) =∏s∈V
µsxs
∏(s,t)∈E
µstxsxt
µsxsµtxt
If the graph is a tree, the entropy of the joint distribution is
H(pµ) = −∑s∈V
r−1∑j=0
µsj logµsj −∑
(s,t)∈E
∑j,k
µstjk logµstjk
µsjµtk
Neither holds for graph with cycles.
(KDD 2012 Tutorial) Graphical Models 102 / 131
![Page 373: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/373.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint probability
pµ(x) =∏s∈V
µsxs
∏(s,t)∈E
µstxsxt
µsxsµtxt
If the graph is a tree, the entropy of the joint distribution is
H(pµ) = −∑s∈V
r−1∑j=0
µsj logµsj −∑
(s,t)∈E
∑j,k
µstjk logµstjk
µsjµtk
Neither holds for graph with cycles.
(KDD 2012 Tutorial) Graphical Models 102 / 131
![Page 374: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/374.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:
HBethe(pτ ) = −∑s∈V
r−1∑j=0
τsj log τsj −∑
(s,t)∈E
∑j,k
τstjk logτstjkτsjτtk
Note τ is not a true marginal, and HBethe is not a true entropy.
The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).
(KDD 2012 Tutorial) Graphical Models 103 / 131
![Page 375: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/375.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:
HBethe(pτ ) = −∑s∈V
r−1∑j=0
τsj log τsj −∑
(s,t)∈E
∑j,k
τstjk logτstjkτsjτtk
Note τ is not a true marginal, and HBethe is not a true entropy.
The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).
(KDD 2012 Tutorial) Graphical Models 103 / 131
![Page 376: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/376.jpg)
Inference Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:
HBethe(pτ ) = −∑s∈V
r−1∑j=0
τsj log τsj −∑
(s,t)∈E
∑j,k
τstjk logτstjkτsjτtk
Note τ is not a true marginal, and HBethe is not a true entropy.
The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).
(KDD 2012 Tutorial) Graphical Models 103 / 131
![Page 377: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/377.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational problem
A(θ) = supτ∈L
τ>θ +HBethe(pτ )
Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.
The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.
At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution
(KDD 2012 Tutorial) Graphical Models 104 / 131
![Page 378: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/378.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational problem
A(θ) = supτ∈L
τ>θ +HBethe(pτ )
Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.
The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.
At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution
(KDD 2012 Tutorial) Graphical Models 104 / 131
![Page 379: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/379.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational problem
A(θ) = supτ∈L
τ>θ +HBethe(pτ )
Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.
The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.
At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)
τ may not correspond to a true marginal distribution
(KDD 2012 Tutorial) Graphical Models 104 / 131
![Page 380: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/380.jpg)
Inference Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational problem
A(θ) = supτ∈L
τ>θ +HBethe(pτ )
Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.
The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.
At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution
(KDD 2012 Tutorial) Graphical Models 104 / 131
![Page 381: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/381.jpg)
Inference Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
(KDD 2012 Tutorial) Graphical Models 105 / 131
![Page 382: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/382.jpg)
Inference Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
(KDD 2012 Tutorial) Graphical Models 105 / 131
![Page 383: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/383.jpg)
Inference Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
(KDD 2012 Tutorial) Graphical Models 105 / 131
![Page 384: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/384.jpg)
Inference Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
(KDD 2012 Tutorial) Graphical Models 105 / 131
![Page 385: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/385.jpg)
Inference Maximizing Problems
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 106 / 131
![Page 386: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/386.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 387: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/387.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )
However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 388: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/388.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the best
In fact z∗1:N can even have zero probability!2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 389: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/389.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 390: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/390.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 391: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/391.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a whole
The max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 392: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/392.jpg)
Inference Maximizing Problems
Maximizing Problems
Recall the HMM example
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )
We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!
2 An alternative is to find
z∗1:N = arg maxz1:N
p(z1:N |x1:N )
finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs
(KDD 2012 Tutorial) Graphical Models 107 / 131
![Page 393: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/393.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
Simple modification to the sum-product algorithm: replace∑
with max inthe factor-to-variable messages.
µfs→x(x) = maxx1
. . .maxxM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
µxm→fs(xm) =∏
f∈ne(xm)\fs
µf→xm(xm)
µxleaf→f (x) = 1
µfleaf→x(x) = f(x)
(KDD 2012 Tutorial) Graphical Models 108 / 131
![Page 394: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/394.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to leaves
At the root, multiply incoming messages
pmax = maxx
∏f∈ne(x)
µf→x(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial) Graphical Models 109 / 131
![Page 395: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/395.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to leaves
At the root, multiply incoming messages
pmax = maxx
∏f∈ne(x)
µf→x(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial) Graphical Models 109 / 131
![Page 396: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/396.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to leaves
At the root, multiply incoming messages
pmax = maxx
∏f∈ne(x)
µf→x(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial) Graphical Models 109 / 131
![Page 397: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/397.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to leaves
At the root, multiply incoming messages
pmax = maxx
∏f∈ne(x)
µf→x(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial) Graphical Models 109 / 131
![Page 398: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/398.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to leaves
At the root, multiply incoming messages
pmax = maxx
∏f∈ne(x)
µf→x(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial) Graphical Models 109 / 131
![Page 399: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/399.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers:
When creating the message
µfs→x(x) = maxx1
. . .maxxM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.
At the root, backtrack the pointers.
(KDD 2012 Tutorial) Graphical Models 110 / 131
![Page 400: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/400.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers:
When creating the message
µfs→x(x) = maxx1
. . .maxxM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.
At the root, backtrack the pointers.
(KDD 2012 Tutorial) Graphical Models 110 / 131
![Page 401: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/401.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers:
When creating the message
µfs→x(x) = maxx1
. . .maxxM
fs(x, x1, . . . , xM )M∏
m=1
µxm→fs(xm)
for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.
At the root, backtrack the pointers.
(KDD 2012 Tutorial) Graphical Models 110 / 131
![Page 402: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/402.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from leaf f1
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
The second messageµz1→f2(z1 = 1) = 1/4µz1→f2(z1 = 2) = 1/8
(KDD 2012 Tutorial) Graphical Models 111 / 131
![Page 403: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/403.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from leaf f1
µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8The second messageµz1→f2(z1 = 1) = 1/4µz1→f2(z1 = 2) = 1/8
(KDD 2012 Tutorial) Graphical Models 111 / 131
![Page 404: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/404.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
µf2→z2(z2 = 1)= max
z1
f2(z1, z2)µz1→f2(z1)
= maxz1
P (z2 = 1 | z1)P (x2 = G | z2 = 1)µz1→f2(z1)
= max(1/4 · 1/4 · 1/4, 1/2 · 1/4 · 1/8) = 1/64
Back pointer for z2 = 1: either z1 = 1 or z1 = 2
(KDD 2012 Tutorial) Graphical Models 112 / 131
![Page 405: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/405.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
The other element of the same message:
µf2→z2(z2 = 2)= max
z1
f2(z1, z2)µz1→f2(z1)
= maxz1
P (z2 = 2 | z1)P (x2 = G | z2 = 2)µz1→f2(z1)
= max(3/4 · 1/2 · 1/4, 1/2 · 1/2 · 1/8) = 3/32
Back pointer for z2 = 2: z1 = 1
(KDD 2012 Tutorial) Graphical Models 113 / 131
![Page 406: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/406.jpg)
Inference Maximizing Problems
Intermediate: The Max-Product Algorithm
z1f 1 z2f 2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
π = π = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
µf2→z2 =( 1/64 →z1=1,2
3/32 →z1=1
)At root z2,
maxs=1,2
µf2→z2(s) = 3/32
z2 = 2→ z1 = 1
z∗1:2 = arg maxz1:2
p(z1:2|x1:2) = (1, 2)
In this example, sum-product and max-product produce the same bestsequence; In general they differ.
(KDD 2012 Tutorial) Graphical Models 114 / 131
![Page 407: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/407.jpg)
Inference Maximizing Problems
From Max-Product to Max-Sum
The max-sum algorithm is equivalent to the max-product algorithm, butwork in log space to avoid underflow.
µfs→x(x) = maxx1...xM
log fs(x, x1, . . . , xM ) +M∑
m=1
µxm→fs(xm)
µxm→fs(xm) =∑
f∈ne(xm)\fs
µf→xm(xm)
µxleaf→f (x) = 0
µfleaf→x(x) = log f(x)
When at the root,
log pmax = maxx
∑f∈ne(x)
µf→x(x)
The back pointers are the same.
(KDD 2012 Tutorial) Graphical Models 115 / 131
![Page 408: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/408.jpg)
Parameter Learning
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 116 / 131
![Page 409: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/409.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 410: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/410.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 411: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/411.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 412: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/412.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 413: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/413.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observed
partially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 414: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/414.jpg)
Parameter Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate θ from iid data x1 . . .xn.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial) Graphical Models 117 / 131
![Page 415: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/415.jpg)
Parameter Learning
Fully Observed Data
pθ(x) = exp(θ>φ(x)−A(θ)
)Given iid data x1 . . .xn, the log likelihood is
`(θ) =1n
n∑i=1
log pθ(xi) = θ>
(1n
n∑i=1
φ(xi)
)−A(θ) = θ>µ−A(θ)
µ ≡ 1n
∑ni=1 φ(xi) is the mean parameter of the empirical distribution
on x1 . . .xn. Clearly µ ∈M.
Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)
The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.
When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.
(KDD 2012 Tutorial) Graphical Models 118 / 131
![Page 416: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/416.jpg)
Parameter Learning
Fully Observed Data
pθ(x) = exp(θ>φ(x)−A(θ)
)Given iid data x1 . . .xn, the log likelihood is
`(θ) =1n
n∑i=1
log pθ(xi) = θ>
(1n
n∑i=1
φ(xi)
)−A(θ) = θ>µ−A(θ)
µ ≡ 1n
∑ni=1 φ(xi) is the mean parameter of the empirical distribution
on x1 . . .xn. Clearly µ ∈M.
Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)
The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.
When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.
(KDD 2012 Tutorial) Graphical Models 118 / 131
![Page 417: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/417.jpg)
Parameter Learning
Fully Observed Data
pθ(x) = exp(θ>φ(x)−A(θ)
)Given iid data x1 . . .xn, the log likelihood is
`(θ) =1n
n∑i=1
log pθ(xi) = θ>
(1n
n∑i=1
φ(xi)
)−A(θ) = θ>µ−A(θ)
µ ≡ 1n
∑ni=1 φ(xi) is the mean parameter of the empirical distribution
on x1 . . .xn. Clearly µ ∈M.
Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)
The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.
When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.
(KDD 2012 Tutorial) Graphical Models 118 / 131
![Page 418: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/418.jpg)
Parameter Learning
Fully Observed Data
pθ(x) = exp(θ>φ(x)−A(θ)
)Given iid data x1 . . .xn, the log likelihood is
`(θ) =1n
n∑i=1
log pθ(xi) = θ>
(1n
n∑i=1
φ(xi)
)−A(θ) = θ>µ−A(θ)
µ ≡ 1n
∑ni=1 φ(xi) is the mean parameter of the empirical distribution
on x1 . . .xn. Clearly µ ∈M.
Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)
The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.
When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.
(KDD 2012 Tutorial) Graphical Models 118 / 131
![Page 419: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/419.jpg)
Parameter Learning
Fully Observed Data
pθ(x) = exp(θ>φ(x)−A(θ)
)Given iid data x1 . . .xn, the log likelihood is
`(θ) =1n
n∑i=1
log pθ(xi) = θ>
(1n
n∑i=1
φ(xi)
)−A(θ) = θ>µ−A(θ)
µ ≡ 1n
∑ni=1 φ(xi) is the mean parameter of the empirical distribution
on x1 . . .xn. Clearly µ ∈M.
Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)
The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.
When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.
(KDD 2012 Tutorial) Graphical Models 118 / 131
![Page 420: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/420.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 421: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/421.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 422: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/422.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 423: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/423.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 424: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/424.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 425: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/425.jpg)
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn
The incomplete likelihood `(θ) = 1n
∑ni=1 log pθ(xi) where
pθ(xi) =∫pθ(xi, z)dz
Can be written as `(θ) = 1n
∑ni=1Axi(θ)−A(θ)
New log partition function of pθ(z | xi), one per item:
Axi(θ) = log∫
exp(θ>φ(xi, z′))dz′
Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial) Graphical Models 119 / 131
![Page 426: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/426.jpg)
Parameter Learning
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z while holding xi
fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some p
The variational definition Axi(θ) = supµ∈Mxiθ>µ−A∗xi
(µ)
Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi
Lower bound L on the incomplete log likelihood:
`(θ) =1n
n∑i=1
Axi(θ)−A(θ)
≥ 1n
n∑i=1
(θ>µi −A∗xi
(µi))−A(θ)
≡ L(µ1, . . . , µn, θ)
(KDD 2012 Tutorial) Graphical Models 120 / 131
![Page 427: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/427.jpg)
Parameter Learning
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z while holding xi
fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi
θ>µ−A∗xi(µ)
Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi
Lower bound L on the incomplete log likelihood:
`(θ) =1n
n∑i=1
Axi(θ)−A(θ)
≥ 1n
n∑i=1
(θ>µi −A∗xi
(µi))−A(θ)
≡ L(µ1, . . . , µn, θ)
(KDD 2012 Tutorial) Graphical Models 120 / 131
![Page 428: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/428.jpg)
Parameter Learning
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z while holding xi
fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi
θ>µ−A∗xi(µ)
Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi
Lower bound L on the incomplete log likelihood:
`(θ) =1n
n∑i=1
Axi(θ)−A(θ)
≥ 1n
n∑i=1
(θ>µi −A∗xi
(µi))−A(θ)
≡ L(µ1, . . . , µn, θ)
(KDD 2012 Tutorial) Graphical Models 120 / 131
![Page 429: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/429.jpg)
Parameter Learning
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z while holding xi
fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi
θ>µ−A∗xi(µ)
Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi
Lower bound L on the incomplete log likelihood:
`(θ) =1n
n∑i=1
Axi(θ)−A(θ)
≥ 1n
n∑i=1
(θ>µi −A∗xi
(µi))−A(θ)
≡ L(µ1, . . . , µn, θ)
(KDD 2012 Tutorial) Graphical Models 120 / 131
![Page 430: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/430.jpg)
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi
µi ← arg maxµi∈Mxi
L(µ1, . . . , µn, θ)
Equivalently, argmaxµi∈Mxiθ>µi −A∗xi
(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ
(KDD 2012 Tutorial) Graphical Models 121 / 131
![Page 431: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/431.jpg)
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi
µi ← arg maxµi∈Mxi
L(µ1, . . . , µn, θ)
Equivalently, argmaxµi∈Mxiθ>µi −A∗xi
(µi)
This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ
(KDD 2012 Tutorial) Graphical Models 121 / 131
![Page 432: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/432.jpg)
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi
µi ← arg maxµi∈Mxi
L(µ1, . . . , µn, θ)
Equivalently, argmaxµi∈Mxiθ>µi −A∗xi
(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]
The E-step is named after this Eθ[] under the current parameters θ
(KDD 2012 Tutorial) Graphical Models 121 / 131
![Page 433: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/433.jpg)
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi
µi ← arg maxµi∈Mxi
L(µ1, . . . , µn, θ)
Equivalently, argmaxµi∈Mxiθ>µi −A∗xi
(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ
(KDD 2012 Tutorial) Graphical Models 121 / 131
![Page 434: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/434.jpg)
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed:
θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max
θ∈Ωθ>µ−A(θ)
µ = 1n
∑ni=1 µ
i
The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ
Standard fully observed maximum likelihood problem, hence the nameM-step
(KDD 2012 Tutorial) Graphical Models 122 / 131
![Page 435: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/435.jpg)
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed:
θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max
θ∈Ωθ>µ−A(θ)
µ = 1n
∑ni=1 µ
i
The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ
Standard fully observed maximum likelihood problem, hence the nameM-step
(KDD 2012 Tutorial) Graphical Models 122 / 131
![Page 436: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/436.jpg)
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed:
θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max
θ∈Ωθ>µ−A(θ)
µ = 1n
∑ni=1 µ
i
The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ
Standard fully observed maximum likelihood problem, hence the nameM-step
(KDD 2012 Tutorial) Graphical Models 122 / 131
![Page 437: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/437.jpg)
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed:
θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max
θ∈Ωθ>µ−A(θ)
µ = 1n
∑ni=1 µ
i
The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ
Standard fully observed maximum likelihood problem, hence the nameM-step
(KDD 2012 Tutorial) Graphical Models 122 / 131
![Page 438: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/438.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 439: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/439.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 440: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/440.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 441: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/441.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximum
recallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 442: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/442.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 443: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/443.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 444: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/444.jpg)
Parameter Learning
Variational EM
For loopy graphs E-step often intractable.
Can’t maximizemax
µi∈Mxi
θ>µi −A∗xi(µi)
Improve but not necessarily maximize: “generalized EM”
The mean field method maximizes
maxµi∈Mxi (F )
θ>µi −A∗xi(µi)
up to local maximumrecallMxi(F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
(KDD 2012 Tutorial) Graphical Models 123 / 131
![Page 445: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/445.jpg)
Structure Learning
Outline
1 Overview
2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph
3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems
4 Parameter Learning
5 Structure Learning
(KDD 2012 Tutorial) Graphical Models 124 / 131
![Page 446: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/446.jpg)
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M ⊆M be a log-linear model structure
P (X |M, θ) =1Z
exp
(∑i∈M
θifi(X)
)
A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment
M and θ treated separately
(KDD 2012 Tutorial) Graphical Models 125 / 131
![Page 447: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/447.jpg)
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M ⊆M be a log-linear model structure
P (X |M, θ) =1Z
exp
(∑i∈M
θifi(X)
)
A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment
M and θ treated separately
(KDD 2012 Tutorial) Graphical Models 125 / 131
![Page 448: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/448.jpg)
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M ⊆M be a log-linear model structure
P (X |M, θ) =1Z
exp
(∑i∈M
θifi(X)
)
A score for the model M can be maxθ lnP (Data |M, θ)
The score is always better for larger M – needs regularization orBayesian treatment
M and θ treated separately
(KDD 2012 Tutorial) Graphical Models 125 / 131
![Page 449: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/449.jpg)
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M ⊆M be a log-linear model structure
P (X |M, θ) =1Z
exp
(∑i∈M
θifi(X)
)
A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment
M and θ treated separately
(KDD 2012 Tutorial) Graphical Models 125 / 131
![Page 450: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/450.jpg)
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M ⊆M be a log-linear model structure
P (X |M, θ) =1Z
exp
(∑i∈M
θifi(X)
)
A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment
M and θ treated separately
(KDD 2012 Tutorial) Graphical Models 125 / 131
![Page 451: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/451.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N(µ,Σ)
The graphical model has p nodes x1, . . . , xp
The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1
Equivalently, xi, xj are conditionally independent given other variables
(KDD 2012 Tutorial) Graphical Models 126 / 131
![Page 452: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/452.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp
The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1
Equivalently, xi, xj are conditionally independent given other variables
(KDD 2012 Tutorial) Graphical Models 126 / 131
![Page 453: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/453.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp
The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1
Equivalently, xi, xj are conditionally independent given other variables
(KDD 2012 Tutorial) Graphical Models 126 / 131
![Page 454: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/454.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp
The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1
Equivalently, xi, xj are conditionally independent given other variables
(KDD 2012 Tutorial) Graphical Models 126 / 131
![Page 455: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/455.jpg)
Structure Learning
Example
If we know Σ =
14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5
Then Ω = Σ−1 =
0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667
0 0 0.1667 0.3333
The corresponding graphical model structure is
xx
xx
12
3
4
(KDD 2012 Tutorial) Graphical Models 127 / 131
![Page 456: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/456.jpg)
Structure Learning
Example
If we know Σ =
14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5
Then Ω = Σ−1 =
0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667
0 0 0.1667 0.3333
The corresponding graphical model structure is
xx
xx
12
3
4
(KDD 2012 Tutorial) Graphical Models 127 / 131
![Page 457: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/457.jpg)
Structure Learning
Example
If we know Σ =
14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5
Then Ω = Σ−1 =
0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667
0 0 0.1667 0.3333
The corresponding graphical model structure is
xx
xx
12
3
4
(KDD 2012 Tutorial) Graphical Models 127 / 131
![Page 458: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/458.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X(1), . . . , X(n) ∼ N(µ,Σ)
The log likelihood is n2 log |Ω| − 1
2
∑ni=1(X
(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance
S =1n
∑i
(X(i) − X)>(X(i) − X)
where X is the sample mean
S−1 is not a good estimate of Ω when n is small
(KDD 2012 Tutorial) Graphical Models 128 / 131
![Page 459: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/459.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n
2 log |Ω| − 12
∑ni=1(X
(i) − µ)>Ω(X(i) − µ)
The maximum likelihood estimate of Σ is the sample covariance
S =1n
∑i
(X(i) − X)>(X(i) − X)
where X is the sample mean
S−1 is not a good estimate of Ω when n is small
(KDD 2012 Tutorial) Graphical Models 128 / 131
![Page 460: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/460.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n
2 log |Ω| − 12
∑ni=1(X
(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance
S =1n
∑i
(X(i) − X)>(X(i) − X)
where X is the sample mean
S−1 is not a good estimate of Ω when n is small
(KDD 2012 Tutorial) Graphical Models 128 / 131
![Page 461: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/461.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n
2 log |Ω| − 12
∑ni=1(X
(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance
S =1n
∑i
(X(i) − X)>(X(i) − X)
where X is the sample mean
S−1 is not a good estimate of Ω when n is small
(KDD 2012 Tutorial) Graphical Models 128 / 131
![Page 462: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/462.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
For centered data, minimize a regularized problem instead:
− log |Ω|+ 1n
n∑i=1
X(i)>ΩX(i) + λ∑i6=j
|Ωij |
Known as glasso
(KDD 2012 Tutorial) Graphical Models 129 / 131
![Page 463: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/463.jpg)
Structure Learning
Structure Learning for Gaussian Random Fields
For centered data, minimize a regularized problem instead:
− log |Ω|+ 1n
n∑i=1
X(i)>ΩX(i) + λ∑i6=j
|Ωij |
Known as glasso
(KDD 2012 Tutorial) Graphical Models 129 / 131
![Page 464: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/464.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)
Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn
exact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 465: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/465.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random field
conditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn
exact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 466: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/466.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn
exact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 467: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/467.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn
exact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 468: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/468.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 469: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/469.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 470: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/470.jpg)
Structure Learning
Summary
Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence
Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational
If p(x1, . . . , xn) not given, estimate it from data
parameter and structure learning
(KDD 2012 Tutorial) Graphical Models 130 / 131
![Page 471: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/471.jpg)
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer 2006.
(KDD 2012 Tutorial) Graphical Models 131 / 131
![Page 472: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/472.jpg)
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer 2006.
(KDD 2012 Tutorial) Graphical Models 131 / 131
![Page 473: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,](https://reader036.fdocuments.us/reader036/viewer/2022062602/5eb999990a176c6d5262d2a2/html5/thumbnails/473.jpg)
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer 2006.
(KDD 2012 Tutorial) Graphical Models 131 / 131