Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf ·...

473
Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison, USA KDD 2012 Tutorial (KDD 2012 Tutorial) Graphical Models 1 / 131

Transcript of Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf ·...

Page 1: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Graphical Models

Xiaojin Zhu

Department of Computer SciencesUniversity of Wisconsin–Madison, USA

KDD 2012 Tutorial

(KDD 2012 Tutorial) Graphical Models 1 / 131

Page 2: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 2 / 131

Page 3: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 3 / 131

Page 4: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls

You randomly picked an envelope, randomly took out a ball – it wasblack

At this point, you are given the option to switch envelopes. Shouldyou?

(KDD 2012 Tutorial) Graphical Models 4 / 131

Page 5: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls

You randomly picked an envelope, randomly took out a ball – it wasblack

At this point, you are given the option to switch envelopes. Shouldyou?

(KDD 2012 Tutorial) Graphical Models 4 / 131

Page 6: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball,the other two black balls

You randomly picked an envelope, randomly took out a ball – it wasblack

At this point, you are given the option to switch envelopes. Shouldyou?

(KDD 2012 Tutorial) Graphical Models 4 / 131

Page 7: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, b

P (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 8: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2

P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 9: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0

We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 10: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?

P (E = 1 | B = b) = P (B=b|E=1)P (E=1)P (B=b) = 1/2×1/2

3/4 = 1/3Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 11: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 12: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.

The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 13: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0We ask: P (E = 1 | B = b) ≥ 1/2?P (E = 1 | B = b) = P (B=b|E=1)P (E=1)

P (B=b) = 1/2×1/23/4 = 1/3

Switch.The graphical model:

B

E

(KDD 2012 Tutorial) Graphical Models 5 / 131

Page 14: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

(KDD 2012 Tutorial) Graphical Models 6 / 131

Page 15: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

(KDD 2012 Tutorial) Graphical Models 6 / 131

Page 16: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

(KDD 2012 Tutorial) Graphical Models 6 / 131

Page 17: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

(KDD 2012 Tutorial) Graphical Models 6 / 131

Page 18: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

(KDD 2012 Tutorial) Graphical Models 6 / 131

Page 19: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)

exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)

Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 20: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)

hard to interpret (conditional independence)

inference p(XQ | XE)

Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 21: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)

Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 22: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)

Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 23: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 24: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 25: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 26: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Overview

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)exponential naıve storage (2n for binary r.v.)hard to interpret (conditional independence)

inference p(XQ | XE)Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

(KDD 2012 Tutorial) Graphical Models 7 / 131

Page 27: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 8 / 131

Page 28: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models

Just because there are nodes and edges doesn’t mean it’s a graphicalmodel

These are not graphical models:

neural network decision tree network flow HMM template

(KDD 2012 Tutorial) Graphical Models 9 / 131

Page 29: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models

Just because there are nodes and edges doesn’t mean it’s a graphicalmodel

These are not graphical models:

neural network decision tree network flow HMM template

(KDD 2012 Tutorial) Graphical Models 9 / 131

Page 30: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models

Just because there are nodes and edges doesn’t mean it’s a graphicalmodel

These are not graphical models:

neural network decision tree network flow HMM template

(KDD 2012 Tutorial) Graphical Models 9 / 131

Page 31: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 10 / 131

Page 32: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 33: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 34: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 35: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 36: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 37: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 38: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏

i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

(KDD 2012 Tutorial) Graphical Models 11 / 131

Page 39: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Alarm

Binary variables

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

P (B,∼ E,A, J,∼M)= P (B)P (∼ E)P (A | B,∼ E)P (J | A)P (∼M | A)= 0.001× (1− 0.002)× 0.94× 0.9× (1− 0.7)≈ .000253

(KDD 2012 Tutorial) Graphical Models 12 / 131

Page 40: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Naive Bayes

y y

x x. . .1 d x

d

p(y, x1, . . . xd) = p(y)∏d

i=1 p(xi | y)

Used extensively in natural language processing

Plate representation on the right

(KDD 2012 Tutorial) Graphical Models 13 / 131

Page 41: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Naive Bayes

y y

x x. . .1 d x

d

p(y, x1, . . . xd) = p(y)∏d

i=1 p(xi | y)Used extensively in natural language processing

Plate representation on the right

(KDD 2012 Tutorial) Graphical Models 13 / 131

Page 42: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Naive Bayes

y y

x x. . .1 d x

d

p(y, x1, . . . xd) = p(y)∏d

i=1 p(xi | y)Used extensively in natural language processing

Plate representation on the right

(KDD 2012 Tutorial) Graphical Models 13 / 131

Page 43: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

No Causality Whatsoever

P(A)=aP(B|A)=bP(B|~A)=c

A

B

B

A

P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence correlation)

However, people tend to design BNs based on causal relations

(KDD 2012 Tutorial) Graphical Models 14 / 131

Page 44: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

No Causality Whatsoever

P(A)=aP(B|A)=bP(B|~A)=c

A

B

B

A

P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence correlation)

However, people tend to design BNs based on causal relations

(KDD 2012 Tutorial) Graphical Models 14 / 131

Page 45: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

No Causality Whatsoever

P(A)=aP(B|A)=bP(B|~A)=c

A

B

B

A

P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence correlation)

However, people tend to design BNs based on causal relations

(KDD 2012 Tutorial) Graphical Models 14 / 131

Page 46: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 47: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 48: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 49: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 50: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

Ndw

D

αβT

zφ θ

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 51: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 52: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 53: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Example: Latent Dirichlet Allocation (LDA)

θ

Ndw

D

αβT

A generative model for p(φ, θ, z, w | α, β):For each topic t

φt ∼ Dirichlet(β)For each document d

θ ∼ Dirichlet(α)For each word position in d

topic z ∼ Multinomial(θ)word w ∼ Multinomial(φz)

Inference goals: p(z | w,α, β), argmaxφ,θ p(φ, θ | w,α, β)

(KDD 2012 Tutorial) Graphical Models 15 / 131

Page 54: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Some Topics by LDA on the Wish Corpus

p(word | topic)

“troops” “election” “love”

(KDD 2012 Tutorial) Graphical Models 16 / 131

Page 55: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified byd-separation (“directed separation”)

(KDD 2012 Tutorial) Graphical Models 17 / 131

Page 56: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified byd-separation (“directed separation”)

(KDD 2012 Tutorial) Graphical Models 17 / 131

Page 57: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified byd-separation (“directed separation”)

(KDD 2012 Tutorial) Graphical Models 17 / 131

Page 58: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified byd-separation (“directed separation”)

(KDD 2012 Tutorial) Graphical Models 17 / 131

Page 59: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 1: Tail-to-Tail

C

A B

C

A B

A, B in general dependent

A, B conditionally independent given C (observed nodes are shaded)

An observed C is a tail-to-tail node, blocks the undirected path A-B

(KDD 2012 Tutorial) Graphical Models 18 / 131

Page 60: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 1: Tail-to-Tail

C

A B

C

A B

A, B in general dependent

A, B conditionally independent given C (observed nodes are shaded)

An observed C is a tail-to-tail node, blocks the undirected path A-B

(KDD 2012 Tutorial) Graphical Models 18 / 131

Page 61: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 1: Tail-to-Tail

C

A B

C

A B

A, B in general dependent

A, B conditionally independent given C (observed nodes are shaded)

An observed C is a tail-to-tail node, blocks the undirected path A-B

(KDD 2012 Tutorial) Graphical Models 18 / 131

Page 62: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 2: Head-to-Tail

A C B A C B

A, B in general dependent

A, B conditionally independent given C

An observed C is a head-to-tail node, blocks the path A-B

(KDD 2012 Tutorial) Graphical Models 19 / 131

Page 63: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 2: Head-to-Tail

A C B A C B

A, B in general dependent

A, B conditionally independent given C

An observed C is a head-to-tail node, blocks the path A-B

(KDD 2012 Tutorial) Graphical Models 19 / 131

Page 64: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 2: Head-to-Tail

A C B A C B

A, B in general dependent

A, B conditionally independent given C

An observed C is a head-to-tail node, blocks the path A-B

(KDD 2012 Tutorial) Graphical Models 19 / 131

Page 65: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 3: Head-to-Head

A B A B

C C

A, B in general independent

A, B conditionally dependent given C, or any of C’s descendants

An observed C is a head-to-head node, unblocks the path A-B

(KDD 2012 Tutorial) Graphical Models 20 / 131

Page 66: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 3: Head-to-Head

A B A B

C C

A, B in general independent

A, B conditionally dependent given C, or any of C’s descendants

An observed C is a head-to-head node, unblocks the path A-B

(KDD 2012 Tutorial) Graphical Models 20 / 131

Page 67: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Case 3: Head-to-Head

A B A B

C C

A, B in general independent

A, B conditionally dependent given C, or any of C’s descendants

An observed C is a head-to-head node, unblocks the path A-B

(KDD 2012 Tutorial) Graphical Models 20 / 131

Page 68: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation

Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked

A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.

(KDD 2012 Tutorial) Graphical Models 21 / 131

Page 69: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation

Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked

A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.

(KDD 2012 Tutorial) Graphical Models 21 / 131

Page 70: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation

Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked

A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x ∈ C, or

The path is head-to-head at x, and neither x nor any of itsdescendants is in C.

(KDD 2012 Tutorial) Graphical Models 21 / 131

Page 71: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation

Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked

A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x ∈ C, orThe path is head-to-head at x, and neither x nor any of itsdescendants is in C.

(KDD 2012 Tutorial) Graphical Models 21 / 131

Page 72: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Example 1

The undirected path from A to B is unblocked by E (because of C),and is not blocked by F

A, B dependent given C

A

C

B

F

E

(KDD 2012 Tutorial) Graphical Models 22 / 131

Page 73: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Example 1

The undirected path from A to B is unblocked by E (because of C),and is not blocked by F

A, B dependent given C

A

C

B

F

E

(KDD 2012 Tutorial) Graphical Models 22 / 131

Page 74: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Example 2

The path from A to B is blocked both at E and F

A, B conditionally independent given F

A

B

F

E

C

(KDD 2012 Tutorial) Graphical Models 23 / 131

Page 75: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Directed Graphical Models

d-Separation Example 2

The path from A to B is blocked both at E and F

A, B conditionally independent given F

A

B

F

E

C

(KDD 2012 Tutorial) Graphical Models 23 / 131

Page 76: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 24 / 131

Page 77: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 78: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 79: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 80: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 81: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 82: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

(KDD 2012 Tutorial) Graphical Models 25 / 131

Page 83: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1

A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 84: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 85: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 86: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)

p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 87: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)

p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 88: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)

When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 89: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 90: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

(KDD 2012 Tutorial) Graphical Models 26 / 131

Page 91: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Log Linear Models

Real-valued feature functions f1(X), . . . , fk(X)

Real-valued weights w1, . . . , wk

p(X) =1Z

exp

(−

k∑i=1

wifi(X)

)

(KDD 2012 Tutorial) Graphical Models 27 / 131

Page 92: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Log Linear Models

Real-valued feature functions f1(X), . . . , fk(X)Real-valued weights w1, . . . , wk

p(X) =1Z

exp

(−

k∑i=1

wifi(X)

)

(KDD 2012 Tutorial) Graphical Models 27 / 131

Page 93: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: The Ising Model

θsθ

xs xtst

This is an undirected model with x ∈ 0, 1.

pθ(x) =1Z

exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

fs(X) = xs, fst(X) = xsxt

ws = −θs, wst = −θst

(KDD 2012 Tutorial) Graphical Models 28 / 131

Page 94: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: The Ising Model

θsθ

xs xtst

This is an undirected model with x ∈ 0, 1.

pθ(x) =1Z

exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

fs(X) = xs, fst(X) = xsxt

ws = −θs, wst = −θst

(KDD 2012 Tutorial) Graphical Models 28 / 131

Page 95: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Image Denoising

[From Bishop PRML] noisy image argmaxX P (X|Y )

pθ(X | Y ) =1Z

exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

θs =

c ys = 1−c ys = 0

(KDD 2012 Tutorial) Graphical Models 29 / 131

Page 96: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj

(KDD 2012 Tutorial) Graphical Models 30 / 131

Page 97: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj

(KDD 2012 Tutorial) Graphical Models 30 / 131

Page 98: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj

(KDD 2012 Tutorial) Graphical Models 30 / 131

Page 99: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0

When Ωij 6= 0, there is an edge between xi, xj

(KDD 2012 Tutorial) Graphical Models 30 / 131

Page 100: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0When Ωij 6= 0, there is an edge between xi, xj

(KDD 2012 Tutorial) Graphical Models 30 / 131

Page 101: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Conditional Independence in Markov Random Fields

Two group of variables A, B are conditionally independent givenanother group C, if:

A, B become disconnected by removing C and all edges involving C

AC

B

(KDD 2012 Tutorial) Graphical Models 31 / 131

Page 102: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Undirected Graphical Models

Conditional Independence in Markov Random Fields

Two group of variables A, B are conditionally independent givenanother group C, if:

A, B become disconnected by removing C and all edges involving C

AC

B

(KDD 2012 Tutorial) Graphical Models 31 / 131

Page 103: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Factor Graph

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 32 / 131

Page 104: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Factor Graph

Factor Graph

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node

Factors represent computation

A B

C

(A,B,C)ψ

A B

C

A B

C

(A,B,C)ψf

A B

C

fP(A)P(B)P(C|A,B)

(KDD 2012 Tutorial) Graphical Models 33 / 131

Page 105: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Factor Graph

Factor Graph

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node

Factors represent computation

A B

C

(A,B,C)ψ

A B

C

A B

C

(A,B,C)ψf

A B

C

fP(A)P(B)P(C|A,B)

(KDD 2012 Tutorial) Graphical Models 33 / 131

Page 106: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Definitions Factor Graph

Factor Graph

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node

Factors represent computation

A B

C

(A,B,C)ψ

A B

C

A B

C

(A,B,C)ψf

A B

C

fP(A)P(B)P(C|A,B)

(KDD 2012 Tutorial) Graphical Models 33 / 131

Page 107: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 34 / 131

Page 108: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 35 / 131

Page 109: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Inference by Enumeration

Let X = (XQ, XE , XO) for query, evidence, and other variables.

Infer P (XQ | XE)By definition

P (XQ | XE) =P (XQ, XE)P (XE)

=

∑XO

P (XQ, XE , XO)∑XQ,XO

P (XQ, XE , XO)

Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms

(KDD 2012 Tutorial) Graphical Models 36 / 131

Page 110: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Inference by Enumeration

Let X = (XQ, XE , XO) for query, evidence, and other variables.

Infer P (XQ | XE)

By definition

P (XQ | XE) =P (XQ, XE)P (XE)

=

∑XO

P (XQ, XE , XO)∑XQ,XO

P (XQ, XE , XO)

Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms

(KDD 2012 Tutorial) Graphical Models 36 / 131

Page 111: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Inference by Enumeration

Let X = (XQ, XE , XO) for query, evidence, and other variables.

Infer P (XQ | XE)By definition

P (XQ | XE) =P (XQ, XE)P (XE)

=

∑XO

P (XQ, XE , XO)∑XQ,XO

P (XQ, XE , XO)

Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms

(KDD 2012 Tutorial) Graphical Models 36 / 131

Page 112: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Inference by Enumeration

Let X = (XQ, XE , XO) for query, evidence, and other variables.

Infer P (XQ | XE)By definition

P (XQ | XE) =P (XQ, XE)P (XE)

=

∑XO

P (XQ, XE , XO)∑XQ,XO

P (XQ, XE , XO)

Summing exponential number of terms: with k variables in XO eachtaking r values, there are rk terms

(KDD 2012 Tutorial) Graphical Models 36 / 131

Page 113: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 114: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 115: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 116: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 117: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 118: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ X

Idea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 119: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1, . . . , xk

We sum over the r values each variable can take:∑vr

xi=v1

This is exponential (rk):∑

x1...xk

We want∑

x1...xkp(X)

For a graphical model, the joint probability factorsp(X) =

∏mj=1 fj(X(j))

Each factor fj operates on a subset X(j) ⊆ XIdea to save computation: dynamic programming

(KDD 2012 Tutorial) Graphical Models 37 / 131

Page 120: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 121: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)

Introduce a new factor f−m+1 =(∑

x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 122: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)

f−m+1 contains the union of variables in f+l+1 . . . f

+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 123: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 124: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 125: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 126: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 127: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Eliminating a Variable

Rearrange factors∑

x1...xkf−1 . . . f−l f

+l+1 . . . f

+m by whether x1 ∈ X(j)

Obviously equivalent:∑

x2...xkf−1 . . . f−l

(∑x1f+

l+1 . . . f+m

)Introduce a new factor f−m+1 =

(∑x1f+

l+1 . . . f+m

)f−m+1 contains the union of variables in f+

l+1 . . . f+m except x1

In fact, x1 disappears altogether in∑

x2...xkf−1 . . . f−l f

−m+1

Dynamic programming: compute f−m+1 once, use it thereafter

Hope: f−m+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial) Graphical Models 38 / 131

Page 128: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Binary variables

Say we want P (D) =∑

A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑

A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(

∑A f1(A)f2(A,B))

(KDD 2012 Tutorial) Graphical Models 39 / 131

Page 129: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Binary variables

Say we want P (D) =∑

A,B,C P (A)P (B|A)P (C|B)P (D|C)

Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑

A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(

∑A f1(A)f2(A,B))

(KDD 2012 Tutorial) Graphical Models 39 / 131

Page 130: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Binary variables

Say we want P (D) =∑

A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)

f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑

A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(

∑A f1(A)f2(A,B))

(KDD 2012 Tutorial) Graphical Models 39 / 131

Page 131: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Binary variables

Say we want P (D) =∑

A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))

∑A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(

∑A f1(A)f2(A,B))

(KDD 2012 Tutorial) Graphical Models 39 / 131

Page 132: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Binary variables

Say we want P (D) =∑

A,B,C P (A)P (B|A)P (C|B)P (D|C)Let f1(A) = P (A). Note f1 is an array (from CDP) of size two:P (A = 0)P (A = 1)f2(A,B) is a table of size four:P (B = 0|A = 0)P (B = 0|A = 1)P (B = 1|A = 0)P (B = 1|A = 1))∑

A,B,C f1(A)f2(A,B)f3(B,C)f4(C,D) =∑B,C f3(B,C)f4(C,D)(

∑A f1(A)f2(A,B))

(KDD 2012 Tutorial) Graphical Models 39 / 131

Page 133: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)

f5(B) ≡∑

A f1(A)f2(A,B) reduces to an array of size twoP (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑

B,C f3(B,C)f4(C,D)f5(B) =∑

C f4(C,D)(∑

B f3(B,C)f5(B)),and so on

In the end, f7(D) is represented by P (D = 0), P (D = 1)

(KDD 2012 Tutorial) Graphical Models 40 / 131

Page 134: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡

∑A f1(A)f2(A,B) reduces to an array of size two

P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)

For this example, f5(B) happens to be P (B)∑B,C f3(B,C)f4(C,D)f5(B) =

∑C f4(C,D)(

∑B f3(B,C)f5(B)),

and so on

In the end, f7(D) is represented by P (D = 0), P (D = 1)

(KDD 2012 Tutorial) Graphical Models 40 / 131

Page 135: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡

∑A f1(A)f2(A,B) reduces to an array of size two

P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)

∑B,C f3(B,C)f4(C,D)f5(B) =

∑C f4(C,D)(

∑B f3(B,C)f5(B)),

and so on

In the end, f7(D) is represented by P (D = 0), P (D = 1)

(KDD 2012 Tutorial) Graphical Models 40 / 131

Page 136: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡

∑A f1(A)f2(A,B) reduces to an array of size two

P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑

B,C f3(B,C)f4(C,D)f5(B) =∑

C f4(C,D)(∑

B f3(B,C)f5(B)),and so on

In the end, f7(D) is represented by P (D = 0), P (D = 1)

(KDD 2012 Tutorial) Graphical Models 40 / 131

Page 137: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

f1(A)f2(A,B) is an array of size four: match A valuesP (A = 0)P (B = 0|A = 0)P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0)P (A = 1)P (B = 1|A = 1)f5(B) ≡

∑A f1(A)f2(A,B) reduces to an array of size two

P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1)P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)For this example, f5(B) happens to be P (B)∑

B,C f3(B,C)f4(C,D)f5(B) =∑

C f4(C,D)(∑

B f3(B,C)f5(B)),and so on

In the end, f7(D) is represented by P (D = 0), P (D = 1)

(KDD 2012 Tutorial) Graphical Models 40 / 131

Page 138: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Computation for P (D): 12 ×, 6 +

Enumeration: 48 ×, 14 +

Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.

Saving depends more critically on the graph structure (tree width),may still be intractable

(KDD 2012 Tutorial) Graphical Models 41 / 131

Page 139: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Computation for P (D): 12 ×, 6 +

Enumeration: 48 ×, 14 +

Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.

Saving depends more critically on the graph structure (tree width),may still be intractable

(KDD 2012 Tutorial) Graphical Models 41 / 131

Page 140: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Computation for P (D): 12 ×, 6 +

Enumeration: 48 ×, 14 +

Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.

Saving depends more critically on the graph structure (tree width),may still be intractable

(KDD 2012 Tutorial) Graphical Models 41 / 131

Page 141: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Example: Chain Graph

A B C D

Computation for P (D): 12 ×, 6 +

Enumeration: 48 ×, 14 +

Saving depends on elimination order. Finding optimal order NP-hard;there are heuristic methods.

Saving depends more critically on the graph structure (tree width),may still be intractable

(KDD 2012 Tutorial) Graphical Models 41 / 131

Page 142: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e

Eliminate variables XO = X −XE −XQ

The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:

P (XQ | XE = e) =f(XQ)∑XQ

f(XQ)

(KDD 2012 Tutorial) Graphical Models 42 / 131

Page 143: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e

Eliminate variables XO = X −XE −XQ

The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:

P (XQ | XE = e) =f(XQ)∑XQ

f(XQ)

(KDD 2012 Tutorial) Graphical Models 42 / 131

Page 144: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e

Eliminate variables XO = X −XE −XQ

The final factor will be the joint f(XQ) = P (XQ, XE = e)

Normalize to answer query:

P (XQ | XE = e) =f(XQ)∑XQ

f(XQ)

(KDD 2012 Tutorial) Graphical Models 42 / 131

Page 145: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e

Eliminate variables XO = X −XE −XQ

The final factor will be the joint f(XQ) = P (XQ, XE = e)Normalize to answer query:

P (XQ | XE = e) =f(XQ)∑XQ

f(XQ)

(KDD 2012 Tutorial) Graphical Models 42 / 131

Page 146: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Summary: Exact Inference

Common methods:

EnumerationVariable eliminationNot covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial) Graphical Models 43 / 131

Page 147: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Summary: Exact Inference

Common methods:

Enumeration

Variable eliminationNot covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial) Graphical Models 43 / 131

Page 148: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Summary: Exact Inference

Common methods:

EnumerationVariable elimination

Not covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial) Graphical Models 43 / 131

Page 149: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Summary: Exact Inference

Common methods:

EnumerationVariable eliminationNot covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial) Graphical Models 43 / 131

Page 150: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exact Inference

Summary: Exact Inference

Common methods:

EnumerationVariable eliminationNot covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial) Graphical Models 43 / 131

Page 151: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 44 / 131

Page 152: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference by Monte Carlo

Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn

p(XQ = cQ | XE) =∫

1(xQ=cQ)p(xQ | XE)dxQ

If we can draw samples x(1)Q , . . . x

(m)Q ∼ p(xQ | XE), an unbiased

estimator is

p(XQ = cQ | XE) ≈ 1m

m∑i=1

1(x

(i)Q =cQ)

The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling

(KDD 2012 Tutorial) Graphical Models 45 / 131

Page 153: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference by Monte Carlo

Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn

p(XQ = cQ | XE) =∫

1(xQ=cQ)p(xQ | XE)dxQ

If we can draw samples x(1)Q , . . . x

(m)Q ∼ p(xQ | XE), an unbiased

estimator is

p(XQ = cQ | XE) ≈ 1m

m∑i=1

1(x

(i)Q =cQ)

The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling

(KDD 2012 Tutorial) Graphical Models 45 / 131

Page 154: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference by Monte Carlo

Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn

p(XQ = cQ | XE) =∫

1(xQ=cQ)p(xQ | XE)dxQ

If we can draw samples x(1)Q , . . . x

(m)Q ∼ p(xQ | XE), an unbiased

estimator is

p(XQ = cQ | XE) ≈ 1m

m∑i=1

1(x

(i)Q =cQ)

The variance of the estimator decreases as O(1/m)

Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling

(KDD 2012 Tutorial) Graphical Models 45 / 131

Page 155: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference by Monte Carlo

Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn

p(XQ = cQ | XE) =∫

1(xQ=cQ)p(xQ | XE)dxQ

If we can draw samples x(1)Q , . . . x

(m)Q ∼ p(xQ | XE), an unbiased

estimator is

p(XQ = cQ | XE) ≈ 1m

m∑i=1

1(x

(i)Q =cQ)

The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)

We discuss two methods: forward sampling and Gibbs sampling

(KDD 2012 Tutorial) Graphical Models 45 / 131

Page 156: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference by Monte Carlo

Consider the inference problem p(XQ = cQ | XE) whereXQ ∪XE ⊆ x1 . . . xn

p(XQ = cQ | XE) =∫

1(xQ=cQ)p(xQ | XE)dxQ

If we can draw samples x(1)Q , . . . x

(m)Q ∼ p(xQ | XE), an unbiased

estimator is

p(XQ = cQ | XE) ≈ 1m

m∑i=1

1(x

(i)Q =cQ)

The variance of the estimator decreases as O(1/m)Inference reduces to sampling from p(xQ | XE)We discuss two methods: forward sampling and Gibbs sampling

(KDD 2012 Tutorial) Graphical Models 45 / 131

Page 157: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0

2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)

(KDD 2012 Tutorial) Graphical Models 46 / 131

Page 158: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0

2 Sample E ∼ Ber(0.002)

3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)

(KDD 2012 Tutorial) Graphical Models 46 / 131

Page 159: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0

2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)

(KDD 2012 Tutorial) Graphical Models 46 / 131

Page 160: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0

2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)

5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)

(KDD 2012 Tutorial) Graphical Models 46 / 131

Page 161: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B,E,A, J,M):1 Sample B ∼ Ber(0.001): r ∼ U(0, 1). If (r < 0.001) then B = 1 elseB = 0

2 Sample E ∼ Ber(0.002)3 If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

4 If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)5 If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01)

(KDD 2012 Tutorial) Graphical Models 46 / 131

Page 162: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1,M = 1)

Throw away all samples except those with (E = 1,M = 1)

p(B = 1 | E = 1,M = 1) ≈ 1m

m∑i=1

1(B(i)=1)

where m is the number of surviving samples

Can be highly inefficient (note P (E = 1) tiny)

Does not work for Markov Random Fields

(KDD 2012 Tutorial) Graphical Models 47 / 131

Page 163: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)

p(B = 1 | E = 1,M = 1) ≈ 1m

m∑i=1

1(B(i)=1)

where m is the number of surviving samples

Can be highly inefficient (note P (E = 1) tiny)

Does not work for Markov Random Fields

(KDD 2012 Tutorial) Graphical Models 47 / 131

Page 164: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)

p(B = 1 | E = 1,M = 1) ≈ 1m

m∑i=1

1(B(i)=1)

where m is the number of surviving samples

Can be highly inefficient (note P (E = 1) tiny)

Does not work for Markov Random Fields

(KDD 2012 Tutorial) Graphical Models 47 / 131

Page 165: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1,M = 1)Throw away all samples except those with (E = 1,M = 1)

p(B = 1 | E = 1,M = 1) ≈ 1m

m∑i=1

1(B(i)=1)

where m is the number of surviving samples

Can be highly inefficient (note P (E = 1) tiny)

Does not work for Markov Random Fields

(KDD 2012 Tutorial) Graphical Models 47 / 131

Page 166: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)Works for both graphical models

Initialization:

Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 167: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)

Works for both graphical models

Initialization:

Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 168: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)Works for both graphical models

Initialization:

Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 169: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)Works for both graphical models

Initialization:

Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 170: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)Works for both graphical models

Initialization:

Fix evidence; randomly set other variables

e.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 171: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1, M = 1)

Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

Directly sample from p(xQ | XE)Works for both graphical models

Initialization:

Fix evidence; randomly set other variablese.g. X(0) = (B = 0, E = 1, A = 0, J = 0,M = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 48 / 131

Page 172: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)

This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏

y∈C(xi)

P (y | Pa(y))

where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 49 / 131

Page 173: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))

For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏

y∈C(xi)

P (y | Pa(y))

where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 49 / 131

Page 174: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏

y∈C(xi)

P (y | Pa(y))

where Pa(x) are the parents of x, and C(x) the children of x.

For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 49 / 131

Page 175: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏

y∈C(xi)

P (y | Pa(y))

where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.

For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 49 / 131

Page 176: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

For each non-evidence variable xi, fixing all other nodes X−i,resample its value xi ∼ P (xi | X−i)This is equivalent to xi ∼ P (xi | MarkovBlanket(xi))For a Bayesian network MarkovBlanket(xi) includes xi’s parents,spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi | Pa(xi))∏

y∈C(xi)

P (y | Pa(y))

where Pa(x) are the parents of x, and C(x) the children of x.For many graphical models the Markov Blanket is small.For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B,E = 1)

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 49 / 131

Page 177: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)

Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)

Move on to J , then repeat B,A, J,B,A, J . . .

Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 50 / 131

Page 178: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)

Move on to J , then repeat B,A, J,B,A, J . . .

Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 50 / 131

Page 179: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)

Move on to J , then repeat B,A, J,B,A, J . . .

Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 50 / 131

Page 180: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Update

Say we sampled B = 1. ThenX(1) = (B = 1, E = 1, A = 0, J = 0,M = 1)Starting from X(1), sample A ∼ P (A | B = 1, E = 1, J = 0,M = 1)to get X(2)

Move on to J , then repeat B,A, J,B,A, J . . .

Keep all later samples. P (B = 1 | E = 1,M = 1) is the fraction ofsamples with B = 1.

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

(KDD 2012 Tutorial) Graphical Models 50 / 131

Page 181: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Example 2: The Ising Model

xs

A

B

C

D

This is an undirected model with x ∈ 0, 1.

pθ(x) =1Z

exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

(KDD 2012 Tutorial) Graphical Models 51 / 131

Page 182: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Example 2: The Ising Model

xs

A

B

C

D

The Markov blanket of xs is A,B,C,D

In general for undirected graphical models

p(xs | x−s) = p(xs | xN(s))

N(s) is the neighbors of s.

The Gibbs update is

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

(KDD 2012 Tutorial) Graphical Models 52 / 131

Page 183: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Example 2: The Ising Model

xs

A

B

C

D

The Markov blanket of xs is A,B,C,D

In general for undirected graphical models

p(xs | x−s) = p(xs | xN(s))

N(s) is the neighbors of s.

The Gibbs update is

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

(KDD 2012 Tutorial) Graphical Models 52 / 131

Page 184: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Example 2: The Ising Model

xs

A

B

C

D

The Markov blanket of xs is A,B,C,D

In general for undirected graphical models

p(xs | x−s) = p(xs | xN(s))

N(s) is the neighbors of s.

The Gibbs update is

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

(KDD 2012 Tutorial) Graphical Models 52 / 131

Page 185: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)

Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 186: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 187: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)

But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 188: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 189: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixing

In practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 190: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 191: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X ′ | X)Certain Markov chains have a stationary distribution π such thatπ = Tπ

Gibbs sampler is such a Markov chain withTi((X−i, x

′i) | (X−i, xi)) = p(x′i | X−i), and stationary distribution

p(xQ | XE)But it takes time for the chain to reach stationary distribution (mix)

Can be difficult to assert mixingIn practice “burn in”: discard X(0), . . . , X(T )

Use all of X(T+1), . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial) Graphical Models 53 / 131

Page 192: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 193: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 194: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )

No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 195: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 196: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 197: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Collapsed Gibbs Sampling

In general, Ep[f(X)] ≈ 1m

∑mi=1 f(X(i)) if X(i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

If so,

Ep[f(X)] = Ep(Y )Ep(Z|Y )[f(Y, Z)]

≈ 1m

m∑i=1

Ep(Z|Y (i))[f(Y (i), Z)]

if Y (i) ∼ p(Y )No need to sample Z: it is collapsed

Collapsed Gibbs sampler Ti((Y−i, y′i) | (Y−i, yi)) = p(y′i | Y−i)

Note p(y′i | Y−i) =∫p(y′i, Z | Y−i)dZ

(KDD 2012 Tutorial) Graphical Models 54 / 131

Page 198: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Example: Collapsed Gibbs Sampling for LDA

θ

Ndw

D

αβT

Collapse θ, φ, Gibbs update:

P (zi = j | z−i,w) ∝n

(wi)−i,j + βn

(di)−i,j + α

n(·)−i,j +Wβn

(di)−i,· + Tα

n(wi)−i,j : number of times word wi has been assigned to topic j,

excluding the current position

n(di)−i,j : number of times a word from document di has been assigned

to topic j, excluding the current position

n(·)−i,j : number of times any word has been assigned to topic j,

excluding the current position

n(di)−i,·: length of document di, excluding the current position

(KDD 2012 Tutorial) Graphical Models 55 / 131

Page 199: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Example: Collapsed Gibbs Sampling for LDA

Collapse θ, φ, Gibbs update:

P (zi = j | z−i,w) ∝n

(wi)−i,j + βn

(di)−i,j + α

n(·)−i,j +Wβn

(di)−i,· + Tα

n(wi)−i,j : number of times word wi has been assigned to topic j,

excluding the current position

n(di)−i,j : number of times a word from document di has been assigned

to topic j, excluding the current position

n(·)−i,j : number of times any word has been assigned to topic j,

excluding the current position

n(di)−i,·: length of document di, excluding the current position

(KDD 2012 Tutorial) Graphical Models 55 / 131

Page 200: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Example: Collapsed Gibbs Sampling for LDA

Collapse θ, φ, Gibbs update:

P (zi = j | z−i,w) ∝n

(wi)−i,j + βn

(di)−i,j + α

n(·)−i,j +Wβn

(di)−i,· + Tα

n(wi)−i,j : number of times word wi has been assigned to topic j,

excluding the current position

n(di)−i,j : number of times a word from document di has been assigned

to topic j, excluding the current position

n(·)−i,j : number of times any word has been assigned to topic j,

excluding the current position

n(di)−i,·: length of document di, excluding the current position

(KDD 2012 Tutorial) Graphical Models 55 / 131

Page 201: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Example: Collapsed Gibbs Sampling for LDA

Collapse θ, φ, Gibbs update:

P (zi = j | z−i,w) ∝n

(wi)−i,j + βn

(di)−i,j + α

n(·)−i,j +Wβn

(di)−i,· + Tα

n(wi)−i,j : number of times word wi has been assigned to topic j,

excluding the current position

n(di)−i,j : number of times a word from document di has been assigned

to topic j, excluding the current position

n(·)−i,j : number of times any word has been assigned to topic j,

excluding the current position

n(di)−i,·: length of document di, excluding the current position

(KDD 2012 Tutorial) Graphical Models 55 / 131

Page 202: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Summary: Markov Chain Monte Carlo

Forward sampling

Gibbs sampling

Collapsed Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings, etc.

Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial) Graphical Models 56 / 131

Page 203: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Summary: Markov Chain Monte Carlo

Forward sampling

Gibbs sampling

Collapsed Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings, etc.

Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial) Graphical Models 56 / 131

Page 204: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Summary: Markov Chain Monte Carlo

Forward sampling

Gibbs sampling

Collapsed Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings, etc.

Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial) Graphical Models 56 / 131

Page 205: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Summary: Markov Chain Monte Carlo

Forward sampling

Gibbs sampling

Collapsed Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings, etc.

Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial) Graphical Models 56 / 131

Page 206: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Markov Chain Monte Carlo

Summary: Markov Chain Monte Carlo

Forward sampling

Gibbs sampling

Collapsed Gibbs sampling

Not covered: block Gibbs, Metropolis-Hastings, etc.

Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial) Graphical Models 56 / 131

Page 207: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 57 / 131

Page 208: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

The Sum-Product Algorithm

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as “loopy BP”,approximate

The algorithm involves passing messages on the factor graph

Alternative view: variational approximation (more later)

(KDD 2012 Tutorial) Graphical Models 58 / 131

Page 209: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

The Sum-Product Algorithm

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as “loopy BP”,approximate

The algorithm involves passing messages on the factor graph

Alternative view: variational approximation (more later)

(KDD 2012 Tutorial) Graphical Models 58 / 131

Page 210: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

The Sum-Product Algorithm

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as “loopy BP”,approximate

The algorithm involves passing messages on the factor graph

Alternative view: variational approximation (more later)

(KDD 2012 Tutorial) Graphical Models 58 / 131

Page 211: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

The Sum-Product Algorithm

Also known as belief propagation (BP)

Exact if the graph is a tree; otherwise known as “loopy BP”,approximate

The algorithm involves passing messages on the factor graph

Alternative view: variational approximation (more later)

(KDD 2012 Tutorial) Graphical Models 58 / 131

Page 212: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Example: A Simple HMM

The Hidden Markov Model template (not a graphical model)

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 59 / 131

Page 213: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Example: A Simple HMM

Observing x1 = R, x2 = G, the directed graphical model

z1

x =G2

z2

x =R1

Factor graphz1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

(KDD 2012 Tutorial) Graphical Models 60 / 131

Page 214: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Example: A Simple HMM

Observing x1 = R, x2 = G, the directed graphical model

z1

x =G2

z2

x =R1

Factor graphz1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

(KDD 2012 Tutorial) Graphical Models 60 / 131

Page 215: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Messages

A message is a vector of length K, where K is the number of valuesx takes.

There are two types of messages:

1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.

2 µx→f : message from a variable node x to a factor node f

(KDD 2012 Tutorial) Graphical Models 61 / 131

Page 216: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Messages

A message is a vector of length K, where K is the number of valuesx takes.

There are two types of messages:

1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.

2 µx→f : message from a variable node x to a factor node f

(KDD 2012 Tutorial) Graphical Models 61 / 131

Page 217: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Messages

A message is a vector of length K, where K is the number of valuesx takes.

There are two types of messages:1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.

2 µx→f : message from a variable node x to a factor node f

(KDD 2012 Tutorial) Graphical Models 61 / 131

Page 218: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Messages

A message is a vector of length K, where K is the number of valuesx takes.

There are two types of messages:1 µf→x: message from a factor node f to a variable node xµf→x(i) is the ith element, i = 1 . . .K.

2 µx→f : message from a variable node x to a factor node f

(KDD 2012 Tutorial) Graphical Models 61 / 131

Page 219: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Leaf Messages

Assume tree factor graph. Pick an arbitrary root, say z2

Start messages at leaves.

If a leaf is a factor node f , µf→x(x) = f(x)

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4

µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

If a leaf is a variable node x, µx→f (x) = 1

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 62 / 131

Page 220: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Leaf Messages

Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.

If a leaf is a factor node f , µf→x(x) = f(x)

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4

µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

If a leaf is a variable node x, µx→f (x) = 1

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 62 / 131

Page 221: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Leaf Messages

Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.

If a leaf is a factor node f , µf→x(x) = f(x)

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4

µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

If a leaf is a variable node x, µx→f (x) = 1

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 62 / 131

Page 222: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Leaf Messages

Assume tree factor graph. Pick an arbitrary root, say z2Start messages at leaves.

If a leaf is a factor node f , µf→x(x) = f(x)

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4

µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

If a leaf is a variable node x, µx→f (x) = 1

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 62 / 131

Page 223: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Message from Variable to Factor

A node (factor or variable) can send out a message if all otherincoming messages have arrived

Let x be in factor fs. ne(x)\fs are factors connected to x excludingfs.

µx→fs(x) =∏

f∈ne(x)\fs

µf→x(x)

µz1→f2(z1 = 1) = 1/4

µz1→f2(z1 = 2) = 1/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B(KDD 2012 Tutorial) Graphical Models 63 / 131

Page 224: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Message from Variable to Factor

A node (factor or variable) can send out a message if all otherincoming messages have arrivedLet x be in factor fs. ne(x)\fs are factors connected to x excludingfs.

µx→fs(x) =∏

f∈ne(x)\fs

µf→x(x)

µz1→f2(z1 = 1) = 1/4

µz1→f2(z1 = 2) = 1/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B(KDD 2012 Tutorial) Graphical Models 63 / 131

Page 225: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Message from Factor to Variable

Let x be in factor fs. Let the other variables in fs be x1:M .

µfs→x(x) =∑x1

. . .∑xM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

In this example

µf2→z2(s) =2∑

s′=1

µz1→f2(s′)f2(z1 = s′, z2 = s)

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)

We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

(KDD 2012 Tutorial) Graphical Models 64 / 131

Page 226: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Message from Factor to Variable

Let x be in factor fs. Let the other variables in fs be x1:M .

µfs→x(x) =∑x1

. . .∑xM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

In this example

µf2→z2(s) =2∑

s′=1

µz1→f2(s′)f2(z1 = s′, z2 = s)

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)

We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

(KDD 2012 Tutorial) Graphical Models 64 / 131

Page 227: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Message from Factor to Variable

Let x be in factor fs. Let the other variables in fs be x1:M .

µfs→x(x) =∑x1

. . .∑xM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

In this example

µf2→z2(s) =2∑

s′=1

µz1→f2(s′)f2(z1 = s′, z2 = s)

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s)+1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)

We get µf2→z2(z2 = 1) = 1/32, µf2→z2(z2 = 2) = 1/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

(KDD 2012 Tutorial) Graphical Models 64 / 131

Page 228: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Up to Root, Back Down

The message has reached the root, pass it back down

µz2→f2(z2 = 1) = 1

µz2→f2(z2 = 2) = 1

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 65 / 131

Page 229: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Keep Passing Down

µf2→z1(s) =∑2

s′=1 µz2→f2(s′)f2(z1 = s, z2 = s′)

= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).

We getµf2→z1(z1 = 1) = 7/16µf2→z1(z1 = 2) = 3/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 66 / 131

Page 230: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Keep Passing Down

µf2→z1(s) =∑2

s′=1 µz2→f2(s′)f2(z1 = s, z2 = s′)

= 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1)+ 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).

We getµf2→z1(z1 = 1) = 7/16µf2→z1(z1 = 2) = 3/8

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

(KDD 2012 Tutorial) Graphical Models 66 / 131

Page 231: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

From Messages to Marginals

Once a variable receives all incoming messages, we compute itsmarginal as

p(x) ∝∏

f∈ne(x)

µf→x(x)

In this example

P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4

1/8

)·( 7/16

3/8

)=( 7/64

3/64

)⇒(

0.70.3

)P (z2|x1, x2) ∝ µf2→z2 =

( 1/321/8

)⇒(

0.20.8

)One can also compute the marginal of the set of variables xs involvedin a factor fs

p(xs) ∝ fs(xs)∏

x∈ne(f)

µx→f (x)

(KDD 2012 Tutorial) Graphical Models 67 / 131

Page 232: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

From Messages to Marginals

Once a variable receives all incoming messages, we compute itsmarginal as

p(x) ∝∏

f∈ne(x)

µf→x(x)

In this example

P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4

1/8

)·( 7/16

3/8

)=( 7/64

3/64

)⇒(

0.70.3

)P (z2|x1, x2) ∝ µf2→z2 =

( 1/321/8

)⇒(

0.20.8

)

One can also compute the marginal of the set of variables xs involvedin a factor fs

p(xs) ∝ fs(xs)∏

x∈ne(f)

µx→f (x)

(KDD 2012 Tutorial) Graphical Models 67 / 131

Page 233: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

From Messages to Marginals

Once a variable receives all incoming messages, we compute itsmarginal as

p(x) ∝∏

f∈ne(x)

µf→x(x)

In this example

P (z1|x1, x2) ∝ µf1→z1 · µf2→z1 =( 1/4

1/8

)·( 7/16

3/8

)=( 7/64

3/64

)⇒(

0.70.3

)P (z2|x1, x2) ∝ µf2→z2 =

( 1/321/8

)⇒(

0.20.8

)One can also compute the marginal of the set of variables xs involvedin a factor fs

p(xs) ∝ fs(xs)∏

x∈ne(f)

µx→f (x)

(KDD 2012 Tutorial) Graphical Models 67 / 131

Page 234: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 235: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); or

set messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 236: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 237: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 238: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 239: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Handling Evidence

Observing x = v,

we can absorb it in the factor (as we did); orset messages µx→f (x) = 0 for all x 6= v

Observing XE ,

multiplying the incoming messages to x /∈ XE gives the joint (notp(x|XE)):

p(x,XE) ∝∏

f∈ne(x)

µf→x(x)

The conditional is easily obtained by normalization

p(x|XE) =p(x,XE)∑x′ p(x′, XE)

(KDD 2012 Tutorial) Graphical Models 68 / 131

Page 240: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Loopy Belief Propagation

So far, we assumed a tree graph

When the factor graph contains loops, pass messages indefinitely untilconvergence

But convergence may not happen

But in many cases loopy BP still works well, empirically

(KDD 2012 Tutorial) Graphical Models 69 / 131

Page 241: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Loopy Belief Propagation

So far, we assumed a tree graph

When the factor graph contains loops, pass messages indefinitely untilconvergence

But convergence may not happen

But in many cases loopy BP still works well, empirically

(KDD 2012 Tutorial) Graphical Models 69 / 131

Page 242: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Loopy Belief Propagation

So far, we assumed a tree graph

When the factor graph contains loops, pass messages indefinitely untilconvergence

But convergence may not happen

But in many cases loopy BP still works well, empirically

(KDD 2012 Tutorial) Graphical Models 69 / 131

Page 243: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Belief Propagation

Loopy Belief Propagation

So far, we assumed a tree graph

When the factor graph contains loops, pass messages indefinitely untilconvergence

But convergence may not happen

But in many cases loopy BP still works well, empirically

(KDD 2012 Tutorial) Graphical Models 69 / 131

Page 244: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 70 / 131

Page 245: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

Example: The Ising Model

θsθ

xs xtst

The random variables x take values in 0, 1.

pθ(x) =1Z

exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

(KDD 2012 Tutorial) Graphical Models 71 / 131

Page 246: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Conditional

θsθ

xs xtst

Markovian: the conditional distribution for xs is

p(xs | x−s) = p(xs | xN(s))

N(s) is the neighbors of s.

This reduces to (recall Gibbs sampling)

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

(KDD 2012 Tutorial) Graphical Models 72 / 131

Page 247: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Conditional

θsθ

xs xtst

Markovian: the conditional distribution for xs is

p(xs | x−s) = p(xs | xN(s))

N(s) is the neighbors of s.

This reduces to (recall Gibbs sampling)

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

(KDD 2012 Tutorial) Graphical Models 72 / 131

Page 248: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Mean Field Algorithm for Ising Model

Gibbs sampling would draw xs from

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:

µs ←1

exp(−(θs +∑

t∈N(s) θstµt)) + 1

The µ’s are updated iteratively

The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).

(KDD 2012 Tutorial) Graphical Models 73 / 131

Page 249: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Mean Field Algorithm for Ising Model

Gibbs sampling would draw xs from

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

Instead, let µs be the estimated marginal p(xs = 1)

Mean field algorithm:

µs ←1

exp(−(θs +∑

t∈N(s) θstµt)) + 1

The µ’s are updated iteratively

The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).

(KDD 2012 Tutorial) Graphical Models 73 / 131

Page 250: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Mean Field Algorithm for Ising Model

Gibbs sampling would draw xs from

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:

µs ←1

exp(−(θs +∑

t∈N(s) θstµt)) + 1

The µ’s are updated iteratively

The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).

(KDD 2012 Tutorial) Graphical Models 73 / 131

Page 251: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Mean Field Algorithm for Ising Model

Gibbs sampling would draw xs from

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:

µs ←1

exp(−(θs +∑

t∈N(s) θstµt)) + 1

The µ’s are updated iteratively

The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).

(KDD 2012 Tutorial) Graphical Models 73 / 131

Page 252: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Mean Field Algorithm

The Mean Field Algorithm for Ising Model

Gibbs sampling would draw xs from

p(xs = 1 | xN(s)) =1

exp(−(θs +∑

t∈N(s) θstxt)) + 1

Instead, let µs be the estimated marginal p(xs = 1)Mean field algorithm:

µs ←1

exp(−(θs +∑

t∈N(s) θstµt)) + 1

The µ’s are updated iteratively

The Mean Field algorithm is coordinate ascent and guaranteed toconverge to a local optimal (more later).

(KDD 2012 Tutorial) Graphical Models 73 / 131

Page 253: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 74 / 131

Page 254: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ R

Note X is all the nodes in a Graphical model

φi(X) sometimes called a feature function

Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.

The exponential family is a family of probability densities:

pθ(x) = exp(θ>φ(x)−A(θ)

)

(KDD 2012 Tutorial) Graphical Models 75 / 131

Page 255: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model

φi(X) sometimes called a feature function

Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.

The exponential family is a family of probability densities:

pθ(x) = exp(θ>φ(x)−A(θ)

)

(KDD 2012 Tutorial) Graphical Models 75 / 131

Page 256: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model

φi(X) sometimes called a feature function

Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.

The exponential family is a family of probability densities:

pθ(x) = exp(θ>φ(x)−A(θ)

)

(KDD 2012 Tutorial) Graphical Models 75 / 131

Page 257: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model

φi(X) sometimes called a feature function

Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.

The exponential family is a family of probability densities:

pθ(x) = exp(θ>φ(x)−A(θ)

)

(KDD 2012 Tutorial) Graphical Models 75 / 131

Page 258: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

Let φ(X) = (φ1(X), . . . , φd(X))> be d sufficient statistics, whereφi : X 7→ RNote X is all the nodes in a Graphical model

φi(X) sometimes called a feature function

Let θ = (θ1, . . . , θd)> ∈ Rd be canonical parameters.

The exponential family is a family of probability densities:

pθ(x) = exp(θ>φ(x)−A(θ)

)

(KDD 2012 Tutorial) Graphical Models 75 / 131

Page 259: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

pθ(x) = exp(θ>φ(x)−A(θ)

)Inner product between parameters θ and sufficient statistics φ.

θ

φ( )

φ( )

φ( )

x

y

z

pθ(x) > pθ(y) > pθ(z)

A is the log partition function,

A(θ) = log∫

exp(θ>φ(x)

)ν(dx)

A = logZ

(KDD 2012 Tutorial) Graphical Models 76 / 131

Page 260: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

pθ(x) = exp(θ>φ(x)−A(θ)

)Inner product between parameters θ and sufficient statistics φ.

θ

φ( )

φ( )

φ( )

x

y

z

pθ(x) > pθ(y) > pθ(z)

A is the log partition function,

A(θ) = log∫

exp(θ>φ(x)

)ν(dx)

A = logZ

(KDD 2012 Tutorial) Graphical Models 76 / 131

Page 261: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family

pθ(x) = exp(θ>φ(x)−A(θ)

)Inner product between parameters θ and sufficient statistics φ.

θ

φ( )

φ( )

φ( )

x

y

z

pθ(x) > pθ(y) > pθ(z)

A is the log partition function,

A(θ) = log∫

exp(θ>φ(x)

)ν(dx)

A = logZ(KDD 2012 Tutorial) Graphical Models 76 / 131

Page 262: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Minimal vs. Overcomplete Models

Parameters for which the density is normalizable:

Ω = θ ∈ Rd | A(θ) <∞

A minimal exponential family is where the φ’s are linearlyindependent.

An overcomplete exponential family is where the φ’s are linearlydependent:

∃α ∈ Rd, α>φ(x) = constant ∀x

Both minimal and overcomplete representations are useful.

(KDD 2012 Tutorial) Graphical Models 77 / 131

Page 263: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Minimal vs. Overcomplete Models

Parameters for which the density is normalizable:

Ω = θ ∈ Rd | A(θ) <∞

A minimal exponential family is where the φ’s are linearlyindependent.

An overcomplete exponential family is where the φ’s are linearlydependent:

∃α ∈ Rd, α>φ(x) = constant ∀x

Both minimal and overcomplete representations are useful.

(KDD 2012 Tutorial) Graphical Models 77 / 131

Page 264: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Minimal vs. Overcomplete Models

Parameters for which the density is normalizable:

Ω = θ ∈ Rd | A(θ) <∞

A minimal exponential family is where the φ’s are linearlyindependent.

An overcomplete exponential family is where the φ’s are linearlydependent:

∃α ∈ Rd, α>φ(x) = constant ∀x

Both minimal and overcomplete representations are useful.

(KDD 2012 Tutorial) Graphical Models 77 / 131

Page 265: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Minimal vs. Overcomplete Models

Parameters for which the density is normalizable:

Ω = θ ∈ Rd | A(θ) <∞

A minimal exponential family is where the φ’s are linearlyindependent.

An overcomplete exponential family is where the φ’s are linearlydependent:

∃α ∈ Rd, α>φ(x) = constant ∀x

Both minimal and overcomplete representations are useful.

(KDD 2012 Tutorial) Graphical Models 77 / 131

Page 266: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!

Can be rewritten as

p(x) = exp (x log β + (1− x) log(1− β))

Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.

Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x

(KDD 2012 Tutorial) Graphical Models 78 / 131

Page 267: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!

Can be rewritten as

p(x) = exp (x log β + (1− x) log(1− β))

Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.

Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x

(KDD 2012 Tutorial) Graphical Models 78 / 131

Page 268: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!

Can be rewritten as

p(x) = exp (x log β + (1− x) log(1− β))

Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.

Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x

(KDD 2012 Tutorial) Graphical Models 78 / 131

Page 269: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = βx(1− β)1−x for x ∈ 0, 1 and β ∈ (0, 1).Does not look like an exponential family!

Can be rewritten as

p(x) = exp (x log β + (1− x) log(1− β))

Now in exponential family form with φ1(x) = x, φ2(x) = 1− x,θ1 = log β, θ2 = log(1− β), and A(θ) = 0.

Overcomplete: α1 = α2 = 1 makes α>φ(x) = 1 for all x

(KDD 2012 Tutorial) Graphical Models 78 / 131

Page 270: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = exp (x log β + (1− x) log(1− β))

Can be further rewritten as

p(x) = exp (xθ − log(1 + exp(θ)))

Minimal exponential family withφ(x) = x, θ = log β

1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family

But not all (e.g., the Laplace distribution).

(KDD 2012 Tutorial) Graphical Models 79 / 131

Page 271: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = exp (x log β + (1− x) log(1− β))

Can be further rewritten as

p(x) = exp (xθ − log(1 + exp(θ)))

Minimal exponential family withφ(x) = x, θ = log β

1−β , A(θ) = log(1 + exp(θ)).

Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family

But not all (e.g., the Laplace distribution).

(KDD 2012 Tutorial) Graphical Models 79 / 131

Page 272: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = exp (x log β + (1− x) log(1− β))

Can be further rewritten as

p(x) = exp (xθ − log(1 + exp(θ)))

Minimal exponential family withφ(x) = x, θ = log β

1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family

But not all (e.g., the Laplace distribution).

(KDD 2012 Tutorial) Graphical Models 79 / 131

Page 273: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 1: Bernoulli

p(x) = exp (x log β + (1− x) log(1− β))

Can be further rewritten as

p(x) = exp (xθ − log(1 + exp(θ)))

Minimal exponential family withφ(x) = x, θ = log β

1−β , A(θ) = log(1 + exp(θ)).Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are inthe exponential family

But not all (e.g., the Laplace distribution).

(KDD 2012 Tutorial) Graphical Models 79 / 131

Page 274: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 2: Ising Model

θsθ

xs xtst

pθ(x) = exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt −A(θ)

Binary random variable xs ∈ 0, 1

d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>

This is a regular (Ω = Rd), minimal exponential family.

(KDD 2012 Tutorial) Graphical Models 80 / 131

Page 275: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 2: Ising Model

θsθ

xs xtst

pθ(x) = exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt −A(θ)

Binary random variable xs ∈ 0, 1d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>

This is a regular (Ω = Rd), minimal exponential family.

(KDD 2012 Tutorial) Graphical Models 80 / 131

Page 276: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 2: Ising Model

θsθ

xs xtst

pθ(x) = exp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt −A(θ)

Binary random variable xs ∈ 0, 1d = |V |+ |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>

This is a regular (Ω = Rd), minimal exponential family.

(KDD 2012 Tutorial) Graphical Models 80 / 131

Page 277: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 3: Potts Model

θsθ

xs xtst

Similar to Ising model but generalizing xs ∈ 0, . . . , r − 1.

Indicator functions fsj(x) = 1 if xs = j and 0 otherwise, andfstjk(x) = 1 if xs = j ∧ xt = k, and 0 otherwise.

pθ(x) = exp

∑sj

θsjfsj(x) +∑stjk

θstjkfstjk(x)−A(θ)

(KDD 2012 Tutorial) Graphical Models 81 / 131

Page 278: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 3: Potts Model

θsθ

xs xtst

Similar to Ising model but generalizing xs ∈ 0, . . . , r − 1.Indicator functions fsj(x) = 1 if xs = j and 0 otherwise, andfstjk(x) = 1 if xs = j ∧ xt = k, and 0 otherwise.

pθ(x) = exp

∑sj

θsjfsj(x) +∑stjk

θstjkfstjk(x)−A(θ)

(KDD 2012 Tutorial) Graphical Models 81 / 131

Page 279: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 3: Potts Model

θsθ

xs xtst

pθ(x) = exp

∑sj

θsjfsj(x) +∑stjk

θstjkfstjk(x)−A(θ)

d = r|V |+ r2|E|

Regular but overcomplete, because∑r−1

j=0 θsj(x) = 1 for any s ∈ Vand all x.

The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.

(KDD 2012 Tutorial) Graphical Models 82 / 131

Page 280: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 3: Potts Model

θsθ

xs xtst

pθ(x) = exp

∑sj

θsjfsj(x) +∑stjk

θstjkfstjk(x)−A(θ)

d = r|V |+ r2|E|Regular but overcomplete, because

∑r−1j=0 θsj(x) = 1 for any s ∈ V

and all x.

The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.

(KDD 2012 Tutorial) Graphical Models 82 / 131

Page 281: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Exponential Family Example 3: Potts Model

θsθ

xs xtst

pθ(x) = exp

∑sj

θsjfsj(x) +∑stjk

θstjkfstjk(x)−A(θ)

d = r|V |+ r2|E|Regular but overcomplete, because

∑r−1j=0 θsj(x) = 1 for any s ∈ V

and all x.

The Potts model is a special case where the parameters are tied:θstkk = α, and θstjk = β for j 6= k.

(KDD 2012 Tutorial) Graphical Models 82 / 131

Page 282: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Important Relationship

For sufficient statistics defined by indicator functions,

e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise

The marginal can be obtained via the mean

Eθ[φsj(x)] = P (xs = j)

Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.

We will come back to this point when we discuss variationalinference...

(KDD 2012 Tutorial) Graphical Models 83 / 131

Page 283: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Important Relationship

For sufficient statistics defined by indicator functions,

e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise

The marginal can be obtained via the mean

Eθ[φsj(x)] = P (xs = j)

Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.

We will come back to this point when we discuss variationalinference...

(KDD 2012 Tutorial) Graphical Models 83 / 131

Page 284: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Important Relationship

For sufficient statistics defined by indicator functions,

e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise

The marginal can be obtained via the mean

Eθ[φsj(x)] = P (xs = j)

Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.

We will come back to this point when we discuss variationalinference...

(KDD 2012 Tutorial) Graphical Models 83 / 131

Page 285: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Important Relationship

For sufficient statistics defined by indicator functions,

e.g., φsj(x) = fsj(x) = 1 if xs = j and 0 otherwise

The marginal can be obtained via the mean

Eθ[φsj(x)] = P (xs = j)

Since inference is often about computing the marginal, in this case itis equivalent to computing the mean.

We will come back to this point when we discuss variationalinference...

(KDD 2012 Tutorial) Graphical Models 83 / 131

Page 286: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 287: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 288: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 289: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 290: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inM

Therefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 291: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Mean Parameters

Let p be any density (not necessarily in exponential family).

Given sufficient statistics φ, the mean parameters µ = (µ1, . . . , µd)>

is

µi = Ep[φi(x)] =∫φi(x)p(x)dx

The set of mean parameters define a convex set:

M = µ ∈ Rd | ∃p s.t. Ep[φ(x)] = µ

If µ(1), µ(2) ∈M, there must exist p(1), p(2)

The convex combinations of p(1), p(2) leads to another meanparameter inMTherefore M is convex

(KDD 2012 Tutorial) Graphical Models 84 / 131

Page 292: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: The First Two Moments

Let φ1(x) = x, φ2(x) = x2

For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.

Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p

M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.

(KDD 2012 Tutorial) Graphical Models 85 / 131

Page 293: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: The First Two Moments

Let φ1(x) = x, φ2(x) = x2

For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.

Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p

M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.

(KDD 2012 Tutorial) Graphical Models 85 / 131

Page 294: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: The First Two Moments

Let φ1(x) = x, φ2(x) = x2

For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.

Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p

M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.

(KDD 2012 Tutorial) Graphical Models 85 / 131

Page 295: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: The First Two Moments

Let φ1(x) = x, φ2(x) = x2

For any p (not necessarily Gaussian) on x, the mean parametersµ = (µ1, µ2) = (E(x),E(x2))>.

Note V(x) = E(x2)− E2(x) = µ2 − µ21 ≥ 0 for any p

M is not R× R+ but rather the subset (µ1, µ2) | µ2 ≥ µ21.

(KDD 2012 Tutorial) Graphical Models 85 / 131

Page 296: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Marginal Polytope

The marginal polytope is defined for discrete xs

Recall M = µ ∈ Rd | µ =∑

x φ(x)p(x) for some pp can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass functions.

M = convφ(x),∀x is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial) Graphical Models 86 / 131

Page 297: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Marginal Polytope

The marginal polytope is defined for discrete xs

Recall M = µ ∈ Rd | µ =∑

x φ(x)p(x) for some p

p can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass functions.

M = convφ(x),∀x is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial) Graphical Models 86 / 131

Page 298: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Marginal Polytope

The marginal polytope is defined for discrete xs

Recall M = µ ∈ Rd | µ =∑

x φ(x)p(x) for some pp can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass functions.

M = convφ(x),∀x is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial) Graphical Models 86 / 131

Page 299: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Marginal Polytope

The marginal polytope is defined for discrete xs

Recall M = µ ∈ Rd | µ =∑

x φ(x)p(x) for some pp can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass functions.

M = convφ(x),∀x is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial) Graphical Models 86 / 131

Page 300: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Marginal Polytope

The marginal polytope is defined for discrete xs

Recall M = µ ∈ Rd | µ =∑

x φ(x)p(x) for some pp can be a point mass function on a particular x.

In fact any p is a convex combination of such point mass functions.

M = convφ(x),∀x is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial) Graphical Models 86 / 131

Page 301: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Marginal Polytope Example

Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.

minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.

only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.

(KDD 2012 Tutorial) Graphical Models 87 / 131

Page 302: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Marginal Polytope Example

Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.

minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.

only 4 different x = (x1, x2).

the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.

(KDD 2012 Tutorial) Graphical Models 87 / 131

Page 303: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Marginal Polytope Example

Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.

minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.

only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.

(KDD 2012 Tutorial) Graphical Models 87 / 131

Page 304: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Marginal Polytope Example

Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.

minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.

only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.

(KDD 2012 Tutorial) Graphical Models 87 / 131

Page 305: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Marginal Polytope Example

Tiny Ising model: two nodes x1, x2 ∈ 0, 1 connected by an edge.

minimal sufficient statistics φ(x1, x2) = (x1, x2, x1x2)>.

only 4 different x = (x1, x2).the marginal polytope isM = conv(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)

the convex hull is a polytope inside the unit cube.

the three coordinates are node marginals µ1 ≡ Ep[x1 = 1],µ2 ≡ Ep[x2 = 1] and edge marginal µ12 ≡ Ep[x1 = x2 = 1], hencethe name.

(KDD 2012 Tutorial) Graphical Models 87 / 131

Page 306: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Log Partition Function A

For any regular exponential family, A(θ) is convex in θ.

Strictly convex for minimal exponential family.

Nice property: ∂A(θ)∂θi

= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.

(KDD 2012 Tutorial) Graphical Models 88 / 131

Page 307: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Log Partition Function A

For any regular exponential family, A(θ) is convex in θ.

Strictly convex for minimal exponential family.

Nice property: ∂A(θ)∂θi

= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.

(KDD 2012 Tutorial) Graphical Models 88 / 131

Page 308: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Log Partition Function A

For any regular exponential family, A(θ) is convex in θ.

Strictly convex for minimal exponential family.

Nice property: ∂A(θ)∂θi

= Eθ[φi(x)]

Therefore, ∇A = µ, the mean parameters of pθ.

(KDD 2012 Tutorial) Graphical Models 88 / 131

Page 309: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Log Partition Function A

For any regular exponential family, A(θ) is convex in θ.

Strictly convex for minimal exponential family.

Nice property: ∂A(θ)∂θi

= Eθ[φi(x)]Therefore, ∇A = µ, the mean parameters of pθ.

(KDD 2012 Tutorial) Graphical Models 88 / 131

Page 310: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 311: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 312: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 313: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 314: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)

For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 315: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Conjugate Duality

The conjugate dual function A∗ to A is defined as

A∗(µ) = supθ∈Ω

θ>µ−A(θ)

Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variational definition.

For any µ ∈M’s interior, let θ(µ) satisfy

Eθ(µ)[φ(x)] = ∇A(θ(µ)) = µ

Then A∗(µ) = −H(pθ(µ)), the negative entropy.

The dual of the dual gives back A: A(θ) = supµ∈M µ>θ −A∗(µ)For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈M0 by themoment matching conditions µ = Eθ[φ(x)].

(KDD 2012 Tutorial) Graphical Models 89 / 131

Page 316: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Conjugate Dual for Bernoulli

Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.

By definitionA∗(µ) = sup

θ∈Rθµ− log(1 + exp(θ))

Taking derivative and solve

A∗(µ) = µ logµ+ (1− µ) log(1− µ)

i.e., the negative entropy.

(KDD 2012 Tutorial) Graphical Models 90 / 131

Page 317: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Conjugate Dual for Bernoulli

Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.

By definitionA∗(µ) = sup

θ∈Rθµ− log(1 + exp(θ))

Taking derivative and solve

A∗(µ) = µ logµ+ (1− µ) log(1− µ)

i.e., the negative entropy.

(KDD 2012 Tutorial) Graphical Models 90 / 131

Page 318: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Conjugate Dual for Bernoulli

Recall the minimal exponential family for Bernoulli withφ(x) = x,A(θ) = log(1 + exp(θ)),Ω = R.

By definitionA∗(µ) = sup

θ∈Rθµ− log(1 + exp(θ))

Taking derivative and solve

A∗(µ) = µ logµ+ (1− µ) log(1− µ)

i.e., the negative entropy.

(KDD 2012 Tutorial) Graphical Models 90 / 131

Page 319: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Inference with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The supremum is attained by µ = Eθ[φ(x)].

Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.

Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.

This variational representation is exact, not approximate

We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial) Graphical Models 91 / 131

Page 320: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Inference with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.

Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.

This variational representation is exact, not approximate

We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial) Graphical Models 91 / 131

Page 321: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Inference with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.

Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.

This variational representation is exact, not approximate

We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial) Graphical Models 91 / 131

Page 322: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Inference with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.

Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.

This variational representation is exact, not approximate

We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial) Graphical Models 91 / 131

Page 323: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Inference with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The supremum is attained by µ = Eθ[φ(x)].Want to compute the marginals P (xs = j)? They are the meanparameters µsj = Eθ[φij(x)] under standard overcompleterepresentation.

Want to compute the mean parameters µsj? They are the arg sup tothe optimization problem above.

This variational representation is exact, not approximate

We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial) Graphical Models 91 / 131

Page 324: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 325: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 326: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)

The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 327: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 328: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 329: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Difficulties with Variational Representation

A(θ) = supµ∈M

µ>θ −A∗(µ)

Difficult to solve even though it is a convex problem

Two issues:

Although the marginal polytope M is convex, it can be quite complex(exponential number of vertices)The dual function A∗(µ) usually does not admit an explicit form.

Variational approximation modifies the optimization problem so thatit is tractable, at the price of an approximate solution.

Next, we cast mean field and sum-product algorithms as variationalapproximations.

(KDD 2012 Tutorial) Graphical Models 92 / 131

Page 330: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.

Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges

The densities in this sub-family are all fully factorized:

pθ(x) =∏s∈V

p(xs; θs)

(KDD 2012 Tutorial) Graphical Models 93 / 131

Page 331: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.

Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)

Set all θi = 0 if φi involves edges

The densities in this sub-family are all fully factorized:

pθ(x) =∏s∈V

p(xs; θs)

(KDD 2012 Tutorial) Graphical Models 93 / 131

Page 332: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.

Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges

The densities in this sub-family are all fully factorized:

pθ(x) =∏s∈V

p(xs; θs)

(KDD 2012 Tutorial) Graphical Models 93 / 131

Page 333: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset M(F ) onwhich A∗(µ) has a closed form.

Consider the fully disconnected subgraph F = (V, ∅) of the originalgraph G = (V,E)Set all θi = 0 if φi involves edges

The densities in this sub-family are all fully factorized:

pθ(x) =∏s∈V

p(xs; θs)

(KDD 2012 Tutorial) Graphical Models 93 / 131

Page 334: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂M

Recall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 335: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).

It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 336: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).

Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 337: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 338: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 339: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.

This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 340: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.

Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 341: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

Let M(F ) be the mean parameters of the fully factorized sub-family.In general, M(F ) ⊂MRecall M is the convex hull of extreme points φ(x).It turns out the extreme points φ(x) ∈ M(F ).Example:

The tiny Ising model x1, x2 ∈ 0, 1 with φ = (x1, x2, x1x2)>

The point mass distribution p(x = (0, 1)>) = 1 is realized as a limit tothe series p(x) = exp(θ1x1 + θ2x2 −A(θ)) where θ1 → −∞ andθ2 →∞.This series is in F because θ12 = 0.Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial) Graphical Models 94 / 131

Page 342: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

[courtesy Nick Bridle]

Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).

But in general M(F ) is a true subset ofMTherefore,M(F ) is a nonconvex inner set ofM

(KDD 2012 Tutorial) Graphical Models 95 / 131

Page 343: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

[courtesy Nick Bridle]

Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).But in general M(F ) is a true subset ofM

Therefore,M(F ) is a nonconvex inner set ofM

(KDD 2012 Tutorial) Graphical Models 95 / 131

Page 344: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Geometry ofM(F )

[courtesy Nick Bridle]

Because the extreme points ofM are inM(F ), ifM(F ) wereconvex, we would have M =M(F ).But in general M(F ) is a true subset ofMTherefore,M(F ) is a nonconvex inner set ofM

(KDD 2012 Tutorial) Graphical Models 95 / 131

Page 345: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].

The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 346: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 347: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).

The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 348: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )

Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 349: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 350: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Mean Field Method as Variational Approximation

Recall the exact variational problem

A(θ) = supµ∈M

µ>θ −A∗(µ)

attained by solution to inference problem µ = Eθ[φ(x)].The mean field method simply replaces M with M(F )

L(θ) = supµ∈M(F )

µ>θ −A∗(µ)

Obvious L(θ) ≤ A(θ).The original solution µ∗ may not be inM(F )Even if µ∗ ∈M(F ), may hit local maximum and not find it

Why both? Because A∗(µ) = −H(pθ(µ)) has a very simple form forM(F )

(KDD 2012 Tutorial) Graphical Models 96 / 131

Page 351: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)

Fully factorizedM(F ) means no edge. µst = µsµt

ForM(F ), the dual function A∗(µ) has the simple form

A∗(µ) =∑s∈V

−H(µs) =∑s∈V

µs logµs + (1− µs) log(1− µs)

Thus the mean field problem is

L(θ) = supµ∈M(F )

µ>θ −∑s∈V

(µs logµs + (1− µs) log(1− µs))

= max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

(KDD 2012 Tutorial) Graphical Models 97 / 131

Page 352: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt

ForM(F ), the dual function A∗(µ) has the simple form

A∗(µ) =∑s∈V

−H(µs) =∑s∈V

µs logµs + (1− µs) log(1− µs)

Thus the mean field problem is

L(θ) = supµ∈M(F )

µ>θ −∑s∈V

(µs logµs + (1− µs) log(1− µs))

= max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

(KDD 2012 Tutorial) Graphical Models 97 / 131

Page 353: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt

ForM(F ), the dual function A∗(µ) has the simple form

A∗(µ) =∑s∈V

−H(µs) =∑s∈V

µs logµs + (1− µs) log(1− µs)

Thus the mean field problem is

L(θ) = supµ∈M(F )

µ>θ −∑s∈V

(µs logµs + (1− µs) log(1− µs))

= max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

(KDD 2012 Tutorial) Graphical Models 97 / 131

Page 354: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

The mean parameters for the Ising model are the node and edgemarginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)Fully factorizedM(F ) means no edge. µst = µsµt

ForM(F ), the dual function A∗(µ) has the simple form

A∗(µ) =∑s∈V

−H(µs) =∑s∈V

µs logµs + (1− µs) log(1− µs)

Thus the mean field problem is

L(θ) = supµ∈M(F )

µ>θ −∑s∈V

(µs logµs + (1− µs) log(1− µs))

= max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

(KDD 2012 Tutorial) Graphical Models 97 / 131

Page 355: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

L(θ) = max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

Bilinear in µ, not jointly concave

But concave in a single dimension µs, fixing others.

Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.

Setting the partial derivative w.r.t. µs to 0 yields:

µs =1

1 + exp(−(θs +

∑(s,t)∈E θstµt)

)as we’ve seen before.

Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.

(KDD 2012 Tutorial) Graphical Models 98 / 131

Page 356: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

L(θ) = max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

Bilinear in µ, not jointly concave

But concave in a single dimension µs, fixing others.

Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.

Setting the partial derivative w.r.t. µs to 0 yields:

µs =1

1 + exp(−(θs +

∑(s,t)∈E θstµt)

)as we’ve seen before.

Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.

(KDD 2012 Tutorial) Graphical Models 98 / 131

Page 357: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

L(θ) = max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

Bilinear in µ, not jointly concave

But concave in a single dimension µs, fixing others.

Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.

Setting the partial derivative w.r.t. µs to 0 yields:

µs =1

1 + exp(−(θs +

∑(s,t)∈E θstµt)

)as we’ve seen before.

Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.

(KDD 2012 Tutorial) Graphical Models 98 / 131

Page 358: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

L(θ) = max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

Bilinear in µ, not jointly concave

But concave in a single dimension µs, fixing others.

Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.

Setting the partial derivative w.r.t. µs to 0 yields:

µs =1

1 + exp(−(θs +

∑(s,t)∈E θstµt)

)as we’ve seen before.

Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.

(KDD 2012 Tutorial) Graphical Models 98 / 131

Page 359: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Example: Mean Field for Ising Model

L(θ) = max(µ1...µm)∈[0,1]m

∑s∈V

θsµs +∑

(s,t)∈E

θstµsµt +∑s∈V

H(µs)

Bilinear in µ, not jointly concave

But concave in a single dimension µs, fixing others.

Iterative coordinate-wise maximization: fixing µt for t 6= s andoptimizing µs.

Setting the partial derivative w.r.t. µs to 0 yields:

µs =1

1 + exp(−(θs +

∑(s,t)∈E θstµt)

)as we’ve seen before.

Caution: mean field converges to a local maximum depending on theinitialization of µ1 . . . µm.(KDD 2012 Tutorial) Graphical Models 98 / 131

Page 360: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The sum-product algorithm makes two approximations:

it relaxes M to an outer set L

it replaces the dual A∗ with an approximation A∗.

A(θ) = supµ∈L

µ>θ − A∗(µ)

(KDD 2012 Tutorial) Graphical Models 99 / 131

Page 361: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

A(θ) = supµ∈M

µ>θ −A∗(µ)

The sum-product algorithm makes two approximations:

it relaxes M to an outer set L

it replaces the dual A∗ with an approximation A∗.

A(θ) = supµ∈L

µ>θ − A∗(µ)

(KDD 2012 Tutorial) Graphical Models 99 / 131

Page 362: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).

The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd

+ satisfying “node normalization” and“edge-node marginal consistency” conditions:

r−1∑j=0

τsj = 1 ∀s ∈ V

r−1∑k=0

τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1

r−1∑j=0

τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1

Define L = τ satisfying the above conditions.

(KDD 2012 Tutorial) Graphical Models 100 / 131

Page 363: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.

Now consider τ ∈ Rd+ satisfying “node normalization” and

“edge-node marginal consistency” conditions:

r−1∑j=0

τsj = 1 ∀s ∈ V

r−1∑k=0

τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1

r−1∑j=0

τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1

Define L = τ satisfying the above conditions.

(KDD 2012 Tutorial) Graphical Models 100 / 131

Page 364: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd

+ satisfying “node normalization” and“edge-node marginal consistency” conditions:

r−1∑j=0

τsj = 1 ∀s ∈ V

r−1∑k=0

τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1

r−1∑j=0

τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1

Define L = τ satisfying the above conditions.

(KDD 2012 Tutorial) Graphical Models 100 / 131

Page 365: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

For overcomplete exponential families on discrete nodes, the meanparameters are node and edge marginalsµsj = p(xs = j), µstjk = p(xs = j, xt = k).The marginal polytope is M = µ | ∃p with marginals µ.Now consider τ ∈ Rd

+ satisfying “node normalization” and“edge-node marginal consistency” conditions:

r−1∑j=0

τsj = 1 ∀s ∈ V

r−1∑k=0

τstjk = τsj ∀s, t ∈ V, j = 0 . . . r − 1

r−1∑j=0

τstjk = τtk ∀s, t ∈ V, k = 0 . . . r − 1

Define L = τ satisfying the above conditions.(KDD 2012 Tutorial) Graphical Models 100 / 131

Page 366: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

If the graph is a tree thenM = L

If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy

Nice property: L is still a polytope, but much simpler thanM.

φ( )x

L

M

τ

The first approximation in sum-product is to replace M with L.

(KDD 2012 Tutorial) Graphical Models 101 / 131

Page 367: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

If the graph is a tree thenM = L

If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy

Nice property: L is still a polytope, but much simpler thanM.

φ( )x

L

M

τ

The first approximation in sum-product is to replace M with L.

(KDD 2012 Tutorial) Graphical Models 101 / 131

Page 368: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

If the graph is a tree thenM = L

If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy

Nice property: L is still a polytope, but much simpler thanM.

φ( )x

L

M

τ

The first approximation in sum-product is to replace M with L.

(KDD 2012 Tutorial) Graphical Models 101 / 131

Page 369: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Outer Relaxation

If the graph is a tree thenM = L

If the graph has cycles thenM⊂ L: L does not enforce otherconstraints that true marginals need to satisfy

Nice property: L is still a polytope, but much simpler thanM.

φ( )x

L

M

τ

The first approximation in sum-product is to replace M with L.

(KDD 2012 Tutorial) Graphical Models 101 / 131

Page 370: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Recall µ are node and edge marginals

If the graph is a tree, one can exactly reconstruct the joint probability

pµ(x) =∏s∈V

µsxs

∏(s,t)∈E

µstxsxt

µsxsµtxt

If the graph is a tree, the entropy of the joint distribution is

H(pµ) = −∑s∈V

r−1∑j=0

µsj logµsj −∑

(s,t)∈E

∑j,k

µstjk logµstjk

µsjµtk

Neither holds for graph with cycles.

(KDD 2012 Tutorial) Graphical Models 102 / 131

Page 371: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Recall µ are node and edge marginals

If the graph is a tree, one can exactly reconstruct the joint probability

pµ(x) =∏s∈V

µsxs

∏(s,t)∈E

µstxsxt

µsxsµtxt

If the graph is a tree, the entropy of the joint distribution is

H(pµ) = −∑s∈V

r−1∑j=0

µsj logµsj −∑

(s,t)∈E

∑j,k

µstjk logµstjk

µsjµtk

Neither holds for graph with cycles.

(KDD 2012 Tutorial) Graphical Models 102 / 131

Page 372: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Recall µ are node and edge marginals

If the graph is a tree, one can exactly reconstruct the joint probability

pµ(x) =∏s∈V

µsxs

∏(s,t)∈E

µstxsxt

µsxsµtxt

If the graph is a tree, the entropy of the joint distribution is

H(pµ) = −∑s∈V

r−1∑j=0

µsj logµsj −∑

(s,t)∈E

∑j,k

µstjk logµstjk

µsjµtk

Neither holds for graph with cycles.

(KDD 2012 Tutorial) Graphical Models 102 / 131

Page 373: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Recall µ are node and edge marginals

If the graph is a tree, one can exactly reconstruct the joint probability

pµ(x) =∏s∈V

µsxs

∏(s,t)∈E

µstxsxt

µsxsµtxt

If the graph is a tree, the entropy of the joint distribution is

H(pµ) = −∑s∈V

r−1∑j=0

µsj logµsj −∑

(s,t)∈E

∑j,k

µstjk logµstjk

µsjµtk

Neither holds for graph with cycles.

(KDD 2012 Tutorial) Graphical Models 102 / 131

Page 374: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:

HBethe(pτ ) = −∑s∈V

r−1∑j=0

τsj log τsj −∑

(s,t)∈E

∑j,k

τstjk logτstjkτsjτtk

Note τ is not a true marginal, and HBethe is not a true entropy.

The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).

(KDD 2012 Tutorial) Graphical Models 103 / 131

Page 375: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:

HBethe(pτ ) = −∑s∈V

r−1∑j=0

τsj log τsj −∑

(s,t)∈E

∑j,k

τstjk logτstjkτsjτtk

Note τ is not a true marginal, and HBethe is not a true entropy.

The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).

(KDD 2012 Tutorial) Graphical Models 103 / 131

Page 376: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Approximating A∗

Define the Bethe entropy for τ ∈ L on loopy graphs in the same way:

HBethe(pτ ) = −∑s∈V

r−1∑j=0

τsj log τsj −∑

(s,t)∈E

∑j,k

τstjk logτstjkτsjτtk

Note τ is not a true marginal, and HBethe is not a true entropy.

The second approximation in sum-product is to replace A∗(τ) with−HBethe(pτ ).

(KDD 2012 Tutorial) Graphical Models 103 / 131

Page 377: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

With these two approximations, we arrive at the variational problem

A(θ) = supτ∈L

τ>θ +HBethe(pτ )

Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.

The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.

At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution

(KDD 2012 Tutorial) Graphical Models 104 / 131

Page 378: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

With these two approximations, we arrive at the variational problem

A(θ) = supτ∈L

τ>θ +HBethe(pτ )

Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.

The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.

At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution

(KDD 2012 Tutorial) Graphical Models 104 / 131

Page 379: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

With these two approximations, we arrive at the variational problem

A(θ) = supτ∈L

τ>θ +HBethe(pτ )

Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.

The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.

At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)

τ may not correspond to a true marginal distribution

(KDD 2012 Tutorial) Graphical Models 104 / 131

Page 380: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

The Sum-Product Algorithm as Variational Approximation

With these two approximations, we arrive at the variational problem

A(θ) = supτ∈L

τ>θ +HBethe(pτ )

Optimality conditions require the gradients vanish w.r.t. both τ andthe Lagrangian multipliers on constraints τ ∈ L.

The sum-product algorithm can be derived as an iterative fixed pointprocedure to satisfy optimality conditions.

At the solution, A(θ) is not guaranteed to be either an upper or alower bound of A(θ)τ may not correspond to a true marginal distribution

(KDD 2012 Tutorial) Graphical Models 104 / 131

Page 381: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.

(KDD 2012 Tutorial) Graphical Models 105 / 131

Page 382: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.

(KDD 2012 Tutorial) Graphical Models 105 / 131

Page 383: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.

(KDD 2012 Tutorial) Graphical Models 105 / 131

Page 384: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Exponential Family and Variational Inference

Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

The mean field method

Not covered: Expectation Propagation

Efficient computation. But often unknown bias in solution.

(KDD 2012 Tutorial) Graphical Models 105 / 131

Page 385: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 106 / 131

Page 386: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 387: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )

However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 388: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the best

In fact z∗1:N can even have zero probability!2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 389: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 390: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 391: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a whole

The max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 392: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Maximizing Problems

Recall the HMM example

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

There are two senses of “best states” z1:N given x1:N :1 So far we computed the marginal p(zn|x1:N )

We can define “best” as z∗n = arg maxk p(zn = k|x1:N )However z∗1:N as a whole may not be the bestIn fact z∗1:N can even have zero probability!

2 An alternative is to find

z∗1:N = arg maxz1:N

p(z1:N |x1:N )

finds the most likely state configuration as a wholeThe max-sum algorithm solves this, generalizes the Viterbi algorithmfor HMMs

(KDD 2012 Tutorial) Graphical Models 107 / 131

Page 393: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

Simple modification to the sum-product algorithm: replace∑

with max inthe factor-to-variable messages.

µfs→x(x) = maxx1

. . .maxxM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

µxm→fs(xm) =∏

f∈ne(xm)\fs

µf→xm(xm)

µxleaf→f (x) = 1

µfleaf→x(x) = f(x)

(KDD 2012 Tutorial) Graphical Models 108 / 131

Page 394: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to leaves

At the root, multiply incoming messages

pmax = maxx

∏f∈ne(x)

µf→x(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial) Graphical Models 109 / 131

Page 395: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to leaves

At the root, multiply incoming messages

pmax = maxx

∏f∈ne(x)

µf→x(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial) Graphical Models 109 / 131

Page 396: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to leaves

At the root, multiply incoming messages

pmax = maxx

∏f∈ne(x)

µf→x(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial) Graphical Models 109 / 131

Page 397: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to leaves

At the root, multiply incoming messages

pmax = maxx

∏f∈ne(x)

µf→x(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial) Graphical Models 109 / 131

Page 398: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

As in sum-product, pick an arbitrary variable node x as the root

Pass messages up from leaves until they reach the root

Unlike sum-product, do not pass messages back from root to leaves

At the root, multiply incoming messages

pmax = maxx

∏f∈ne(x)

µf→x(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial) Graphical Models 109 / 131

Page 399: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

To identify the configuration itself, keep back pointers:

When creating the message

µfs→x(x) = maxx1

. . .maxxM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.

At the root, backtrack the pointers.

(KDD 2012 Tutorial) Graphical Models 110 / 131

Page 400: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

To identify the configuration itself, keep back pointers:

When creating the message

µfs→x(x) = maxx1

. . .maxxM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.

At the root, backtrack the pointers.

(KDD 2012 Tutorial) Graphical Models 110 / 131

Page 401: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

To identify the configuration itself, keep back pointers:

When creating the message

µfs→x(x) = maxx1

. . .maxxM

fs(x, x1, . . . , xM )M∏

m=1

µxm→fs(xm)

for each x value, we separately create M pointers back to the valuesof x1, . . . , xM that achieve the maximum.

At the root, backtrack the pointers.

(KDD 2012 Tutorial) Graphical Models 110 / 131

Page 402: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Message from leaf f1

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

The second messageµz1→f2(z1 = 1) = 1/4µz1→f2(z1 = 2) = 1/8

(KDD 2012 Tutorial) Graphical Models 111 / 131

Page 403: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

Message from leaf f1

µf1→z1(z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4µf1→z1(z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8The second messageµz1→f2(z1 = 1) = 1/4µz1→f2(z1 = 2) = 1/8

(KDD 2012 Tutorial) Graphical Models 111 / 131

Page 404: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

µf2→z2(z2 = 1)= max

z1

f2(z1, z2)µz1→f2(z1)

= maxz1

P (z2 = 1 | z1)P (x2 = G | z2 = 1)µz1→f2(z1)

= max(1/4 · 1/4 · 1/4, 1/2 · 1/4 · 1/8) = 1/64

Back pointer for z2 = 1: either z1 = 1 or z1 = 2

(KDD 2012 Tutorial) Graphical Models 112 / 131

Page 405: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

The other element of the same message:

µf2→z2(z2 = 2)= max

z1

f2(z1, z2)µz1→f2(z1)

= maxz1

P (z2 = 2 | z1)P (x2 = G | z2 = 2)µz1→f2(z1)

= max(3/4 · 1/2 · 1/4, 1/2 · 1/2 · 1/8) = 3/32

Back pointer for z2 = 2: z1 = 1

(KDD 2012 Tutorial) Graphical Models 113 / 131

Page 406: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

Intermediate: The Max-Product Algorithm

z1f 1 z2f 2

P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2

π = π = 1/21 2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)

1 2

1/4 1/2

R G B R G B

µf2→z2 =( 1/64 →z1=1,2

3/32 →z1=1

)At root z2,

maxs=1,2

µf2→z2(s) = 3/32

z2 = 2→ z1 = 1

z∗1:2 = arg maxz1:2

p(z1:2|x1:2) = (1, 2)

In this example, sum-product and max-product produce the same bestsequence; In general they differ.

(KDD 2012 Tutorial) Graphical Models 114 / 131

Page 407: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Inference Maximizing Problems

From Max-Product to Max-Sum

The max-sum algorithm is equivalent to the max-product algorithm, butwork in log space to avoid underflow.

µfs→x(x) = maxx1...xM

log fs(x, x1, . . . , xM ) +M∑

m=1

µxm→fs(xm)

µxm→fs(xm) =∑

f∈ne(xm)\fs

µf→xm(xm)

µxleaf→f (x) = 0

µfleaf→x(x) = log f(x)

When at the root,

log pmax = maxx

∑f∈ne(x)

µf→x(x)

The back pointers are the same.

(KDD 2012 Tutorial) Graphical Models 115 / 131

Page 408: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 116 / 131

Page 409: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 410: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 411: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 412: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 413: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observed

partially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 414: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Parameter Learning

Assume the graph structure is given

Learning in exponential family: estimate θ from iid data x1 . . .xn.

Principle: maximum likelihood

Distinguish two cases:

fully observed data: all dimensions of x are observedpartially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial) Graphical Models 117 / 131

Page 415: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Fully Observed Data

pθ(x) = exp(θ>φ(x)−A(θ)

)Given iid data x1 . . .xn, the log likelihood is

`(θ) =1n

n∑i=1

log pθ(xi) = θ>

(1n

n∑i=1

φ(xi)

)−A(θ) = θ>µ−A(θ)

µ ≡ 1n

∑ni=1 φ(xi) is the mean parameter of the empirical distribution

on x1 . . .xn. Clearly µ ∈M.

Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)

The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.

When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.

(KDD 2012 Tutorial) Graphical Models 118 / 131

Page 416: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Fully Observed Data

pθ(x) = exp(θ>φ(x)−A(θ)

)Given iid data x1 . . .xn, the log likelihood is

`(θ) =1n

n∑i=1

log pθ(xi) = θ>

(1n

n∑i=1

φ(xi)

)−A(θ) = θ>µ−A(θ)

µ ≡ 1n

∑ni=1 φ(xi) is the mean parameter of the empirical distribution

on x1 . . .xn. Clearly µ ∈M.

Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)

The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.

When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.

(KDD 2012 Tutorial) Graphical Models 118 / 131

Page 417: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Fully Observed Data

pθ(x) = exp(θ>φ(x)−A(θ)

)Given iid data x1 . . .xn, the log likelihood is

`(θ) =1n

n∑i=1

log pθ(xi) = θ>

(1n

n∑i=1

φ(xi)

)−A(θ) = θ>µ−A(θ)

µ ≡ 1n

∑ni=1 φ(xi) is the mean parameter of the empirical distribution

on x1 . . .xn. Clearly µ ∈M.

Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)

The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.

When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.

(KDD 2012 Tutorial) Graphical Models 118 / 131

Page 418: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Fully Observed Data

pθ(x) = exp(θ>φ(x)−A(θ)

)Given iid data x1 . . .xn, the log likelihood is

`(θ) =1n

n∑i=1

log pθ(xi) = θ>

(1n

n∑i=1

φ(xi)

)−A(θ) = θ>µ−A(θ)

µ ≡ 1n

∑ni=1 φ(xi) is the mean parameter of the empirical distribution

on x1 . . .xn. Clearly µ ∈M.

Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)

The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.

When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.

(KDD 2012 Tutorial) Graphical Models 118 / 131

Page 419: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Fully Observed Data

pθ(x) = exp(θ>φ(x)−A(θ)

)Given iid data x1 . . .xn, the log likelihood is

`(θ) =1n

n∑i=1

log pθ(xi) = θ>

(1n

n∑i=1

φ(xi)

)−A(θ) = θ>µ−A(θ)

µ ≡ 1n

∑ni=1 φ(xi) is the mean parameter of the empirical distribution

on x1 . . .xn. Clearly µ ∈M.

Maximum likelihood: θML = arg supθ∈Ω θ>µ−A(θ)

The solution is θML = θ(µ), the exponential family density whosemean parameter matches µ.

When µ ∈M0 and φ minimal, there is a unique maximum likelihoodsolution θML.

(KDD 2012 Tutorial) Graphical Models 118 / 131

Page 420: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 421: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 422: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 423: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 424: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 425: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn

The incomplete likelihood `(θ) = 1n

∑ni=1 log pθ(xi) where

pθ(xi) =∫pθ(xi, z)dz

Can be written as `(θ) = 1n

∑ni=1Axi(θ)−A(θ)

New log partition function of pθ(z | xi), one per item:

Axi(θ) = log∫

exp(θ>φ(xi, z′))dz′

Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial) Graphical Models 119 / 131

Page 426: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

EM as Variational Lower Bound

Mean parameter realizable by any distribution on z while holding xi

fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some p

The variational definition Axi(θ) = supµ∈Mxiθ>µ−A∗xi

(µ)

Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi

Lower bound L on the incomplete log likelihood:

`(θ) =1n

n∑i=1

Axi(θ)−A(θ)

≥ 1n

n∑i=1

(θ>µi −A∗xi

(µi))−A(θ)

≡ L(µ1, . . . , µn, θ)

(KDD 2012 Tutorial) Graphical Models 120 / 131

Page 427: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

EM as Variational Lower Bound

Mean parameter realizable by any distribution on z while holding xi

fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi

θ>µ−A∗xi(µ)

Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi

Lower bound L on the incomplete log likelihood:

`(θ) =1n

n∑i=1

Axi(θ)−A(θ)

≥ 1n

n∑i=1

(θ>µi −A∗xi

(µi))−A(θ)

≡ L(µ1, . . . , µn, θ)

(KDD 2012 Tutorial) Graphical Models 120 / 131

Page 428: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

EM as Variational Lower Bound

Mean parameter realizable by any distribution on z while holding xi

fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi

θ>µ−A∗xi(µ)

Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi

Lower bound L on the incomplete log likelihood:

`(θ) =1n

n∑i=1

Axi(θ)−A(θ)

≥ 1n

n∑i=1

(θ>µi −A∗xi

(µi))−A(θ)

≡ L(µ1, . . . , µn, θ)

(KDD 2012 Tutorial) Graphical Models 120 / 131

Page 429: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

EM as Variational Lower Bound

Mean parameter realizable by any distribution on z while holding xi

fixed: Mxi = µ ∈ Rd | µ = Ep[φ(xi, z)] for some pThe variational definition Axi(θ) = supµ∈Mxi

θ>µ−A∗xi(µ)

Trivial variational lower bound: Axi(θ) ≥ θ>µi −A∗xi(µi),∀µi ∈Mxi

Lower bound L on the incomplete log likelihood:

`(θ) =1n

n∑i=1

Axi(θ)−A(θ)

≥ 1n

n∑i=1

(θ>µi −A∗xi

(µi))−A(θ)

≡ L(µ1, . . . , µn, θ)

(KDD 2012 Tutorial) Graphical Models 120 / 131

Page 430: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi

µi ← arg maxµi∈Mxi

L(µ1, . . . , µn, θ)

Equivalently, argmaxµi∈Mxiθ>µi −A∗xi

(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ

(KDD 2012 Tutorial) Graphical Models 121 / 131

Page 431: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi

µi ← arg maxµi∈Mxi

L(µ1, . . . , µn, θ)

Equivalently, argmaxµi∈Mxiθ>µi −A∗xi

(µi)

This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ

(KDD 2012 Tutorial) Graphical Models 121 / 131

Page 432: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi

µi ← arg maxµi∈Mxi

L(µ1, . . . , µn, θ)

Equivalently, argmaxµi∈Mxiθ>µi −A∗xi

(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]

The E-step is named after this Eθ[] under the current parameters θ

(KDD 2012 Tutorial) Graphical Models 121 / 131

Page 433: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(µ1, . . . , µn, θ).In the E-step, maximizes each µi

µi ← arg maxµi∈Mxi

L(µ1, . . . , µn, θ)

Equivalently, argmaxµi∈Mxiθ>µi −A∗xi

(µi)This is the variational representation of the mean parameterµi(θ) = Eθ[φ(xi, z)]The E-step is named after this Eθ[] under the current parameters θ

(KDD 2012 Tutorial) Graphical Models 121 / 131

Page 434: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The M-Step

In the M-step, maximize θ holding the µ’s fixed:

θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max

θ∈Ωθ>µ−A(θ)

µ = 1n

∑ni=1 µ

i

The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ

Standard fully observed maximum likelihood problem, hence the nameM-step

(KDD 2012 Tutorial) Graphical Models 122 / 131

Page 435: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The M-Step

In the M-step, maximize θ holding the µ’s fixed:

θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max

θ∈Ωθ>µ−A(θ)

µ = 1n

∑ni=1 µ

i

The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ

Standard fully observed maximum likelihood problem, hence the nameM-step

(KDD 2012 Tutorial) Graphical Models 122 / 131

Page 436: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The M-Step

In the M-step, maximize θ holding the µ’s fixed:

θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max

θ∈Ωθ>µ−A(θ)

µ = 1n

∑ni=1 µ

i

The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ

Standard fully observed maximum likelihood problem, hence the nameM-step

(KDD 2012 Tutorial) Graphical Models 122 / 131

Page 437: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Exact EM: The M-Step

In the M-step, maximize θ holding the µ’s fixed:

θ ← arg maxθ∈ΩL(µ1, . . . , µn, θ) = arg max

θ∈Ωθ>µ−A(θ)

µ = 1n

∑ni=1 µ

i

The solution θ(µ) satisfies Eθ(µ)[φ(x)] = µ

Standard fully observed maximum likelihood problem, hence the nameM-step

(KDD 2012 Tutorial) Graphical Models 122 / 131

Page 438: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 439: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 440: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 441: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximum

recallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 442: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 443: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 444: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Parameter Learning

Variational EM

For loopy graphs E-step often intractable.

Can’t maximizemax

µi∈Mxi

θ>µi −A∗xi(µi)

Improve but not necessarily maximize: “generalized EM”

The mean field method maximizes

maxµi∈Mxi (F )

θ>µi −A∗xi(µi)

up to local maximumrecallMxi(F ) is an inner approximation to Mxi

Mean field E-step leads to generalized EM

The sum-product algorithm does not lead to generalized EM

(KDD 2012 Tutorial) Graphical Models 123 / 131

Page 445: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Outline

1 Overview

2 DefinitionsDirected Graphical ModelsUndirected Graphical ModelsFactor Graph

3 InferenceExact InferenceMarkov Chain Monte CarloBelief PropagationMean Field AlgorithmExponential Family and Variational InferenceMaximizing Problems

4 Parameter Learning

5 Structure Learning

(KDD 2012 Tutorial) Graphical Models 124 / 131

Page 446: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M ⊆M be a log-linear model structure

P (X |M, θ) =1Z

exp

(∑i∈M

θifi(X)

)

A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment

M and θ treated separately

(KDD 2012 Tutorial) Graphical Models 125 / 131

Page 447: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M ⊆M be a log-linear model structure

P (X |M, θ) =1Z

exp

(∑i∈M

θifi(X)

)

A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment

M and θ treated separately

(KDD 2012 Tutorial) Graphical Models 125 / 131

Page 448: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M ⊆M be a log-linear model structure

P (X |M, θ) =1Z

exp

(∑i∈M

θifi(X)

)

A score for the model M can be maxθ lnP (Data |M, θ)

The score is always better for larger M – needs regularization orBayesian treatment

M and θ treated separately

(KDD 2012 Tutorial) Graphical Models 125 / 131

Page 449: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M ⊆M be a log-linear model structure

P (X |M, θ) =1Z

exp

(∑i∈M

θifi(X)

)

A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment

M and θ treated separately

(KDD 2012 Tutorial) Graphical Models 125 / 131

Page 450: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

Let M ⊆M be a log-linear model structure

P (X |M, θ) =1Z

exp

(∑i∈M

θifi(X)

)

A score for the model M can be maxθ lnP (Data |M, θ)The score is always better for larger M – needs regularization orBayesian treatment

M and θ treated separately

(KDD 2012 Tutorial) Graphical Models 125 / 131

Page 451: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N(µ,Σ)

The graphical model has p nodes x1, . . . , xp

The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1

Equivalently, xi, xj are conditionally independent given other variables

(KDD 2012 Tutorial) Graphical Models 126 / 131

Page 452: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp

The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1

Equivalently, xi, xj are conditionally independent given other variables

(KDD 2012 Tutorial) Graphical Models 126 / 131

Page 453: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp

The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1

Equivalently, xi, xj are conditionally independent given other variables

(KDD 2012 Tutorial) Graphical Models 126 / 131

Page 454: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N(µ,Σ)The graphical model has p nodes x1, . . . , xp

The edge between xi, xj is absent if and only if Ωij = 0, whereΩ = Σ−1

Equivalently, xi, xj are conditionally independent given other variables

(KDD 2012 Tutorial) Graphical Models 126 / 131

Page 455: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Example

If we know Σ =

14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5

Then Ω = Σ−1 =

0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667

0 0 0.1667 0.3333

The corresponding graphical model structure is

xx

xx

12

3

4

(KDD 2012 Tutorial) Graphical Models 127 / 131

Page 456: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Example

If we know Σ =

14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5

Then Ω = Σ−1 =

0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667

0 0 0.1667 0.3333

The corresponding graphical model structure is

xx

xx

12

3

4

(KDD 2012 Tutorial) Graphical Models 127 / 131

Page 457: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Example

If we know Σ =

14 −16 4 −2−16 32 −8 44 −8 8 −4−2 4 −4 5

Then Ω = Σ−1 =

0.1667 0.0833 0.0000 00.0833 0.0833 0.0417 00.0000 0.0417 0.2500 0.1667

0 0 0.1667 0.3333

The corresponding graphical model structure is

xx

xx

12

3

4

(KDD 2012 Tutorial) Graphical Models 127 / 131

Page 458: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Let data be X(1), . . . , X(n) ∼ N(µ,Σ)

The log likelihood is n2 log |Ω| − 1

2

∑ni=1(X

(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance

S =1n

∑i

(X(i) − X)>(X(i) − X)

where X is the sample mean

S−1 is not a good estimate of Ω when n is small

(KDD 2012 Tutorial) Graphical Models 128 / 131

Page 459: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n

2 log |Ω| − 12

∑ni=1(X

(i) − µ)>Ω(X(i) − µ)

The maximum likelihood estimate of Σ is the sample covariance

S =1n

∑i

(X(i) − X)>(X(i) − X)

where X is the sample mean

S−1 is not a good estimate of Ω when n is small

(KDD 2012 Tutorial) Graphical Models 128 / 131

Page 460: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n

2 log |Ω| − 12

∑ni=1(X

(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance

S =1n

∑i

(X(i) − X)>(X(i) − X)

where X is the sample mean

S−1 is not a good estimate of Ω when n is small

(KDD 2012 Tutorial) Graphical Models 128 / 131

Page 461: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

Let data be X(1), . . . , X(n) ∼ N(µ,Σ)The log likelihood is n

2 log |Ω| − 12

∑ni=1(X

(i) − µ)>Ω(X(i) − µ)The maximum likelihood estimate of Σ is the sample covariance

S =1n

∑i

(X(i) − X)>(X(i) − X)

where X is the sample mean

S−1 is not a good estimate of Ω when n is small

(KDD 2012 Tutorial) Graphical Models 128 / 131

Page 462: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

For centered data, minimize a regularized problem instead:

− log |Ω|+ 1n

n∑i=1

X(i)>ΩX(i) + λ∑i6=j

|Ωij |

Known as glasso

(KDD 2012 Tutorial) Graphical Models 129 / 131

Page 463: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Structure Learning for Gaussian Random Fields

For centered data, minimize a regularized problem instead:

− log |Ω|+ 1n

n∑i=1

X(i)>ΩX(i) + λ∑i6=j

|Ωij |

Known as glasso

(KDD 2012 Tutorial) Graphical Models 129 / 131

Page 464: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)

Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn

exact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 465: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random field

conditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn

exact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 466: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn

exact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 467: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xn

exact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 468: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 469: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 470: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

Summary

Graphical model = joint distribution p(x1, . . . , xn)Bayesian network or Markov random fieldconditional independence

Inference = p(XQ | XE), in general XQ ∪XE ⊂ x1 . . . xnexact, MCMC, variational

If p(x1, . . . , xn) not given, estimate it from data

parameter and structure learning

(KDD 2012 Tutorial) Graphical Models 130 / 131

Page 471: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008

Bishop, Pattern Recognition and Machine Learning. Springer 2006.

(KDD 2012 Tutorial) Graphical Models 131 / 131

Page 472: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008

Bishop, Pattern Recognition and Machine Learning. Springer 2006.

(KDD 2012 Tutorial) Graphical Models 131 / 131

Page 473: Graphical Models - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/pub/ZhuKDD12.pdf · Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison,

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential Families, andVariational Inference. FTML 2008

Bishop, Pattern Recognition and Machine Learning. Springer 2006.

(KDD 2012 Tutorial) Graphical Models 131 / 131