Recursive Cavity Modeling for Estimation of Gaussian...

Recursive Cavity Modeling for

Estimation of Gaussian MRFs∗

Stochastic Systems Group

Jason K. Johnson

October 9, 2002

∗/mit/jasonj/Public/SSG-OCT9-02

Overview

• Background

– Graphical Models (MRFs)

– Exponential Families

– Gaussian MRFs

– Information Geometry and Projections

• Model-Thinning Projections

– Model Selection by greedy edge-removal proce-dure.

– Parameters optimized by Iterative Scaling.

• Recursive Cavity Modeling

– Nested Dissection

– Cavity Modeling

– Blanket Modeling

– Examples

1

Graphical Models∗

Undirected graph G = (V, E) based upon ver-

tices V with E (unordered pairs of vertices).

Random variables x = (xi, i ∈ V) are said to

be Markov w.r.t G when

p(xA, xB|xS) = p(xA|xS)p(xB|xS)

for all A,B, S ⊂ V where S seperates A from

B.

Hammersley-Clifford, 71.† x is Markov w.r.t.

G if and only if p(x) factors according to G as

p(x) =1

Z(ψ)

∏

c∈Cψc(xc)

with positive potential functions ψ and Z(ψ)is normalization constant.

Markov structure of random process x allows

for compact specification of p(x) as graphical

models.∗Lauritzen, 96; Jordan, 99.†Grimmett, 73.

2

Example MRF

x1 x2

x4 x3

Graph Factorization

p(x) ∝ ψ1(x1)ψ2(x2)ψ3(x3)ψ4(x4)

ψ1,2(x1, x2)ψ2,3(x2, x3)

ψ3,4(x3, x4)ψ4,1(x4, x1)

Conditional Independence

p(x1,3|x2,4) = p(x1|x2,4)p(x3|x2,4)

p(x2,4|x1,3) = p(x2|x1,3)p(x4|x1,3)

3

Exponential Families∗

Specified by a base measure q(x) > 0 and a set

of sufficient statistics t(x) both defined over

some specified state-space X. We take X =

Rn so that model is specified by pdf of the

form

f(x; θ) = q(x) exp{θ · t(x)− ϕ(θ)}

where the cumulant function ϕ(θ) is the nor-

malization constant

ϕ(θ) = log∫

q(x) exp{θ · t(x)}dx

Only consider admissable parameters Θ s.t.

pdf is normalizable ϕ(θ) < ∞. The family

is regular if Θ has non-empty interior. The

statistics are minimal if the t(x) are linearly-

independent. Then, dual parameterization pro-

vided by moment coordinates η = Eθ{t(x)}

over the set of achievable moments η(Θ).

∗Chentsov, 66; Barndorff-Nielsen, 78.

4

Gaussian Markov Random Fields

Consider Gaussian process x ∼ N (µ,Σ) with

mean vector µ = E{x} and covariance matrix

Σ = E{xx′} − µµ′.

Information Filter Form. Say that x ∼ N−1(h, J)

if

h = Σ−1µ

J = Σ−1

s.t. density function is parameterized as

p(x) = exp{−1

2x′Jx+ h′x− ϕ(h, J)}

where

ϕ(h, J) =1

2{h′J−1h− log |J |+ n log 2π}.

This is an exponential family model with

θ = (h,−J/2)

t(x) = (x, xx′)

η = (µ,Σ + µµ′)

ϕ(θ) = ϕ(h, J)

5

Example GMRF

ψ1 ψ2 ψ3

ψ1,2 ψ2,3

p(x) ∝ ψ1(x1)ψ2(x2)ψ3(x3)ψ1,2(x1, x2)ψ2,3(x2, x3)

ψ1(x1) = exp{−1

2x′1J1,1x1 + h′1x1}

ψ2(x2) = exp{−1

2x′2J2,2x2 + h′2x2}

ψ3(x3) = exp{−1

2x′3J3,3x3 + h′3x3}

ψ1,2(x1, x2) = exp{−x′1J1,2x2}

ψ2,3(x2, x3) = exp{−x′2J2,3x3}

h =

h1h2h3

, J =

J1,1 J1,2 0J ′1,2 J2,2 J2,3

0 J ′2,3 J3,3

6

Information Geometry∗

Based upon the Kullback-Leibler divergence†,

a measure of contrast between probability dis-

tributions.

D(p‖q) = Ep

logp(x)

q(x)

Bregman distance in θ based upon ϕ(θ),

D(θ∗‖θ) = ϕ(θ)−∇ϕ(θ∗) · (θ − θ∗)

Legendre transform ϕ∗(η) of ϕ(θ):

ϕ∗(η) = θ(η) · η − ϕ(θ)

“Slope transform”

η(θ) =∂ϕ(θ)

∂θ

θ(η) =∂ϕ∗(η)

∂η

Convex bifunction in (η(p), θ(q)),

D(η‖θ) = ϕ∗(η) + ϕ(θ)− η · θ

∗Chentsov, 72; Csiszar, 75; Efron, 78; Amari, 01.†Kullback and Leibler, 51.

7

Bregman distance∗

ϕ(θ)

D(θ0||θ)

θ0 θ

ϕ(θ; θ0)

∗Bregman, 67.

8

Triangle Relation

D(θ0||θ1)

ϕ(θ)

θ0 θ1 θ2

D(θ0||θ2)

D(θ1||θ2)

∆ · (θ2 − θ1)

D(θ0‖θ2) = D(θ0‖θ1)+D(θ1‖θ2)+(η1−η0)·(θ2−θ1)

9

Information Projections

Let F be a regular exponential family with min-

imal statistics t(x), exponential coordinates Θ,

and moment coordinates η(Θ).

M-projection. Let p ∈ F , H ⊂ F e-flat sub-

manifold. Exists unique q∗ ∈ H satisfying the

following equivalent conditions:

(i) D(p‖q∗) = infq∈HD(p‖q)

(ii) ∀q ∈ H : (η(p)−η(q∗)) ·(θ(q)−θ(q∗)) = 0

(iii) ∀q ∈ H : D(p‖q) = D(p‖q∗) +D(q∗‖q)

We call q∗ = arg minq∈HD(p‖q) them-projection

of p to H.

10

M-projection

Θ η(Θ)

p

q q∗

p

q∗

q

D(p‖q∗)

D(q∗‖q)

D(p‖q)

∂

∂θ(q)D(p‖q) = η(q)− η(p)

11

Dual E-projection

E-projection. Let q ∈ F , H′ ⊂ F m-flat sub-

manifold. Exists unique p∗ ∈ H′ satisfying the

following equivalent conditions:

(i) D(p∗‖q) = infp∈H′D(p‖q)

(ii) ∀p ∈ H′ : (η(p)−η(p∗))·(θ(q)−θ(p∗)) = 0

(iii) ∀p ∈ H′ : D(p‖q) = D(p‖p∗) +D(p∗‖q)

We call p∗ = arg minp∈H′D(p||q) the e-projectionof q to H′.

Duality. Let H and H′ be I-orthogonal sub-manifolds such that exists r in intersection and

∀p ∈ H′, q ∈ H : (η(p)−η(r))·(θ(q)−θ(r)) = 0

Then, r is both the m-projection of p ∈ H′ toH and the e-projection of q ∈ H to H′.

12

E-projection

Θ η(Θ)

p

q

p

q

D(p‖q)D(p‖p∗)

p∗

D(p∗‖q)

p∗

∂

∂η(p)D(p‖q) = θ(p)− θ(q)

13

Model Thinning

Let t(x) = (tH(x), t′H(x)), θ = (θH, θ′H) and

η = (ηH, η′H).

Objective. M-project p ∈ F to lower-order

exponential family,

H = {q ∈ F | θ′H(q) = 0}

Dual Problem. E-projection q ∈ H to the m-

flat submanifold:

H′(p) = {r ∈ F | ηH(r) = ηH(p)} (1)

The latter e-projection problem may be solved

by iterative scaling techniques which adjust pa-

rameters θH(q) until ηH(q) = ηH(p) (moment

matching).

For GMRF x ∼ N−1(h, J), impose sparsity

on J . Moment-matching gives classical covari-

ance selection problem (Dempster, 72).

14

Iterative Scaling

Alternating e-projections to set of m-flat sub-

manifolds converges to e-projection to inter-

section (Csiszar, 75). Special case of method

of alternating Bregman projections (Bregman,

67).

Iterative Proportional Fitting.∗ m-flat subman-

ifolds impose marginal moment constraints specif-

ing marginal distribution p∗(xC).

ψ(xC)← ψ(xC)×p∗(xC)

p(xC)

Covariance Selection.† Updates exponential

parameters (hC, JC) to impose moment con-

straints (µ∗C,Σ∗C).

JC ← JC + (J∗C − JC)

hC ← hC + (h∗C − hC)

where (h∗C, J∗C) = ((Σ∗C)−1µ∗C, (Σ

∗C)−1) and

(h∗C, J∗C) = (Σ−1

C µC,Σ−1C ) (marginal informa-

tion models).∗Ireland and Kullback, 68.†Speed and Kivveri, 86.

15

Greedy Edge-Removal

Prunes edges from graphical model by forc-

ing selected off-diagonal entries of J to zero

(m-projections implemented by iterative scal-

ing techniques).

Selects weak interactions to prune according

to conditional mutual information

I(xi; xj|x\ij) = −1

2log

1−det Ji,j

√

det Ji,i det Jj,j

which gives tractable lower-bound estimate of

KL under m-projection.

Selects batch K ⊂ V of weakest edges to prune

satisfying

∑

KIi;j <

δ

|K|

Continues thinning until no more weak inter-

actions relative to δ. Related to Akaike infor-

mation criterion (Akaike, 74).

16

Nested Dissection

(1) vertical cut.

(2) horizontal cut.

(3) vertical cut

(4) horizontal cut.

17

Variable Elimination

Integrate over subset Λ ⊂ V of random vari-

ables:

p(x\Λ) =∫

p(x)dxΛ

Local parameter update in (h, J) representa-

tion:

h∂Λ ← h∂Λ − J∂Λ,ΛJ−1Λ,ΛhΛ

J∂Λ ← J∂Λ − J∂Λ,ΛJ−1Λ,ΛJΛ,∂Λ

Eliminates vertices in graphical model but adds

“fill” edges between neighbors. Only updates

local parameters and structure of “boundary”

∂Λ of subfield.

18

Cavity Models (Initialization)

(1) Partial model of subfield (zero boundary).

(2) Elimination gives model of surface.

(3) Model thinning gives “cavity model”.

19

“Upwards” Cavity Modeling

(1) Initialization.

(2) Merge. (3) Eliminate.

(4) Thin.

20

“Downwards” Blanket Modeling

(1) Initialization.

(2) Merge. (3) Eliminate.

(4) Thin.

21

Conclusion

RCM appears to provide a powerful and flex-

ible framework for tractable yet near-optimal

computation in MRFs.

Much work remains to better characterize per-

formance and explore promising extensions:

• Develop information geometry of RCM.

• Consider more general families of graphical

models.

• Employ alternative modeling techniques.

• Applications

– Model Identification

– Image Processing

– Data Compression and Coding

– Monte-Carlo Simulation

22

References

Akaike, 74. A new look at the statistical model identi-fication. IEEE Trans. Auto. Control, AC-19:716:723.

Amari, 01. Information geometry of hierarchy of proba-bility distributions. IEEE Trans. Inf. Theory, 47(5):1701-1711.

Chentsov, 66. A systematic theory of exponential fam-ilies. Theory of Prob. and Appl., 11.

Chentsov, 72. Statistical decision rules and optimal in-ference. AMS Trans. Math. Mono., v.53 (reprint 82).

Barndorff-Nielsen, 78. Information and Exponential Fam-

ilies. John Wiley.

Bregman, 67. The relaxation method of finding thecommon point of convex sets. USSR Comp. Math.

and Physics, 7:200-217.

Csiszar, 75. I-divergence geometry of probability distri-butions and minimization problems. Annals of Prob.,3(1):146-158.

Dempster, 72. Covariance Selection. Biometrics, 28(1):157-175.

Efron, 78. The geometry of exponential families. An-

nals of Stat., 6(2):362-376.

Grimmett, 73. A thoerem about random fields. Bull.

of London Math. Soc., 5:81-84.

23

Ireland and Kullback, 68. Contingency tables with givenmarginals. Biometrika, 55:179-188.

Jordan (editor), 99. Learning in Graphical Models. MITPress.

Kullback and Leibler, 51. On information and suffi-ciency. Annals of Math. Stat., 22(1):79-86.

Lauritzen, 96. Graphical Models. Oxford UniversityPress.

Speed and Kiiveri, 86. Gaussian Markov distributionsover finite graphs. Annals of Stat., 14(1):138-150.

Recursive Cavity Modeling for Estimation of Gaussian...

Documents

Transcript of Recursive Cavity Modeling for Estimation of Gaussian...