Recursive Cavity Modeling for Estimation of Gaussian...
Transcript of Recursive Cavity Modeling for Estimation of Gaussian...
Recursive Cavity Modeling for
Estimation of Gaussian MRFs∗
Stochastic Systems Group
Jason K. Johnson
October 9, 2002
∗/mit/jasonj/Public/SSG-OCT9-02
Overview
• Background
– Graphical Models (MRFs)
– Exponential Families
– Gaussian MRFs
– Information Geometry and Projections
• Model-Thinning Projections
– Model Selection by greedy edge-removal proce-dure.
– Parameters optimized by Iterative Scaling.
• Recursive Cavity Modeling
– Nested Dissection
– Cavity Modeling
– Blanket Modeling
– Examples
1
Graphical Models∗
Undirected graph G = (V, E) based upon ver-
tices V with E (unordered pairs of vertices).
Random variables x = (xi, i ∈ V) are said to
be Markov w.r.t G when
p(xA, xB|xS) = p(xA|xS)p(xB|xS)
for all A,B, S ⊂ V where S seperates A from
B.
Hammersley-Clifford, 71.† x is Markov w.r.t.
G if and only if p(x) factors according to G as
p(x) =1
Z(ψ)
∏
c∈Cψc(xc)
with positive potential functions ψ and Z(ψ)is normalization constant.
Markov structure of random process x allows
for compact specification of p(x) as graphical
models.∗Lauritzen, 96; Jordan, 99.†Grimmett, 73.
2
Example MRF
x1 x2
x4 x3
Graph Factorization
p(x) ∝ ψ1(x1)ψ2(x2)ψ3(x3)ψ4(x4)
ψ1,2(x1, x2)ψ2,3(x2, x3)
ψ3,4(x3, x4)ψ4,1(x4, x1)
Conditional Independence
p(x1,3|x2,4) = p(x1|x2,4)p(x3|x2,4)
p(x2,4|x1,3) = p(x2|x1,3)p(x4|x1,3)
3
Exponential Families∗
Specified by a base measure q(x) > 0 and a set
of sufficient statistics t(x) both defined over
some specified state-space X. We take X =
Rn so that model is specified by pdf of the
form
f(x; θ) = q(x) exp{θ · t(x)− ϕ(θ)}
where the cumulant function ϕ(θ) is the nor-
malization constant
ϕ(θ) = log∫
q(x) exp{θ · t(x)}dx
Only consider admissable parameters Θ s.t.
pdf is normalizable ϕ(θ) < ∞. The family
is regular if Θ has non-empty interior. The
statistics are minimal if the t(x) are linearly-
independent. Then, dual parameterization pro-
vided by moment coordinates η = Eθ{t(x)}
over the set of achievable moments η(Θ).
∗Chentsov, 66; Barndorff-Nielsen, 78.
4
Gaussian Markov Random Fields
Consider Gaussian process x ∼ N (µ,Σ) with
mean vector µ = E{x} and covariance matrix
Σ = E{xx′} − µµ′.
Information Filter Form. Say that x ∼ N−1(h, J)
if
h = Σ−1µ
J = Σ−1
s.t. density function is parameterized as
p(x) = exp{−1
2x′Jx+ h′x− ϕ(h, J)}
where
ϕ(h, J) =1
2{h′J−1h− log |J |+ n log 2π}.
This is an exponential family model with
θ = (h,−J/2)
t(x) = (x, xx′)
η = (µ,Σ + µµ′)
ϕ(θ) = ϕ(h, J)
5
Example GMRF
ψ1 ψ2 ψ3
ψ1,2 ψ2,3
p(x) ∝ ψ1(x1)ψ2(x2)ψ3(x3)ψ1,2(x1, x2)ψ2,3(x2, x3)
ψ1(x1) = exp{−1
2x′1J1,1x1 + h′1x1}
ψ2(x2) = exp{−1
2x′2J2,2x2 + h′2x2}
ψ3(x3) = exp{−1
2x′3J3,3x3 + h′3x3}
ψ1,2(x1, x2) = exp{−x′1J1,2x2}
ψ2,3(x2, x3) = exp{−x′2J2,3x3}
h =
h1h2h3
, J =
J1,1 J1,2 0J ′1,2 J2,2 J2,3
0 J ′2,3 J3,3
6
Information Geometry∗
Based upon the Kullback-Leibler divergence†,
a measure of contrast between probability dis-
tributions.
D(p‖q) = Ep
logp(x)
q(x)
Bregman distance in θ based upon ϕ(θ),
D(θ∗‖θ) = ϕ(θ)−∇ϕ(θ∗) · (θ − θ∗)
Legendre transform ϕ∗(η) of ϕ(θ):
ϕ∗(η) = θ(η) · η − ϕ(θ)
“Slope transform”
η(θ) =∂ϕ(θ)
∂θ
θ(η) =∂ϕ∗(η)
∂η
Convex bifunction in (η(p), θ(q)),
D(η‖θ) = ϕ∗(η) + ϕ(θ)− η · θ
∗Chentsov, 72; Csiszar, 75; Efron, 78; Amari, 01.†Kullback and Leibler, 51.
7
Bregman distance∗
ϕ(θ)
D(θ0||θ)
θ0 θ
ϕ(θ; θ0)
∗Bregman, 67.
8
Triangle Relation
D(θ0||θ1)
ϕ(θ)
θ0 θ1 θ2
D(θ0||θ2)
D(θ1||θ2)
∆ · (θ2 − θ1)
D(θ0‖θ2) = D(θ0‖θ1)+D(θ1‖θ2)+(η1−η0)·(θ2−θ1)
9
Information Projections
Let F be a regular exponential family with min-
imal statistics t(x), exponential coordinates Θ,
and moment coordinates η(Θ).
M-projection. Let p ∈ F , H ⊂ F e-flat sub-
manifold. Exists unique q∗ ∈ H satisfying the
following equivalent conditions:
(i) D(p‖q∗) = infq∈HD(p‖q)
(ii) ∀q ∈ H : (η(p)−η(q∗)) ·(θ(q)−θ(q∗)) = 0
(iii) ∀q ∈ H : D(p‖q) = D(p‖q∗) +D(q∗‖q)
We call q∗ = arg minq∈HD(p‖q) them-projection
of p to H.
10
M-projection
Θ η(Θ)
p
q q∗
p
q∗
q
D(p‖q∗)
D(q∗‖q)
D(p‖q)
∂
∂θ(q)D(p‖q) = η(q)− η(p)
11
Dual E-projection
E-projection. Let q ∈ F , H′ ⊂ F m-flat sub-
manifold. Exists unique p∗ ∈ H′ satisfying the
following equivalent conditions:
(i) D(p∗‖q) = infp∈H′D(p‖q)
(ii) ∀p ∈ H′ : (η(p)−η(p∗))·(θ(q)−θ(p∗)) = 0
(iii) ∀p ∈ H′ : D(p‖q) = D(p‖p∗) +D(p∗‖q)
We call p∗ = arg minp∈H′D(p||q) the e-projectionof q to H′.
Duality. Let H and H′ be I-orthogonal sub-manifolds such that exists r in intersection and
∀p ∈ H′, q ∈ H : (η(p)−η(r))·(θ(q)−θ(r)) = 0
Then, r is both the m-projection of p ∈ H′ toH and the e-projection of q ∈ H to H′.
12
E-projection
Θ η(Θ)
p
q
p
q
D(p‖q)D(p‖p∗)
p∗
D(p∗‖q)
p∗
∂
∂η(p)D(p‖q) = θ(p)− θ(q)
13
Model Thinning
Let t(x) = (tH(x), t′H(x)), θ = (θH, θ′H) and
η = (ηH, η′H).
Objective. M-project p ∈ F to lower-order
exponential family,
H = {q ∈ F | θ′H(q) = 0}
Dual Problem. E-projection q ∈ H to the m-
flat submanifold:
H′(p) = {r ∈ F | ηH(r) = ηH(p)} (1)
The latter e-projection problem may be solved
by iterative scaling techniques which adjust pa-
rameters θH(q) until ηH(q) = ηH(p) (moment
matching).
For GMRF x ∼ N−1(h, J), impose sparsity
on J . Moment-matching gives classical covari-
ance selection problem (Dempster, 72).
14
Iterative Scaling
Alternating e-projections to set of m-flat sub-
manifolds converges to e-projection to inter-
section (Csiszar, 75). Special case of method
of alternating Bregman projections (Bregman,
67).
Iterative Proportional Fitting.∗ m-flat subman-
ifolds impose marginal moment constraints specif-
ing marginal distribution p∗(xC).
ψ(xC)← ψ(xC)×p∗(xC)
p(xC)
Covariance Selection.† Updates exponential
parameters (hC, JC) to impose moment con-
straints (µ∗C,Σ∗C).
JC ← JC + (J∗C − JC)
hC ← hC + (h∗C − hC)
where (h∗C, J∗C) = ((Σ∗C)−1µ∗C, (Σ
∗C)−1) and
(h∗C, J∗C) = (Σ−1
C µC,Σ−1C ) (marginal informa-
tion models).∗Ireland and Kullback, 68.†Speed and Kivveri, 86.
15
Greedy Edge-Removal
Prunes edges from graphical model by forc-
ing selected off-diagonal entries of J to zero
(m-projections implemented by iterative scal-
ing techniques).
Selects weak interactions to prune according
to conditional mutual information
I(xi; xj|x\ij) = −1
2log
1−det Ji,j
√
det Ji,i det Jj,j
which gives tractable lower-bound estimate of
KL under m-projection.
Selects batch K ⊂ V of weakest edges to prune
satisfying
∑
KIi;j <
δ
|K|
Continues thinning until no more weak inter-
actions relative to δ. Related to Akaike infor-
mation criterion (Akaike, 74).
16
Nested Dissection
(1) vertical cut.
(2) horizontal cut.
(3) vertical cut
(4) horizontal cut.
17
Variable Elimination
Integrate over subset Λ ⊂ V of random vari-
ables:
p(x\Λ) =∫
p(x)dxΛ
Local parameter update in (h, J) representa-
tion:
h∂Λ ← h∂Λ − J∂Λ,ΛJ−1Λ,ΛhΛ
J∂Λ ← J∂Λ − J∂Λ,ΛJ−1Λ,ΛJΛ,∂Λ
Eliminates vertices in graphical model but adds
“fill” edges between neighbors. Only updates
local parameters and structure of “boundary”
∂Λ of subfield.
18
Cavity Models (Initialization)
(1) Partial model of subfield (zero boundary).
(2) Elimination gives model of surface.
(3) Model thinning gives “cavity model”.
19
“Upwards” Cavity Modeling
(1) Initialization.
(2) Merge. (3) Eliminate.
(4) Thin.
20
“Downwards” Blanket Modeling
(1) Initialization.
(2) Merge. (3) Eliminate.
(4) Thin.
21
Conclusion
RCM appears to provide a powerful and flex-
ible framework for tractable yet near-optimal
computation in MRFs.
Much work remains to better characterize per-
formance and explore promising extensions:
• Develop information geometry of RCM.
• Consider more general families of graphical
models.
• Employ alternative modeling techniques.
• Applications
– Model Identification
– Image Processing
– Data Compression and Coding
– Monte-Carlo Simulation
22
References
Akaike, 74. A new look at the statistical model identi-fication. IEEE Trans. Auto. Control, AC-19:716:723.
Amari, 01. Information geometry of hierarchy of proba-bility distributions. IEEE Trans. Inf. Theory, 47(5):1701-1711.
Chentsov, 66. A systematic theory of exponential fam-ilies. Theory of Prob. and Appl., 11.
Chentsov, 72. Statistical decision rules and optimal in-ference. AMS Trans. Math. Mono., v.53 (reprint 82).
Barndorff-Nielsen, 78. Information and Exponential Fam-
ilies. John Wiley.
Bregman, 67. The relaxation method of finding thecommon point of convex sets. USSR Comp. Math.
and Physics, 7:200-217.
Csiszar, 75. I-divergence geometry of probability distri-butions and minimization problems. Annals of Prob.,3(1):146-158.
Dempster, 72. Covariance Selection. Biometrics, 28(1):157-175.
Efron, 78. The geometry of exponential families. An-
nals of Stat., 6(2):362-376.
Grimmett, 73. A thoerem about random fields. Bull.
of London Math. Soc., 5:81-84.
23
Ireland and Kullback, 68. Contingency tables with givenmarginals. Biometrika, 55:179-188.
Jordan (editor), 99. Learning in Graphical Models. MITPress.
Kullback and Leibler, 51. On information and suffi-ciency. Annals of Math. Stat., 22(1):79-86.
Lauritzen, 96. Graphical Models. Oxford UniversityPress.
Speed and Kiiveri, 86. Gaussian Markov distributionsover finite graphs. Annals of Stat., 14(1):138-150.