MAP Inference - Tel Aviv Universityhaimk/pgm-seminar/MAP Inference.pdf · Inference in Graphical...
Transcript of MAP Inference - Tel Aviv Universityhaimk/pgm-seminar/MAP Inference.pdf · Inference in Graphical...
Inference in Graphical Models
Given a Bayesian network or Markov network – a data structure that
represents a joint distribution compactly, in a factorized way.
We would like to answer queries using the distribution as our model.
Focused on the conditional probability query 𝑃 𝑌 𝐸 = 𝑒 .
We’ve seen algorithms for exact and approximate inference.
Maximum a Posteriori (MAP)
Evidence: 𝐸 = 𝑒.
Query variables: subset of variables 𝑌 = 𝒳 − 𝐸 (𝒳 = {𝑋1, … , 𝑋𝑛}).
Task: compute MAP(𝑌|𝐸 = 𝑒) = argmax𝑦
𝑃 𝑌 = 𝑦 𝐸 = 𝑒).
Applications:
Message Decoding: most likely transmitted message.
Image segmentation: most likely segmentation.
Marginal MAP Query
General form of MAP query.
Now we allow 𝒀 ⊆ 𝒳 − 𝑬 .
Task: Let 𝐖 = 𝒳 − 𝒀− 𝑬, compute
MAP(𝒀|𝑬 = 𝑒) ≔ argmax𝒚
𝑃 𝒀 = 𝒚 𝑬 = 𝒆)
= argmax𝒚
σ𝒘𝑃 𝒀 = 𝒚,𝑾 = 𝒘 𝑬 = 𝒆)
Involves both maximization and summation – seems harder than maximization
or summation alone.
Computational Complexity
The following decision problem is NP-complete
Given a Bayesian network ℬ = (𝐺, 𝑃) and a probability 𝑝, decide if there is a joint
assignment 𝑥 such that 𝑃 𝑥 > 𝑝.
The following decision problem for marginal MAP is complete for 𝑁𝑃𝑃𝑃
Given a Bayesian network ℬ = 𝐺, 𝑃 , subset of random variables 𝒀 ⊆ 𝒳 and
probability 𝑝, decide if there is an assignment 𝒚 to 𝒀 such that 𝑃 𝒚 > 𝑝.
Overview: Algorithms for MAP
Variable Elimination.
Message Passing (Belief Propagation).
Integer programming methods.
For some networks: graph-cut methods.
Combinatorial Search (will not be discussed).
Preliminaries
We consider a distribution 𝑃𝜙(𝒳) defined via a set of factors and
unnormalized density ෨𝑃𝜙.
Then equivalently we need to compute
𝑥𝑚𝑎𝑝 = argmax𝑥
𝑃𝜙 𝑥 = argmax𝑥
1
𝑍෨𝑃𝜙 𝑥 = argmax
𝑥෨𝑃𝜙 𝑥
Given evidence we aim to maximize 𝑃𝜙 𝒀, 𝒆 (we consider the unnormalized
probability with the factors reduced by 𝐸 = 𝑒).
Similarly for marginal MAP
Max Marginal
Let Y be a set of variables and 𝜉 an assignment. We denote 𝜉 𝑌 to be the
assignment values of 𝜉 to the variables Y.
The max-marginal of a function 𝑓 relative to a set of variables Y is
𝑀𝑎𝑥𝑀𝑎𝑟𝑔𝑓 𝒚 = max𝜉 𝑌 =𝑦
𝑓(𝜉)
for any assignment 𝒚 ∈ 𝑉𝑎𝑙 𝒀 .
For example 𝑀𝑎𝑥𝑀𝑎𝑟𝑔𝑃 𝑎 = max𝑏
𝑃(𝑎, 𝑏) is a max-marginal of 𝑃 over 𝐴.
Variable Elimination: a Simple Example.
Consider the Bayesian network A->B with CPD’s
We want to compute maxa,b
P a, b and argmaxa,b
P a, b
We compute
maxa,b
P a, b = maxamax𝑏𝑃 𝑎 𝑃 𝑏 𝑎 = max
a{𝑃 𝑎 max
𝑏𝑃 𝑏 𝑎 } = max 0.4 ∗ 0.9,0.6 ∗ 0.55 = 0.36
We now want to find the MAP assignment 𝑎∗, 𝑏∗ .
First, 𝑎∗ = argmax𝑎 𝑃 𝑎 max𝑏𝑃 𝑏 𝑎 = 𝑎0 = 0.4 (Why?)
Then given 𝑎∗ we can now calculate 𝑏∗ = argmaxb𝑃 𝑏 𝑎∗ = 𝑏1 = 0.9 (Why?)
Variable Elimination
We have seen two procedures
Variable elimination.
Tracing back to get a joint assignment.
Variable elimination was possible due to the general fact
If 𝑋 ∉ 𝑆𝑐𝑜𝑝𝑒[𝜙1] then max𝑋
𝜙1, 𝜙2 = 𝜙1max𝑋
𝜙2
Max-product Variable Elimination
Analogous to sum-product variable-elimination.
Discrepancies:
Max replaces Sum.
Traceback procedure.
No query variables – all variables are eliminated.
Finding the Most Probable Assignment
The 𝜙𝑋𝑖’s are the intermediate factors - the last one is the max-marginal over the final
uneliminated variable.
The returned assignment is the MAP assignment.
VE Example
First elimination
max𝑖,𝑑,𝑔,𝑙
𝜙𝐼 𝑖 𝜙𝐺 𝑔, 𝑖, 𝑑 𝜙𝐷 𝑑 𝜙𝐿(𝑙, 𝑔)max𝑠
𝜙𝑆(𝑖, 𝑠)
Second elimination
max𝑑,𝑔,𝑙
𝜙𝐷 𝑑 𝜙𝐿(𝑙, 𝑔)max𝑖
𝜙𝐼 𝑖 𝜙𝐺 𝑔, 𝑖, 𝑑 𝜏1(𝑖)
And so on…
VE Example Continued
Now we trace back:
𝑔∗ = argmax𝑔𝜓5(𝑔)
𝑙∗ = argmax𝑙𝜓4(𝑙, 𝑔∗)
𝑑∗ = argmax𝑑𝜓3(𝑔∗, 𝑑)
𝑖∗ = argmax𝑖𝜓2(𝑔∗, 𝑖, 𝑑∗)
𝑠∗ = argmax𝑠𝜓1(𝑖∗, 𝑠)
(𝑔∗, 𝑙∗, 𝑑∗, 𝑖∗, 𝑠∗) is the MAP assignment.
𝜏5(∅) is its probability.
Complexity of Max-Product VE
Analysis of complexity of sum-product VE applies unchanged.
Traceback – no extra asymptotic expense
Linear traversal over the intermediate factors.
VE for Marginal MAP
Recall the marginal MAP problem
MAP(𝒀|𝑬 = 𝑒) = argmax𝒚
𝑃 𝒀 = 𝒚 𝑬 = 𝒆)
= argmax𝒚
σ𝒘𝑃 𝒀 = 𝒚,𝑾 = 𝒘 𝑬 = 𝒆)
Suggests a similar VE algorithm.
Marginal MAP VE – a Bad Example
We wish to compute
Must perform all summations before maximizations
Generally max𝑎
σ𝑏 𝑃(𝑎, 𝑏) ≤ σ𝑏max𝑎
𝑃(𝑎, 𝑏) (not necessarily equal).
After summing out X𝑖’s we remain with an exponential factor (in 𝑛).
MAP query can be performed in linear time.
Reminder: Sum-Product Belief
Propagation Algorithm Properties
Each belief is the marginal of the cluster:
All pairs of adjacent cliques are calibrated
Reparameterization of the distribution
Max-Product Belief Propagation
Properties
Each belief is the max-marginal of the cluster
All pairs of adjacent cliques are max-calibrated
Reparameterization of the distribution
Proofs are exactly the same (replace summation with maximization).
Max-Product Belief Propagation Example
Max marginals after running Belief Propagation on the clique tree
Decoding Max Marginals
Can we separately choose each value for a random variable based on local
beliefs?
No! Example: XOR Distribution 𝑃 𝑋1 = 𝑋2 = 0.4, 𝑃 𝑋1 ≠ 𝑋2 = 0.1
This will work if there is a unique MAP assignment. Equivalently, each max-
marginal over 𝑋𝑖 has a unique maximal value.
In case of ties, we can introduce a slight random perturbation.
Otherwise, can use Traceback – on factors that correspond to variable
eliminations.
Local Optimality and MAP
We can also verify if an assignment is a MAP assignment.
An assignment 𝜉 is locally optimal if for each cluster 𝐶 the assignment to 𝐶 in
𝜉 maximizes its corresponding belief.
We will prove the following theorem:
Theorem. Let {𝛽𝑖 , 𝜇𝑖,𝑗} be the beliefs in a clique tree resulting from an execution
of max-product belief propagation. Then an assignment 𝜉∗ is locally optimal
relative to the beliefs {𝛽𝑖} if and only if it is the global MAP assignment.
Proof of Theorem
“if” direction follows directly – a MAP assignment maximizes each max-
marginal.
For the “only if” direction we will need the following lemma
Proof of Lemma
It is easy to see that 𝜙
𝜓𝒚∗ = 1.
Consider an assignment 𝒚 and 𝒛 = 𝒚 𝒁 .
Then either 𝜙 𝒚 = 𝜓(𝒛) or 𝜙 𝒚 < 𝜓 𝒛 , hence 𝜙
𝜓𝒚 ≤ 1 =
𝜙
𝜓𝒚∗ .
Proof of Theorem (“only if” direction)
Select a root clique 𝑪𝑟.
For each clique 𝑖 ≠ 𝑟 let 𝜋 𝑖 be the parent of clique 𝑖 in the rooted tree.
We can rewrite the equation of the reparamaterization
as follows
By the lemma, 𝜉∗ optimizes each term in this product, therefore it is the MAP
assignment.
Theorem. Let {𝛽𝑖 , 𝜇𝑖,𝑗} be the beliefs in a clique tree resulting from an
execution of max-product belief propagation. Then an assignment 𝜉∗ is locally
optimal relative to the beliefs {𝛽𝑖} if and only if it is the global MAP
assignment.
Reminder: Sum-Product Belief
Propagation in Loopy Graphs
Applied sum-product message passing to loopy cluster graphs (same algorithm
with a slight modification).
We assumed a running intersection property for loopy clusters.
The algorithm is an approximation to the exact inference task
Convergence is not guaranteed.
At convergence beliefs are calibrated but are necessarily the marginals
(pseudo marginals).
Max-Product Belief Propagation in Loopy
Graphs
Analogously to the clique tree case, we can derive belief propagation
algorithm for loopy cluster graph.
The resulting beliefs will not generally be the max-marginals, but pseudo
max-marginals.
Decoding the Pseudo-Max-Marginals
An assignment that is locally optimal may not exist.
Example: cluster graph with 3 clusters 𝐴, 𝐵 . 𝐵, 𝐶 , {𝐴, 𝐶} and beliefs
Why do we want to look for locally optimal assignments?
A locally optimal assignment is guaranteed to have local maximum properties –
next slide.
How do we search for one if one exists?
This is a constraint satisfaction problem (NP hard). We can use CSP methods.
MAP as a Linear Optimization Problem -
Motivation
Use various optimization techniques.
Theoretical insights.
Connections to previous algorithms – more insights.
Preliminaries
Given a set of factors Φ = 𝜙𝑟 ∶ 𝑟 ∈ 𝑹 each with scope 𝑪𝒓.
We turn all our products to summations (assume factors are positive)
Let 𝑛𝑟 = |𝑉𝑎𝑙 𝑪𝒓 |.
For each 𝑟 ∈ 𝑹 let 𝑐𝑟𝑗: 𝑗 = 1,… , 𝑛𝑟 be an enumeration of different assignments to
𝜙𝑟.
Define coefficients 𝜂𝑟𝑗= log𝜙𝑟 𝑐𝑟
𝑗for all 𝑟 ∈ 𝑹, 𝑗 = 1,… , 𝑛𝑟.
Define optimization binary variables 𝑞 𝑥𝑟𝑗: 𝑟 ∈ 𝑹, 𝑗 = 1,… , 𝑛𝑟 .
𝑞 𝑥𝑟𝑗
= 1 if and only if the factor 𝜙𝑟 is assigned value 𝑐𝑟𝑗.
Let 𝜼 and 𝒒 be the corresponding vectors of dimension 𝑁 = σ𝑟∈𝑹𝑛𝑟.
We would like to maximize σ𝑟∈𝑹σ𝑗=1𝑛𝑟 𝜂𝑟
𝑗𝑞 𝑥𝑟
𝑗= 𝜼𝑻𝒒.
Variables Example
Assume a pairwise MRF with three variables 𝐴, 𝐵, 𝐶 and factors
𝜙1 𝐴, 𝐵 , 𝜙2 𝐵, 𝐶 , 𝜙3 𝐴, 𝐶 .
Assume 𝐴, 𝐵 are binary valued and 𝐶 takes three values.
The optimization variables are
Values are enumerated lexicographically.
Integer Programming Formulation
For each factor only
a single assignment is
selected
Factors agree on their
intersection
There is a one-to-one mapping between assignments to 𝒒 and legal
assignment to the random variables.
Integer Programming Example
Returning to our previous example, the constraint that 𝜙1(𝐴, 𝐵) has a single
assignment is σ𝑗=14 𝑞(𝑥1
𝑗) = 1.
The consistency constraints on 𝜙1 𝐴, 𝐵 , 𝜙2(𝐵, 𝐶) are (the first for 𝐵 = 𝑏1, the
second for 𝐵 = 𝑏2)
Linear Programming Relaxation
Integer programming is NP-hard.
Relaxation - 𝑞𝑟𝑗≥ 0 replaces 𝑞𝑟
𝑗∈ 0,1 in the LP.
Solvable in polynomial time.
Solve the relaxed LP, if solution is integer, you are done. Otherwise, try
greedy approaches, randomized rounding, etc.
Incorporate additional constraints to the LP, perhaps getting a better
approximation.
Advanced techniques (like solving the dual).
Reminder: Log Linear Models
Log linear model
where 𝐸 𝑥1, … , 𝑥𝑛 = σ𝑖=1𝑘 𝑤𝑖𝑓𝑖 𝑫𝑖 is the energy function.
In a MAP query our goal is to minimize the energy function.
We have an MRF with nodes 𝑋1, …𝑋𝑛 and set of edges ℰ.
We will consider the following variant (all variables are binary)
Inference Using Graph Cuts
Efficient solution for special classes of networks.
An example where sum-product inference and MAP inference have different
computational properties.
The Min Cut Problem
Let 𝐺 = 𝑉 ∪ 𝑠, 𝑡 , 𝐸 be a directed graph where each edge 𝑒 ∈ E has a non-
negative cost 𝑐(𝑒).
A graph cut 𝒞 = 𝑉𝑠, 𝑉𝑡 is a disjoint partition of 𝑉 ∪ 𝑠, 𝑡 = 𝑉𝑠 ∪ 𝑉𝑡 such that 𝑠∈ 𝑉𝑠 and 𝑡 ∈ 𝑉𝑡.
The cost of the cut is 𝑐 𝒞 = σ𝑣1∈𝑉𝑠,𝑣2∈𝑉𝑡𝑐(𝑣1, 𝑣2).
In the min-cut problem we wish to find the cut that achieves the minimal
cost.
Can be solved in polynomial time.
The Reduction
We can WLOG that all the energy components are non-negative.
We construct the following graph 𝐺 = (𝑉 ∪ 𝑠, 𝑡 , 𝐸).
𝑉 contains a vertex for each random variable.
For each 𝑣𝑖 ∈ 𝑉 introduce an edge 𝑣𝑖 , 𝑡 with cost 𝜖𝑖(0).
For each 𝑣𝑖 ∈ 𝑉 introduce an edge 𝑠, 𝑣𝑖 with cost 𝜖𝑖(1).
For each pair of variables 𝑋𝑖 , 𝑋𝑗 that are connected by an edge in the MRF, we introduce
two edges 𝑣𝑖 , 𝑣𝑗 , (𝑣𝑗 , 𝑣𝑖) with cost 𝜆𝑖,𝑗 ≥ 0.
We map an assignment (𝑥1, … , 𝑥𝑛) to a cut 𝒞 = 𝑉𝑠, 𝑉𝑡 such that 𝑥𝑖 = 0 if and only
if 𝑣𝑖 ∈ 𝑉𝑠.
Correctness
Consider a cut 𝒞 = 𝑉𝑠, 𝑉𝑡 . If 𝑣𝑖 ∈ 𝑉𝑠 then 𝑥𝑖 = 0 and we will get a
contribution of 𝜖𝑖 0 to the cost of the cut and the energy function.
The analogous argument holds when 𝑣𝑖 ∈ 𝑉𝑡.
The edge 𝑣𝑖 , 𝑣𝑗 contributes 𝜆𝑖,𝑗 to the cost of the cut only if 𝑣𝑖 and 𝑣𝑗 are in
opposite sides of the cut.
Conversely, the pair of random variables 𝑋𝑖 , 𝑋𝑗 contributes 𝜆𝑖,𝑗 to the energy
function only if 𝑋𝑖 ≠ 𝑋𝑗.
Hence, the cost of the cut is precisely the same as the energy of the
corresponding assignment.