MAP Inference - Tel Aviv Universityhaimk/pgm-seminar/MAP Inference.pdf · Inference in Graphical...

52
MAP Inference Alon Brutzkus

Transcript of MAP Inference - Tel Aviv Universityhaimk/pgm-seminar/MAP Inference.pdf · Inference in Graphical...

MAP Inference

Alon Brutzkus

Inference in Graphical Models

Given a Bayesian network or Markov network – a data structure that

represents a joint distribution compactly, in a factorized way.

We would like to answer queries using the distribution as our model.

Focused on the conditional probability query 𝑃 𝑌 𝐸 = 𝑒 .

We’ve seen algorithms for exact and approximate inference.

Maximum a Posteriori (MAP)

Evidence: 𝐸 = 𝑒.

Query variables: subset of variables 𝑌 = 𝒳 − 𝐸 (𝒳 = {𝑋1, … , 𝑋𝑛}).

Task: compute MAP(𝑌|𝐸 = 𝑒) = argmax𝑦

𝑃 𝑌 = 𝑦 𝐸 = 𝑒).

Applications:

Message Decoding: most likely transmitted message.

Image segmentation: most likely segmentation.

MAP ≠ Max over Marginals

Two node chain

CPDs

argmax𝑎,𝑏

𝑃(𝑎, 𝑏) = (𝑎0, 𝑏1)

argmax𝑎

𝑃 𝑎 = 𝑎1

Marginal MAP Query

General form of MAP query.

Now we allow 𝒀 ⊆ 𝒳 − 𝑬 .

Task: Let 𝐖 = 𝒳 − 𝒀− 𝑬, compute

MAP(𝒀|𝑬 = 𝑒) ≔ argmax𝒚

𝑃 𝒀 = 𝒚 𝑬 = 𝒆)

= argmax𝒚

σ𝒘𝑃 𝒀 = 𝒚,𝑾 = 𝒘 𝑬 = 𝒆)

Involves both maximization and summation – seems harder than maximization

or summation alone.

Computational Complexity

The following decision problem is NP-complete

Given a Bayesian network ℬ = (𝐺, 𝑃) and a probability 𝑝, decide if there is a joint

assignment 𝑥 such that 𝑃 𝑥 > 𝑝.

The following decision problem for marginal MAP is complete for 𝑁𝑃𝑃𝑃

Given a Bayesian network ℬ = 𝐺, 𝑃 , subset of random variables 𝒀 ⊆ 𝒳 and

probability 𝑝, decide if there is an assignment 𝒚 to 𝒀 such that 𝑃 𝒚 > 𝑝.

Overview: Algorithms for MAP

Variable Elimination.

Message Passing (Belief Propagation).

Integer programming methods.

For some networks: graph-cut methods.

Combinatorial Search (will not be discussed).

Preliminaries

We consider a distribution 𝑃𝜙(𝒳) defined via a set of factors and

unnormalized density ෨𝑃𝜙.

Then equivalently we need to compute

𝑥𝑚𝑎𝑝 = argmax𝑥

𝑃𝜙 𝑥 = argmax𝑥

1

𝑍෨𝑃𝜙 𝑥 = argmax

𝑥෨𝑃𝜙 𝑥

Given evidence we aim to maximize 𝑃𝜙 𝒀, 𝒆 (we consider the unnormalized

probability with the factors reduced by 𝐸 = 𝑒).

Similarly for marginal MAP

Max Marginal

Let Y be a set of variables and 𝜉 an assignment. We denote 𝜉 𝑌 to be the

assignment values of 𝜉 to the variables Y.

The max-marginal of a function 𝑓 relative to a set of variables Y is

𝑀𝑎𝑥𝑀𝑎𝑟𝑔𝑓 𝒚 = max𝜉 𝑌 =𝑦

𝑓(𝜉)

for any assignment 𝒚 ∈ 𝑉𝑎𝑙 𝒀 .

For example 𝑀𝑎𝑥𝑀𝑎𝑟𝑔𝑃 𝑎 = max𝑏

𝑃(𝑎, 𝑏) is a max-marginal of 𝑃 over 𝐴.

Variable Elimination: a Simple Example.

Consider the Bayesian network A->B with CPD’s

We want to compute maxa,b

P a, b and argmaxa,b

P a, b

We compute

maxa,b

P a, b = maxamax𝑏𝑃 𝑎 𝑃 𝑏 𝑎 = max

a{𝑃 𝑎 max

𝑏𝑃 𝑏 𝑎 } = max 0.4 ∗ 0.9,0.6 ∗ 0.55 = 0.36

We now want to find the MAP assignment 𝑎∗, 𝑏∗ .

First, 𝑎∗ = argmax𝑎 𝑃 𝑎 max𝑏𝑃 𝑏 𝑎 = 𝑎0 = 0.4 (Why?)

Then given 𝑎∗ we can now calculate 𝑏∗ = argmaxb𝑃 𝑏 𝑎∗ = 𝑏1 = 0.9 (Why?)

Variable Elimination

We have seen two procedures

Variable elimination.

Tracing back to get a joint assignment.

Variable elimination was possible due to the general fact

If 𝑋 ∉ 𝑆𝑐𝑜𝑝𝑒[𝜙1] then max𝑋

𝜙1, 𝜙2 = 𝜙1max𝑋

𝜙2

Maximization as Operation on Factors

Max-product Variable Elimination

Analogous to sum-product variable-elimination.

Discrepancies:

Max replaces Sum.

Traceback procedure.

No query variables – all variables are eliminated.

Finding the Most Probable Assignment

The 𝜙𝑋𝑖’s are the intermediate factors - the last one is the max-marginal over the final

uneliminated variable.

The returned assignment is the MAP assignment.

VE Example

First elimination

max𝑖,𝑑,𝑔,𝑙

𝜙𝐼 𝑖 𝜙𝐺 𝑔, 𝑖, 𝑑 𝜙𝐷 𝑑 𝜙𝐿(𝑙, 𝑔)max𝑠

𝜙𝑆(𝑖, 𝑠)

Second elimination

max𝑑,𝑔,𝑙

𝜙𝐷 𝑑 𝜙𝐿(𝑙, 𝑔)max𝑖

𝜙𝐼 𝑖 𝜙𝐺 𝑔, 𝑖, 𝑑 𝜏1(𝑖)

And so on…

VE Example Continued

Now we trace back:

𝑔∗ = argmax𝑔𝜓5(𝑔)

𝑙∗ = argmax𝑙𝜓4(𝑙, 𝑔∗)

𝑑∗ = argmax𝑑𝜓3(𝑔∗, 𝑑)

𝑖∗ = argmax𝑖𝜓2(𝑔∗, 𝑖, 𝑑∗)

𝑠∗ = argmax𝑠𝜓1(𝑖∗, 𝑠)

(𝑔∗, 𝑙∗, 𝑑∗, 𝑖∗, 𝑠∗) is the MAP assignment.

𝜏5(∅) is its probability.

Complexity of Max-Product VE

Analysis of complexity of sum-product VE applies unchanged.

Traceback – no extra asymptotic expense

Linear traversal over the intermediate factors.

VE for Marginal MAP

Recall the marginal MAP problem

MAP(𝒀|𝑬 = 𝑒) = argmax𝒚

𝑃 𝒀 = 𝒚 𝑬 = 𝒆)

= argmax𝒚

σ𝒘𝑃 𝒀 = 𝒚,𝑾 = 𝒘 𝑬 = 𝒆)

Suggests a similar VE algorithm.

Example of VE for Marginal MAP

We want to compute

We perform the following operations:

Marginal MAP VE – a Bad Example

We wish to compute

Must perform all summations before maximizations

Generally max𝑎

σ𝑏 𝑃(𝑎, 𝑏) ≤ σ𝑏max𝑎

𝑃(𝑎, 𝑏) (not necessarily equal).

After summing out X𝑖’s we remain with an exponential factor (in 𝑛).

MAP query can be performed in linear time.

Reminder: Sum-Product Belief

Propagation Example

Reminder: Sum-Product Belief

Propagation Algorithm

Reminder: Sum-Product Belief

Propagation Algorithm Properties

Each belief is the marginal of the cluster:

All pairs of adjacent cliques are calibrated

Reparameterization of the distribution

Max-Product Belief Propagation

Max-Product Belief Propagation

Properties

Each belief is the max-marginal of the cluster

All pairs of adjacent cliques are max-calibrated

Reparameterization of the distribution

Proofs are exactly the same (replace summation with maximization).

Max-Product Belief Propagation Example

Markov network

Clique Tree

𝐴, 𝐵, 𝐷 𝐵, 𝐶, 𝐷𝐵, 𝐷

Max-Product Belief Propagation Example

Max marginals after running Belief Propagation on the clique tree

Decoding Max Marginals

Can we separately choose each value for a random variable based on local

beliefs?

No! Example: XOR Distribution 𝑃 𝑋1 = 𝑋2 = 0.4, 𝑃 𝑋1 ≠ 𝑋2 = 0.1

This will work if there is a unique MAP assignment. Equivalently, each max-

marginal over 𝑋𝑖 has a unique maximal value.

In case of ties, we can introduce a slight random perturbation.

Otherwise, can use Traceback – on factors that correspond to variable

eliminations.

Local Optimality and MAP

We can also verify if an assignment is a MAP assignment.

An assignment 𝜉 is locally optimal if for each cluster 𝐶 the assignment to 𝐶 in

𝜉 maximizes its corresponding belief.

We will prove the following theorem:

Theorem. Let {𝛽𝑖 , 𝜇𝑖,𝑗} be the beliefs in a clique tree resulting from an execution

of max-product belief propagation. Then an assignment 𝜉∗ is locally optimal

relative to the beliefs {𝛽𝑖} if and only if it is the global MAP assignment.

Proof of Theorem

“if” direction follows directly – a MAP assignment maximizes each max-

marginal.

For the “only if” direction we will need the following lemma

Proof of Lemma

It is easy to see that 𝜙

𝜓𝒚∗ = 1.

Consider an assignment 𝒚 and 𝒛 = 𝒚 𝒁 .

Then either 𝜙 𝒚 = 𝜓(𝒛) or 𝜙 𝒚 < 𝜓 𝒛 , hence 𝜙

𝜓𝒚 ≤ 1 =

𝜙

𝜓𝒚∗ .

Proof of Theorem (“only if” direction)

Select a root clique 𝑪𝑟.

For each clique 𝑖 ≠ 𝑟 let 𝜋 𝑖 be the parent of clique 𝑖 in the rooted tree.

We can rewrite the equation of the reparamaterization

as follows

By the lemma, 𝜉∗ optimizes each term in this product, therefore it is the MAP

assignment.

Theorem. Let {𝛽𝑖 , 𝜇𝑖,𝑗} be the beliefs in a clique tree resulting from an

execution of max-product belief propagation. Then an assignment 𝜉∗ is locally

optimal relative to the beliefs {𝛽𝑖} if and only if it is the global MAP

assignment.

Reminder: Sum-Product Belief

Propagation in Loopy Graphs

Applied sum-product message passing to loopy cluster graphs (same algorithm

with a slight modification).

We assumed a running intersection property for loopy clusters.

The algorithm is an approximation to the exact inference task

Convergence is not guaranteed.

At convergence beliefs are calibrated but are necessarily the marginals

(pseudo marginals).

Sum-Product Belief Propagation in Loopy

Graphs Algorithm

Max-Product Belief Propagation in Loopy

Graphs

Analogously to the clique tree case, we can derive belief propagation

algorithm for loopy cluster graph.

The resulting beliefs will not generally be the max-marginals, but pseudo

max-marginals.

Decoding the Pseudo-Max-Marginals

An assignment that is locally optimal may not exist.

Example: cluster graph with 3 clusters 𝐴, 𝐵 . 𝐵, 𝐶 , {𝐴, 𝐶} and beliefs

Why do we want to look for locally optimal assignments?

A locally optimal assignment is guaranteed to have local maximum properties –

next slide.

How do we search for one if one exists?

This is a constraint satisfaction problem (NP hard). We can use CSP methods.

Strong Local Maximum

Definition of induced subgraph

Strong Local Maximum

Theorem on local optimality (proof not given)

MAP as a Linear Optimization Problem -

Motivation

Use various optimization techniques.

Theoretical insights.

Connections to previous algorithms – more insights.

Preliminaries

Given a set of factors Φ = 𝜙𝑟 ∶ 𝑟 ∈ 𝑹 each with scope 𝑪𝒓.

We turn all our products to summations (assume factors are positive)

Let 𝑛𝑟 = |𝑉𝑎𝑙 𝑪𝒓 |.

For each 𝑟 ∈ 𝑹 let 𝑐𝑟𝑗: 𝑗 = 1,… , 𝑛𝑟 be an enumeration of different assignments to

𝜙𝑟.

Define coefficients 𝜂𝑟𝑗= log𝜙𝑟 𝑐𝑟

𝑗for all 𝑟 ∈ 𝑹, 𝑗 = 1,… , 𝑛𝑟.

Define optimization binary variables 𝑞 𝑥𝑟𝑗: 𝑟 ∈ 𝑹, 𝑗 = 1,… , 𝑛𝑟 .

𝑞 𝑥𝑟𝑗

= 1 if and only if the factor 𝜙𝑟 is assigned value 𝑐𝑟𝑗.

Let 𝜼 and 𝒒 be the corresponding vectors of dimension 𝑁 = σ𝑟∈𝑹𝑛𝑟.

We would like to maximize σ𝑟∈𝑹σ𝑗=1𝑛𝑟 𝜂𝑟

𝑗𝑞 𝑥𝑟

𝑗= 𝜼𝑻𝒒.

Variables Example

Assume a pairwise MRF with three variables 𝐴, 𝐵, 𝐶 and factors

𝜙1 𝐴, 𝐵 , 𝜙2 𝐵, 𝐶 , 𝜙3 𝐴, 𝐶 .

Assume 𝐴, 𝐵 are binary valued and 𝐶 takes three values.

The optimization variables are

Values are enumerated lexicographically.

Integer Programming Formulation

For each factor only

a single assignment is

selected

Factors agree on their

intersection

There is a one-to-one mapping between assignments to 𝒒 and legal

assignment to the random variables.

Integer Programming Example

Returning to our previous example, the constraint that 𝜙1(𝐴, 𝐵) has a single

assignment is σ𝑗=14 𝑞(𝑥1

𝑗) = 1.

The consistency constraints on 𝜙1 𝐴, 𝐵 , 𝜙2(𝐵, 𝐶) are (the first for 𝐵 = 𝑏1, the

second for 𝐵 = 𝑏2)

Linear Programming Relaxation

Integer programming is NP-hard.

Relaxation - 𝑞𝑟𝑗≥ 0 replaces 𝑞𝑟

𝑗∈ 0,1 in the LP.

Solvable in polynomial time.

Solve the relaxed LP, if solution is integer, you are done. Otherwise, try

greedy approaches, randomized rounding, etc.

Incorporate additional constraints to the LP, perhaps getting a better

approximation.

Advanced techniques (like solving the dual).

Linear Programming Formulation

Reminder: Log Linear Models

Log linear model

where 𝐸 𝑥1, … , 𝑥𝑛 = σ𝑖=1𝑘 𝑤𝑖𝑓𝑖 𝑫𝑖 is the energy function.

In a MAP query our goal is to minimize the energy function.

We have an MRF with nodes 𝑋1, …𝑋𝑛 and set of edges ℰ.

We will consider the following variant (all variables are binary)

Inference Using Graph Cuts

Efficient solution for special classes of networks.

An example where sum-product inference and MAP inference have different

computational properties.

The Min Cut Problem

Let 𝐺 = 𝑉 ∪ 𝑠, 𝑡 , 𝐸 be a directed graph where each edge 𝑒 ∈ E has a non-

negative cost 𝑐(𝑒).

A graph cut 𝒞 = 𝑉𝑠, 𝑉𝑡 is a disjoint partition of 𝑉 ∪ 𝑠, 𝑡 = 𝑉𝑠 ∪ 𝑉𝑡 such that 𝑠∈ 𝑉𝑠 and 𝑡 ∈ 𝑉𝑡.

The cost of the cut is 𝑐 𝒞 = σ𝑣1∈𝑉𝑠,𝑣2∈𝑉𝑡𝑐(𝑣1, 𝑣2).

In the min-cut problem we wish to find the cut that achieves the minimal

cost.

Can be solved in polynomial time.

The Reduction

We can WLOG that all the energy components are non-negative.

We construct the following graph 𝐺 = (𝑉 ∪ 𝑠, 𝑡 , 𝐸).

𝑉 contains a vertex for each random variable.

For each 𝑣𝑖 ∈ 𝑉 introduce an edge 𝑣𝑖 , 𝑡 with cost 𝜖𝑖(0).

For each 𝑣𝑖 ∈ 𝑉 introduce an edge 𝑠, 𝑣𝑖 with cost 𝜖𝑖(1).

For each pair of variables 𝑋𝑖 , 𝑋𝑗 that are connected by an edge in the MRF, we introduce

two edges 𝑣𝑖 , 𝑣𝑗 , (𝑣𝑗 , 𝑣𝑖) with cost 𝜆𝑖,𝑗 ≥ 0.

We map an assignment (𝑥1, … , 𝑥𝑛) to a cut 𝒞 = 𝑉𝑠, 𝑉𝑡 such that 𝑥𝑖 = 0 if and only

if 𝑣𝑖 ∈ 𝑉𝑠.

Correctness

Consider a cut 𝒞 = 𝑉𝑠, 𝑉𝑡 . If 𝑣𝑖 ∈ 𝑉𝑠 then 𝑥𝑖 = 0 and we will get a

contribution of 𝜖𝑖 0 to the cost of the cut and the energy function.

The analogous argument holds when 𝑣𝑖 ∈ 𝑉𝑡.

The edge 𝑣𝑖 , 𝑣𝑗 contributes 𝜆𝑖,𝑗 to the cost of the cut only if 𝑣𝑖 and 𝑣𝑗 are in

opposite sides of the cut.

Conversely, the pair of random variables 𝑋𝑖 , 𝑋𝑗 contributes 𝜆𝑖,𝑗 to the energy

function only if 𝑋𝑖 ≠ 𝑋𝑗.

Hence, the cost of the cut is precisely the same as the energy of the

corresponding assignment.

Example

Markov network and energy components (only non-zero displayed)

Reduction graph

𝑋1

𝑋4 𝑋2

𝑋3