Approximate Inference

34
. Approximate Inference Slides by Nir Friedman

description

Approximate Inference. Slides by Nir Friedman. When can we hope to approximate?. Two situations: Highly stochastic distributions “Far” evidence is discarded “Peaked” distributions improbable values are ignored. X n+1. X 1. X 2. X 3. Stochasticity & Approximations. Consider a chain: - PowerPoint PPT Presentation

Transcript of Approximate Inference

Page 1: Approximate Inference

.

Approximate Inference

Slides by Nir Friedman

Page 2: Approximate Inference

When can we hope to approximate?

Two situations: Highly stochastic distributions

“Far” evidence is discarded “Peaked” distributions

improbable values are ignored

Page 3: Approximate Inference

Stochasticity & Approximations

Consider a chain:

P(Xi+1 = t | Xi = t) = 1- P(Xi+1 = f | Xi = f) = 1-

Computing the probability of Xn+1 given X1 , we get

X1 X2 X3Xn+1

2/)1(

0

121211

2/

0

2211

)1(12

)|(

)1(2

)|(

n

k

knkn

n

k

knkn

k

ntXfXP

k

ntXtXP

Even # of flips:

Odd # of flips:

Page 4: Approximate Inference

Plot of P(Xn = t | X1 = t)

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

n = 5

n = 10

n = 20

Page 5: Approximate Inference

Stochastic Processes

This behavior of a chain (a Markov Process) is called Mixing.

In general Bayes nets there is a similar behavior. If probabilities are far from 0 & 1, then effect of

“far” evidence vanishes (and so can be discarded in approximations).

Page 6: Approximate Inference

“Peaked” distributions If the distribution is “peaked”, then most of the

mass is on few instances If we can focus on these instances, we can

ignore the rest

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Probability

Instances

Page 7: Approximate Inference

Global conditioning

A

L

C

I

D

J

B

M

E

K

Fixing value of A & B

a b c d le

ledcbaPmP...

),...,,,,,()(

Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning.

L

C

I

J

M

E

K

D

a b b a

Page 8: Approximate Inference

Bounded conditioning

A

B

Fixing value of A & B

By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one.

Page 9: Approximate Inference

Bounded conditioning

Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset.

Search for highly probable assignments to A,B. Option 1--- select a,b with high P(a,b). Option 2--- select a,b with high P(a,b | e).

We need to search for such high mass values and that can be hard.

obasbleba

b)P(ab)|ayYP)yYPPr,

,,,(,( ee

Page 10: Approximate Inference

Bounded Conditioning

Advantages: Combines exact inference within approximation Continuous: more time can be used to examine more cases Bounds: unexamined mass

used to compute error-bars

Possible problems: P(a,b) is prior mass not the posterior. If posterior is significantly different P(a,b| e), Computation

can be wasted on irrelevant assignments

obableba

b)P(aPr,

,1

Page 11: Approximate Inference

Network Simplifications

In these approaches, we try to replace the original network with a simpler one

the resulting network allows fast exact methods

Page 12: Approximate Inference

Network Simplifications

Typical simplifications: Remove parts of the network Remove edges Reduce the number of values (value abstraction) Replace a sub-network with a simpler one

(model abstraction) These simplifications are often w.r.t. to the

particular evidence and query

Page 13: Approximate Inference

Stochastic Simulation

Suppose our goal is the compute the likelihood of evidence P(e) where e is an assignment to some variables in {X1,…,Xn}.

Assume that we can sample instances <x1,…,xn> according to the distribution P(x1,…,xn).

What is then the probability that a random sample <x1,…,xn> satisfies e?

Answer: simply P(e) which is what we wish to compute.

Each sample simulates the tossing of a biased coin with probability P(e) of “Heads”.

Page 14: Approximate Inference

Stochastic Sampling

Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate

Law of large number implies that as N grows, our estimate will converge to p with high probability

N

[i])|P

NHeads

)P i

xe

e(

#(

Zeros or ones

How many samples do we need to get a reliable estimation?

We will not discuss this issue here.

Page 15: Approximate Inference

Sampling a Bayesian Network

If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it?

Idea: sample according to structure of the network Write distribution using the chain rule, and then

sample each variable given its parents

Page 16: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

b

Earthquake

Radio

Burglary

Alarm

Call

0.03

Page 17: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

eb

Earthquake

Radio

Burglary

Alarm

Call

0.001

Page 18: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

e ab

0.4

Earthquake

Radio

Burglary

Alarm

Call

Page 19: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

e a cb

Earthquake

Radio

Burglary

Alarm

Call

0.8

Page 20: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

e a cb

0.3

Earthquake

Radio

Burglary

Alarm

Call

Page 21: Approximate Inference

Samples:

B E A C R

Logic sampling

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

e e

0.3 0.001

e a cb r

Earthquake

Radio

Burglary

Alarm

Call

Page 22: Approximate Inference

Logic Sampling

Let X1, …, Xn be order of variables consistent with arc direction

for i = 1, …, n do sample xi from P(Xi | pai ) (Note: since Pai {X1,…,Xi-1}, we already

assigned values to them) return x1, …,xn

Page 23: Approximate Inference

Logic Sampling

Sampling a complete instance is linear in number of variables Regardless of structure of the network

However, if P(e) is small, we need many samples to get a decent estimate

Page 24: Approximate Inference

Can we sample from P(Xi|e) ?

If evidence e is in roots of the Bayes network, easily If evidence is in leaves of the network, we have a

problem: Our sampling method proceeds according to the

order of nodes in the network.

Z

R

B

A=a

X

Page 25: Approximate Inference

Likelihood Weighting

Can we ensure that all of our sample satisfy e? One simple (but wrong) solution:

When we need to sample a variable Y that is assigned value by e, use its specified value.

For example: we know Y = 1 Sample X from P(X) Then take Y = 1

Is this a sample from P(X,Y |Y = 1) ? NO.

X Y

Page 26: Approximate Inference

Likelihood Weighting

Problem: these samples of X are from P(X) Solution:

Penalize samples in which P(Y=1|X) is small

We now sample as follows: Let xi be a sample from P(x) Let wi= P(Y = 1|X = xi )

X Y

ii

iii

w

)|XPw)YxXP

xx(1|(

Page 27: Approximate Inference

Likelihood Weighting

Let X1, …, Xn be order of variables consistent with arc direction

w = 1 for i = 1, …, n do

if Xi = xi has been observedw w P(Xi = xi | pai )

elsesample xi from P(Xi | pai )

return x1, …,xn, and w

Page 28: Approximate Inference

Samples:

B E A C R

Likelihood Weighting

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a

0.8 0.05

P(r)

r r

0.3 0.001

b

Earthquake

Radio

Burglary

Alarm

Call

0.03

Weight

= r

a

= a

Page 29: Approximate Inference

Samples:

B E A C R

Likelihood Weighting

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

r r

0.3 0.001

eb

Earthquake

Radio

Burglary

Alarm

Call

0.001

Weight

= r = a

Page 30: Approximate Inference

Samples:

B E A C R

Likelihood Weighting

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

r r

0.3 0.001

eb

0.4

Earthquake

Radio

Burglary

Alarm

Call

Weight

= r = a

0.6a

Page 31: Approximate Inference

Samples:

B E A C R

Likelihood Weighting

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

r r

0.3 0.001

e cb

Earthquake

Radio

Burglary

Alarm

Call

0.05Weight

= r = a

a 0.6

Page 32: Approximate Inference

Samples:

B E A C R

Likelihood Weighting

P(b) 0.03P(e) 0.001

P(a)

b e b e b e b e

0.98 0.40.7 0.01

P(c)

a a

0.8 0.05

P(r)

r r

0.3 0.001

e cb r

0.3

Earthquake

Radio

Burglary

Alarm

Call

Weight

= r = a

a 0.6*0.3

Page 33: Approximate Inference

Likelihood Weighting

Why does this make sense? When N is large, we expect to sample NP(X = x)

samples with x[i] = x Thus,

)xXPPN

PN)x|XP

i

i 1Y|(1)Y(

1)Y, x X(

[i]) x X|1 P(Y

x[i]([i]) x X|1 P(Y

Page 34: Approximate Inference

Summary

Approximate inference is needed for large pedigrees. We have seen a few methods today. Some could fit genetic linkage analysis and some do not. There are many other approximation algorithms: Variational methods, MCMC, and others.

In next semester’s project of Bioinformatics (236524), we will offer projects that seek to implement some approximation methods and embed them in the superlink software.