1 Sampling Bayesian Networks ICS 295 2008. 2 Answering BN Queries Probability of Evidence P(e) ?...

1

Sampling Bayesian Networks

ICS 295

2008

4

Algorithm Tree

5

Sampling Fundamentals

)(

)()(XD

dxXxggE

Given a set of variables X = {X1, X2, … Xn}, a joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

)(

)()(XDx

xpxggE

6

Sampling From (X)

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

T

t

tSgT

g1

)(1

},...,,{ 21tn

ttt xxxS A sample St is an instantiation:

7

Sampling Challenge

It is hard to generate samples from (X) Trade-Offs:

Generate samples from Q(X) Forward Sampling, Likelyhood Weighting, IS Try to find Q(X) close to (X)

Generate dependent samples forming a Markov Chain from P’(X)(X)

Metropolis, Metropolis-Hastings, Gibbs Try to reduce dependence between samples

8

Markov Chain

A sequence of random values x0,x1,… , defined on a finite state space D(X), is called a Markov Chain if it satisfies the Markov Property:

)|(),...,|( 101 xxyxPzxxxyxP tttt

If P(xt+1 =y |xt) does not change with t (time homogeneous), then it is often expressed as a transition function, A(x,y)

Liu, Ch.12, p 245

1),( y

yxA

9

Markov Chain Monte Carlo

First, define a transition probability P(xt+1=y|xt)

Pick initial state x0, usually not important because it becomes “forgotten”

Generate samples x1, x2,… sampling each next value from P(X| xt)

x0 x1 xt xt+1

If we choose proper P(xt+1|xt), we can guarantee that the distribution represented by samples x0,x1,… converges to (X)

10

Markov Chain Properties

Irreducibility Periodicity Recurrence Revercibility Ergodicity Stationary Distribution

11

Irreducible

A station x is said to be irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps.

If on state is irreducible, then all the sates must be irreducible.

Liu, Ch. 12, pp. 249, Def. 12.1.1

12

Aperiodic

A state x is aperiodic if the greatest common divider of {n : An(x,x) > 0} is 1.

If state x is aperiodic and the chain is irreducible, then every state must be aperiodic.

Liu, Ch. 12, pp.240-250, Def. 12.1.1

13

Recurrence

A state x is recurrent if the chain returns to x with probability 1

State x is recurrent if and only if:

Let M(x) be the expected number of steps to return to state x

State x is positive recurrent if M(x) is finite The recurrent states in a finite state chain are

positive recurrent.

0

)(

n

niip

14

Ergodicity

A state x is ergodic if it is aperiodic and positive recurrent.

If all states in a Markov chain are ergodic then the chain is ergodic.

15

Reversibility

Detail balance condition:

Markov chain is reversible if there is a such that:

For a reversible Markov chain, is always a stationary distribution.

)()|()()|( 111 jxPjxixPixPixjxP tttttt

jijiji pp

16

Stationary Distribution

If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:

)(

),()()(XDy

xyAyx

An irreducible chain has a stationary distribution if and only if all of its states are positive recurrent. The distribution is unique.

17

Stationary Distribution In Finite State Space

Stationary distribution always exists but may not be unique

If a finite-state Markov chain is irreducible and aperiodic, it is guaranteed to be unique and A(n)=P(xn = y | x0) converges to a rank-one matrix in which each row is the stationary distribution .

Thus, initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution

However, it is important how long it takes to forget it!

18

Convergence Theorem

Given a finite state Markov Chain whose transition function is irreducible and aperiodic, then An(x0,y) converges to its invariant distribution (y) geometrically in variation distance, then there exists a 0 < r < 1 and c > 0 s.t.:

nn crxA var

),(

19

Eigen-Value Condition

Convergence to stationary distribution is driven by eigen-values of matrix A(x,y).

“The chain will converge to its unique invariant distribution if and only if matrix A’s second largest eigen-value in modular is strictly less than 1.”

Many proofs of convergence are centered around analyzing second eigen-value.

Liu, Ch. 12, p. 249

20

Convergence In Finite State Space

Assume a finite-state Markov chain is irreducible and aperiodic

Initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution

However, it is important how long it takes to forget it! – known as burn-in time

Since the first k states are not drown exactly from , they are often thrown away. Open question: how big a k ?

21

Sampling in BN

Same Idea: generate a set of samples T

Estimate P(Xi|E) from samples Challenge: X is a vector and P(X) is a

huge distribution represented by BN Need to know:

How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(Xi|e) ?

23

Sampling Algorithms

Forward Sampling Gibbs Sampling (MCMC)

Blocking Rao-Blackwellised

Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle

Filtering) in Dynamic Bayesian Networks

24

Gibbs Sampling

Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

Transition probability equals the conditional distribution

Example: (X,Y), A(xt+1|yt)=P(x|y), A(yt+1|xt) = P(y|x)

x0

y0

x1

y1

25

Gibbs Sampling for BN

Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|

e) Guaranteed to converge when all P > 0 Methods to improve convergence:

Blocking Rao-Blackwellised

Error Bounds Lag-t autocovariance Multiple Chains, Chebyshev’s Inequality

26

Gibbs Sampling (Pearl, 1988)

A sample t[1,2,…],is an instantiation of all variables in the network:

Sampling process Fix values of observed variables e Instantiate node values in sample x0 at

random Generate samples x1,x2,…xT from P(x|e) Compute posteriors from samples

},...,,{ 2211tNN

ttt xXxXxXx

27

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:),\|(

),,...,,|(

...

),,...,,|(

),,...,,|(

1

11

12

11

1

31

121

22

3211

11

exxxPxX

exxxxPxX

exxxxPxX

exxxxPxX

it

itii

tN

ttN

tNN

tN

ttt

tN

ttt

from sampled

ProcessAllVariablesIn SomeOrder

28

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiit

i paxPpaxPxxxP )|()|()\|(

:)\|( )\|( :Important it

iit

i xmarkovxPxxxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

29

Ordered Gibbs Sampling Algorithm

Input: X, EOutput: T samples {xt } Fix evidence E Generate samples from P(X | E)1. For t = 1 to T (compute samples)2. For i = 1 to N (loop through variables)3. Xi sample xi

t from P(Xi | markovt \ Xi)

Answering Queries

Query: P(xi |e) = ? Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

T

t it

iiii XmarkovxXPT

xXP1

)\|(1

)(

T

xXsamplesxXP ii

ii

)(#)(

31

Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}X1

X4

X8 X5 X2

X3

X9 X7

X6

32


X1 = x10

X6 = x60

X2 = x20

X7 = x70

X3 = x30

X8 = x80

X4 = x40

X5 = x50

X1

X4

X8 X5 X2

X3

X9 X7

X6

33


X1 P (X1 |X0

2,…,X0

8 ,X9}

E = {X9}X1

X4

X8 X5 X2

X3

X9 X7

X6

34


X2 P(X2 |X1

1,…,X0

8 ,X9}

E = {X9}

X1

X4

X8 X5 X2

X3

X9 X7

X6

35

Gibbs Sampling: Illustration

36

Gibbs Sampling Example – Init

Initialize nodes with random values:

X1 = x10 X6 = x6

0

X2 = x20 X7 = x7

0

X3 = x30 X8 = x8

0

X4 = x40

X5 = x50

Initialize Running Sums:SUM1 = 0

SUM2 = 0

SUM3 = 0

SUM4 = 0

SUM5 = 0

SUM6 = 0

SUM7 = 0

SUM8 = 0

37

Gibbs Sampling Example – Step 1

Generate Sample 1 compute SUM1 += P(x1| x2

0, x30, x4

0, x50, x6

0, x70, x8

0, x9 ) select and assign new value X1 = x1

1 compute SUM2 += P(x2| x1

1, x30, x4

0, x50, x6

0, x70, x8



1, x21, x4

0, x50, x6

0, x70, x8


1 …..

At the end, have new sample: S1 = {x1

1, x21, x4

1, x51, x6

1, x71, x8

1, x9 }

38

Gibbs Sampling Example – Step 2

Generate Sample 2 Compute P(x1| x2

1, x31, x4

1, x51, x6

1, x71, x8


1 update SUM1 += P(x1| x2

1, x31, x4

1, x51, x6

1, x71, x8

1, x9 ) Compute P(x2| x1

2, x31, x4

1, x51, x6

1, x71, x8


1 update SUM2 += P(x2| x1

2, x31, x4

1, x51, x6

1, x71, x8

1, x9 ) Compute P(x3| x1

2, x22, x4

1, x51, x6

1, x71, x8



2, x22, x4

1, x51, x6

1, x71, x8

1, x9 ) …..

New sample: S2 = {x12, x2

2, x42, x5

2, x62, x7

2, x82, x9 }

40

Gibbs Convergence Stationary distribution = target sampling

distribution MCMC converges to the stationary

distribution if network is ergodic Chain is ergodic if all probabilities are

positive

Si Sj pij > 0pij

If i,j such that pij = 0 , then we may not be able to explore full sampling space !

41

Gibbs Sampling: Burn-In

We want to sample from P(X | E) But…starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample values

from approximate P(x|e) (for example, run IBP first)

42

Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)-Disadvantage: convergence may be slow

Problems:

Samples are dependent ! Statistical variance is too big in high-

dimensional problems

43

Gibbs: Speeding Convergence

Objectives:1. Reduce dependence between

samples (autocorrelation) Skip samples Randomize Variable Sampling Order

2. Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation

44

Skipping Samples

Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

45

Randomized Variable Order

Random Scan Gibbs SamplerPick each next variable Xi for update at

random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance

Reduction”)

46

Blocking

Sample several variables together, as a block Example: Given three variables X,Y,Z, with

domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

47

Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993“Blocking Gibbs Sampling Very Large

Probabilistic Expert Systems” Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

Sample P(Ei | Ai)

48

Rao-Blackwellisation

Do not sample all variables! Sample a subset! Example: Given three variables

X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(x|yt)yt+1 P(y|xt+1)

49

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

51

Rao-Blackwellised Gibbs: Cutset Sampling

Select C X (possibly cycle-cutset), |C| = m

Fix evidence E Initialize nodes with random values:

For i=1 to m: ci to Ci = c 0i

For t=1 to n , generate samples:For i=1 to m:Ci=ci

t+1 P(ci|c1 t+1,…,ci-1

t+1,ci+1t,…,cm

t ,e)

52

Cutset Sampling

Select a subset C={C1,…,CK} X A sample t[1,2,…],is an instantiation of C:

Sampling process Fix values of observed variables e Generate sample c0 at random Generate samples c1,c2,…cT from P(c|e) Compute posteriors from samples

},...,,{ 2211tKK

ttt cCcCcCc

53

Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:),\|(

),,...,,|(

...

),,...,,|(

),,...,,|(

1

11

12

11

1

31

121

22

3211

11

ecccPcC

eccccPcC

eccccPcC

eccccPcC

it

itii

tK

ttK

tKK

tK

ttt

tK

ttt

from sampled

54


How to compute P(ci|c t\ci, e) ?

Compute joint P(ci, c t\ci, e) for each ci

D(Ci) Then normalize:

P(ci| c t\ci , e) = P(ci, c

t\ci , e) Computation efficiency depends

on choice of C

55


How to choose C ? Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.

Pick C wisely so as to minimize w notion of w-cutset

56

w-cutset Sampling

C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w

Complexity of exact inference: bounded by w !

cycle-cutset is a special case

57

Cutset Sampling-Answering Queries

Query: ci C, P(ci |e)=? same as Gibbs: Special case of w-cutset

computed while generating sample t

compute after generating sample t

T

t it

ii ecccPT

|e)(cP1

),\|(1

Query: P(xi |e) = ?

T

t

tii ,ecxP

T|e)(xP

1)|(

1

58

Cutset Sampling Example

}{ 05

02

0 ,xxc

X1

X7

X5 X4

X2

X9 X8

X3

E=x9

X6

59


),(

),(1)(

),(

),(

}{

905

''2

905

'2

9052

12

905

''2

905

'2

05

02

0

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,xx c

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X2 :

60


},{

),(

),(1)(

),(

),(

)(

},{

15

12

1

9''

512

9'5

12

9125

15

9''

512

9'5

12

9052

12

05

02

0

xxc

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,x| xxP x

xxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X5 :

62


),,|(

),,|(

),,|(

3

1)|(

),,|(},{

),,|(},{

),,|(},{

935

323

925

223

915

123

93

935

323

35

32

3

925

223

25

22

2

915

123

15

12

1

xxxxP

xxxxP

xxxxP

xxP

xxxxPxxc

xxxxPxxc

xxxxPxxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Query P(x3 |e) for non-sampled node X3 :

63

Gibbs: Error Bounds

Objectives: Estimate needed number of samples T Estimate error Methodology: 1 chain use lag-k autocovariance

Estimate T M chains standard sampling

variance Estimate Error

64

Gibbs: lag-k autocovariance

12

1

1

1

)(2)0(1

)(

))((1

)(

)\|(1

)|(

)\|(

i

kN

t kii

itN

t ii

it

ii

iT

PVar

PPPPT

k

xxxPT

exPP

xxxPP

Lag-k autocovariance

65

Gibbs: lag-k autocovariance

12

1

)(2)0(1

)(

i

iT

PVar

)(

)0(ˆPVar

T

Estimate Monte Carlo variance:

Here, is the smallest positive integer satisfying:

1)12()2( Effective chain size:

In absense of autocovariance: TT ˆ

66

Gibbs: Multiple Chains

Generate M chains of size K Each chain produces independent estimate Pm:

M

i mPM

P1

1

)\|(1

)|(1 i

tK

t iim xxxPK

exPP

Treat Pm as independent random variables.

Estimate P(xi|e) as average of Pm (xi|e) :

67

Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:

M

St

PMPM

PPM

SPVar

M

M

mm

M

m m

1,2/

1

22

2

1

2

1

1

1

1)(

68

Geman&Geman1984

Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41.

Introduce Gibbs sampling; Place the idea of Gibbs sampling in a general

setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.

69

Tanner&Wong 1987

Tanner and Wong (1987) Data-augmentation Convergence Results

70

Pearl1988

Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann.

In the case of Bayesian networks, the neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.

71

Gelfand&Smith,1990

Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409.

Show variance reduction in using mixture estimator for posterior marginals.

72

Neal, 1992

R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118.

Stochastic simulation in Noisy-Or networks.

73

CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3

0

0.001

0.002

0.003

0.004

0 1000 2000 3000 4000 5000

# samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0

0.0002

0.0004

0.0006

0.0008

0 5 10 15 20 25

Time(sec)

Cutset Gibbs

74

CPCS179 Test Results

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

0 20 40 60 80

Time(sec)

Cutset Gibbs

75

CPCS360b Test Results


Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

76

Random Networks


|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20


RANDOM, n=100, |C|=13, |E|=15-20

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0 200 400 600 800 1000 1200

# samples

Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0

0.0002

0.0004

0.0006

0.0008

0.001

0 1 2 3 4 5 6 7 8 9 10 11

Time(sec)

Cutset Gibbs

77

Coding Networks

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}


x1 x2 x3 x4

u1 u2 u3 u4

p1 p2 p3 p4

y4y3y2y1

Coding Networks, n=100, |C|=12-14

0.001

0.01

0.1

0 10 20 30 40 50 60

Time(sec)

IBP Gibbs Cutset

78

Non-Ergodic Hailfinder


Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

0 500 1000 1500

# samples

Cutset Gibbs

79

Non-Ergodic CPCS360b - MSE

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0

0.000005

0.00001

0.000015

0.00002

0.000025

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

Gibbs

IBP

|C|=26,fw=3

|C|=48,fw=2

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

80

Non-Ergodic CPCS360b - MaxErr

cpcs360b, N=360, |E|=[20-34], MaxErr

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

Gibbs

IBP

|C|=26,fw=3

|C|=48,fw=2

81

Importance vs. Gibbs

T

tt

tt

t

T

t

t

T

t

xq

xpxf

Tf

exqx

xfT

xf

expexp

expx

1

1

)(

)()(1

)|(:Importance

)(1

)(

)|()|(~)|(~ :Gibbs

wt

1 Sampling Bayesian Networks ICS 295 2008. 2 Answering BN Queries Probability of Evidence P(e) ?...

Documents

Transcript of 1 Sampling Bayesian Networks ICS 295 2008. 2 Answering BN Queries Probability of Evidence P(e) ?...