Learning with Approximate Inferencesrihari/CSE674/Chap20/... · Assienment a,o ao a"n al al al AL...

Probabilistic Graphical Models Srihari

1

Learning with Approximate Inference

Sargur Srihari [email protected]


Topics

•  Learning MN parameters using Gradient Ascent and Belief Propagation – Difficulty with exact methods

•  Approximate methods – Approximate Inference

•  Belief Propagation •  Sampling-based Learning •  MAP-based Learning

•  Ex: CRF for Protein Structure Prediction 2


Likelihood function for a Markov Network

•  Log-linear form of MN with parameters θis

•  We wish to find θ that maximizes the Log-likelihood

3

P(X1,..Xn;θ ) =1

Z(θ )exp θi fi (Di )

i=1

k

∑⎧⎨⎩

⎫⎬⎭

where θ={θ1,..θk} are k parameters, each associated with a feature fi defined over instances of Di

Z θ( ) = exp θ

ifiξ( )

i∑⎧⎨

⎩

⎫⎬⎭ξ

∑

Where the partition function defined as:

ℓ(θ :D) = θi

i∑ fi ξ[m]( )

m∑⎛⎝⎜

⎞⎠⎟−M lnZ(θ )

Probabilistic Graphical Models Srihari Gradient-based learning of Undirected PGM

4

Run inference to compute Z(θ) , Eθ[fi]

Initialize θ .

Compute gradient of likelihood . ℓ

Update θ with step size η .

Optimum Reached?

Stop

Yes

No

∂∂θiℓ θ;D( ) = ED[ fi χ( )]− Eθ [ fi ]

θt+1 ← θt + η∇ℓ θt ;D( )

Z θ( ) = exp θ

ifiξ( )

i∑⎧⎨⎪⎪⎩⎪⎪

⎫⎬⎪⎪⎭⎪⎪ξ

∑ Eθ [ fi ]=1

Z(θ )fi (ξ )exp θ j f j ξ( )

j∑⎧⎨

⎩⎪

⎫⎬⎭⎪ξ

∑

∇ℓ(θ) =ddθℓ(θ) =

∂ℓ θ( )∂θ

1

.∂ℓ θ( )∂θ

k

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

θt+1−θt ≤ δ

Pθ (χ ) =1

Z θ( ) exp − θi fi (Di )i=1

k

∑⎡⎣⎢

⎤⎦⎥

Method assumes that we are able to compute partition function Z(θ), which is a sum over all unnormalized probabilities of assignments ξ,and two expectations ED[fi(χ)] and Eθ[fi]. Both Z(θ) and Eθ[fi] require inference to compute probabilities of each ξ

Goal: Determine parameters θ of Markov Network over variables χ given data set D with k features fi


Exact Inference: Belief Propagation

C

(a) (b) (c)

A

D BBD

A

C

BD

A

C

P(A,B,C.D) = 1Zφ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)

where

Z = φ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)A,B,C ,D∑

Z=7,201,840

1.Gibbs Distribution

1. A,B,D 2. B,C,D{B,D}

362

Assignmenta

6o6r6t6o6o6r6r

maxc600,000300,030

5, ooo,5oo1,000

2001,000, 100

100,010200,000

Chapter 1&

Assienment

a,o

aona"

alalalAL

Assignment b0bob0

bL

bL

bt6t

cocLctcococ1ct

d1 ll I4o4t4o4rd0 ll 5,

6o

bL

br

600,2001,300, 1305, 100, 510

201,000

reparameteriza-tion

clique treeinvariant

Example 10.5

n,z(B, D) l3z(8,C,

Figure 10.6 lhe clique and sepset bellefs for the Misconception example,

Using equation (10.9), the denominator can be rewritten as:

il 6i-i5i-r.(i- j)e€r

Each message da+i appears exactly once in the numerator and exactly once in theso that all messages cancel. The remaining expression is simply:

l rbo{co1: e*.ieVr

Thus, via equation (10.i0), the clique and sepsets beliefs provide aunnormalized measure. This property is called the clique tree irunriant, for reasonsbecome clear later on in this chapter.

Another intuition for this result can be obtained from the following example:

Consider a clique tree obtained from Markav network A*B -C -D, with anfactors @. Our clique tree in this case would haue three cliques C1 : {A,B), Czand, Cs : {CLD}. When the clique tree is calibrated, we haue that B1(A,B) : Po(Zz(B,C) : Po(B,C). From the conditional independence properties of thishaue that

Po(A,B,C): Po(A,B)P,(C I B),and

P^rc I B\:1vjB'c) .po(B)

As B2(B,C) : Po(B,C), we can obtain Po@) by marginalizing B2(B,C). Thus, we

Br@,qrffifir)p|(A, B)p2(B,C)-U;p,@O-

d

Po(A, B,C)

!PΦ

a1,b0,c1,d 0( ) = 100

β1

a1,b0,d 0( )β2b0,c1,d 0( )

µ1,2

b0,d 0( )=

200 ⋅300 ⋅100600 ⋅200

= 100

2. Clique Tree (triangulated):

Computing Clique Beliefs (βi), Sepset Beliefs (µi,j)

β1

A,B,D( ) = !PΦ

A,B,D( ) = ψ1

A,B,D( )C∑ = φ

1(A,B)φ

2(B,C)φ

3(C,D)φ

4(D,A)

C∑

e.g., β1(a0,b0,d 0) = 300,000 + 300,000 = 600,000

µ1,2

(B,D) = β1

C1−S1,2

∑ C1( ) = β

1A∑ A,B,D( )

e.g., µ1,2

(b0,d 0) = 600,000 + 200 = 600,200

β2

B,C,D( ) = !PΦ

B,C,D( ) = µ1,2

B,D( ) ⋅ψ2A∑ B,C,D( ) = ψ

2B,C,D( )

A∑

e.g.,β2

b0,c0,d 0( ) = 300,000 +100 = 300,100

Initial Potentials:

ψ1

A,B,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

ψ2

B,C,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

!PΦ

A,B,C,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

All Clique and Sepset Beliefs

Factors Unormalized Distribution

Verifying Inference

Each ψ has every factor involving its arguments

Clique/Cluster C1 Cluster C2 Sepset

!PΦχ( ) =

βiC

i( )i∈VT

∏µ

i,jS

i,j( )i−j( )∈ET

∏

Probabilistic Graphical Models Srihari Example of Computing ED[fi(χ)]

χ={A,B,C}; Pairwise Markov Network A—B—C – Variables are binary – Three clusters: C1= {A,B}, C2={B,C}, C3={C,A}

•  Log-linear model with two features: •  f00(x,y)=1 if x=0, y=0 and 0 otherwise and•  f11(x,y)=1 if x=1, y=1 and 0 otherwise

– Assume we have three data instances (A,B,C): –  (0,0.0)à Cluster AB, Cluster BC, Cluster CA have f00(x,y)=1 –  (0,1,0)àOnly Cluster CA has f00(x,y)=1 –  (1,0,0)àOnly Cluster BC has f00(x,y)=1

•  Unnormalized empirical feature counts, pooled over all clusters:

E !Pf00

⎡⎣ ⎤⎦ = (3 + 1 + 1)/ 3 = 5/ 3

E !Pf11

⎡⎣ ⎤⎦ = (0 + 0 + 0)/ 3 = 0

Pθ (χ ) =1

Z θ( ) exp − θi fi (Di )i=1

k

∑⎡⎣⎢

⎤⎦⎥

Can normalize using counts over all features i.e., (5/3)+0

Both shared over all clusters


Difficulty with Exact Methods •  Exact Parameter Estimation Methods assume

ability to compute 1.  Partition function Z(θ) and 2.  Expectations

•  In many applications structure of network does not allow exact computation of these terms –  In image segmentation, grid networks lead to

exponential size clusters for exact inference

7

EPθ [ fi ]

Cluster graph is clique tree with overlapping factors

AB—BC--CA


Discussion on Approximate Methods 1.  In this section: approximate inference

– Decouple inference from Learning Parameters •  Inference is a black-box

– But approximation may interfere with learning •  Non-convergence of inference can lead to oscillating

estimates of the gradient & no learning convergence

2.  Next section: approximate objective function – Whose optimization doesn’t require much inference –  Approximately optimizing the likelihood function

can be reformulated as exactly optimizing an approximate objective

8


Approximate Inference •  Learning with Approximate Inference

methods 1.  Belief Propagation 2.  MAP-based Learning

9


Approximate Inference: Belief Propagation

•  Popular Approach for Approximate Inference is Belief Propagation and its variants – An algorithm from this family would be used for

inference in the model resulting from the learning procedure

•  We should use the same inference algorithm that will be used for querying it – Model trained with same inference algorithm is

better than model trained with exact inference! •  BP is run with each iteration of Gradient Ascent

to compute expected feature count 10 EPθ [ fi ]


Difficulty with Belief Propagation •  In principle, BP is run in each Gradient Ascent

iteration –  to compute used in gradient computation

•  Due to family preservation property each feature must be a subset of a cluster Ci in the cluster graph

– Hence to compute we can compute BP marginals over Ci

•  But BP often does not converge – Marginals derived often oscillate

•  Final results depend on where we stop

– As a result gradient computed is unstable •  Hurt convergence properties of gradient descent •  Even more severe with line search 11

EPθ [ fi ]

EPθ [ fi ]


Convergent alternatives to Belief Propagation

•  We describe three BP methods: 1.  Pseudo- moment matching

•  Reformulate task of learning with approximate inference as optimizing an alternative objective

2.  Maximum Entropy approximation (CAMEL) •  General derivation that allows us to reformulate ML with

BP as a unified optimization problem with an approximate objective

3.  MAP-based Learning •  Approximate expected feature counts with their counts

in the single MAP assignment in current MN 12


Pseudo-moment Matching •  Begin with analysis of fixed points in learning

– Converged BP beliefs must satisfy •  Or

•  Define for each sepset Si,j between Ci and Cj

– We use the final set of potentials as the parameterization of the Markov Network

•  Provides a closed-form solution for both inference and learning – Cannot be used with parameter regularization, non-

table factors or CRFs 13

Eβi Ci( ) fCi⎡⎣ ⎤⎦ = ED fi Ci( )⎡⎣ ⎤⎦

φi ←βi

µi, j

β

ic

ij( ) = P̂ c

ij( )

C1: A,B,D C2: B,C,D S12={B,D}

Convergent point is a set of beliefs that match the data


BP and Maximum Entropy •  A more general derivation that allows us to

reformulate maximum likelihood learning with belief propagation as a unified optimization problem with an approximate objective

•  Opens door to use of better approximation algorithms

•  Start with dual of maximum-likelihood problem Maximum-Entropy:

Find Q(χ) Maximizing HQ(χ) Subject to EQ[fi]=ED [fi], i=1,.., k 14

Probabilistic Graphical Models Srihari Tractable Max Entropy

15

•  This is exact for tree cluster graph but approx otherwise Approx-Maximum-Entropy:

Find Q(χ) Maximizing Subject to EQ[fi]=ED [fi], i=1,.., k Q ε Local(U)

HQ χ( ) ≈ HβiCi( )

Ci∈U∑ − Hµi, j

Si, j( )Ci−Cj∈S∑

•  Approximation is exact when cluster graph is a tree Method known as CAMEL (Constrained Approx Max Entropy Learning)

•  Assume a cluster graph U consisting of – a set of clusters {Ci} connected by sepsets Sij.

•  Rather than optimize maximum-entropy – over the space of distributions Q we optimize

Hβ

i

Ci

( )C

i∈U∑ − Hµ

i ,jS

i,j( )C

i−C

j∈U

∑


Example of Max Ent Learning

•  Pairwise Markov Network A—B—C – Variables are binary – Three clusters: C1= {A,B}, C2={B,C}, C3={C,A} – Log-linear model with features

•  f00(x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of Ci

•  f11(x,y)=1 if x=1, y=1 and 0 otherwise

– Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) •  Unnormalized Feature counts, pooled over all clusters,

are

16

E !P f00[ ] = (3+1+1) / 3 = 5 / 3E !P f11[ ] = (0 + 0 + 0) / 3 = 0


CAMEL Optimization Problem •  Optimization problem takes the form

– with two types of constraints:

17

EQ[fi]=ED [fi], i=1,.., k

Type 2 Constraints: Marginals from Cluster-graph approximation

Type 1 Constraints:


CAMEL Solutions

•  CAMEL optimization is a constrained maximization problem with –  linear constraints and – a nonconcave objective

•  Several solution algorithms, one of which is – Lagrange multipliers for all constraints and

optimize over resulting new variables

18


Sampling-based Learning

•  We wish to maximize the log-likelihood

•  In the sampling-based approach we use samples to estimate the partition function Z(θ) and use that estimate to perform gradient ascent

19

ℓ(θ :D) = θi

i∑ fi ξ[m]( )

m∑⎛⎝⎜

⎞⎠⎟−M lnZ(θ )


Expressing Z(θ) as an expectation •  Partition function Z(θ) is a summation over an

exponentially large space – One approach to approximating this summation is to

reformulate as expectation wrt a distribution Q(χ)

– This is precisely the form of importance sampling estimator 20

Z θ( ) = expξ∑ θi fi ξ( )

i∑⎧⎨

⎩

⎫⎬⎭

=Q ξ( )Q ξ( ) exp

ξ∑ θi fi ξ( )

i∑⎧⎨

⎩

⎫⎬⎭

=EQ1

Q χ( ) exp θi fi χ( )i∑⎧⎨

⎩

⎫⎬⎭

⎡

⎣⎢

⎤

⎦⎥

Since E

Qg(χ)⎡⎣ ⎤⎦ = Q(ξ)g(ξ)

ξ∑


Importance sampling •  To determine from p(z), we

draw samples {z(m)} from a simpler dist. q(z)

•  Samples are weighted by ratios rl= p(z(l)) / q(z(l)) – Known as importance weights

•  Which corrects the bias introduced by wrong distribution

21

Proposal distribution E[ f ] = f (z)p(z)dz∫

= f (z) p(z)q(z)

q(z)dz∫

= 1M

p(z(m) )q(z(m) )m=1

M∑ f (z(m) )

Unlike rejection sampling All of the samples are retained

E

p[f ] = 1

Mf (z [m ])

m=1

M

∑


Importance Sampling Estimator

•  We can approximate the partition function by generating samples from Q and correcting appropriately via weights

•  We can simplify this expression by choosing Q to be Pθ0 for some set of parameters θ0

22

Z θ( )= EPθ0

Z θ0( )exp θifiχ( )i∑{ }

exp θi0fiχ( )i∑{ }

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

=Z θ0( )EPθ0exp (θ

i− θi

0)fi(χ)

i∑⎧⎨⎪⎪⎩⎪⎪

⎫⎬⎪⎪⎭⎪⎪

⎡

⎣⎢⎢

⎤

⎦⎥⎥


Samples to approximate ln Z(θ) •  If we can sample instances ξ1,..ξK from Pθ0

– We can approximate the log-partition function as •  We can plug this approximation of ln Z(θ) into

the log-likelihood and optimize it •  Note that ln Z(θ0) is a constant

–  that we can ignore in the optimization and the resulting expression is a simple function of θ which can be optimized using gradient ascent

lnZ θ( )≈ ln

1K

exp θi− θ

i0( ) f

iξk( )

i∑⎧⎨⎪⎪⎩⎪⎪

⎫⎬⎪⎪⎭⎪⎪k=1

K

∑⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

+ lnZ θ0( )

1Mℓ(θ :D) = θi

i∑ ED fi (di )[ ]( )− lnZ(θ )

where ED[fi(di)] is the empirical expectation of f, i.e., its average in the data set


Gradient ascent + Sampling •  Gradient ascent over θ relative to

–  is equivalent to utilizing an importance sampling estimator directly to approximate the expected counts in the gradient of equation

•  However, it is generally it is useful to view such methods as exactly optimizing an approximate objective rather than approximately optimizing the exact likelihood 24

lnZ θ( )≈ ln

1K

exp θi− θ

i0( ) f

iξk( )

i∑⎧⎨⎪⎪⎩⎪⎪

⎫⎬⎪⎪⎭⎪⎪k=1

K

∑⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

+ lnZ θ0( )

∂∂θi

1Mℓ(θ :D) = ED[ fi χ( )]− Eθ [ fi ]


Quality of importance sampling estimator

•  It depends on the difference between θ and θ0

•  The greater the difference, the larger the variance of the importance weights

•  Thus this type of approximation is reasonable only in a neighborhood of surrounding θ0

25


How to use this approximation? •  Iterate between two steps

1.  Use a sampling procedure to generate samples from current parameter set θt

2.  Then use gradient descent to find θt+1 that improves the approximate log-likelihood based on the samples

•  We can then regenerate samples and repeat the process.

–  As the samples are regenerated from a new distribution,

•  we can hope that that they are generated from a distribution not too far from the one we are currently optimizing maintaining a reasonable approximation

26

Probabilistic Graphical Models Srihari MAP-based Learning

•  Another approach to inference in learning – Approximating expected feature counts with the

counts in the single MAP assignment to current MN •  Approximate gradient at assignment θ is

– where ξMAP(θ)=arg maxξ P(ξ|θ) is the MAP assignment given the current set of parameters θ

•  Approach also called as Viterbi training •  Equivalent to exact optimization of approximate

objective 27

ED fi χ( )⎡⎣ ⎤⎦ − fi ξMAP θ( )( )

1Mℓ θ :D( )− lnP ξMAP θ( ) |θ( )


Ex: CRFs for Protein Structure •  Predict the 3_D structure of proteins

– Proteins are chains of residues, each containing one of 20 possible amino acids

•  Amino acids are linked together into a common backbone structure onto which amino-specific side-chains are attached

– Predict the side-chain conformations given the backbone.

– Full configuration consists of upto four angles each of which takes on a continuous value

•  Discretize angle into a small no (3) rotamers

•  Address optimization as MAP for CRF 28

Learning with Approximate Inferencesrihari/CSE674/Chap20/... · Assienment a,o ao a"n al al al AL...

Documents

Transcript of Learning with Approximate Inferencesrihari/CSE674/Chap20/... · Assienment a,o ao a"n al al al AL...