Sampling Methods - Bishop PRML Ch. 11vda.univie.ac.at/Teaching/ML/16s/LectureNotes/11... ·...

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Sampling MethodsBishop PRML Ch. 11

Alireza Ghane

Sampling Methods A. Ghane / T. Möller / G. Mori 1

Recall – Inference For General Graphs

• Junction tree algorithm is an exact inference method for arbitrarygraphs

• A particular tree structure defined over cliques of variables• Inference ends up being exponential in maximum clique size• Therefore slow in many cases

• Sampling methods: represent desired distribution with a set ofsamples, as more samples are used, obtain more accuraterepresentation

Outline

Sampling

Rejection Sampling

Importance Sampling

Markov Chain Monte Carlo

Outline

Sampling

Rejection Sampling

Importance Sampling

Sampling

• The fundamental problem we address in this lecture is how toobtain samples from a probability distribution p(z)

• This could be a conditional distribution p(z|e)• We often wish to evaluate expectations such as

E[f ] =∫f(z)p(z)dz

• e.g. mean when f(z) = z

• For complicated p(z), this is difficult to do exactly, approximateas

f̂ =1

L∑l=1

f(z(l))

where {z(l)|l = 1, . . . , L} are independent samples from p(z)

Sampling

p(z) f(z)

• Approximate

f̂ =1

L∑l=1

f(z(l))

where {z(l)|l = 1, . . . , L} are independent samples from p(z)

Bayesian Networks - Generating Fair Samples

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

.01[from Russell and Norvig, AIMA]

• How can we generate a fair set of samples from this BN?Sampling Methods A. Ghane / T. Möller / G. Mori 7

Sampling from Bayesian Networks

• Sampling from discrete Bayesian networks with no observationsis straight-forward, using ancestral sampling

• Bayesian network specifies factorization of joint distribution

P (z1, . . . , zn) =

n∏i=1

P (zi|pa(zi))

• Sample in-order, sample parents before children• Possible because graph is a DAG

• Choose value for zi from p(zi|pa(zi))

Sampling From Empty Network – Example

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

Ancestral Sampling

• This sampling procedure is fair, the fraction of samples with aparticular value tends towards the joint probability of that value

• Define SPS(z1, . . . , zn) to be the probability of generating theevent (z1, . . . , zn)

• This is equal to p(z1, . . . , zn) due to the semantics of theBayesian network

• Define NPS(z1, . . . , zn) to be the number of times we generatethe event (z1, . . . , zn)

limN→∞

NPS(z1, . . . , zn)

N= SPS(z1, . . . , zn) = p(z1, . . . , zn)

Sampling Marginals

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

• Note that this procedure can be applied togenerate samples for marginals as well

• Simply discard portions of sample whichare not needed

• e.g. For marginal p(rain), sample(cloudy = t, sprinkler = f, rain =t, wg = t) just becomes (rain = t)

• Still a fair sampling procedure

Sampling with Evidence

• What if we observe some values and want samples from p(z|e)?• Naive method, logic sampling:

• Generate N samples from p(z) using ancestral sampling• Discard those samples that do not have correct evidence values

• e.g. For p(rain|cloudy = t, spr = t, wg = t), sample(cloudy = t, spr = f, rain = t, wg = t) discarded

• Generates fair samples, but wastes time• Many samples will be discarded for low p(e)

Other Problems

• Continuous variables?• Gaussian okay, Box-Muller and other methods• More complex distributions?

• Undirected graphs (MRFs)?

Outline

Sampling

Rejection Sampling

Importance Sampling

Rejection Sampling

p(z) f(z)

• Consider the case of an arbitrary, continuous p(z)• How can we draw samples from it?• Assume we can evaluate p(z), up to some normalization constant

p(z) =1

Zpp̃(z)

where p̃(z) can be efficiently evaluated (e.g. MRF)

Proposal Distribution

kq(z0) kq(z)

p̃(z)

• Let’s also assume that we have some simpler distribution q(z)called a proposal distribution from which we can easily drawsamples

• e.g. q(z) is a Gaussian

• We can then draw samples from q(z) and use these• But these wouldn’t be fair samples from p(z)?!

Comparison Function and Rejection

kq(z0) kq(z)

p̃(z)

• Introduce constant k such that kq(z) ≥ p̃(z) for all z• Rejection sampling procedure:

• Generate z0 from q(z)• Generate u0 from [0, kq(z0)] uniformly• If u0 > p̃(z) reject sample z0, otherwise keep it

• Original samples are uniform in grey region• Kept samples uniform in white region – hence samples from p(z)

Rejection Sampling Analysis

• How likely are we to keep samples?• Probability a sample is accepted is:

p(accept) =

∫{p̃(z)/kq(z)}q(z)dz

∫p̃(z)dz

• Smaller k is better subject to kq(z) ≥ p̃(z) for all z• If q(z) is similar to p̃(z), this is easier

• In high-dim spaces, acceptance ratio falls off exponentially• Finding a suitable k challenging

Outline

Sampling

Rejection Sampling

Importance Sampling

Discretization

• Importance sampling is a sampling technique for computingexpectations:

E[f ] =∫f(z)p(z)dz

• Could approximate using discretization over a uniform grid:

E[f ] ≈L∑l=1

f(z(l))p(z(l))

• c.f. Riemannian sum

• Much wasted computation, exponential scaling in dimension• Instead, again use a proposal distribution instead of a uniform

Importance samplingp(z) f(z)

• Approximate expectation

E[f ] =

∫f(z)p(z)dz =

∫f(z)

q(z)q(z)dz

L∑l=1

f(z(l))p(z(l))

q(z(l))

• Quantities p(z(l))/q(z(l)) are known as importance weights• Correct for use of wrong distribution q(z) in sampling

Likelihood Weighted Sampling• Consider the case where we have evidence e and again desire an

expectation over p(x|e)• If we have a Bayesian network, we can use a particular type of

importance sampling called likelihood weighted sampling:• Perform ancestral sampling• If a variable zi is in the evidence set, set its value rather than

sampling

• Importance weights are: ??

p(z(l))

q(z(l))= ?

p(z(l))

q(z(l))=

p(x, e)

1∏zi 6∈e p(zi|pai)

∝∏zi∈e

p(zi|pai)

sampling

• Importance weights are:

p(z(l))

q(z(l))= ?

p(z(l))

q(z(l))=

p(x, e)

∝∏zi∈e

p(zi|pai)

sampling

• Importance weights are:

p(z(l))

q(z(l))= ?

p(z(l))

q(z(l))=

p(x, e)

∝∏zi∈e

p(zi|pai)

Likelihood Weighted Sampling Example

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

w = 1.0

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

w = 1.0

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

w = 1.0× 0.1

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

w = 1.0× 0.1

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

w = 1.0× 0.1× 0.99 = 0.099

Sampling Importance Resampling

• Note that importance sampling, e.g. likelihood weightedsampling, gives approximation to expectation, not samples

• But samples can be obtained using these ideas• Sampling-importance-resampling uses a proposal distributionq(z) to generate samples

• Unlike rejection sampling, no parameter k is needed

SIR - Algorithm

• Sampling-importance-resampling algorithm has two stages• Sampling:

• Draw samples z(1), . . . ,z(L) from proposal distribution q(z)• Importance resampling:

• Put weights on samples

wl =p̃(z(l))/q(z(l))∑m p̃(z

(m))/q(z(m))

• Draw samples from the discrete set z(1), . . . ,z(L) according toweights wl (uniform distribution)

• This two stage process is correct in the limit as L→∞

Outline

Sampling

Rejection Sampling

Importance Sampling

• Markov chain Monte Carlo (MCMC) methods also use aproposal distribution to generate samples from anotherdistribution

• Unlike the previous methods, we keep track of the samplesgenerated z(1), . . . ,z(τ)

• The proposal distribution depends on the current state: q(z|z(τ))• Intuitively, walking around in state space, each step depends only

on the current state

Metropolis Algorithm

• Simple algorithm for walking around in state space:• Draw sample z∗ ∼ q(z|z(τ))• Accept sample with probability

A(z∗, z(τ)) = min

p̃(z∗)

p̃(z(τ))

)• If accepted z(τ+1) = z∗, else z(τ+1) = z(τ)

• Note that if z∗ is better than z(τ), it is always accepted• Every iteration produces a sample

• Though sometimes it’s the same as previous• Contrast with rejection sampling

• The basic Metropolis algorithm assumes the proposaldistribution is symmetric q(zA|zB) = q(zB|zA)

p̃(z∗)

p̃(z(τ))

p̃(z∗)

p̃(z(τ))

p̃(z∗)

p̃(z(τ))

Metropolis Example

0 0.5 1 1.5 2 2.5 30

• p(z) is anisotropic Gaussian, proposal distribution q(z) isisotropic Gaussian

• Red lines show rejected moves, green lines show accepted moves• As τ →∞, distribution of z(τ) tends to p(z)

• True if q(zA|zB) > 0• In practice, burn-in the chain, collect samples after some

iterations• Only keep every M th sample

Metropolis Example - Graphical Model

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

• Consider running Metropolis algorithm to draw samples fromp(cloudy, rain|spr = t, wg = t)

• Define q(z|zτ ) to be uniformly pick from cloudy, rain,uniformly reset its value

Metropolis Example

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

• Walk around in this state space, keep track of how many timeseach state occurs

Metropolis-Hastings Algorithm

• A generalization of the previous algorithm for asymmetricproposal distributions known as the Metropolis-Hastingsalgorithm

• Accept a step with probability

p̃(z∗)q(z(τ)|z∗)p̃(z(τ))q(z∗|z(τ))

• A sufficient condition for this algorithm to produce the correctdistribution is detailed balance

Gibbs Sampling

• A simple coordinate-wise MCMC method• Given distribution p(z) = p(z1, . . . , zM ), sample each variable

(either in pre-defined or random order)• Sample z(τ+1)

1 ∼ p(z1|z(τ)2 , z(τ)3 , . . . , z

(τ)M )

• Sample z(τ+1)2 ∼ p(z2|z(τ+1)

1 , z(τ)3 , . . . , z

(τ)M )

• . . .• Sample z(τ+1)

M ∼ p(zM |z(τ+1)1 , z

(τ+1)2 , . . . , z

(τ+1)M−1 )

• These are easy if Markov blanket is small, e.g. in MRF withsmall cliques, and forms amenable to sampling

Gibbs Sampling - Example

Gibbs Sampling Example - Graphical Model

Cloudy

RainSprinkler

WetGrass

P(R|C)C

P(S|C)

T TT FF TF F

P(W|S,R)

P(C).50

• Consider running Gibbs sampling onp(cloudy, rain|spr = t, wg = t)

• q(z|zτ ): pick from cloudy, rain, reset its value according top(cloudy|rain, spr, wg) (or p(rain|cloudy, spr, wg))

• This is often easy – only need to look at Markov blanketSampling Methods A. Ghane / T. Möller / G. Mori 50

Conclusion

• Readings: Ch. 11.1-11.3 (we skipped much of it)• Sampling methods use proposal distributions to obtain samples

from complicated distributions• Different methods, different methods of correcting for proposal

distribution not matching desired distribution• In practice, effectiveness relies on having good proposal

distribution, which matches desired distribution well

Sampling Methods - Bishop PRML Ch. 11vda.univie.ac.at/Teaching/ML/16s/LectureNotes/11... ·...

Documents

Transcript of Sampling Methods - Bishop PRML Ch. 11vda.univie.ac.at/Teaching/ML/16s/LectureNotes/11... ·...

IAML: Logistic Regression...Figure credit: Chris Bishop, PRML As for linear regression, we can transform the input space if we want x!˚(x) 19/24 Generative and Discriminative Models

LINEAR MODELS FOR CLASSIFICATION - Chapter 4 in PRML

Prml Slides 8

Chris&Bishop,&PRML PATTERN&RECOGNITION& AND MACHINE&LEARNING€¦ · Chris&Bishop,&PRML PATTERN&RECOGNITION& ANDMACHINE&LEARNING CH&1&2:&INTROTO&PROBABILITY&DENSITY&ESTIMATION COMPSCI'591/691NR'

PRML Bernulli Beta distribution

C19 : Lecture 1 : Graphical Modelsfwood/teaching/C19_hilary... · C19 : Lecture 1 : Graphical Models Frank Wood University of Oxford January, 2014 Many gures from PRML [Bishop, 2006]

Combining Models - Oliver Schulte - CMPT 726oschulte/teaching/726/fall... · Bishop PRML Ch. 14. Combining Models: Some TheoryBoostingDerivation of Adaboost from the Exponential Loss

Cybernetics and Arti cial Intelligence (2017 ... · Ch. Bishop, Pattern Recognition ... com/en-us/um/people/cmbishop/prml/ Kotek, Vysok y, Zdr ahal: Kybernetika 1990 Classi cation

DDM & PRML - Teledyne LeCroy

Διαφάνεια 1arly/courses/ml/tmp/linear_classifiers.pdf · Bishop PRML Ch. 4 . Classification: ... Some algebra, we get a solution using the pseudo-inverse as in regression

Neural Networks - Bishop PRML Ch. 5

Pattern Recognition and Machine Learningrtc12/CSE586/papers/Bishop-PRML-sample.pdf · Preface Pattern recognition has its origins in engineering, whereas machine learning grew out

Approximate Inference: Sampling - Marc Deisenroth · Adapted from PRML (Bishop, 2006) 1.Generate z 0 qpzq 2.Generate u 0 Ur0,kqpz 0qs 3.If u 0 ¡p˜pz 0q, reject the sample. Otherwise,

Machine Learning – some basics - Bishop PRML Ch. 1vda.univie.ac.at/Teaching/FDA/17w/LectureNotes/01_basics.pdf · Admin MachineLearning CurveFitting DecisionTheory ProbabilityTheory

PRML 2.4-2.5: The Exponential Family & Nonparametric Methods

PRML Chap.1

Lecture 15: Linear Discriminant Analysisdfg/ProbabilisticInference/IDAPISlides15.pdfClassiﬁcation x 1 x 2 Adapted from PRML (Bishop, 2006) Input vector x PRD, assign it to one of

Introduction to Machine Learning - univie.ac.atvda.univie.ac.at/Teaching/ML/15s/LectureNotes/01_basics.pdf · Introduction to Machine Learning Bishop PRML Ch. 1 Alireza Ghane. Course

PRML 13.2.2: The Forward-Backward Algorithm

Sequential Data - Part 2 - Oliver Schulte - CMPT 726oschulte/teaching/726/spring11/slides/mychapter13b.pdfSequential Data - Part 2 Oliver Schulte - CMPT 726 Bishop PRML Ch. 13 Russell