Tutorial on Bayesian Data Analysis

34
Tutorial on Bayesian Data Analysis Phil Gregory University of British Columbia Nov. 2011

Transcript of Tutorial on Bayesian Data Analysis

Page 1: Tutorial on Bayesian Data Analysis

Tutorial on Bayesian Data Analysis

Phil Gregory

University of British Columbia

Nov. 2011

Page 2: Tutorial on Bayesian Data Analysis

1. Role of probability theory in science2. Probability theory as extended logic3. The how-to of Bayesian inference4. Assigning probabilities5. Frequentist statistical inference6. What is a statistic?7. Frequentist hypothesis testing8. Maximum entropy probabilities9. Bayesian inference (Gaussian errors)10. Linear model fitting (Gaussian errors)11. Nonlinear model fitting12. Markov chain Monte Carlo13. Bayesian spectral analysis14. Bayesian inference (Poisson sampling)

Chapters

Hardcover ISBN 9780521841504 | Paperback ISBN:9780521150125 )

Resources and solutionsThis title has freeMathematica based support software available

Introduces statistical inference in the larger context of scientific methods, and includes 55 worked examples and many problem sets.

Page 3: Tutorial on Bayesian Data Analysis

Resources and solutionswww.cambridge.org/9780521150125

Book Preface (11 Kb)

Errata and Revisions (375 Kb)

Additional book examples with Mathematica 6 tutorial (5.2MB, nb)

Additional book examples with Mathematica 7 tutorial (5.4MB, nb)

Additional book examples with Mathematica 8 tutorial (5.4MB, nb)

Worked solutions for selected problems in Mathematica 7 (7.5MB, nb)

Mathematica Player: Free Interactive Player for ...

Go to Mathematica support material

Page 4: Tutorial on Bayesian Data Analysis

Outline (Lecture based on Chapter 3 of my book)

1. Occam’s Razor & Model Selection2. A simple spectral line problem

• Background (prior) information• Data• Hypothesis space of interest• Model Selection• Choice of prior• Likelihood calculation• Model selection results• Parameter estimation• Weak spectral line case• Additional parameters

3. Generalizations

1

2

3

4

5

6

7

8

9

10

11

12

Page 5: Tutorial on Bayesian Data Analysis

Occam’s Razor and Model SelectionOccam's razor is principle attributed to the mediaeval

philosopher William of Occam . The principle states that one should not make more assumptions than the minimum needed. It underlies all scientific modeling and theory building.

It cautions us to choose from a set of otherwise equivalent models (hypotheses) of a given phenomenon the simplest one.

In any given model, Occam's razor helps us to "shave off" those concepts, variables or constructs that are not really needed to explain the phenomenon.

It was previously thought to be only a qualitative principle. One of the great advantages of Bayesian analysis is that it allows for a quantitative evaluation of Occam’s razor.

Page 6: Tutorial on Bayesian Data Analysis

Occam’s Razor and Model SelectionImagine comparing two models: M1 with a single parameter, θ, and M0

with θ fixed at some default value θ0 (so M0 has no free parameter). We will compare the two models by computing the ratio of their

posterior probabilities or Odds (O10).

O10 =pHM1 D, ILpHM0 D, IL

=

pHM1 IL pHD M1, ILpHD IL

pHM0 IL pHD M0, ILpHD IL

=pHM1 ILpHM0 IL

pHD M1, ILpHD M0, IL

posterior probabilityratio

Expand numerator and denominator with Bayes’ theorem

prior probabilityratio Bayes factor

Page 7: Tutorial on Bayesian Data Analysis

Occam’s Razor and Model Selection

I),M|p(DI),M|p(D

I)|p(MI)|p(M

I)D,|p(MI)D,|p(MOdds

0

1

0

1

0

1 ==

2nd term calledBayes factor B10

Suppose prior Odds = 1

I),M,θ|Dp(I),M|p(θdθI),M|p(D 111 ∫=Marginal likelihood for M1

B10 =p HD M1, ILp HD M0, IL

= Marginal likelihood ratiocalled the Global likelihood ratio

in my book

In words: the marginal likelihood for a model is the weighted average likelihood for its parameter(s). The weighting function is the prior for the parameter.

Page 8: Tutorial on Bayesian Data Analysis

To develop our intuition about the Occam penalty we will carry out a back of the envelop calculation for the Bayes factor.

Approximate the prior, p(θ|M1,I) , by a uniform distribution of width ∆θ. Therefore p(θ|M1,I) = 1/ ∆θ

I),M,θ|Dp(I),M|p(θdθI),M|p(D 111 ∫=

I),M,θ|Dp(dθ∆θ1

1∫=

Often the data provide us with more information about parameters than we had without the data, so that the likelihood function, p(D|θ,M1,I), will be much more “peaked” than the prior, p(θ|M1,I) .

Approximate the likelihood function, p(D|θ,M1,I) , by a Gaussian distribution of a characteristic width δθ .

∆θδθ I),M,θ|p(DI),M|p(D 11 =

Maximum value of the likelihood

Page 9: Tutorial on Bayesian Data Analysis

Since model M0 has no free parameters the marginal likelihood is also the maximum likelihood, and there is no Occam factor.

I),M,θ|p(DI),M|p(D 100 =

∆θδθ

I),M,θ|p(DI),M,θ|p(D

I),M|p(DI),M|p(DB

00

1

0

110 ×==

Now pull the relevant equations together

Occam penalty continued

Page 10: Tutorial on Bayesian Data Analysis

∆θδθ

I),M,θ|p(DI),M,θ|p(DB

00

110 =

Occam penalty continued Occam factor Ωθ

The maximum likelihood ratio in the first factor can never favor the simpler model because M1 contains it as a special case. However, since the posterior width, δθ, is narrower than the prior width, ∆θ, the Occam factor penalizes the complicated model for any “wasted” parameter space that gets ruled out by the data.

The Bayes factor will thus favor the more complicated model only if the likelihood ratio is large enough to overcome this Occam factor.

Suppose M1 had two parameters θ and φ. Then the Bayes factor would have two Occam factors

φθ

00

110

ΩΩratio likelihoodmax∆φδφ

∆θδθ

I),M,θ|p(DI),M,θ|p(DB

××=

=

Page 11: Tutorial on Bayesian Data Analysis

Occam,s razor continued

Note: parameter estimation is like model selection onlywe are comparing models with the same complexityso the Occam factors cancel out.

Warning: do not try and use a parameter estimation analysis to do model selection.

Page 12: Tutorial on Bayesian Data Analysis

Simple Spectral Line Problem

Background (prior) information: Two competing grand unification theories have been proposed, each championed by a Nobel prize winner in physics. We want to compute the relative probability of the truth of each theory based on our prior information and some new data.

Theory 1 is unique in that it predicts the existence of a new short-lived baryon which is expected to form a short-lived atom and give rise to a spectral line at an accurately calculable radio wavelength.

Unfortunately, it is not feasible to detect the line in the laboratory. The only possibility of obtaining a sufficient column density of the short-lived atom is in interstellar space.

Prior estimates of the line strength expected from the Orion nebula according to theory 1 range from 0.1 to 100 mK.

Page 13: Tutorial on Bayesian Data Analysis

outlineSimple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line. The frequency, νi , is in units of the spectrometer channel number and the line center frequency ν0 = 37.

8

Line profile fi

Page 14: Tutorial on Bayesian Data Analysis

To test this prediction, a new spectrometer was mounted on the JamesClerk Maxwell telescope on Mauna Kea and the spectrum shown below was obtained. The spectrometer has 64 frequency channels.

Data

All channels have Gaussian noise characterized by σ = 1 mK. The noise in separate channels is independent. The line center frequency ν0 = 37.

Page 15: Tutorial on Bayesian Data Analysis

Questions of interest

Based on our current state of information, which includes just the above prior information and the measured spectrum,

1) what do we conclude about the relative probabilities of the two competing theories

and 2) what is the posterior PDF for the model parameters?

Hypothesis space of interest for model selection part:

M0 ≡ “Model 0, no line exists”M1 ≡ “Model 1, line exists”

M1 has 1 unknown parameters, the line temperature T.

M0 has no unknown parameters.

Page 16: Tutorial on Bayesian Data Analysis

To answer the model selection question, we compute the odds ratio (O10) of model M1 to model M0 .

Model selection

pHD » M1, IL = ‡T

pHD, T » M1, IL‚T

= ‡T

pHT » M1, IL pHD » M1, T , IL‚T

p(D|M1, I), the called the marginal (or global) likelihood of M1 .

The marginal likelihood of a model is equal to the weighted average likelihood for its parameters.

Expanded with product rule

O10 =pHM1 D, ILpHM0 D, IL

=

pHM1 IL pHD M1, ILpHD IL

pHM0 IL pHD M0, ILpHD IL

=pHM1 ILpHM0 IL

pHD M1, ILpHD M0, IL

posterior probabilityratio

Expand numerator and denominator with Bayes’ theorem

prior probabilityratio Bayes factor

Page 17: Tutorial on Bayesian Data Analysis

Investigate two common choices

There is a problem with this prior if the range of T is large. In the current example Tmin = 0.1 and Tmax = 100. Compare the probability that T lies in the upper decade of the prior range (10 to 100 mK) to the lowest decade (0.1 to 1 mK).

Usually, expressing great uncertainty in some quantity corresponds more closely to a statement of scale invariance or equal probability per decade. The Jeffreys prior, discussed next, has this scale invariantproperty.

Page 18: Tutorial on Bayesian Data Analysis

2. Jeffreys scale invariant prior

p T M1, I dT =dT

T ¥ ln Tmax Tmin

p ln T M1, I d lnT =d lnT

ln Tmax Tminor equivalently

0.1

1p T M1, I dT =

10

100p T M1, I dT

What if the lower bound on T includes zero? Another alternativeIs a modified Jeffreys prior of the form.

This prior behaves like a uniform prior for T < T0 and a Jeffreys prior for T > T0 . Typically set T0 = noise level.

pHT M1, I L =1

T + T0

1

ln T0+TmaxT0

Page 19: Tutorial on Bayesian Data Analysis

di = T fi + ei and fi = exp− νi − ν0

2

2 σL2

,

Let di represent the measured data value for the ith channel of the spectrometer. According to model M1,

and ei represents the error component in the measurement. Our prior information indicates that ei has a Gaussian distribution with a σ = 1 mK.

Assuming M1 is true, then if it were not for the error eidi would equal the model prediction T fi.

Let Ei Ξ “a proposition asserting that the ith error value is in the rangeei to ei + dei.“ If all the Ei are independent then

Page 20: Tutorial on Bayesian Data Analysis

Probability of getting a data value di a distance ei away from thepredicted value is proportional to the height of the Gaussian error curve at that location.

Page 21: Tutorial on Bayesian Data Analysis

From the prior information, we can write

Our final likelihood is given by

p HD M1, I L = ‰i=1

N 1

σ 2 pExpB-

Hdi - T fiL2

2s 2F

= H2 p L-N ê2s -N ExpB-0.5 ‚i=1

N Hdi - T fiL2

s 2F

The familiar c2

statistic used in least-squares

Page 22: Tutorial on Bayesian Data Analysis

Maximum and Marginal likelihoods

Our final likelihood is given by

For the given data the maximum value of the likelihood = 3.80×10-37

p D M1, I = 2 p -Nê2s -N Exp -i=1

N di - T fi 2

2s 2

To compute the odds O10 need the marginal likelihood of M1 .

p D M1, I =T

p T M1, I p D M1, T , I ‚T

A uniform prior for T yields p(D|M1,I) = 5.06×10-39

A Jeffreys prior for T yields p(D|M1,I) = 1.74×10-38

Page 23: Tutorial on Bayesian Data Analysis

Calculation of p(D|M0 ,I)

Model M0 assumes the spectrum is consistent with noise and has no free parameters so we can write

p D M0, I = 2p -N2 σ−N Exp -

i=1

N di - 0 2

2s 2= 3.26×10-51

Model selection results

Bayes factor, uniform prior = 1.6 x 1012

Bayes factor, Jeffreys prior = 5.3 x 1012

The factor of 1012 is so large that we are not terribly interested in whether the factor in front is 1.6 or 5.3. Thus the choice of prior is of little consequence when the evidence provided by the data for the existence is as strong as this.

Page 24: Tutorial on Bayesian Data Analysis

Now that we have solved the model selection problem leading to a significant preference for M1, we would now like to compute p(T|D,M1, I), the posterior PDF for the signal strength.

Page 25: Tutorial on Bayesian Data Analysis

outline

1.5 1.75 2 2.25 2.5 2.75 3Log10 Tmax

11.2511.5

11.7512

12.2512.5

12.75

goL01

sddo Uniform

Jeffreys

0 10 20 30 40 50 60Channel number

-10123456

langiShtgnerts

HK

mL

0 1 2 3 4 5 6 7Line strength T

00.10.20.30.40.50.60.7

pHT»

D,M

1,IL Uniform

Jeffreys

Page 26: Tutorial on Bayesian Data Analysis

How do our conclusions change when evidence for the line in the data is weaker?

All channels have IID Gaussian noise characterized by σ = 1 mK. The predicted line center frequency ν0 = 37.

Page 27: Tutorial on Bayesian Data Analysis

Weak line model selection results

For model M0 , p(D|M0 ,I) = 1.13×10-38

A uniform prior for T yields p(D|M1 ,I) = 1.13×10-38

A Jeffreys prior for T yields p(D|M1 ,I) = 1.24×10-37

Bayes factor, uniform prior = 1.0

Bayes factor, Jeffreys prior = 11.

As expected, when the evidence provided by the data is much weaker, our conclusions can be strongly influenced by the choice of prior and it is a good idea to examine the sensitivity of the results by employing more than one prior.

Page 28: Tutorial on Bayesian Data Analysis

1 1.5 2 2.5 3 3.5 4Log10 Tmax

0

2.5

5

7.5

10

12.5

15

17.5

sddO

Uniform

Jeffreys

0 10 20 30 40 50 60Channel number

-1

0

1

2

3

langiShtgnerts

HK

mL

Spectral Line Data

0 1 2 3 4 5Line strength T

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

pHT»

D,M

1,IL Uniform

Jeffreys

Page 29: Tutorial on Bayesian Data Analysis

What if we were uncertain about the line center frequency?

Suppose our prior information only restricted the center frequency to the first 44 channels of the spectrometer.In this case ν0 becomes a nuisance parameter that we can marginalize.

New Bayes factor = 1.0 ,assuming a uniform prior for ν0and a Jeffreys prior for T.

pHD » M1, IL = ‡n0

‡T

pHD, T , n0 » M1, IL‚T dn0

= ‡n0

‡T

pHT » M1, IL pHn0 » M1, IL pHD » M1, T , n0, IL‚T dn0

Assumes independentpriors

Built into any Bayesian model selection calculation is an automatic and quantitative Occam’s razor that penalizes more complicated models. One can identify an Occam factor for each parameter that is marginalized. The size of any individual Occam factor depends on our prior ignorance in the particular parameter.

Page 30: Tutorial on Bayesian Data Analysis

Marginal probability density function for line strength

0.5 1 1.5 2 2.5 3 3.5 4Line Strength HmKL

0.2

0.4

0.6

0.8

1

1.2

1.4

FDP

Line frequency Huniform prior L & line strength HJeffreysL

n0 = channel 37

n0 unknown

Page 31: Tutorial on Bayesian Data Analysis

Marginal probability density function for line strength

p HT M1, IL =1

T + T0×

1

lnB1 +TMax

T0F

Modified Jeffreys prior

p HT M1, IL =1

1

lnB TMaxTMin

F

Jeffreys prior

Page 32: Tutorial on Bayesian Data Analysis

Marginal PDF for line center frequency

Marginal posterior PDF for the line frequency, where the linefrequency is expressed as a spectrometer channel number.

Page 33: Tutorial on Bayesian Data Analysis

Joint probability density function

0 25 50 75 100 125 150 175Frequency channels H1 to 44L

0

20

40

60

80

100

eniLhtgnerts

HK

mL

Contours->890, 50, 20, 10, 5, 2, 1, 0.5<% of peak

Page 34: Tutorial on Bayesian Data Analysis

Generalizations

current model

di = T exp −νi − ν0

2

2 σL2

+ ei

More generally we can write

di = fi + ei

fi =α=1

m

Aα gα xi θ

where

specifies a linear model with m basis functions

fi =α=1

m

Aα gα xi

gα xi

or

specifies a model with m basis functions with an additionalset of nonlinear parameters represented by θ.