Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

47
Prior distribution All models are wrong…but some models are useful.” G.E.P. Box

Transcript of Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Page 1: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Prior distribution

“All models are wrong…but some models are useful.”

G.E.P. Box

Page 2: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Principles of Prior Selection

The most important principle of prior selection is that your prior should represent the best knowledge that you have about the parameters of the problem before you look at the data

Usually there is some information at your disposal.You know that the distance to the Moon is more than 10,000 km and less than 1,000,000 km, so you are justified in setting your prior to zero outside that range

The population of fish in a lake cannot be greater than if the volume of the lake were entirely filled with fish

It is unjustified to use default, ignorance, or other automatic priors if you have substantial information that can affect the answer

Page 3: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Principles of Prior Selection

But sometimes you do not have substantial prior information!A number of principles have been used to do this:

Group invariance arguments, Maximum entropy arguments, arguments from the Fisher information matrix

These “ignorance” priors generally reproduce the results of a corresponding frequentist analysis (with of course a Bayesian interpretation), so results using them cannot be worse than a frequentist analysis.

Of course, if you have real information, you can use this information in a way that frequentists can’t.

Page 4: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Here the idea is that if we know a priori that our prior knowledge should be invariant to the action of some underlying group, then we can choose our prior to respect that invariance

Example: In rolling a die or flipping a coin, if we have no reason to think that any side is favored then we are saying that our state of knowledge about the roll of the die/flip of the coin is invariant to the action of the permutation group on 2 (or 6) objects. Thus, if we were to mix up the sides of the die by applying an arbitrary element of the permutation group to the die or coin, we would not change our prior.

The prior is constant on each side of the die/coin

Page 5: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Other examplesIf we have an angular parameter, we may find that our prior is invariant to the action of the rotation group. Then our prior would be constant in angle.If we have a pair of angular parameters (e.g., latitude, longitude) and believe that our prior is invariant to arbitrary space rotations (O(3)) , then our prior will be constant on constant solid angles, e.g., d cos d.If our prior represents a location and is invariant to translations of the origin of the axes, then our prior is the (improper) flat prior. Such invariance may be appropriate if physical considerations indicate such invariance.

Page 6: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Other examplesA more interesting example is if we are measuring a positive quantity such as a length, and our prior says that it should be invariant with respect to changes in the scale of the graduations of the ruler (e.g., it shouldn’t matter if our ruler is graduated in inches or centimeters, our results should be physically the same).

Then the prior is the (improper) Jeffreys prior 1/.

This is the same as saying that the prior on =log is flat (translations in log are equivalent to multiplications by an arbitrary scale factor)

Page 7: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

An example of the usefulness of this Jeffreys prior is given by Benford’s Law (misnamed—it was actually discovered by the astronomer Simon Newcomb).

If you consider a collection of measurements of some quantity that is measured on an arbitrary multiplicative scale, such as areas of political divisions, then the distribution of the first digits of these numbers is empirically found to be roughly logarithmic

Page 8: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Example: Areas of 40 European countries (some units)

Digit Actual Predicted

1 10 (25%) 30% 2 7 (17.5%) 18% 3 6 (15%) 12% 4 6 (15%) 10% 5 3 (7.5%) 8% 6 2 (5%) 7% 7 2 (5%) 6% 8 1 (2.5%) 5% 9 3 (5%) 5%

Page 9: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

This does not indicate that Mother Nature has ten fingers!

We would expect that the distribution of first digits should not depend on what units we use, e.g., square km, hectares, square feet, whatever. Our knowledge is scale invariant.

Distribution should be invariant to transformations of the scale (renormalization) group c. So

d ln c d(c )

c

d( )

d ln

Page 10: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Let a=10k be the lower endpoint of a decade of numbers. The relative proportion of numbers with first digit d in each decade (independent of decade) is given by

Evaluate this for all 9 first digits and add to get the normalization constant:

The expected frequency of first digits d expected is therefore

d /ad

a(d1)

ln(d 1) ln(d) ln(11/ d)

ln(10) ln(1) ln(10)

ln(11/ d) / ln(10) log10(11/ d)

Page 11: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

These examples illustrate that the existence of a natural symmetry group generates a natural prior, which is the distribution that is invariant under the action of the group.

When the group is compact this is the same as the (unique) Haar measure on the compact groupWhen the group is not compact, the prior will be improper and can only be used if the posterior is proper. The prior will only be known up to a factor in this case.

• Here we are ignoring the “ends” of the scale, e.g., the size of the universe, the size of the Earth, etc. The symmetry group is only approximate!

• Some difficulties arise because there can be two different invariant measures on noncompact groups

Page 12: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Things do get more complex when the group is not compact. As an example, consider the affine group (group of scale changes and offsets) with the multiplication law

This is easily verified to be a (non-Abelian) group:The product of any two elements of the group is a new element of the group

There is an identity

Each element has an inverse

Associative law is satisfied

XY (x, s)(y, t) (sy x,st) (z,w) Z

x, y, z ( ,), s, t,w (0,)

Page 13: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

Consider in particular a location parameter x and a scale parameter . Consider transforming them by a fixed element (l,s) of the affine group by multiplication on the right. We get

The infinitesimal volume element transforms thusly:

so that the invariant volume element (measure) is

(x, )(l, s) ( x , ) (l x,s)

d x d 1 l

0 sdx d sdx d

d x d

s dx d

s

dx d

Page 14: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

This is the right-invariant Haar measure on the affine group. It can be taken as an appropriate prior on x and .

Show that the left-invariant Haar measure on the affine group is

Hint: Work from the multiplication law

where we have multiplied by the constant element on the left.

dx d 2

(l,s)(x, ) ( x , ) (sx l,s)

Page 15: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Group Invariance

There is some controversy as to whether the left or the right invariant Haar measure is the correct one to use as a prior. Berger favors the right invariant measure and Villegas the left invariant measure.

Berger says that use of the right invariant Haar measure avoids certain “marginalization paradoxes” as well as giving improved results in other situations

We note as well that Berger’s choice yields a prior that is the same as what we get if we multiply the flat prior on x with the Jeffreys prior on . That would be appropriate if we believed that x and were independent.

Page 16: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Another approach to prior selection suggested by the late E.T. Jaynes is to maximize the information entropy of the distribution, subject to constraints that you may happen to know.

The entropy is supposed to be a measure of how much information we lack to describe the distribution. That is, the larger the entropy, the less we know and can specify about the distribution

Page 17: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Example: Suppose we wish to specify a binary number of n digits, and each number is equiprobable.

It takes exactly n binary bits to specify the number.

There are N numbers, running from 0 to 2n-1.

Note that n = log2(N)

If we double the number of bits, we square the number of numbers that we can specify.

So the amount of missing information we could obtain when we learn the number is proportional to the number of bits it takes us to specify the missing information.

Page 18: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Similarly, if we have two sets of numbers of size M and N, and it takes m and n bits respectively to specify a given number in each set,

Then,

where k is an arbitrary constant

This suggests that information should be logarithmic in the number of equiprobable cases

nm

NIMI

NkMk

MNkMNI

)()(

)log()log(

)log()(

Page 19: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

It is evident that we obtain more information if we learn that an improbable result is true than if we learn that a probable result is true. This suggests that information should be a function of probability: H(p1,p2,…,pn)

Note that if we have M equiprobable cases, pj=1/M and

I(M) k log( p j )

k p j log p j

k p j log p j since all p j are equal

H ( p1, p2, pM )Think of this as the expectedentropy for case j: The entropy–k log pj times its probability pj

Page 20: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Shannon proposed that a reasonable definition of the information potentially available in observing events of unequal probability pj would be

This is consistent with the equiprobable case, and also gives greater potential information gain to events of lower probability since then the entropy of a low probability case is –k log pj which goes to infinity as pj goes to 0. But this happens only with probability pj, so the expected entropy from this case is –k pj log pj

H( p1, p2 , pM ) k pj log p j

Page 21: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Example: Suppose we have 8 equiprobable cases. Then pj=1/8. Under the definition (taking k=1),

Suppose we divide the 8 cases into c1={1,2} and c2={3,4,5,6,7,8} and suppose we learn first that one of these cases is true and then learn which of the subcases is true. I.e., we may learn that c1={1,2} is true, and then learn that of these possibilities 2 is true. The information gained is exactly equivalent to learning outright which of the 8 cases is true, so if our proposal makes sense we should be able to make this all consistent. This suggests

H( p1, p2 , p8) log(1/ 8) log(8)

H 1

8,

1

8,

1

8,

1

8,

1

8,

1

8,

1

8,

1

8 H 1

4,

3

4 1

4H 1

2,

1

2 3

4H 1

6,1

6,

1

6,

1

6,

1

6,

1

6

Page 22: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Here’s why:

H 1

8,

1

8,

1

8,

1

8,

1

8,

1

8,

1

8,

1

8 H 1

4,

3

4 1

4H 1

2,

1

2 3

4H 1

6,1

6,

1

6,

1

6,

1

6,

1

6

Informationupon learninganswer outright

Informationupon learningwhich of c1 orc2 is true

Informationupon learningwhich of 6 casesin c2 is true, giventhat c2 is true (butonly observedwith probability3/4)...

Informationupon learningwhich of 2 casesin c1 is true, giventhat c1 is true (butonly observedwith probability1/4) so expectedinformation gainis 1/4 of total

p(1|c1)p(c1)

Page 23: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Substituting into the proposed equation we find

Direct evaluation shows that this agrees

H 1

8,

1

8,1

8,

1

8,1

8,

1

8,1

8,

1

8 H 1

4,

3

4 1

4H 1

2,1

2 3

4H 1

6,1

6,1

6,1

6,1

6,

1

6 log(8) 1

4log 1

4 3

4log 3

4

14log(2) 3

4log(6)

Page 24: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Thus we generalize:H(p1, p2)= –p1log p1 –p2log p2

ToH(p1, p2,.., pN)= –pklog pk

To (for the continuous case) H( p( )) p( )log p( )d

Page 25: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Examples: Suppose we have a finite state space with n possibilities, and have no additional knowledge. The principle of Maximum Entropy (MAXENT) suggests maximizing the information entropy, subject to the side constraint that the total probability be equal to 1. We can do this by introducing a Lagrange multiplier :

H pk log pk subject to pk 1

S H pk 1 Spk

log pk pk

pk

0

pk exp( 1) constant 1n

Page 26: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Examples: What is the continuous distribution that maximizes the information entropy, given that the mean and variance are known?

22 )()(

)(

1)(

subject to

)(log)(max

dxxpx

dxxxp

dxxp

dxxpxpH

Page 27: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum EntropyIntroduce Lagrange multipliers a, b, c:

Taking the variation p(x),

Thus the distribution is Normal, and applying the constraints it is N(,2)

S p(x)log p(x)dx a p(x)dx 1 b xp(x)dx c (x )2 p(x)dx 2

S 0 log p(x) 1 a bx c(x )2 p(x)dx

0 log p(x) 1 a bx c(x )2 p(x) exp a bx c(x )2

Page 28: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Thus the normal distribution is the one that maximizes our uncertainty, given fixed mean and variance. This is one more reason why the normal distribution is so important: It tells us less about the data, given only that we know the mean and variance, than any other continuous distribution.

The information entropy of the normal distribution can be calculated to be

Show this!

H log 2e

Page 29: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

An interesting example of using maximum entropy was given by Jaynes: The astronomer Wolf (the same one who invented sunspot numbers) had a pair of dice that he tossed repeatedly over the years, recording the outcomes. He obtained

White die Red die N p N p1 3246 0.16230 3407 0.170352 3449 0.17245 3631 0.181553 2897 0.14485 3176 0.158804 2841 0.14205 2916 0.145805 3635 0.18175 3448 0.172406 3932 0.19660 3422 0.17110

Page 30: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

The dice do not appear to be fair, and the white and red die appear to be unfair in different ways.

Jaynes proposed the following physical causes:

The excavations for the spots made larger numbers lighter than smaller ones, causing them to come up more frequently

One axis was longer than the other two, causing the faces on its ends to be seen less frequently

(On the white die) there might have been a chip on the 2-3-6 corner, making these three faces come up more frequently

Page 31: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Jaynes maximized the information entropy, subject to constraints that correspond to these physical situations—for example, proposing that the deviation in the frequency that a face would come up, relative to a fair die, was proportional to (T–B) where T is the number of spots on the top and B the number of spots on the bottom, when a particular face is up. The analysis also predicted that both dies were manufactured so that they were longer along the 3-4 axis

Much later, the actual dice were found in the archives of Wolf’s observatory. Measurement of the dice confirms Jaynes’ analysis of the physical characteristics of the dice, including the suspected chip

Page 32: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

Example of how Jaynes did this: Maximize the entropy

subject to the constraints

(For a fair die the sum would be 3.5)

H pk log pk

0 pk 1

pk 1

kpk 3.5983 (the value for the white die)

Page 33: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum EntropySolution: Introduce Lagrange and multipliers and . Find an extremum of

with solution

for small . Note that for a fair die all the p’s are equal so would be zero. Thus we get a linear equation for and I get =0.0382. The exact solution is =0.03373

Calculate the probabilities implied by this analysis of the white die. Compare with the actual probabilities

S pk log pk pk 1 kpk 3.5983

log pk ( 1) k 0

pk exp ( 1) k exp( k) 1 k

Page 34: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Maximum Entropy

The constraint implied by the “long axis” proposal would increase the frequencies of 1, 2, 5 and 6 by an amount , while decreasing the frequencies of 3 and 4 by an amount 2. (The factor 2 comes in to maintain the normalization)

This corresponds to a constraint

What probabilities are obtained with both constraints on the white die? How do they agree with the observed frequencies?

p1 p2 p5 p6 2( p3 p4 ) 0.1393

Page 35: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Harold Jeffreys proposed another general procedure for picking priors. He suggested using

where

is the Fisher information matrix of p(x|). Here, may be a vector of parameters, and the expectation is taken over x.

pJ ()| J( ) |1/2

J( ) E2 log p(x | )

d 2

Page 36: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

The Jeffreys prior has the very nice property that it is invariant to parameterization changes:

h 1()

J() Ed2 log p(x |)

d2

Ed2 log p(x | h 1())

d 2

dd

2

J()dd

2

| J() |1 2| J() |1 2 dd

Page 37: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Example: x1, x2, …, xn ~ N(, ) where is unknown but

the variance is known. Then

log p(x | ) 12 2 n(x )2 constant

J() En

2

constant

pJ ( )| J() |1/2flat

Page 38: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Example: x1, x2, …, xn ~ N(, ) where is known but the

variance is unknown. Then

log p(x | ) n log 12 2 (xi )2 constant

n log s2

2 2 constant

J( ) En

2 3s2

4

2

n 2

pJ ( )| J( ) |1/2 1

Page 39: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Example: x1, x2, …, xn ~ N(, ) where both and are

unknown. Then

log p(x |, ) n log 12 2 (xi )2 constant

n log Sxx n(x )2

2 2

J(, ) E

n 2 2

n(x ) 3

2n(x )

3 n 2 3

(xi )2 4

n 2 0

0 n 2 3

n 2

1

4

Page 40: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Example: x1, x2, …, xn ~ N(, ) where both and are unknown. This leads to the prior on and

This is not what we would have expected, if we thought that the priors on and should be independent! (It is what we would get if we used the left-invariant Haar measure, but this is rejected for other reasons).

p(, ) 1

2

Page 41: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Jeffreys himself thought this result inconsistent with the previous two, and preferred the prior obtained by assuming independence of and (the right-invariant Haar prior)

This is the independence Jeffreys prior for this problem, and it is the one that we shall use (following Berger)

p(, ) p( )p( ) 1

Page 42: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Discussion: Choosing “ignorance” priors is by no means easy or straightforward.

Arguments based on group symmetry seem the most secure from a logical point of view but depend on there being a natural symmetry

Maximum entropy is appealing, however the priors that result are not invariant to a change of variable, so they depend implicitly on a (subjective) judgement about a natural parameterization of the problem

Page 43: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Jeffreys Priors

Jeffreys priors are also suspect as the posterior distribution depends on the form of the data and thus may violate the LP

In particular, the Jeffreys priors for binomial and negative binomial data are different!

Finally, use of such priors will not take into account real prior information that we may have, so again we emphasize that if you have information you should use it and not one of these “uninformative” or “automatic” priors

Page 44: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Informed Vague Priors

Sometimes one may have real information that can lead to a prior, but the prior will still be “vague”, or spread out.

Example: If the sun were surrounded by a spherically symmetric distribution of stars then the number of stars in a shell of width dr is proportional to r2dr. If we were estimating the distance to a star by some means, it would be appropriate to use this as a prior on the distance.

For many years astronomers failed to recognize this and thus a bias was built into the stellar distance scale, the so-called Lutz-Kelker bias, though Trumpler and Weaver were apparently aware of it in the 1930’s. The stars are on average farther away than their measured distances would indicate

Page 45: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Informed Vague Priors

In the galaxy, the density of some groups of stars falls off roughly exponentially with distance from the galactic plane. This suggests a prior of the form

where is the star’s latitude and z0 is the scale height for the exponential falloff of density for the group of stars in question

p(r) exp( r | sin | / | z0 |)r2dr

Page 46: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Conjugate priors

Sometimes constructing noninformed priors can be difficult.

We might not have any physical information either to help us choose an informed prior.

In such cases, we prefer analytical convenience and choose a conjugate prior.

Let F be a sampling distribution, and P the class of prior distributions.

P is a natural conjugate prior for F if

PpFpPyp . and |. allfor |

Page 47: Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box.

Conjugate priors

Computational convenience

They can be interpreted as additional data