bayestalkBW

8/8/2019 bayestalkBW

1/73

1

An Introduction to Bayesian Inference and

Computation

Kevin QuinnDepartment of Political Science and

Center for Statistics and the Social Sciences

University of Washington

[email protected]


2/73

2

What is Bayesian Inference?

Bayesian inference is a means of making rational

probability statements about quantities of interest(observables, model parameters, functions of model

parameters)


3/73

3

Advantages of the Bayesian Approach

Answers the questions that researchers are really

interested in, e.g. What is the probability that ...

Natural way to combine information from multiplestudies

Provides a formal method for combining prior qualitativeinformation with observed quantitative information


4/73

4

More general way to deal with issues of model

identification

Principled means to compare non-nested models (Bayesfactors)

Allows one to fit very realistic (complicated) models


5/73

5

Disadvantages of the Bayesian Approach

Often computationally more demanding than classical

inference

At the moment no general purpose canned softwarepackages similar in ease of use to say, Stata


6/73

6

Requires either:

1. An elicitation and defense of real subjective probabilitydistributions, or

2. Sensitivity analysis to show that the choice of

subjective beliefs isnt determining ones inferences

Allows one to fit overly complicated (realistic) models


7/73

7

Assumed Knowledge

Basic understanding of random variables

8th grade algebra

A bit of matrix algebra (matrix addition, multiplication,

what an inverse matrix is)


8/73

8

Outline

1. Review and Foundational Issues

(a) Foundational Issues(b) Likelihood Inference

(c) Bayesian Inference


9/73

9

2. Bayesian Computation

(a) The Gibbs Sampling Algorithm Example: Sampling from a Bivariate Normal

Distribution

Example: Linear Regression with NormalDisturbances

(b) The Metropolis-Hastings Algorithm

Example: Sampling from a Bivariate Normal

Distribution Example: Logistic Regression


10/73

10

Discrete and Continuous Random Variables

A random variable is discrete if:

1. it takes a finite set of values, {x1, x2, . . . , xn}2. or if it takes an infinite sequence of distinct values,

{x1, x2, . . .}

A random variable is continuous if it can take all valuesin a dense subset of the real numbers


11/73

11

Distribution Functions and Probability

(Density) Functions of RVs

A distribution function FX(x) of a random variable Xis a non-decreasing function that gives the probability

that X < x.

A probability function fX(x) of a discrete randomvariable X gives the probability that X = x


12/73

12

A probability density function fX() of a continuous

random variable X is the non-negative function thatsatisfies FX(x) =

x fX(z)dz

Note fX(x) = Pr(X = x) if X is continuous

To find Pr(X (a, b)) we calculateba fX(z)dz = FX(b) FX(a).


13/73

13

Marginal, Conditional, and Joint

Probability (Density) Functions

The marginal probability function fX(x) gives theprobability that X = x for all x

The joint probability function fX,Y(x, y) of two discrete

random variables X and Y gives the probability thatX = x and Y = y for all x and y


14/73

14

If X and Y are discrete, then

fX(x) =

y

fX,Y(x, y)

The conditional probability function of two discreterandom variables X and Y is given by:

fX|Y(x|y) = fX,Y(x, y)

fY(y)

where is assumed that fY(y) > 0


15/73

15

It follows that:

fX,Y(x, y) = fX|Y(x|y)fY(y)

and

fY(y) =fX,Y(x, y)

fX|Y(x|y) =

xfX,Y(x, y)

If X and Y are independent then

fX,Y(x, y) = fX(x)fY(y)


16/73

16

Similar results hold for continuous random variables

where summation is replaced with integration


17/73

17

Probability Models

Consider an observed sample of data y

A probability model for y consists of 2 things:1. An assumption about the probability distribution with

density f(y|) that generated y2. The set of possible values of the model parameters

.

f(y|) is called the sampling density and is the jointdensity of all the observed ys.


18/73

18

When f(y|) is viewed as a function of for fixed y it

is referred to as the likelihood function and is writtenL(|y).

The likelihood function can actually be any function of

y and that is proportional to f(y|)


19/73

19

Example 1: Linear Model with Normal

Disturbances

consider a set of measurements of household income(x) and household consumption (y) for n households at

the same point in time

We believe that conditional on household income,household consumption follows a normal distribution

and that the observations are independent

This is our distributional assumption


20/73

20

Conditional independence of the ys allows us to form

the sampling density by taking a product of themarginal densities

We believe that consumption varies linearly in income

Both positive and negative relationships are theoreticallypossibly

This determines the parameter space


21/73

21

Our probability model is:

f(y|0, 1, 2) =

ni=1

fN(yi|0 + xi1, 2)

0 R, 1 R, 2 R+

Note that this is equivalent to the simple linearregression model with normal disturbances:

yi = 0 + xi1 + i

iiid N(0, 2), i = 1, 2, . . . , n


22/73

22


23/73


24/73

24

This implies that y follows the binomial distribution with

sample size n and probability of success . this is the distributional assumption

The binomial sample size is fixed at n and the parameter

can be any number in the [0,1] interval this determines the parameter space

Thus our probability model is

p(y|n, ) =

n

y

y(1 )(ny), [0, 1]


25/73

25

Frequentist Inference Based on the

Likelihood Function

Goal of frequentist (classical) inference is to devise

estimators that have desirable properties in repeated

sampling

Maximum Likelihood Estimator ML

of :

ML = arg max

L(|y)


26/73

26

Under suitable regularity conditions, ML is:

Consistent Asymptotically normally distributed

Invariant to reparameterization

Minimum variance unbiased Problem: frequentist inference cant answer questions

such as:

What is Pr( > 0)? What is Pr(g() (a, b))? What is the probability that a particular model is

correct?


27/73

27

Bayesian Inference

Goal of Bayesian inference is to make probability

statements about model parameters , and/or functionsof model parameters g(), given a probability model and

observed data

In other words, we want to know f(|y)

Note that our probability model is defined in terms off(y|) which is not quite what we want


28/73


29/73

29

Bayes Rule gives us a formula for calculating the

posterior density of given y (denoted f(|y)) fromknowledge of the sampling density ofy (denoted f(y|))and the prior density of (denoted f())

The function f() plays a crucial role in transformingour observed data and probability model into probability

statements about

Since f() doesnt depend on the observed data y itrepresents the researchers subjective a priori beliefs

about the likely values of


30/73

30

The fact that f() is a subjective probability implies

that f(|y) is a subjective probability

Note that since f(y) is a constant for fixed y we canwrite:

f(|y) f(y|)f()In words, the posterior density is proportional to the

sampling density times the prior density.


31/73

31

Once a probability model is formed and a prior specified

Bayesian inference proceeds by summarizing f(|y).

Interesting quantities include (but are not limited to)

The posterior mean of :

E[|y] =

f(|y)d

The posterior variance of :

Var[|y] =

( E[|y])2 f(|y)d


32/73

32

A 100 (1 )% credible set C where C ischosen to satisfy:

1 = C f(|y)d


33/73

33

Example: Inference About a Binomial

Proportion with a Beta Prior

Suppose we observe y = (0, 1, 0, 0, 0)

We believe yiiid Bernoulli() i = 1, . . . , 5

We want to make inferences about

Let y =n

i=1 yi = 1 and n = 5


34/73

34

Under the assumption of Bernoulli sampling our

sampling density is

f(y|) =

n

y

y(1 )ny

Lets assume our beliefs about can be represented bya Beta density

f() =( + )

()()(1)(1 )(1)


35/73

35

The mean of a beta random variable is +

The variance of a beta random variable is

(+)2(++1)


36/73

36

Suppose we think that the ys we observe are realizations

of fair coin tosses and we are moderately sure of this.

This prior belief might be represented by assuming

Beta(10, 10)

This implies the posterior density of given y is

f(|y) y(1 )ny(1)(1 )(1)

(y+1)(1 )(ny+1)


37/73

37

In other words,

|y Beta(y + , n y + )

which, in this case means

|y Beta(11, 14)


38/73


39/73

39

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

pi


40/73

40

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

pi


41/73

41

Now lets suppose are prior beliefs are that y is being

generated whenever a 1 comes up on a fair 6-sided die

Beta(3.33, 16.67)

|y Beta(4.33, 20.67)


42/73

42

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

6

pi


43/73


44/73

44

0.0 0.2 0.4 0.6 0.8 1.0

0

1

2

3

4

5

6

pi


45/73

45

Bayesian Computation

Goal of Bayesian inference is to summarize the posterior

distribution f(|y)


46/73

46

This involves integration

E[|y] =

f(|y)d

Var[|y] =

( E[|y])2 f(|y)d

1 =

C

f(|y)d


47/73


48/73

48


49/73

49

The Gibbs Sampling Algorithm

Suppose we have a density f(1, . . . , k)

We want to take a sample from the joint distribution of1, . . . , k but it is not easy to do this directly

Rather than trying to sample directly from f(1, . . . , k)we are going to construct a Markov chain whose

stationary distribution is f(1, . . . , k)

50


50/73

50

The Gibbs sampling algorithm works as follows

initialize

(0)

2 , . . . ,

(0)

kfor (i = 1 to M){

sample (i)1 from f1(1|

(i1)2 , . . . ,

(i1)k )

sample (i)2 from f2(2|

(i)1 ,

(i1)3 . . . ,

(i1)k )

.

.

.

sample (i)

k

from fk(k|(i)

1

, . . . , (i)

k1

)

store (i)1 , . . . , (i)k

}

51


51/73

51

A Proof of Convergence For a Simple Joint

Distribution [From Casella and George(1992)]

Consider

X=0 X=1

Y=0 p1 p2 p1 + p2Y=1 p3 p4 p3 + p4

p1 + p3 p2 + p4 1.00

52


52/73

52

The joint probability function is justfX,Y(0, 0) fX,Y(1, 0)fX,Y(0, 1) fX,Y(1, 1)

=

p1 p2p3 p4

The marginal probability function for X is

fX = [fX(0) fX(1)] = [p1 + p3 p2 + p4]

53


53/73

53

The full conditional probabilities are expressed by two

matricesAY|X =

p1

p1+p3

p3p1+p3

p2p2+p4

p4p2+p4

and

AX|Y = p

1p1+p2

p2p1+p2

p3p3+p4

p4p3+p4

54


54/73

54

Transitions are of the form X0 Y

1 X1

What are the transition probabilities from X0 to X1?

Since we have to go through Y1 to get to X1 the

transition probabilities take the form

Pr(X1 = x1|X0 = x0) =

y

Pr(X1 = x1|Y

1 = y1)

Pr(Y

1 = y1|X

0 = x0)

55


55/73

55

Which means that the transition matrix for the X

sequence is AX|X = AY|XAX|Y

The stationary distribution f of X

must satisfy

fAX|X = f

It is easy to check that

fXAX|X = fXAY|XAX|Y = f

X

56


56/73

56

In other words, the marginal probability function of Xis the stationary distribution of the X chain

57


57/73

57

Example: Sampling from a Bivariate

Normal Distribution

Consider X

Y

N

XY

,

2X X,Y

Y,X 2Y

Let = X,Y/

2X2Y

58


58/73

58

Full conditionals:

X|Y = y N(m1, s21)

where

m1 = X Y(X,Y/2Y) + y(X,Y/2Y)

s2

1

= 2

X

(1 2)

59


59/73

59

and

Y|X = x N(m2, s

2

2)where

m2 = Y X(X,Y/2X) + x(X,Y/

2X)

s22 = 2Y(1

2)

60


60/73

60

The Gibbs sampling algorithm for the bivariate normal

distribution isinitialize y(0)

for (i = 1 to M){sample x(i)|y(i1) from N(m1, s21)

sample y(i)|x(i) from N(m2, s22)store x(i) and y(i)

}


61/73

62


62/73

6

Assume is a priori independent of 2

so p(, 2

) = p()p(2

)

Priors

N(m,V)

2 IG(/2, /2)

Posterior (ignoring constants of proportionality)

63


63/73

f(, 2|y) (2)n/2 exp

(y X)

(2

In)1

(y X)2

exp

( m)V1( m)

2 (2)(/2+1) exp

/2

2

(2)

To find the full conditionals drop terms that dontinclude the variable of interest

64


64/73

Full conditional for

f(|2,y) exp

(y X)(2In)1(y X)

2 (3)

exp

( m)V1( m)2

65


65/73

A bit of algebra reveals

f(|2,y) exp

( m)V1( m)

2

(4)

whereV = X(2I)1X+V11

m = VX(2I)1y +V1m

.

66


66/73

Full conditional for 2

f2(2|,y) (2)n/2 exp

(y X)(y X)

22

(2

)(/2+1)

exp

/2

2

(2)(n/2+/2+1)

exp (y X)(y X) + 22

(5)

67


67/73

This is an inverse gamma density with shape parameter

(n + )/2 and scale parameter

(yX)(yX)+

2

The intuition here is that is acting like an additionalnumber of observations and is acting like an additional

sum of squared errors.

68


68/73

The Gibbs sampling algorithm for the linear model is

initialize

2(0)

for (i = 1 to M){

sample (i) from f(|2(i1)

, y)

sample 2(i)

from f2(2|(i), y)

store (i) and 2(i)

}

69


69/73

The Metropolis-Hastings Algorithm

Suppose we want to sample from f(|y)

70


70/73

The Metropolis-Hastings Algorithm is:

initialize

(0)

for (i in 1 to M){

sample (i1)can from q(|(i1))

set

(i) =

(i1)can with probability (

(i1)can , (i1))

(i1) with probability 1 ((i1)

can , (i1))

store (i)

}where

71


71/73

(can, ) = minf(can|y)f(|y)

q(|can)

q(can|), 1

72


72/73

Example: Sampling from a Bivariate

Normal Distribution

73


73/73

Example: Logistic Regression

bayestalkBW

Documents

Transcript of bayestalkBW