bayestalkBW
Transcript of bayestalkBW
-
8/8/2019 bayestalkBW
1/73
1
An Introduction to Bayesian Inference and
Computation
Kevin QuinnDepartment of Political Science and
Center for Statistics and the Social Sciences
University of Washington
-
8/8/2019 bayestalkBW
2/73
2
What is Bayesian Inference?
Bayesian inference is a means of making rational
probability statements about quantities of interest(observables, model parameters, functions of model
parameters)
-
8/8/2019 bayestalkBW
3/73
3
Advantages of the Bayesian Approach
Answers the questions that researchers are really
interested in, e.g. What is the probability that ...
Natural way to combine information from multiplestudies
Provides a formal method for combining prior qualitativeinformation with observed quantitative information
-
8/8/2019 bayestalkBW
4/73
4
More general way to deal with issues of model
identification
Principled means to compare non-nested models (Bayesfactors)
Allows one to fit very realistic (complicated) models
-
8/8/2019 bayestalkBW
5/73
5
Disadvantages of the Bayesian Approach
Often computationally more demanding than classical
inference
At the moment no general purpose canned softwarepackages similar in ease of use to say, Stata
-
8/8/2019 bayestalkBW
6/73
6
Requires either:
1. An elicitation and defense of real subjective probabilitydistributions, or
2. Sensitivity analysis to show that the choice of
subjective beliefs isnt determining ones inferences
Allows one to fit overly complicated (realistic) models
-
8/8/2019 bayestalkBW
7/73
7
Assumed Knowledge
Basic understanding of random variables
8th grade algebra
A bit of matrix algebra (matrix addition, multiplication,
what an inverse matrix is)
-
8/8/2019 bayestalkBW
8/73
8
Outline
1. Review and Foundational Issues
(a) Foundational Issues(b) Likelihood Inference
(c) Bayesian Inference
-
8/8/2019 bayestalkBW
9/73
9
2. Bayesian Computation
(a) The Gibbs Sampling Algorithm Example: Sampling from a Bivariate Normal
Distribution
Example: Linear Regression with NormalDisturbances
(b) The Metropolis-Hastings Algorithm
Example: Sampling from a Bivariate Normal
Distribution Example: Logistic Regression
-
8/8/2019 bayestalkBW
10/73
10
Discrete and Continuous Random Variables
A random variable is discrete if:
1. it takes a finite set of values, {x1, x2, . . . , xn}2. or if it takes an infinite sequence of distinct values,
{x1, x2, . . .}
A random variable is continuous if it can take all valuesin a dense subset of the real numbers
-
8/8/2019 bayestalkBW
11/73
11
Distribution Functions and Probability
(Density) Functions of RVs
A distribution function FX(x) of a random variable Xis a non-decreasing function that gives the probability
that X < x.
A probability function fX(x) of a discrete randomvariable X gives the probability that X = x
-
8/8/2019 bayestalkBW
12/73
12
A probability density function fX() of a continuous
random variable X is the non-negative function thatsatisfies FX(x) =
x fX(z)dz
Note fX(x) = Pr(X = x) if X is continuous
To find Pr(X (a, b)) we calculateba fX(z)dz = FX(b) FX(a).
-
8/8/2019 bayestalkBW
13/73
13
Marginal, Conditional, and Joint
Probability (Density) Functions
The marginal probability function fX(x) gives theprobability that X = x for all x
The joint probability function fX,Y(x, y) of two discrete
random variables X and Y gives the probability thatX = x and Y = y for all x and y
-
8/8/2019 bayestalkBW
14/73
14
If X and Y are discrete, then
fX(x) =
y
fX,Y(x, y)
The conditional probability function of two discreterandom variables X and Y is given by:
fX|Y(x|y) = fX,Y(x, y)
fY(y)
where is assumed that fY(y) > 0
-
8/8/2019 bayestalkBW
15/73
15
It follows that:
fX,Y(x, y) = fX|Y(x|y)fY(y)
and
fY(y) =fX,Y(x, y)
fX|Y(x|y) =
xfX,Y(x, y)
If X and Y are independent then
fX,Y(x, y) = fX(x)fY(y)
-
8/8/2019 bayestalkBW
16/73
16
Similar results hold for continuous random variables
where summation is replaced with integration
-
8/8/2019 bayestalkBW
17/73
17
Probability Models
Consider an observed sample of data y
A probability model for y consists of 2 things:1. An assumption about the probability distribution with
density f(y|) that generated y2. The set of possible values of the model parameters
.
f(y|) is called the sampling density and is the jointdensity of all the observed ys.
-
8/8/2019 bayestalkBW
18/73
18
When f(y|) is viewed as a function of for fixed y it
is referred to as the likelihood function and is writtenL(|y).
The likelihood function can actually be any function of
y and that is proportional to f(y|)
-
8/8/2019 bayestalkBW
19/73
19
Example 1: Linear Model with Normal
Disturbances
consider a set of measurements of household income(x) and household consumption (y) for n households at
the same point in time
We believe that conditional on household income,household consumption follows a normal distribution
and that the observations are independent
This is our distributional assumption
-
8/8/2019 bayestalkBW
20/73
20
Conditional independence of the ys allows us to form
the sampling density by taking a product of themarginal densities
We believe that consumption varies linearly in income
Both positive and negative relationships are theoreticallypossibly
This determines the parameter space
-
8/8/2019 bayestalkBW
21/73
21
Our probability model is:
f(y|0, 1, 2) =
ni=1
fN(yi|0 + xi1, 2)
0 R, 1 R, 2 R+
Note that this is equivalent to the simple linearregression model with normal disturbances:
yi = 0 + xi1 + i
iiid N(0, 2), i = 1, 2, . . . , n
-
8/8/2019 bayestalkBW
22/73
22
-
8/8/2019 bayestalkBW
23/73
-
8/8/2019 bayestalkBW
24/73
24
This implies that y follows the binomial distribution with
sample size n and probability of success . this is the distributional assumption
The binomial sample size is fixed at n and the parameter
can be any number in the [0,1] interval this determines the parameter space
Thus our probability model is
p(y|n, ) =
n
y
y(1 )(ny), [0, 1]
-
8/8/2019 bayestalkBW
25/73
25
Frequentist Inference Based on the
Likelihood Function
Goal of frequentist (classical) inference is to devise
estimators that have desirable properties in repeated
sampling
Maximum Likelihood Estimator ML
of :
ML = arg max
L(|y)
-
8/8/2019 bayestalkBW
26/73
26
Under suitable regularity conditions, ML is:
Consistent Asymptotically normally distributed
Invariant to reparameterization
Minimum variance unbiased Problem: frequentist inference cant answer questions
such as:
What is Pr( > 0)? What is Pr(g() (a, b))? What is the probability that a particular model is
correct?
-
8/8/2019 bayestalkBW
27/73
27
Bayesian Inference
Goal of Bayesian inference is to make probability
statements about model parameters , and/or functionsof model parameters g(), given a probability model and
observed data
In other words, we want to know f(|y)
Note that our probability model is defined in terms off(y|) which is not quite what we want
-
8/8/2019 bayestalkBW
28/73
-
8/8/2019 bayestalkBW
29/73
29
Bayes Rule gives us a formula for calculating the
posterior density of given y (denoted f(|y)) fromknowledge of the sampling density ofy (denoted f(y|))and the prior density of (denoted f())
The function f() plays a crucial role in transformingour observed data and probability model into probability
statements about
Since f() doesnt depend on the observed data y itrepresents the researchers subjective a priori beliefs
about the likely values of
-
8/8/2019 bayestalkBW
30/73
30
The fact that f() is a subjective probability implies
that f(|y) is a subjective probability
Note that since f(y) is a constant for fixed y we canwrite:
f(|y) f(y|)f()In words, the posterior density is proportional to the
sampling density times the prior density.
-
8/8/2019 bayestalkBW
31/73
31
Once a probability model is formed and a prior specified
Bayesian inference proceeds by summarizing f(|y).
Interesting quantities include (but are not limited to)
The posterior mean of :
E[|y] =
f(|y)d
The posterior variance of :
Var[|y] =
( E[|y])2 f(|y)d
-
8/8/2019 bayestalkBW
32/73
32
A 100 (1 )% credible set C where C ischosen to satisfy:
1 = C f(|y)d
-
8/8/2019 bayestalkBW
33/73
33
Example: Inference About a Binomial
Proportion with a Beta Prior
Suppose we observe y = (0, 1, 0, 0, 0)
We believe yiiid Bernoulli() i = 1, . . . , 5
We want to make inferences about
Let y =n
i=1 yi = 1 and n = 5
-
8/8/2019 bayestalkBW
34/73
34
Under the assumption of Bernoulli sampling our
sampling density is
f(y|) =
n
y
y(1 )ny
Lets assume our beliefs about can be represented bya Beta density
f() =( + )
()()(1)(1 )(1)
-
8/8/2019 bayestalkBW
35/73
35
The mean of a beta random variable is +
The variance of a beta random variable is
(+)2(++1)
-
8/8/2019 bayestalkBW
36/73
36
Suppose we think that the ys we observe are realizations
of fair coin tosses and we are moderately sure of this.
This prior belief might be represented by assuming
Beta(10, 10)
This implies the posterior density of given y is
f(|y) y(1 )ny(1)(1 )(1)
(y+1)(1 )(ny+1)
-
8/8/2019 bayestalkBW
37/73
37
In other words,
|y Beta(y + , n y + )
which, in this case means
|y Beta(11, 14)
-
8/8/2019 bayestalkBW
38/73
-
8/8/2019 bayestalkBW
39/73
39
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
pi
-
8/8/2019 bayestalkBW
40/73
40
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
pi
-
8/8/2019 bayestalkBW
41/73
41
Now lets suppose are prior beliefs are that y is being
generated whenever a 1 comes up on a fair 6-sided die
Beta(3.33, 16.67)
|y Beta(4.33, 20.67)
-
8/8/2019 bayestalkBW
42/73
42
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
pi
-
8/8/2019 bayestalkBW
43/73
-
8/8/2019 bayestalkBW
44/73
44
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
pi
-
8/8/2019 bayestalkBW
45/73
45
Bayesian Computation
Goal of Bayesian inference is to summarize the posterior
distribution f(|y)
-
8/8/2019 bayestalkBW
46/73
46
This involves integration
E[|y] =
f(|y)d
Var[|y] =
( E[|y])2 f(|y)d
1 =
C
f(|y)d
-
8/8/2019 bayestalkBW
47/73
-
8/8/2019 bayestalkBW
48/73
48
-
8/8/2019 bayestalkBW
49/73
49
The Gibbs Sampling Algorithm
Suppose we have a density f(1, . . . , k)
We want to take a sample from the joint distribution of1, . . . , k but it is not easy to do this directly
Rather than trying to sample directly from f(1, . . . , k)we are going to construct a Markov chain whose
stationary distribution is f(1, . . . , k)
50
-
8/8/2019 bayestalkBW
50/73
50
The Gibbs sampling algorithm works as follows
initialize
(0)
2 , . . . ,
(0)
kfor (i = 1 to M){
sample (i)1 from f1(1|
(i1)2 , . . . ,
(i1)k )
sample (i)2 from f2(2|
(i)1 ,
(i1)3 . . . ,
(i1)k )
.
.
.
sample (i)
k
from fk(k|(i)
1
, . . . , (i)
k1
)
store (i)1 , . . . , (i)k
}
51
-
8/8/2019 bayestalkBW
51/73
51
A Proof of Convergence For a Simple Joint
Distribution [From Casella and George(1992)]
Consider
X=0 X=1
Y=0 p1 p2 p1 + p2Y=1 p3 p4 p3 + p4
p1 + p3 p2 + p4 1.00
52
-
8/8/2019 bayestalkBW
52/73
52
The joint probability function is justfX,Y(0, 0) fX,Y(1, 0)fX,Y(0, 1) fX,Y(1, 1)
=
p1 p2p3 p4
The marginal probability function for X is
fX = [fX(0) fX(1)] = [p1 + p3 p2 + p4]
53
-
8/8/2019 bayestalkBW
53/73
53
The full conditional probabilities are expressed by two
matricesAY|X =
p1
p1+p3
p3p1+p3
p2p2+p4
p4p2+p4
and
AX|Y = p
1p1+p2
p2p1+p2
p3p3+p4
p4p3+p4
54
-
8/8/2019 bayestalkBW
54/73
54
Transitions are of the form X0 Y
1 X1
What are the transition probabilities from X0 to X1?
Since we have to go through Y1 to get to X1 the
transition probabilities take the form
Pr(X1 = x1|X0 = x0) =
y
Pr(X1 = x1|Y
1 = y1)
Pr(Y
1 = y1|X
0 = x0)
55
-
8/8/2019 bayestalkBW
55/73
55
Which means that the transition matrix for the X
sequence is AX|X = AY|XAX|Y
The stationary distribution f of X
must satisfy
fAX|X = f
It is easy to check that
fXAX|X = fXAY|XAX|Y = f
X
56
-
8/8/2019 bayestalkBW
56/73
56
In other words, the marginal probability function of Xis the stationary distribution of the X chain
57
-
8/8/2019 bayestalkBW
57/73
57
Example: Sampling from a Bivariate
Normal Distribution
Consider X
Y
N
XY
,
2X X,Y
Y,X 2Y
Let = X,Y/
2X2Y
58
-
8/8/2019 bayestalkBW
58/73
58
Full conditionals:
X|Y = y N(m1, s21)
where
m1 = X Y(X,Y/2Y) + y(X,Y/2Y)
s2
1
= 2
X
(1 2)
59
-
8/8/2019 bayestalkBW
59/73
59
and
Y|X = x N(m2, s
2
2)where
m2 = Y X(X,Y/2X) + x(X,Y/
2X)
s22 = 2Y(1
2)
60
-
8/8/2019 bayestalkBW
60/73
60
The Gibbs sampling algorithm for the bivariate normal
distribution isinitialize y(0)
for (i = 1 to M){sample x(i)|y(i1) from N(m1, s21)
sample y(i)|x(i) from N(m2, s22)store x(i) and y(i)
}
-
8/8/2019 bayestalkBW
61/73
62
-
8/8/2019 bayestalkBW
62/73
6
Assume is a priori independent of 2
so p(, 2
) = p()p(2
)
Priors
N(m,V)
2 IG(/2, /2)
Posterior (ignoring constants of proportionality)
63
-
8/8/2019 bayestalkBW
63/73
f(, 2|y) (2)n/2 exp
(y X)
(2
In)1
(y X)2
exp
( m)V1( m)
2 (2)(/2+1) exp
/2
2
(2)
To find the full conditionals drop terms that dontinclude the variable of interest
64
-
8/8/2019 bayestalkBW
64/73
Full conditional for
f(|2,y) exp
(y X)(2In)1(y X)
2 (3)
exp
( m)V1( m)2
65
-
8/8/2019 bayestalkBW
65/73
A bit of algebra reveals
f(|2,y) exp
( m)V1( m)
2
(4)
whereV = X(2I)1X+V11
m = VX(2I)1y +V1m
.
66
-
8/8/2019 bayestalkBW
66/73
Full conditional for 2
f2(2|,y) (2)n/2 exp
(y X)(y X)
22
(2
)(/2+1)
exp
/2
2
(2)(n/2+/2+1)
exp (y X)(y X) + 22
(5)
67
-
8/8/2019 bayestalkBW
67/73
This is an inverse gamma density with shape parameter
(n + )/2 and scale parameter
(yX)(yX)+
2
The intuition here is that is acting like an additionalnumber of observations and is acting like an additional
sum of squared errors.
68
-
8/8/2019 bayestalkBW
68/73
The Gibbs sampling algorithm for the linear model is
initialize
2(0)
for (i = 1 to M){
sample (i) from f(|2(i1)
, y)
sample 2(i)
from f2(2|(i), y)
store (i) and 2(i)
}
69
-
8/8/2019 bayestalkBW
69/73
The Metropolis-Hastings Algorithm
Suppose we want to sample from f(|y)
70
-
8/8/2019 bayestalkBW
70/73
The Metropolis-Hastings Algorithm is:
initialize
(0)
for (i in 1 to M){
sample (i1)can from q(|(i1))
set
(i) =
(i1)can with probability (
(i1)can , (i1))
(i1) with probability 1 ((i1)
can , (i1))
store (i)
}where
71
-
8/8/2019 bayestalkBW
71/73
(can, ) = minf(can|y)f(|y)
q(|can)
q(can|), 1
72
-
8/8/2019 bayestalkBW
72/73
Example: Sampling from a Bivariate
Normal Distribution
73
-
8/8/2019 bayestalkBW
73/73
Example: Logistic Regression