bayestalkBW

download bayestalkBW

of 73

Transcript of bayestalkBW

  • 8/8/2019 bayestalkBW

    1/73

    1

    An Introduction to Bayesian Inference and

    Computation

    Kevin QuinnDepartment of Political Science and

    Center for Statistics and the Social Sciences

    University of Washington

    [email protected]

  • 8/8/2019 bayestalkBW

    2/73

    2

    What is Bayesian Inference?

    Bayesian inference is a means of making rational

    probability statements about quantities of interest(observables, model parameters, functions of model

    parameters)

  • 8/8/2019 bayestalkBW

    3/73

    3

    Advantages of the Bayesian Approach

    Answers the questions that researchers are really

    interested in, e.g. What is the probability that ...

    Natural way to combine information from multiplestudies

    Provides a formal method for combining prior qualitativeinformation with observed quantitative information

  • 8/8/2019 bayestalkBW

    4/73

    4

    More general way to deal with issues of model

    identification

    Principled means to compare non-nested models (Bayesfactors)

    Allows one to fit very realistic (complicated) models

  • 8/8/2019 bayestalkBW

    5/73

    5

    Disadvantages of the Bayesian Approach

    Often computationally more demanding than classical

    inference

    At the moment no general purpose canned softwarepackages similar in ease of use to say, Stata

  • 8/8/2019 bayestalkBW

    6/73

    6

    Requires either:

    1. An elicitation and defense of real subjective probabilitydistributions, or

    2. Sensitivity analysis to show that the choice of

    subjective beliefs isnt determining ones inferences

    Allows one to fit overly complicated (realistic) models

  • 8/8/2019 bayestalkBW

    7/73

    7

    Assumed Knowledge

    Basic understanding of random variables

    8th grade algebra

    A bit of matrix algebra (matrix addition, multiplication,

    what an inverse matrix is)

  • 8/8/2019 bayestalkBW

    8/73

    8

    Outline

    1. Review and Foundational Issues

    (a) Foundational Issues(b) Likelihood Inference

    (c) Bayesian Inference

  • 8/8/2019 bayestalkBW

    9/73

    9

    2. Bayesian Computation

    (a) The Gibbs Sampling Algorithm Example: Sampling from a Bivariate Normal

    Distribution

    Example: Linear Regression with NormalDisturbances

    (b) The Metropolis-Hastings Algorithm

    Example: Sampling from a Bivariate Normal

    Distribution Example: Logistic Regression

  • 8/8/2019 bayestalkBW

    10/73

    10

    Discrete and Continuous Random Variables

    A random variable is discrete if:

    1. it takes a finite set of values, {x1, x2, . . . , xn}2. or if it takes an infinite sequence of distinct values,

    {x1, x2, . . .}

    A random variable is continuous if it can take all valuesin a dense subset of the real numbers

  • 8/8/2019 bayestalkBW

    11/73

    11

    Distribution Functions and Probability

    (Density) Functions of RVs

    A distribution function FX(x) of a random variable Xis a non-decreasing function that gives the probability

    that X < x.

    A probability function fX(x) of a discrete randomvariable X gives the probability that X = x

  • 8/8/2019 bayestalkBW

    12/73

    12

    A probability density function fX() of a continuous

    random variable X is the non-negative function thatsatisfies FX(x) =

    x fX(z)dz

    Note fX(x) = Pr(X = x) if X is continuous

    To find Pr(X (a, b)) we calculateba fX(z)dz = FX(b) FX(a).

  • 8/8/2019 bayestalkBW

    13/73

    13

    Marginal, Conditional, and Joint

    Probability (Density) Functions

    The marginal probability function fX(x) gives theprobability that X = x for all x

    The joint probability function fX,Y(x, y) of two discrete

    random variables X and Y gives the probability thatX = x and Y = y for all x and y

  • 8/8/2019 bayestalkBW

    14/73

    14

    If X and Y are discrete, then

    fX(x) =

    y

    fX,Y(x, y)

    The conditional probability function of two discreterandom variables X and Y is given by:

    fX|Y(x|y) = fX,Y(x, y)

    fY(y)

    where is assumed that fY(y) > 0

  • 8/8/2019 bayestalkBW

    15/73

    15

    It follows that:

    fX,Y(x, y) = fX|Y(x|y)fY(y)

    and

    fY(y) =fX,Y(x, y)

    fX|Y(x|y) =

    xfX,Y(x, y)

    If X and Y are independent then

    fX,Y(x, y) = fX(x)fY(y)

  • 8/8/2019 bayestalkBW

    16/73

    16

    Similar results hold for continuous random variables

    where summation is replaced with integration

  • 8/8/2019 bayestalkBW

    17/73

    17

    Probability Models

    Consider an observed sample of data y

    A probability model for y consists of 2 things:1. An assumption about the probability distribution with

    density f(y|) that generated y2. The set of possible values of the model parameters

    .

    f(y|) is called the sampling density and is the jointdensity of all the observed ys.

  • 8/8/2019 bayestalkBW

    18/73

    18

    When f(y|) is viewed as a function of for fixed y it

    is referred to as the likelihood function and is writtenL(|y).

    The likelihood function can actually be any function of

    y and that is proportional to f(y|)

  • 8/8/2019 bayestalkBW

    19/73

    19

    Example 1: Linear Model with Normal

    Disturbances

    consider a set of measurements of household income(x) and household consumption (y) for n households at

    the same point in time

    We believe that conditional on household income,household consumption follows a normal distribution

    and that the observations are independent

    This is our distributional assumption

  • 8/8/2019 bayestalkBW

    20/73

    20

    Conditional independence of the ys allows us to form

    the sampling density by taking a product of themarginal densities

    We believe that consumption varies linearly in income

    Both positive and negative relationships are theoreticallypossibly

    This determines the parameter space

  • 8/8/2019 bayestalkBW

    21/73

    21

    Our probability model is:

    f(y|0, 1, 2) =

    ni=1

    fN(yi|0 + xi1, 2)

    0 R, 1 R, 2 R+

    Note that this is equivalent to the simple linearregression model with normal disturbances:

    yi = 0 + xi1 + i

    iiid N(0, 2), i = 1, 2, . . . , n

  • 8/8/2019 bayestalkBW

    22/73

    22

  • 8/8/2019 bayestalkBW

    23/73

  • 8/8/2019 bayestalkBW

    24/73

    24

    This implies that y follows the binomial distribution with

    sample size n and probability of success . this is the distributional assumption

    The binomial sample size is fixed at n and the parameter

    can be any number in the [0,1] interval this determines the parameter space

    Thus our probability model is

    p(y|n, ) =

    n

    y

    y(1 )(ny), [0, 1]

  • 8/8/2019 bayestalkBW

    25/73

    25

    Frequentist Inference Based on the

    Likelihood Function

    Goal of frequentist (classical) inference is to devise

    estimators that have desirable properties in repeated

    sampling

    Maximum Likelihood Estimator ML

    of :

    ML = arg max

    L(|y)

  • 8/8/2019 bayestalkBW

    26/73

    26

    Under suitable regularity conditions, ML is:

    Consistent Asymptotically normally distributed

    Invariant to reparameterization

    Minimum variance unbiased Problem: frequentist inference cant answer questions

    such as:

    What is Pr( > 0)? What is Pr(g() (a, b))? What is the probability that a particular model is

    correct?

  • 8/8/2019 bayestalkBW

    27/73

    27

    Bayesian Inference

    Goal of Bayesian inference is to make probability

    statements about model parameters , and/or functionsof model parameters g(), given a probability model and

    observed data

    In other words, we want to know f(|y)

    Note that our probability model is defined in terms off(y|) which is not quite what we want

  • 8/8/2019 bayestalkBW

    28/73

  • 8/8/2019 bayestalkBW

    29/73

    29

    Bayes Rule gives us a formula for calculating the

    posterior density of given y (denoted f(|y)) fromknowledge of the sampling density ofy (denoted f(y|))and the prior density of (denoted f())

    The function f() plays a crucial role in transformingour observed data and probability model into probability

    statements about

    Since f() doesnt depend on the observed data y itrepresents the researchers subjective a priori beliefs

    about the likely values of

  • 8/8/2019 bayestalkBW

    30/73

    30

    The fact that f() is a subjective probability implies

    that f(|y) is a subjective probability

    Note that since f(y) is a constant for fixed y we canwrite:

    f(|y) f(y|)f()In words, the posterior density is proportional to the

    sampling density times the prior density.

  • 8/8/2019 bayestalkBW

    31/73

    31

    Once a probability model is formed and a prior specified

    Bayesian inference proceeds by summarizing f(|y).

    Interesting quantities include (but are not limited to)

    The posterior mean of :

    E[|y] =

    f(|y)d

    The posterior variance of :

    Var[|y] =

    ( E[|y])2 f(|y)d

  • 8/8/2019 bayestalkBW

    32/73

    32

    A 100 (1 )% credible set C where C ischosen to satisfy:

    1 = C f(|y)d

  • 8/8/2019 bayestalkBW

    33/73

    33

    Example: Inference About a Binomial

    Proportion with a Beta Prior

    Suppose we observe y = (0, 1, 0, 0, 0)

    We believe yiiid Bernoulli() i = 1, . . . , 5

    We want to make inferences about

    Let y =n

    i=1 yi = 1 and n = 5

  • 8/8/2019 bayestalkBW

    34/73

    34

    Under the assumption of Bernoulli sampling our

    sampling density is

    f(y|) =

    n

    y

    y(1 )ny

    Lets assume our beliefs about can be represented bya Beta density

    f() =( + )

    ()()(1)(1 )(1)

  • 8/8/2019 bayestalkBW

    35/73

    35

    The mean of a beta random variable is +

    The variance of a beta random variable is

    (+)2(++1)

  • 8/8/2019 bayestalkBW

    36/73

    36

    Suppose we think that the ys we observe are realizations

    of fair coin tosses and we are moderately sure of this.

    This prior belief might be represented by assuming

    Beta(10, 10)

    This implies the posterior density of given y is

    f(|y) y(1 )ny(1)(1 )(1)

    (y+1)(1 )(ny+1)

  • 8/8/2019 bayestalkBW

    37/73

    37

    In other words,

    |y Beta(y + , n y + )

    which, in this case means

    |y Beta(11, 14)

  • 8/8/2019 bayestalkBW

    38/73

  • 8/8/2019 bayestalkBW

    39/73

    39

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    pi

  • 8/8/2019 bayestalkBW

    40/73

    40

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    pi

  • 8/8/2019 bayestalkBW

    41/73

    41

    Now lets suppose are prior beliefs are that y is being

    generated whenever a 1 comes up on a fair 6-sided die

    Beta(3.33, 16.67)

    |y Beta(4.33, 20.67)

  • 8/8/2019 bayestalkBW

    42/73

    42

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    6

    pi

  • 8/8/2019 bayestalkBW

    43/73

  • 8/8/2019 bayestalkBW

    44/73

    44

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    6

    pi

  • 8/8/2019 bayestalkBW

    45/73

    45

    Bayesian Computation

    Goal of Bayesian inference is to summarize the posterior

    distribution f(|y)

  • 8/8/2019 bayestalkBW

    46/73

    46

    This involves integration

    E[|y] =

    f(|y)d

    Var[|y] =

    ( E[|y])2 f(|y)d

    1 =

    C

    f(|y)d

  • 8/8/2019 bayestalkBW

    47/73

  • 8/8/2019 bayestalkBW

    48/73

    48

  • 8/8/2019 bayestalkBW

    49/73

    49

    The Gibbs Sampling Algorithm

    Suppose we have a density f(1, . . . , k)

    We want to take a sample from the joint distribution of1, . . . , k but it is not easy to do this directly

    Rather than trying to sample directly from f(1, . . . , k)we are going to construct a Markov chain whose

    stationary distribution is f(1, . . . , k)

    50

  • 8/8/2019 bayestalkBW

    50/73

    50

    The Gibbs sampling algorithm works as follows

    initialize

    (0)

    2 , . . . ,

    (0)

    kfor (i = 1 to M){

    sample (i)1 from f1(1|

    (i1)2 , . . . ,

    (i1)k )

    sample (i)2 from f2(2|

    (i)1 ,

    (i1)3 . . . ,

    (i1)k )

    .

    .

    .

    sample (i)

    k

    from fk(k|(i)

    1

    , . . . , (i)

    k1

    )

    store (i)1 , . . . , (i)k

    }

    51

  • 8/8/2019 bayestalkBW

    51/73

    51

    A Proof of Convergence For a Simple Joint

    Distribution [From Casella and George(1992)]

    Consider

    X=0 X=1

    Y=0 p1 p2 p1 + p2Y=1 p3 p4 p3 + p4

    p1 + p3 p2 + p4 1.00

    52

  • 8/8/2019 bayestalkBW

    52/73

    52

    The joint probability function is justfX,Y(0, 0) fX,Y(1, 0)fX,Y(0, 1) fX,Y(1, 1)

    =

    p1 p2p3 p4

    The marginal probability function for X is

    fX = [fX(0) fX(1)] = [p1 + p3 p2 + p4]

    53

  • 8/8/2019 bayestalkBW

    53/73

    53

    The full conditional probabilities are expressed by two

    matricesAY|X =

    p1

    p1+p3

    p3p1+p3

    p2p2+p4

    p4p2+p4

    and

    AX|Y = p

    1p1+p2

    p2p1+p2

    p3p3+p4

    p4p3+p4

    54

  • 8/8/2019 bayestalkBW

    54/73

    54

    Transitions are of the form X0 Y

    1 X1

    What are the transition probabilities from X0 to X1?

    Since we have to go through Y1 to get to X1 the

    transition probabilities take the form

    Pr(X1 = x1|X0 = x0) =

    y

    Pr(X1 = x1|Y

    1 = y1)

    Pr(Y

    1 = y1|X

    0 = x0)

    55

  • 8/8/2019 bayestalkBW

    55/73

    55

    Which means that the transition matrix for the X

    sequence is AX|X = AY|XAX|Y

    The stationary distribution f of X

    must satisfy

    fAX|X = f

    It is easy to check that

    fXAX|X = fXAY|XAX|Y = f

    X

    56

  • 8/8/2019 bayestalkBW

    56/73

    56

    In other words, the marginal probability function of Xis the stationary distribution of the X chain

    57

  • 8/8/2019 bayestalkBW

    57/73

    57

    Example: Sampling from a Bivariate

    Normal Distribution

    Consider X

    Y

    N

    XY

    ,

    2X X,Y

    Y,X 2Y

    Let = X,Y/

    2X2Y

    58

  • 8/8/2019 bayestalkBW

    58/73

    58

    Full conditionals:

    X|Y = y N(m1, s21)

    where

    m1 = X Y(X,Y/2Y) + y(X,Y/2Y)

    s2

    1

    = 2

    X

    (1 2)

    59

  • 8/8/2019 bayestalkBW

    59/73

    59

    and

    Y|X = x N(m2, s

    2

    2)where

    m2 = Y X(X,Y/2X) + x(X,Y/

    2X)

    s22 = 2Y(1

    2)

    60

  • 8/8/2019 bayestalkBW

    60/73

    60

    The Gibbs sampling algorithm for the bivariate normal

    distribution isinitialize y(0)

    for (i = 1 to M){sample x(i)|y(i1) from N(m1, s21)

    sample y(i)|x(i) from N(m2, s22)store x(i) and y(i)

    }

  • 8/8/2019 bayestalkBW

    61/73

    62

  • 8/8/2019 bayestalkBW

    62/73

    6

    Assume is a priori independent of 2

    so p(, 2

    ) = p()p(2

    )

    Priors

    N(m,V)

    2 IG(/2, /2)

    Posterior (ignoring constants of proportionality)

    63

  • 8/8/2019 bayestalkBW

    63/73

    f(, 2|y) (2)n/2 exp

    (y X)

    (2

    In)1

    (y X)2

    exp

    ( m)V1( m)

    2 (2)(/2+1) exp

    /2

    2

    (2)

    To find the full conditionals drop terms that dontinclude the variable of interest

    64

  • 8/8/2019 bayestalkBW

    64/73

    Full conditional for

    f(|2,y) exp

    (y X)(2In)1(y X)

    2 (3)

    exp

    ( m)V1( m)2

    65

  • 8/8/2019 bayestalkBW

    65/73

    A bit of algebra reveals

    f(|2,y) exp

    ( m)V1( m)

    2

    (4)

    whereV = X(2I)1X+V11

    m = VX(2I)1y +V1m

    .

    66

  • 8/8/2019 bayestalkBW

    66/73

    Full conditional for 2

    f2(2|,y) (2)n/2 exp

    (y X)(y X)

    22

    (2

    )(/2+1)

    exp

    /2

    2

    (2)(n/2+/2+1)

    exp (y X)(y X) + 22

    (5)

    67

  • 8/8/2019 bayestalkBW

    67/73

    This is an inverse gamma density with shape parameter

    (n + )/2 and scale parameter

    (yX)(yX)+

    2

    The intuition here is that is acting like an additionalnumber of observations and is acting like an additional

    sum of squared errors.

    68

  • 8/8/2019 bayestalkBW

    68/73

    The Gibbs sampling algorithm for the linear model is

    initialize

    2(0)

    for (i = 1 to M){

    sample (i) from f(|2(i1)

    , y)

    sample 2(i)

    from f2(2|(i), y)

    store (i) and 2(i)

    }

    69

  • 8/8/2019 bayestalkBW

    69/73

    The Metropolis-Hastings Algorithm

    Suppose we want to sample from f(|y)

    70

  • 8/8/2019 bayestalkBW

    70/73

    The Metropolis-Hastings Algorithm is:

    initialize

    (0)

    for (i in 1 to M){

    sample (i1)can from q(|(i1))

    set

    (i) =

    (i1)can with probability (

    (i1)can , (i1))

    (i1) with probability 1 ((i1)

    can , (i1))

    store (i)

    }where

    71

  • 8/8/2019 bayestalkBW

    71/73

    (can, ) = minf(can|y)f(|y)

    q(|can)

    q(can|), 1

    72

  • 8/8/2019 bayestalkBW

    72/73

    Example: Sampling from a Bivariate

    Normal Distribution

    73

  • 8/8/2019 bayestalkBW

    73/73

    Example: Logistic Regression