Hierarchical Bayeswebuser.bus.umich.edu/plenk/Bayes Overview 1.pdf · Baseball Example • 90 MLB...

Hierarchical Bayes

Peter LenkStephen M Ross School of Business at the

University of MichiganSeptember 2004

2

Outline

• Bayesian Decision Theory• Simple Bayes and Shrinkage Estimates• Hierarchical Bayes• Numerical Methods• Batting Averages• HB Interaction Model

3

Bayesian Decision Model

Actions States ConsequencesDecisionModel

PosteriorUpdating

Result Optimal Decision: Marketing Action & Inference

Integration

Prior Likelihood

LossFunction

InferenceModelSeparable Parts:Don’t mix them Bayes

Theorem

4

Bayes Theorem

• Model for the data given parameters– f(y |θ) were θ = unknown parameters– E.g. Yi = µ + εi and θ = (µ,σ)– Likelihood l(θ) = f(y1 |θ) f(y2 |θ)… f(yn |θ)

• Prior distribution of parameters p(θ)• Update prior

– p(θ|Data ) = l(θ)p(θ)/f(y)– f(y) = marginal distribution of data

5

Easy Example:

• Estimate mean from a normal distribution.• Yi = µ + εi

• Error terms {εi} are iid normal– Mean is zero– Standard deviation of error terms is σ.– Assume that σ is known

6

Conjugate Prior for Mean

• Prior distribution for µ is normal– Prior mean is m0

– Prior variance is v02

– Precision is 1/ v02

7

Posterior Distribution

20

2

2

20

2

2

0

11

10 and 1

)1(

vnv

w

vn

n

w

mwywm

n

n

+=

<<+

=

−+=

σ

σ

σ

• Observe n data points• Posterior distribution

is normal– Mean is mn

– Variance is vn2

8

Shrinkage Estimators

• Bayes estimators combines prior guesses with sample estimates

• If the prior precision is larger than sample precision (prior has more information), then put more weight on prior mean.

• If the sample precision is larger the prior precision (sample has more information), then put more weight on sample average

9

Example

• Y is normal with mean 10 and Variance 16• Normal prior for the population mean

– Mean = 5 & Variance = 2– Prior is informative and way off

• Data– n = 5, Average = 10.9, Variance = 14.7

• Posterior is normal– Mean = 7.4 and variance is 1.2

10

Prior & Posterior n=5

00.10.20.30.40.5

0 5 10 15

Mean

Prior Posterior Likelihood

Prior MeanSample Mean

Posterior Mean

11


00.20.40.60.8

1

0 5 10 15

Mean


12

Use Less Informative Prior

• Y is normal with mean 10 and Variance 16• Normal prior for the population mean

– Mean = 5 & Variance = 10 instead of 2– Prior is “flatter”

• Data– n = 5, Average = 10.9, Variance = 14.7

• Posterior is normal– Mean = 9.6 and variance is 2.3

13


00.10.20.30.40.5

0 5 10 15

Mean


14


00.20.40.60.8

1

0 5 10 15

Mean


15

Summary

• Prior has less effect as sample size increases

• Very informative priors give good results with smaller samples if prior information is correct

• If you really don’t know, then use “flatter” or less informative priors

16

What about Marketing?

• HB revolution in how we think about customers

17

Henry FordAll Customers are the Same

18

Alfred SloanSeveral Common Preferences

19

Continuous Heterogeneity

20

Profit Maximization

21

It Can Get Wild!

22

HB Model for Weekly Spending

• Within-subject model:Yi,j = µi + εi,j and var(εi,j) = σi

2

• Heterogeneity in mean weekly spending or between-subjects

µi = θ + δi and var(δi) = τ2

• Prior Distributionθ is N(u0,v0

2)• Variances are known

23

Variances & Covariances

• Var(Yi,j| µi) = σi2 (known µi)

• Var(Yi,j|θ) = τ2 + σi2 (unknown µi)

• Cov(Yi,j, Yi,k) = τ2 for j not equal to k• Observations from different subjects are

independent

25

Joint Distribution

( )

( ) ( ) ( )∏∏==

=in

jiiji

N

ii yfgvuh

YP

1

2,

1

2200 ,|,|,|

,,

σµτθµθ

θµ

Prior Between Subjects Within Subjects

26

Bayes Theorem

( ) ( )( )

( ) ( )( )

( ) ( )θµθµ

θµθµ

θµθµθµθµ

,,|,

,,|,

,,,,|,

YPYPYP

YPYP

ddYPYPYP

∝

=

=∫∫

Constant because Y is fixed & known

27

Bayes Estimator

• Posterior means are optimal under squared error loss

E(µi|Y) and E(θ|Y)

• Measure of accuracy is posterior variancevar(µi|Y) and var(θ|Y)

28

Posterior Distribution of θ

• Normal distribution• Posterior mean is uN

• Posterior variance is vN2

• Posterior precision is Pr(θ|Y) = 1/vN2

29

Posterior Precision of θ “Pr” = Precision = 1/Variance

( ) ( )

( ) ( )

i

ii

N

ii

n

Yv

YY

22

20

1

1|Pr and 1Pr

|Pr)Pr(|Pr

στθθ

θθθ

+==

+= ∑=

30

Posterior Mean of θ

( )( )( )YYw

Yw

Ywuwu

ii

N

iiiN

|Pr|Pr and

|Pr)Pr(

0

100

θθ

θθ

==

+= ∑=

31

Updating of θ

• Prior Mean Posterior Meanu0 uN

• Prior Var Posterior Varv0

2 vN2

32

Posterior Mean of µi

[ ] ( )

( )( ) ( )

22

2

1|Pr|Pr|Pr

1|

i

i

i

i

iii

iii

Niiii

n

n

YY

uYYE

στ

σµθµ

µα

ααµ

+=

+=

−+=

33

Between-Subject Heterogeneity in Mean Household Spending

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

n

Heterogeneity

34

Between & Within Subjects Distributions

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

nHeterogeneity Subject 1 Subject 2 Subject 3

35

2 Observations per Subject

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

n

Heterogeneity Subject 1 Subject 2 Subject 3

36

Subject Averages

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

n


37

Pooled Estimate of Mean

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

nHeterogeneity Subject 1 Subject 2 Subject 3

Pooled estimate of population average spending

38

Sample Estimates

• Disaggregate estimate of µi only uses the observations for subject i.– Super if 30 or more observations per subject

• Pooled or aggregate estimator of θ– Smaller sampling error– Ignores individual difference

iY

Y

39

HB Shrinkage Estimator• Take combination of individual average and pooled

average

• What are the correct weights?• HB automatically gives optimal weights based on

– Prior variance of µi

– Number of observations for subject i – Variance of past spending for subject i – Number of subjects– Amount of heterogeneity in household means

( )YwYw iii −+ 1

40

Shrinkage Estimates

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

n


41

20 Observations per Subject

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60

Spending

Dis

trib

utio

n


42

Bayes & Shrinkage Estimates

• Bayes estimators automatically determine the optimal amount of shrinkage to minimize MSE for true parameters and predictions

• Borrows strength from all subjects• Tradeoff some bias for variance reduction

43

Good & Bad News

• Only simple models result in equations• Models we use in marketing require

numerical methods to compute posterior mean, posterior standard deviations, predictions and so on.

44

Numerical Integration

• Compute posterior mean of function T(θ).

( )[ ] ( ) ( ) θθθθ dypTyTE ∫= ||

45

T(x) and f(x)

0

1

2

3

4

5

0 0.25 0.5 0.75 1

X

T(x)f(x)

0123456789

10

0 0.2 0.4 0.6 0.8 1

X

46

Trapezoid RuleT(x)f(x)

0123456789

10

0 0.2 0.4 0.6 0.8 1

X

47

Grid Methods

• Very accurate with few functional evaluations

• Need to know where the action is• Does not scale well to higher dimensions• You need to be very smart to make it work

48

Monte Carlo

• Generate random draws θ1, θ2, …, θm from posterior distribution using a random number generator.

• What happened to the density of θ?

( )[ ] ( )∑=

≈m

jjT

myTE

1

1| θθ

49

Good & Bad News

• If your computer has a random number generator for the posterior distribution, Monte Carlo is a snap to do.

• Your computer almost never has the correct random number generator.

50

Importance Sampling

• Would like to sample from density f• You have a good random number

generator for the density g• Importance sampling lets you generate

random deviates from g to evaluate expectations with respect to f.

• Generate φ1, …, φm from g

51

Importance Sampling Approximation

( ) ( ) ( ) ( )( ) ( )

( ) ( )

( )( ) ( )

( ) ( )( ) ( ) ( )

( )∑

∑∑

∑

∫∫

=

=

=

=

=∝

=≈

=

m

jj

ii

i

ii

m

iiim

ii

m

iii

r

rwgfr

wTr

rT

dggfTdfT

1

1

1

1

andϕ

ϕϕϕϕϕ

ϕϕϕ

ϕϕ

ϕϕϕϕϕθθθ

52

Markov Chain Monte Carlo

• Extension of Monte Carlo• Random draws are not independent• Joint distribution f(β,φ) does not have a

convenient random number generator.• Conditional distributions g(φ|β) and h(β|φ)

are easy to generate from.

53

Iterative Generation from Full Conditionals

• Start at φ0

• Generate β1 from h(β|φ0) .• Generate φ1 from g(φ|β1).• …• Generate βm+1 from h(β|φm)• Generate φm+1 from g(φ|βm+1)

54

Baseball Example

• 90 MLB Players in 2000 season.• Observe at bats (AB) and hits (BA) in May• Infer distribution of batting averages

across players.• Predict batting averages in October using

data from May.

55

Baseball Batting Averages

263.26227.148Pena

542.26049.204Vizquel

546.31745.400Belle

436.32343.442Murray

ABBAABBA

OctoberMay

The Cleveland Indians - 1995

56

Estimating a Probability

• n at bats in May• X = number of hits • p = batting average for season• X has a binomial distribution

– mean np– variance np(1-p)

57

Binomial Distribution

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 5 10 15 20 25 30 35 40 45 50

Sample Successes

Prob

abili

ty

mean = 15.00 STD = 3.24n = 50 p = 0.3

58

Need Prior for Batting Average p

• 0 < p < 1• Beta distribution is popular choice• It has two parameters: α and β• Density is proportional to pα-1(1-p)β-1

• Prior Mean = α/(α+β)

59

Beta Prior for p

[ ]

( )( ) ( ) ( ) 101

),|(,|

11 ≤≤−ΓΓ+Γ

=

=

−− pforpp

pBetapf

βα

βαβα

βαβα

( ) ∫∞

−−=Γ0

1 dxex xαα

60

Mean and Variance

[ ]1

)(1)()(

)(

++−

=

+=

βα

βαα

pEpEpV

pE

61

Beta Distribution

0

0

0

1

1

1

1

0.00 0.25 0.50 0.75 1.00

P

Den

sity

alpha = 1 beta = 1 mean = 0.50

62

Beta Distribution

0

1

2

3

4

5

6

0.00 0.25 0.50 0.75 1.00

P

Den

sity


63

Beta Distribution

01122334

0.00 0.25 0.50 0.75 1.00

P

Den

sity


64

Beta Distribution

0

2

4

6

8

10

12

0.00 0.25 0.50 0.75 1.00

P

Den

sity

alpha = 0.5 beta = 2 mean = 0.20

65

Beta Distribution

0

1

2

3

4

5

0.00 0.25 0.50 0.75 1.00

P

Den

sity

alpha = 0.5 beta = 0.5 mean = 0.50

66

Bayes Theorem: Update prior for p after observing n and x

( )

),|(

1

],|[]|Pr[

],|[]|Pr[

],|[]|Pr[],,|[

11

1

0

xnxpBeta

pp

pfpx

dqqfqx

pfpxxpf

xnx

−++=

−∝

∝

=

−−+−+

∫

βα

βα

βα

βαβα

βα

67

Inference About P:Posterior is also Beta

PriorParameters

PosteriorParameters

α α+x

β β+n-x

68

Posterior Mean of p:Its another shrinkage estimator

( )

nnw

nxp

wpw

nn

nnn

++==

−+

βα and ˆ

MeanPrior )1(ˆ

69

p=Beta & x=Binomial

0.00

2.00

4.00

6.00

8.00

0.00 0.25 0.50 0.75 1.00

P

Den

sity

alpha = 5 beta = 15

prior mean = 0.25

N=50 & X = 18

PosteriorPrior

70

Hierarchical Bayes Model

• Variation within batter i:– Xi given pi has a binomial distribution

• Variation among batters:– pi is a beta distribution with parameters

α and β.• Prior distribution for α and β

– Gamma (chi-square) distribution

71

Gamma Distribution

sYEYVand

srYE

yforeyrs

sryGyg

syrr

1)()()(

0)(

),|()(

1

==

>Γ

=

=

−−

72

Gamma Distribution

0

0.02

0.04

0.06

0.08

0.1

0.12

0 10 20 30 40 50

X

Dens

ityalpha = 1 beta = 0.1

mean = 10.00

STD = 10.00

73

Gamma Distribution

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 10 20 30 40 50

X

Dens

ityalpha = 10 beta = 0.5

mean = 20.00

STD = 6.32

74

Specify Prior Parameters:r, s, u & v

• Priors: α is G(r,s) & β is G(u,v).• E(α) = r/s and V(α) = E(α)/s.• s determines variance relative to mean.• I used s = 0.25 or the variance is four

times larger than the mean.• Same for v.

75

[ ]

0

),|()(

pur

rpEEpE

=+

=

= βα [ ] ( )1

1),|( 00

++−

==urpp

pEVc βα

( )

( ) ( )

−

−−=

−

−=

11

1

11

000

000

cpp

pu

andcpp

pr

76

Specify Prior Parameters

• Guess a mean of all batting averages: p0 = 0.25

• Measure of my uncertainty of that guess:c = 0.01

• Parameter r = 4.4• Parameter u = 13.3

77

MCMC for Batting Averages

• Need full conditionals for pi give α and β– Beta distribution

• Need full conditionals for α and β given pi.

– Unknown distribution– Use Metropolis algorithm

78

MCMC: Full Conditionals for Player i Batting Average pi

[ ] ( ) ( )

( )

( )iiii

xni

xai

iiiii

xnxpBeta

pp

pfpxxpf

iii

−++=

−∝

∝

−−+−+

βα

βαβα

β

,|

1

,||Pr,,|

11

79

MCMC: Full Conditional for α and β

( )

( ) ( ) ( )∏=

−− −∝n

iii

nn

vugsrgpp

ppxxg

1

11

11

,|,|1

,,,|,

βα

βα

βα

KK

80

Metropolis Algorithm

• Want to generate θ from f• Instead, generate candidate value φ from

g(.|θ)– Density g can depend on θ– eg Random walk: φ = θ + δ

• With probability α(θ,φ) accept φ as the new value of θ

• With probability 1−α(θ,φ) keep θ

81

Transition Probability

( ) ( ) ( )( ) ( )

= 1,||max,θϕθϕθϕϕθα

gfgf

• f is the full conditional density of θ• Ratios: do not need to know constants• Usually compute α on log scale.• Works if densities are not zero• Works better if g is close to f

82

Alpha and Beta vs Iteration

0

20

40

60

80

100

120

0 500 1000 1500 2000

MCMC Iteration

Para

met

er

BETA

ALPHA

83

Posterior of Alpha

0102030405060708090

15 17 18 20 22 24 25 27 29 31 32 34 36 38 40 41

Alpha

Freq

uenc

y

84

Posterior of Beta

0102030405060708090

38 44 51 58 64 71 78 85 91 98 105

Beta

Freq

uenc

y

85

Parameters Estimates

(11.7)(14.6)(std)

68.253.2β

(4.6)(8.4)(std)

26.217.8α

PosteriorPrior

86

Distribution of Batting Averages

0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

Batting Average

Den

sity

PriorPosterior after May

87

Prediction of Season Averages

9.4%0.032Bayes

17.0%0.060MLE

MAPERMSE

88

Batting AveragesBayes Shrinks MLE

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5

May 2000

Seas

on 2

000

MLEBayes

89

HB ConjointLenk, DeSarbo, Green, Young (1996)

• Evaluated computer profiles on a 0 to 10 scale for “likelihood to purchase”– 0 = Would not buy– 10 = Would definitely buy

• Design– 178 subjects– 13 attributes with two levels each– 20 profiles per subject

90

Attributes: Effect Coding

1. Hotline support2. RAM3. Screen Size4. CPU5. Hard Disk6. Multimedia7. Cache

8. Color9. Retail Store10.Warrantee11.Software12.Guarantee13.Price

91

Subject-Covariates

• Female: 1 if female and 0 if male• Years: # years of work experience• Own: 1 if has computer & 0 else• Nerd: 1 if technical background & 0 else• Apply: # of software applications• Expert: self-report expertise rating

92

Summary Statistics for Covariates

Variable Mean Std Dev MIN MAXFEMALE 0.275 0.448 0 1YEARS 4.416 2.369 1 18OWN 0.876 0.330 0 1NERD 0.275 0.448 0 1APPLY 4.287 1.574 1 9EXPERT 7.618 1.902 3 10

93

Interaction Model

• Within-Subjects

• Between-Subjects Heterogeneity

[ ] ( )mimi

iiii

IN

niXY2,0|

,,1for

σεε

εβ

=

=+= K

[ ] ( )∆=+Θ′=

,0|ipi

iii

Nz

δδδβ

94

Average over Posterior Means and Std Dev of Partworths Across

Subjects

PostMean PostSTD PostMean PostSTDCNST 4.757 1.404 Cache 0.031 0.461HotLine 0.095 0.487 Color 0.026 0.371RAM 0.347 0.467 Dstrbtn 0.078 0.378ScrnSz 0.193 0.405 Wrrnty 0.124 0.392CPU 0.392 0.646 Sftwr 0.196 0.399HrdDsk 0.171 0.501 Grntee 0.112 0.427MultMd 0.494 0.574 Price -1.127 0.873

95

Impact of Covariates on Partworths

CNST RAM CPU Dstrbtn Wrrnty Grntee PriceCNST 3.74 0.52 -0.15 0.05 -0.01 0.03 -1.55FEMALE -0.10 0.06 0.12 0.40YEARS -0.11OWN -0.10 0.17 0.17 0.20 -0.12 -0.20NERD -0.27 0.15 0.16 -0.14APPLY 0.10EXPERT 0.17

96

Summary

• HB allows individual-level coefficients• Two level model

– With-in subjects– Between subjects (heterogeneity)

• HB shrinks unstable, subject-level estimators to population mean

97

Summary

• BDT provides integrated framework for making decisions and inference

• Good models consider all sources of uncertainty

• Good methods keep track of all sources of uncertainty

• Bayes does both

Hierarchical Bayeswebuser.bus.umich.edu/plenk/Bayes Overview 1.pdf · Baseball Example • 90 MLB...

Documents

Transcript of Hierarchical Bayeswebuser.bus.umich.edu/plenk/Bayes Overview 1.pdf · Baseball Example • 90 MLB...