Hierarchical Bayeswebuser.bus.umich.edu/plenk/Bayes Overview 1.pdf · Baseball Example • 90 MLB...
Transcript of Hierarchical Bayeswebuser.bus.umich.edu/plenk/Bayes Overview 1.pdf · Baseball Example • 90 MLB...
Hierarchical Bayes
Peter LenkStephen M Ross School of Business at the
University of MichiganSeptember 2004
2
Outline
• Bayesian Decision Theory• Simple Bayes and Shrinkage Estimates• Hierarchical Bayes• Numerical Methods• Batting Averages• HB Interaction Model
3
Bayesian Decision Model
Actions States ConsequencesDecisionModel
PosteriorUpdating
Result Optimal Decision: Marketing Action & Inference
Integration
Prior Likelihood
LossFunction
InferenceModelSeparable Parts:Don’t mix them Bayes
Theorem
4
Bayes Theorem
• Model for the data given parameters– f(y |θ) were θ = unknown parameters– E.g. Yi = µ + εi and θ = (µ,σ)– Likelihood l(θ) = f(y1 |θ) f(y2 |θ)… f(yn |θ)
• Prior distribution of parameters p(θ)• Update prior
– p(θ|Data ) = l(θ)p(θ)/f(y)– f(y) = marginal distribution of data
5
Easy Example:
• Estimate mean from a normal distribution.• Yi = µ + εi
• Error terms {εi} are iid normal– Mean is zero– Standard deviation of error terms is σ.– Assume that σ is known
6
Conjugate Prior for Mean
• Prior distribution for µ is normal– Prior mean is m0
– Prior variance is v02
– Precision is 1/ v02
7
Posterior Distribution
20
2
2
20
2
2
0
11
10 and 1
)1(
vnv
w
vn
n
w
mwywm
n
n
+=
<<+
=
−+=
σ
σ
σ
• Observe n data points• Posterior distribution
is normal– Mean is mn
– Variance is vn2
8
Shrinkage Estimators
• Bayes estimators combines prior guesses with sample estimates
• If the prior precision is larger than sample precision (prior has more information), then put more weight on prior mean.
• If the sample precision is larger the prior precision (sample has more information), then put more weight on sample average
9
Example
• Y is normal with mean 10 and Variance 16• Normal prior for the population mean
– Mean = 5 & Variance = 2– Prior is informative and way off
• Data– n = 5, Average = 10.9, Variance = 14.7
• Posterior is normal– Mean = 7.4 and variance is 1.2
10
Prior & Posterior n=5
00.10.20.30.40.5
0 5 10 15
Mean
Prior Posterior Likelihood
Prior MeanSample Mean
Posterior Mean
11
Prior & Posterior n=50
00.20.40.60.8
1
0 5 10 15
Mean
Prior Posterior Likelihood
12
Use Less Informative Prior
• Y is normal with mean 10 and Variance 16• Normal prior for the population mean
– Mean = 5 & Variance = 10 instead of 2– Prior is “flatter”
• Data– n = 5, Average = 10.9, Variance = 14.7
• Posterior is normal– Mean = 9.6 and variance is 2.3
13
Prior & Posterior n=5
00.10.20.30.40.5
0 5 10 15
Mean
Prior Posterior Likelihood
14
Prior & Posterior n=50
00.20.40.60.8
1
0 5 10 15
Mean
Prior Posterior Likelihood
15
Summary
• Prior has less effect as sample size increases
• Very informative priors give good results with smaller samples if prior information is correct
• If you really don’t know, then use “flatter” or less informative priors
16
What about Marketing?
• HB revolution in how we think about customers
17
Henry FordAll Customers are the Same
18
Alfred SloanSeveral Common Preferences
19
Continuous Heterogeneity
20
Profit Maximization
21
It Can Get Wild!
22
HB Model for Weekly Spending
• Within-subject model:Yi,j = µi + εi,j and var(εi,j) = σi
2
• Heterogeneity in mean weekly spending or between-subjects
µi = θ + δi and var(δi) = τ2
• Prior Distributionθ is N(u0,v0
2)• Variances are known
23
Variances & Covariances
• Var(Yi,j| µi) = σi2 (known µi)
• Var(Yi,j|θ) = τ2 + σi2 (unknown µi)
• Cov(Yi,j, Yi,k) = τ2 for j not equal to k• Observations from different subjects are
independent
24
Precisions = 1/Variance
( )
( )
( ) ( )
( ) ( )n
YnY
YY
v
ii
iii
iji
iiji
i
22
2
22,2,
2
20
1|Pr and|Pr
1|Pr and1|Pr
1|Pr
precisionprior is 1Pr
στθ
σµ
στθ
σµ
τθµ
θ
+==
+==
=
=
25
Joint Distribution
( )
( ) ( ) ( )∏∏==
=in
jiiji
N
ii yfgvuh
YP
1
2,
1
2200 ,|,|,|
,,
σµτθµθ
θµ
Prior Between Subjects Within Subjects
26
Bayes Theorem
( ) ( )( )
( ) ( )( )
( ) ( )θµθµ
θµθµ
θµθµθµθµ
,,|,
,,|,
,,,,|,
YPYPYP
YPYP
ddYPYPYP
∝
=
=∫∫
Constant because Y is fixed & known
27
Bayes Estimator
• Posterior means are optimal under squared error loss
E(µi|Y) and E(θ|Y)
• Measure of accuracy is posterior variancevar(µi|Y) and var(θ|Y)
28
Posterior Distribution of θ
• Normal distribution• Posterior mean is uN
• Posterior variance is vN2
• Posterior precision is Pr(θ|Y) = 1/vN2
29
Posterior Precision of θ “Pr” = Precision = 1/Variance
( ) ( )
( ) ( )
i
ii
N
ii
n
Yv
YY
22
20
1
1|Pr and 1Pr
|Pr)Pr(|Pr
στθθ
θθθ
+==
+= ∑=
30
Posterior Mean of θ
( )( )( )YYw
Yw
Ywuwu
ii
N
iiiN
|Pr|Pr and
|Pr)Pr(
0
100
θθ
θθ
==
+= ∑=
31
Updating of θ
• Prior Mean Posterior Meanu0 uN
• Prior Var Posterior Varv0
2 vN2
32
Posterior Mean of µi
[ ] ( )
( )( ) ( )
22
2
1|Pr|Pr|Pr
1|
i
i
i
i
iii
iii
Niiii
n
n
YY
uYYE
στ
σµθµ
µα
ααµ
+=
+=
−+=
33
Between-Subject Heterogeneity in Mean Household Spending
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
n
Heterogeneity
34
Between & Within Subjects Distributions
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
nHeterogeneity Subject 1 Subject 2 Subject 3
35
2 Observations per Subject
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
n
Heterogeneity Subject 1 Subject 2 Subject 3
36
Subject Averages
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
n
Heterogeneity Subject 1 Subject 2 Subject 3
37
Pooled Estimate of Mean
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
nHeterogeneity Subject 1 Subject 2 Subject 3
Pooled estimate of population average spending
38
Sample Estimates
• Disaggregate estimate of µi only uses the observations for subject i.– Super if 30 or more observations per subject
• Pooled or aggregate estimator of θ– Smaller sampling error– Ignores individual difference
iY
Y
39
HB Shrinkage Estimator• Take combination of individual average and pooled
average
• What are the correct weights?• HB automatically gives optimal weights based on
– Prior variance of µi
– Number of observations for subject i – Variance of past spending for subject i – Number of subjects– Amount of heterogeneity in household means
( )YwYw iii −+ 1
40
Shrinkage Estimates
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
n
Heterogeneity Subject 1 Subject 2 Subject 3
41
20 Observations per Subject
0.0
0.1
0.2
0.3
0.4
0 10 20 30 40 50 60
Spending
Dis
trib
utio
n
Heterogeneity Subject 1 Subject 2 Subject 3
42
Bayes & Shrinkage Estimates
• Bayes estimators automatically determine the optimal amount of shrinkage to minimize MSE for true parameters and predictions
• Borrows strength from all subjects• Tradeoff some bias for variance reduction
43
Good & Bad News
• Only simple models result in equations• Models we use in marketing require
numerical methods to compute posterior mean, posterior standard deviations, predictions and so on.
44
Numerical Integration
• Compute posterior mean of function T(θ).
( )[ ] ( ) ( ) θθθθ dypTyTE ∫= ||
45
T(x) and f(x)
0
1
2
3
4
5
0 0.25 0.5 0.75 1
X
T(x)f(x)
0123456789
10
0 0.2 0.4 0.6 0.8 1
X
46
Trapezoid RuleT(x)f(x)
0123456789
10
0 0.2 0.4 0.6 0.8 1
X
47
Grid Methods
• Very accurate with few functional evaluations
• Need to know where the action is• Does not scale well to higher dimensions• You need to be very smart to make it work
48
Monte Carlo
• Generate random draws θ1, θ2, …, θm from posterior distribution using a random number generator.
• What happened to the density of θ?
( )[ ] ( )∑=
≈m
jjT
myTE
1
1| θθ
49
Good & Bad News
• If your computer has a random number generator for the posterior distribution, Monte Carlo is a snap to do.
• Your computer almost never has the correct random number generator.
50
Importance Sampling
• Would like to sample from density f• You have a good random number
generator for the density g• Importance sampling lets you generate
random deviates from g to evaluate expectations with respect to f.
• Generate φ1, …, φm from g
51
Importance Sampling Approximation
( ) ( ) ( ) ( )( ) ( )
( ) ( )
( )( ) ( )
( ) ( )( ) ( ) ( )
( )∑
∑∑
∑
∫∫
=
=
=
=
=∝
=≈
=
m
jj
ii
i
ii
m
iiim
ii
m
iii
r
rwgfr
wTr
rT
dggfTdfT
1
1
1
1
andϕ
ϕϕϕϕϕ
ϕϕϕ
ϕϕ
ϕϕϕϕϕθθθ
52
Markov Chain Monte Carlo
• Extension of Monte Carlo• Random draws are not independent• Joint distribution f(β,φ) does not have a
convenient random number generator.• Conditional distributions g(φ|β) and h(β|φ)
are easy to generate from.
53
Iterative Generation from Full Conditionals
• Start at φ0
• Generate β1 from h(β|φ0) .• Generate φ1 from g(φ|β1).• …• Generate βm+1 from h(β|φm)• Generate φm+1 from g(φ|βm+1)
54
Baseball Example
• 90 MLB Players in 2000 season.• Observe at bats (AB) and hits (BA) in May• Infer distribution of batting averages
across players.• Predict batting averages in October using
data from May.
55
Baseball Batting Averages
263.26227.148Pena
542.26049.204Vizquel
546.31745.400Belle
436.32343.442Murray
ABBAABBA
OctoberMay
The Cleveland Indians - 1995
56
Estimating a Probability
• n at bats in May• X = number of hits • p = batting average for season• X has a binomial distribution
– mean np– variance np(1-p)
57
Binomial Distribution
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25 30 35 40 45 50
Sample Successes
Prob
abili
ty
mean = 15.00 STD = 3.24n = 50 p = 0.3
58
Need Prior for Batting Average p
• 0 < p < 1• Beta distribution is popular choice• It has two parameters: α and β• Density is proportional to pα-1(1-p)β-1
• Prior Mean = α/(α+β)
59
Beta Prior for p
[ ]
( )( ) ( ) ( ) 101
),|(,|
11 ≤≤−ΓΓ+Γ
=
=
−− pforpp
pBetapf
βα
βαβα
βαβα
( ) ∫∞
−−=Γ0
1 dxex xαα
60
Mean and Variance
[ ]1
)(1)()(
)(
++−
=
+=
βα
βαα
pEpEpV
pE
61
Beta Distribution
0
0
0
1
1
1
1
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 1 beta = 1 mean = 0.50
62
Beta Distribution
0
1
2
3
4
5
6
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 5 beta = 1 mean = 0.83
63
Beta Distribution
01122334
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 5 beta = 10 mean = 0.33
64
Beta Distribution
0
2
4
6
8
10
12
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 0.5 beta = 2 mean = 0.20
65
Beta Distribution
0
1
2
3
4
5
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 0.5 beta = 0.5 mean = 0.50
66
Bayes Theorem: Update prior for p after observing n and x
( )
),|(
1
],|[]|Pr[
],|[]|Pr[
],|[]|Pr[],,|[
11
1
0
xnxpBeta
pp
pfpx
dqqfqx
pfpxxpf
xnx
−++=
−∝
∝
=
−−+−+
∫
βα
βα
βα
βαβα
βα
67
Inference About P:Posterior is also Beta
PriorParameters
PosteriorParameters
α α+x
β β+n-x
68
Posterior Mean of p:Its another shrinkage estimator
( )
nnw
nxp
wpw
nn
nnn
++==
−+
βα and ˆ
MeanPrior )1(ˆ
69
p=Beta & x=Binomial
0.00
2.00
4.00
6.00
8.00
0.00 0.25 0.50 0.75 1.00
P
Den
sity
alpha = 5 beta = 15
prior mean = 0.25
N=50 & X = 18
PosteriorPrior
70
Hierarchical Bayes Model
• Variation within batter i:– Xi given pi has a binomial distribution
• Variation among batters:– pi is a beta distribution with parameters
α and β.• Prior distribution for α and β
– Gamma (chi-square) distribution
71
Gamma Distribution
sYEYVand
srYE
yforeyrs
sryGyg
syrr
1)()()(
0)(
),|()(
1
==
>Γ
=
=
−−
72
Gamma Distribution
0
0.02
0.04
0.06
0.08
0.1
0.12
0 10 20 30 40 50
X
Dens
ityalpha = 1 beta = 0.1
mean = 10.00
STD = 10.00
73
Gamma Distribution
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0 10 20 30 40 50
X
Dens
ityalpha = 10 beta = 0.5
mean = 20.00
STD = 6.32
74
Specify Prior Parameters:r, s, u & v
• Priors: α is G(r,s) & β is G(u,v).• E(α) = r/s and V(α) = E(α)/s.• s determines variance relative to mean.• I used s = 0.25 or the variance is four
times larger than the mean.• Same for v.
75
[ ]
0
),|()(
pur
rpEEpE
=+
=
= βα [ ] ( )1
1),|( 00
++−
==urpp
pEVc βα
( )
( ) ( )
−
−−=
−
−=
11
1
11
000
000
cpp
pu
andcpp
pr
76
Specify Prior Parameters
• Guess a mean of all batting averages: p0 = 0.25
• Measure of my uncertainty of that guess:c = 0.01
• Parameter r = 4.4• Parameter u = 13.3
77
MCMC for Batting Averages
• Need full conditionals for pi give α and β– Beta distribution
• Need full conditionals for α and β given pi.
– Unknown distribution– Use Metropolis algorithm
78
MCMC: Full Conditionals for Player i Batting Average pi
[ ] ( ) ( )
( )
( )iiii
xni
xai
iiiii
xnxpBeta
pp
pfpxxpf
iii
−++=
−∝
∝
−−+−+
βα
βαβα
β
,|
1
,||Pr,,|
11
79
MCMC: Full Conditional for α and β
( )
( ) ( ) ( )∏=
−− −∝n
iii
nn
vugsrgpp
ppxxg
1
11
11
,|,|1
,,,|,
βα
βα
βα
KK
80
Metropolis Algorithm
• Want to generate θ from f• Instead, generate candidate value φ from
g(.|θ)– Density g can depend on θ– eg Random walk: φ = θ + δ
• With probability α(θ,φ) accept φ as the new value of θ
• With probability 1−α(θ,φ) keep θ
81
Transition Probability
( ) ( ) ( )( ) ( )
= 1,||max,θϕθϕθϕϕθα
gfgf
• f is the full conditional density of θ• Ratios: do not need to know constants• Usually compute α on log scale.• Works if densities are not zero• Works better if g is close to f
82
Alpha and Beta vs Iteration
0
20
40
60
80
100
120
0 500 1000 1500 2000
MCMC Iteration
Para
met
er
BETA
ALPHA
83
Posterior of Alpha
0102030405060708090
15 17 18 20 22 24 25 27 29 31 32 34 36 38 40 41
Alpha
Freq
uenc
y
84
Posterior of Beta
0102030405060708090
38 44 51 58 64 71 78 85 91 98 105
Beta
Freq
uenc
y
85
Parameters Estimates
(11.7)(14.6)(std)
68.253.2β
(4.6)(8.4)(std)
26.217.8α
PosteriorPrior
86
Distribution of Batting Averages
0
2
4
6
8
10
0 0.2 0.4 0.6 0.8 1
Batting Average
Den
sity
PriorPosterior after May
87
Prediction of Season Averages
9.4%0.032Bayes
17.0%0.060MLE
MAPERMSE
88
Batting AveragesBayes Shrinks MLE
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5
May 2000
Seas
on 2
000
MLEBayes
89
HB ConjointLenk, DeSarbo, Green, Young (1996)
• Evaluated computer profiles on a 0 to 10 scale for “likelihood to purchase”– 0 = Would not buy– 10 = Would definitely buy
• Design– 178 subjects– 13 attributes with two levels each– 20 profiles per subject
90
Attributes: Effect Coding
1. Hotline support2. RAM3. Screen Size4. CPU5. Hard Disk6. Multimedia7. Cache
8. Color9. Retail Store10.Warrantee11.Software12.Guarantee13.Price
91
Subject-Covariates
• Female: 1 if female and 0 if male• Years: # years of work experience• Own: 1 if has computer & 0 else• Nerd: 1 if technical background & 0 else• Apply: # of software applications• Expert: self-report expertise rating
92
Summary Statistics for Covariates
Variable Mean Std Dev MIN MAXFEMALE 0.275 0.448 0 1YEARS 4.416 2.369 1 18OWN 0.876 0.330 0 1NERD 0.275 0.448 0 1APPLY 4.287 1.574 1 9EXPERT 7.618 1.902 3 10
93
Interaction Model
• Within-Subjects
• Between-Subjects Heterogeneity
[ ] ( )mimi
iiii
IN
niXY2,0|
,,1for
σεε
εβ
=
=+= K
[ ] ( )∆=+Θ′=
,0|ipi
iii
Nz
δδδβ
94
Average over Posterior Means and Std Dev of Partworths Across
Subjects
PostMean PostSTD PostMean PostSTDCNST 4.757 1.404 Cache 0.031 0.461HotLine 0.095 0.487 Color 0.026 0.371RAM 0.347 0.467 Dstrbtn 0.078 0.378ScrnSz 0.193 0.405 Wrrnty 0.124 0.392CPU 0.392 0.646 Sftwr 0.196 0.399HrdDsk 0.171 0.501 Grntee 0.112 0.427MultMd 0.494 0.574 Price -1.127 0.873
95
Impact of Covariates on Partworths
CNST RAM CPU Dstrbtn Wrrnty Grntee PriceCNST 3.74 0.52 -0.15 0.05 -0.01 0.03 -1.55FEMALE -0.10 0.06 0.12 0.40YEARS -0.11OWN -0.10 0.17 0.17 0.20 -0.12 -0.20NERD -0.27 0.15 0.16 -0.14APPLY 0.10EXPERT 0.17
96
Summary
• HB allows individual-level coefficients• Two level model
– With-in subjects– Between subjects (heterogeneity)
• HB shrinks unstable, subject-level estimators to population mean
97
Summary
• BDT provides integrated framework for making decisions and inference
• Good models consider all sources of uncertainty
• Good methods keep track of all sources of uncertainty
• Bayes does both