Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian...
Transcript of Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian...
![Page 1: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/1.jpg)
Bayesian Statistics
Seung-Hoon Na
Chonbuk National University
![Page 2: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/2.jpg)
Bayesian statistics
• Using the posterior distribution to summarize everything we know about a set of unknown variables
• Summarizing posterior distributions
– MAP estimation
– Credible intervals
![Page 3: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/3.jpg)
MAP estimation
• MAP estimate: the posterior mode
– The most popular choice among point estimates of an unknown quantity
– Reduces to an optimization problem, for which efficient algorithms often exist
– Interpreted in non-Bayesian terms the log prior as a regularizer
![Page 4: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/4.jpg)
MAP estimation: Drawbacks
• No measure of uncertainty
– But, in many applications it is important to know how much one can trust a given estimate
• Overfitting
– If we don’t model the uncertainty in our parameters, then our predictive distribution will be overconfident
![Page 5: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/5.jpg)
MAP estimation: Drawbacks• The mode is an untypical point
– Choosing the mode as a summary of a posterior distribution is often a very poor choice, since the mode is usually quite untypical of the distribution, unlike the mean or median
두경우에서 Mean이 mode보다주어진분포에대한더나은요약을제공함
![Page 6: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/6.jpg)
MAP estimation: Drawbacks
• Not invariant to reparameterization
– To see this, reparameterize 𝑥 with 𝑦 = 𝑓(𝑥)
– MAP estimate for 𝒙
– MAP estimate for 𝒚
Jacobian
Reparameterization이후 MAP결과이전의 MAP결과와달라진다
≠ 𝑓(ො𝑥)
![Page 7: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/7.jpg)
MAP estimation: Drawbacks• Not invariant to reparameterization
The mode of the transformed distribution is not equal to the transform of the original mode
![Page 8: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/8.jpg)
MAP estimation: Drawbacks
• Not invariant to reparameterization: an example in the context of MAP estimation
– The Bernoulli distribution
– Prior:
– Parameterization 1:
– Parameterization 2:
The MAP estimate depends on the parameterization
![Page 9: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/9.jpg)
Credible Intervals• In addition to point estimates, a measure of
confidence is often required
• 𝟏𝟎𝟎(𝟏 − 𝜶)% credible interval
– One of the standard measures of confidence in some (scalar) quantity 𝜃
• Central interval– The specific credible interval with (1 − 𝛼)/2 mass in each tail
점추정외에, 신뢰척도도필요 Uncertainty를모델링하기 위한 point기반방법
베이즈신용구간
Central interval: 해당구간을제외한나머지양끝에각각 (1 − 𝛼)/2의확률질량을가짐
중심신용구간
![Page 10: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/10.jpg)
Credible Intervals• Highest posterior density (HPD) regions
– A problem with central intervals
• There might be points outside the CI which have higher probability density.
– Definition of HPD (given 𝛼)
HPD region is sometimes called a highest density interval or HDI
![Page 11: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/11.jpg)
Central interval vs. HPD region
(a) Central interval and (b) HPD region for a Beta(3,9) posterior.
𝑝∗
![Page 12: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/12.jpg)
Central interval vs. HPD region
The HDI may not even be a connected region
![Page 13: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/13.jpg)
Inference for a difference in proportions• Two sellers offering an item for the same price.
• Seller 1: 90 pos, 10 neg ||| Seller 2: 2 pos, 0 neg
Who should you buy from?prior
posterior
![Page 14: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/14.jpg)
Inference for a difference in proportions
Monte carlo방법을통해𝑝 𝛿 > 0|𝐷 를 근사시:0.718 (교재코드)
- 𝑝 𝜃1|𝐷 과 𝑝 𝜃2|𝐷 를샘플링
![Page 15: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/15.jpg)
Bayesian model selection• Model selection problem
– How should we choose the best one, among a set of models of different complexity?
• Cross validation: require fitting each model K times
• Bayesian model selection
– Marginal likelihood (or integrated likelihood, or evidence )
𝑝 𝑚 이 uniform인경우 Bayesian model selection은Marginal likelihood을따르는방식이됨
Bayesian model selection은CV에서처럼각모델별K번 fitting이필요하지않음
![Page 16: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/16.jpg)
Bayesian Occam’s razor
• MLE or MAP estimate– Overfitting problem: Models with more parameters will
achieve higher likelihood
• But, when maximizing marginal likelihood, instead of likelihood – models with more parameters do not necessarily have
higher marginal likelihood
– So, it can handle overfitting Bayesian Occam’s razor effect
Bayesian Occam’s razor effect: Likelihood가아니라파라미터로적분한형태인 marginal likelihood를criterion으로선택하면 overfitting에빠지는것을자연스럽게방지
![Page 17: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/17.jpg)
How Bayesian Occam’s razor effect? • 1) Marginal likelihood is like a leave-one-out CV
• 2) Conservation of probability mass
– Complex models, which can predict many things, must spread their probability mass thinly, and hence will not obtain as large a probability for any given data set as simpler models
model이복잡하면 (too complex), 초기 examples들을 overfitting하나나머지 examples들을잘예측하지못하게됨
• model이복잡하면 (too complex), 다양한많은예제들을골고루예측할수있게되어, 각 dataset에대한 prob mass가낮은값을가지며 spread된다 (y축에서봤을때 thin).
• 반면, 단순모델 (simple model)의경우, 예측가능한예제의집합이제한되며, 따라서특정 dataset의 prob mass가상대적으로높음
모든 dataset에대한확률합이 1
![Page 18: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/18.jpg)
How Bayesian Occam’s razor effect?
• Conservation of probability mass
복잡한모델
단순한모델
정확한 (right) 모델
Spread됨을볼수있다
![Page 19: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/19.jpg)
Bayesian Occam’s razor
• N=5, green: true dashed: prediction
𝑑=1 𝑑=2
![Page 20: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/20.jpg)
Bayesian Occam’s razor• N=5, green: true dashed: prediction
d=3
Not enough data to justify a complex model So, the MAP model is d=1
![Page 21: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/21.jpg)
Bayesian Occam’s razor
• N=30, green: true dashed: prediction
d=1 d=2
![Page 22: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/22.jpg)
Bayesian Occam’s razor• N=30, green: true dashed: prediction
d=3
d=2: this is the right model
![Page 23: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/23.jpg)
Bayesian Occam’s razor: Ridge regression
• Ridge regression– Suppose we fit a degree 12 polynomial to 𝑁 = 21 data points
– MLE using least square:
– MAP: Put zero-mean Gaussian prior Magnitude is too large
Ridge regression
𝜆𝐼 term adds a "ridge" to the main diagonal
![Page 24: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/24.jpg)
Bayesian Occam’s razor:
Ridge regressionRidge regression (MAP)
Wiggle curve Smooth curveln 𝜆 = −20.135 ln 𝜆 = −8.571
Almost least square (MLE)
𝜆 = 1.8 ∗ 10−9 𝜆 = 1.895 ∗ 10−4
![Page 25: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/25.jpg)
Bayesian Occam’s razor:
Ridge regression
![Page 26: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/26.jpg)
Bayesian Occam’s razor𝑝 𝐷 𝜆 vs. log 𝜆 in polynomial ridge regression (degree = 14; N = 21)
Marginal likelihood에 기반한Model selection 과 CV와 유사
Empirical Bayes
![Page 27: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/27.jpg)
Bayesian model selection:
Empirical Bayes
• Instead of evaluating the evidence at a finite grid of values, use numerical optimization:
Empirical Bayes or Type II maximum likelihood
![Page 28: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/28.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)
• For parameter inference in a fixed model 𝑚
– 𝑝 𝜃 𝐷,𝑚 ∝𝑝 𝜃 𝑚 𝑝(𝐷|𝜃,𝑚)
𝑝(𝐷|𝑚)
• 𝑝(𝐷|𝑚) can be ignored as normalization constant
• But, for comparing models, we need 𝑝(𝐷|𝑚)
– In general, computing 𝑝(𝐷|𝑚) is hard,
– In the case that we have a conjugate prior, the computation can be easy
![Page 29: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/29.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)
• Prior:
• Likelihood:
• Posterior:
Unnormalized posterior
So 𝑃(𝐷|𝑚) is based on normalization constants
![Page 30: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/30.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)• Beta-binomial model
– Posterior: 𝑝(𝜃|𝐷) = 𝐵𝑒𝑡𝑎(𝜃|𝑎 + 𝑁1, 𝑏 + 𝑁0)
– Normalization constant (𝑍𝑁) of 𝑝(𝜃|𝐷): 𝐵(𝑎 + 𝑁1, 𝑏 + 𝑁0)
Beta function
![Page 31: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/31.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)
• Beta-binomial model
The normalization constant (𝑍𝑁) of 𝑝 𝜃 𝐷 : 𝐵 𝑎 + 𝑁1, 𝑏 + 𝑁0
![Page 32: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/32.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)
• Dirichlet-multinoulli model
![Page 33: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/33.jpg)
Marginal Likelihood: 𝑝(𝐷|𝑚)
• Gaussian-Gaussian-Wishart model
![Page 34: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/34.jpg)
BIC approximation to
log marginal likelihood
• Bayesian information criterion (BIC)
• A penalized log likelihood• BIC in linear regression
– Likelihood
– BIC
# of degree of freedom in the model
![Page 35: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/35.jpg)
BIC approximation to
log marginal likelihood• BIC cost
• BIC cost in linear regression
• Or minimum description length (MDL) principle – The score for a model in terms of how well it fits the data,
minus how complex the model is to define.
• Akaike information criterion
– Derived from a frequentist framework• Cannot be interpreted as an approximation to the marginal
likelihood
AIC의 BIC보다 penalty가작음BIC에 비해보다복잡한모델이선택될수있음
![Page 36: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/36.jpg)
Model selection: Effect of the prior
• Marginal likelihood involve model averagingSo, the prior plays an important role
• E.g.) Model selection for linear regression
– Prior
• 𝛼 is large simple model; 𝛼 is small complex model
• Hierarchical Bayesian: when prior is unknown
– Put a prior on the prior – Marginal likelihood
Require to integrate out both 𝑤 and 𝛼 computationally hard
![Page 37: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/37.jpg)
Model selection: Empirical Bayes
• Hierarchical Bayesian: when prior is unknown
– Approximation to optimize 𝛼:
Empirical Bayes (EB)
Computationally easier
![Page 38: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/38.jpg)
Model selection: Bayes Factor• Two models we are considering
– 𝑀0: the null hypothesis
– 𝑀1: the alternative hypothesis
• Bayes factor: the ratio of marginal likelihoods
– Convert the Bayes factor to a posterior over models
• When 𝑝(𝑀1) = 𝑝(𝑀0) = 0.5:
![Page 39: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/39.jpg)
Model selection: Bayes Factor
• Jeffreys’ scale of evidence for interpreting Bayes factors
![Page 40: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/40.jpg)
Bayes Factor: An Example
• Testing if a coin is fair
–𝑀0: a pair coin with 𝜃 = 0.5
–𝑀1: a biased coin where 𝜃 ∈ [0, 1]
• Marginal likelihood
![Page 41: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/41.jpg)
Bayes Factor: An Example
• N=5, 𝛼0 = 𝛼1 = 1 #heads가 2또는 3인경우M0 선택
log10 𝑝(𝐷|𝑀0)
![Page 42: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/42.jpg)
Bayes Factor: An Example
• BIC approximation
![Page 43: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/43.jpg)
Uninformative Priors• If we don’t have strong beliefs about what 𝜃 should be, it is
common to use an uninformative or non-informative prior, and to “let the data speak for itself”.
• Haldane prior
• This is an improper prior doesn’t integrate to 1 But, the posterior is proper
• Jeffreys priors
– If 𝑝(𝜙) is non-informative, then any reparameterization of the prior, such as 𝜃 = ℎ(𝜙) for some function h, should also be non-informative.
Reparameterization에상관없이 non-informative임을보존하는 prior
![Page 44: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/44.jpg)
Jeffreys priors
• Fisher information:
– a measure of curvature of the expected negative log likelihood and hence a measure of stability of the MLE
![Page 45: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/45.jpg)
Jeffreys priors: Derivation
![Page 46: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/46.jpg)
Jeffreys priors• Bernoulli: 𝑋 ∼ 𝐵𝑒𝑟(𝜃)
– Score function
– Observed information
– Fisher information
![Page 47: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/47.jpg)
Jeffreys priors
• Bernoulli
• Multinoulli
![Page 48: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/48.jpg)
Mixtures of conjugate priors
• a mixture of conjugate priors is also conjugate
– The posterior can also be written as a mixture of conjugate
– The posterior mixing weights:
conjugate
Marginal likelihood
![Page 49: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/49.jpg)
Mixtures of conjugate priors: An Example
• Prior:
• Posterior:
N1=20 N0=10
![Page 50: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/50.jpg)
Hierarchical Bayes• Bayesian model
– Posterior: 𝑝 𝜽 𝐷
– Prior: 𝑝(𝜽|𝜼)– 𝜼 are the hyper-parameters
– how to set η?
• Hierarchical Bayesian model– Put a prior on priors
– Also called multi-level model
![Page 51: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/51.jpg)
Hierarchical Bayes: An Example• Modeling related cancer rates
– 𝑁𝑖: The number of people in various cities
– 𝑥𝑖: the number of people who died of cancer in these cities
– Assumption:
– we want to estimate the cancer rates 𝜃𝑖
– Approach 1) Estimate them all separately suffer from the sparse data problem
– Approach 2) Parameter tying: assume all the 𝜃𝑖 are the same
![Page 52: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/52.jpg)
Hierarchical Bayes:
Modeling related cancer rates
• Approach 3)
– Assume that the 𝜃𝑖 are similar, but that there may be city-specific variations Hierarchical Bayes
– That is,
– Infer hyperparams 𝜼 = (𝑎, 𝑏) from the data • Empirical Bayes 등에 기반
![Page 53: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/53.jpg)
Hierarchical Bayes:
Modeling related cancer rates
![Page 54: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/54.jpg)
Hierarchical Bayes:
Modeling related cancer rates
Red line:
![Page 55: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/55.jpg)
Hierarchical Bayes:
Modeling related cancer rates
• 95% credible interval
![Page 56: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/56.jpg)
Empirical Bayes
• How to infer hyperparameters?
• Suppose we have a two-level model
– Need to marginalize out 𝜽 to obtain 𝑝(𝜼|𝐷) usually computationally hard
• Empirical Bayes: Evidence procedure– Approximate the posterior on the hyper-parameters
with point-estimate
![Page 57: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/57.jpg)
Empirical Bayes
• EB provides a computational cheap approximation in multi-level hierarchical Bayesian model, just as we viewed MAP estimation as an approximation to inference in the one level model 𝜽 → 𝐷.
Frequentist
Fully Bayesian
![Page 58: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/58.jpg)
Empirical Bayes:
Beta-binomial model
– Maximizing this marginal likelihood wrt 𝑎, 𝑏:
https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf
![Page 59: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/59.jpg)
Empirical Bayes:
Gaussian-Gaussian model
• Suppose we have data from multiple related groups
– 𝑥𝑖𝑗: the test score for student 𝑖 in school 𝑗, for 𝑗 = 1: 𝐷,
𝑖 = 1:𝑁𝑗
– Want to estimate the mean score for each school 𝜃𝑗
– Use hierarchical Bayes model to handle data-poor problem
• Joint distribution
𝜃𝑗 ∼ 𝑁(𝜇, 𝜏2) 𝜼 = (𝜇, 𝜏)
![Page 60: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/60.jpg)
Empirical Bayes: Gaussian-Gaussian model
• Joint distribution, given the estimate ෝ𝜼 = Ƹ𝜇, Ƹ𝜏
• The posterior:
Likelihood function을sufficient statistics으로단순화
![Page 61: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/61.jpg)
C.f.) Simplify likelihood function
using sufficient Statistics
• Because the MLE estimator and Bayes estimator are functions of sufficient statistic
http://people.missouristate.edu/songfengzheng/teaching/mth541/lecture%20notes/sufficient.pdf
http://www.stat.cmu.edu/~larry/=stat705/Lecture6.pdf
![Page 62: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/62.jpg)
Empirical Bayes: Gaussian-Gaussian model
• The posterior (using Gaussian-related formulas):
– 𝐵𝑗: controls the degree of shrinkage towards the
overall mean, ෝ𝜇
• The posterior mean, when 𝜎𝑗 = 𝜎:
Shrinkage: 각그룹 mean은전체 global mean쪽으로shrinkage된다
Large sample size small 𝜎𝑗2 small 𝐵𝑗
![Page 63: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/63.jpg)
c.f.) Apply linear Gaussian systems
𝑝 𝐷 𝜃𝑗 , ෝ𝜇 , ෝ𝜏 = 𝑁(ഥ𝑥𝑗|𝜃𝑗 , 𝜎𝑗2)
𝑝 𝜃𝑗 𝐷, ෝ𝜇 , ෝ𝜏 = 𝑁(𝜃𝑗| 𝜇𝑗 , 𝜎𝑗2)
𝑝 𝜃𝑗 ෝ𝜇 , ෝ𝜏 = 𝑁(𝜃𝑗|ෝ𝜇 , Ƹ𝜏2)
1
෦𝜎𝑗2=
1
ො𝜏2+
1
𝜎𝑗2 𝜎𝑗
2 =Ƹ𝜏2𝜎𝑗
2
Ƹ𝜏2 + 𝜎𝑗2
𝜇𝑗 =Ƹ𝜏2𝜎𝑗
2
Ƹ𝜏2 + 𝜎𝑗2
1
𝜎𝑗2 𝜃𝑗 +
1
Ƹ𝜏2ො𝜇
![Page 64: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/64.jpg)
Empirical Bayes: Gaussian-Gaussian model
• Estimating 𝜼 = (𝝁, 𝝉) (case: 𝝈𝒋𝟐 = 𝝈𝟐)
• Marginal likelihood
• Estimating 𝜇 using MLEs for a Gaussian
![Page 65: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/65.jpg)
c.f.) Apply linear Gaussian systems
𝑝 ഥ𝑥𝑗 𝜃𝑗 , ෝ𝜇 , ෝ𝜏 = 𝑁(ഥ𝑥𝑗|𝜃𝑗 , 𝜎𝑗2)
𝑝 ഥ𝑥𝑗 ෝ𝜇 , ෝ𝜏 = ∫ 𝑁 𝜃𝑗 𝜇, 𝜏2 𝑁 ഥ𝑥𝑗 𝜃𝑗 , 𝜎𝑗
2 𝑑𝜃𝑗
𝑝 𝜃𝑗 𝜇, 𝜏 = 𝑁(𝜃𝑗|𝜇, 𝜏2)
= 𝑁(ഥ𝑥𝑗|𝜇, 𝜏2 + 𝜎𝑗
2)
![Page 66: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/66.jpg)
Empirical Bayes: Gaussian-Gaussian model
• Estimating the variance 𝜏2: moment matching
• Shrinkage factor:
![Page 67: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/67.jpg)
Empirical Bayes: Gaussian-
Gaussian model
• Estimating 𝜼 = (𝝁, 𝝉) (case: 𝝈𝒋𝟐 are different)
– No closed form solution
– Instead, we need to use the EM algorithm or the approximated inference
𝜎𝑗2가그룹별로다른경우의 Gaussian-Gaussian model
을위한 Empirical Bayes방법에서는 closed formsolution은없고 EM 알고리즘이나근사추론방식이필요
![Page 68: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/68.jpg)
Gaussian-Gaussian model: An Example
• Predicting baseball scores– 𝑏𝑗: The number of hits for 𝐷 = 18 players, during
𝑇 = 45 games
– Assume
– Want to estimate the 𝜃𝑗
– The MLE:
– How about an EB approach?
![Page 69: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/69.jpg)
Gaussian-Gaussian model:
Predicting baseball scores
• EB approach (Gaussian shrinkage approach)
– To apply Gaussian shrinkage approach, we require the likelihood be Gaussian:
– But, 𝑣𝑎𝑟[𝑥𝑗] is not constant (cannot be used as 𝜎2)
𝑥𝑗 = 𝑏𝑗/𝑇
![Page 70: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/70.jpg)
Gaussian-Gaussian model: Predicting baseball
scores • EB approach
– Apply variance stabilizing transform to 𝑥𝑗 to
better match the Gaussian assumption
• Variance stabilizing transformationa function 𝑌 = 𝑓(𝑋) such that 𝑣𝑎𝑟 𝑌 is independent of 𝐸 𝑋 = 𝜇
![Page 71: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/71.jpg)
Gaussian-Gaussian model: Predicting
baseball scores • EB approach
– Apply variance stabilizing transform:
– Then, we have approximately:
– Estimate ො𝜇𝑗 using Gaussian shrinkage
– Then, transform back to get:
𝑦𝑗 = 𝑓 𝑥𝑗 = 𝑇arcsin(2𝑥𝑗 − 1)
![Page 72: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/72.jpg)
Gaussian-Gaussian model: Predicting
baseball scores
![Page 73: Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian Occam’s razor •MLE or MAP estimate –Overfitting problem: Models with more](https://reader033.fdocuments.us/reader033/viewer/2022042201/5ea0acc0a2c330099c16013b/html5/thumbnails/73.jpg)
Gaussian-Gaussian model: Predicting baseball scores
Shrinkage방법이 MLE방법보다 MSE 오차가 3배더작게나온다