Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian...

Bayesian Statistics

Seung-Hoon Na

Chonbuk National University

Bayesian statistics

• Using the posterior distribution to summarize everything we know about a set of unknown variables

• Summarizing posterior distributions

– MAP estimation

– Credible intervals

MAP estimation

• MAP estimate: the posterior mode

– The most popular choice among point estimates of an unknown quantity

– Reduces to an optimization problem, for which efficient algorithms often exist

– Interpreted in non-Bayesian terms the log prior as a regularizer

MAP estimation: Drawbacks

• No measure of uncertainty

– But, in many applications it is important to know how much one can trust a given estimate

• Overfitting

– If we don’t model the uncertainty in our parameters, then our predictive distribution will be overconfident

MAP estimation: Drawbacks• The mode is an untypical point

– Choosing the mode as a summary of a posterior distribution is often a very poor choice, since the mode is usually quite untypical of the distribution, unlike the mean or median

두경우에서 Mean이 mode보다주어진분포에대한더나은요약을제공함


• Not invariant to reparameterization

– To see this, reparameterize 𝑥 with 𝑦 = 𝑓(𝑥)

– MAP estimate for 𝒙

– MAP estimate for 𝒚

Jacobian

Reparameterization이후 MAP결과이전의 MAP결과와달라진다

≠ 𝑓(ො𝑥)

MAP estimation: Drawbacks• Not invariant to reparameterization

The mode of the transformed distribution is not equal to the transform of the original mode


• Not invariant to reparameterization: an example in the context of MAP estimation

– The Bernoulli distribution

– Prior:

– Parameterization 1:

– Parameterization 2:

The MAP estimate depends on the parameterization

Credible Intervals• In addition to point estimates, a measure of

confidence is often required

• 𝟏𝟎𝟎(𝟏 − 𝜶)% credible interval

– One of the standard measures of confidence in some (scalar) quantity 𝜃

• Central interval– The specific credible interval with (1 − 𝛼)/2 mass in each tail

점추정외에, 신뢰척도도필요 Uncertainty를모델링하기 위한 point기반방법

베이즈신용구간

Central interval: 해당구간을제외한나머지양끝에각각 (1 − 𝛼)/2의확률질량을가짐

중심신용구간

Credible Intervals• Highest posterior density (HPD) regions

– A problem with central intervals

• There might be points outside the CI which have higher probability density.

– Definition of HPD (given 𝛼)

HPD region is sometimes called a highest density interval or HDI

Central interval vs. HPD region

(a) Central interval and (b) HPD region for a Beta(3,9) posterior.

𝑝∗

Central interval vs. HPD region

The HDI may not even be a connected region

Inference for a difference in proportions• Two sellers offering an item for the same price.

• Seller 1: 90 pos, 10 neg ||| Seller 2: 2 pos, 0 neg

Who should you buy from?prior

posterior

Inference for a difference in proportions

Monte carlo방법을통해𝑝 𝛿 > 0|𝐷 를 근사시:0.718 (교재코드)

- 𝑝 𝜃1|𝐷 과 𝑝 𝜃2|𝐷 를샘플링

Bayesian model selection• Model selection problem

– How should we choose the best one, among a set of models of different complexity?

• Cross validation: require fitting each model K times

• Bayesian model selection

– Marginal likelihood (or integrated likelihood, or evidence )

𝑝 𝑚 이 uniform인경우 Bayesian model selection은Marginal likelihood을따르는방식이됨

Bayesian model selection은CV에서처럼각모델별K번 fitting이필요하지않음

Bayesian Occam’s razor

• MLE or MAP estimate– Overfitting problem: Models with more parameters will

achieve higher likelihood

• But, when maximizing marginal likelihood, instead of likelihood – models with more parameters do not necessarily have

higher marginal likelihood

– So, it can handle overfitting Bayesian Occam’s razor effect

Bayesian Occam’s razor effect: Likelihood가아니라파라미터로적분한형태인 marginal likelihood를criterion으로선택하면 overfitting에빠지는것을자연스럽게방지

How Bayesian Occam’s razor effect? • 1) Marginal likelihood is like a leave-one-out CV

• 2) Conservation of probability mass

– Complex models, which can predict many things, must spread their probability mass thinly, and hence will not obtain as large a probability for any given data set as simpler models

model이복잡하면 (too complex), 초기 examples들을 overfitting하나나머지 examples들을잘예측하지못하게됨

• model이복잡하면 (too complex), 다양한많은예제들을골고루예측할수있게되어, 각 dataset에대한 prob mass가낮은값을가지며 spread된다 (y축에서봤을때 thin).

• 반면, 단순모델 (simple model)의경우, 예측가능한예제의집합이제한되며, 따라서특정 dataset의 prob mass가상대적으로높음

모든 dataset에대한확률합이 1

How Bayesian Occam’s razor effect?

• Conservation of probability mass

복잡한모델

단순한모델

정확한 (right) 모델

Spread됨을볼수있다


• N=5, green: true dashed: prediction

𝑑=1 𝑑=2

Bayesian Occam’s razor• N=5, green: true dashed: prediction

d=3

Not enough data to justify a complex model So, the MAP model is d=1


• N=30, green: true dashed: prediction

d=1 d=2

Bayesian Occam’s razor• N=30, green: true dashed: prediction

d=3

d=2: this is the right model

Bayesian Occam’s razor: Ridge regression

• Ridge regression– Suppose we fit a degree 12 polynomial to 𝑁 = 21 data points

– MLE using least square:

– MAP: Put zero-mean Gaussian prior Magnitude is too large

Ridge regression

𝜆𝐼 term adds a "ridge" to the main diagonal

Bayesian Occam’s razor:

Ridge regressionRidge regression (MAP)

Wiggle curve Smooth curveln 𝜆 = −20.135 ln 𝜆 = −8.571

Almost least square (MLE)

𝜆 = 1.8 ∗ 10−9 𝜆 = 1.895 ∗ 10−4

Bayesian Occam’s razor:

Ridge regression

Bayesian Occam’s razor𝑝 𝐷 𝜆 vs. log 𝜆 in polynomial ridge regression (degree = 14; N = 21)

Marginal likelihood에 기반한Model selection 과 CV와 유사

Empirical Bayes

Bayesian model selection:

Empirical Bayes

• Instead of evaluating the evidence at a finite grid of values, use numerical optimization:

Empirical Bayes or Type II maximum likelihood

Marginal Likelihood: 𝑝(𝐷|𝑚)

• For parameter inference in a fixed model 𝑚

– 𝑝 𝜃 𝐷,𝑚 ∝𝑝 𝜃 𝑚 𝑝(𝐷|𝜃,𝑚)

𝑝(𝐷|𝑚)

• 𝑝(𝐷|𝑚) can be ignored as normalization constant

• But, for comparing models, we need 𝑝(𝐷|𝑚)

– In general, computing 𝑝(𝐷|𝑚) is hard,

– In the case that we have a conjugate prior, the computation can be easy


• Prior:

• Likelihood:

• Posterior:

Unnormalized posterior

So 𝑃(𝐷|𝑚) is based on normalization constants

Marginal Likelihood: 𝑝(𝐷|𝑚)• Beta-binomial model

– Posterior: 𝑝(𝜃|𝐷) = 𝐵𝑒𝑡𝑎(𝜃|𝑎 + 𝑁1, 𝑏 + 𝑁0)

– Normalization constant (𝑍𝑁) of 𝑝(𝜃|𝐷): 𝐵(𝑎 + 𝑁1, 𝑏 + 𝑁0)

Beta function


• Beta-binomial model

The normalization constant (𝑍𝑁) of 𝑝 𝜃 𝐷 : 𝐵 𝑎 + 𝑁1, 𝑏 + 𝑁0


• Dirichlet-multinoulli model


• Gaussian-Gaussian-Wishart model

BIC approximation to

log marginal likelihood

• Bayesian information criterion (BIC)

• A penalized log likelihood• BIC in linear regression

– Likelihood

– BIC

# of degree of freedom in the model

BIC approximation to

log marginal likelihood• BIC cost

• BIC cost in linear regression

• Or minimum description length (MDL) principle – The score for a model in terms of how well it fits the data,

minus how complex the model is to define.

• Akaike information criterion

– Derived from a frequentist framework• Cannot be interpreted as an approximation to the marginal

likelihood

AIC의 BIC보다 penalty가작음BIC에 비해보다복잡한모델이선택될수있음

Model selection: Effect of the prior

• Marginal likelihood involve model averagingSo, the prior plays an important role

• E.g.) Model selection for linear regression

– Prior

• 𝛼 is large simple model; 𝛼 is small complex model

• Hierarchical Bayesian: when prior is unknown

– Put a prior on the prior – Marginal likelihood

Require to integrate out both 𝑤 and 𝛼 computationally hard

Model selection: Empirical Bayes

• Hierarchical Bayesian: when prior is unknown

– Approximation to optimize 𝛼:

Empirical Bayes (EB)

Computationally easier

Model selection: Bayes Factor• Two models we are considering

– 𝑀0: the null hypothesis

– 𝑀1: the alternative hypothesis

• Bayes factor: the ratio of marginal likelihoods

– Convert the Bayes factor to a posterior over models

• When 𝑝(𝑀1) = 𝑝(𝑀0) = 0.5:

Model selection: Bayes Factor

• Jeffreys’ scale of evidence for interpreting Bayes factors

Bayes Factor: An Example

• Testing if a coin is fair

–𝑀0: a pair coin with 𝜃 = 0.5

–𝑀1: a biased coin where 𝜃 ∈ [0, 1]

• Marginal likelihood


• N=5, 𝛼0 = 𝛼1 = 1 #heads가 2또는 3인경우M0 선택

log10 𝑝(𝐷|𝑀0)


• BIC approximation

Uninformative Priors• If we don’t have strong beliefs about what 𝜃 should be, it is

common to use an uninformative or non-informative prior, and to “let the data speak for itself”.

• Haldane prior

• This is an improper prior doesn’t integrate to 1 But, the posterior is proper

• Jeffreys priors

– If 𝑝(𝜙) is non-informative, then any reparameterization of the prior, such as 𝜃 = ℎ(𝜙) for some function h, should also be non-informative.

Reparameterization에상관없이 non-informative임을보존하는 prior

Jeffreys priors

• Fisher information:

– a measure of curvature of the expected negative log likelihood and hence a measure of stability of the MLE

Jeffreys priors: Derivation

Jeffreys priors• Bernoulli: 𝑋 ∼ 𝐵𝑒𝑟(𝜃)

– Score function

– Observed information

– Fisher information

Jeffreys priors

• Bernoulli

• Multinoulli

Mixtures of conjugate priors

• a mixture of conjugate priors is also conjugate

– The posterior can also be written as a mixture of conjugate

– The posterior mixing weights:

conjugate

Marginal likelihood

Mixtures of conjugate priors: An Example

• Prior:

• Posterior:

N1=20 N0=10

Hierarchical Bayes• Bayesian model

– Posterior: 𝑝 𝜽 𝐷

– Prior: 𝑝(𝜽|𝜼)– 𝜼 are the hyper-parameters

– how to set η?

• Hierarchical Bayesian model– Put a prior on priors

– Also called multi-level model

Hierarchical Bayes: An Example• Modeling related cancer rates

– 𝑁𝑖: The number of people in various cities

– 𝑥𝑖: the number of people who died of cancer in these cities

– Assumption:

– we want to estimate the cancer rates 𝜃𝑖

– Approach 1) Estimate them all separately suffer from the sparse data problem

– Approach 2) Parameter tying: assume all the 𝜃𝑖 are the same

Hierarchical Bayes:

Modeling related cancer rates

• Approach 3)

– Assume that the 𝜃𝑖 are similar, but that there may be city-specific variations Hierarchical Bayes

– That is,

– Infer hyperparams 𝜼 = (𝑎, 𝑏) from the data • Empirical Bayes 등에 기반

Hierarchical Bayes:


Hierarchical Bayes:


Red line:

Hierarchical Bayes:


• 95% credible interval

Empirical Bayes

• How to infer hyperparameters?

• Suppose we have a two-level model

– Need to marginalize out 𝜽 to obtain 𝑝(𝜼|𝐷) usually computationally hard

• Empirical Bayes: Evidence procedure– Approximate the posterior on the hyper-parameters

with point-estimate

Empirical Bayes

• EB provides a computational cheap approximation in multi-level hierarchical Bayesian model, just as we viewed MAP estimation as an approximation to inference in the one level model 𝜽 → 𝐷.

Frequentist

Fully Bayesian

Empirical Bayes:

Beta-binomial model

– Maximizing this marginal likelihood wrt 𝑎, 𝑏:

https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf

Empirical Bayes:

Gaussian-Gaussian model

• Suppose we have data from multiple related groups

– 𝑥𝑖𝑗: the test score for student 𝑖 in school 𝑗, for 𝑗 = 1: 𝐷,

𝑖 = 1:𝑁𝑗

– Want to estimate the mean score for each school 𝜃𝑗

– Use hierarchical Bayes model to handle data-poor problem

• Joint distribution

𝜃𝑗 ∼ 𝑁(𝜇, 𝜏2) 𝜼 = (𝜇, 𝜏)

Empirical Bayes: Gaussian-Gaussian model

• Joint distribution, given the estimate ෝ𝜼 = Ƹ𝜇, Ƹ𝜏

• The posterior:

Likelihood function을sufficient statistics으로단순화

C.f.) Simplify likelihood function

using sufficient Statistics

• Because the MLE estimator and Bayes estimator are functions of sufficient statistic

http://people.missouristate.edu/songfengzheng/teaching/mth541/lecture%20notes/sufficient.pdf

http://www.stat.cmu.edu/~larry/=stat705/Lecture6.pdf


• The posterior (using Gaussian-related formulas):

– 𝐵𝑗: controls the degree of shrinkage towards the

overall mean, ෝ𝜇

• The posterior mean, when 𝜎𝑗 = 𝜎:

Shrinkage: 각그룹 mean은전체 global mean쪽으로shrinkage된다

Large sample size small 𝜎𝑗2 small 𝐵𝑗

c.f.) Apply linear Gaussian systems

𝑝 𝐷 𝜃𝑗 , ෝ𝜇 , ෝ𝜏 = 𝑁(ഥ𝑥𝑗|𝜃𝑗 , 𝜎𝑗2)

𝑝 𝜃𝑗 𝐷, ෝ𝜇 , ෝ𝜏 = 𝑁(𝜃𝑗| 𝜇𝑗 , 𝜎𝑗2)

𝑝 𝜃𝑗 ෝ𝜇 , ෝ𝜏 = 𝑁(𝜃𝑗|ෝ𝜇 , Ƹ𝜏2)

1

෦𝜎𝑗2=

1

ො𝜏2+

1

𝜎𝑗2 𝜎𝑗

2 =Ƹ𝜏2𝜎𝑗

2

Ƹ𝜏2 + 𝜎𝑗2

𝜇𝑗 =Ƹ𝜏2𝜎𝑗

2

Ƹ𝜏2 + 𝜎𝑗2

1

𝜎𝑗2 𝜃𝑗 +

1

Ƹ𝜏2ො𝜇


• Estimating 𝜼 = (𝝁, 𝝉) (case: 𝝈𝒋𝟐 = 𝝈𝟐)

• Marginal likelihood

• Estimating 𝜇 using MLEs for a Gaussian

c.f.) Apply linear Gaussian systems

𝑝 ഥ𝑥𝑗 𝜃𝑗 , ෝ𝜇 , ෝ𝜏 = 𝑁(ഥ𝑥𝑗|𝜃𝑗 , 𝜎𝑗2)

𝑝 ഥ𝑥𝑗 ෝ𝜇 , ෝ𝜏 = ∫ 𝑁 𝜃𝑗 𝜇, 𝜏2 𝑁 ഥ𝑥𝑗 𝜃𝑗 , 𝜎𝑗

2 𝑑𝜃𝑗

𝑝 𝜃𝑗 𝜇, 𝜏 = 𝑁(𝜃𝑗|𝜇, 𝜏2)

= 𝑁(ഥ𝑥𝑗|𝜇, 𝜏2 + 𝜎𝑗

2)


• Estimating the variance 𝜏2: moment matching

• Shrinkage factor:

Empirical Bayes: Gaussian-

Gaussian model

• Estimating 𝜼 = (𝝁, 𝝉) (case: 𝝈𝒋𝟐 are different)

– No closed form solution

– Instead, we need to use the EM algorithm or the approximated inference

𝜎𝑗2가그룹별로다른경우의 Gaussian-Gaussian model

을위한 Empirical Bayes방법에서는 closed formsolution은없고 EM 알고리즘이나근사추론방식이필요

Gaussian-Gaussian model: An Example

• Predicting baseball scores– 𝑏𝑗: The number of hits for 𝐷 = 18 players, during

𝑇 = 45 games

– Assume

– Want to estimate the 𝜃𝑗

– The MLE:

– How about an EB approach?

Gaussian-Gaussian model:

Predicting baseball scores

• EB approach (Gaussian shrinkage approach)

– To apply Gaussian shrinkage approach, we require the likelihood be Gaussian:

– But, 𝑣𝑎𝑟[𝑥𝑗] is not constant (cannot be used as 𝜎2)

𝑥𝑗 = 𝑏𝑗/𝑇

Gaussian-Gaussian model: Predicting baseball

scores • EB approach

– Apply variance stabilizing transform to 𝑥𝑗 to

better match the Gaussian assumption

• Variance stabilizing transformationa function 𝑌 = 𝑓(𝑋) such that 𝑣𝑎𝑟 𝑌 is independent of 𝐸 𝑋 = 𝜇

Gaussian-Gaussian model: Predicting

baseball scores • EB approach

– Apply variance stabilizing transform:

– Then, we have approximately:

– Estimate ො𝜇𝑗 using Gaussian shrinkage

– Then, transform back to get:

𝑦𝑗 = 𝑓 𝑥𝑗 = 𝑇arcsin(2𝑥𝑗 − 1)

Gaussian-Gaussian model: Predicting

baseball scores

Gaussian-Gaussian model: Predicting baseball scores

Shrinkage방법이 MLE방법보다 MSE 오차가 3배더작게나온다

Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian...

Documents

Transcript of Chonbuk National Universitynlp.jbnu.ac.kr/PGM2019/slides_jbnu/BStat.pdf · 2019-03-20 · Bayesian...