Kevin Murphy UBC CS & Stats 9 February 2005
-
Upload
wanda-beard -
Category
Documents
-
view
15 -
download
0
description
Transcript of Kevin Murphy UBC CS & Stats 9 February 2005
![Page 1: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/1.jpg)
Why I am a Bayesian(and why you should become one, too)
orClassical statistics considered harmful
Kevin MurphyUBC CS & Stats
9 February 2005
![Page 2: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/2.jpg)
Where does the title come from?
• “Why I am not a Bayesian”, Glymour, 1981
• “Why Glymour is a Bayesian”, Rosenkrantz, 1983
• “Why isn’t everyone a Bayesian?”,Efron, 1986
• “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001
Many other such philosophical essays…
![Page 3: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/3.jpg)
Frequentist vs Bayesian
• Prob = objective relative frequencies
• Params are fixed unknown constants, so cannot write e.g. P(=0.5|D)
• Estimators should be good when averaged across many trials
• Prob = degrees of belief (uncertainty)
• Can write P(anything|D)
• Estimators should be good for the available data
Source: “All of statistics”, Larry Wasserman
![Page 4: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/4.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?
![Page 5: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/5.jpg)
Coin flipping
HHTHT
HHHHHWhat process produced these sequences?
The following slides are from Tenenbaum & Griffiths
![Page 6: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/6.jpg)
Hypotheses in coin flipping
• Fair coin, P(H) = 0.5
• Coin with P(H) = p
• Markov model
• Hidden Markov model
• ...
Describe processes by which D could be generated
HHTHTD =
statisticalmodels
![Page 7: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/7.jpg)
Hypotheses in coin flipping
• Fair coin, P(H) = 0.5
• Coin with P(H) = p
• Markov model
• Hidden Markov model
• ...
Describe processes by which D could be generated
generativemodels
HHTHTD =
![Page 8: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/8.jpg)
Representing generative models
• Graphical model notation– Pearl (1988), Jordan (1998)
• Variables are nodes, edges indicate dependency
• Directed edges show causal process of data generation
HHTHTd1 d2 d3 d4 d5
d1 d2 d3 d4
Fair coin, P(H) = 0.5
d1 d2 d3 d4
Markov model
![Page 9: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/9.jpg)
Models with latent structure
• Not all nodes in a graphical model need to be observed
• Some variables reflect latent structure, used in generating D but unobserved
HHTHTd1 d2 d3 d4 d5 d1 d2 d3 d4
Hidden Markov model
s1 s2 s3 s4
d1 d2 d3 d4
P(H) = p
p
How do we select the “best” model?
![Page 10: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/10.jpg)
Bayes’ rule
Hh
hphdp
hphdpdhp
)()|(
)()|()|(
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypotheses
![Page 11: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/11.jpg)
The origin of Bayes’ rule
• A simple consequence of using probability to represent degrees of belief
• For any two random variables:
)|()()&(
)|()()&(
BApBpBAp
ABpApBAp
)|()()|()( ABpApBApBp
)(
)|()()|(
Bp
ABpApBAp
![Page 12: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/12.jpg)
• Good statistics– consistency, and worst-case error bounds.
• Cox Axioms– necessary to cohere with common sense
• “Dutch Book” + Survival of the Fittest– if your beliefs do not accord with the laws of probability, then you
can always be out-gambled by someone whose beliefs do so accord.
• Provides a theory of incremental learning– a common currency for combining prior knowledge and the lessons
of experience.
Why represent degrees of belief with probabilities?
![Page 13: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/13.jpg)
Hypotheses in Bayesian inference
• Hypotheses H refer to processes that could have generated the data D
• Bayesian inference provides a distribution over these hypotheses, given D
• P(D|H) is the probability of D being generated by the process identified by H
• Hypotheses H are mutually exclusive: only one process could have generated D
![Page 14: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/14.jpg)
Coin flipping
• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p
![Page 15: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/15.jpg)
Coin flipping
• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p
![Page 16: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/16.jpg)
Comparing two simple hypotheses
• Contrast simple hypotheses:– H1: “fair coin”, P(H) = 0.5
– H2:“always heads”, P(H) = 1.0
• Bayes’ rule:
• With two hypotheses, use odds form
)(
)|()()|(
DP
HDPHPDHP
![Page 17: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/17.jpg)
Bayes’ rule in odds form
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2) = x
Posterior odds Bayes factor(likelihood ratio)
Prior odds
![Page 18: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/18.jpg)
Data = HHTHT
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
D: HHTHTH1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000
P(H1|D) / P(H2|D) = infinity
= x
![Page 19: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/19.jpg)
Data = HHHHH
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
D: HHHHHH1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P(H1|D) / P(H2|D) 30
= x
![Page 20: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/20.jpg)
Data = HHHHHHHHHH
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
D: HHHHHHHHHHH1, H2: “fair coin”, “always heads”
P(D|H1) = 1/210 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P(H1|D) / P(H2|D) 1
= x
![Page 21: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/21.jpg)
Coin flipping
• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p
![Page 22: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/22.jpg)
Comparing simple and complex hypotheses
• Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?
d1 d2 d3 d4
Fair coin, P(H) = 0.5
d1 d2 d3 d4
P(H) = p
p
vs.
![Page 23: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/23.jpg)
• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p
such that X is more probable than if P(H) = 0.5
Comparing simple and complex hypotheses
![Page 24: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/24.jpg)
Comparing simple and complex hypotheses
Pro
babi
lity
![Page 25: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/25.jpg)
Comparing simple and complex hypotheses
Pro
babi
lity
HHHHH p = 1.0
![Page 26: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/26.jpg)
Comparing simple and complex hypotheses
Pro
babi
lity
HHTHT p = 0.6
![Page 27: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/27.jpg)
• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p such
that X is more probable than if P(H) = 0.5
• How can we deal with this?– frequentist: hypothesis testing– information theorist: minimum description length– Bayesian: just use probability theory!
Comparing simple and complex hypotheses
![Page 28: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/28.jpg)
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Computing P(D|H1) is easy:
P(D|H1) = 1/2N
Compute P(D|H2) by averaging over p:
= x
Comparing simple and complex hypotheses
![Page 29: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/29.jpg)
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Computing P(D|H1) is easy:
P(D|H1) = 1/2N
Compute P(D|H2) by averaging over p:
= x
Comparing simple and complex hypotheses
likelihood PriorMarginal likelihood
![Page 30: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/30.jpg)
Likelihood and prior
• Likelihood:
P(D | p) = pNH (1-p)NT
– NH: number of heads– NT: number of tails
• Prior:
P(p) pFH-1 (1-p)FT-1 ?
![Page 31: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/31.jpg)
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a set of previous experiences– strategy often used with neural networks
• e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair
• In fact, this is a sensible statistical idea...
![Page 32: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/32.jpg)
Likelihood and prior• Likelihood:
P(D | p) = pNH (1-p)NT
– NH: number of heads– NT: number of tails
• Prior:
P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads– FT: fictitious observations of tails
Beta(FH,FT)(pseudo-counts)
![Page 33: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/33.jpg)
Posterior / prior x likelihood• Prior
• Likelihood
• Posterior Same form!
![Page 34: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/34.jpg)
Conjugate priors
• Exist for many standard distributions– formula for exponential family conjugacy
• Define prior in terms of fictitious observations
• Beta is conjugate to Bernoulli (coin-flipping)
FH = FT = 1 FH = FT = 3FH = FT = 1000
![Page 35: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/35.jpg)
Normalizing constants• Prior• Normalizing constant for Beta distribution
• Posterior
• Hence marginal likelihood is
![Page 36: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/36.jpg)
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Computing P(D|H1) is easy:
P(D|H1) = 1/2N
Compute P(D|H2) by averaging over p:
= x
Comparing simple and complex hypotheses
Marginal likelihood (“evidence”) for H2
Likelihood for H1
![Page 37: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/37.jpg)
Marginal likelihood for H1 and H2
Pro
babi
lity
Marginal likelihood is an average over all values of p
![Page 38: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/38.jpg)
Sensitivity to hyper-parameters
![Page 39: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/39.jpg)
• Simple and complex hypotheses can be compared directly using Bayes’ rule– requires summing over latent variables
• Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor”
• Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)
Bayesian model selection
![Page 40: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/40.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?
![Page 41: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/41.jpg)
Example: Belgian euro-coins
• A Belgian euro spun N=250 times came up heads X=140.
• “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002)
Source: Mackay exercise 3.15
![Page 42: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/42.jpg)
Classical hypothesis testing
• Null hypothesis H0 eg. = 0.5 (unbiased coin)
• For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1: 0.5
• Need a decision rule that maps data D to accept/ reject of H0.
• Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2
![Page 43: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/43.jpg)
P-values• Define p-value of threshold as
• Intuitively, p-value of data is probability of getting data at least that extreme given H0
![Page 44: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/44.jpg)
P-values• Define p-value of threshold as
• Intuitively, p-value of data is probability of getting data at least that extreme given H0
• Usually choose so that false rejection rate of H0 is below significance level = 0.05
R
![Page 45: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/45.jpg)
P-values• Define p-value of threshold as
• Intuitively, p-value of data is probability of getting data at least that extreme given H0
• Usually choose so that false rejection rate of H0 is below significance level = 0.05
• Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1
R
![Page 46: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/46.jpg)
P-value for euro coins
• N = 250 trials, X=140 heads
• P-value is “less than 7%”
• If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%.
• This does not mean P(H0|D)=0.07!
Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)
![Page 47: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/47.jpg)
Bayesian analysis of euro-coin
• Assume P(H0)=P(H1)=0.5
• Assume P(p) ~ Beta(,)
• Setting =1 yields a uniform (non-informative) prior.
![Page 48: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/48.jpg)
Bayesian analysis of euro-coin
• If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased).
• By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis.
• Other priors yield similar results.• Bayesian analysis contradicts classical
analysis.
![Page 49: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/49.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?
![Page 50: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/50.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense
![Page 51: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/51.jpg)
The likelihood principle• In order to choose between hypotheses H0
and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as
• This principle can be proved from two simpler principles called conditionality and sufficiency.
![Page 52: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/52.jpg)
Frequentist statistics violates the likelihood principle
• “The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961
![Page 53: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/53.jpg)
Another example
• Suppose X ~ N(,2); we observe x=3
• Compare H0: =0 with H1: >0
• P-value = P(X ¸ 3|H0)=0.001, so reject H0
• Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1
![Page 54: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/54.jpg)
When are P-values valid?• Suppose X ~ N(,2); we observe X=x.
• One-sided hypothesis test: H0: ·
0 vs H1: > 0
• If P() / 1, then P(|x) ~ N(x,2), so
• P-value is the same in this case, since Gaussian is symmetric in its arguments
![Page 55: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/55.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense
![Page 56: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/56.jpg)
Stopping rule principle
• Inferences you make should only depend on the observed data, not the reasons why this data was collected.
• If you look at your data to decide when to stop collecting, this should not change any conclusions you draw.
• Follows from likelihood principle.
![Page 57: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/57.jpg)
Frequentist statistics violates stopping rule principle
• Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)?
• Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:
![Page 58: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/58.jpg)
Frequentist statistics violates stopping rule principle
• Suppose the data was generated by tossing coins until we got X=3 heads.
• Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias!
First n-1 trials contain x-1 heads; last trial always heads
![Page 59: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/59.jpg)
Ignoring stopping criterion can mislead classical estimators
• Let Xi ~ Bernoulli()• Max lik. estimator• MLE is unbiased:• Toss coin; if head, stop, else toss second coin.
P(H)=, P(HT)= (1-), P(TT)=(1-)2.
• Now MLE is biased!
• Many classical rules for assessing significance when complex stopping rules are used.
![Page 60: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/60.jpg)
Outline
• Hypothesis testing – Bayesian approach
• Hypothesis testing – classical approach
• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense
![Page 61: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/61.jpg)
Confidence intervals
• An interval (min(D),max(D)) is a 95% CI if lies inside this interval 95% of the time across repeated draws D~P(.|)
• This does not mean P( 2 CI|D) = 0.95!
Mackay sec 37.3
![Page 62: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/62.jpg)
Example• Draw 2 integers from
• If =39, we would expect
![Page 63: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/63.jpg)
Example• If =39, we would expect
• Define confidence interval as
• eg (x1,x2)=(40,39), CI=(39,39)• 75% of the time, this will contain the true
![Page 64: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/64.jpg)
CIs violate common sense• If =39, we would expect
• If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5
• If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.
![Page 65: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/65.jpg)
What’s wrong with the classical approach?
• Violates likelihood principle
• Violates stopping rule principle
• Violates common sense
![Page 66: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/66.jpg)
What’s right about the Bayesian approach?
• Simple and natural
• Optimal mechanism for reasoning under uncertainty
• Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false
• Supports interesting (human-like) kinds of learning
![Page 67: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/67.jpg)
![Page 68: Kevin Murphy UBC CS & Stats 9 February 2005](https://reader035.fdocuments.us/reader035/viewer/2022062517/5681379c550346895d9f3f93/html5/thumbnails/68.jpg)
Bayesian humor
• “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”