Post on 13-Feb-2017
Molekulare Medizin SS 2016
Statistics for Molecular Medicine -- Statistical inference and statistical tests--
Barbara Kollerits & Claudia LaminaMedical University Innsbruck, Division of Genetic Epidemiology
The principles of statistical testing:
Formulating Hypothesis
&
Test-statistics
&
p-values
1
Formulating Hypothesis & Statistical Tests
Steps in conducting a statistical test:
■ Quantify the scientific problem from a clinical / biological perspective
■ Formulate the model assumptions (distribution of the variable of interest)
■ Formulate the problem as a statistical testing problem:
Nullhypothesis versus alternative hypothesis
■ Define the „error“ you are willing to tolerate
■ Calculate the appropriate test statistic
■ Decide for the nullhypothesis or against it
Formulating Hypothesis & Statistical Tests
Hypothesis Formulation:
■ Null hypothesis H0: The conservative hypothesis you want to reject
■ Alternative Hypothesis H1: The hypothesis you want to proof
■ Examples:
Scientific hypothesis:A new therapy is assumed to better prevent myocardial infarctions in risk patients than the old therapy.
Scientific hypothesis:Women and men achieve equally good scores in the EMS-AT test
Statistical hypothesis:H0: new ≥ old
H1: new < old
withnew : the proportion of patients experiencing a MI during the study receiving the new therapyold : the proportion of patients experiencing a MI during the study receiving the old therapy
Statistical hypothesis:H0: men=women
H1: men≠women
with men : mean scores for menwomen : mean scores for women
2
Formulating Hypothesis & Statistical Tests
Possible decisions in statistical tests:
■ Type I and Type II error cannot be minimized simultaneously
■ Statistical tests are constructed in that way, that the probability of a Type I error is not bigger than the significance level (typically set to 0.01 or 0.05)
Example: ■ Test the new MI-therapy on patients to a significance level of 5%. ■ In reality, H0 is true and there is no difference between therapies. ■ If the study is repeated 100 times on 100 different samples, the statistical
test rejects the Nullhypothesis in maximum 5 of the100 tests.
Decide for
H0 H1
Reality H0 Correct decision Wrong decision:Type I error ()
H1 Wrong decision:Type II error ()
Correct decision:Power (1-)
Formulating Hypothesis & Statistical Tests
The One-sample test of the mean: Gauß-Test (also called z-Test)
■ Situation: Compare the sample mean (sample) with a specified mean (0)
■ Assumption: normal distribution of the sample
■ Hypothesis: H0: sample= 0 versus H1: sample≠0
Example:
From a former population-based sample, you know, that the mean of the non-fasting cholesterol level was 230.
Now, you have finished the measurements in your new sample. Since you want to conduct a study on cholesterol levels, that is comparable to the old study, you want to test, if the mean on the new study is equal to the mean in the old study:
One-sample test of the mean: H0: sample= 230 vs. H1: sample ≠ 230
3
Formulating Hypothesis & Statistical Tests
The One-sample test of the mean: Gauß-Test (also called z-Test)
■ Situation: Compare the sample mean (sample) with a specified mean (0)
■ Assumption: normal distribution of the sample
■ Hypothesis: H0: sample= 0 versus H1: sample≠0
Assuming normal distribution & assuming H0 is true:
If the test-statistic is “too extreme”, it is no very likely, that H0 is true
Reject H0 , of T is “too extreme”
Standardization
Test-statistic:
Formulating Hypothesis & Statistical Tests
■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated
The probability distribution given H0 ~ N(0,1):
-z1-/2z1-/2
Area = 1- /2
AcceptanceregionRejection
regionRejection
region/2
■ All values falling in the rejection region (<-z1-/2 or > z1-/2) do not support the null hypothesis (significance level ).
4
■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated
The probability distribution given H0 ~ N(0,1):
-z1-/2z1-/2
Area = 1- /2
AcceptanceregionRejection
regionRejection
region/2
■ All values falling in the rejection region (<-z1-/2 or > z1-/2) do not support the null hypothesis (significance level ).
critical value
Formulating Hypothesis & Statistical Tests
Formulating Hypothesis & Statistical Tests
The One-sample test of the mean revisited: Gauß-Test (also called z-Test)
■ Situation: Compare the sample mean (sample) with a specified mean (0)
■ Assumption: normal distribution of the sample
■ Hypothesis: H0: sample= 0 versus H1: sample≠0
Assuming normal distribution & assuming H0 is true
Test-statistic:
Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to
5
Formulating Hypothesis & Statistical Tests
■ Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to
Attention: Rejection of H0 is not a decision for H1, but a decision against H0, since no distribution can be specified, if H1 is really true.
■ The (1-/2)- Quantiles can be found in tables or can be calculated by computers, e.g. the 97.5%-Quantile of the normal distribution used for a 5%-significance test (two-sided) is ~1.96
■ As for confidence intervals: If is not known & the sample size is not large enough estimate by S
■ So far a statistical test gives you a test statistic. If the test statistic is more extreme than the critical values, decide against the nullhypothesis. If the test statistic is smaller than the critical values, decide for the nullhypothesis
simple yes/no decision rule
■ This decision rule does not give you the certainty / strength of the decision
■ Assume two two-sided z-Tests to =5%, one gives T=2 the other T=2.6
■ Both teststatistics are > z1-/2 (~1.96) both are „significant“
■ However, 2.6 is „more extreme“ than 2, given the truth of the Nullhypothesis!
2 2.6-2.6 -2 2-2
Formulating Hypothesis & Statistical Tests
6
■ The probability to estimate an effect, that is as extreme as the observedeffect or even more extreme, under the assumption, that the null-hypothesis(= no association) is true
■ For a two sided test, the area under the curve on both tails of the distributionfunction have to be added
2 2.6-2.6 -2
~0.005~0.005
~0.025 ~0.025
Formulating Hypothesis & Statistical Tests
■ The probability to estimate an effect, that is as extreme as the observedeffect or even more extreme, under the assumption, that the null-hypothesis(= no association) is true
■ For a two sided test, the area under the curve on both tailes of thedistribution function have to be added
2 2.6-2.6 -2
~0.005~0.005
~ ~ 0.025 ~ ~ 0.025
P-value ~ 0.05
Formulating Hypothesis & Statistical Tests
7
■ The probability to estimate an effect, that is as extreme as the observedeffect or even more extreme, under the assumption, that the null-hypothesis(= no association) is true
■ For a two sided test, the area under the curve on both tailes of thedistribution function have to be added
2 2.6-2.6 -2
~0.005~0.005
~ ~ 0.025 ~ ~ 0.025
P-value ~ 0.05
v
P-value ~ 0.01
Formulating Hypothesis & Statistical Tests
16
■ The P-value p is a measure of certainty against the nullhypothesis.
Example:
A one sample z-Test comparing the sample mean to 0: H0: sample= ; H1: sample≠
results in a test statistic T=2.6, which corresponds to a p-value of 0.01.
A popular interpretation, but wrong:
„The probability, that the sample mean is different from 0 is 1%“
The sample mean does not have a probability. It is 0 or not!
Correct interpretation:
„A different random sample is drawn 100 times from the population of interest. The population mean is 0 (=Nullhypothesis). Maximum 1 of the 100 experiments results in a teststatistic, which is ≥ |2.6|“
The randomness lies in the sample!
If p<, reject H0 decision on the test can be based on T or p
Formulating Hypothesis & Statistical Tests
8
The smaller the p-value, the more certainty is given that the result is not only due to chance
P value 0.01: only in 1 of 100 experiments you get such a result just by chance
P value 0.001: only in 1 of 1000 experiments you get such a result just by chance really seldom
Formulating Hypothesis & Statistical Tests
■ The One-sample test of the mean revisited 2: Gauß-Test (also called z-Test)
■ Situation: Compare the sample mean (sample) with a specified mean (0)
■ Assumption: normal distribution of the sample
■ Hypothesis: H0: sample= 0 versus H1: sample ≠ 0
■ Test-statistic:
■ Test decision: Reject H0, if one of the following applies:
|T| > z1-/2
p <
These three statements areequivalent and this holds in a similar manner for all parametric tests
Formulating Hypothesis & Statistical Tests
Standard normal distribution
9
■ Example:
■ Situation: Compare the sample mean of total cholesterol (sample = 237.1) with a specified mean of total cholesterol (0 = 230)
■ Hypothesis: H0: sample= 0 versus H1: sample ≠ 0
■ Teststatistic:
■ Test decision: Reject H0, if one of the following applies:
|T| > z1-/2 : 6.3 > 1.96 reject H0
p < 2.97e-10 < 0.05 reject H0
: 230 not in [234.89, 239.31] reject H0
Formulating Hypothesis & Statistical Tests
Sample mean significantly different from specified mean
Formulating Hypothesis & Statistical Tests: Exercise
p 0.95 0.975 0.995
zp 1.64 1.96 2.58
The manufacturer of a laboratory measurement device claims that onemeasurement will take 5 minutes on average. You want to test that statement with 10 measurements. (Mean = 5.3, standard deviation = 0.3)
What hypothesis do you want to test ?
H0: versus H1:
Determine the test-statistic T =
What is the critical value (significance level 5%) ?
Quantiles of the standard normal distribution:
Test decision:
10
The most common statistical tests:
Testing measures of location
Quantitative / Continuousoutcome variable
Qualitative / Categorical outcomevariable
Normal distribution
Any otherdistribution
Expectedfrequency in each cell of thecrosstable „high“
Expectedfrequency in each cell of thecrosstable „low“
Compare2 groups
t-test Wilcoxon-test / Mann-Whitney U-Test
Chi-Square Fishers exacttest
Compare>2 groups
Analysis ofVariance(ANOVA)
Kruskal-Wallis-Test
Chi-Square Fishers exacttest
The most common statistical tests
Testing frequencies in a crosstable:
Are the rows and columns independent from each other?
Testing measures of location:
Does the mean/median differ between groups
11
Testing measures of location
The One-sample t-test (the “standard test” for mean comparisons):
■ Situation: Compare the sample mean (sample) with a specified mean (0)
■ Assumption: normal distribution of the sample, is not known
■ Hypothesis: H0: sample= 0 versus H1: sample ≠ 0
■ Teststatistic:
■ Test decision for a two sided test: |T| > tn-1,1-/2 : Reject H0
■ Test decision for a one sided test: |T| > tn-1,1- : Reject H0
■ The Gauß-Test is the general form of the t-test, if were known
■ The t-test approximates the Gauß-Test
■ Quantiles of the t-distribution needed to decide for or against the Nullhypothesis
■ But in practice: Statistical programs give out p-values
Testing measures of location
Comparing the normal and t-distribution:df (= n‐1) 0.95 0.975
1 6.314 12.706
2 2.920 4.303
3 2.353 3.182
4 2.132 2.776
5 2.015 2.571
6 1.943 2.447
7 1.895 2.365
8 1.860 2.306
9 1.833 2.262
10 1.812 2.228
11 1.796 2.201
12 1.782 2.179
13 1.771 2.160
14 1.761 2.145
15 1.753 2.131
16 1.746 2.120
17 1.740 2.110
18 1.734 2.101
19 1.729 2.093
20 1.725 2.086
21 1.721 2.080
22 1.717 2.074
23 1.714 2.069
24 1.711 2.064
25 1.708 2.060
26 1.706 2.056
27 1.703 2.052
28 1.701 2.048
29 1.699 2.045
30 1.697 2.042
120 1.658 1.980
∞ 1.645 1.960
0.95-quantile: critical value for a one-sided test tothe significance level 5%, n=16
Example: t0.95(15) = 1.753
0.975-quantile: critical value for a two-sided testto the significance level 5%, n=823
Example: t0.975(822) ~ 1.96
Quantiles of the standardnormal distribution
Quantiles t-distribution:
12
Testing measures of location
df (= n‐1) 0.95 0.975 0.99 0.995
1 6.314 12.706 31.821 63.657
2 2.920 4.303 6.965 9.925
3 2.353 3.182 4.541 5.841
4 2.132 2.776 3.747 4.604
5 2.015 2.571 3.365 4.032
6 1.943 2.447 3.143 3.707
7 1.895 2.365 2.998 3.499
8 1.860 2.306 2.896 3.355
9 1.833 2.262 2.821 3.250
10 1.812 2.228 2.764 3.169
11 1.796 2.201 2.718 3.106
12 1.782 2.179 2.681 3.055
13 1.771 2.160 2.650 3.012
14 1.761 2.145 2.624 2.977
15 1.753 2.131 2.602 2.947
16 1.746 2.120 2.583 2.921
17 1.740 2.110 2.567 2.898
18 1.734 2.101 2.552 2.878
19 1.729 2.093 2.539 2.861
20 1.725 2.086 2.528 2.845
21 1.721 2.080 2.518 2.831
22 1.717 2.074 2.508 2.819
23 1.714 2.069 2.500 2.807
24 1.711 2.064 2.492 2.797
25 1.708 2.060 2.485 2.787
26 1.706 2.056 2.479 2.779
27 1.703 2.052 2.473 2.771
28 1.701 2.048 2.467 2.763
29 1.699 2.045 2.462 2.756
30 1.697 2.042 2.457 2.750
120 1.658 1.980 2.358 2.617
∞ 1.645 1.960 2.326 2.576
Determine the following critical values:
One-sidedor two-sided test
Significancelevel
n Critical value
two 0.05 >> 120
one 0.05 >> 120
two 0.05 15
one 0.05 28
two 0.01 10
one 0.01 20
two 0.01 >> 120
Testing measures of location
The two-sample t-test for unpaired samples:
■ Situation: Compare the means (1, 2) of two unpaired samples
■ Assumption: normal distribution of both samples, is not known
Here: Equal assumed, but there are methods (Welch t-test) for unequal
Hypothesis:
■ Teststatistic:
with the pooled variance
■ Test decision for a two sided test: |T| > tn1+n2-2,1-/2 : Reject H0
■ Test decision for a one sided test: |T| > tn1+n2-2,1- : Reject H0
13
Labparameter XY in Diseased
Labparameter XY in Healthy
8.70 3.36
11.28 18.35
13.24 5.19
8.37 8.35
12.16 13.1
11.04 15.65
10.47 4.29
11.16 11.36
4.28 9.09
19.54 (missing)
X 11.024 9.86
S2 15.227 27.038
Testing measures of location
Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results:
_
Labparameter XY in Diseased
Labparameter XY in Healthy
8.70 3.36
11.28 18.35
13.24 5.19
8.37 8.35
12.16 13.1
11.04 15.65
10.47 4.29
11.16 11.36
4.28 9.09
19.54 (missing)
X 11.024 9.86
S2 15.227 27.038
Testing measures of location
Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results:
_
s2 = (9*15.227 + 8*27.038)/17 = 20.78512
√(20.78512*(1/10+1/9))
(11.024 – 9.86)_____ __________________T =
Critical value of a t(17)-distribution toα = 5% (two-sided test) = 2.11
0.556 < 2.11 XY does not differbetween diseased and non-diseased
P-value = 0.29
= 0.556~t(17)
14
Testing measures of location: Exercise
Patient ID Physically active Cholesterol level
1 0 195
2 0 159
3 0 166
4 0 244
5 0 169
6 0 168
7 0 222
8 0 238
9 0 216
10 0 180
11 1 146
12 1 145
13 1 147
14 1 208
15 1 182
16 1 145
17 1 187
18 1 206
19 1 218
20 1 161
Comparing the cholesterol levels betweenphysically active (=1) and inactive (=0) patients:
Meaninactive = Meanactive =
What hypothesis do you want to test ?
H0: versus H1:
The following information is given:
Sinactive= 31.94 ; Sactive= 29.27
Pooled S2 = 938.48
Test-statistic T =
Critical value (two-sided test, = 5%):
Test decision ?
Testing measures of location
The two-sample t-test for paired samples:
■ Situation: Compare the means of two paired samples, e.g. compare the means of variables in the same patients before a treatment and after the treatment
■ Assumption: normal distribution of both samples, is not known
■ Hypothesis: H0: before= after versus H1: before≠after
Calculate d = xbefore-xafter for each patient
new Hypothesis: H0: The mean of the difference is 0: d =
versus H1: The mean of the difference is ≠ 0: d ≠
Same situation as the one-sample t-test
T-tests can also be used approximatively for any distribution, that is not too skewed.
15
Testing measures of location
Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results:
_
ID kg atbaseline
kg after 6 months
Difference
1 108 90 18
2 97 97 0
3 88 91 -3
4 120 111 9
5 98 94 4
6 95 91 4
7 87 82 5
8 85 77 8
9 99 103 -4
10 134 127 7
X 101.1 96.3 4.8
S2 242.767 209.122 41.07 s = 6.41
_
Testing measures of location
Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results:
ID kg atbaseline
kg after 6 months
Difference
1 108 90 18
2 97 97 0
3 88 91 -3
4 120 111 9
5 98 94 4
6 95 91 4
7 87 82 5
8 85 77 8
9 99 103 -4
10 134 127 7
X 101.1 96.3 4.8
S2 242.767 209.122 41.07 s = 6.41
_
Paired t-test = one-sample t-test on the difference:
T = √10*(4.8 – 0)/6.41 = 2.368
t0.975(9) = 2.262 (two-sided)
H0 can be rejected (p = 0.042)
Since you want to prove, thatkg(before)>kg(after):
one-sided test more appropriate (more power)
t0.95(9) = 1.833 , p = 0.021 (=0.5*two-sided p)
An unpaired t-test would have yielded non-significant results: p = 0.24 (one-sided test)
16
Testing measures of location
Analysis of Variance (ANOVA)
■ Situation: Compare the means of k samples (k>2)
■ Assumption: normal distribution of the population, =…= k
■ Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ
How to construct the teststatistic ?
Testing measures of location
Analysis of Variance (ANOVA)
■ Situation: Compare the means of k samples (k>2)
■ Assumption: normal distribution of the population, =…= k
■ Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ
Idea from the two-sample t-Test:
Variability within the group
Difference /Variability between the groups
Relate the variability between the groups
to the variability within the groups
17
Testing measures of location
Analysis of Variance (ANOVA): Variance partitioning
■ There are k groups (j=1,…k)
■ X1j, X2j,… Xnjjare the observed values of the variable of interest for i=1,…nj patients
in the jth group
■ is the overall mean of the variable, are the means within the groups
SST= Sum of squares
Total
SSE= Sum of squares Erroror SSR = Sum of squares Residual
~ Variance „within“
SSM= Sum of squares Model~ Variance „between“
SSE/SSR SSM| | | | | | | | |
Group 1 Group 2 Group 3
1 2 3
all observations xij
Testing measures of location
Analysis of Variance (ANOVA): Variance partitioning
SST= Sum of squares Total
SSE= Sum of squares Erroror SSR = Sum of squares Residual
~ Variance „within“SSM= Sum of squares Model
~ Variance „between“
Mean of Squares:
18
Testing measures of location
■ Test statistic:
■ Test decision for a two sided test: F > Fk-1, n-k, 1- : Reject H0
■ Since F is always positive, there are no one-sided tests
■ If H0 is rejected, you can tell, that there are at least two groups, which differ from each other significantly. You can‘t tell, which groups differ!
perform pairwise t-tests after overall F-Test (see Closed test procedure)
Example:
There are 3 different medications (Med1, Med2, Med3), which are intended to increase the HDL-cholesterol levels in patients
1. perform ANOVA as an overall test, if there is a difference between the groups
2. If the F-Test was significant, you know, that there is a difference
3. Test Med1 against Med2, Med1 against Med3, Med2 against Med3
Parametric Tests Nonparametric Tests
T-Test • Wilcoxon-Test• Wilcoxon rank-sum test• Mann-Whitney U-Test
Testing, if two independent samples come from the same distribution Testing equality of medians
ANOVA Kruskal-Wallis-Test:Testing equality of population medians between groups
Different names for the same test
Testing measures of location
All tests so far assumed a normally distributed variable parametric tests
If the assumption does not hold nonparametric tests
Characteristics of nonparametric tests:
• Robust against outliers and skewed distributions
• However: Parametric tests should be preferred over nonparametric test, if appropriate, since they have the higher power.
19
39
Testing measures of location
Two sample test on equality of distributions: Wilcoxon Test
■ Situation: Compare location measures of two unpaired samples X and Y, if the assumption of a t-test does not hold
■ Assumption: the form of the continuous distributions of the variables X and Y is the same test on equality of distributions = test on equality of the medians
■ Hypothesis: H0: xmed= ymed versus H1: xmed ≠ ymed
■ Test is based on the ranks
■ What are ranks?
Example: Wilcoxon Test
Original values: X={1,2,4,6,9,9,11}; xmed =6; Y={1,3,4,5,6,7,8}; ymed =5
■ Sort both variables into one: 1/1,2,3,4/4,5,6/6,7,8,9/9,11
■ Ranking: 1.5,1.5,3,4,5.5,5.5,7,8.5,8.5,10,11,12.5,12.5,14
■ A teststatistic is calculated using these ranks by the computer !
Efficiency of Wilcoxon-Test (Kruskal-Wallis-Test) compared to t-test (ANOVA):
■ To achieve the same power as a t-test/ANOVA, a higher sample size is needed!
■ The parametric tests are in general more powerful (if their assumptions are fulfilled)
Testing measures of location
20
The most common statistical tests:
Testing frequencies
42
Testing frequencies
One sample test on frequencies: -square goodness-of-fit test for categorical traits: Compare the frequencies of a categorical variable to specified proportions
■ Example: The Quality manager of a gummi bear production fabric claims, that the 5 favours do not have the same proportions in each package as claimed.
proportions under H0 1=2=3=4=5=1/5.
■ Hypothesis: H0: P(X=i) = i versus H1: P(X=i) ≠ i for one i at least, i=1,..k number of categories
■ Idea: Compare the observed numbers (O) in each category (hi) with the expected numbers (E=n i) in each category
■ Assumption: ni≥1 for all i & ni≥5 for at least 80% of the categories
Teststatistic:
Test decision for a two sided test: : Reject H0
21
43
Testing frequencies
Excursion: What are degrees of freedom (df)?
Definition:
The number of values that can vary freely.
Example:
The -square goodness-of-fit test with k categories has (k-1) degrees of freedom.
Why?
There is the constraint, that all proportions sum up to 1:
1, 2,…k-1 can be chosen freely. k is then fixed
Example with real data: suppose you have three groups, then 1 = 0.20, 2 =0.50 3 =1-0.70 (=0.30)
44
Testing frequencies
Two sample test on frequencies: -test of independence:
■ Situation: Compare the frequencies between two or more groups
Or: Test, if two categorical variables X (i=1,…k) and
Y (j=1,…m) depend on each other
■ A possible scenario: Compare the number of smokers, ex-smokers and never-smokers between men and women
All situations you can group intocontingency tables
Y
1 … m Row sum
X
1 h11 … h1m h1.
2 h21 … h2m h2.
: : : :
k hk1 … hkm hk.
Column sum h.1 h.m n
22
45
Testing frequencies
Two sample test on frequencies: -test of independence:
■ Hypothesis: H0: X and Y are independent from each other
H1: X and Y are dependent from each other (are associated)
■ Assumption: expected frequencies ≥1 for all & expected frequencies ≥5 for at least 80% of the cells
none of the cells should have a very rare expectancy
if assumption is not fulfilled use Fishers exact test
Idea to construct the teststatistic: Compare the observed numbers in each cell with the expected numbers, if H0 and therefore independence of the two factor variables is assumed
46
Testing frequencies
Table of observed numbers
Y
1 … m
X
1 h11 … h1m h1.
2 h21 … h2m h2.
: :
k hk1 … hkm hk.
h.1 h.m n
h1. …hk. , h.1 …h.m are the margin probabilities
Smokingstatus
Gender
Current Smoker
Ex-Smoker
NeverSmoker
Row Total
Men 144 310 268 722
Women 117 143 475 735
Column Total 261 453 743 1457
23
Testing frequencies
Table of expected numbers:
Y
1 … m
X
1 h1.h.1/n … h1.h.m/n h1.
2 h2.h.1/n … h2.h.m/n h2.
: :
k hk.h.1/n … hk.h.m/n hk.
h.1 h.m n
Expected number in each cell:
(Row sum * Columns sum)/ Total sum
Smokingstatus
Gender
Current Smoker
Ex-Smoker
NeverSmoker
Row Total
Men 144 310 268 722
Women 117 143 475 735
Column Total 261 453 743 1457
Expected number in the upper left cell:
722*261/1457 = 129.336
48
Testing frequencies
Observed: Expected:
Y
1 … m
X
1 h11 … h1m h1.
2 h21 … h2m h2.
: :
k hk1 … hkm hk.
h.1 h.m n
Y
1 … m
X
1 h1.h.1/n … h1.h.m/n h1.
2 h2.h.1/n … h2.h.m/n h2.
: :
k hk.h.1/n … hk.h.m/n hk.
h.1 h.m n
OijEij
Teststatistic:
Test decision: : Reject H0
24
Testing frequencies
Example:
Observed: Expected:
Smokingstatus
Gender
Current Smoker
Ex-Smoker
NeverSmoker
Row Total
Men 129.336 224.479 368.185 722
Women 131.664 228.521 374.815 735
Column Total 261 453 743 1457
Smokingstatus
Gender
Current Smoker
Ex-Smoker
NeverSmoker
Row Total
Men 144 310 268 722
Women 117 143 475 735
Column Total 261 453 743 1457
= (144-129.336)2/129.336 + (310- 224.479)2/224.479 + (268- 368.185)2/368.185 +
+ (117- 131.664)2/131.664 + (143- 228.521)2/228.521 + (475- 374.815)2/374.815 = 121.9218
((2-1)*(3-1)) = (2) = 5.99 121.9218 >> 5.99 test is significant (p = 3.3e-27)
the Null-Hypothesis, that gender and smoking status are independent can be rejected
the test itself does not tell you, however, if men smoke more than women etc.
Testing frequencies
Two sample test on frequencies, if the expected number of cells is rare:
Fishers exact test
■ Situation: If the assumptions of a Chi-square-Test do not hold (number of expected values in each cell ≥ 1 and number of expected values ≥ 5 in 80% of cells)
■ Idea of the test: Simulate all tables that are possible with the observed margin probabilities. Then, count all the tables that are „more extreme“ in the opposite direction of the null-hypothesis than the observed table.
p = number of more extreme tables / number of all tables
computer intensive & not really solvable „by hand“
■ Test decision: p < : Reject H0
25
Testing frequencies : Exercise
In an experiment with mice, you want to test the development of insulin resistance in three different diets (D1, D2, D3). After following up the micefor several months, you observe the following:
Diet Σ
D1 D2 D3
insulin resistant 6 1 3 10
not insulin resistent 4 9 7 20
Σ 10 10 10 30
Calculate the expected numbers in each cell under the assumption of independence between diet and insulin resistance:
Diet Σ
D1 D2 D3
insulin resistant 10
not insulin resistent 20
Σ 10 10 10 30
Which diet resultsin more insulinresistent mice thanexpected, which in less?
Which statistical test should be performed?
The multiple testing problem
26
The multiple testing problem
The situation:
Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course)
100 statistical tests are performed with a significance level of =0.05
The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true
54
The multiple testing problem
The situation:
■ Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course)
100 statistical tests are performed with a significance level of =0.05
The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true
You expect 5 tests to be significant
just by chance !!!
27
55
The multiple testing problem
■ The probability to get at least one Type I error increases with increasing number of tests.
■ Family-wise error rate (the error rate for the complete family of tests performed): *=1-(1-)k, with being the comparison-wise error rate
The significance level has to be modified for multiple testing situations
k * (=0.05)
1 0.05
5 0.226
10 0.401
100 0.994
The probability to get one or more false discoveries(Type I error)
56
The multiple testing problem
The Bonferroni correction method:
■ Control the comparison-wise error rate: Reject H0, if p <
■ Control the family-wise error rate (including k tests): Reject H0, if p < /k
This is equivalent to pBonferroni=p*k < Advantage: simple
■ Problem: Bonferroni-correction increases the probability of a type II error
the power of detecting a true association is reduced Disadvantage:too conservative
■ Can be used, if tests are dependent, but is just too conservative for this case
k /k (=0.05)
1 0.05
5 0.01
10 0.005
100 0.0005
0.05/5=0.01
28
57
The multiple testing problem
Some modifications of the Bonferroni Method: Bonferroni-Holm-method
■ You have a list of k p-values sort it, so that the minimal p-value (p(1)) comes first: p(1), p(2), p(3), p(4), ……. p(k)
■ If p(1) ≥ /k stop and accept all null hypotheses
If p(1) < /k reject H0(1) and continue with the reduced list:
p(2), p(3), p(4), ……. p(k)
■ If p(2) ≥ /(k-1) stop and accept all remaining null hypothesis
If p(2) < /(k-1) reject H0(2) and continue with the reduced list:
p(3), p(4), ……. p(k)
■ Less conservative than Bonferroni
■ Can also be used, if tests are dependent
■ There are multiple other methods, e.g. Hochberg-method etc.
Continue, until the hypothesis with the smallest p-value cannot be rejectedstop and accept
all hypotheses thathave not beenrejected before
In R: function
p.adjust()
The multiple testing problem
Example: Assume you have this p-value vector (10 independent tests to =0.05):
Unadjusted p: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1
With Bonferroni: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1
With B.-Holm: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1
E.g: Bonferroni.: 0.05 /10 = 0.005 all p values smaller than 0.005 significant
E.g: B.-Holm.: First p value (0.0001): 0.0001 < (0.05/10=0.005), continue!Next p value (0.001): 0.001 < (0.05/9=0.006), continue!…until: p value (0.007): 0.007< (0.05/6=0.008), stop, because:next p value (0.01): 0.01 ≥ (0.05/5=0.01).
All these methods are still too conservative if tests are not independent (e.g. highly correlated parameters)!
„significant“, if multiple testing is not accounted for
29
Multiples Testing: Exercise
You conduct an experiment with knock-out mice and compare the meanvalues of 5 different parameters between wildtype (WT) and knockout (KO) mice.
In the following list, the p-values for all t-tests are given.Para meter
Uncorrectedp-value
Significant toα = 5% ?
Significant after Bonferroni-Correction ?
Significant after Bonferroni-HolmCorrection ?
1 0.04
2 0.001
3 0.012
4 0.015
5 0.1
Significance level after Bonferroni correction? α =
Conclusion:
60
The multiple testing problem
The closed testing principle
■ In cases, where a set of hypotheses can be tested simultaneously („overall“ or „omnibus tests“)
■ Suppose there are 3 hypotheses H1,H2,H3 to be tested:
Then H1 can be rejected at level , if all intersections including H1 can be rejected, i.e. H1 ∩ H2 ∩ H3, H1 ∩ H2, H1 ∩ H3 and H1 can all be rejected at level
■ Example: There are 4 different medications (Med1, Med2, Med3, Med4), which are intended to lower the HDL-cholesterol levels in patients
1. perform ANOVA as an overall test, if there is a difference between the groups
2. If the F-Test was significant, perform an ANOVA on all possible intersections
3. If there was any significant F-Test, perform pairwise t-tests for all Medications included in this significant F-Test, etc…
■ Attention: It is not sufficient (if there are more than 3 groups) to test first an ANOVA and then pairwise t-tests, if the ANOVA was significant test also all intersections in between
30
The multiple testing problem
■ Example: Testing the following hypotheses:
1. Step ANOVA: 1 = 2 = 3 = 4
2. Step ANOVA: 1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4
for all intersections
3. Step pairwise 1 = 2 1 = 3 2 = 3 1 = 4 2 =
Significant
Significantto family-wise error rate *=
sign.level
sign.level
sign.level t-tests
The multiple testing problem
■ Example: Testing the following hypotheses:
1. Step ANOVA: 1 = 2 = 3 = 4
2. Step ANOVA: 1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4
for all intersections
3. Step pairwise 1 = 2 1 = 3 2 = 3 1 = 4 2 =
Significant
sign.level
sign.level
sign.level
But: Very complicated to perform, if there are more than 3 groups
See e.g. Tukeys Test (in R)
Significantto family-wise error rate *=
t-tests
31