Formulating Hypothesis & Statistical Tests

Molekulare Medizin SS 2016

Statistics for Molecular Medicine -- Statistical inference and statistical tests--

Barbara Kollerits & Claudia LaminaMedical University Innsbruck, Division of Genetic Epidemiology

The principles of statistical testing:

Formulating Hypothesis

Test-statistics

p-values

Formulating Hypothesis & Statistical Tests

Steps in conducting a statistical test:

■ Quantify the scientific problem from a clinical / biological perspective

■ Formulate the model assumptions (distribution of the variable of interest)

■ Formulate the problem as a statistical testing problem:

Nullhypothesis versus alternative hypothesis

■ Define the „error“ you are willing to tolerate

■ Calculate the appropriate test statistic

■ Decide for the nullhypothesis or against it

Hypothesis Formulation:

■ Null hypothesis H0: The conservative hypothesis you want to reject

■ Alternative Hypothesis H1: The hypothesis you want to proof

■ Examples:

Scientific hypothesis:A new therapy is assumed to better prevent myocardial infarctions in risk patients than the old therapy.

Scientific hypothesis:Women and men achieve equally good scores in the EMS-AT test

Statistical hypothesis:H0: new ≥ old

H1: new < old

withnew : the proportion of patients experiencing a MI during the study receiving the new therapyold : the proportion of patients experiencing a MI during the study receiving the old therapy

Statistical hypothesis:H0: men=women

H1: men≠women

with men : mean scores for menwomen : mean scores for women

Possible decisions in statistical tests:

■ Type I and Type II error cannot be minimized simultaneously

■ Statistical tests are constructed in that way, that the probability of a Type I error is not bigger than the significance level (typically set to 0.01 or 0.05)

Example: ■ Test the new MI-therapy on patients to a significance level of 5%. ■ In reality, H0 is true and there is no difference between therapies. ■ If the study is repeated 100 times on 100 different samples, the statistical

test rejects the Nullhypothesis in maximum 5 of the100 tests.

Decide for

Reality H0 Correct decision Wrong decision:Type I error ()

H1 Wrong decision:Type II error ()

Correct decision:Power (1-)

The One-sample test of the mean: Gauß-Test (also called z-Test)

■ Situation: Compare the sample mean (sample) with a specified mean (0)

■ Assumption: normal distribution of the sample

■ Hypothesis: H0: sample= 0 versus H1: sample≠0

Example:

From a former population-based sample, you know, that the mean of the non-fasting cholesterol level was 230.

Now, you have finished the measurements in your new sample. Since you want to conduct a study on cholesterol levels, that is comparable to the old study, you want to test, if the mean on the new study is equal to the mean in the old study:

One-sample test of the mean: H0: sample= 230 vs. H1: sample ≠ 230

The One-sample test of the mean: Gauß-Test (also called z-Test)

Assuming normal distribution & assuming H0 is true:

If the test-statistic is “too extreme”, it is no very likely, that H0 is true

Reject H0 , of T is “too extreme”

Standardization

Test-statistic:

■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated

The probability distribution given H0 ~ N(0,1):

-z1-/2z1-/2

Area = 1- /2

AcceptanceregionRejection

regionRejection

region/2

■ All values falling in the rejection region (<-z1-/2 or > z1-/2) do not support the null hypothesis (significance level ).

■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated

The probability distribution given H0 ~ N(0,1):

-z1-/2z1-/2

Area = 1- /2

AcceptanceregionRejection

regionRejection

region/2

■ All values falling in the rejection region (<-z1-/2 or > z1-/2) do not support the null hypothesis (significance level ).

critical value

The One-sample test of the mean revisited: Gauß-Test (also called z-Test)

Assuming normal distribution & assuming H0 is true

Test-statistic:

Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to

■ Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to

Attention: Rejection of H0 is not a decision for H1, but a decision against H0, since no distribution can be specified, if H1 is really true.

■ The (1-/2)- Quantiles can be found in tables or can be calculated by computers, e.g. the 97.5%-Quantile of the normal distribution used for a 5%-significance test (two-sided) is ~1.96

■ As for confidence intervals: If is not known & the sample size is not large enough estimate by S

■ So far a statistical test gives you a test statistic. If the test statistic is more extreme than the critical values, decide against the nullhypothesis. If the test statistic is smaller than the critical values, decide for the nullhypothesis

simple yes/no decision rule

■ This decision rule does not give you the certainty / strength of the decision

■ Assume two two-sided z-Tests to =5%, one gives T=2 the other T=2.6

■ Both teststatistics are > z1-/2 (~1.96) both are „significant“

■ However, 2.6 is „more extreme“ than 2, given the truth of the Nullhypothesis!

2 2.6-2.6 -2 2-2

■ The probability to estimate an effect, that is as extreme as the observedeffect or even more extreme, under the assumption, that the null-hypothesis(= no association) is true

■ For a two sided test, the area under the curve on both tails of the distributionfunction have to be added

2 2.6-2.6 -2

~0.005~0.005

~0.025 ~0.025

■ For a two sided test, the area under the curve on both tailes of thedistribution function have to be added

2 2.6-2.6 -2

~0.005~0.005

~ ~ 0.025 ~ ~ 0.025

P-value ~ 0.05

■ For a two sided test, the area under the curve on both tailes of thedistribution function have to be added

2 2.6-2.6 -2

~0.005~0.005

~ ~ 0.025 ~ ~ 0.025

P-value ~ 0.05

P-value ~ 0.01

■ The P-value p is a measure of certainty against the nullhypothesis.

Example:

A one sample z-Test comparing the sample mean to 0: H0: sample= ; H1: sample≠

results in a test statistic T=2.6, which corresponds to a p-value of 0.01.

A popular interpretation, but wrong:

„The probability, that the sample mean is different from 0 is 1%“

The sample mean does not have a probability. It is 0 or not!

Correct interpretation:

„A different random sample is drawn 100 times from the population of interest. The population mean is 0 (=Nullhypothesis). Maximum 1 of the 100 experiments results in a teststatistic, which is ≥ |2.6|“

The randomness lies in the sample!

If p<, reject H0 decision on the test can be based on T or p

The smaller the p-value, the more certainty is given that the result is not only due to chance

P value 0.01: only in 1 of 100 experiments you get such a result just by chance

P value 0.001: only in 1 of 1000 experiments you get such a result just by chance really seldom

■ The One-sample test of the mean revisited 2: Gauß-Test (also called z-Test)

■ Hypothesis: H0: sample= 0 versus H1: sample ≠ 0

■ Test-statistic:

■ Test decision: Reject H0, if one of the following applies:

|T| > z1-/2

These three statements areequivalent and this holds in a similar manner for all parametric tests

Standard normal distribution

■ Example:

■ Situation: Compare the sample mean of total cholesterol (sample = 237.1) with a specified mean of total cholesterol (0 = 230)

■ Teststatistic:

■ Test decision: Reject H0, if one of the following applies:

|T| > z1-/2 : 6.3 > 1.96 reject H0

p < 2.97e-10 < 0.05 reject H0

: 230 not in [234.89, 239.31] reject H0

Sample mean significantly different from specified mean

Formulating Hypothesis & Statistical Tests: Exercise

p 0.95 0.975 0.995

zp 1.64 1.96 2.58

The manufacturer of a laboratory measurement device claims that onemeasurement will take 5 minutes on average. You want to test that statement with 10 measurements. (Mean = 5.3, standard deviation = 0.3)

What hypothesis do you want to test ?

H0: versus H1:

Determine the test-statistic T =

What is the critical value (significance level 5%) ?

Quantiles of the standard normal distribution:

Test decision:

The most common statistical tests:

Testing measures of location

Quantitative / Continuousoutcome variable

Qualitative / Categorical outcomevariable

Normal distribution

Any otherdistribution

Expectedfrequency in each cell of thecrosstable „high“

Expectedfrequency in each cell of thecrosstable „low“

Compare2 groups

t-test Wilcoxon-test / Mann-Whitney U-Test

Chi-Square Fishers exacttest

Compare>2 groups

Analysis ofVariance(ANOVA)

Kruskal-Wallis-Test

Chi-Square Fishers exacttest

The most common statistical tests

Testing frequencies in a crosstable:

Are the rows and columns independent from each other?

Testing measures of location:

Does the mean/median differ between groups

The One-sample t-test (the “standard test” for mean comparisons):

■ Assumption: normal distribution of the sample, is not known

■ Teststatistic:

■ Test decision for a two sided test: |T| > tn-1,1-/2 : Reject H0

■ Test decision for a one sided test: |T| > tn-1,1- : Reject H0

■ The Gauß-Test is the general form of the t-test, if were known

■ The t-test approximates the Gauß-Test

■ Quantiles of the t-distribution needed to decide for or against the Nullhypothesis

■ But in practice: Statistical programs give out p-values

Comparing the normal and t-distribution:df (= n‐1) 0.95 0.975

1 6.314 12.706

2 2.920 4.303

3 2.353 3.182

4 2.132 2.776

5 2.015 2.571

6 1.943 2.447

7 1.895 2.365

8 1.860 2.306

9 1.833 2.262

10 1.812 2.228

11 1.796 2.201

12 1.782 2.179

13 1.771 2.160

14 1.761 2.145

15 1.753 2.131

16 1.746 2.120

17 1.740 2.110

18 1.734 2.101

19 1.729 2.093

20 1.725 2.086

21 1.721 2.080

22 1.717 2.074

23 1.714 2.069

24 1.711 2.064

25 1.708 2.060

26 1.706 2.056

27 1.703 2.052

28 1.701 2.048

29 1.699 2.045

30 1.697 2.042

120 1.658 1.980

∞ 1.645 1.960

0.95-quantile: critical value for a one-sided test tothe significance level 5%, n=16

Example: t0.95(15) = 1.753

0.975-quantile: critical value for a two-sided testto the significance level 5%, n=823

Example: t0.975(822) ~ 1.96

Quantiles of the standardnormal distribution

Quantiles t-distribution:

df (= n‐1) 0.95 0.975 0.99 0.995

1 6.314 12.706 31.821 63.657

2 2.920 4.303 6.965 9.925

3 2.353 3.182 4.541 5.841

4 2.132 2.776 3.747 4.604

5 2.015 2.571 3.365 4.032

6 1.943 2.447 3.143 3.707

7 1.895 2.365 2.998 3.499

8 1.860 2.306 2.896 3.355

9 1.833 2.262 2.821 3.250

10 1.812 2.228 2.764 3.169

11 1.796 2.201 2.718 3.106

12 1.782 2.179 2.681 3.055

13 1.771 2.160 2.650 3.012

14 1.761 2.145 2.624 2.977

15 1.753 2.131 2.602 2.947

16 1.746 2.120 2.583 2.921

17 1.740 2.110 2.567 2.898

18 1.734 2.101 2.552 2.878

19 1.729 2.093 2.539 2.861

20 1.725 2.086 2.528 2.845

21 1.721 2.080 2.518 2.831

22 1.717 2.074 2.508 2.819

23 1.714 2.069 2.500 2.807

24 1.711 2.064 2.492 2.797

25 1.708 2.060 2.485 2.787

26 1.706 2.056 2.479 2.779

27 1.703 2.052 2.473 2.771

28 1.701 2.048 2.467 2.763

29 1.699 2.045 2.462 2.756

30 1.697 2.042 2.457 2.750

120 1.658 1.980 2.358 2.617

∞ 1.645 1.960 2.326 2.576

Determine the following critical values:

One-sidedor two-sided test

Significancelevel

n Critical value

two 0.05 >> 120

one 0.05 >> 120

two 0.05 15

one 0.05 28

two 0.01 10

one 0.01 20

two 0.01 >> 120

The two-sample t-test for unpaired samples:

■ Situation: Compare the means (1, 2) of two unpaired samples

■ Assumption: normal distribution of both samples, is not known

Here: Equal assumed, but there are methods (Welch t-test) for unequal

Hypothesis:

■ Teststatistic:

with the pooled variance

■ Test decision for a two sided test: |T| > tn1+n2-2,1-/2 : Reject H0

■ Test decision for a one sided test: |T| > tn1+n2-2,1- : Reject H0

Labparameter XY in Diseased

Labparameter XY in Healthy

8.70 3.36

11.28 18.35

13.24 5.19

8.37 8.35

12.16 13.1

11.04 15.65

10.47 4.29

11.16 11.36

4.28 9.09

19.54 (missing)

X 11.024 9.86

S2 15.227 27.038

Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results:

Labparameter XY in Diseased

Labparameter XY in Healthy

8.70 3.36

11.28 18.35

13.24 5.19

8.37 8.35

12.16 13.1

11.04 15.65

10.47 4.29

11.16 11.36

4.28 9.09

19.54 (missing)

X 11.024 9.86

S2 15.227 27.038

Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results:

s2 = (9*15.227 + 8*27.038)/17 = 20.78512

√(20.78512*(1/10+1/9))

(11.024 – 9.86)_____ __________________T =

Critical value of a t(17)-distribution toα = 5% (two-sided test) = 2.11

0.556 < 2.11 XY does not differbetween diseased and non-diseased

P-value = 0.29

= 0.556~t(17)

Testing measures of location: Exercise

Patient ID Physically active Cholesterol level

1 0 195

2 0 159

3 0 166

4 0 244

5 0 169

6 0 168

7 0 222

8 0 238

9 0 216

10 0 180

11 1 146

12 1 145

13 1 147

14 1 208

15 1 182

16 1 145

17 1 187

18 1 206

19 1 218

20 1 161

Comparing the cholesterol levels betweenphysically active (=1) and inactive (=0) patients:

Meaninactive = Meanactive =

What hypothesis do you want to test ?

H0: versus H1:

The following information is given:

Sinactive= 31.94 ; Sactive= 29.27

Pooled S2 = 938.48

Test-statistic T =

Critical value (two-sided test, = 5%):

Test decision ?

The two-sample t-test for paired samples:

■ Situation: Compare the means of two paired samples, e.g. compare the means of variables in the same patients before a treatment and after the treatment

■ Assumption: normal distribution of both samples, is not known

■ Hypothesis: H0: before= after versus H1: before≠after

Calculate d = xbefore-xafter for each patient

new Hypothesis: H0: The mean of the difference is 0: d =

versus H1: The mean of the difference is ≠ 0: d ≠

Same situation as the one-sample t-test

T-tests can also be used approximatively for any distribution, that is not too skewed.

Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results:

ID kg atbaseline

kg after 6 months

Difference

1 108 90 18

2 97 97 0

3 88 91 -3

4 120 111 9

5 98 94 4

6 95 91 4

7 87 82 5

8 85 77 8

9 99 103 -4

10 134 127 7

X 101.1 96.3 4.8

S2 242.767 209.122 41.07 s = 6.41

Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results:

ID kg atbaseline

kg after 6 months

Difference

1 108 90 18

2 97 97 0

3 88 91 -3

4 120 111 9

5 98 94 4

6 95 91 4

7 87 82 5

8 85 77 8

9 99 103 -4

10 134 127 7

X 101.1 96.3 4.8

S2 242.767 209.122 41.07 s = 6.41

Paired t-test = one-sample t-test on the difference:

T = √10*(4.8 – 0)/6.41 = 2.368

t0.975(9) = 2.262 (two-sided)

H0 can be rejected (p = 0.042)

Since you want to prove, thatkg(before)>kg(after):

one-sided test more appropriate (more power)

t0.95(9) = 1.833 , p = 0.021 (=0.5*two-sided p)

An unpaired t-test would have yielded non-significant results: p = 0.24 (one-sided test)

Analysis of Variance (ANOVA)

■ Situation: Compare the means of k samples (k>2)

■ Assumption: normal distribution of the population, =…= k

■ Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ

How to construct the teststatistic ?

Analysis of Variance (ANOVA)

■ Situation: Compare the means of k samples (k>2)

■ Assumption: normal distribution of the population, =…= k

■ Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ

Idea from the two-sample t-Test:

Variability within the group

Difference /Variability between the groups

Relate the variability between the groups

to the variability within the groups

Analysis of Variance (ANOVA): Variance partitioning

■ There are k groups (j=1,…k)

■ X1j, X2j,… Xnjjare the observed values of the variable of interest for i=1,…nj patients

in the jth group

■ is the overall mean of the variable, are the means within the groups

SST= Sum of squares

SSE= Sum of squares Erroror SSR = Sum of squares Residual

~ Variance „within“

SSM= Sum of squares Model~ Variance „between“

Group 1 Group 2 Group 3

all observations xij

Analysis of Variance (ANOVA): Variance partitioning

SST= Sum of squares Total

SSE= Sum of squares Erroror SSR = Sum of squares Residual

~ Variance „within“SSM= Sum of squares Model

~ Variance „between“

Mean of Squares:

■ Test statistic:

■ Test decision for a two sided test: F > Fk-1, n-k, 1- : Reject H0

■ Since F is always positive, there are no one-sided tests

■ If H0 is rejected, you can tell, that there are at least two groups, which differ from each other significantly. You can‘t tell, which groups differ!

perform pairwise t-tests after overall F-Test (see Closed test procedure)

Example:

There are 3 different medications (Med1, Med2, Med3), which are intended to increase the HDL-cholesterol levels in patients

1. perform ANOVA as an overall test, if there is a difference between the groups

2. If the F-Test was significant, you know, that there is a difference

3. Test Med1 against Med2, Med1 against Med3, Med2 against Med3

Parametric Tests Nonparametric Tests

T-Test • Wilcoxon-Test• Wilcoxon rank-sum test• Mann-Whitney U-Test

Testing, if two independent samples come from the same distribution Testing equality of medians

ANOVA Kruskal-Wallis-Test:Testing equality of population medians between groups

Different names for the same test

All tests so far assumed a normally distributed variable parametric tests

If the assumption does not hold nonparametric tests

Characteristics of nonparametric tests:

• Robust against outliers and skewed distributions

• However: Parametric tests should be preferred over nonparametric test, if appropriate, since they have the higher power.

Two sample test on equality of distributions: Wilcoxon Test

■ Situation: Compare location measures of two unpaired samples X and Y, if the assumption of a t-test does not hold

■ Assumption: the form of the continuous distributions of the variables X and Y is the same test on equality of distributions = test on equality of the medians

■ Hypothesis: H0: xmed= ymed versus H1: xmed ≠ ymed

■ Test is based on the ranks

■ What are ranks?

Example: Wilcoxon Test

Original values: X={1,2,4,6,9,9,11}; xmed =6; Y={1,3,4,5,6,7,8}; ymed =5

■ Sort both variables into one: 1/1,2,3,4/4,5,6/6,7,8,9/9,11

■ Ranking: 1.5,1.5,3,4,5.5,5.5,7,8.5,8.5,10,11,12.5,12.5,14

■ A teststatistic is calculated using these ranks by the computer !

Efficiency of Wilcoxon-Test (Kruskal-Wallis-Test) compared to t-test (ANOVA):

■ To achieve the same power as a t-test/ANOVA, a higher sample size is needed!

■ The parametric tests are in general more powerful (if their assumptions are fulfilled)

The most common statistical tests:

Testing frequencies

One sample test on frequencies: -square goodness-of-fit test for categorical traits: Compare the frequencies of a categorical variable to specified proportions

■ Example: The Quality manager of a gummi bear production fabric claims, that the 5 favours do not have the same proportions in each package as claimed.

proportions under H0 1=2=3=4=5=1/5.

■ Hypothesis: H0: P(X=i) = i versus H1: P(X=i) ≠ i for one i at least, i=1,..k number of categories

■ Idea: Compare the observed numbers (O) in each category (hi) with the expected numbers (E=n i) in each category

■ Assumption: ni≥1 for all i & ni≥5 for at least 80% of the categories

Teststatistic:

Test decision for a two sided test: : Reject H0

Testing frequencies

Excursion: What are degrees of freedom (df)?

Definition:

The number of values that can vary freely.

Example:

The -square goodness-of-fit test with k categories has (k-1) degrees of freedom.

There is the constraint, that all proportions sum up to 1:

1, 2,…k-1 can be chosen freely. k is then fixed

Example with real data: suppose you have three groups, then 1 = 0.20, 2 =0.50 3 =1-0.70 (=0.30)

Testing frequencies

Two sample test on frequencies: -test of independence:

■ Situation: Compare the frequencies between two or more groups

Or: Test, if two categorical variables X (i=1,…k) and

Y (j=1,…m) depend on each other

■ A possible scenario: Compare the number of smokers, ex-smokers and never-smokers between men and women

All situations you can group intocontingency tables

1 … m Row sum

1 h11 … h1m h1.

2 h21 … h2m h2.

: : : :

k hk1 … hkm hk.

Column sum h.1 h.m n

Testing frequencies

Two sample test on frequencies: -test of independence:

■ Hypothesis: H0: X and Y are independent from each other

H1: X and Y are dependent from each other (are associated)

■ Assumption: expected frequencies ≥1 for all & expected frequencies ≥5 for at least 80% of the cells

none of the cells should have a very rare expectancy

if assumption is not fulfilled use Fishers exact test

Idea to construct the teststatistic: Compare the observed numbers in each cell with the expected numbers, if H0 and therefore independence of the two factor variables is assumed

Testing frequencies

Table of observed numbers

1 … m

1 h11 … h1m h1.

2 h21 … h2m h2.

k hk1 … hkm hk.

h.1 h.m n

h1. …hk. , h.1 …h.m are the margin probabilities

Smokingstatus

Gender

Current Smoker

Ex-Smoker

NeverSmoker

Row Total

Men 144 310 268 722

Women 117 143 475 735

Column Total 261 453 743 1457

Testing frequencies

Table of expected numbers:

1 … m

1 h1.h.1/n … h1.h.m/n h1.

2 h2.h.1/n … h2.h.m/n h2.

k hk.h.1/n … hk.h.m/n hk.

h.1 h.m n

Expected number in each cell:

(Row sum * Columns sum)/ Total sum

Smokingstatus

Gender

Current Smoker

Ex-Smoker

NeverSmoker

Row Total

Men 144 310 268 722

Women 117 143 475 735

Column Total 261 453 743 1457

Expected number in the upper left cell:

722*261/1457 = 129.336

Testing frequencies

Observed: Expected:

1 … m

1 h11 … h1m h1.

2 h21 … h2m h2.

k hk1 … hkm hk.

h.1 h.m n

1 … m

1 h1.h.1/n … h1.h.m/n h1.

2 h2.h.1/n … h2.h.m/n h2.

k hk.h.1/n … hk.h.m/n hk.

h.1 h.m n

OijEij

Teststatistic:

Test decision: : Reject H0

Testing frequencies

Example:

Observed: Expected:

Smokingstatus

Gender

Current Smoker

Ex-Smoker

NeverSmoker

Row Total

Men 129.336 224.479 368.185 722

Women 131.664 228.521 374.815 735

Column Total 261 453 743 1457

Smokingstatus

Gender

Current Smoker

Ex-Smoker

NeverSmoker

Row Total

Men 144 310 268 722

Women 117 143 475 735

Column Total 261 453 743 1457

= (144-129.336)2/129.336 + (310- 224.479)2/224.479 + (268- 368.185)2/368.185 +

+ (117- 131.664)2/131.664 + (143- 228.521)2/228.521 + (475- 374.815)2/374.815 = 121.9218

((2-1)*(3-1)) = (2) = 5.99 121.9218 >> 5.99 test is significant (p = 3.3e-27)

the Null-Hypothesis, that gender and smoking status are independent can be rejected

the test itself does not tell you, however, if men smoke more than women etc.

Testing frequencies

Two sample test on frequencies, if the expected number of cells is rare:

Fishers exact test

■ Situation: If the assumptions of a Chi-square-Test do not hold (number of expected values in each cell ≥ 1 and number of expected values ≥ 5 in 80% of cells)

■ Idea of the test: Simulate all tables that are possible with the observed margin probabilities. Then, count all the tables that are „more extreme“ in the opposite direction of the null-hypothesis than the observed table.

p = number of more extreme tables / number of all tables

computer intensive & not really solvable „by hand“

■ Test decision: p < : Reject H0

Testing frequencies : Exercise

In an experiment with mice, you want to test the development of insulin resistance in three different diets (D1, D2, D3). After following up the micefor several months, you observe the following:

Diet Σ

D1 D2 D3

insulin resistant 6 1 3 10

not insulin resistent 4 9 7 20

Σ 10 10 10 30

Calculate the expected numbers in each cell under the assumption of independence between diet and insulin resistance:

Diet Σ

D1 D2 D3

insulin resistant 10

not insulin resistent 20

Σ 10 10 10 30

Which diet resultsin more insulinresistent mice thanexpected, which in less?

Which statistical test should be performed?

The multiple testing problem

The situation:

Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course)

100 statistical tests are performed with a significance level of =0.05

The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true

The situation:

■ Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course)

100 statistical tests are performed with a significance level of =0.05

The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true

You expect 5 tests to be significant

just by chance !!!

■ The probability to get at least one Type I error increases with increasing number of tests.

■ Family-wise error rate (the error rate for the complete family of tests performed): *=1-(1-)k, with being the comparison-wise error rate

The significance level has to be modified for multiple testing situations

k * (=0.05)

1 0.05

5 0.226

10 0.401

100 0.994

The probability to get one or more false discoveries(Type I error)

The Bonferroni correction method:

■ Control the comparison-wise error rate: Reject H0, if p <

■ Control the family-wise error rate (including k tests): Reject H0, if p < /k

This is equivalent to pBonferroni=p*k < Advantage: simple

■ Problem: Bonferroni-correction increases the probability of a type II error

the power of detecting a true association is reduced Disadvantage:too conservative

■ Can be used, if tests are dependent, but is just too conservative for this case

k /k (=0.05)

1 0.05

5 0.01

10 0.005

100 0.0005

0.05/5=0.01

Some modifications of the Bonferroni Method: Bonferroni-Holm-method

■ You have a list of k p-values sort it, so that the minimal p-value (p(1)) comes first: p(1), p(2), p(3), p(4), ……. p(k)

■ If p(1) ≥ /k stop and accept all null hypotheses

If p(1) < /k reject H0(1) and continue with the reduced list:

p(2), p(3), p(4), ……. p(k)

■ If p(2) ≥ /(k-1) stop and accept all remaining null hypothesis

If p(2) < /(k-1) reject H0(2) and continue with the reduced list:

p(3), p(4), ……. p(k)

■ Less conservative than Bonferroni

■ Can also be used, if tests are dependent

■ There are multiple other methods, e.g. Hochberg-method etc.

Continue, until the hypothesis with the smallest p-value cannot be rejectedstop and accept

all hypotheses thathave not beenrejected before

In R: function

p.adjust()

Example: Assume you have this p-value vector (10 independent tests to =0.05):

Unadjusted p: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1

With Bonferroni: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1

With B.-Holm: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1

E.g: Bonferroni.: 0.05 /10 = 0.005 all p values smaller than 0.005 significant

E.g: B.-Holm.: First p value (0.0001): 0.0001 < (0.05/10=0.005), continue!Next p value (0.001): 0.001 < (0.05/9=0.006), continue!…until: p value (0.007): 0.007< (0.05/6=0.008), stop, because:next p value (0.01): 0.01 ≥ (0.05/5=0.01).

All these methods are still too conservative if tests are not independent (e.g. highly correlated parameters)!

„significant“, if multiple testing is not accounted for

Multiples Testing: Exercise

You conduct an experiment with knock-out mice and compare the meanvalues of 5 different parameters between wildtype (WT) and knockout (KO) mice.

In the following list, the p-values for all t-tests are given.Para meter

Uncorrectedp-value

Significant toα = 5% ?

Significant after Bonferroni-Correction ?

Significant after Bonferroni-HolmCorrection ?

1 0.04

2 0.001

3 0.012

4 0.015

Significance level after Bonferroni correction? α =

Conclusion:

The closed testing principle

■ In cases, where a set of hypotheses can be tested simultaneously („overall“ or „omnibus tests“)

■ Suppose there are 3 hypotheses H1,H2,H3 to be tested:

Then H1 can be rejected at level , if all intersections including H1 can be rejected, i.e. H1 ∩ H2 ∩ H3, H1 ∩ H2, H1 ∩ H3 and H1 can all be rejected at level

■ Example: There are 4 different medications (Med1, Med2, Med3, Med4), which are intended to lower the HDL-cholesterol levels in patients

1. perform ANOVA as an overall test, if there is a difference between the groups

2. If the F-Test was significant, perform an ANOVA on all possible intersections

3. If there was any significant F-Test, perform pairwise t-tests for all Medications included in this significant F-Test, etc…

■ Attention: It is not sufficient (if there are more than 3 groups) to test first an ANOVA and then pairwise t-tests, if the ANOVA was significant test also all intersections in between

■ Example: Testing the following hypotheses:

1. Step ANOVA: 1 = 2 = 3 = 4

2. Step ANOVA: 1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4

for all intersections

3. Step pairwise 1 = 2 1 = 3 2 = 3 1 = 4 2 =

Significant

Significantto family-wise error rate *=

sign.level

sign.level t-tests

■ Example: Testing the following hypotheses:

1. Step ANOVA: 1 = 2 = 3 = 4

2. Step ANOVA: 1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4

for all intersections

3. Step pairwise 1 = 2 1 = 3 2 = 3 1 = 4 2 =

Significant

sign.level

But: Very complicated to perform, if there are more than 3 groups

See e.g. Tukeys Test (in R)

Significantto family-wise error rate *=

t-tests

Formulating Hypothesis & Statistical Tests

Documents

Transcript of Formulating Hypothesis & Statistical Tests

Statistical properties of hypothesis tests using Goal ...

Statistical Foundations: Hypothesis TestingHypothesis Testing

Statistical Hypotheses & Hypothesis Testing

HYPOTHESIS TESTING. Statistical Methods Estimation Hypothesis Testing Inferential Statistics Descriptive Statistics Statistical Methods.

Statistical Foundations for Multivariate Analyses€¦ · Formulating Hypotheses Hypothesis are formulated as the existence / absence of statistical associations between processes

8. Statistical Literacy Estimation and Hypothesis Testing 2020biep540w/pdf/8. Statistical...BIOSTATS 540 – Fall 2020 8. Statistical Literacy – Estimation and Hypothesis Testing

STATISTICAL INFERENCE PART VI HYPOTHESIS TESTING 1.

Hypothesis testing and statistical decision theoryxial/Teaching/2013F/slides/14-hypothesistesting.pdf · • Hypothesis testing (definitions) • Statistical decision theory – a

Introduction to Statistical Hypothesis Testing

STATISTICAL INFERENCE PART V HYPOTHESIS TESTING 1.

Confidence Intervals and Hypothesis Tests (Statistical ... · Confidence Intervals and Hypothesis Tests (Statistical Inference) Ian Jolliffe Introduction Illustrative Example Types

Statistical inference: confidence intervals and hypothesis testing

Statistical Hypothesis Testing

Scientific Method A logical approach to solving problems by observing and collecting data, formulating hypotheses, testing hypothesis, and formulating.

9-1 Hypothesis Testing 9-1.1 Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.

Formulating Statistical Questions and Collecting Data

Hypothesis Testing Is It Significant?. Questions What is a statistical hypothesis? What is the null hypothesis? Why is it important for statistical tests?

Introduction Hypothesis Testing · Hypothesis Testing Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 23/10/2018 Formulating a hypothesis test Interpreting

Hypothesis Testing Testing Statistical Significance.

8. Statistical Literacy Estimation and Hypothesis Testing 2019people.umass.edu/~biep540w/pdf/8. Statistical... · Unit 8 Statistical Literacy – Estimation and Hypothesis Testing