Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB...

14
THE STANDARD ERROR OF THE LAB SCIENTIST and other common statistical misconceptions in the scientific literature. Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Some Quotes Instead of an outline, here are some quotes from scientific publications that we will have a closer look at: 95% confidence intervals (mean plus minus two standard deviations) the model predicted the data well (correlation coefficient R 2 = 0.85) we used the jackknife to estimate the error for future predictions

Transcript of Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB...

Page 1: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

THE STANDARD ERROROF THE LAB SCIENTIST

����� and other common statistical misconceptions in the scientific literature.

Ingo Ruczinski

Department of Biostatistics, Johns Hopkins University

Some Quotes

Instead of an outline, here are some quotes from scientific publications that we willhave a closer look at:

“ ����� 95% confidence intervals (mean plus minus two standard deviations) ����� ”

“ ����� the model predicted the data well (correlation coefficient R2 = 0.85) ����� ”

“ ����� we used the jackknife to estimate the error for future predictions ����� ”

Page 2: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Quote #1

“ ����� 95% confidence intervals (mean plus minus two standard deviations) ����� ”

1.96 × σ

σ

µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ

The Normal Distribution N(µ,σ2)

Parameters and Statistics

A statistic is a numerical quantity derived from a sample to estimate an unknownparameter that describes some feature of the entire population.

For example, assume that the measurements taken in an experiment follow a nor-mal distribution

� ��������� ���, and assume that we carry out � independent exper-

iments, i. e. let�����

�����

�����be a random sample from

�.

��� �is the unknown population mean (a parameter).�� ������� ��� � is the sample mean (a statistic).

��� is the standard deviation of

�.

!�#" $ � � � � �&% � is the sample standard deviation, where$ � �&���'��� � � ��(� �

.

Note that

is not a standard error!

Page 3: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

The Standard Error

The standard error of a statistic is the standard deviation of its sampling distribution.

For example:� � ����� �� � � �

hence�� � ������� � � � � � and therefore the standard

error of��

is � � � .

A standard error itself is a parameter, not a statistic!

As the standard deviation of�

is often unknown, so is the standard error of��

, butin practice we can estimate it, for example by �se

� �� � � � � � .

In general, the standard error depends on the sample size: the larger the samplesize, the smaller the standard error.

This means that the term standard deviation in “ ����� 95% confidence intervals(mean plus minus two standard deviations) ����� ” better be referring to the sam-pling distribution, not the population.

But what about that factor 2?

Confidence Intervals

If the standard deviation

of�

is known, then ������� � � � ����� � % � .

We can obtain a 95% confidence interval for the population mean�

as� � � �� ������� ������� � � � � �� ! ����� ������� � � �#"

If

is unknown and we have to estimate it from the data as well, then ������$� � � �&% � � � .

The 95% confidence interval for�

is now�#� � �� � % � � ���� ����� � � � � � �� !'%

� � ��(� ����� � � � �#"

�3 4 5 6 7 8 9 10)+* ,.-/10 24345 4.30 3.18 2.78 2.57 2.45 2.36 2.31 2.26

Page 4: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Plotting Data (Confusion Part 1)

0

2

4

6

8

10

12

14

A B0

2

4

6

8

10

12

14

Guess what the data are!

A B

Plotting Data (Confusion Part 1)

0

2

4

6

8

10

12

14

A B0

2

4

6

8

10

12

14

A B

Page 5: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Plotting Data (Confusion Part 1)

0

2

4

6

8

10

12

14

A B0

2

4

6

8

10

12

14

A B

Plotting Data (Confusion Part 1)

0

2

4

6

8

10

12

14

A B0

2

4

6

8

10

12

14

A B

Page 6: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Reporting Uncertainty (Confusion Part 2)

� Results are frequently reported in the form ’mean plus minus standard error’,such as 7.4 (

�1.3).

� What is reported as the (estimated) standard error is often the sample standarddeviation (

, not

� � � ).

� The plus/minus notation can also mislead readers to believe 7.4 (�

1.3) is aconfidence interval.

� To allow others to correctly quantify uncertainty, it is also necessary to reportthe number of experiments that have been performed (for the

%-quantile and to

calculate an estimate for the standard error, if necessary).

An Example

Do chemically denatured proteins behave as random coils?

� The radius of gyration Rg of a protein is defined as the root mean square dis-tance from each atom of the protein to their centroid.

� For an ideal (infinitely thin) random-coil chain in a solvent, the average radiusof gyration of a random coil is a simple function of its length n: Rg � n0.5

� For an excluded volume polymer (a polymer with non-zero thickness and non-trivial interactions between monomers) in a solvent, the average radius of gyra-tion, we have Rg � n0.588 (Flory 1953).

��� The radius of gyration can be measured using small angle x-ray scattering.

Page 7: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

An Example

Length [residues]

Rg

[A°]

10 50 100100 500

10

20

30

40

50

60

708090

Confidence interval for the slope: [ 0.579 ; 0.635 ]

An Example

Length [residues]

Rg

[A°]

10 50 100100 500

10

20

30

40

50

60

708090

Page 8: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Variability

Length [residues]

Rg

[A°]

10 50 100100 500

10

20

30

40

50

60

708090

Variance Components

2

3

4

5

ln(k

f)

Wild type I28A I28L I28V V55A V55M V55T V55G

Page 9: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Variance Components

jl030

3b_1

1

jl030

3b_1

2

se03

25b_

04

se03

25b_

05

se03

25b_

06

56.G

u

56.G

u

56.G

u

56.G

u

56.G

u

jl030

9b_0

0

jl030

9b_0

1

jl030

9b_0

2

oc03

01b_

06

oc03

01b_

07

56.T

FE

56.T

FE

56.T

FE

56.T

FE

56.T

FE

ju03

30b_

07

ju03

30b_

08

se03

26b_

01

se03

26b_

02

se03

26b_

03

46.G

u

46.G

u

46.G

u

46.G

u

46.G

u

Variance Components

jl030

3b_1

1

jl030

3b_1

2

se03

25b_

04

se03

25b_

05

se03

25b_

06

56.G

u

56.G

u

56.G

u

56.G

u

56.G

u

jl030

9b_0

0

jl030

9b_0

1

jl030

9b_0

2

oc03

01b_

06

oc03

01b_

07

56.T

FE

56.T

FE

56.T

FE

56.T

FE

56.T

FE

ju03

30b_

07

ju03

30b_

08

se03

26b_

01

se03

26b_

02

se03

26b_

03

46.G

u

46.G

u

46.G

u

46.G

u

46.G

u

Page 10: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Quote #2

“ ����� the model predicted the data well (correlation coefficient R2 = 0.85) ����� ”

Correlation

4 6 8 10 12

2

4

6

8

X

Y

R2 = 0.56

Page 11: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Correlation

4 6 8 10 12

2

4

6

8

X

Y

R2 = 0.56

Correlation

4 6 8 10 12

2

4

6

8

X

Y

R2 = 0.85

Page 12: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Correlation vs Regression

� In a correlation setting we try to determine whether two random variables varytogether (covary).

� There is no ordering between those variables, and we do not try to explain oneof the variables as a function of the other.

� In regression settings we describe the dependence of one variable on the othervariable.

� There is an ordering of the variables, often called the dependent variable andthe independent variable.

Correlation vs Regression

The correlation coefficient of two jointly distributed random variables�

and � isdefined as � � cov

���!� � � � ��where cov

� �!� � � is the covariance between�

and � , and � and

�are their

respective standard deviations.

If�

and � follow a bivariate normal distribution with correlation ����� �� �� � ������ � �

� � ��� �� � � �� � � �� �

then � �� � � � � ��� � !�� � � � �� ���where

� � � � � � � � � � ��� � � � � � � � and � � �� � % � � � � .

Page 13: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Some Comments

� The sample (multiple) correlation coefficient in a regression setting is definedas the correlation between the observed values � and the fitted values

� fromthe regression model: R = cor( � � � )

� R2 is called the coefficient of determination: it is equal to the proportion of thevariability in � explained by the regression model.

� The notion “the higher R2 � the better the model” is simply wrong.

� Assuming we have an intercept in the (linear regression) model, the more pre-dictors we include in the model, the higher R2.

� However, there is a test for “significant” reductions in R2 (there is a one-to-onecorrespondence to the usual

%and � statistics).

� R2 tells us nothing about model violations.

Model Fit

�0 = 3.0,

�1 = 0.5, p-value (slope) = 0.002, R2 = 0.67, RSE = 1.24 (9 df).

0 5 10 15 20

0

2

4

6

8

10

12

0 5 10 15 20

0

2

4

6

8

10

12

0 5 10 15 20

0

2

4

6

8

10

12

0 5 10 15 20

0

2

4

6

8

10

12

Page 14: Johns Hopkins School of Public Health - THE STANDARD ERROR OF THE LAB SCIENTISTcfrangak/cominte/ruczinski.05.03... · 2005-04-01 · Department of Biostatistics, Johns Hopkins University

Experimental Design

0 1 2 3 4

0

2

4

6

some concentration

som

e ou

tcom

e

0 1 2 3 4

0

2

4

6

some concentration

som

e ou

tcom

e

0 1 2 3 4

0

2

4

6

some concentration

som

e ou

tcom

e

Standard error ratios for the slope:

1.65 1.41 1.00: :

In Conclusion: A Few Suggestions

� Take Karl Broman’s course “Statistics for Laboratory Scientists” (140.615/616).

� For your analysis, use tools that help you understand the data, and try to getan idea what all that statistical output from your program means.

� Avoid “black boxes” as much as possible. Plot the data.

� For more complicated quantitative projects, adopt a biostatistician.

� Keep recruiting people like Ray and Matthew. Cheers!