University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay...

81
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 22-03-22 00:16 1 Data analysis project Data analysis project Proposal must be approved Strong suggestion to submit a draft before Dec. 2 nd Description of question, H, prediction, protocol, data (1-2 pages) Show the data (1-5 graphs) Report analyses results, justify choices, hypotheses tested (1-5 pages) statistical, and biological interpretation (1-2 pages) Total: less than 10 pages.

Transcript of University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay...

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

1

Data analysis projectData analysis project

• Proposal must be approved• Strong suggestion to submit a draft before Dec. 2nd

• Description of question, H, prediction, protocol, data (1-2 pages)

• Show the data (1-5 graphs)• Report analyses results, justify choices, hypotheses

tested (1-5 pages)• statistical, and biological interpretation (1-2 pages)• Total: less than 10 pages.

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

2

Grading schemeGrading scheme

• Question, protocol, data, data copy (20%)• Data examination, analysis, stat interpretation

(60%)• Biological conclusion (10%)• Style (10%)

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

3

Simple linear regressionSimple linear regressionSimple linear regressionSimple linear regression

What regression analysis does

The simple regression model

Hypothesis testing in regression

Residual analysis

Inverse prediction, replicated regression and weighted regression

Regression caveats

Power considerations in simple linear regression

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

4

What regression doesWhat regression does

• Fits a straight line through a cloud of data.

• Tests and quantifies the effect of an independent variable X on a dependent variable Y.

• Intensity of the effect is given by the slope (b) of the regression.

• The importance of the effect is given by the coefficient of determination (r2).

X

Y

X

Y

b =YX

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

5

Regression and correlation coefficientsRegression and correlation coefficients

• The slope b is estimated as:• The correlation r is:

• So,

• b = r if X and Y have the same variance…

• and if b = 0, r = 0 and vice versa.

b

X X Y Y

Cov X Y

i ii

N

X Xii

N

X

( )( )

( , )

( )

1

2

1

2

rCov X Y

X Y

( , )

r b X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

6

How it does itHow it does it

• by the method of least squares, which involves minimizing the sum of squared deviations between the observations and the regression line, i.e. minimizing the residuals

• Squared deviation of an observation given by:

i i iY Y2 2 ( )

X

Y i

Yi

Yi

Residual: i i iY Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

7

Regression or correlation?Regression or correlation?

• Correlation: degree of association between two variables X and Y; no causal relationship assumed!

• Regression: to predict the value of the dependent variable if the independent variable were changed; causal relationship assumed!

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

8

When do we use When do we use regression?regression?

• Don’t use it to determine the strength of association between to variables.

• Do use it if you want to predict the value of Y given X.

X

Y

Regression

X1

X2

Correlation

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

9

The simple regression modelThe simple regression model

• The regression model is:

• So, all simple regression models are described by 2 parameters, the intercept () and slope (b).

b =YX(slope)

Y bXi i i

X X

Y

(intercept)

i

Xi

Yi

ObservedExpected

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

10

AssumptionsAssumptions

• Residuals are independent and normally distributed.

• The variance of the residuals is equal for all X (homoscedasticity).

• The relationship between Y and X is linear.

• There is no measurement error on X (Model I regression).

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

11

Measurement errorMeasurement error

• Assumption of no error on X can be examined beforehand, and is almost invariably violated.

• Only of concern when measurement error is large relative to magnitude of X (say, > 10%).

• If assumption is invalid, then Model II regression is required.

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

12

Residual analysis I: Residual analysis I: independenceindependence

• Plot residuals against estimates, look for patterns.

EstimateR

es

idu

al

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

13

Residual analysis II: Residual analysis II: NormalityNormality

• Plot residuals against estimates; look for patterns.

• Do normal probability plot.• Check with Lilliefors test.

NE

Ds

Residual

Normal

Non-normalR

es

idu

al

Estimate

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

14

Residual analysis III: Residual analysis III: HomoscedasticityHomoscedasticity

• Plot residuals against estimates; look for patterns.

• Check with Levene’s test by grouping Y’s into several classes.

Estimate

Re

sid

ua

l Group 1

Group 2Group 3

Re

sid

ua

lEstimate

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

15

Residual analysis IV: Residual analysis IV: LinearityLinearity

• Plot residuals against estimates; look for patterns.

Re

sid

ua

lX

Y

Estimate

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

16

Robustness of regression with respect to Robustness of regression with respect to violation of assumptionsviolation of assumptions

Assumption Robustness Remark

Normality Higher Only if sample sizes are reasonably large (>10)

Independence Low But depends on strength of correlation

Homoscedasticity Lower Especially with smaller sample sizes

Linearity Low Make sure of this one!

No error on X Higher Error of < 10% is OK, otherwise use Model II

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

17

What to do when assumptions aren’t metWhat to do when assumptions aren’t met

• Try transforming the data, but remember: (1) for some data, no transformation will work; (2) finding an appropriate transformation may not be easy.

• Use non-linear regression.

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

18

0 200 400 600

1.2

2.4

3.6

4.8

6.0

7.2

Length (mm)

We

igh

t (kg

)Transformations in regressionTransformations in regression

10 100 10000.001

0.01

0.1

1.0

8.0

Length (mm; log scale)

We

igh

t (kg

; log

sca

le) Weight versus length

in the beetle Scorpaenichthys marmoratus

Weight versus length in the beetle Scorpaenichthys marmoratus

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

19

oC

Transformations in regressionTransformations in regression

10 20

50

100

150

Ch

irp

s/m

in

oC10 2040

80

120

160

Ch

irp

s/m

in (

log

sc

ale

)

Chirp rate as a function of temperature in males of the

cricket Oecanthus fultoni.

Chirp rate as a function of temperature in males of the

cricket Oecanthus fultoni.

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

20

Transformations in regressionTransformations in regression

701 2 5 10 20 50

Relative brightness (times) in log scale

0

1

2

3

4

5

6

7

Mill

ivo

lts

0 10 20 30 40 50 60 70

Relative brightness (times)

0

1

2

3

4

5

6

7

Mill

ivo

lts

Electrical resistance as a function of illumination in cephalopod eyes.

Electrical resistance as a function of illumination in cephalopod eyes.

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

21

Age and size of, guess what?Age and size of, guess what?

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

22

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

23

*** Linear Model ***

Call: lm(formula = log10(FKLNGTH) ~ log10(AGE), data = Reg1dat, na.action =

na.exclude)

Residuals:

Min 1Q Median 3Q Max

-0.08432 -0.01578 0.0006757 0.02111 0.0701

Coefficients:

Value Std. Error t value Pr(>|t|)

(Intercept) 1.1991 0.0256 46.8737 0.0000

log10(AGE) 0.3343 0.0204 16.4124 0.0000

Residual standard error: 0.02832 on 73 degrees of freedom

Multiple R-Squared: 0.7868

F-statistic: 269.4 on 1 and 73 degrees of freedom, the p-value is 0

5 observations deleted due to missing values

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

24

Hypothesis testing I: partitioning the total Hypothesis testing I: partitioning the total sums of squaressums of squares

Total SS Model (Explained) SS Unexplained (Error) SS

( )Y Yii

N

1

2 ( )Y Yii

N

1

2 ( )Y Yii

N

i

1

2

= +

Y

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

25

Hypothesis testing I: partitioning the total Hypothesis testing I: partitioning the total sums of squaressums of squares

• So, MSregression = s2Y and

MSerror = 0 if observed = expected.

• Calculate F = MSR/MSe and compare with F distribution with 1 and N - 2 df.

• H0: F = 0

MSY Y

NR

ii

N

( )

1

2

2

MSY Y

e

ii

N

i

( )

1

2

1

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

26

Standard error of the slopeStandard error of the slope

• The standard error sb and 100(1- ) CIs of the slope are:

• So, for fixed N, can decrease sb by expanding range of X values sampled.

sMS

X Xb t sb

e

ii

N N b

( ), ( ),

2

1

2 2

Y

X

sb smaller

Y

sb larger

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

27

Standard error of the interceptStandard error of the intercept

• The standard error s of

the intercept is:

• So, for fixed N, we can decrease s by expanding range of X values sampled.

s MSN

X

X Xeii

LNM

OQP

1 2

2( )

X

s smaller

Y

Y

s larger

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

28

Hypothesis testing II: Hypothesis testing II: testing model parameterstesting model parameters

• Test each hypothesis by a t-test:

• Note: these are 2-tailed hypotheses!

tbs

ts

bb

,

X

Y

H02: b = 0

X

Y

Y

Y

H01: = 0Y = 0

ObservedExpected

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

29

Hypothesis testing III: one-tailed Hypothesis testing III: one-tailed hypotheseshypotheses

• Biological theory predicts that Y should increase with X.

• So, H0: b 0 (one-tailed)

• Calculate:

• Reject if tb > 0 and p (one-tailed) <

tbsb

b

YY

H0 accepted

H0 rejected

Y

X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

30

Confidence intervals in regressionConfidence intervals in regression

( )

( )( ),Y t MS

NX X

X XN e

ii

N

L

N

MMMM

O

Q

PPPP

2 2

2

2

1

1

( )

( )( ),Y t MS

NX X

X Xi N e

ii

N

L

N

MMMM

O

Q

PPPP

2 2

2

2

1

11

100 (1-) CI for estimated values

100 (1-) CI for observations

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

31

Confidence intervals in Confidence intervals in regressionregression

• CI for observations is larger than CI for estimated values.

• CIs for both estimated values and observations increase with increasing distance between X value and mean of sample.

X

Y

Observations

Y

Estimates

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

32

OutliersOutliers

• points that appear to lie well off the fitted line

• Issue 1: are “apparent” outliers really outliers?

• Issue 2: do they significantly affect the statistical conclusions?

X

Y

Outlier?

Outlier?

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

33

Outlier analysis I: Studentized residualsOutlier analysis I: Studentized residuals

• Plot Studentized residuals against estimated values.

• “Large” residuals are those with value > 3.0 .

• Such cases make large contributions to residual mean square of the regression.

0.5 1.0 1.5 2.0LAGE

-4

-3

-2

-1

0

1

2

3

4

ST

UD

EN

T

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

34

Outlier analysis II: Outlier analysis II: LeverageLeverage

• Leverage measures the potential influence of the case on the regression line.

• Determined by X value only, so that points far from the mean have higher leverage.

• “Large” = anything greater than 4/N.

0.5 1.0 1.5 2.0LAGE

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

LE

VE

RA

GE

Small leverageLarge leverage

X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

35

Outlier analysis III: Outlier analysis III: Cook’s distanceCook’s distance

• Cook’s distance: measures both leverage and contribution to residual mean square, i.e. actual influence of a point.

• “Large” = anything greater than 1.

Smaller Cook’sLarger Cook’s

1.4 1.5 1.6 1.7 1.8ESTIMATE

0.0

0.1

0.2

0.3

0.4

0.5

CO

OK

X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

36

Resolving outlier Resolving outlier problemsproblems

• Do they have a significant effect on regression results?

• To determine, delete them, rerun analyses and compare results.

• Are slope and intercept estimates significantly affected, i.e. still lie within 95% CI’s of original estimates?

Outliers inOutliers out

Y

No significanteffect

X

Y

Significanteffect

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

37

The effects of outlier The effects of outlier deletiondeletion

• Reduces sample size (N), thereby reducing power.

• Decreases MSe, so sb decreases, and power increases.

• If N is small, the former effect will probably outweigh the latter unless outliers are very aberrant.

Po

we

r (1

-

)

N smaller

N larger

sb larger

sb smaller

sb

fixed

Nfixed

00

1

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

38

Inverse predictionInverse prediction

• Regression of Y on X, but want to predict X, given Y.

• Regression of X on Y not possible due to error in Y.

• e.g. calibration curves: want to predict concentration from reading, based on regression of reading on known solute concentrations.

Re

ad

ing

Concentration

ReadingC

on

cen

tra

tio

n

Error in “X”

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

39

Inverse predictionInverse prediction

• Regress Y on X.• Generate predicted

value of X given Y.• Calculate 95%

confidence limits for “X” estimate based on 95% confidence limits for “Y” estimate from standard regression.

Y

Predicted “X”

Lower 95%limit

Upper 95%limit

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

40

Regression with replicationRegression with replication

• When several Y’s are measured for each X.

• In this case, we can test the linearity assumption directly by testing the MS due to deviations from linearity over MS within groups.

( )Y Yij i 2

( )Y Yi 2

( )Y Yi i 2

( )Y Yi 2

Yi

Y

Regression SS

Within-group SS

SS due to nonlinearityGroup SS

Error SS

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

41

Weighted regressionWeighted regression

• Used when our confidence in the values of individual observations varies, e.g. different measurement error, precision.

• In replicated designs, variance of Y for given X may vary among X’s, as may sample size (N).

• So, weight by N or inverse of sample variance. X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

42

Regression caveats Regression caveats I: causationI: causation

• A statistically significant regression of Y on X need not imply a causal relationship between the two.

• A non-significant linear regression need not imply the lack of a causal relationship if the causal relationship is non-linear.

Z

X YX

Y

X

Y

Accept linear H0

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

43

Regression caveats Regression caveats II: small samplesII: small samples

• Significant regressions can be obtained by chance, i.e. even when no (linear) causal relationship exists.

• This is especially true if sample sizes are small.

• So when doing multiple simple regressions, control e.

X

Y

True regression (H0 accepted)

Sample regression(H0 rejected)

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

44

Regression caveats Regression caveats III: large samplesIII: large samples

• When N is large, only very small regression coefficients are required to reject H0

(power is large).• So, be careful of

“overinterpreting” the observed relationship if R2 is small. True regression (H0 rejected

but R2 small)

X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

45

Regression caveats Regression caveats IV: extrapolation IV: extrapolation and interpolationand interpolation

• Be careful when (1) predictions lie outside range of sample; (2) when predictions are for values where data are sparse.

X

Y

Estimated relation

True relation

X

YPredicted valueTrue value

Observations

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

46

The final word on extrapolationThe final word on extrapolation

In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hundred and forty-six miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian period, just a million years ago next November, the Lower Mississippi River was upwards of one

million three hundred thousand miles long, and stuck over the Gulf of Mexico like a fishing rod. And by the same token, any person can see that seven hundred and forty-two years from now, the lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen.

Mark Twain, Life on the Mississippi

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

47

Power and sample size in simple linear Power and sample size in simple linear regressionregression

• Because the correlation coefficient r and the regression coefficient b are closely related, i.e.

• … we can transform b to r and evaluate power using r.

X

Y

r b X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

48

Power and sample size Power and sample size regressionregression

Z z z nr ( ) ( )1 3

zrr

z

r FH IK

FHGIKJ

0 511

0 511

. ln

. ln

• If we test H0: b = 0 with sample size n, we can determine 1 - by calculating the z-transformed values for the critical value of the corresponding r (at specified ) (z) and the sample regression coefficient b (zr), and the one-tailed probability of the normal deviate:

X

Y

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

49

Power and sample Power and sample size in regressionsize in regression

• Once Z(1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. .

• Power is then 1-.

Z z z nr ( ) ( )1 3

X

Y

Z(1)

p

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

50

Power and sample size Power and sample size in regression: an in regression: an exampleexample

• Changes in wing length with age in a sample of 13 birds

• So 1 - = 1.00.

n b r

r

13 11 270 987

55305 2 11

, , . , .

.. ( ),

Wing length(cm)

Age (days)

1.4 3.9 3 11

1.5 4.1 4 12

2.2 4.7 5 14

2.4 4.5 6 15

3.1 5.2 8 16

3.2 5.0 9 17

3.2 107.4

z

zr

LNM

OQP

LNM

OQP

0 51 553

1 553616

0 51 987

1 9872 515

. ln.

..

. ln.

..

Z z z nr ( ) ( )

( . . ) .

1 3

2 515 616 13 3 6 005

P Z( . ) . 6 01 00001

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

51

Minimal sample size Minimal sample size in regressionin regression

• Given desired power 1 - , how large a sample is required to reject H0: b= 0 if it is false and the true regression coefficient is at least b

• To do so, first calculate regression coefficient 0 corresponding to b.

0 0b X

Y

X1

Y

Reject H0?

Observed

Expected under H0: b = 0True regression (b0)

Y

Reject H0?

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

52

Minimal sample size Minimal sample size in regression (cont’d)in regression (cont’d)

• …then calculate:

nZ Z

zmin( )

FHG

IKJ

1

0

3

z00

0

0 511

FHG

IKJ. ln

X1

Y

Reject H0?

Observed

Expected under H0: b = 0True regression (b0)

Y

Reject H0?

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

53

Minimal sample Minimal sample size: an examplesize: an example

• We want to reject H0: b= 0 99% of the time when b0>

0.2and(2)= .05• So (1) = .01 and

• For b = .20, we have...

Z

Z

( ) .

.1 2 326

196

Wing length(cm)

Age (days)

1.4 3.9 3 11

1.5 4.1 4 12

2.2 4.7 5 14

2.4 4.5 6 15

3.1 5.2 8 16

3.2 5.0 9 17

3.2 107.4

n s sX Y 13 11 4 76 128, , . , .

0 0 20 7284 67128

FH IKbss

X

Y

. ...

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

54

Minimal sample Minimal sample size (cont’d)size (cont’d)

• So…

• …and

• So, a sample size of at least 8 should be used.

z0 0 51 7281 728

924 FH IK. ln

.

..

nZ Z

zmin( ) .

FHG

IKJ 1

0

3 7 64

Wing length(cm)

Age (days)

1.4 3.9 3 11

1.5 4.1 4 12

2.2 4.7 5 14

2.4 4.5 6 15

3.1 5.2 8 16

3.2 5.0 9 17

3.2 107.4

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

55

CorrelationCorrelationCorrelationCorrelation

The underlying principle of correlation analysisMeasuring the strength of a correlation

AssumptionsConfidence intervals and hypothesis testing

Comparing correlationsNon-parametric correlationsPower in correlation analysis

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

56

The underlying principle of correlation The underlying principle of correlation analysisanalysis

• Measures the extent to which two variables covary, in particular, the strength of the linear association between them.

• No implied causal relationship, therefore there is no distinction between dependent and independent variables. X1

X2

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

57

When do we use When do we use correlation?correlation?

• Do use it to determine the strength of association between to variables.

• Do not use it if you want to predict the value of X given Y, or vice versa.

X1

X2

Correlation

X

Y

Regression

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

58

Simple linear correlation versus simple Simple linear correlation versus simple linear regressionlinear regression

• Calculations are the same.• In correlation analysis, one must sample

randomly both X and Y.

• Correlation deals with association (importance).

• Regression deals with prediction (intensity).

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

59

Lab example: fork length and round Lab example: fork length and round weight of sturgeonweight of sturgeon

• Since the two variables are not causally related, use correlation to measure strength of association.

10 30 50 70 90

Weight / Poids

20

30

40

50

60

70

Le

ng

th /

Lo

ng

ue

ur

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

60

Regression: fork length and age of Regression: fork length and age of sturgeonsturgeon

• The two variables are causally related.

• The relationship between the two provides an estimate of growth rates…

• ...and we can use the relationship to predict the size of sturgeon of a given age.

0 10 20 30 40 50

Age

20

30

40

50

60

70

Le

ng

th /

Lo

ng

ueu

r

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

61

Measuring the strength of a correlationMeasuring the strength of a correlation

• Test statistic is the product-moment correlation coefficient r.

r

X X Y Y

X X Y Y

Cov X Y

i ii

N

ii

N

ii

N

X Y

( )( )

( ) ( )

( , )

1

2

1

2

1

X1

X2

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

62

Measuring the Measuring the strength of a strength of a correlationcorrelation

• r always lies between -1 and 1.

• r2 is the coefficient of determination, which measures the proportion of the variance in X1 (or X2) “explained” by variation in X2 or X1 .

X1

X2

X2

X2

r = 0.9

r = 0.5

r = 0 r = 0

r = -0.5

r = -0.9

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

63

Assumptions of correlation Assumptions of correlation analysis I: Bivariate normalityanalysis I: Bivariate normality

• For each value of X1, X2 values are normally distributed, and vice versa.

r = 0.8

r = 0

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

64

Assumptions of correlation Assumptions of correlation analysis II: analysis II: HomoscedasticityHomoscedasticity

• The variance of X1, given X2, is independent, and vice versa.

• But the variances of X1 and X2 need not be equal.

X2

X1

X2

Homoscedastic

Heteroscedastic

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

65

Assumptions of correlation Assumptions of correlation analysis III: Linearityanalysis III: Linearity

• The relationship between X1 and X2 is linear.

X2

Linear

X1

X2

Nonlinear

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

66

Violation of assumptions: fork length and Violation of assumptions: fork length and age of sturgeonage of sturgeon

• Relationship between fork length and age appears non-linear.

• Variance in fork length appears to increase with age.

0 10 20 30 40 50

Age

20

30

40

50

60

70

Le

ng

th /

Lo

ng

ue

ur

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

67

If parametric correlation assumptions If parametric correlation assumptions aren’t met...aren’t met...

• Try transforming the data (e.g. log transform).

• Try a non-parametric correlation analysis.

0.8 1.0 1.2 1.4 1.6 1.8

Log (Age)

1.3

1.4

1.5

1.6

1.7

1.8

Lo

g (

Le

ng

th/T

aill

e)

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

68

Confidence intervals for Confidence intervals for correlation coefficientscorrelation coefficients

confidence limit for Z-transformed correlation given by:

• Convert back to untransformed CI by:

z t z z N / , ,2

13

re

e

z

z

2

2

1

1

X2

Smaller CI

X2

X1

X2

Larger CI

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

69

Hypothesis testing IHypothesis testing I

• H0: = 0

• Standard error of correlation

coefficient given by:

• Calculate

• … and compare to t-distribution with N - 2 df .

sr

Nr

12

2

t r sr /

X2

Reject H0

X2

Accept H0

X1

X2

ObservedExpected

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

70

Hypothesis testing IIHypothesis testing II

• H0: r =

• Transform r and to

• Calculate

• … and compare Z distribution with N - 3 df .

zr

r

LNMOQP

LNMOQP0 5

1

10 5

1

1. ln , . ln

Zz

Nzz

,

1

3

X2

Reject H0

X2

X1

X2

Accept H0

ObservedExpected

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

71

Comparing 2 Comparing 2 correlationscorrelations

• H0: r1 = r

• Transform r1 and r to:

• Calculate

• … and compare to Z distribution.

zr

rz

r

r11

12

2

2

0 51

10 5

1

1

LNMOQP

LNMOQP. ln , . ln

Zz z

N N

1 2

1 2

13

13

X2

Reject H0

X2

X1

X2

Accept H0

r1

r2

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

72

Comparing multiple Comparing multiple correlationscorrelations

• H0: ri = rj = rk= … based on ni, nj, nk…observations

• Z transform all ris to zis and calculate

• … and compare to 2 distribution with df = k -1.

2 2

2

1

2

1

33

3

LNM

OQP

( )( )

( )n z

n z

ni i

i ii

ii

ki

k

X2

Reject H0

X2

X1

X2

Accept H0

r1

r2

r3

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

73

Computing common Computing common correlationscorrelations

• If H0: ri = rj = rk= … is accepted, then each ri estimates the same (population) correlation .

• To calculate , first calculate weighted Z-score zw:

• Then back-transform to get

X2

X1

X2

Accept H0

r1

r2

r3

zn z

nw

i ii

k

ii

k

( )

( )

3

3

1

1

zw

z

z

ee

w

w

LNMOQP

0 51

1

2

2

11

. ln

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

74

Non-parametric Non-parametric correlationscorrelations

• Use when one or more assumptions are not met.

• Essentially a parametric correlation of the ranks.

• Most common statistic is Spearman rank correlation.

rR R

N NS

X Xi

N

16

1 22

13

( )

X2

X1 Rank X1

Ra

nk

X2

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

75

Power and sample Power and sample size in correlationsize in correlation

• If we test H0: = 0 with sample size n, we can determine 1 - by using the Z-transformation for critical values (for given ) of the true correlation (z) and sample correlation r (zr).

Z z z nr ( ) ( )1 3

X1

X2

ZP

rob

abili

ty

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

76

Power and sample Power and sample size in correlationsize in correlation

• Once Z(1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. .

• Power is then 1-.

Z z z nr ( ) ( )1 3

X1

X2

ZP

rob

abili

ty

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

77

Power and sample size Power and sample size in correlation: an in correlation: an exampleexample

• Correlation of wing length and tail length of a sample of 12 birds

• so 1 - = 0.98

n r

z rr

12 10 870

1333 57605 2 10

, , .

. , .. ( ),

Wing length (cm) Tail length (cm)

10.4 10.7 7.4 7.4

10.8 10.5 7.6 7.2

11.1 10.8 7.9 7.8

10.2 11.2 7.2 7.7

10.3 10.6 7.4 7.8

10.2 11.4 7.1 8.37.4

z

zr

LNM

OQP

LNM

OQP

0 51 576

1 576656

0 51 87

1 871333

. ln.

..

. ln.

..

Z z z nr ( ) ( )

( . . ) .

1 3

1333 656 12 3 2 03

P Z( . ) . 2 03 0212

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

78

Minimal sample sizeMinimal sample size

• Given desired power 1 - , how large a sample is required to reject H0: = 0 if it is false with a specified

• Calculate:

nZ Z

zmin( )

FHG

IKJ

1

0

3

z00

0

0 511

FHG

IKJ. ln

X2

Reject H0?

X2

X1

X2

Reject H0?

ObservedExpected

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

79

Minimal sample size: an exampleMinimal sample size: an example

• We want to reject H0: = 0 99% of the time when |>

0.5and(2)= .05• So (1) = .01 and

• for r = .50, we have...

• Hence

• So, a sample size of at least 64 should be used.

Z

Z

( ) .

.1 2 326

196

zr FH IK0 51 51 5

549. ln..

.

nZ Z

zmin( ) .

FHG

IKJ 1

0

3 639

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

80

Power and sample size in Power and sample size in comparing 2 correlationscomparing 2 correlations

• Power of a test for difference between two correlation coefficients is 1- , where is one-tailed probability of:

X2

Reject H0

X2

X1

X2

Accept H0

r1

r2

Zz z

Z

n n

z z

z z

( )

| |1

1 2

1 2

1 2

1 2

13

13

University of Ottawa - Bio 4118 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-19 06:30

81

An exampleAn example

• What is power to detect a difference?

• From table of normal deviates,

• So, power = 0.22

Statistic Sample 1 Sample 2

r .78 .84

n 98 95

z 1.045 1.221

z z1 20146 .

Z Z

Zz z

Z

a

z z

. ( )

( )

.

| |.

05 2

11 2

196

0 761 2

P Z

P Z

( . )

( . ) .

0 76

1 0 76 78