1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due...

60
1 Experimental Experimental Statistics Statistics - week 10 - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday

Transcript of 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due...

Page 1: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

1

Experimental StatisticsExperimental Statistics - week 10 - week 10Experimental StatisticsExperimental Statistics - week 10 - week 10

Chapter 11:

Linear Regression and Correlation

Note:

Homework Due Thursday

Page 2: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

2

( )ijk i j i k ik ijky

2-Factor with Repeated Measure -- Model

type subject within type

timetype by time interaction

NOTES: type and time are both fixed effects in the current example

- we say “subject is nested within type”

- Expected Mean Squares given on page 1032

Page 3: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

3

The GLM ProcedureDependent Variable: conc Sum ofSource DF Squares Mean Square F Value Pr > FModel 17 57720.50000 3395.32353 110.87 <.0001Error 32 980.00000 30.62500Corrected Total 49 58700.50000

R-Square Coeff Var Root MSE conc Mean 0.983305 6.978545 5.533986 79.30000

Source DF Type III SS Mean Square F Value Pr > F

type 1 40.50000 40.50000 1.32 0.2587subject(type) 8 3920.00000 490.00000 16.00 <.0001time 4 34288.00000 8572.00000 279.90 <.0001type*time 4 19472.00000 4868.00000 158.96 <.0001

2-Factor Repeated Measures – ANOVA Output

Page 4: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

4

2-factor Repeated Measures

Source Type III Expected Mean Square

type Var(Error) + 5 Var(subject(type)) + Q(type,type*time) subject(type) Var(Error) + 5 Var(subject(type)) time Var(Error) + Q(time,type*time) type*time Var(Error) + Q(type*time)

The GLM Procedure Tests of Hypotheses for Mixed Model Analysis of Variance

Dependent Variable: conc Source DF Type III SS Mean Square F Value Pr > F* type 1 40.500000 40.500000 0.08 0.7810 Error 8 3920.000000 490.000000Error: MS(subject(type))* This test assumes one or more other fixed effects are zero. Source DF Type III SS Mean Square F Value Pr > F subject(type) 8 3920.000000 490.000000 16.00 <.0001* time 4 34288 8572.000000 279.90 <.0001 type*time 4 19472 4868.000000 158.96 <.0001 Error: MS(Error) 32 980.000000 30.625000

Page 5: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

5

NOTE: Since time x type interaction is significant, and since these are fixed effects we DO NOT test main effects

– we compare cell means (using MSE)

.0252 2(30.6250)

(32) 2.042 7.1475 5

MSELSD t

.5 1 2 3 4 C 37 63 85 140 76 T 55 81 134 80 42

Cell Means

Page 6: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

6

The write-up related to the SAS output should be something like the following. Note, that even though we get a significant variance component due to subject(within group) I did not estimate the variance component itself. (I did not give this particular variance component estimation formula.)

Note also that since there is a significant interaction between the fixed effects type and time, we do not test the main effects.

Page 7: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

7

Dealing with Normality/Equal Variance Issues

Normalizing Transformations: - log

- square root

- Box-Cox transformations

1, 0

xw

Note: the normalizing transformations sometimes also produce variance stabilization

Page 8: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

8

Nonparametric “ANOVA”

Man-Whitney U – for comparing 2 samples

Kruskal-Wallis Test – for comparing >2 samples

Friedman’s Test – nonparametric alternative to randomized complete block/ 1-factor repeated measures design

Page 9: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

HistogramHistogram

• displays distribution of 1 variable

Scatter Diagram (Scatterplot)Scatter Diagram (Scatterplot)

• displays joint distribution of 2 variables

• plots data as “points” in the “x-y plane.”

Page 10: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

10

Page 11: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

11

Page 12: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Association Between Two Association Between Two VariablesVariables

– indicates that knowing one helps in predicting the other

Linear Association– our interest in this course– points “swarm” about a line

Correlation Analysis– measures the strength of linear association

Page 13: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

13

Page 14: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

14

(association)(association)

Page 15: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Regression AnalysisRegression Analysis

We want to predict the dependent variable

- response variable

using the independent variable

- explanatory variable

- predictor variable

DependentDependentVariableVariable

(Y)(Y)

Independent Variable (X)Independent Variable (X)

More than one independent variable – Multiple Regression

Page 16: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

16

11.7 Correlation Analysis11.7 Correlation Analysis11.7 Correlation Analysis11.7 Correlation Analysis

Page 17: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Correlation CoefficientCorrelation Coefficient

- measures - measures linearlinear association association

-1 0 +1

perfect no perfect

negative linear positiverelationship relationship relationship

Denoted or yxr r

Page 18: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Positive CorrelationPositive Correlation - - high values of one variable are associated with- - high values of one variable are associated with

high values of the other high values of the other

Examples:• - father’s height, son’s height

• - daily grade, final grade

• r = 0.93 for plot on the left

1 2 3 4 5 6 7 81 2 3 4 5 6 7 8

33

22

11

00

Page 19: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

19

EXAMS I and II

Page 20: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Negative CorrelationNegative Correlation - -- - high with low, low with highhigh with low, low with high

Examples: - car age, selling price

- days absent, final grade

• r = - 0.89 for plot shown here

1 2 3 4 5 6 71 2 3 4 5 6 7

44

33

22

11

00

Page 21: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

21

Page 22: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Zero CorrelationZero Correlation - - no linear relationship- - no linear relationship

Examples:• - height, IQ score

• r = 0.0 for plot here

1 2 3 4 5 6 71 2 3 4 5 6 7

55

44

33

22

11

00

Page 23: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

23

Page 24: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

24

-.75, 0, .5, .99-.75, 0, .5, .99

Page 25: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

25

Page 26: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

26

1

2 2

1 1

( )( )

( ) ( )

n

i ii

yx n n

i ii i

x x y y

r

x x y y

Calculating the Correlation Coefficient

Page 27: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

27

2

1

( )

n

xx ii

S x x

Notation:

2

1

( )

n

yy ii

S y y

1

( )( )

n

xy i ii

S x x y y

2 2

1 1

( ) /

n n

i ii i

y y n

So --

xyyx

xx yy

Sr

S S

2 2

1 1

( ) /

n n

i ii i

x x n

1 1 1

( )( ) /

n n n

i i i ii i i

x y x y n

Page 28: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

28

Study Time Exam (hours) Score

(X) (Y)

10 92

15 81

12 84

20 74

8 85

16 80

14 84

22 80

The data below are the study times and the test scores on an exam given over the material covered during the two weeks.

Find r

1

2

1

1

2

1

1

117

1,869

660

54,638

9,519

n

ii

n

ii

n

ii

n

ii

n

i ii

x

x

y

y

x y

Page 29: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

29

DATA one;INPUT time score;DATALINES;10 9215 8112 8420 748 8516 8014 8422 80;PROC CORR; Var score time; TITLE ‘Study Time by Score';RUN;PROC PLOT;PLOT time*score;RUN;PROC GPLOT;PLOT time*score;RUN;

Page 30: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

30

The CORR Procedure 2 Variables: score time

Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum score 8 82.50000 5.18239 660.00000 74.00000 92.00000 time 8 14.62500 4.74906 117.00000 8.00000 22.00000

Pearson Correlation Coefficients, N = 8 Prob > |r| under H0: Rho=0

score time score 1.00000 -0.77490 0.0239

time -0.77490 1.00000 0.0239

Study Time by Score

Page 31: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

31

Plot of score*time. Legend: A = 1 obs, B = 2 obs, etc.score ‚ ‚ 92 ˆ A ‚ 91 ˆ ‚ 90 ˆ ‚ 89 ˆ ‚ 88 ˆ ‚ 87 ˆ ‚ 86 ˆ ‚ 85 ˆ A ‚ 84 ˆ A A ‚ 83 ˆ ‚ 82 ˆ ‚ 81 ˆ A ‚ 80 ˆ A A ‚ 79 ˆ ‚ 78 ˆ ‚ 77 ˆ ‚ 76 ˆ ‚ 75 ˆ ‚ 74 ˆ A ‚ Šƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒ 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 time

Page 32: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

32

Page 33: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

33

0 : 0 : 0 aH H

Rejection Region

Test Statistic

2

- 2

1-

nt r

r

t > t2 or t < t/2 df n 2

Testing Statistical Significance of Correlation Coefficient

Page 34: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

34

Correlation Between Study Time Correlation Between Study Time and Scoreand Score

H0: There is No Correlation Between Study Time and ScoreHa: There is a Correlation Between Study Time and Score

Rejection Region

Test Statistic

Conclusion

P-value

Page 35: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

35

Page 36: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

36

• Correlation measures the strength of the linear relationship between two variables.

• Correlation requires that both variables be quantitative.

• r does not change when we change the units of measurement of x, y, or both.

• Correlation makes no distinction between explanatory and response variables.

• The correlation coefficient is not resistant to outliers.

Properties of Correlations

Page 37: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

37

The CORR Procedure

Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 math reading math 1.00000 0.87207 <.0001 reading 0.87207 1.00000 <.0001

Math vs Reading Scores

Page 38: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

38

The CORR Procedure Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 math reading math 1.00000 0.27198 0.2460 reading 0.27198 1.00000 0.2460

Math vs Reading Scores with Outlier

Page 39: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

39

Pearson Correlation Coefficients, N = 14 Prob > |r| under H0: Rho=0 math reading math 1.00000 -0.1973 0.5194 reading -0.1973 1.00000 0.5194

Page 40: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

40

Pearson Correlation Coefficients, N = 14 Prob > |r| under H0: Rho=0 math reading math 1.00000 0.53211 0.0502 reading 0.53211 1.00000 0.0502

Page 41: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

41

Divorce Rate (per 1000)

% in prison on Drug Offenses

Page 42: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

IMPORTANT NOTE:IMPORTANT NOTE:

Correlation Correlation DOES NOTDOES NOT Imply CausationImply Causation

• strong association between 2 variables is not enough to justify conclusions about cause and effect

• best way to get evidence that X causes Y is through a controlled experiment

Page 43: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

43

11.1-5 Regression Analysis11.1-5 Regression Analysis11.1-5 Regression Analysis11.1-5 Regression Analysis

Page 44: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

44

Page 45: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

45

Goal of Regression Analysis:

Predict Y from knowledge of X

For data such as the Father-Son data, it seems reasonable to assume a model of the form

| 0 1Y x x

i.e. the conditional means of Y given x follow a straight line

Page 46: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

46

Alternative mathematical expression for the “regression model”:

0 1y x 2' (0, )where the s (errors) are distributed N

2- i.e. all the errors have the same variance

In practice, we want to estimate this line from the data.

Page 47: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 48: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 49: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 50: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 51: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 52: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Page 53: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

Which line is “closest” to the Which line is “closest” to the points ?points ?

Page 54: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

0

2

4

6

8

10

12

14

0 2 4 6 8

Criterion for measuring “closeness” --- the sum of squared vertical distances from the points to the line

Regression (Least Squares) Line --- the line for which this sum-of-squared distance is a minimum

Page 55: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

55

Notation

Theoretical Model

Regression line

0 1y x

0 1ˆ ˆy x

0 1ˆ ˆ and are called "least-squares estimates"

Page 56: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

56

Data1 1

2 2

n n

x y

x y

x y

0 1ˆ ˆˆ i iy x

0 1ˆ ˆ i i iy x e

we write

where is called the th residualie i

ˆi.e. i i ie y y

Page 57: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

57

1

2 20 1

1 1

ˆ ˆ

ˆ ˆ( ) [ ( )]ˆ

0 and are chosen to minimize

n n

i i i ii i

y y y x

NOTE:

- this is a calculus problem

Page 58: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

58

11

2

1

( )( )ˆ

( )

n

i ixyi

nxx

ii

x x y yS

Sx x

Least Squares Estimates

0 1ˆ ˆy x

Computation Formula

1 1 11

2 2

1 1

( )( ) /ˆ

( ) /

n n n

i i i ii i i

n n

i ii i

x y x y n

x x n

Page 59: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

59

Study Time Exam (hours) Score

(X) (Y)

10 92

15 81

12 84

20 74

8 85

16 80

14 84

22 80

The data below are the study times and the test scores on an exam given over the material covered during the two weeks.

Find the equation of the regression line for prediction exam score from study time.

1

2

1

1

2

1

1

117

1,869

660

54,638

9,519

n

ii

n

ii

n

ii

n

ii

n

i ii

x

x

y

y

x y

Page 60: 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.

60

The GLM Procedure

Dependent Variable: score Sum ofSource DF Squares Mean Square F Value Pr > FModel 1 112.8883610 112.8883610 9.02 0.0239Error 6 75.1116390 12.5186065Corrected Total 7 188.0000000

R-Square Coeff Var Root MSE score Mean 0.600470 4.288684 3.538164 82.50000

Source DF Type I SS Mean Square F Value Pr > Ftime 1 112.8883610 112.8883610 9.02 0.0239Source DF Type III SS Mean Square F Value Pr > Ftime 1 112.8883610 112.8883610 9.02 0.0239

Standard Parameter Estimate Error t Value Pr > |t| Intercept 94.86698337 4.30408629 22.04 <.0001 time -0.84560570 0.28159265 -3.00 0.0239

PROC GLM;MODEL score=time;RUN;