1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3....

52
1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute and interpret correlation coefficients 1.Construct and interpret scatter diagrams

Transcript of 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3....

Page 1: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

1

Describing the Relation Between Two Variables

Learning Objectives

4. Interpret residual plots

3. Compute and interpret least square lines

2. Compute and interpret correlation coefficients

1.Construct and interpret scatter diagrams

Page 2: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

2

Bivariate data is data in which two variables are measured on an individual. Often the purpose is

to study the relationship between two variables: Correlation problem

or to predict one variable using the other: Least-squares Regression Problem (Also called: Simple Linear Regression problem).

For bivariate problems, we call one variable the response variable, y (also called dependent variable), which is the variable whose value can be explained or determined based upon the value of the predictor variable, x (also called independent variable).

Scatter Diagrams; Correlation

Page 3: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

3

Examples of Bivariate Data

1. What is the relationship between hand-size and height?One needs to collect two variables from each subject (hand-size and height) of a sample of n subjects. It is a bivariate study.

We can conduct two studies: (a) To find out the relationship between Hand-size and Height. (b) To predict Height using Hand-size and investigate if Hand-size is a good predictor of Height or not .

2. Is the weight of a car a good predictor of mileage? One needs to collect the weight and mileage of a sample of n cars, in

order to study this problem.We can conduct two studies: (a) To find out the relationship between

Weight and Mileage. (b) To predict Mileage using Weight and investigate if Weight is a good predictor of Mileage or not.

Page 4: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

4

Real-Time ActivityIs Hand-size a good predictor of Height?

How do you measure Hand-size?Measure your Hand-size by measuring Hand-Width and Hand Length as discussed in the class. Now go to the Real-Time Online activity site athttp://stat.cst.cmich.edu/statact/Go to Data Entry, select Activity: Hand-Size.

Use the Activity Code: To be provided in class

Page 5: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

5

Is Hand size a good predictor of height? Here are 20 cases from previous students

How to demonstrate the relationship between hand length and height?

Graphical method – Scatter diagram.

Numerical Method- Correlation coefficient and Least squares regression.

• Row Gender length width height• 1 female 8.50 9.50 68.5• 2 female 8.40 9.00 68.0• 3 female 7.50 8.00 68.0• 4 female 7.25 8.00 68.0• 5 male 7.40 7.70 70.0• 6 male 7.50 8.75 71.0• 7 female 6.50 7.25 66.0• 8 male 8.00 7.00 68.0• 9 male 8.00 8.75 72.0• 10 male 8.75 9.50 76.8• 11 male 8.00 9.00 71.0• 12 female 6.00 7.50 62.0• 13 male 6.20 11.50 69.0• 14 female 6.50 7.50 61.5• 15 male 8.00 9.50 69.0• 16 female 7.00 8.00 69.0• 17 male 7.00 9.10 72.2• 18 female 6.50 7.50 63.0• 19 female 6.50 7.00 61.0• 20 female 7.25 7.50 63.5

Page 6: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

6

How can we demonstrate the relationship between Hand Length and Height?

Graphical method: A scatter diagram (scatter plot): shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram.

The predictor variable (independent variable) is plotted on the horizontal axis and the response variable (dependent variable) is plotted on the vertical axis.

Note: Points are not connected in scatter diagram.

Page 7: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

7

For the example of using Hand Length to predict Height

What is the response variable? ______________________

What is the predictor variable? _____________________

HeightHeight

Hand LengthHand Length

Scatter Plot of Height Vs. Hand Length Scatter Plot of Height Vs. Hand Length

Page 8: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

8

Scatter plot using MinitabGo to Graph, choose Scatterplot, choose Simple, select

variable name, OK.

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

Scatterplot of height vs hand_ length

(1)(1)Is the relation positive? Is the relation positive?

(2)(2)Is the relation strong? Is the relation strong?

Page 9: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

9

Nonlinear, Nonlinear,

No correlationNo correlation

positivepositive

Perfectly correlatedPerfectly correlated Positive : Highly Positive : Highly correlatedcorrelated

Positive: Positive: Moderately Moderately correlatedcorrelated

X - Predictor

Y:

Re

spo

nse

121086420

25

20

15

10

5

Scatterplot of C2 vs C1

Nonlinear, Nonlinear,

Positive CorrelationPositive Correlation

C1C

3

12108642

20

15

10

5

0

Scatterplot of C3 vs C1

Nonlinear, Nonlinear,

Positive CorrelationPositive Correlation

Page 10: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

10

NegativeNegative

Perfectly correlatedPerfectly correlated

Negative:Negative:

Highly CorrelatedHighly Correlated

Negative:Negative:

Moderately CorrelatedModerately Correlated

No correlationNo correlation

X

Y

1086420

15.0

12.5

10.0

7.5

5.0

Scatterplot of C6 vs C4

No correlationNo correlation

C4

C5

1086420

35

30

25

20

15

10

Scatterplot of C5 vs C4

NonlinearNonlinear

Negative CorrelationNegative Correlation

Page 11: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

11

The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the strength of linear relation between two quantitative variables.

NOTATION: We use

the Greek letter (rho=ρ) to represent the population correlation coefficient and

r to represent the sample correlation coefficient -1 < r < +1.

How can we quantify the correlation? How can we quantify the correlation?

Numerical Method: Pearson CorrelationNumerical Method: Pearson Correlation

Page 12: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

12

1. If r = +1 (or -1) there is a perfect positive (negative) linear relation between the two variables.

2. The closer r is to +1 (or -1), the stronger the evidence of positive (negative) association between the two variables.

3. If r is close to 0, there is no evidence of linear relation between the two variables. Because the linear correlation coefficient is a measure of the linear relation, r close to 0 does not imply no relation, just no linear relation.

Properties of the Linear Correlation CoefficientProperties of the Linear Correlation Coefficient

Page 13: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

13

Nonlinear, Nonlinear,

No correlationNo correlation

Positive Perfectly correlatedPositive Perfectly correlated Positive : Highly Positive : Highly correlatedcorrelated

Positive: Positive: Moderately Moderately correlatedcorrelated

X - Predictor

Y:

Re

spo

nse

121086420

25

20

15

10

5

Scatterplot of C2 vs C1

Nonlinear, Nonlinear,

Positive CorrelationPositive Correlation

C1C

3

12108642

20

15

10

5

0

Scatterplot of C3 vs C1

Nonlinear, Nonlinear,

Positive CorrelationPositive Correlation

r = +1r = +1 r= +.9r= +.9

r = +.4r = +.4

r~0r~0

r = +.8r = +.8

r=+.6r=+.6

Page 14: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

14

NegativeNegative

Perfectly correlatedPerfectly correlated

Negative:Negative:

Highly CorrelatedHighly Correlated

Negative:Negative:

Moderately CorrelatedModerately Correlated

No correlationNo correlation

X

Y

1086420

15.0

12.5

10.0

7.5

5.0

Scatterplot of C6 vs C4

No correlationNo correlation

C4

C5

1086420

35

30

25

20

15

10

Scatterplot of C5 vs C4

NonlinearNonlinear

Negative CorrelationNegative Correlation

r=-1.0r=-1.0 r=-0.9r=-0.9 r=-0.4r=-0.4

R~0.0R~0.0R~0.0R~0.0

r=-0.8r=-0.8

Page 15: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

15

Important distinction between Association and Cause and Effect Relation between two variables

Examples:

A study shows there exists a strong correlation between Math IQ and Feet size for Kindergarten children. Does this mean that Feet Size is the cause of Math IQ for kindergarten children?

A study shows there is a positive correlation between CEO salary and Stock price. Does this mean that CEO salary is the cause of Stock price?

For each of the above examples, the relationship is not a causal relation. There is a hidden variable that is related to both variables and is the cause. This hidden variable is often called ‘Lurking Variable’.

Can you identify the lurking variable for each case?

A strong relationship does not imply cause and effect!!!!

Page 16: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

16

How do we determine the Pearson correlation coefficient , r, for the How do we determine the Pearson correlation coefficient , r, for the data of HAND SIZE AND HEIGHT?data of HAND SIZE AND HEIGHT?

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

Scatterplot of height vs hand_ length

The following is the Scatter Plot between The following is the Scatter Plot between Height and Hand Length.Height and Hand Length.

Page 17: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

17

where SSxx = (n-1) sx2

SSyy = (n-1) sy2

SSxy = xy - n Sx

2 is the sample variance of the x variable. Sy

2 is the sample variance of the y variable.

r = SSxy

SSxx SSyy

Computing r

Page 18: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

18

Compute the linear correlation coefficient to quantify the relationship between Height and Hand Length

• Row Gender length width height• 1 female 8.50 9.50 68.5• 2 female 8.40 9.00 68.0• 3 female 7.50 8.00 68.0• 4 female 7.25 8.00 68.0• 5 male 7.40 7.70 70.0• 6 male 7.50 8.75 71.0• 7 female 6.50 7.25 66.0• 8 male 8.00 7.00 68.0• 9 male 8.00 8.75 72.0• 10 male 8.75 9.50 76.8• 11 male 8.00 9.00 71.0• 12 female 6.00 7.50 62.0• 13 male 6.20 11.50 69.0• 14 female 6.50 7.50 61.5• 15 male 8.00 9.50 69.0• 16 female 7.00 8.00 69.0• 17 male 7.00 9.10 72.2• 18 female 6.50 7.50 63.0• 19 female 6.50 7.00 61.0• 20 female 7.25 7.50 63.5

Page 19: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

19

EXAMPLE Height & Hand Length: Compute the

linear correlation coefficient .

Sample Statistics N Mean Standard Deviation

Hand Length (X) (X) 20 20 7.338 7.338 0.8080.808

Height (Y) 20 20 67.875 67.875 4.0564.056

Use Minitab: Go to Stat, Basic Statistics, Correlation, select Height and Use Minitab: Go to Stat, Basic Statistics, Correlation, select Height and Hand-length. OK. Hand-length. OK.

The computer result is : r = .668. The correlation is moderately high.The computer result is : r = .668. The correlation is moderately high.

Page 20: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

20

Online Applet Activity: Visualizing correlation using scatter plots

Go to the site: http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregression.html

To review scatter plot, correlation coefficient

(a) create a scatter plot using 10 pairs of data with near zero correlation.(b) create a scatter plot with nonlinear relation and near zero correlation using 10 pairs of data .© create a scatter plot with nonlinear relation and correlation near .8 using 10 pairs of data.(d) create a scatter plot using 10 pairs of data with near zero correlation and add one additional point that will greatly increase correlation.(e) create a scatter plot using 10 pairs of data with near one correlation and add one additional point that will greatly decrease correlation.

Page 21: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

21

Finding the Least-squares Regression Line

Recall: A mathematical Line y = mx + bRecall: A mathematical Line y = mx + b

m: is the slope: the unit change of y when increasing one unit of x.m: is the slope: the unit change of y when increasing one unit of x.

b: the intercept, the y-value when setting x = 0.b: the intercept, the y-value when setting x = 0.

Examples:Examples:

(1) Graph the line : y = 2x – 3 and determine the slope and intercept(1) Graph the line : y = 2x – 3 and determine the slope and intercept

(2) Graph the line y = (-.5)x +1 and determine the slope and intercept.(2) Graph the line y = (-.5)x +1 and determine the slope and intercept.

(3) Determine the line y = mx+b that passes through two points (1, 5) and (3,2)(3) Determine the line y = mx+b that passes through two points (1, 5) and (3,2)

Ans: determine the slope m = (y2-y1)/(x2-x1)=(2-5)/(3-1) = -1.5Ans: determine the slope m = (y2-y1)/(x2-x1)=(2-5)/(3-1) = -1.5

The equation is y = (-1.5)x+b. Now, to determine the intercept b, apply a point, say, The equation is y = (-1.5)x+b. Now, to determine the intercept b, apply a point, say, (1,5) into the equation y=(-1.5)x+b: 5 = (-1.5)(1)+b then, solve for b = 6.5(1,5) into the equation y=(-1.5)x+b: 5 = (-1.5)(1)+b then, solve for b = 6.5

So the equation is y=(-1.5)x + 6.5So the equation is y=(-1.5)x + 6.5

(4) Exercise: determine the line passing through (2,3) and (4,9)(4) Exercise: determine the line passing through (2,3) and (4,9)

Page 22: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

22

How do we determine a line that can be used to predict The Height using How do we determine a line that can be used to predict The Height using Hand Length?Hand Length?

That is to determine a line : That is to determine a line :

bb11 is the slope and b is the slope and b00 is the intercept is the intercept

An intuitive approach: By drawing your best guess line and determine two An intuitive approach: By drawing your best guess line and determine two points of your best guess line, then obtain the line using the two points you points of your best guess line, then obtain the line using the two points you chose. chose.

01ˆ bxby

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

Scatterplot of height vs hand_ length

Page 23: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

23

Use Fathom to Demonstrate The Least Squared Method

Predictions, Residuals,Sum of Squares of Residuals

Problem: How well can Hand_size predict Height?

Data: Hand_size_20cases

Page 24: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

24

Page 25: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

25

The difference between the observed value of y and the predicted value of y is the error or residual. That is

residual = observed – predicted

Notation:

iii yye ˆ

What is Residual?

Page 26: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

26

bb11 is the slope and b is the slope and b00 is the intercept. is the intercept.

One way to determine the best line is to find bOne way to determine the best line is to find b11 and b and b00 so that the sum of so that the sum of

the squared residuals is the smallestthe squared residuals is the smallest..

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

Scatterplot of height vs hand_ length

ee11

ee22

ee202001ˆ bxby

Page 27: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

27

Page 28: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

28

21

1

)1(

:by obtained becan b slope the

r, computingfor formula by ther replace weif :NOTE

x

xy

xx

xy

x

y

sn

ss

ss

ss

s

srb

Page 29: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

29

(a) Find the least-squares regression line:

(b) Interpret the slope:

Increase Hand Length by one inches will increase Height by 3.353 inches.

x

= .668 (4.056)/(.808) = 3.353

= 67.875 – (3.353)(7.338) = 43.27= 67.875 – (3.353)(7.338) = 43.27

Predicting Height using Hand LengthPredicting Height using Hand Length

Sample Statistics N Mean Standard Deviation

Hand Length (X) (X) 20 20 7.338 ( 7.338 ( ) ) 0.808 0.808 ((ssxx ) )

Height (Y) 20 20 67.875 ( ) 67.875 ( ) 4.056 4.056 ((ssyy ) )

The regression line is Height = 3.353(Hand Length) + 43.27The regression line is Height = 3.353(Hand Length) + 43.27

y

Page 30: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

30

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

S 3.10061R-Sq 44.6%R-Sq(adj) 41.6%

Fitted Line Plotheight = 43.29 + 3.351 hand_length

Use Minitab to obtain the regression line:Use Minitab to obtain the regression line:

Go to Stat, Regression, Fitted Line Plot, choose Y and X. OK. Go to Stat, Regression, Fitted Line Plot, choose Y and X. OK.

(b) Interpret the slope:

Increase Hand Length by one inches will increase Height by 3.353 inches.

Page 31: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

31

Predicting Height using Hand Length

(c) Predict the height for the individual with Hand Length 7.5” and 6.5”, respectively.

(d) Draw the least-squares regression line on the scatter diagram of the data.

(e) Compute the residual, y-ŷ: In the data, there are individuals whose Hand Length is 7.5” and the height is 68”. Find the residual of the height when using the model to predict the height. Do the same for (6.5”, 63”).

(f) Find the sum of the squared residuals.

(g) Does any other line yield smaller squared residuals?

Page 32: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

32

Predicting Height using Hand Length ŷ=3.351x + 43.29

y

ˆe y y • Row Gender length width height• 1 female 8.50 9.50 68.5• 2 female 8.40 9.00 68.0• 3 female 7.50 8.00 68.0• 4 female 7.25 8.00 68.0• 5 male 7.40 7.70 70.0• 6 male 7.50 8.75 71.0• 7 female 6.50 7.25 66.0• 8 male 8.00 7.00 68.0• 9 male 8.00 8.75 72.0• 10 male 8.75 9.50 76.8• 11 male 8.00 9.00 71.0• 12 female 6.00 7.50 62.0• 13 male 6.20 11.50 69.0• 14 female 6.50 7.50 61.5• 15 male 8.00 9.50 69.0• 16 female 7.00 8.00 69.0• 17 male 7.00 9.10 72.2• 18 female 6.50 7.50 63.0• 19 female 6.50 7.00 61.0• 20 female 7.25 7.50 63.5

y (Predicte(Predicted Height)d Height)

(Residual)(Residual)

For each case, can you find the For each case, can you find the predicted Height and the predicted Height and the corresponding residual? corresponding residual?

Page 33: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

33

Should the line be used to predict the Height when the Hand Length = 3” or = 10”?

Do not use a least-squares regression line to make predictions Do not use a least-squares regression line to make predictions for X values far outside the scope of the model (in this case, x for X values far outside the scope of the model (in this case, x variable is from (6”) to (8.75”) ), because we can’t be sure the variable is from (6”) to (8.75”) ), because we can’t be sure the linear relation continues to exist when hand length < 6” or > linear relation continues to exist when hand length < 6” or > 8.75”.8.75”.

Page 34: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

34

Diagnostics on the Least-squares Regression Line

Questions: Questions:

(1)(1) How do I know if this is a ‘good’ model? That is how much How do I know if this is a ‘good’ model? That is how much information of the Height can be explained by the Hand Length.information of the Height can be explained by the Hand Length.

(2)(2) Is there any unusual Height – an outlier in the Y, response Is there any unusual Height – an outlier in the Y, response variable?variable?

(3)(3) Is there any unusual X value that may dramatically affect the Is there any unusual X value that may dramatically affect the model – an influential case? model – an influential case?

hand_length

heig

ht

9.08.58.07.57.06.56.0

78

76

74

72

70

68

66

64

62

60

S 3.10061R-Sq 44.6%R-Sq(adj) 41.6%

Fitted Line Plotheight = 43.29 + 3.351 hand_length

Page 35: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

35

When we were asked to predict the Height using only the Height data, without knowing any information about the relation between height and hand length. Our ‘typical guess’ is the ‘Average Height:

(1) Using the sample average of Height:

=68.875”

When we have the information of Hand Length, and we are asked to predict the Height for the individual whose hand length is 7”, we can apply the model we derived:

(2): Use the least squares regression line:

ŷ = 3.351(7) + 43.29 = 66.75”

Page 36: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

36

The difference between the 2 predictions is the additional information explained by the Hand Length: explained deviation

y-ŷ

yy yy ˆ

Total Deviation

Unexplained Deviation Explained Deviation

ˆ ˆ( ) ( )

T.D. U.D. E.D.

y y y y y y

Page 37: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

37

Total Deviation

= Unexplained Deviation + Explained Deviation

ion)SS(Regress SS(Error) SS(Total)

)ˆ( )ˆ()(

Square of Sum Regression Square of SumError

Square of Sum Total

222

yyyyyy

Total Variation

= Unexplained Variation + Explained Variation

Which is computed as follows:

Page 38: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

38

R2 : Coefficient of Determination

2

SS(Total) SS(Error) SS(Regression)1

SS(Total) SS(Total) SS(Total)

SS(Regression) SS(Error) *100% (1 )*100%

SS(Total) SS(Total)R

RR22 =the proportion =the proportion of variation of variation explained by X explained by X variablevariable

Variation Explained by the X variable:Variation Explained by the X variable:

SS due to RegressionSS due to RegressionVariation due to Error – Variation due to Error – Sum of squared Sum of squared ResidualsResiduals

2 is called the coefficient of determination, which is the

the percent of variation that the predictor variable

can explain the response variable.

R

Page 39: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

39

Where do I find the SS(Total), SS(Error) and SS(Regression)?

This information can be easily obtained from computer output. The regression equation is

The Regression Line is : Height = 43.29 + 3.351 Hand_length

S = 3.10061 R-Sq = 44.6% R-Sq(adj) = 41.6%

Analysis of VarianceSource DF SS MS F PRegression 1 139.469 139.469 14.51 0.001Error 18 173.048 9.614Total 19 312.518

SS(Total) SS(Total) = SS(Error) + SS(Regression)= SS(Error) + SS(Regression)

Page 40: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

40

To determine R2 for the linear regression model simply square the value of the linear correlation coefficient. We can also use: (r2)*100% [NOTE: NOTE:

The method does not work for regression equations that have more than The method does not work for regression equations that have more than 1 predictor variable.]1 predictor variable.]

The coefficient of determination R2 is the % of variation in the response variable that is explained by variation in the predictor variable.

RR22 = SS(Regression)/SS(Total) = SS(Regression)/SS(Total)

= 1- SS(Error)/SS(Total)= 1- SS(Error)/SS(Total)

Page 41: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

41

Determining the Coefficient of Determination for the model: Predicting Height using Hand Length

Find and interpret the coefficient of determination for the model of predicting Height using Hand Length:

R2 = 139.469/312.518*100% = 44.6%

OR : use r=.668

R2 = (.668)2* 100% = 44.6%

The Hand Length can explain 44.6% of variation of the Height.

Page 42: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

42

Some concept questions

Determine if each of the following statement true or false:• If the Pearson coefficient, r > 0, then, the slope, b1 > 0.• If r = 0, it is possible b1 >0.• If r < 0, then, R2 < 0• If R2 = .64, and b1 =2.35, then, r = .8• If R2 = .64, and b1 = -2.35, then, r = .8

NOTE: (a) Slope must have the same sign as correlation coefficient, r(b) R2 can not be negative. (b) To compute r from R2 for simple linear regression:

r can be positive or negative, Where the sign of r is the same as the sign of b1, the slope.

2r R

Page 43: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

43

How do we know if the model is adequate?

• Is a linear model adequate ?

• Are there any outliers in response variable?

• Are there any influential cases?

All of these questions cane be answered by analyzing residuals, ei.

Page 44: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

44

Online Applet Activity: Visualizing the effects of outliers and influential cases

using scatter plotsGo to the site: http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregression.html

(a) Create a scatter plot of 10 cases with high positive correlation. Add one case to the very

right and lower corner, observe the change of the correlation coefficient and the change of the regression line (pay special attention to the change of the slope. What do you find?

(b) Create 10 pairs of data points with one on the upper right corner, the rest show a high negative correlation, and the regression line has almost zero slope. Delete the upper right corner case, and observe the effect of deleting the upper right corner case in changing the slope and correlation.

(c ) Create 10 pairs of cases with high negative correlation and one case having X-values around the middle and y-value (outlier case) is much higher than the rest. Now, delete the outlier case, and observe the change of the slope and correlation.

Page 45: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

45

1. The residuals Vs. the order of Data: If the linear model is adequate, then, this plot would look like random, no specific pattern can be identified. If there is a curve pattern, then, it indicates the relationship between x and y is not linear.

Observation Order

Resi

dual

30282624222018161412108642

12

10

8

6

4

2

0

-2

-4

Residuals Versus the Order of the Data(response is expor)

order

resi

dual

302520151050

4

3

2

1

0

-1

-2

-3

-4

Scatterplot of residual vs order

No specific Pattern. No specific Pattern. Model is adequateModel is adequate

A clear nonlinear pattern of the A clear nonlinear pattern of the residuals Vs data order. residuals Vs data order. The The linear model is not adequate.linear model is not adequate.

Useful Residual Plots for Model Diagnosis

Page 46: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

46

2. Plot Residuals Vs. Predicted Y

If the model is adequate, residuals should show no specific pattern along the zero line. Two common problems can be identified from this plot: (a) A curve pattern indicates the relation between Y and X is nonlinear. (b) Some residuals are far away from zero indicates there are outliers in the response variable.

Page 47: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

47

Fitted Value

Re

sid

ua

l

129630

12

10

8

6

4

2

0

-2

-4

Residuals Versus Predicted Y Values(response is Y)

Fitted Value

Re

sid

ua

l

20.320.220.120.019.919.819.719.619.5

10

5

0

-5

-10

Residuals Versus the Fitted Values(response is Y)

No unusual pattern. No unusual pattern.

Adequate modelAdequate model

Nonlinear Pattern. Nonlinear Pattern.

Model is not linear.Model is not linear.

Page 48: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

48

0

-5

3. A plot of residuals against the predictor variable may also reveal outliers. These values will be easy to identify because the residual will lie far from the rest of the plot.

An Outlier caseAn Outlier case

Page 49: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

49

The effect of Influential Observations

An influential observation is one that has a disproportionate affect on the value of the slope and y-intercept in the least-squares regression equation.

Page 50: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

50

Re-visit the Online Applet Activity: Visualizing the effects of outliers and influential cases using scatter plots

Open the activity worksheet:

Online Applet Activity-Outlier&InfluCases(Reg&Corr).

Working with your group to answer the questions asked in

the worksheet. We will work on some of the problems in class, and your team will complete the work by the next class eriod. It is due next class period.

Page 51: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

51

If there are outliers or influential cases,

how do we deal with them?

As with outliers, influential observations should be removed only if there is justification to do so.

When an influential observation occurs in a data set and its removal is not warranted, there are two courses of action:

(1) Collect more data so that additional points near the influential observation are obtained, or

(2) Use more advanced techniques such as transformations to log transformations

Page 52: 1 Describing the Relation Between Two Variables Learning Objectives 4. Interpret residual plots 3. Compute and interpret least square lines 2. Compute.

52

Activity: How well can Hand_length or Hand_width predict height?

Open the activity worksheet:

Activity-Regression-Hand-size

Working with your group to answer the questions asked in the worksheet.

We will work on some of the problems, and you will complete the rest after class.

It is due next class period.