Download - Correlation and Regression - WordPress.com · introduction If there exists a linear relationship, then regression will be used to estimate the equation for the linear relationship

Transcript

Correlation and Regression

1

Overview

Introduction

Scatter Plots

Correlation

Regression

Coefficient of Determination

2

Objectives of the topic1. Draw a scatter plot for a set of ordered pairs.

2. Compute the correlation coefficient.

3. Test the hypothesis H0: ρ = 0.

4. Compute the equation of the regression line.

5. Compute the coefficient of determination.

6. Have a working idea of the concept of

multiple regression ( it is not examinable)

3

Introduction

Correlation is a statistical method used to

determine whether a linear relationship

between variables exists. In this course

reference will be made only to pair wise

correlation. i.e the correlation between only two

variables

Regression is a statistical method used to

describe and estimate the nature of the

relationship between variables—that is,

positive or negative, linear or nonlinear.

4

introduction

If there exists a linear relationship, then

regression will be used to estimate the

equation for the linear relationship

Correlation and regression therefore work

hand in hand. They are complements not

substitutes.

5

Introduction to correlation and regression

The purpose of this topic is to answer

these questions statistically:

1. Are two variables related?- is there a particular

observable or logical relationship between two

variables? If one variable increases, what

happens to the other variable? Does it increase

or decrease?

Consider sales and revenue; motivation and

worker performance; exchange rate and imports;

smoking and lung cancer; weight and blood

pressure; interest rate and investment; study

time and score.6

Introduction to correlation and regression

2. If there is a relationship, what type of

relationship exists? Is it positive or

negative?

3. What is the strength of the linear

relationship? Is it a weak relationship or a

strong relationship?

4. What kind of predictions can be made from

the relationship?

7

Introduction

To answer question 1,2 and 3,the correlation

coefficient, a numerical measure to determine

whether two variables are related and to

determine the strength of the relationship

between the variables is used.

To answer question 3, regression will be used.

8

Scatter Plots and Correlation

The first step in correlation is to have a

conceptual or logical understanding of the

relation between the two variables. What do

you think is the relationship between exchange

rate and imports.

The next step is to construct a scatter plot for

the data to confirm the conceptual relationship

A scatter plot is a graph of the ordered pairs

(x, y) of numbers consisting of the independent

variable x and the dependent variable y.

9

Scatter plots and correlation

Correlation is symmetric- i.e it does not

matter which of the two variables is

dependent and which is independent

10

Example1: Car Rental CompaniesConstruct a scatter plot for the data shown for car rental

companies in the United States for a recent year.

Step 1: Draw and label the x and y axes and denote one

of the variables by X and the other by Y.

Step 2: Plot each point on the graph.

11

Example 1

What do you think is the conceptual or logical

relationship between the number of cars given

out to hire and the sales revenue made.

Positive? Negative? No relationship?

As more cars are given out to hire does the

sales revenue of the firm decrease or increase?

The scatter plot is shown in the following

diagram

12

Example : Car Rental Companies

13

Positive Relationship

02

46

8

reve

nue

( in

bill

ions

)

10 20 30 40 50 60cars (in ten thousands)

Example 2: Absences/Final GradesConstruct a scatter plot for the data obtained in a study on

the number of absences and the final grades of seven

randomly selected students from a statistics class.

Step 1: Draw and label the x and y axes and denote one of the

variables by X and the other by Y

Step 2: Plot each point on the graph.14

Example -2: Absences/Final Grades

15

Negative Relationship

4050

6070

8090

final

gra

de(%

)

0 5 10 15number of absentees

Correlation-interpretation

After the scatter plot, the next step is to numerically

calculate the correlation coefficient between the

variables and interpret the coefficient

The correlation coefficient computed from the

sample data measures the strength and direction of a

linear relationship between two variables.

There are several types of correlation coefficients. Two

will be explained in this course;

The two explained in this course are the Pearson

product moment correlation coefficient (PPMC) and

the Spearman’s Rank correlation coefficient

16

Correlation-interpretation

The symbol for the sample correlation

coefficient is r. The symbol for the

population correlation coefficient is .

17

Correlation-interpretation

The range of the correlation coefficient is from

1 to 1.

If there is a strong positive linear

relationship between the variables, the value

of r will be close to 1.

If there is a strong negative linear

relationship between the variables, the value

of r will be close to 1.

18

Correlation-interpretation

If r=1, then there exists a perfect positive relationship

between the variables

If r=-1, then there exists a perfect negative relationship

between the variables

If r=0, then there exists no relationship between the

variables

If 0<r<0.5, then there exists a weak positive relationship

between the variables

If 0.5<r<1, then there exists a strong positive relationship

between the variables

19

Correlation-interpretation

If -0.5<r<0, then there exists a weak

negative relationship between the

variables

If -0.5<r<-1, then there exists a strong

negative relationship between the

variables

20

Correlation-interpretation

21

Pearson’s Product

Moment Correlation

Coefficient (PPMCC)

22

Correlation Coefficient (PPMCC)

The formula for the correlation coefficient

(PPMCC) is given by

where n is the number of data pairs.

23

2 22 2

n xy x yr

n x x n y y

Correlation coefficient (PPMCC)

It can also be expressed in notation form

as:

Where ;

24

xy

xx yy

Sr

S S

xyS n xy x y 22

xxS n x x

22

yyS n y y

PPMCC- alternative/easier formula

Alternatively, the formula can be written

as:

Where and are mean of Y and X

respectively

25

2

2 22

XY nr

X n Y n

XY

YX

Y

X

Example 1: Car Rental CompaniesCompute the correlation coefficient for the data in

Example 1.

26

Company

Cars x

(in 10,000s)

Income y

(in billions) xy x2 y2

A

B

C

D

EF

63.0

29.0

20.8

19.1

13.48.5

7.0

3.9

2.1

2.8

1.41.5

441.00

113.10

43.68

53.48

18.762.75

3969.00

841.00

432.64

364.81

179.5672.25

49.00

15.21

4.41

7.84

1.962.25

Σx =

153.8

Σy =

18.7

Σxy =

682.77

Σx2 =

5859.26

Σy2 =

80.67

Example 1: Car Rental CompaniesCompute the correlation coefficient for the data in

Example 1.

27

Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26,

Σy2 = 80.67, n = 6

2 22 2

n xy x yr

n x x n y y

2 2

6 682.77 153.8 18.7

6 5859.26 153.8 6 80.67 18.7

r

0.982 (strong positive relationship)r

Example 10-5: Absences/Final GradesCompute the correlation coefficient for the data in

Example 10–2.

28

Student

Number of

absences, xFinal Grade

y (pct.) xy x2 y2

A

B

C

D

EF

6

2

15

9

125

82

86

43

74

5890

492

172

645

666

696450

36

4

225

81

14425

6,724

7,396

1,849

5,476

3,3648,100

Σx =

57

Σy =

511

Σxy =

3745

Σx2 =

579

Σy2 =

38,993

G 8 78 624 64 6,084

Alternative calculation

We can calculate for the mean of Y and X. the mean of X

and Y are given as 25.6333 and 3.1167 respectively.

Can you interpret the correlation coefficient?

There is a strong positive correlation between car rentals

and sales revenue 29

2 2

682.77 6(25.6333)(3.1167)

(5859.26 6(25.6333) (80.67 6(3.1167)r

203.42220.9820

1916.6036 22.3871r

Example 2: Absences/Final GradesCompute the correlation coefficient for the data in

Example 2.

30

Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579,

Σy2 = 38,993, n = 7

2 22 2

n xy x yr

n x x n y y

2 2

7 3745 57 511

7 579 57 7 38,993 511

r

0.944 (strong negative relationship) r

Spearman’s Rank correlation coefficient

31

Formula for spearman’s rank correlation coefficient

The formulas is given by:

Where n is the number of observations. d

is the difference in ranks between X and

Y.

The interpretation for the spearman’s

correlation is the same as for the PPMCC

32

2

2

61

( 1)s

dr

n n

STEP BY STEP APPROACH

1. Rank X and Y in an ascending or

descending order

2. Calculate the difference between the

ranks

3. Find the square of the difference and

calculate

33

Example-Step by step approach

Consider the data from example 1

34

COMPANY CARS(in 10,000s) Income (in billions)

A 63 7

B 29 3.9

C 20.8 2.1

D 19.1 2.8

E 13.4 1.4

F 8.5 1.5

Example-Step by step approach

Represent one of the variable by X and

the other by Y. Remember that correlation

is symmetric

Order the data in ascending order

Find the difference in the ranks

Calculate the correlation coefficient

35

solution

company Cars(X) Income (y) Rank of

X

Rank of Y d

A 63 7 6 6 0 0

B 29 3.9 5 5 0 0

C 20.8 2.1 4 3 1 1

D 19.1 2.8 3 4 -1 1

E 13.4 1.4 2 1 1 1

F 8.5 1.5 1 2 -1 1

4

36

2d

solution

The spearman’s rank correlation

coefficient is given as:

Note: the answer will not necessarily be the

same but the interpretation will be the same. In

this case, there exists a strong positive

correlation between car rentals and revenue

37

2

6(4)1 0.8857

6(6 1)sr

Special cases

How do we rank the data when two or more data points

have the same rank?

When two scores tie in rank, both are given the mean of

the two ranks they would occupy and the next rank is

eliminated to keep n (the number of observations)

consistent. For example, if two data points tied for 4th

place, both would receive a rank of 4.5 ((4 + 5) ÷ 2), and

the next data point would be ranked number 6.

if three points tied for 4th place, the three would receive

a rank of 5 ((4 + 5+6) ÷ 3), and the next school would be

ranked number 7.

38

example

Consider the following example

39

company Cars(X) Income (y) Rank of

X

Rank of Y d

A 63 7 6 6 0 0

B 29 2.1 5 5 0 0

C 13.4 2.1 2.5 5 -2.5 6.25

D 19.1 2.1 4 5 -1 1

E 13.4 1.4 2.5 1 1.5 2.25

F 8.5 1.5 1 2 -1 1

10.5

2d

solution

The spearman’s rank correlation

coefficient is given as:

40

2

6(10.5)1 0.7

6(6 1)sr