©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by...

©2006 Thomson/South-Western 1

Chapter 13 –Chapter 13 –

Correlation andCorrelation andSimple Simple RegressionRegression

Slides prepared by Jeff HeylLincoln University

©2006 Thomson/South-Western

Concise Managerial StatisticsConcise Managerial Statistics

KVANLIPAVURKEELING

KVANLIPAVURKEELING


Bivariate DataBivariate Data

Figure 13.1Figure 13.1

35 –35 –

30 –30 –

25 –25 –

20 –20 –

15 –15 –

10 –10 –

5 –5 –Sq

ua

re f

oo

tag

e (

hu

nd

red

s)

Sq

ua

re f

oo

tag

e (

hu

nd

red

s)

||

2020

||

3030

||

4040

||

5050

||

6060

||

7070

||

8080

YY

XX

Income (thousands)Income (thousands)

(a)(a)

35 –35 –

30 –30 –

25 –25 –

20 –20 –

15 –15 –

10 –10 –

5 –5 –

Sq

ua

re f

oo

tag

e (

hu

nd

red

s)

Sq

ua

re f

oo

tag

e (

hu

nd

red

s)

||

2020

||

3030

||

4040

||

5050

||

6060

||

7070

||

8080

YY

XX

Income (thousands)Income (thousands)

(b)(b)


Coefficient of CorrelationCoefficient of Correlation

The strength of the linear relationship The strength of the linear relationship between two variables is called the between two variables is called the coefficient of correlation, r.coefficient of correlation, r.

rr = =∑∑((xx - - xx)()(yy - - yy))

∑ ∑((xx - - xx))22 ∑( ∑(yy - - yy))22

==∑∑xyxy - (∑ - (∑xx)(∑)(∑yy) / ) / nn

∑ ∑xx22 - (∑ - (∑xx))22 / / nn ∑ ∑yy22 - (∑ - (∑yy))22 / / nn


Coefficient of Correlation Coefficient of Correlation PropertiesProperties

1.1. r ranges from r ranges from -1.0-1.0 to to 1.01.0

2.2. The larger |r | is, the stronger the linear The larger |r | is, the stronger the linear relationshiprelationship

3.3. The sign of r tells you whether the The sign of r tells you whether the relationship between X and Y is a positive relationship between X and Y is a positive (direct) or a negative (inverse) relationship(direct) or a negative (inverse) relationship

4.4. r r = 1= 1 or or -1-1 implies that a perfect linear implies that a perfect linear pattern exists between the two variables, pattern exists between the two variables, that they are perfectly correlatedthat they are perfectly correlated


Sum of SquaresSum of SquaresSSSSXX = sum of squares for = sum of squares for XX

= ∑(= ∑(xx - - xx))22

= ∑= ∑xx22 - - (∑(∑xx))22

nn

SSSSYY = sum of squares for = sum of squares for YY

= ∑(= ∑(yy - - yy))22

= ∑= ∑yy22 - - (∑(∑yy))22

nn

SCPSCPXYXY = sum of cross products for = sum of cross products for XYXY

= ∑(= ∑(xx - - xx)()(yy - - yy))

= ∑= ∑xyxy - - (∑(∑xx) (∑) (∑yy))

nn


Sum of SquaresSum of SquaresSSSSXX = sum of squares for = sum of squares for XX

= ∑(= ∑(xx - - xx))22

= ∑= ∑xx22 - - (∑(∑xx))22

nn

SSSSYY = sum of squares for = sum of squares for YY

= ∑(= ∑(yy - - yy))22

= ∑= ∑yy22 - - (∑(∑yy))22

nn

SCPSCPXYXY = sum of cross products for = sum of cross products for XYXY

= ∑(= ∑(xx - - xx)()(yy - - yy))

= ∑= ∑xyxy - - (∑(∑xx) (∑) (∑yy))

nn

rr = =SCPSCPXYXY

SSSSXX SS SSYY


Scatter Diagram and Scatter Diagram and Correlation CoefficientCorrelation Coefficient



Vertical DistancesVertical Distances

dd11

dd22

dd33

dd44

dd55

dd66

dd77

dd88

dd99

dd1010Line Line LL


||2020

||3030

||4040

||5050

||6060

||7070

||8080

XX

YYS

qu

are

foo

tag

eS

qu

are

foo

tag

e

IncomeIncome


Least Squares LineLeast Squares Line

The least squares line is the line The least squares line is the line through the data that minimizes the through the data that minimizes the sum of the differences between the sum of the differences between the observations and the lineobservations and the line

∑∑dd22 = = dd1122 + + dd22

22 + + dd3322 + … + + … + d dnn

22

bb11 = = bb00 = = yy - - bb11xxSCPSCPXYXY

SSSSXX


Least Squares LineLeast Squares Line


dd11

dd22

YY = = bb00 + + bb11XX^̂

YY for for XX = 50 = 50YY for for XX = 50 = 50^̂

YY

XX5050

IncomeIncome

Sq

uar

e fo

ota

ge

Sq

uar

e fo

ota

ge

Distance is Distance is YY −− YY^̂


Sum of Squares of ErrorSum of Squares of Error

SSE = SSSSE = SSYY - -(SCP(SCPXYXY))22

SSSSXX

SSE = ∑SSE = ∑dd22 = ∑( = ∑(yy - - yy))22^̂


Least Squares Line Least Squares Line for Real Estate Datafor Real Estate Data


YY

XX5050

IncomeIncome

Sq

uar

eS

qu

are

foo

tag

efo

ota

ge

YY = 4.915 + .3539 = 4.915 + .3539XX^̂

YY = 20 = 20YY = 22.67 = 22.67^̂


Assumptions for theAssumptions for the Simple Regression Model Simple Regression Model

1.1. The mean of each error component is zeroThe mean of each error component is zero

YY = = 00 + + 11XX + + ee

2.2. Each error component (random variable) Each error component (random variable) follows an approximate normal distributionfollows an approximate normal distribution

3.3. The variance of the error component is the The variance of the error component is the same for each value of Xsame for each value of X

4.4. The errors are independent of each otherThe errors are independent of each other


Assumption 1 for theAssumption 1 for theSimple Regression ModelSimple Regression Model

YY

XX

IncomeIncome

Sq

uar

e fo

ota

ge

Sq

uar

e fo

ota

ge


YY = = 00 + + 11XX

YY = = 00 + + 11XX + + ee

µµyy150150

µµyy135135

ee

3535 5050

00


Violation of Assumption 3Violation of Assumption 3


YY

XX

IncomeIncome

Sq

uar

e fo

ota

ge

Sq

uar

e fo

ota

ge YY = = 00 + + 11XX

ee

3535 5050

ee

6060


Assumptions 1, 2, 3 for theAssumptions 1, 2, 3 for theSimple Regression ModelSimple Regression Model


YY

XX

IncomeIncome

Sq

uar

e fo

ota

ge

Sq

uar

e fo

ota

ge

ee

3535 5050 6060

0000

00

eeee

ee

ee

ee


Estimating the Error Estimating the Error Variance, Variance, ee

22

ss22 = = ee22 = estimate of = estimate of ee

22 = = SSESSE

nn - 2 - 2^̂

wherewhere

(SCP(SCPXYXY))22

SSSSXX

SSE = ∑(SSE = ∑(yy - - yy))22 = SS = SSYY - -^̂


Three Possible PopulationsThree Possible Populations

11 < 0 < 0

(c)(c)

XX

YY

11 > 0 > 0

(b)(b)

XX

YY

11 = 0 = 0

(a)(a)

XX

YY



Hypothesis Test on theHypothesis Test on theSlope of the Regression LineSlope of the Regression Line

HHoo: : 11 = 0 ( = 0 (XX provides no information) provides no information)

HHaa: : 11 ≠ 0 ( ≠ 0 (XX does provide information) does provide information)

Two-Tailed TestTwo-Tailed Test

Test Statistic:Test Statistic:

rejectreject HHoo if | if |tt| > | > tt/2,/2,nn-2-2

tt = = = =bb11 – – 11

s/ s/ SSSSxx

bb11 – – 11

ssb b 11


Hypothesis Test on theHypothesis Test on theSlope of the Regression LineSlope of the Regression Line

Test Statistic:Test Statistic:

tt = =bb11

ssbb 11

HHoo: : 11 ≤ 0 ≤ 0

HHaa: : 11 > 0 > 0

One-Tailed TestOne-Tailed Test

HHoo: : 11 ≥ 0 ≥ 0

HHaa: : 11 < 0 < 0

rejectreject HHoo if if tt > > tt/2,/2,nn-2-2 rejectreject HHoo if if tt < - < -tt/2,/2,nn-2-2


t Curve with 8 dft Curve with 8 df


1.8601.860 Rejection regionRejection region

tttt


Real Estate ExampleReal Estate Example



Scatter DiagramScatter Diagram

30 –30 –

20 –20 –

10 –10 –

||1212

||2424

||3636

||4848

||6060

AgeAge

Liq

uid

ass

ets

Liq

uid

ass

ets

(% o

f a

nn

ual

in

com

e)(%

of

an

nu

al i

nco

me)

YY

XX

YY = -.814 + .3526 = -.814 + .3526XX^̂



Scatter DiagramScatter Diagram


SSX = 1268.67 x = 43.667SSY = 348.92 y = 14.583SCPXY = 447.33

r = = .672SCPXY

SSX SSY


Confidence Interval for Confidence Interval for 11

TheThe (1 - (1 - ) • 100% ) • 100% confidence interval forconfidence interval for 11 isis

bb11 - - tt/2,/2,nn-2-2ssbb toto bb11 + + tt/2,/2,nn-2-2ssbb11 11


Curvilinear RelationshipCurvilinear Relationship

YYYY

XXXX



Measuring the StrengthMeasuring the Strengthof the Modelof the Model

rr = =SCPSCPXYXY

SSSSXX SS SSYY

rr

1 - 1 - rr22

nn - 2 - 2

tt = =

HHoo: : pp = 0 = 0 ((no linear relationship exists betweenno linear relationship exists between

XX andand YY))HHaa: : pp ≠ 0 ≠ 0 ((a linear relationship does exista linear relationship does exist))


Danger of Assuming Danger of Assuming CausalityCausality

A high statistical correlation does A high statistical correlation does not imply causalitynot imply causality

There are many situations when There are many situations when variables are highly correlated variables are highly correlated because a factor not being because a factor not being studied affects the variables being studied affects the variables being studiedstudied


Coefficient of DeterminationCoefficient of Determination

SSE = SSSSE = SSYY - -(SCP(SCPXYXY))22

SSSSXX

rr22 = =(SCP(SCPXYXY))22

SSSSXXSSSSYY

rr22== coefficient of determinationcoefficient of determination

== 1 -1 -

== percentage of explained variation percentage of explained variation in the dependent variable using the in the dependent variable using the simple linear regression modelsimple linear regression model

SSESSE

SSSSYY


Total Variation, SSTotal Variation, SSYY


YY

XX

((xx, , yy))yy - - yy

((xx, , yy))

yy - - yy

yy - - yy

yy

YY = = bb00 + + bb11XX

Sample pointSample point

^̂

^̂

^̂

^̂


Total Variation, SSTotal Variation, SSYY


YY

XX

((xx, , yy))yy - - yy

((xx, , yy))

yy - - yy

yy - - yy

yy

YY = = bb00 + + bb11XX

Sample pointSample point

^̂

^̂

^̂

^̂

SSY = SSR + SSE

SSR = (SCPXY)2

SSX


Estimation and Estimation and Prediction Using the Prediction Using the Simple Linear Model Simple Linear Model

The least squares line can be The least squares line can be used to estimate average values used to estimate average values

or predict individual valuesor predict individual values


Confidence Interval for µConfidence Interval for µY|xY|x 00

(1- (1- ) 100% Confidence Interval for ) 100% Confidence Interval for Y|xY|x00

YY - - tt/2,/2,nn-2-2ss + +^̂ ((xx00 - - xx))22

SSSSXX

11

nn

to Yto Y + + tt/2,/2,nn-2-2ss + +((xx00 - - xx))22

SSSSXX

11

nn^̂

ssYY = = ss + +((xx00 - - xx))22

SSSSXX

11

nn^̂


Confidence andConfidence andPrediction IntervalsPrediction Intervals



Confidence and Confidence and Prediction IntervalsPrediction Intervals



95% Confidence Intervals95% Confidence Intervals

xx = 49.8 = 49.8

20.2720.27

12.3312.33

Upper confidence limitsUpper confidence limits

Lower confidence limitsLower confidence limits

YY = 4.975 + .3539 = 4.975 + .3539XX^̂


35 –35 –

30 –30 –

25 –25 –

20 –20 –

15 –15 –

10 –10 –

5 –5 –

||2020

||3030

||4040

||5050

||6060

||7070

XX


Prediction Interval for YPrediction Interval for YXX 00

YY - - tt/2,/2,nn-2-2ss 1 + + 1 + +^̂ ((xx00 - - xx))22

SSSSXX

11

nn

to Yto Y + + tt/2,/2,nn-2-2ss 1 + + 1 + +((xx00 - - xx))22

SSSSXX

11

nn^̂

ssYY22 = = ss22 1 1 + + + +

((xx00 - - xx))22

SSSSXX

11

nn^̂


95% Confidence Intervals95% Confidence Intervals


xx = 49.8 = 49.8

24.4324.43

Prediction interval limitsPrediction interval limits8.178.17

20.2720.27

Confidence interval limitsConfidence interval limits12.3312.33

35 –35 –

30 –30 –

25 –25 –

20 –20 –

15 –15 –

10 –10 –

5 –5 –

||2020

||3030

||4040

||5050

||6060

||7070

XX


Checking Model Checking Model AssumptionsAssumptions

1.1. The errors are normally distributed The errors are normally distributed with a mean of zerowith a mean of zero

2.2. The variance of the errors remains The variance of the errors remains constant. For example, you should not constant. For example, you should not observe larger errors associated with observe larger errors associated with larger values of X.larger values of X.

3.3. The errors are independentThe errors are independent


Examination of ResidualsExamination of Residuals

XX

(a)(a)

YY - - YY^̂

YY - - YY^̂

XX

(b)(b)



Examination of ResidualsExamination of Residuals


TimeTime

YY - - YY^̂

0019

94 –

1994

–

1995

–19

95 –

1993

–19

93 –

1997

–19

97 –

1999

–19

99 –

1992

–19

92 –

1996

–19

96 –

1998

–19

98 –

2000

–20

00 –

2001

–20

01 –


Autocorrelation and the Autocorrelation and the Durbin-Watson StatisticDurbin-Watson Statistic

Range from Range from 00 to to 44

Ideal value is Ideal value is 22

As As DWDW decreases from decreases from 22, positive , positive autocorrelation increasesautocorrelation increases

As As DWDW increases from increases from 22, negative , negative autocorrelation increasesautocorrelation increases

DW = DW = ∑∑((eett - - eet-1t-1))22

∑∑eett22

TT

tt =2 =2

TT

tt =1 =1


Autocorrelation and the Autocorrelation and the Durbin-Watson StatisticDurbin-Watson Statistic



Checking for OutliersChecking for Outliers



Identifying Outlying ValuesIdentifying Outlying Values

Outlying sample values can be found Outlying sample values can be found by calculating the sample leverageby calculating the sample leverage

hhii = + = +((xxii - - xx))22

SSSSXX

11

nn

SSSSXX = ∑ = ∑xx22 - (∑ - (∑xx))22//nn

A sample is considered an outlier if its A sample is considered an outlier if its leverage is greater than leverage is greater than 4/4/n or n or 6/6/nn


Identifying Outlying ValuesIdentifying Outlying ValuesThe standard deviation of the The standard deviation of the

predicted Y value ispredicted Y value is

ssyy = = s hs hii

The confidence interval isThe confidence interval is

YY - - tt/2,/2,nn-2-2s hs hi i to Y to Y + + tt/2,/2,nn-2-2s hs hii ^̂ ^̂

The prediction interval isThe prediction interval is

YY - - tt/2,/2,nn-2-2s s 1 + 1 + hhi i to Y to Y + + tt/2,/2,nn-2-2s s 1 + 1 + hhii ^̂ ^̂



Figure 13.27(a)Figure 13.27(a)



Figure 13.27(b)Figure 13.27(b)


Identifying Outlying ValuesIdentifying Outlying Values

Unusually large or small values of the dependent Unusually large or small values of the dependent variable variable ((YY)) can generally be detected using the can generally be detected using the

sample standardized residualssample standardized residuals

Estimated standard deviation of the ith residualEstimated standard deviation of the ith residual

ss 1 - 1 - hhii

Standardized residual =Standardized residual =YYii - - YYii

ss 1 - 1 - hhii

^̂

An observation is thought to have and outlying An observation is thought to have and outlying value of Y if its standardized residual value of Y if its standardized residual > 2> 2 or or < -2< -2


Identifying Influential Identifying Influential ObservationsObservations

You may conclude the ith observation is You may conclude the ith observation is influential if the corresponding Dinfluential if the corresponding Dii measure measure > .8> .8

Cook’s distance measureCook’s distance measure

DDii = (standardized residual) = (standardized residual)2211

22

hhii

1 - 1 - hhii

==((YYii – – YYii))22

22ss22

hhii

(1 – (1 – hhii))22

^̂


Leverages, Standardized Leverages, Standardized Residuals, and Cook’s Distance Residuals, and Cook’s Distance

MeasuresMeasures



Summary of Summary of Figures 13.26 and 13.28Figures 13.26 and 13.28

Outlying inOutlying in Outlying in Outlying in InfluentialInfluentialXX Value Value YY Value Value ObservationObservation

PointPoint (h(hii > .4) > .4) (|stand. res.| > 2)(|stand. res.| > 2) (D(Dii > .8) > .8)

AA NoNo YesYes NoNo

BB NoNo NoNo NoNo

CC YesYes YesYes YesYes

Table 13.1Table 13.1


Engine Capacity and MPGEngine Capacity and MPG


©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by...

Documents

Transcript of ©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by...