COPYRIGHT © 2007 Thomson South-Western, a part of The Thomson Corporation. Thomson, the Star logo,…
©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by...
-
Upload
heather-porter -
Category
Documents
-
view
225 -
download
1
Transcript of ©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by...
©2006 Thomson/South-Western 1
Chapter 13 –Chapter 13 –
Correlation andCorrelation andSimple Simple RegressionRegression
Slides prepared by Jeff HeylLincoln University
©2006 Thomson/South-Western
Concise Managerial StatisticsConcise Managerial Statistics
KVANLIPAVURKEELING
KVANLIPAVURKEELING
©2006 Thomson/South-Western 2
Bivariate DataBivariate Data
Figure 13.1Figure 13.1
35 –35 –
30 –30 –
25 –25 –
20 –20 –
15 –15 –
10 –10 –
5 –5 –Sq
ua
re f
oo
tag
e (
hu
nd
red
s)
Sq
ua
re f
oo
tag
e (
hu
nd
red
s)
||
2020
||
3030
||
4040
||
5050
||
6060
||
7070
||
8080
YY
XX
Income (thousands)Income (thousands)
(a)(a)
35 –35 –
30 –30 –
25 –25 –
20 –20 –
15 –15 –
10 –10 –
5 –5 –
Sq
ua
re f
oo
tag
e (
hu
nd
red
s)
Sq
ua
re f
oo
tag
e (
hu
nd
red
s)
||
2020
||
3030
||
4040
||
5050
||
6060
||
7070
||
8080
YY
XX
Income (thousands)Income (thousands)
(b)(b)
©2006 Thomson/South-Western 3
Coefficient of CorrelationCoefficient of Correlation
The strength of the linear relationship The strength of the linear relationship between two variables is called the between two variables is called the coefficient of correlation, r.coefficient of correlation, r.
rr = =∑∑((xx - - xx)()(yy - - yy))
∑ ∑((xx - - xx))22 ∑( ∑(yy - - yy))22
==∑∑xyxy - (∑ - (∑xx)(∑)(∑yy) / ) / nn
∑ ∑xx22 - (∑ - (∑xx))22 / / nn ∑ ∑yy22 - (∑ - (∑yy))22 / / nn
©2006 Thomson/South-Western 4
Coefficient of Correlation Coefficient of Correlation PropertiesProperties
1.1. r ranges from r ranges from -1.0-1.0 to to 1.01.0
2.2. The larger |r | is, the stronger the linear The larger |r | is, the stronger the linear relationshiprelationship
3.3. The sign of r tells you whether the The sign of r tells you whether the relationship between X and Y is a positive relationship between X and Y is a positive (direct) or a negative (inverse) relationship(direct) or a negative (inverse) relationship
4.4. r r = 1= 1 or or -1-1 implies that a perfect linear implies that a perfect linear pattern exists between the two variables, pattern exists between the two variables, that they are perfectly correlatedthat they are perfectly correlated
©2006 Thomson/South-Western 5
Sum of SquaresSum of SquaresSSSSXX = sum of squares for = sum of squares for XX
= ∑(= ∑(xx - - xx))22
= ∑= ∑xx22 - - (∑(∑xx))22
nn
SSSSYY = sum of squares for = sum of squares for YY
= ∑(= ∑(yy - - yy))22
= ∑= ∑yy22 - - (∑(∑yy))22
nn
SCPSCPXYXY = sum of cross products for = sum of cross products for XYXY
= ∑(= ∑(xx - - xx)()(yy - - yy))
= ∑= ∑xyxy - - (∑(∑xx) (∑) (∑yy))
nn
©2006 Thomson/South-Western 6
Sum of SquaresSum of SquaresSSSSXX = sum of squares for = sum of squares for XX
= ∑(= ∑(xx - - xx))22
= ∑= ∑xx22 - - (∑(∑xx))22
nn
SSSSYY = sum of squares for = sum of squares for YY
= ∑(= ∑(yy - - yy))22
= ∑= ∑yy22 - - (∑(∑yy))22
nn
SCPSCPXYXY = sum of cross products for = sum of cross products for XYXY
= ∑(= ∑(xx - - xx)()(yy - - yy))
= ∑= ∑xyxy - - (∑(∑xx) (∑) (∑yy))
nn
rr = =SCPSCPXYXY
SSSSXX SS SSYY
©2006 Thomson/South-Western 7
Scatter Diagram and Scatter Diagram and Correlation CoefficientCorrelation Coefficient
Figure 13.2Figure 13.2
©2006 Thomson/South-Western 8
Vertical DistancesVertical Distances
dd11
dd22
dd33
dd44
dd55
dd66
dd77
dd88
dd99
dd1010Line Line LL
Figure 13.3Figure 13.3
||2020
||3030
||4040
||5050
||6060
||7070
||8080
XX
YYS
qu
are
foo
tag
eS
qu
are
foo
tag
e
IncomeIncome
©2006 Thomson/South-Western 9
Least Squares LineLeast Squares Line
The least squares line is the line The least squares line is the line through the data that minimizes the through the data that minimizes the sum of the differences between the sum of the differences between the observations and the lineobservations and the line
∑∑dd22 = = dd1122 + + dd22
22 + + dd3322 + … + + … + d dnn
22
bb11 = = bb00 = = yy - - bb11xxSCPSCPXYXY
SSSSXX
©2006 Thomson/South-Western 10
Least Squares LineLeast Squares Line
Figure 13.6Figure 13.6
dd11
dd22
YY = = bb00 + + bb11XX^̂
YY for for XX = 50 = 50YY for for XX = 50 = 50^̂
YY
XX5050
IncomeIncome
Sq
uar
e fo
ota
ge
Sq
uar
e fo
ota
ge
Distance is Distance is YY −− YY^̂
©2006 Thomson/South-Western 11
Sum of Squares of ErrorSum of Squares of Error
SSE = SSSSE = SSYY - -(SCP(SCPXYXY))22
SSSSXX
SSE = ∑SSE = ∑dd22 = ∑( = ∑(yy - - yy))22^̂
©2006 Thomson/South-Western 12
Least Squares Line Least Squares Line for Real Estate Datafor Real Estate Data
Figure 13.5Figure 13.5
YY
XX5050
IncomeIncome
Sq
uar
eS
qu
are
foo
tag
efo
ota
ge
YY = 4.915 + .3539 = 4.915 + .3539XX^̂
YY = 20 = 20YY = 22.67 = 22.67^̂
©2006 Thomson/South-Western 13
Assumptions for theAssumptions for the Simple Regression Model Simple Regression Model
1.1. The mean of each error component is zeroThe mean of each error component is zero
YY = = 00 + + 11XX + + ee
2.2. Each error component (random variable) Each error component (random variable) follows an approximate normal distributionfollows an approximate normal distribution
3.3. The variance of the error component is the The variance of the error component is the same for each value of Xsame for each value of X
4.4. The errors are independent of each otherThe errors are independent of each other
©2006 Thomson/South-Western 14
Assumption 1 for theAssumption 1 for theSimple Regression ModelSimple Regression Model
YY
XX
IncomeIncome
Sq
uar
e fo
ota
ge
Sq
uar
e fo
ota
ge
Figure 13.6Figure 13.6
YY = = 00 + + 11XX
YY = = 00 + + 11XX + + ee
µµyy150150
µµyy135135
ee
3535 5050
00
©2006 Thomson/South-Western 15
Violation of Assumption 3Violation of Assumption 3
Figure 13.7Figure 13.7
YY
XX
IncomeIncome
Sq
uar
e fo
ota
ge
Sq
uar
e fo
ota
ge YY = = 00 + + 11XX
ee
3535 5050
ee
6060
©2006 Thomson/South-Western 16
Assumptions 1, 2, 3 for theAssumptions 1, 2, 3 for theSimple Regression ModelSimple Regression Model
Figure 13.8Figure 13.8
YY
XX
IncomeIncome
Sq
uar
e fo
ota
ge
Sq
uar
e fo
ota
ge
ee
3535 5050 6060
0000
00
eeee
ee
ee
ee
©2006 Thomson/South-Western 17
Estimating the Error Estimating the Error Variance, Variance, ee
22
ss22 = = ee22 = estimate of = estimate of ee
22 = = SSESSE
nn - 2 - 2^̂
wherewhere
(SCP(SCPXYXY))22
SSSSXX
SSE = ∑(SSE = ∑(yy - - yy))22 = SS = SSYY - -^̂
©2006 Thomson/South-Western 18
Three Possible PopulationsThree Possible Populations
11 < 0 < 0
(c)(c)
XX
YY
11 > 0 > 0
(b)(b)
XX
YY
11 = 0 = 0
(a)(a)
XX
YY
Figure 13.9Figure 13.9
©2006 Thomson/South-Western 19
Hypothesis Test on theHypothesis Test on theSlope of the Regression LineSlope of the Regression Line
HHoo: : 11 = 0 ( = 0 (XX provides no information) provides no information)
HHaa: : 11 ≠ 0 ( ≠ 0 (XX does provide information) does provide information)
Two-Tailed TestTwo-Tailed Test
Test Statistic:Test Statistic:
rejectreject HHoo if | if |tt| > | > tt/2,/2,nn-2-2
tt = = = =bb11 – – 11
s/ s/ SSSSxx
bb11 – – 11
ssb b 11
©2006 Thomson/South-Western 20
Hypothesis Test on theHypothesis Test on theSlope of the Regression LineSlope of the Regression Line
Test Statistic:Test Statistic:
tt = =bb11
ssbb 11
HHoo: : 11 ≤ 0 ≤ 0
HHaa: : 11 > 0 > 0
One-Tailed TestOne-Tailed Test
HHoo: : 11 ≥ 0 ≥ 0
HHaa: : 11 < 0 < 0
rejectreject HHoo if if tt > > tt/2,/2,nn-2-2 rejectreject HHoo if if tt < - < -tt/2,/2,nn-2-2
©2006 Thomson/South-Western 21
t Curve with 8 dft Curve with 8 df
Figure 13.10Figure 13.10
1.8601.860 Rejection regionRejection region
tttt
©2006 Thomson/South-Western 22
Real Estate ExampleReal Estate Example
Figure 13.11Figure 13.11
©2006 Thomson/South-Western 23
Real Estate ExampleReal Estate Example
Figure 13.12Figure 13.12
©2006 Thomson/South-Western 24
Real Estate ExampleReal Estate Example
Figure 13.13Figure 13.13
©2006 Thomson/South-Western 25
Real Estate ExampleReal Estate Example
Figure 13.14Figure 13.14
©2006 Thomson/South-Western 26
Scatter DiagramScatter Diagram
30 –30 –
20 –20 –
10 –10 –
||1212
||2424
||3636
||4848
||6060
AgeAge
Liq
uid
ass
ets
Liq
uid
ass
ets
(% o
f a
nn
ual
in
com
e)(%
of
an
nu
al i
nco
me)
YY
XX
YY = -.814 + .3526 = -.814 + .3526XX^̂
Figure 13.15Figure 13.15
©2006 Thomson/South-Western 27
Scatter DiagramScatter Diagram
Figure 13.15Figure 13.15
SSX = 1268.67 x = 43.667SSY = 348.92 y = 14.583SCPXY = 447.33
r = = .672SCPXY
SSX SSY
©2006 Thomson/South-Western 28
Confidence Interval for Confidence Interval for 11
TheThe (1 - (1 - ) • 100% ) • 100% confidence interval forconfidence interval for 11 isis
bb11 - - tt/2,/2,nn-2-2ssbb toto bb11 + + tt/2,/2,nn-2-2ssbb11 11
©2006 Thomson/South-Western 29
Curvilinear RelationshipCurvilinear Relationship
YYYY
XXXX
Figure 13.16Figure 13.16
©2006 Thomson/South-Western 30
Measuring the StrengthMeasuring the Strengthof the Modelof the Model
rr = =SCPSCPXYXY
SSSSXX SS SSYY
rr
1 - 1 - rr22
nn - 2 - 2
tt = =
HHoo: : pp = 0 = 0 ((no linear relationship exists betweenno linear relationship exists between
XX andand YY))HHaa: : pp ≠ 0 ≠ 0 ((a linear relationship does exista linear relationship does exist))
©2006 Thomson/South-Western 31
Danger of Assuming Danger of Assuming CausalityCausality
A high statistical correlation does A high statistical correlation does not imply causalitynot imply causality
There are many situations when There are many situations when variables are highly correlated variables are highly correlated because a factor not being because a factor not being studied affects the variables being studied affects the variables being studiedstudied
©2006 Thomson/South-Western 32
Coefficient of DeterminationCoefficient of Determination
SSE = SSSSE = SSYY - -(SCP(SCPXYXY))22
SSSSXX
rr22 = =(SCP(SCPXYXY))22
SSSSXXSSSSYY
rr22== coefficient of determinationcoefficient of determination
== 1 -1 -
== percentage of explained variation percentage of explained variation in the dependent variable using the in the dependent variable using the simple linear regression modelsimple linear regression model
SSESSE
SSSSYY
©2006 Thomson/South-Western 33
Total Variation, SSTotal Variation, SSYY
Figure 13.17Figure 13.17
YY
XX
((xx, , yy))yy - - yy
((xx, , yy))
yy - - yy
yy - - yy
yy
YY = = bb00 + + bb11XX
Sample pointSample point
^̂
^̂
^̂
^̂
©2006 Thomson/South-Western 34
Total Variation, SSTotal Variation, SSYY
Figure 13.17Figure 13.17
YY
XX
((xx, , yy))yy - - yy
((xx, , yy))
yy - - yy
yy - - yy
yy
YY = = bb00 + + bb11XX
Sample pointSample point
^̂
^̂
^̂
^̂
SSY = SSR + SSE
SSR = (SCPXY)2
SSX
©2006 Thomson/South-Western 35
Estimation and Estimation and Prediction Using the Prediction Using the Simple Linear Model Simple Linear Model
The least squares line can be The least squares line can be used to estimate average values used to estimate average values
or predict individual valuesor predict individual values
©2006 Thomson/South-Western 36
Confidence Interval for µConfidence Interval for µY|xY|x 00
(1- (1- ) 100% Confidence Interval for ) 100% Confidence Interval for Y|xY|x00
YY - - tt/2,/2,nn-2-2ss + +^̂ ((xx00 - - xx))22
SSSSXX
11
nn
to Yto Y + + tt/2,/2,nn-2-2ss + +((xx00 - - xx))22
SSSSXX
11
nn^̂
ssYY = = ss + +((xx00 - - xx))22
SSSSXX
11
nn^̂
©2006 Thomson/South-Western 37
Confidence andConfidence andPrediction IntervalsPrediction Intervals
Figure 13.18Figure 13.18
©2006 Thomson/South-Western 38
Confidence andConfidence andPrediction IntervalsPrediction Intervals
Figure 13.19Figure 13.19
©2006 Thomson/South-Western 39
Confidence and Confidence and Prediction IntervalsPrediction Intervals
Figure 13.20Figure 13.20
©2006 Thomson/South-Western 40
95% Confidence Intervals95% Confidence Intervals
xx = 49.8 = 49.8
20.2720.27
12.3312.33
Upper confidence limitsUpper confidence limits
Lower confidence limitsLower confidence limits
YY = 4.975 + .3539 = 4.975 + .3539XX^̂
Figure 13.21Figure 13.21
35 –35 –
30 –30 –
25 –25 –
20 –20 –
15 –15 –
10 –10 –
5 –5 –
||2020
||3030
||4040
||5050
||6060
||7070
XX
©2006 Thomson/South-Western 41
Prediction Interval for YPrediction Interval for YXX 00
YY - - tt/2,/2,nn-2-2ss 1 + + 1 + +^̂ ((xx00 - - xx))22
SSSSXX
11
nn
to Yto Y + + tt/2,/2,nn-2-2ss 1 + + 1 + +((xx00 - - xx))22
SSSSXX
11
nn^̂
ssYY22 = = ss22 1 1 + + + +
((xx00 - - xx))22
SSSSXX
11
nn^̂
©2006 Thomson/South-Western 42
95% Confidence Intervals95% Confidence Intervals
Figure 13.22Figure 13.22
xx = 49.8 = 49.8
24.4324.43
Prediction interval limitsPrediction interval limits8.178.17
20.2720.27
Confidence interval limitsConfidence interval limits12.3312.33
35 –35 –
30 –30 –
25 –25 –
20 –20 –
15 –15 –
10 –10 –
5 –5 –
||2020
||3030
||4040
||5050
||6060
||7070
XX
©2006 Thomson/South-Western 43
Checking Model Checking Model AssumptionsAssumptions
1.1. The errors are normally distributed The errors are normally distributed with a mean of zerowith a mean of zero
2.2. The variance of the errors remains The variance of the errors remains constant. For example, you should not constant. For example, you should not observe larger errors associated with observe larger errors associated with larger values of X.larger values of X.
3.3. The errors are independentThe errors are independent
©2006 Thomson/South-Western 44
Examination of ResidualsExamination of Residuals
XX
(a)(a)
YY - - YY^̂
YY - - YY^̂
XX
(b)(b)
Figure 13.23Figure 13.23
©2006 Thomson/South-Western 45
Examination of ResidualsExamination of Residuals
Figure 13.24Figure 13.24
TimeTime
YY - - YY^̂
0019
94 –
1994
–
1995
–19
95 –
1993
–19
93 –
1997
–19
97 –
1999
–19
99 –
1992
–19
92 –
1996
–19
96 –
1998
–19
98 –
2000
–20
00 –
2001
–20
01 –
©2006 Thomson/South-Western 46
Autocorrelation and the Autocorrelation and the Durbin-Watson StatisticDurbin-Watson Statistic
Range from Range from 00 to to 44
Ideal value is Ideal value is 22
As As DWDW decreases from decreases from 22, positive , positive autocorrelation increasesautocorrelation increases
As As DWDW increases from increases from 22, negative , negative autocorrelation increasesautocorrelation increases
DW = DW = ∑∑((eett - - eet-1t-1))22
∑∑eett22
TT
tt =2 =2
TT
tt =1 =1
©2006 Thomson/South-Western 47
Autocorrelation and the Autocorrelation and the Durbin-Watson StatisticDurbin-Watson Statistic
Figure 13.25Figure 13.25
©2006 Thomson/South-Western 48
Checking for OutliersChecking for Outliers
Figure 13.26Figure 13.26
©2006 Thomson/South-Western 49
Identifying Outlying ValuesIdentifying Outlying Values
Outlying sample values can be found Outlying sample values can be found by calculating the sample leverageby calculating the sample leverage
hhii = + = +((xxii - - xx))22
SSSSXX
11
nn
SSSSXX = ∑ = ∑xx22 - (∑ - (∑xx))22//nn
A sample is considered an outlier if its A sample is considered an outlier if its leverage is greater than leverage is greater than 4/4/n or n or 6/6/nn
©2006 Thomson/South-Western 50
Identifying Outlying ValuesIdentifying Outlying ValuesThe standard deviation of the The standard deviation of the
predicted Y value ispredicted Y value is
ssyy = = s hs hii
The confidence interval isThe confidence interval is
YY - - tt/2,/2,nn-2-2s hs hi i to Y to Y + + tt/2,/2,nn-2-2s hs hii ^̂ ^̂
The prediction interval isThe prediction interval is
YY - - tt/2,/2,nn-2-2s s 1 + 1 + hhi i to Y to Y + + tt/2,/2,nn-2-2s s 1 + 1 + hhii ^̂ ^̂
©2006 Thomson/South-Western 51
Real Estate ExampleReal Estate Example
Figure 13.27(a)Figure 13.27(a)
©2006 Thomson/South-Western 52
Real Estate ExampleReal Estate Example
Figure 13.27(b)Figure 13.27(b)
©2006 Thomson/South-Western 53
Identifying Outlying ValuesIdentifying Outlying Values
Unusually large or small values of the dependent Unusually large or small values of the dependent variable variable ((YY)) can generally be detected using the can generally be detected using the
sample standardized residualssample standardized residuals
Estimated standard deviation of the ith residualEstimated standard deviation of the ith residual
ss 1 - 1 - hhii
Standardized residual =Standardized residual =YYii - - YYii
ss 1 - 1 - hhii
^̂
An observation is thought to have and outlying An observation is thought to have and outlying value of Y if its standardized residual value of Y if its standardized residual > 2> 2 or or < -2< -2
©2006 Thomson/South-Western 54
Identifying Influential Identifying Influential ObservationsObservations
You may conclude the ith observation is You may conclude the ith observation is influential if the corresponding Dinfluential if the corresponding Dii measure measure > .8> .8
Cook’s distance measureCook’s distance measure
DDii = (standardized residual) = (standardized residual)2211
22
hhii
1 - 1 - hhii
==((YYii – – YYii))22
22ss22
hhii
(1 – (1 – hhii))22
^̂
©2006 Thomson/South-Western 55
Leverages, Standardized Leverages, Standardized Residuals, and Cook’s Distance Residuals, and Cook’s Distance
MeasuresMeasures
Figure 13.28Figure 13.28
©2006 Thomson/South-Western 56
Summary of Summary of Figures 13.26 and 13.28Figures 13.26 and 13.28
Outlying inOutlying in Outlying in Outlying in InfluentialInfluentialXX Value Value YY Value Value ObservationObservation
PointPoint (h(hii > .4) > .4) (|stand. res.| > 2)(|stand. res.| > 2) (D(Dii > .8) > .8)
AA NoNo YesYes NoNo
BB NoNo NoNo NoNo
CC YesYes YesYes YesYes
Table 13.1Table 13.1
©2006 Thomson/South-Western 57
Engine Capacity and MPGEngine Capacity and MPG
Figure 13.29Figure 13.29
©2006 Thomson/South-Western 58
Engine Capacity and MPGEngine Capacity and MPG
Figure 13.30Figure 13.30
©2006 Thomson/South-Western 59
Engine Capacity and MPGEngine Capacity and MPG
Figure 13.31Figure 13.31
©2006 Thomson/South-Western 60
Engine Capacity and MPGEngine Capacity and MPG
Figure 13.32Figure 13.32
©2006 Thomson/South-Western 61
Engine Capacity and MPGEngine Capacity and MPG
Figure 13.33Figure 13.33