SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression.
Linear Modelling: Simple Regression - GitHub Pages · Simple Regression: 21 Aims: • To...
Transcript of Linear Modelling: Simple Regression - GitHub Pages · Simple Regression: 21 Aims: • To...
Linear Modelling: Simple Regression10th of May 2018 R. Nicholls / D.-L. Couturier / M. Fernandes
Introduction:
ANOVA• Usedfortestinghypothesesregardingdifferencesbetweengroups• Considersthevariationwithinandbetweengroups
Regression• Usedforrevealingandinvestigatingrelationshipsbetweeninputandoutputvariables• Modeldata,andextrapolateasmuchinformationaspossible
2
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
3
Howtomeasurethestrengthofalinearrelationshipbetweenvariables?
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
4
0 10 20 30 40 50
010
2030
4050
60
x
yCorrelation:
5
0 10 20 30 40 50
-15
-10
-50
510
x
yCorrelation:
6
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
7
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
8
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
9
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
10
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
11
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
12
0 10 20 30 40 50
010
2030
4050
x
yCorrelation:
13
Positivelycorrelated:
Negativelycorrelated:
Uncorrelated:
Correlation:
14
Pearson’sproduct-momentcorrelationcoefficient:
CoefficientofVariation(R2value):
0 10 20 30 40 50
010
2030
4050
x
y
Correlation:
15
r=0.931R2=0.866
0 10 20 30 40 500
1020
3040
5060
x
y
r=-0.949R2=0.901
0 10 20 30 40 50-5
05
1015
2025
30
x
y
0 10 20 30 40 50
-15
-10
-50
510
x
y
Correlation:
16
r=-0.060R2=0.004
r=0.106R2=0.011
0 10 20 30 40 50
010
2030
4050
x
y
Correlation:
17
data:xandyt=17.613,df=48,p-value<2.2e-16alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:0.88025560.9602168sampleestimates:cor0.9305923
data:xandyt=1.5609,df=48,p-value=0.1251alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:-0.062380660.46941403sampleestimates:cor0.2197833
0 10 20 30 40 50
-200
-100
0100
200
xy
CanIsaywhethermydataarecorrelated?Isanobservedcorrelationsignificant?
Simple Regression:
18
0 10 20 30 40 50
010
2030
4050
x
y
Aims:• Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail• Beabletopredictresponsegivenaknowledgeoftheindependentvariable
PredictorvariableIndependentvariable
ResponsevariableDependentvariable
0 10 20 30 40 50
010
2030
4050
x
y
Simple Regression:
19
Aims:• Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail• Beabletopredictresponsegivenaknowledgeoftheindependentvariable
ResponsevariableDependentvariable
PredictorvariableIndependentvariable
0 10 20 30 40 50
010
2030
4050
x
y
Simple Regression:
20
Aims:• Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail• Beabletopredictresponsegivenaknowledgeoftheindependentvariable
ResponsevariableDependentvariable
PredictorvariableIndependentvariable
0 10 20 30 40 50
010
2030
4050
x
y
Simple Regression:
21
Aims:• Toinvestigatelinearcorrelationbetweentwovariablesinmoredetail• Beabletopredictresponsegivenaknowledgeoftheindependentvariable
ResponsevariableDependentvariable
PredictorvariableIndependentvariable
εi=errors,residuals
Fortheithobservation:
yi
εixi
Simple Regression:
22
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand
0 10 20 30 40 500
1020
3040
50
x
y
Observations:
yi
εixi
= !(! + !!+ !|!,!)
Simple Regression:
23
0 10 20 30 40 500
1020
3040
50
x
y
Observations:Fittedvalues:
yi
εi
ŷi
xi
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand
Simple Regression:
24
! = !− !Residuals:
0 10 20 30 40 500
1020
3040
50
x
y yi
εi
ŷi
xi
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand
Simple Regression:
25
0 10 20 30 40 500
1020
3040
50
x
y yi
εi
! = !− !
ŷi
Residuals: xi! = !+ !
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !
Simple Regression:
26
0 10 20 30 40 500
1020
3040
50
x
y yi
εi
! = !− !
ŷi
Residuals: xi! = !+ !
! ~ !(!,!!)
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !
Simple Regression:
27
0 10 20 30 40 500
1020
3040
50
x
y yi
εi
! = !− !
ŷi
Residuals: xi! = !+ !
! ~ !(!,!!)
!(!|!;!,!) = 12!!!
!!(!!!)!!!!
Sohowdowefittheregressionline?Supposeweknowparameterestimatesand ! = !− !
0 10 20 30 40 500
1020
3040
50
x
y
Simple Regression:
28
Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata
yi
εi
ŷi
xi
! = !− !
!(!|!;!,!) = 12!!!
!!(!!!)!!!!
! !,! !,! = !(!!|!!;!,!)!
= 12!!!!
!!(!!!!!)!
!!!
0 10 20 30 40 500
1020
3040
50
x
y
Simple Regression:
29
Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata
yi
εi
ŷi
xi
! = !− !
!(!|!;!,!) = 12!!!
!!(!!!)!!!!
! !,! !,! = !(!!|!!;!,!)!
= 12!!!!
!!(!!!!!)!
!!!
ln! !,! !,! = !!! ln 2!!! − (!!!!!)!
!!!!
= !!! log 2!!
! − !!!! (!! − !!)!
!
! !,! !,! !"#
0 10 20 30 40 500
1020
3040
50
x
y
Simple Regression:
30
Optimalparameters:minimiseresidualsumofsquaresMaximumLikelihoodandLeastSquaresestimatesareequivalent(forGaussianerrorsmodel)
yi
εi
ŷi
xi
! = !− !
Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata
! !,! !,! !"#
Simple Regression:
31
Optimalparameters:minimiseresidualsumofsquaresMaximumLikelihoodandLeastSquaresestimatesareequivalent(forGaussianerrormodel)
! = !− !
Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedata
0 10 20 30 40 500
1020
3040
50
x
y!
!
Simple Regression:
32
! = !− !
Sohowdowefittheregressionline?ObtainestimatesandMaximiselikelihoodofparametersgiventhedataMinimisesumofsquaredresiduals
0 10 20 30 40 500
1020
3040
50
x
y!
!
Finalanswer:
Simple Regression:
33
Example:Predictingtimbervolumeoffelledblackcherrytrees
8 10 12 14 16 18 20
1020
3040
5060
70
GirthVolume
> cor(trees$Volume,trees$Girth)[1] 0.9671194
> m1 = lm(Volume~Girth,data=trees)> summary(m1)
Call:lm(formula = Volume ~ Girth, data = trees)
Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Response: y=VolumePredictor: x=Girth
8 10 12 14 16 18 20
1020
3040
5060
70
GirthVolume
Simple Regression:
34
Example:Predictingtimbervolumeoffelledblackcherrytrees
> cor(trees$Volume,trees$Girth)[1] 0.9671194
> m1 = lm(Volume~Girth,data=trees)> summary(m1)
Call:lm(formula = Volume ~ Girth, data = trees)
Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Response: y=VolumePredictor: x=Girth
Residuals
Frequency
-10 -5 0 5 10
01
23
45
67
Simple Regression:
35
Example:Predictingtimbervolumeoffelledblackcherrytrees
> cor(trees$Volume,trees$Girth)[1] 0.9671194
> m1 = lm(Volume~Girth,data=trees)> summary(m1)
Call:lm(formula = Volume ~ Girth, data = trees)
Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Response: y=VolumePredictor: x=Girth
! = 4.252!! = 18.1
Residuals
Frequency
-10 -5 0 5 10
01
23
45
67
Simple Regression:
36
Example:Predictingtimbervolumeoffelledblackcherrytrees
> cor(trees$Volume,trees$Girth)[1] 0.9671194
> m1 = lm(Volume~Girth,data=trees)> summary(m1)
Call:lm(formula = Volume ~ Girth, data = trees)
Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Response: y=VolumePredictor: x=Girth
95%within±8.5
! = 4.252!! = 18.1
Linear Regression:
37
Assumptions:1. Modelislinearinparameters.
Linear Regression:
38
Assumptions:1. Modelislinearinparameters.
Linear Regression:
39
Assumptions:1. Modelislinearinparameters.
Linear Regression:
40
Assumptions:1. Modelislinearinparameters.
Linear Regression:
41
Assumptions:1. Modelislinearinparameters.
Linear Regression:
42
Assumptions:1. Modelislinearinparameters.
2. Gaussianerrormodel.
Linear Regression:
43
Assumptions:1. Modelislinearinparameters.
2. Gaussianerrormodel.
3. Additiveerrormodel.
Linear Regression:
44
Assumptions:1. Modelislinearinparameters.
2. Gaussianerrormodel.
3. Additiveerrormodel.
4. Independenceoferrors.
Noautocorrelation–whenoneobservationdependsonthelast
Linear Regression:
45
Assumptions:1. Modelislinearinparameters.
2. Gaussianerrormodel.
3. Additiveerrormodel.
4. Independenceoferrors.
Noautocorrelation–whenoneobservationdependsonthelast
5. Homoscedasticity.Homogeneity/stabilityofvarianceoftheresiduals
Testing Assumptions: diagnostic plots
46
1. ResidualsvsFittedValues
10 20 30 40 50 60
-10
-50
510
Fitted values
Residuals
lm(Volume ~ Girth)
Residuals vs Fitted
31
2019
• Shouldnotberelated• Novisiblepattern• Meanresidual=zero• Constantvariance
-2 -1 0 1 2
-2-1
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
lm(Volume ~ Girth)
Normal Q-Q
31
2019
Testing Assumptions: diagnostic plots
47
1. ResidualsvsFittedValues2. NormalQuantile-Quantileplot
• VisualtestforNormality• Nostrongtrends/departures
10 20 30 40 50 60
0.0
0.5
1.0
1.5
Fitted values
Standardized residuals
lm(Volume ~ Girth)
Scale-Location31
20
19
Testing Assumptions: diagnostic plots
48
1. ResidualsvsFittedValues2. NormalQuantile-QuantilePlot3. Scale-LocationPlot
• Testforhomoscedasticity• Shouldbeconstant,≈1• Notrend
0.00 0.05 0.10 0.15 0.20
-2-1
01
23
Leverage
Sta
ndar
dize
d re
sidu
als
lm(Volume ~ Girth)
Cook's distance0.5
0.5
1
Residuals vs Leverage
31
128
Testing Assumptions: diagnostic plots
49
1. ResidualsvsFittedValues2. NormalQuantile-QuantilePlot3. Scale-LocationPlot4. IndexPlotofCook’sDistance
• Measurestheinfluenceofaparticularobservation
• Extremex-vals:highleverage• Mayinformoutlierrejection
Modelling Non-Linear Relationships
50
Linearmodelscanbeusedtodescribenon-linearrelationships…
Applyingtransformationstoresponseand/orpredictorvariablescanbeusefulto:• Linearisethedata,i.e.maketherelationshipbetweenvariablesmorelinear.• Stabilisethevarianceoftheresiduals,sothatσ2doesn’tdependonthe
independentvariable.• Normalisethedistributionoftheresiduals
Modelling Non-Linear Relationships
51
Example:Stoppingdistanceofcarsversusspeed(mph)
5 10 15 20 25
020
4060
80100
120
speed
dist
Response: y=distancePredictor: x=speed
5 10 15 20 25
020
4060
80100
120
speed
dist
Modelling Non-Linear Relationships
52
Example:Stoppingdistanceofcarsversusspeed(mph)
Response: y=distancePredictor: x=speed
0 20 40 60 80
-20
020
40
Fitted values
Residuals
lm(dist ~ speed)
Residuals vs Fitted
4923
35
R2=0.651
3 4 5 6 7 8 9
-2-1
01
23
Fitted values
Residuals
lm(sqrt(dist) ~ speed)
Residuals vs Fitted
23
35
39
5 10 15 20 25
24
68
10
speed
sqrt(dist)
Modelling Non-Linear Relationships
53
Example:Stoppingdistanceofcarsversusspeed(mph)
Response: y=distancePredictor: x=speed
R2=0.651R2=0.709
1.5 2.0 2.5 3.0 3.5 4.0 4.5
-1.0
-0.5
0.0
0.5
1.0
Fitted values
Residuals
lm(log(dist) ~ log(speed))
Residuals vs Fitted
3
232
1.5 2.0 2.5 3.0
12
34
log(speed)
log(dist)
Modelling Non-Linear Relationships
54
Example:Stoppingdistanceofcarsversusspeed(mph)
Response: y=distancePredictor: x=speed
R2=0.651R2=0.709R2=0.733
5 10 15 20 25
020
4060
80100
120
speed
dist
Modelling Non-Linear Relationships
55
Example:Stoppingdistanceofcarsversusspeed(mph)
Response: y=distancePredictor: x=speed
R2=0.651R2=0.709R2=0.733
Call:lm(formula = log(dist) ~ log(speed), data = cars)
Residuals: Min 1Q Median 3Q Max -1.00215 -0.24578 -0.02898 0.20717 0.88289
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7297 0.3758 -1.941 0.0581 . log(speed) 1.6024 0.1395 11.484 2.26e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4053 on 48 degrees of freedomMultiple R-squared: 0.7331, Adjusted R-squared: 0.7276 F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
Modelling Non-Linear Relationships
56
Canyouusesimpleregressiontofitthismodel?
Non-linearMultiplicativeerrormodel
Modelling Non-Linear Relationships
57
Canyouusesimpleregressiontofitthismodel?
Non-linearMultiplicativeerrormodel
Modelling Non-Linear Relationships
58
Canyouusesimpleregressiontofitthismodel?
Yes,solongasErrormodelislog-Normal.
5 10 15 20 25
020
4060
80100
120
speed
dist
Non-linearMultiplicativeerrormodel
5 10 15 20 25
020
4060
80100
120
speed
dist
Modelling Non-Linear Relationships
59
Example:Stoppingdistanceofcarsversusspeed(mph)
Response: y=distancePredictor: x=speed
R2=0.651R2=0.709R2=0.733
Call:lm(formula = log(dist) ~ log(speed), data = cars)
Residuals: Min 1Q Median 3Q Max -1.00215 -0.24578 -0.02898 0.20717 0.88289
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7297 0.3758 -1.941 0.0581 . log(speed) 1.6024 0.1395 11.484 2.26e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4053 on 48 degrees of freedomMultiple R-squared: 0.7331, Adjusted R-squared: 0.7276 F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
! = !!!!!!
0 10 20 30 40 50
010
2030
4050
x
y
R functions:
plot(x,y)cor(x,y)cor.test(x,y)
60
data:xandyt=17.613,df=48,p-value<2.2e-16alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:0.88025560.9602168sampleestimates:cor0.9305923
Simple Regression in R:
Correlation Coefficients:
61
Simple Regression in R:
R functions:
plot(x,y)m1 = lm(y~x)abline(m1)
summary(m1)
Call:lm(formula = Volume ~ Girth, data = trees)
Residuals: Min 1Q Median 3Q Max -8.065 -3.107 0.152 3.495 9.587
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***Girth 5.0659 0.2474 20.48 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedomMultiple R-squared: 0.9353, Adjusted R-squared: 0.9331 F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
62
Simple Regression in R:
R functions:
plot(x,y)m1 = lm(y~x)abline(m1)
summary(m1)
r1 = residuals(r1)hist(r1)
63
Simple Regression in R:
R functions:
plot(x,y)m1 = lm(y~x)abline(m1)
summary(m1)
r1 = residuals(r1)hist(r1)
plot(m1)
10 20 30 40 50 60
-10
-50
510
Fitted values
Residuals
Residuals vs Fitted
31
2019
-2 -1 0 1 2
-2-1
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
31
2019
10 20 30 40 50 60
0.0
0.5
1.0
1.5
Fitted values
Standardized residuals
Scale-Location31
2019
0.00 0.05 0.10 0.15 0.20
-2-1
01
23
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance 0.5
0.5
1
Residuals vs Leverage
31
128