Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis...
Transcript of Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis...
1
Introduction to Linear regression analysis
Part 2
Model comparisons
2
ANOVA for regression
Total variation in Y
SSTotal
=
Variation explained by regression with X
SSRegression
+
Residual variation
SSResidual
Full model
yi = 0 + 1xi + i
• Unexplained variation in Y from full model = SSResidual
( )y yi i 2
3
Reduced model (H0 true)
• Reduced model (H0: 1 = 0 true):
yi = 0 + i
• Unexplained variation in Y from reduced model = SSTotal
(Mean and error)
( )y yi 2
Model comparison
• Difference in unexplained variation between full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including 1 in model
( )y yi 2
4
Explained variation
• Proportion of variation in Y explained by linear relationship with X
• Termed r2, coefficient of determination:
SS Regression
SS Total
• r2 is simply square of correlation coefficient (r) between X and Y.
0 5000 10000 15000X
0
100
200
300
400
500
Y1
0 5000 10000 15000X
0
100
200
300
400
500
Y2
Which is the better model??
5
Which is the better model??
0 5000 10000 15000X
0
100
200
300
400
500
Y1
Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085
Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450
X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009
Residual 1.81383E+05 24 7557.643448
Which is the better model??
0 5000 10000 15000X
0
100
200
300
400
500
Y2
Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781
Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482
X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864
Residual 3388.617362 3 1129.539121
6
0 5000 10000 15000X
0
100
200
300
400
500Y
1
0 5000 10000 15000X
0
100
200
300
400
500
Y2
Which is the better model??
n = 5P = 0.00386r2 = 0.942
n = 26P = 0.000009r 2= 0.551
Which is the better model??
0 5000 10000 15000X
0
100
200
300
400
500
Y1
0 5000 10000 15000X
0
100
200
300
400
500
Y2
95% Confidence bands (for slope)
n = 5P = 0.00386r2 = 0.942
n = 26P = 0.000009r 2= 0.551
7
Assumptions
Normality
Y normally distributed at each value of X:
– Boxplot of y should be symmetrical - watch out for outliers and skewness
– Transformations often help
8
Homogeneity of variance
Variance (spread) of Y should be constant for each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with only one value of y per x).
x1 x2 X
Y
Y1
Y2
y iix 0 1
9
Homogeneity of variance
Variance (spread) of Y should be constant for each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with only one value of y per x).
– Spread of residuals should be even when plotted against xi or predicted yi’s
– Transformations often help
– Transformations that improve normality of Y will also usually make variance of Y more constant
0 5000 10000 15000X
-2
-1
0
1
2
3
Sta
ndar
dize
d R
esid
ual
0 100 200 300 400ESTIMATE
-200
-100
0
100
200
300
RE
SID
UA
L
0 5000 10000 15000X
0
100
200
300
400
500
Y1
Tests of Homogeneity of variance
10
Independence
Values of yi are independent of each other:
– watch out for data which are a time series on same experimental or sampling units
– should be considered at design stage
Linearity
True population relationship between Yand X is linear:
– scatterplot of Y against X
– watch out for asymptotic or exponential patterns
– transformations of Y or Y and X often help
– Always look at residuals
11
EDA and regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and outliers
EDA
• Boxplots of Y (and X):– check for normality, outliers etc.
• Scatterplot of Y and X:– check for linearity, homogeneity of
variance, outliers etc.
12
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
Anscombe (1973) data set
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
13
Limited or weighted data Smoothers (for data exploration – especially useful for
model fitting)• Nonparametric description of
relationship between Y and X– unconstrained by specific model structure
• Useful exploratory technique:– is linear model appropriate?
– are particular observations influential?
Limited or weighted data Smoothers
• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model
through surrounding observations
• Surrounding observations in window (or band)– covers range along X-axis
– size of window (number of observations) determined by smoothing parameter
14
Limited or weighted data Smoothers
• Adjacent windows overlap– resulting line is smooth
– smoothness controlled by smoothing parameter (width of window)
• Any section of line robust to extreme values in other windows
Types of limited or weighted data smoothers (examples)
• Running (moving) means or averages:– means or medians within each window
• Lo(w)ess:– locally weighted regression scatterplot
smoothing
– observations within a window weighted differently
– observations replaced by predicted values from local regression line
15
( )y yi i
Residuals – very useful for examining regression assumptions
• Difference between observed value and value predicted or fitted by the model
• Residual for each observation:– difference between observed y and value
of y predicted by linear regression equation:
Studentised residuals
• residual / SE residuals• follow a t-distribution• studentised residuals can be compared
between different regressions
Observations with large residual (or studentised residual) are outliers from fitted model.
16
0-se
+se
Predicted yi
Res
idua
l
•No pattern in residuals• Indicates assumptions OK
x
y •Even spread of Yaround line
Plot residuals against predicted yi
• Increasing spread of residuals, ie. wedge-shape
• Unequal variance in Y• Skewed distribution of Y• Transformation of Y helps
• Uneven spread of Yaround line
0-se
+se
Predicted yi
Res
idua
l
x
y
17
Violations of assumptions may not always lead to non-significant result
• Relationship between birth rate and Per capita income
0 5000 10000 15000 20000Per Capita Income (thousands)
0.1
0.2
0.3
0.4
0.5
0.6
Birt
hrat
e pe
r fe
mal
e (p
er y
ear)
Violations of assumptions may not always lead to non-significant result
• Relationship between birth rate and Per capita income
0 5000 10000 15000 20000Per Capita Income (thousands)
0.1
0.2
0.3
0.4
0.5
0.6
Birt
h rat
e pe
r fe
mal
e (p
er y
ear)
Dep Var: BIRTHRATE N: 57 Multiple R: 0.759086 Squared multiple R: 0.576212
Adjusted squared multiple R: 0.568506 Standard error of estimate: 0.088099
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 0.580417 1 0.580417 74.781752 0.000000
Residual 0.426881 55 0.007761
18
Violations of assumptions may not always lead to non-significant result
• Perhaps consider transformation
0 5000 10000 15000 20000Per Capita Income (thousands)
0.1
0.2
0.3
0.4
0.5
0.6
Bir
thra
te p
er
fem
ale
(p
er
yea
r)
0.0 0.1 0.2 0.3 0.4ESTIMATE
-0.2
-0.1
0.0
0.1
0.2
RE
SID
UA
L
Violations of assumptions may not always lead to non-significant result
• Log transform of Per Capita Income
0.0 0.1 0.2 0.3 0.4 0.5 0.6ESTIMATE
-0.1
0.0
0.1
0.2
RE
SID
UA
L
2 3 4 5Per Capita Income (log (thousands))
0.1
0.2
0.3
0.4
0.5
0.6
Birt
hrat
e pe
r fe
mal
e (p
er y
ear)
19
Comparison of models
0 5000 10000 15000 20000Per Capita Income (thousands)
0.1
0.2
0.3
0.4
0.5
0.6
Bir
thra
te p
er
fem
ale
(per
ye
ar)
0 5000 10000 15000 20000Per Capita Income (thousands)
0.1
0.2
0.3
0.4
0.5
0.6
Bir
thra
te p
er
fem
ale
(per
ye
ar)
Dep Var: BIRTHRATE N: 57 Multiple R: 0.759086 Squared multiple R: 0.576212
Adjusted squared multiple R: 0.568506 Standard error of estimate: 0.088099
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 0.580417 1 0.580417 74.781752 0.000000
Residual 0.426881 55 0.007761
Dep Var: BIRTHRATE N: 57 Multiple R: 0.894713 Squared multiple R: 0.800512
Adjusted squared multiple R: 0.796885 Standard error of estimate: 0.060444
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 0.806354 1 0.806354 220.705235 0.000000
Residual 0.200944 55 0.003654
2 3 4 5Per Capita Income (log (thousands))
0.1
0.2
0.3
0.4
0.5
0.6
Birt
hra
te p
er
fem
ale
(p
er
yea
r)
Other indicators
• Outliers
• Leverage
• Influence
20
Outliers
• Observations further from fitted model than remaining observations– might be different from
sample outliers in boxplots
• Large residual
outlier
Use robust estimator.syz
Leverage
• How extreme observation is for X-variable
• Measures how much each xi
influences predicted yi Large leverage
21
Influence
• Cook’s D statistic:– incorporates leverage & residual– observations with large influence on
estimated slope– observations with D near or greater than 1
should be checked
• Observation 1 is X and Y outlier but not influential
1Y
X
2
• Observation 2 has large residual - outlier
3
• Observation 3 is very influential (large Cook’s D) - also outlier
22
Transformations
• If Y (and error terms) is skewed:– log or power transformation of Y– improves homogeneity of variance– can reduce influence of outliers
• If nonlinear relationship:– linearize by transformation of Y and/or X
• Transformed variables must make biological sense
Robust regression
• Help with outliers in Y – Least Absolute Deviation (LAD) regression
• Minimizes sum of residuals (instead of squares)
– M (maximum likelihood) regression• Many types – often based on iteratively reweighted least squares, not useful
for issues with leverage. Converges on OLS when assumptions are met
• Help with outliers in Y and X– Least Median of Squares (LMS) regression
• Minimizes median of squares of the residuals
– Least Trimmed Squares (LTS) regression• Uses a trimmed set of observations and operates on sums of residuals
– Rank regression• Uses ranks of residuals rather than values
23
Regression through origin
• Fit model:yi = 1xi + i
• Problems:– extrapolation below smallest xi
– ANOVA partition of SS no longer additive– r2 difficult to interpret
• Always test H0 that 0 = 0 first• Recommendation - Only do this if there is a
compelling reason to do so (usually there is not)