Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression...
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression...
Class 6: Tuesday, Sep. 28
• Section 2.4.
• Checking the assumptions of the simple linear regression model:– Residual plots– Normal quantile plots
• Outliers and influential observations
Checking the model• The simple linear regression model is a great tool
but its answers will only be useful if it is the right model for the data. We need to check the assumptions before using the model.
• Assumptions of the simple linear regression model:1. Linearity: The mean of Y|X is a straight line.2. Constant variance: The standard deviation of Y|X is
constant.3. Normality: The distribution of Y|X is normal.4. Independence: The observations are independent.
Checking that the mean of Y|X is a straight line
1. Scatterplot: Look at whether the mean of Y given X appears to increase or decrease in a straight line.
Bivariate Fit of Salary By Years of Experience
35000
40000
45000
50000
55000
60000
65000
Sa
lary
0 2.5 5 7.5 1012.5Years of Experience
Bivariate Fit of Heart Disease Mortality By Wine Consumption
2
4
6
8
10
12
He
art
Dis
ea
se
Mo
rta
lity
0 10 20 30 40 50 60 70 80
Wine Consumption
Residual Plot• Residuals: Prediction error of using
regression to predict Yi for observation i:
, where
• Residual plot: Plot with residuals on the y axis and the explanatory variable (or some other variable) on the x axis.
ii XY 10ˆˆˆ iii YYres ˆ
-3-2-10123
Res
idua
l
0 10 20 30 40 50 60 70 80
Wine Consumption
-10000
-5000
0
5000
Resid
ual
0 2.5 5 7.5 10 12.5
Years of Experience
• Residual Plot in JMP: After doing Fit Line, click red triangle next to Linear Fit and then click Plot Residuals.
• What should the residual plot look like if the simple linear regression model holds? Under simple linear regression model, the residuals
should have approximately a normal distribution with mean zero and a standard deviation which is the same for all X.
• Simple linear regression model: Residuals should appear as a “swarm” of randomly scattered points about zero. Ideally, you should not be able to detect any patterns. (Try not to read too much into these plots – you’re looking for gross departures from a random scatter).
• A pattern in the residual plot that for a certain range of X the residuals tend to be greater than zero or tend to be less than zero indicates that the mean of Y|X is not a straight line.
)ˆˆ(ˆ10 iiiii XYYYres
B i v a r i a t e F i t o f M i l e a g e B y S p e e d
5
10
15
20
25
30
35
40Mil
eage
0 10 20 30 40 50 60 70 80 90 100 110
Speed
Linear Fit
L i n e a r F i t M i l e a g e = 2 3 . 2 6 6 7 7 6 - 0 . 0 0 1 2 7 0 1 S p e e d
-20
-10
0
10
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
Speed
D a t a S i m u l a t e d F r o m A S i m p l e L i n e a r R e g r e s s i o n M o d e l I d e a l r e g . J M P B i v a r i a t e F i t o f Y B y X
0
10
20
30
40
50
60
70
80
90
100
110
Y
0 10 20 30 40 50 60 70 80 90 100 110
X
-2
-1
0
1
2
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
X
Checking Constant Variance
• Use residual plot of residuals vs. X to check constant variance assumption.
• Constant variance: Spread of residuals is similar for all ranges of X
• Nonconstant variance: Spread of residuals is different for different ranges of X.
– Fan shaped plot: Residuals are increasing in spread as X increases
– Horn shaped plot: Residuals are decreasing in spread as X increases.
D a t a S i m u l a t e d F r o m A S i m p l e L i n e a r R e g r e s s i o n M o d e l I d e a l r e g . J M P B i v a r i a t e F i t o f Y B y X
0
10
20
30
40
50
60
70
80
90
100
110
Y
0 10 20 30 40 50 60 70 80 90 100 110
X
-2
-1
0
1
2
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
X
S i m u l a t e d D a t a f r o m a M o d e l w i t h N o n c o n s t a n t V a r i a n c e B i v a r i a t e F i t o f Y B y X
-100
-50
0
50
100
150
200
250
300
350
Y
0 10 20 30 40 50 60 70 80 90 100 110
X
-200
-100
0
100
200
300
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
X
Name Game Bivariate Fit of Proportion recalled By Position
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Pro
port
ion r
ecalled
1 2 3 4 5 6 7 8 9 10
Position
-0.8
-0.5
-0.2
0.1
0.4
Res
idua
l
1 2 3 4 5 6 7 8 9 10
Position
Checking Normality
• If the distribution of Y|X is normal, then the residuals should have approximately a normal distribution.
• To check normality, make histogram and normal quantile plot of residuals.
• In JMP, after using Fit Line, click red triangle next to Linear Fit and click save residuals. Click Analyze, Distribution, put Residuals in Y, click OK and then after histogram appears, click red triangle next to Residuals and click Normal Quantile Plot.
Name Game Distributions Residuals Proportion recalled
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4.01 .05.10 .25 .50 .75 .90.95 .99
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Simulation from Simple Linear Regression Model Distributions Residuals Y
-3
-2
-1
0
1
2
3.01 .05.10 .25 .50 .75 .90.95 .99
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Normal Quantile Plot• Section 1.3.• Most useful tool for assessing normality.• Plot of residuals (or whatever variable is being checked
for normality) on y-axis versus z-score of percentile of data point.
• If the true distribution is normal, the normal quantile plot will be a straight line. Deviations from a straight line indicate that the distribution is not normal.
• The dotted red lines are “confidence bands.” If all the points lie inside the confidence bands, then we feel that the normality assumption is reasonable.
Independence• In a problem where the data is collected over time, plot the
residuals vs. time. • For simple linear regression model, there should be no
pattern in residuals over time.• Pattern in residuals over time where residuals are higher or
lower in early part of data than later part of data indicates that relationship between Y and X is changing over time and might indicate that there is a lurking variable.
• Lurking variable: A variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.
Residual vs. Time Example
• Mathematics dept. at large state university must plan number of instructors required for large elementary courses and wants to predict enrollment in elementary math courses (y) based on number of first-year students (x).
• Data in mathenroll.JMP• Residual plot vs. time in JMP: After fit y by x, fit
line, click red triangle next to linear fit and click save residuals. Then use fit y by x with y = residuals and x = year.
Residual Plots
-200
0
200
400R
esi
du
al
3750 4000 4250 4500 4750 5000First year students
-200
-100
0
100
200
300
400
Re
sid
ua
ls M
ath
en
rollm
en
t
1992 1996 1998 2000Year
Analysis of Math Enrollment
• Residual plot versus time order indicates that there must be a lurking variable associated with time, in particular there is a change in the relationship between y and x between 1997 and 1998.
• In fact, one of schools in the university changed its program to require that entering students take another mathematics course beginning in 1998, increasing enrollment.
• Implication: Data from before 1998 should not be used to predict future math enrollment.
What to Do About Violations of Simple Linear Regression Model• Coming up in the Future:• Nonlinearity: Transformations (Chapter 2.6),
Polynomial Regression (Chapter 11)• Nonconstant Variance: Transformations (Chapter
2.6)• Nonnormality: Transformations (Chapter 2.6).• Lack of independence: Incorporate time into
multiple regression (Chapter 11), time series techniques (Stat 202).
Outliers and Influential Observations
• Outlier: Any really unusual observation.• Outlier in the X direction (called high leverage point):
Has the potential to influence the regression line.• Outlier in the direction of the scatterplot: An
observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value.
• Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.
Housing Prices and Crime Rates• A community in the Philadelphia area is interested in
how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values.
• The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate.JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000 population.
Bivariate Fit of HousePrice By CrimeRate
0
100000
200000
300000
400000
500000
Ho
us
eP
ric
e
GladwyneHaverford
Phila, N
Phila,CC
0 50 100 150 200 250 300 350 400
CrimeRate
Center City Philadelphia is a high leverage point. Gladwyne and Haverford are outliers in the direction of the scatterplot (their house price) is considerably higher than one would expect given their crime rate.
Which points are influential?B iv a r i a t e F i t o f H o u s e P r ic e B y C r im e R a t e
0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
Ho
us
eP
rice
G la d w y n eH a v e rfo rd
P h ila , N
P h ila ,C C
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0
C rime R a te
L in e a r F it
L in e a r F it
L in e a r F it
A l l o b s e r v a t i o n s L i n e a r F i t H o u s e P r ic e = 1 7 6 6 2 9 .4 1 - 5 7 6 . 9 0 8 1 3 C r im e R a te
W i t h o u t C e n t e r C i t y P h i l a d e l p h i a L i n e a r F i t H o u s e P r ic e = 2 2 5 2 3 3 .5 5 - 2 2 8 8 . 6 8 9 4 C r im e R a te
W i t h o u t G l a d w y n e L i n e a r F i t H o u s e P r ic e = 1 7 3 1 1 6 . 4 3 - 5 6 7 . 7 4 5 0 8 C r im e R a te
Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.
Formal measures of leverage and influence
• Leverage: “Hat values” (JMP calls them hats)• Influence: Cook’s Distance (JMP calls them Cook’s D
Influence).• To obtain them in JMP, click Analyze, Fit Model, put Y
variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances.
• To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.
Distributions Cook's D Influence HousePrice
HaverfordGladwyne Phila,CC
0 5 1015202530
h HousePrice
Phila,CC
0 .1.2.3.4.5.6.7.8.9
Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No otherobservations have high influence or high leverage.
Rules of Thumb for High Leverage and High Influence
• High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where
# of coefficients in regression model = 2 for simple linear regression.
n=number of observations. • High Influence: Any observation with a Cook’s
Distance greater than 1 indicates a high influence.
What to Do About Suspected Influential Observations?
See flowchart handout.
Does removing the observation change the
substantive conclusions?• If not, can say something like “Observation x
has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”
• If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation?– If yes, omit the observation and proceed.– If no, does the observation have high leverage (outlier in explanatory
variable).• If yes, omit the observation and proceed. Report that conclusions only apply to a
limited range of the explanatory variable.• If no, not much can be said. More data (or clarification of the influential observation)
are needed to resolve the questions.
• General principle: Delete observations from the analysis sparingly – only when there is good cause (does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.
Summary• Before using the simple linear regression model, we need to
check its assumptions. Check linearity, constant variance, normality and independence by using scatterplot, residual plot and normal quantile plot.
• Influential observations: observations that, if removed, would have a large influence on the fitted regression model. Examine influential observations, remove them only with cause (belongs to a different population than being studied, has high leverage) and explain why you deleted them.
• Next class: Lurking variables, causation (Sections 2.4, 2.5).