Lecture 6 Notes Chapter 8. Regression...
Transcript of Lecture 6 Notes Chapter 8. Regression...
Lecture 6 Notes
Chapter 8. Regression Wisdom
1
Learning Outcomes
• Check for unusual observations such as high leverage points, and assess their influence on the fitted model
• Exercise proper caution when extrapolating
• Do not assume a strong predictive relationship implies causation
• Exercise caution when using average or restricted range data
2
Examining Residuals
• No residual analysis is complete without a display of the residuals to check that the linear model is appropriate.
• Residuals reveal subtleties that are not clear from the plot of original data.
• Residuals are additional details that confirm or refine or understanding.
• Residuals reveal violations of the regression conditions that require our attention.
• It is good to look at both a histogram of residual (or the normal Q-Q plot of residuals) and a scatterplot of the
residuals vs. predictor variable in order to further examine residuals.
3
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
The centre for Disease Control and Prevention track cigarette smoking in the US. How has the percentage of
people who smoke changed since the danger became clear during the last half of the 20th century?
4
The scatterplot shows percentage of smokers among men 18-24 years of age, as estimated by surveys, from 1965 through 2009.
• The percent of men age 18–24 who are smokers decreased dramatically between 1965 and 1990, but the trend has not been consistent since then.
• The association between percent of men age 18–24 who smoke and year is very strong from 1965 to 1990, but is erratic after 1990.
• A linear model may not be an appropriate model fit for the trend in the percent of males age 18–24 who are smokers. Because, the relationship does not appear to be straight.
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
5
• The regression equation is:
𝑚𝑎𝑙𝑒 𝑠𝑚𝑜𝑘𝑖𝑛𝑔 %= 986.996 - 0.479 Year
We expect that almost all of the residuals to be within 3 standard deviation (3 x 4.17 = 12.15) of their mean zero.
We will check this by looking at:
- Min and max values for the residuals
- The histogram of residuals (or boxplot of residuals)
Summary Statistics of Residuals
6
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
Checking the Normality Assumption about Residuals
7
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
Plot: Residuals vs. Predictor Variable (Year)
• Nonlinearity is more prominent.
• Residual points are not randomly plotted around the
zero line; they are not evenly spread out.
• Residual points form a curvature pattern.
• Linear Regression model is not a correct model.
8
Checking the Linearity and Constant Variances Assumptions about Residuals
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
When residuals are not straight (e.g., show that there is a curvilinear relationship), re-express data by using, for example,
log-transformation technique to linearize, or add a curvature term to the regression model.
Percentage of Both Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
The centre for Disease Control and Prevention track cigarette smoking in the US. How have the percentages of
men and women who smoke changed since the danger became clear during the last half of the 20th century?
9
% Smokers (18 – 24 years of age) from 1965 through 2009Not taking group into account
10
• The regression equation is:
𝑠𝑚𝑜𝑘𝑖𝑛𝑔 %= 953.31 - 0.46 Year
Analysis of Residual Points
11We need to account for two groups:
males and females
Scatterplot for % Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
12
Scatterplot for % Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
• Smoking rates for both men and women in
the US have decreased significantly over the
time period from 1965 to 2009.
• Smoking rates are generally lower for women
than for men.
• The trend in the smoking rates for women
seems a bit straighter than the trend for men.
• The apparent curvature in the scatterplot for
the men could possibly be due to just a few
points, and not an indication of a serious
violation of the linearity condition.
13
14
• An examination of residuals often leads us to discover groups of observations that are different from the rest.
• Histogram might show multiple modes.
• When we discover there is more than one group in a regression, we may decide to analyze the groups
separately using a different model for each group.
Examining Residuals
Outliers, Leverage, and Influence
15
• Any point that stands away from the others can be called an outlier and deserves your special attention.
• Outlying points can strongly influence a regression. Even a single point far from the body of the data can
dominate the analysis.
High Leverage:
• A data can be unusual if its x-value is far the mean of x-values.
• It has a potential to change the regression line.
• If the point(s) line up with the pattern of the rest of the other points, then it may not change our estimate of the
regression line (it is a good idea to fit the model twice, both with and without the point in question).
Influential Point:
• A data point is influential if omitting it from the statistical analysis changes the model enough to make a
meaningful difference.
• Influence depends both on leverage and residual.
16
1. Not High Leverage, Not Influential, Large Residual
2. High Leverage, Not Influential, Small Residual
3. High Leverage, Influential, Not Necessarily Large Residual
Example of an Influential Observation
17
Relationship between Murder rate and poverty level for 51 state (including the state: DC)
Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest.
Example of an Influential Observation
18
Relationship between Murder rate and poverty level (including the state: DC)
Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest.
Example of an Influential Observation
19
Relationship between Murder rate and poverty level (including the state: DC)
Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest.
• r = 0.47
• The regression equation is: 𝑀𝑢𝑟𝑑𝑒𝑟 𝑅𝑎𝑡𝑒=-3.45 + 0.67 Poverty Rate
• R-squared = 0.22 (22%)
Example of an Influential Observation
20
Relationship between Murder rate and poverty level (including the state: DC)
Z-score for max poverty rate is (17.7 – 12.876)/3.086942 = 1.56 (not so far away from its mean); Not High Leverage Point
Poverty rate: 17.7, Murder Rate: 31.4
- It has the highest residual value: 23.0131
Example of Omitting an Observation from Data
21
Examining the relationship between Murder rate and poverty level (excluding DC)
• r = 0.54
• The regression equation is: 𝑀𝑢𝑟𝑑𝑒𝑟 𝑅𝑎𝑡𝑒=-0.66 + 0.41 Poverty Rate
• R-squared = 0.29 (22%)
Poverty rate: 17.7, Murder Rate: 31.4
- It is somewhat influential
Example of High Leverage Point BUT Not An Influential Observation
22
Relationship Between Percent of Birth to Teen Moms and Poverty Rate (including the state Mississippi)
• r = 0.85
• Regression equation:
%𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.39 + 0.70 poverty rate
• R-squared = 0.71 (71%)
• State Mississippi: (21, 17.1)
• Poverty rate had mean 12.84 and SD of 3.06
• Z-score for state Mississippi:
Z = (21-12.84)/3.06 = 2.67 (somewhat above the mean poverty);
We claim this as somewhat high leverage point.
• The residual value for this observation was: 1.01
(small residual).
Example of High Leverage Point BUT Not An Influential Observation
23
Relationship Between Percent of Birth to Teen Moms and Poverty Rate (including the state Mississippi)
• r = 0.82
• Regression equation:
%𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.64 + 0.67 poverty rate
• R-squared = 0.67 (67%)
Poverty rate: 21, % Teen Moms: 17.1
- It is not an influential observation
Restricted-range Problem
24
• When one of the variables is restricted (you only look at some of the values), the correlation can be surprisingly low.
• We will visit an example from the web, from David Lane: http://davidmlane.com/hyperstat/A68809.html
• The demo video is found here: http://onlinestatbook.com/2/describing_bivariate_data/restriction_demo.html
Working with Summary Statistics
25
Graph below shows that there appears to be a strong, positive,
linear association between weight (in pounds)
and height (in inches) for men.
Graph below shows that if instead of data on individuals
we only had the mean weight for each height value, we
would see an even stronger association.
• We see less scattered points.
• It can give a false impression of how well a line
summarizes the data.
• We have a problem of overestimating or
underestimating.