Lecture 6 Notes Chapter 8. Regression...

Lecture 6 Notes

Chapter 8. Regression Wisdom

1

Learning Outcomes

• Check for unusual observations such as high leverage points, and assess their influence on the fitted model

• Exercise proper caution when extrapolating

• Do not assume a strong predictive relationship implies causation

• Exercise caution when using average or restricted range data

2

Examining Residuals

• No residual analysis is complete without a display of the residuals to check that the linear model is appropriate.

• Residuals reveal subtleties that are not clear from the plot of original data.

• Residuals are additional details that confirm or refine or understanding.

• Residuals reveal violations of the regression conditions that require our attention.

• It is good to look at both a histogram of residual (or the normal Q-Q plot of residuals) and a scatterplot of the

residuals vs. predictor variable in order to further examine residuals.

3

Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009

The centre for Disease Control and Prevention track cigarette smoking in the US. How has the percentage of

people who smoke changed since the danger became clear during the last half of the 20th century?

4

The scatterplot shows percentage of smokers among men 18-24 years of age, as estimated by surveys, from 1965 through 2009.

• The percent of men age 18–24 who are smokers decreased dramatically between 1965 and 1990, but the trend has not been consistent since then.

• The association between percent of men age 18–24 who smoke and year is very strong from 1965 to 1990, but is erratic after 1990.

• A linear model may not be an appropriate model fit for the trend in the percent of males age 18–24 who are smokers. Because, the relationship does not appear to be straight.


5

• The regression equation is:

𝑚𝑎𝑙𝑒 𝑠𝑚𝑜𝑘𝑖𝑛𝑔 %= 986.996 - 0.479 Year

We expect that almost all of the residuals to be within 3 standard deviation (3 x 4.17 = 12.15) of their mean zero.

We will check this by looking at:

- Min and max values for the residuals

- The histogram of residuals (or boxplot of residuals)

Summary Statistics of Residuals

6


Checking the Normality Assumption about Residuals

7


Plot: Residuals vs. Predictor Variable (Year)

• Nonlinearity is more prominent.

• Residual points are not randomly plotted around the

zero line; they are not evenly spread out.

• Residual points form a curvature pattern.

• Linear Regression model is not a correct model.

8

Checking the Linearity and Constant Variances Assumptions about Residuals


When residuals are not straight (e.g., show that there is a curvilinear relationship), re-express data by using, for example,

log-transformation technique to linearize, or add a curvature term to the regression model.

Percentage of Both Men and Women Smokers (18 – 24 years of age) from 1965 through 2009

The centre for Disease Control and Prevention track cigarette smoking in the US. How have the percentages of

men and women who smoke changed since the danger became clear during the last half of the 20th century?

9

% Smokers (18 – 24 years of age) from 1965 through 2009Not taking group into account

10

• The regression equation is:

𝑠𝑚𝑜𝑘𝑖𝑛𝑔 %= 953.31 - 0.46 Year

Analysis of Residual Points

11We need to account for two groups:

males and females

Scatterplot for % Men and Women Smokers (18 – 24 years of age) from 1965 through 2009

12

Scatterplot for % Men and Women Smokers (18 – 24 years of age) from 1965 through 2009

• Smoking rates for both men and women in

the US have decreased significantly over the

time period from 1965 to 2009.

• Smoking rates are generally lower for women

than for men.

• The trend in the smoking rates for women

seems a bit straighter than the trend for men.

• The apparent curvature in the scatterplot for

the men could possibly be due to just a few

points, and not an indication of a serious

violation of the linearity condition.

13

14

• An examination of residuals often leads us to discover groups of observations that are different from the rest.

• Histogram might show multiple modes.

• When we discover there is more than one group in a regression, we may decide to analyze the groups

separately using a different model for each group.

Examining Residuals

Outliers, Leverage, and Influence

15

• Any point that stands away from the others can be called an outlier and deserves your special attention.

• Outlying points can strongly influence a regression. Even a single point far from the body of the data can

dominate the analysis.

High Leverage:

• A data can be unusual if its x-value is far the mean of x-values.

• It has a potential to change the regression line.

• If the point(s) line up with the pattern of the rest of the other points, then it may not change our estimate of the

regression line (it is a good idea to fit the model twice, both with and without the point in question).

Influential Point:

• A data point is influential if omitting it from the statistical analysis changes the model enough to make a

meaningful difference.

• Influence depends both on leverage and residual.

16

1. Not High Leverage, Not Influential, Large Residual

2. High Leverage, Not Influential, Small Residual

3. High Leverage, Influential, Not Necessarily Large Residual

Example of an Influential Observation

17

Relationship between Murder rate and poverty level for 51 state (including the state: DC)

Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest.


18

Relationship between Murder rate and poverty level (including the state: DC)



19



• r = 0.47

• The regression equation is: 𝑀𝑢𝑟𝑑𝑒𝑟 𝑅𝑎𝑡𝑒=-3.45 + 0.67 Poverty Rate

• R-squared = 0.22 (22%)


20


Z-score for max poverty rate is (17.7 – 12.876)/3.086942 = 1.56 (not so far away from its mean); Not High Leverage Point

Poverty rate: 17.7, Murder Rate: 31.4

- It has the highest residual value: 23.0131

Example of Omitting an Observation from Data

21

Examining the relationship between Murder rate and poverty level (excluding DC)

• r = 0.54

• The regression equation is: 𝑀𝑢𝑟𝑑𝑒𝑟 𝑅𝑎𝑡𝑒=-0.66 + 0.41 Poverty Rate

• R-squared = 0.29 (22%)

Poverty rate: 17.7, Murder Rate: 31.4

- It is somewhat influential

Example of High Leverage Point BUT Not An Influential Observation

22

Relationship Between Percent of Birth to Teen Moms and Poverty Rate (including the state Mississippi)

• r = 0.85

• Regression equation:

%𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.39 + 0.70 poverty rate

• R-squared = 0.71 (71%)

• State Mississippi: (21, 17.1)

• Poverty rate had mean 12.84 and SD of 3.06

• Z-score for state Mississippi:

Z = (21-12.84)/3.06 = 2.67 (somewhat above the mean poverty);

We claim this as somewhat high leverage point.

• The residual value for this observation was: 1.01

(small residual).

Example of High Leverage Point BUT Not An Influential Observation

23

Relationship Between Percent of Birth to Teen Moms and Poverty Rate (including the state Mississippi)

• r = 0.82

• Regression equation:

%𝑏𝑖𝑟𝑡ℎ 𝑡𝑜 𝑡𝑒𝑒𝑛 𝑚𝑜𝑚𝑠 = 1.64 + 0.67 poverty rate

• R-squared = 0.67 (67%)

Poverty rate: 21, % Teen Moms: 17.1

- It is not an influential observation

Restricted-range Problem

24

• When one of the variables is restricted (you only look at some of the values), the correlation can be surprisingly low.

• We will visit an example from the web, from David Lane: http://davidmlane.com/hyperstat/A68809.html

• The demo video is found here: http://onlinestatbook.com/2/describing_bivariate_data/restriction_demo.html

http://davidmlane.com/hyperstat/A68809.html

http://onlinestatbook.com/2/describing_bivariate_data/restriction_demo.html

Working with Summary Statistics

25

Graph below shows that there appears to be a strong, positive,

linear association between weight (in pounds)

and height (in inches) for men.

Graph below shows that if instead of data on individuals

we only had the mean weight for each height value, we

would see an even stronger association.

• We see less scattered points.

• It can give a false impression of how well a line

summarizes the data.

• We have a problem of overestimating or

underestimating.

Lecture 6 Notes Chapter 8. Regression...

Documents

Transcript of Lecture 6 Notes Chapter 8. Regression...