April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks...
-
Upload
maximillian-dixon -
Category
Documents
-
view
217 -
download
0
Transcript of April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks...
April 4, 2006 Lecture 11 Slide #1
Diagnostics for Multivariate Regression Analysis
• Homeworks
• Assumptions of OLS Revisited
• Visual Diagnostic Techniques
• Non-linearity
• Non-normality and Heteroscedasticity
• Outliers and Case Statistics
April 4, 2006 Lecture 11 Slide #2
Homework -- Set 1
• Test for the additional explanatory power of
attention to the temp change issue (c_4_31) when
modeling the certainty of temperature change
(c_4_32). Use the more complex model from the
prior lecture exercise (shown on Slide #5) as your
base of comparison.
• Discuss the theoretical meaning of your results.
April 4, 2006 Lecture 11 Slide #3
F-Testing a Nested ModelSimpler Model:Temp Change Scale = b0 + b1(c4_1_ide) + b2(c4_3_env) + b3(c5_3_age)
b4(c4_7_un_) + b5(c4_15_co) + b6(c4_25_io)
More Complex Model:Temp Change Scale = b0 + b1(c4_1_ide) + b2(c4_3_env) + b3(c5_3_age)
b4(c4_7_un_) + b5(c4_15_co) + b6(c4_25_io) + b7(c4_32_ct)
Given the models, K = 8, K - H = 1, and n = 2054. Calculating the RSS’s involves running the two models, obtaining the RSS from each. For these models, RSSK = 4397.96 and RSSK-H = 4547.82. So:
€
Fn−KH =
(4547.82 − 4397.96) /1
4397.96 /(2054 − 8)=
149.86
2.15= 69.70
Given that df1 = H (1) and df2 = n-K (2046), the p-value of the model improvement shown in Table A4.2 (pp. 351-353) is <0.001. So what?
April 4, 2006 Lecture 11 Slide #4
Homework -- Set 2• Write a 1 page paper in which you answer the
following question:– Do male respondents have the same relationship
between education (c5_1a) and income (c5_5a) as do female respondents? Control for other relevant X’s.
• Be sure to:– Specify your hypotheses
– Properly recode your variables
– Interpret your statistical tests
April 4, 2006 Lecture 11 Slide #5
Predicting Income with Slope and Intercept Dummiesfor Male Respondents
So education appears to have the same weight for men andwomen scientists
. regress c5_5a c5_3_age c5_1a c5_4_gen gen_ed, beta
Source | SS df MS Number of obs = 2079-------------+------------------------------ F( 4, 2074) = 39.82 Model | 619.80381 4 154.950953 Prob > F = 0.0000 Residual | 8069.81524 2074 3.89094274 R-squared = 0.0713-------------+------------------------------ Adj R-squared = 0.0695 Total | 8689.61905 2078 4.18172235 Root MSE = 1.9725
------------------------------------------------------------------------------ c5_5a | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- c5_3_age | .0205162 .0033118 6.19 0.000 .1346944 c5_1a | .5854507 .1701134 3.44 0.001 .1744906 c5_4_gen | -.1676275 .7010746 -0.24 0.811 -.0299619 gen_ed | .128906 .1871409 0.69 0.491 .0930857 _cons | 1.853485 .6426378 2.88 0.004 .------------------------------------------------------------------------------
April 4, 2006 Lecture 11 Slide #6
But Wait…
• Diagnostics:– VIF
• Tests for linearity and homoscedasticity also raise red flags– Multicolinearity swamps the model, especially given the
small proportion of women scientists
April 4, 2006 Lecture 11 Slide #7
Critical OLS Assumptions
• 1: Fixed Xs– The effect of X is constant. Differences in Yi’s,
given X, are due to variations in “error”• 2: Errors cancel out
– 1 & 2 assure the independence of errors and X’s. Results in unbiased estimations of ß’s.
– “Unbiased” means, in the long run, sample estimates will center on the true population parameters:
– Efficiency:
E[Yi ]=β0 +β1Xi,1 +β2Xi, 2 +... + βK−1Xi ,K−1 + ε i
E[ε i ] =0 for all i
E[bk]→ βk
E[bk −βk]2 <E[ak −βk]
2 for any a
April 4, 2006 Lecture 11 Slide #8
• 3: Errors have constant variance:– (homoscedasticity)
• 4: Errors are uncorrelated with each other– (no autocorrelation)
• Implications of assumptions 1-4:• Standard errors of the estimate are unbiased• OLS is more efficient than any other linear unbiased
estimator (Gauss-Markov Theorem)• A matter of the degree to which 3 and 4 hold…
– As n-size increases, stringency of assumptions 3 and 4 decrease (law of large numbers)
More Critical OLS Assumptions
Var [ε i ] =σ i2 for all i
Cov[ε i,ε j ] =0 for all i≠j
April 4, 2006 Lecture 11 Slide #9
• 5: Errors are Normally Distributed
– Justifies use of t and F distributions– Necessary for hypothesis tests and confidence
intervals– Makes OLS more efficient than any other
unbiased estimator
Last Critical OLS Assumption
ε i ~ N(0,σ 2 ) for all i
April 4, 2006 Lecture 11 Slide #10
Assumptions of “Correct” Model Specification
• Y is a linear function of modeled X
variables
• No X’s are omitted that affect E[Y] and that
are correlated with included X’s
• All X’s in the model affect E[Y]
April 4, 2006 Lecture 11 Slide #11
Summary of Assumption Failures and their Implications
Problem Biased b Biased SE Invalid t/F Hi Var
Non-linear Yes Yes Yes ---
Omit relev. X Yes Yes Yes ---
Irrel X No No No Yes
X meas. Error Yes Yes Yes ---
Heterosced. No Yes Yes Yes
Autocorr. No Yes Yes Yes
X corr. error Yes Yes Yes ---
Non-normal err. No No Yes Yes
Multicolinearity No No No Yes
April 4, 2006 Lecture 11 Slide #12
How do we know if the assumptions have been met?
• Our data permit empirical tests for some assumptions, but not all:– We can check for:
• Linearity• Whether an X should be included• Homoscedasticity• Autocorrelation• Normality
– We can’t check for:• Correlation between error and X’s• Mean error equals zero• All relevant X’s included
April 4, 2006 Lecture 11 Slide #13
So what do we do?• Univariate analysis
– Y, X’s -- look for skew, other possible problems with distributions
• Can identify possible outliers, adequacy of variance
0.00
0.25
0.50
0.75
1.00
Normal F[(c4_31_tc-m)/s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
0
.05
.1
.15
.2
.25
Density
-2 0 2 4 6 8c4_31_temp_change
April 4, 2006 Lecture 11 Slide #14
Bi-Variate Scatterplots• Detect non-linearities (curvilinearity)• Heteroscedasticity (non-constant variance)
c4_31_temp_change
c4_1_ideology
c4_3_environment
c5_3_age
c4_32_temp_certain
0
5
10
0 5 10
0
5
10
0 5 10
0
5
10
0 5 10
0
50
100
0 50 100
0
5
10
0 5 10
April 4, 2006 Lecture 11 Slide #15
Residual vs. Predicted Y Plots• Checks for:
• Curvilinearity (are curves apparent?)• Heteroscedasticity (fan shapes? Also |e| plots)• Non-normality (density appropriate?)• Outliers (singles or clusters evident?)
-5
0
5
Residuals
1 2 3 4 5Fitted values
April 4, 2006 Lecture 11 Slide #16
Non-Linearity• One of the signal failings of OLS• Run an “ovtest”
– Use the “rhs” option to use powers of the right-hand side variables
Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables F(21, 2025) = 1.79 Prob > F = 0.0147
• If non-linear relationships are suspected– Look at the bivariate plots– Use “acprplot” for each of the independent
variables: an augmented component plus residual plot
• acprplot c5_3_age, mspline msopts (bands(7))
-6
-4
-2
0
2
4
Augmented component plus residual
20 40 60 80 100c5_3_age
April 4, 2006 Lecture 11 Slide #17
Normality of Errors• This is a critical assumption for OLS because it is
required for:– Hypothesis tests, confidence interval estimation
• Particularly sensitive with small samples– Efficiency
• Non-normality will increase sample-to-sample variation• Diagnostics:
• Plot the residuals• Run the “hettest” (checks for heteroscedasticity)
– Age, Environmental status, and (especially) certainty all appear to produce non-standard variance
• Then what?• Use robust estimators (uses medians rather than means)
– “regress c4_31_tc c4_1_ide c4_3_env c5_3_age c4_7_un_ c4_14_co c4_25_io c4_32_ct, beta robust”
• Transform the non-linear variables– Logs are common with badly skewed dependent variables
April 4, 2006 Lecture 11 Slide #18
Influence Analysis
• Does any particular case substantially change the regression results?
• Can sometimes be spotted visually, but not always
– We use:
– Asks: by how many standard errors does bk change when case i is removed?
• Measures the influence of case i on the kth estimated coefficient
– If DFBETA > 0, then case i pulls bk up
– If DFBETA < 0, then case i pulls bk down
DFBETASik =bk −bk( i)
se( i) / RSSk
April 4, 2006 Lecture 11 Slide #19
Criteria for Influence Analyses• External criteria (size-adjusted cut-off):
– If |DFBETAik| > 2/ , then consider deleting• Gets the top 5% of influential cases, given sample
• Internal criteria:– If box-plot defines “severe outlier”, consider
deleting that case
• Caution: Evaluate theory– Consider possible modeling approaches
(dummies?)– Throwing away data is a last resort!
€
n
April 4, 2006 Lecture 11 Slide #20
Obtaining DFBETAS in Stata
• After running the regression – type “dfbeta”– Saves the DFBETA for:
• Each parameter (including intercept)
• For each case
– Permits• Scatterplots, box plots, etc.
– graph box DFc5_3_age DFc5_1a DFc5_4_gen, legend(cols(3))
• Sorts, by size, to identify large influence cases
April 4, 2006 Lecture 11 Slide #21
Dfbetas for first three Xsgraph box c4_1_ide c4_3_env, legend(cols(2))
0
2
4
6
8
10
c4_1_ideology c4_3_environment
April 4, 2006 Lecture 11 Slide #22
Checking Outliers• Sort the dfbetas identified as having outliers
– Use the education case (c5_1a)– The list the highest and lowest, with case id’s
• Note that there appears to be no obvious pattern– Suggests keeping the outliers in the model
April 4, 2006 Lecture 11 Slide #23
Homework
• Run diagnostics on a full model of predicted temperature changes– Evaluate for all diagnosable assumptions, including:
• Multicolinearity
• Non-Linearity
• Normality of errors
• Influence
• Heteroscedasticity
• Prepare a brief (1-page) summary
April 4, 2006 Lecture 11 Slide #24
Take a Break...