April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks...

April 4, 2006 Lecture 11 Slide #1

Diagnostics for Multivariate Regression Analysis

• Homeworks

• Assumptions of OLS Revisited

• Visual Diagnostic Techniques

• Non-linearity

• Non-normality and Heteroscedasticity

• Outliers and Case Statistics


Homework -- Set 1

• Test for the additional explanatory power of

attention to the temp change issue (c_4_31) when

modeling the certainty of temperature change

(c_4_32). Use the more complex model from the

prior lecture exercise (shown on Slide #5) as your

base of comparison.

• Discuss the theoretical meaning of your results.


F-Testing a Nested ModelSimpler Model:Temp Change Scale = b0 + b1(c4_1_ide) + b2(c4_3_env) + b3(c5_3_age)

b4(c4_7_un_) + b5(c4_15_co) + b6(c4_25_io)

More Complex Model:Temp Change Scale = b0 + b1(c4_1_ide) + b2(c4_3_env) + b3(c5_3_age)

b4(c4_7_un_) + b5(c4_15_co) + b6(c4_25_io) + b7(c4_32_ct)

Given the models, K = 8, K - H = 1, and n = 2054. Calculating the RSS’s involves running the two models, obtaining the RSS from each. For these models, RSSK = 4397.96 and RSSK-H = 4547.82. So:

€

Fn−KH =

(4547.82 − 4397.96) /1

4397.96 /(2054 − 8)=

149.86

2.15= 69.70

Given that df1 = H (1) and df2 = n-K (2046), the p-value of the model improvement shown in Table A4.2 (pp. 351-353) is <0.001. So what?


Homework -- Set 2• Write a 1 page paper in which you answer the

following question:– Do male respondents have the same relationship

between education (c5_1a) and income (c5_5a) as do female respondents? Control for other relevant X’s.

• Be sure to:– Specify your hypotheses

– Properly recode your variables

– Interpret your statistical tests


Predicting Income with Slope and Intercept Dummiesfor Male Respondents

So education appears to have the same weight for men andwomen scientists

. regress c5_5a c5_3_age c5_1a c5_4_gen gen_ed, beta

Source | SS df MS Number of obs = 2079-------------+------------------------------ F( 4, 2074) = 39.82 Model | 619.80381 4 154.950953 Prob > F = 0.0000 Residual | 8069.81524 2074 3.89094274 R-squared = 0.0713-------------+------------------------------ Adj R-squared = 0.0695 Total | 8689.61905 2078 4.18172235 Root MSE = 1.9725

------------------------------------------------------------------------------ c5_5a | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- c5_3_age | .0205162 .0033118 6.19 0.000 .1346944 c5_1a | .5854507 .1701134 3.44 0.001 .1744906 c5_4_gen | -.1676275 .7010746 -0.24 0.811 -.0299619 gen_ed | .128906 .1871409 0.69 0.491 .0930857 _cons | 1.853485 .6426378 2.88 0.004 .------------------------------------------------------------------------------


But Wait…

• Diagnostics:– VIF

• Tests for linearity and homoscedasticity also raise red flags– Multicolinearity swamps the model, especially given the

small proportion of women scientists


Critical OLS Assumptions

• 1: Fixed Xs– The effect of X is constant. Differences in Yi’s,

given X, are due to variations in “error”• 2: Errors cancel out

– 1 & 2 assure the independence of errors and X’s. Results in unbiased estimations of ß’s.

– “Unbiased” means, in the long run, sample estimates will center on the true population parameters:

– Efficiency:

E[Yi ]=β0 +β1Xi,1 +β2Xi, 2 +... + βK−1Xi ,K−1 + ε i

E[ε i ] =0 for all i

E[bk]→ βk

E[bk −βk]2 <E[ak −βk]

2 for any a


• 3: Errors have constant variance:– (homoscedasticity)

• 4: Errors are uncorrelated with each other– (no autocorrelation)

• Implications of assumptions 1-4:• Standard errors of the estimate are unbiased• OLS is more efficient than any other linear unbiased

estimator (Gauss-Markov Theorem)• A matter of the degree to which 3 and 4 hold…

– As n-size increases, stringency of assumptions 3 and 4 decrease (law of large numbers)

More Critical OLS Assumptions

Var [ε i ] =σ i2 for all i

Cov[ε i,ε j ] =0 for all i≠j


• 5: Errors are Normally Distributed

– Justifies use of t and F distributions– Necessary for hypothesis tests and confidence

intervals– Makes OLS more efficient than any other

unbiased estimator

Last Critical OLS Assumption

ε i ~ N(0,σ 2 ) for all i


Assumptions of “Correct” Model Specification

• Y is a linear function of modeled X

variables

• No X’s are omitted that affect E[Y] and that

are correlated with included X’s

• All X’s in the model affect E[Y]


Summary of Assumption Failures and their Implications

Problem Biased b Biased SE Invalid t/F Hi Var

Non-linear Yes Yes Yes ---

Omit relev. X Yes Yes Yes ---

Irrel X No No No Yes

X meas. Error Yes Yes Yes ---

Heterosced. No Yes Yes Yes

Autocorr. No Yes Yes Yes

X corr. error Yes Yes Yes ---

Non-normal err. No No Yes Yes

Multicolinearity No No No Yes


How do we know if the assumptions have been met?

• Our data permit empirical tests for some assumptions, but not all:– We can check for:

• Linearity• Whether an X should be included• Homoscedasticity• Autocorrelation• Normality

– We can’t check for:• Correlation between error and X’s• Mean error equals zero• All relevant X’s included


So what do we do?• Univariate analysis

– Y, X’s -- look for skew, other possible problems with distributions

• Can identify possible outliers, adequacy of variance

0.00

0.25

0.50

0.75

1.00

Normal F[(c4_31_tc-m)/s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

0

.05

.1

.15

.2

.25

Density

-2 0 2 4 6 8c4_31_temp_change


Bi-Variate Scatterplots• Detect non-linearities (curvilinearity)• Heteroscedasticity (non-constant variance)

c4_31_temp_change

c4_1_ideology

c4_3_environment

c5_3_age

c4_32_temp_certain

0

5

10

0 5 10

0

5

10

0 5 10

0

5

10

0 5 10

0

50

100

0 50 100

0

5

10

0 5 10


Residual vs. Predicted Y Plots• Checks for:

• Curvilinearity (are curves apparent?)• Heteroscedasticity (fan shapes? Also |e| plots)• Non-normality (density appropriate?)• Outliers (singles or clusters evident?)

-5

0

5

Residuals

1 2 3 4 5Fitted values


Non-Linearity• One of the signal failings of OLS• Run an “ovtest”

– Use the “rhs” option to use powers of the right-hand side variables

Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables F(21, 2025) = 1.79 Prob > F = 0.0147

• If non-linear relationships are suspected– Look at the bivariate plots– Use “acprplot” for each of the independent

variables: an augmented component plus residual plot

• acprplot c5_3_age, mspline msopts (bands(7))

-6

-4

-2

0

2

4

Augmented component plus residual

20 40 60 80 100c5_3_age


Normality of Errors• This is a critical assumption for OLS because it is

required for:– Hypothesis tests, confidence interval estimation

• Particularly sensitive with small samples– Efficiency

• Non-normality will increase sample-to-sample variation• Diagnostics:

• Plot the residuals• Run the “hettest” (checks for heteroscedasticity)

– Age, Environmental status, and (especially) certainty all appear to produce non-standard variance

• Then what?• Use robust estimators (uses medians rather than means)

– “regress c4_31_tc c4_1_ide c4_3_env c5_3_age c4_7_un_ c4_14_co c4_25_io c4_32_ct, beta robust”

• Transform the non-linear variables– Logs are common with badly skewed dependent variables


Influence Analysis

• Does any particular case substantially change the regression results?

• Can sometimes be spotted visually, but not always

– We use:

– Asks: by how many standard errors does bk change when case i is removed?

• Measures the influence of case i on the kth estimated coefficient

– If DFBETA > 0, then case i pulls bk up

– If DFBETA < 0, then case i pulls bk down

DFBETASik =bk −bk( i)

se( i) / RSSk


Criteria for Influence Analyses• External criteria (size-adjusted cut-off):

– If |DFBETAik| > 2/ , then consider deleting• Gets the top 5% of influential cases, given sample

• Internal criteria:– If box-plot defines “severe outlier”, consider

deleting that case

• Caution: Evaluate theory– Consider possible modeling approaches

(dummies?)– Throwing away data is a last resort!

€

n


Obtaining DFBETAS in Stata

• After running the regression – type “dfbeta”– Saves the DFBETA for:

• Each parameter (including intercept)

• For each case

– Permits• Scatterplots, box plots, etc.

– graph box DFc5_3_age DFc5_1a DFc5_4_gen, legend(cols(3))

• Sorts, by size, to identify large influence cases


Dfbetas for first three Xsgraph box c4_1_ide c4_3_env, legend(cols(2))

0

2

4

6

8

10

c4_1_ideology c4_3_environment


Checking Outliers• Sort the dfbetas identified as having outliers

– Use the education case (c5_1a)– The list the highest and lowest, with case id’s

• Note that there appears to be no obvious pattern– Suggests keeping the outliers in the model


Homework

• Run diagnostics on a full model of predicted temperature changes– Evaluate for all diagnosable assumptions, including:

• Multicolinearity

• Non-Linearity

• Normality of errors

• Influence

• Heteroscedasticity

• Prepare a brief (1-page) summary


Take a Break...

April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks...

Documents

Transcript of April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks...