Topic 9: Remedies

45
Topic 9: Remedies

description

Topic 9: Remedies. Outline. Review diagnostics for residuals Discuss remedies Nonlinear relationship Nonconstant variance Non-Normal distribution Outliers. Diagnostics for residuals. Look at residuals to find serious violations of the model assumptions nonlinear relationship - PowerPoint PPT Presentation

Transcript of Topic 9: Remedies

Page 1: Topic 9: Remedies

Topic 9: Remedies

Page 2: Topic 9: Remedies

Outline

• Review diagnostics for residuals• Discuss remedies– Nonlinear relationship– Nonconstant variance– Non-Normal distribution– Outliers

Page 3: Topic 9: Remedies

Diagnostics for residuals

• Look at residuals to find serious violations of the model assumptions – nonlinear relationship – nonconstant variance– non-Normal errors• presence of outliers• a strongly skewed distribution

Page 4: Topic 9: Remedies

Recommendations for checking assumptions

• Plot Y vs X (is it a linear relationship?)

• Look at distribution of residuals

• Plot residuals vs X, time, or any other potential explanatory variable

• Use the i=sm## in symbol statement to get smoothed curves

Page 5: Topic 9: Remedies

Plots of Residuals

• Plot residuals vs–Time (order)

–X or predicted value (b0+b1X)

• Look for

–nonrandom patterns

–outliers (unusual observations)

Page 6: Topic 9: Remedies

Residuals vs Order

• Pattern in plot suggests dependent errors / lack of indep

• Pattern usually a linear or quadratic trend and/or cyclical

• If you are interested read KNNL pgs 108-110

Page 7: Topic 9: Remedies

Residuals vs X

• Can look for

–nonconstant variance

–nonlinear relationship

–outliers

–somewhat address Normality of residuals

Page 8: Topic 9: Remedies

Tests for Normality

• H0: data are an i.i.d. sample from a Normal population

• Ha: data are not an i.i.d. sample from a Normal population

• KNNL (p 115) suggest a correlation test that requires a table look-up

Page 9: Topic 9: Remedies

Tests for Normality

• We have several choices for a significance testing procedure

• Proc univariate with the normal option provides four

proc univariate normal;• Shapiro-Wilk is a common choice

Page 10: Topic 9: Remedies

Tests for Normality

Test Statistic p ValueShapiro-Wilk W 0.978904 Pr < W 0.8626

Kolmogorov-Smirnov D 0.09572 Pr > D >0.1500

Cramer-von Mises W-Sq 0.033263 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.207142 Pr > A-Sq >0.2500

All P-values > 0.05…Do not reject H0

Page 11: Topic 9: Remedies

Other tests for model assumptions

• Durbin-Watson test for serially correlated errors (KNNL p 114)

• Modified Levene test for homogeneity of variance (KNNL p 116-118)

• Breusch-Pagan test for homogeneity of variance (KNNL p 118)

• For SAS commands see topic9.sas

Page 12: Topic 9: Remedies

Plots vs significance test

• Plots are more likely to suggest a remedy

• Significance tests results are very dependent on the sample size; with sufficiently large samples we can reject most null hypotheses

Page 13: Topic 9: Remedies

Default graphics with SAS 9.3

proc reg data=toluca; model hours=lotsize; id lotsize;run;

Page 14: Topic 9: Remedies

Will discuss these diagnostics more in multiple regression

Provides rule of thumb limits

Questionable observation (30,273)

Page 15: Topic 9: Remedies

Additional summaries

• Rstudent: Studentized residual…almost all should be between ± 2

• Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential

• Cooks’D: Influence of ith case on all predicted values

Page 16: Topic 9: Remedies
Page 17: Topic 9: Remedies
Page 18: Topic 9: Remedies

Lack of fit• When we have repeat observations at

different values of X, we can do a significance test for nonlinearity

• Browse through KNNL Section 3.7

• Details of approach discussed when we get to KNNL 17.9, p 762

• Basic idea is to compare two models

• Gplot with a smooth is a better (i.e., simpler) approach

Page 19: Topic 9: Remedies

SAS code and output

proc reg data=toluca; model hours=lotsize / lackfit; run;

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 252378 252378 105.88 <.0001

Error 23 54825 2383.71562

Lack of Fit 9 17245 1916.06954 0.71 0.6893Pure Error 14 37581 2684.34524

Corrected Total 24 307203

Page 20: Topic 9: Remedies

Nonlinear relationships

• We can model many nonlinear relationships with linear models, some have several explanatory variables (i.e., multiple linear regression)

–Y = β0 + β1X + β2X2 + e (quadratic)

–Y = β0 + β1log(X) + e

Page 21: Topic 9: Remedies

Nonlinear Relationships• Sometimes can transform a

nonlinear equation into a linear equation

• Consider Y = β0exp(β1X) + e

• Can form linear model using log

log(Y) = log(β0) + β1X + log(e)• Note that we have changed our

assumption about the error

Page 22: Topic 9: Remedies

Nonlinear Relationship

• We can perform a nonlinear regression analysis

• KNNL Chapter 13

• SAS PROC NLIN

Page 23: Topic 9: Remedies

Nonconstant variance

• Sometimes we model the way in which the error variance changes

–may be linearly related to X

• We can then use a weighted analysis

• KNNL 11.1

• Use a weight statement in PROC REG

Page 24: Topic 9: Remedies

Non-Normal errors

• Transformations often help

• Use a procedure that allows different distributions for the error term

–SAS PROC GENMOD

Page 25: Topic 9: Remedies

Generalized Linear Model

• Possible distributions of Y:–Binomial (Y/N or percentage data)–Poisson (Count data)–Gamma (exponential)– Inverse gaussian–Negative binomial–Multinomial

• Specify a link function for E(Y)

Page 26: Topic 9: Remedies

Ladder of Reexpression(transformations)

p

0.0 0.5

-0.5-1.0

1.0 1.5

Transformation is xp

Page 27: Topic 9: Remedies

Circle of Transformations

X up, Y up

X up, Y down

X down, Y up

X down, Y down

X

Y

Page 28: Topic 9: Remedies

Box-Cox Transformations

• Also called power transformations

• These transformations adjust for non-Normality and nonconstant variance

• Y´ = Y or Y´ = (Y - 1)/• In the second form, the limit as

approaches zero is the (natural) log

Page 29: Topic 9: Remedies

Important Special Cases

= 1, Y´ = Y1, no transformation = .5, Y´ = Y1/2, square root = -.5, Y´ = Y-1/2, one over square root = -1, Y´ = Y-1 = 1/Y, inverse = 0, Y´ = (natural) log of Y

Page 30: Topic 9: Remedies

Box-Cox Details

• We can estimate by including it as a parameter in a non-linear model

• Y = β0 + β1X + e

and using the method of maximum likelihood

• Details are in KNNL p 134-137

• SAS code is in boxcox.sas

Page 31: Topic 9: Remedies

Box-Cox Solution• Standardized transformed Y is –K1(Y - 1) if ≠ 0–K2log(Y) if = 0

where K2 = ( Yi)1/n (the geometric mean)

and K1 = 1/ ( K2 -1)

• Run regressions with X as explanatory variable

• estimated minimizes SSE

Page 32: Topic 9: Remedies

data a1; input age plasma @@;cards;0 13.44 0 12.84 0 11.91 0 20.090 15.60 1 10.11 1 11.38 1 10.281 8.96 1 8.59 2 9.83 2 9.002 8.65 2 7.85 2 8.88 3 7.943 6.01 3 5.14 3 6.90 3 6.774 4.86 4 5.10 4 5.67 4 5.754 6.23;

Example

Page 33: Topic 9: Remedies
Page 34: Topic 9: Remedies

Box Cox Procedure

*Procedure that will automatically find the Box-Cox transformation;

proc transreg data=a1;

model boxcox(plasma)=identity(age);

run;

Page 35: Topic 9: Remedies

Transformation Information for BoxCox(plasma)

Lambda R-Square Log Like

-2.50 0.76 -17.0444

-2.00 0.80 -12.3665

-1.50 0.83 -8.1127

-1.00 0.86 -4.8523 *

-0.50 0.87 -3.5523 <

0.00 + 0.85 -5.0754 *

0.50 0.82 -9.2925

1.00 0.75 -15.2625

1.50 0.67 -22.1378

2.00 0.59 -29.4720

2.50 0.50 -37.0844

< - Best Lambda

* - Confidence Interval

+ - Convenient Lambda

Page 36: Topic 9: Remedies

*The first part of the program gets the geometric mean;

data a2; set a1; lplasma=log(plasma);

proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;

Page 37: Topic 9: Remedies

data a4; set a2; if _n_ eq 1 then set a3; keep age yl l; k2=exp(meanl); do l = -1.0 to 1.0 by .1; k1=1/(l*k2**(l-1)); yl=k1*(plasma**l -1); if abs(l) < 1E-8 then yl=k2*log(plasma); output; end;

Page 38: Topic 9: Remedies

proc sort data=a4 out=a4; by l;proc reg data=a4 noprint outest=a5; model yl=age; by l;data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2;proc print data=a5; var l sse;

Page 39: Topic 9: Remedies

Obs l sse 1 -1.0 33.9089 2 -0.9 32.7044 3 -0.8 31.7645 4 -0.7 31.0907 5 -0.6 30.6868 6 -0.5 30.5596 7 -0.4 30.7186 8 -0.3 31.1763 9 -0.2 31.9487 10 -0.1 33.0552

Page 40: Topic 9: Remedies

symbol1 v=none i=join;proc gplot data=a5; plot sse*l;run;

Page 41: Topic 9: Remedies
Page 42: Topic 9: Remedies

data a1; set a1;

tplasma = plasma**(-.5);

tage = (age+.5)**(-.5);

symbol1 v=circle i=sm50;

proc gplot; plot tplasma*age;

proc sort; by tage;

proc gplot; plot tplasma*tage;

run;

Page 43: Topic 9: Remedies
Page 44: Topic 9: Remedies
Page 45: Topic 9: Remedies

Background Reading

• Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested).

• Box-Cox transformation is in nknw132.sas

• Read sections 4.1, 4.2, 4.4, 4.5, and 4.6