Anareg Week 10 Multicollinearity Interesting special cases Polynomial regression.

Post on 18-Jan-2018

219 views 0 download

description

Multicollinearity (2) Solve the statistical problem and the numerical problem will also be solved The statistical problem is more serious than the numerical problem We want to refine a model that has redundancy in the explanatory variables even if X’X can be inverted without difficulty

Transcript of Anareg Week 10 Multicollinearity Interesting special cases Polynomial regression.

Anareg Week 10MulticollinearityInteresting special casesPolynomial regression

MulticollinearityNumerical analysis problem is that the matrix

X’X is close to singular and is therefore difficult to invert accurately

Statistical problem is that there is too much correlation among the explanatory variables and it is therefore difficult to determine the regression coefficients

Multicollinearity (2)Solve the statistical problem and the

numerical problem will also be solvedThe statistical problem is more serious than

the numerical problemWe want to refine a model that has redundancy

in the explanatory variables even if X’X can be inverted without difficulty

Multicollinearity (3)Extremes cases can help us to understand the

problemif all X’s are uncorrelated, Type I SS and Type II SS

will be the same, i.e, the contribution of each explanatory variable to the model will be the same whether or not the other explanatory variables are in the model

if there is a linear combination of the explanatory variables that is a constant (e.g. X1 = X2 (X1 - X2 = 0)), then the Type II SS for the X’s involved will be zero

Y = gpa X1 = hsmX3 = hss X4 = hseX5 = satm X6 = satvX7 = genderm;

Define: sat=satm+satv;We will regress Y on sat satm and satv;

Source DF Model 2 Error 221Corrected Total 223

•Something is wrong•dfM=2 but there are 3 Xs

NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

satv = sat - satm

Par StVar DF Est Err t P

Int 1 1.28 0.37 3.43 0.0007sat B -0.00 0.00 -0.04 0.9684satm B 0.00 0.00 2.10 0.0365satv 0 0 . . .

Extent of multicollinearityOur CS example had one explanatory variable

equal to a linear combination of other explanatory variables

This is the most extreme case of multicollinearity and is detected by statistical software because (X’X) does not have an inverse

We are concerned with cases less extreme

Effects of multicollinearityRegression coefficients are not well

estimated and may be meaninglessSimilarly for standard errors of these

estimatesType I SS and Type II SS will differR2 and predicted values are usually ok

Two separate problemsNumerical accuracy

(X’X) is difficult to invertNeed good software

Statistical problemResults are difficult to interpretNeed a better model

Polynomial regressionWe can do linear, quadratic, cubic, etc. by

defining squares, cubes, etc. in a data step and using these as predictors in a multiple regression

We can do this with more than one explanatory variable

When we do this we generally create a multicollinearity problem

Polynomial Regression (2)We can remove the correlation between

explanatory variables and their squares Center (subtract the mean) before squaring NKNW rescale by standardizing (subtract the

mean and divide by the standard deviation)

Interaction ModelsWith several explanatory variables, we need to

consider the possibility that the effect of one variable depends on the value of another variable

Special casesOne indep variable – second orderOne indep variable – Third orderTwo cindep variables – second order

One Independent variable –Second Order

The regression model:

The mean response isa parabole and is frequently called a quadratic

response function.βo reperesents the mean response of Y when x

= 0 and β1 is often called the linear effect coeff while β11 is called the quadratic effect coeff.

XXxwherexxY iiiiioi 2111

2111)( iioi xxYE

One Independent variable –Third Order

The regression model:

The mean response is

XXxwhere

xxxY

ii

iiiioi 3111

2111

3111

2111)( iiioi xxxYE

Two Independent variable –Second Order

The regression model:

The mean response is

the equation of a conic section. The coeff β12 is often called the interaction effect coeff.

222111

21122222

21112211

XXxXXxwhere

xxxxxxY

iiii

iiiiiiioi

2112222

21112211)( iiiiiioi xxxxxxYE

NKNW Example p 330Response variable is the life (in cycles) of a

power cellExplanatory variables are

Charge rate (3 levels)Temperature (3 levels)

This is a designed experiment

Obs cycles chrate temp 1 150 0.6 10 2 86 1.0 10 3 49 1.4 10 4 288 0.6 20 5 157 1.0 20 6 131 1.0 20 7 184 1.0 20 8 109 1.4 20 9 279 0.6 30 10 235 1.0 30 11 224 1.4 30

Create new variables chrate2=chrate*chrate; temp2=temp*temp; ct=chrate*temp;Then regress cycles on chrate, temp, chrate2, temp2, and ct;

Var b S(b) t Pr>|t|int 162.84 16.61 9.81 <.0002Chrate -55.83 13.22 -

4.22<0.01

Temp 75.50 13.22 5.71 <0.005Chrate2

27.39 20.34 1.35 0.2359

Temp2 -10.61 20.34 -.52 0.6244ct 11.50 16.19 .71 0.5092

b. ANOVA TableSource df SS MSRegression 5 66366 11703X1 1 18704 18704X2|X1 1 34201 34201X1

2|X1,X2 1 1646 1646X2

2|X1,X2,X12 1 285 285

X22|X1,X2,X1

2, X2

2

1 529 529

Error 5 4240 1048Total 10 60606

ConclusionWe have a multicollinearity problemLets look at the correlations (use proc corr)There are some very high correlations

r(chrate,chrate2) = 0.99103r(temp,temp2) = 0.98609

A remedyWe can remove the correlation between

explanatory variables and their squares Center (subtract the mean) before squaring NKNW rescale by standardizing (subtract the

mean and divide by the standard deviation)

Last slide

Read NKNW 7.6 to 7.7 and the problems on pp 317-326

We used programs cs4.sas and NKNW302.sas to generate the output for today

Last slide

Read NKNW 8.5 and Chapter 9