Multicollinearity: an introductory example A high-tech business wants to measure the effect of...
-
Upload
emory-hunter -
Category
Documents
-
view
212 -
download
0
Transcript of Multicollinearity: an introductory example A high-tech business wants to measure the effect of...
Multicollinearity: an introductory example
A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional advertising (TV and newspapers) and advertising on internet.
– Y : sales in $m
– X1: advertising in $m
– X2: internet in $m
Data: Sales3.sav
A matrix scatter plot of the data
Cor(y, x1)
0.983
Cor(y, x2)
0.986
Cor(x1,x2)
0.990x1 and x2 are
strongly correlated, i.e. they have a
substantial amount of common information
x1 = α0 + α1x2 + ε
Regression output
R R2 Adj R2 DS1 ,983a ,965 ,962 ,9764
Anovab
SS df MS F Sig.1 Regressione 265,466 1 265,466 278,438 ,000a
Residuo 9,534 10 ,953Totale 275,000 11
Coefficients
Modelt Sig.B DS
1(Costante) ,885 ,696 1,272 ,232
X1 = traditional advertising in $m
2,254 ,135 16,686 ,000
Using x1 only
With equivalent results when using x2 only
Regression output Using x1 and x2
R R2 Adj R2 DS1 ,987a ,974 ,968 ,8916
Anovab
SS df MS F Sig.1 Regressione 267,846 2 133,923 168,483 ,000a
Residuo 7,154 9 ,795Totale 275,000 11
Coefficients
t Sig.B DS1
(Costante) 1,992 ,902 2,210 ,054
X1 = traditional advertising in $m
,767 ,868 ,884 ,400
X2 = internet advertising in $m
1,275 ,737 1,730 ,118
x1 and x2 are not significant anymore
Multicollinearity
Multicollinearity exists when two or more of the independent variables are moderately or highly correlated with each other.
In practice if independent variables are (highly) correlated they contribute too much redundant information which prevents isolating the effect of single independent variables on y. Confusion is often the result.
High levels of multicollinearity: a) inflate the variance of the β estimates b) regression results maybe misleading and confusing.
In the extreme case, if there exists perfect correlation among some of the independent variables, OLS estimates cannot be computed.
xi = α0 + α1xj +… + αpxj+p+ ε, j+p<k, i≠j,j+1,…, j+p
Detecting Multicollinearity
The following are indicators of multicollinearity:
1. Significant correlations between pairs of independent variables in the model (sufficient but not necessary).
2. Nonsignificant t-tests for all (or nearly all) the individual β parameters when the F test for model adequacy H0: β1= β2 = … = βk = 0 is significant.
3. Opposite signs (from what expected) in the estimated parameters
4. A variance inflation factor (VIF) for a β parameter greater that 10.
The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box.
A typical situation
Multicollinearity can arise when transforming variables, e.g. using x1 and x1
2 in the regression equations if the range of values of x1 is limited.X X square
1,0 1,001,2 1,441,4 1,961,6 2,561,8 3,242,0 4,002,2 4,842,4 5,762,6 6,762,8 7,843,0 9,003,2 10,243,4 11,563,6 12,963,8 14,444,0 16,00
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
18,00
1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5
X
X s
qu
are
Cor(x,x2)=0.987
Remember, if the multicollinearity is present but not excessive (no high correlations, no VIFs above 10), you can ignore it. Each variable provides enough independent information and one can assess its value.
If your main goal is explaining relationships, then the multicollinearity maybe a problem because measured effects can be misleading.
If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.
Some solutions to Multicollinearity
Get more data if you can.
Drop one or more of the correlated independent variables from the final model. A screening procedure like Stepwise regression may be helpful in determining which variable to drop.
If you keep all independent variables be cautios in interpreting parameter values and keep prediction within the range of your data.
Use Ridge regression (we do not touch this subject in the course).
Some solutions to MC
If the multicollinearity is introduced by the use of higher order models (e.g. use x and x2 or x1, x2 and x1x2) use IV as deviations from their mean.
Example: suppose multicollinearity is present in E(Y) = β0 + β1x + β2x2
1) Compute: x* = x – Mean(X)2) Run the regression E(Y) = β0 + β1x* + β2(x*)2
In most cases multicollinearity is greatly reduced. Clearly the parameters β of the new regression will have different values and meaning.
Example: Shipping costs - continues
– Y : cost of shipment in dollars
– X1: package weight in pounds
– X2: distance shipped in miles
Model 1: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12
+ β5x22
Data: Express.sav
A company conducted a study to investigate the relationship between the cost of shipment and the
variables that control the shipping charge: weight and distance.
It is suspected that non linear effect may be present, let us analyze the model
Matrix scatter-plot
A matrix scatter-plot shows at once the bivariate scatter plots for the selected variables. Use it as preliminary screening.
In SPSS choose the “Matrix” option from “Scatter/Dot” Graph and input the variables of interest
Note the obvious quadratic relation for some of the variables, very close to linearity
Symmetric matrix, just look
at the lower triangle
Correlation matrix
Correlazioni
Weight of parcel
Distance shipped
Cost of shipm.
Weight squared
Dist. squared
Weight*Dist.
Weight of parcel in lbs.
Correlazione 1 ,182 ,774** ,967** ,151 ,820**
Sig. (2-code) ,444 ,000 ,000 ,524 ,000N 20 20 20 20 20 20
Distance shipped
Correlazione ,182 1 ,695** ,202 ,980** ,633**
Sig. (2-code) ,444 ,001 ,393 ,000 ,003N 20 20 20 20 20 20
Cost of shipment
Correlazione ,774** ,695** 1 ,799** ,652** ,989**
Sig. (2-code) ,000 ,001 ,000 ,002 ,000N 20 20 20 20 20 20
Weight squared
Correlazione ,967** ,202 ,799** 1 ,160 ,821**
Sig. (2-code) ,000 ,393 ,000 ,500 ,000N 20 20 20 20 20 20
Distance squared
Correlazione ,151 ,980** ,652** ,160 1 ,590**
Sig. (2-code) ,524 ,000 ,002 ,500 ,006N 20 20 20 20 20 20
Weight*Distance
Correlazione ,820** ,633** ,989** ,821** ,590** 1Sig. (2-code) ,000 ,003 ,000 ,000 ,006N 20 20 20 20 20 20
**. La correlazione è significativa al livello 0,01 (2-code).
Individually strongly related
to Y
Model 1:VIF statistics
A VIF statistics larger than 10 is usualy considered an indicator of substantial collinearity
The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box.
Coefficientia
Model
t Sig.B DS VIF
1 (Costante) ,827 ,702 1,178 ,259
Weight of parcel in lbs.
-,609 ,180 -3,386 ,004 20,031
Distance shipped ,004 ,008 ,503 ,623 35,526
Weight squared ,090 ,020 4,442 ,001 17,027
Distance squared 1,507E-5 ,000 ,672 ,513 28,921
Weight*Distance ,007 ,001 11,495 ,000 12,618
Model 2: Using IV as deviations from their mean
Note: problems of multicollinearity have disappeared
Coefficientsa
Model
t Sig.B DS VIF1
(Costante) 5,467 ,216 25,252 ,000X1star 1,263 ,042 30,128 ,000 1,087X2star ,038 ,001 27,563 ,000 1,081X1x2star ,007 ,001 11,495 ,000 1,095X1star2 ,090 ,020 4,442 ,001 1,113x2star2 1,507E-5 ,000 ,672 ,513 1,120
Note: R-square (adjusted), ANOVA table and prediction are the same for the two models (check).
Seems actually irrelevant, drop it