Regression Analyses

53
Regression Analyses

description

Regression Analyses. Multiple Regression. Multiple IVs Single DV (continuous) Generalization of simple linear regression Y’ = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3 ...b k X k Where k is the number of predictors Find solution where Sum(Y-Y’) 2 minimized - PowerPoint PPT Presentation

Transcript of Regression Analyses

Page 1: Regression Analyses

Regression Analyses

Page 2: Regression Analyses

• Multiple IVs• Single DV (continuous)• Generalization of simple linear regression• Y’ = b0 + b1X1 + b2X2 + b3X3...bkXk

Where k is the number of predictors• Find solution where Sum(Y-Y’)2 minimized• Do not confuse size of bs with importance

for prediction• Can standardize to get betas, which can

help determine relative importance

Multiple Regression

Page 3: Regression Analyses

• Prediction – allows prediction of change in the D.V. resulting from changes in the multiple I.V.s

• Explanation – enables explanation of the variate by assessing the relative contribution of each I.V. to the regression equation

• More efficient than multiple simple regression equations– Allows consideration of overlapping variance in the

IVs

Why use Multiple Regression?

Page 4: Regression Analyses

When do you use Multiple Regression?

• When theoretical or conceptual justification exists for predicting or explaining the D.V. with the set of I.V.s

• D.V. is metric/continuous– If not, logistic regression or discriminant

analysis

Page 5: Regression Analyses

Variance in Y

Variance in X1

Variance in X2

a

cb

eresidual varianceX1

X2

Y

Multiple Regression

Page 6: Regression Analyses

• DV is continuous and interval or ratio in scale • Assumes multivariate normality for random IVs• Assumes normal distributions and homogeneity of variance for each level of X for fixed IVs• No error of measurement• Correctly specified model• Errors not correlated

• Expected mean of residuals is 0• Homoscedasticity (error variance equal at all levels of X)• Errors are independent/no autocorrelation (error for one score not correlated with error for another score)• Residuals normally distributed

Assumptions

Page 7: Regression Analyses

Multiple regression represents the construction of a weighted linear combination of variables:

XBXBXBAy kikiii ,2,21,1 ...ˆ

The weights are derived to:

(a) Minimize the sum of the squared errors of prediction:

(b) Maximize the squared correlation (R2) between the original outcome variables and the predicted outcome variables based on the linear combination.

i

yy ii ˆ 2

Page 8: Regression Analyses

X

Y

BXAy ˆy-y`

Page 9: Regression Analyses

Multiple R• R is like r except it involves multiple predictors and R cannot be negative• R is the correlation between Y and Y’ where• Y’ = b0+b1X1 + b2X2 + b3X3...bkXk• R2 tells us the proportion of variance accounted for (coefficient of determination)

Page 10: Regression Analyses

An example . . .

Y = Number of job interviews

X1 = GRE score

X2 = Years to complete Ph.D.

X3 = Number of publications

N = 500

Page 11: Regression Analyses

Interviews

16.014.012.010.08.06.04.02.0

140

120

100

80

60

40

20

0

Std. Dev = 2.88

Mean = 8.6

N = 500.00

Page 12: Regression Analyses

GRE

1560.0

1520.0

1480.0

1440.0

1400.0

1360.0

1320.0

1280.0

1240.0

1200.0

1160.0

1120.0

1080.0

1040.0

50

40

30

20

10

0

Std. Dev = 103.72

Mean = 1296.8

N = 500.00

Page 13: Regression Analyses

Publications

10.08.06.04.02.00.0

200

100

0

Std. Dev = 2.31

Mean = 4.3

N = 500.00

Page 14: Regression Analyses

Years to Complete Degree

14.012.010.08.06.04.0

200

100

0

Std. Dev = 2.05

Mean = 6.1

N = 500.00

Page 15: Regression Analyses

Descriptive Statistics

8.60 2.883 5001296.82 103.724 500

4.30 2.309 500

6.09 2.055 500

InterviewsGREPublicationsYears to CompleteDegree

Mean Std. Deviation N

Page 16: Regression Analyses

Correlations

1.000 .219 .677 -.375.219 1.000 .309 -.091.677 .309 1.000 -.286

-.375 -.091 -.286 1.000

. .000 .000 .000.000 . .000 .020.000 .000 . .000

.000 .020 .000 .

500 500 500 500500 500 500 500500 500 500 500

500 500 500 500

InterviewsGREPublicationsYears to CompleteDegreeInterviewsGREPublicationsYears to CompleteDegreeInterviewsGREPublicationsYears to CompleteDegree

Pearson Correlation

Sig. (1-tailed)

N

Interviews GRE Publications

Years toCompleteDegree

Page 17: Regression Analyses

Predicting Interviews

Variance in InterviewsVariance in Time to

Graduate

Variance in Pubs

b

fresidual variance

Variance in GRE

a

cd

e

Page 18: Regression Analyses

Regression with SPSS

REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT interviews /METHOD=ENTER years to complete gre pubs /SCATTERPLOT=(*ZPRED ,*ZRESID) .

From Analyze Menu • Choose Regression • Choose Linear

Page 19: Regression Analyses

Model Summaryb

.703a .494 .491 2.056 .494 161.558 3 496 .000 1.994Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

R SquareChange F Change df1 df2 Sig. F Change

Change StatisticsDurbin-W

atson

Predictors: (Constant), Years to Complete Degree, GRE, Publicationsa.

Dependent Variable: Interviewsb.

The error that is minimized in the derivation of the regression weights: the standard deviation of errors of prediction.

The variance that is maximized in the derivation of the regression weights.

Page 20: Regression Analyses

ANOVAb

2049.250 3 683.083 161.558 .000a

2097.142 496 4.2284146.392 499

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Years to Complete Degree, GRE, Publicationsa.

Dependent Variable: Interviewsb.

ANOVAb

2049.250 3 683.083 161.558 .000a

2097.142 496 4.2284146.392 499

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Years to Complete Degree, GRE, Publicationsa.

Dependent Variable: Interviewsb.

The error that is minimized in the derivation of the regression weights: the variance of errors of prediction.

Page 21: Regression Analyses

Coefficients

6.576 1.219 5.395 .000 4.1813.029E-04 .001 .011 .325 .746 -.002

.771 .044 .617 17.685 .000 .685

-.277 .047 -.198 -5.926 .000 -.369

(Constant)GREPublicationsYears to CompleteDegree

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound95% Confidence Interval for B

Dependent Variable: Interviewsa.

The weight, b The weight, ,

if variables are standardized.

Page 22: Regression Analyses

Coefficients

6.576 1.219 5.395 .000 4.1813.029E-04 .001 .011 .325 .746 -.002

.771 .044 .617 17.685 .000 .685

-.277 .047 -.198 -5.926 .000 -.369

(Constant)GREPublicationsYears to CompleteDegree

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound95% Confidence Interval for B

Dependent Variable: Interviewsa.

Significance of Beta weights.

Output from SPSS

Page 23: Regression Analyses

Multicollinearity• Addition of many predictors increases likelihood of multicollinearity problems• Using multiple indicators of the same construct without combining them in some fashion will definitely create multicollinearity problems• Wreaks havoc with analysis

e.g., significant overall R2, but no variables in the equation significant

• Can mask or hide variables that have large and meaningful impacts on the DV

Page 24: Regression Analyses

MulticollinearityMulticollinearity reflects redundancy in the predictor variables. When severe, the standard errors for the regression coefficients are inflated and the individual influence of predictors is harder to detect with confidence. When severe, the regression coefficients are highly related.

Coefficient Correlationsa

1.000 .003 .272

.003 1.000 -.296

.272 -.296 1.000

2.186E-03 1.485E-07 5.549E-04

1.485E-07 8.705E-07 -1.204E-055.549E-04 -1.20E-05 1.898E-03

Years to CompleteDegreeGREPublicationsYears to CompleteDegreeGREPublications

Correlations

Covariances

Model1

Years toCompleteDegree GRE Publications

Dependent Variable: Interviewsa.

var(b)

Page 25: Regression Analyses

Coefficientsa

.000 4.181 8.971

.746 -.002 .002 .219 .015 .010 .905 1.105

.000 .685 .856 .677 .622 .565 .838 1.194

.000 -.369 -.185 -.375 -.257 -.189 .918 1.089

Sig. Lower Bound Upper Bound95% Confidence Interval for B

Zero-order Partial PartCorrelations

Tolerance VIFCollinearity Statistics

Dependent Variable: Interviewsa.

Coefficients

6.5763.029E-04

.771

-.277

(Constant)GREPublicationsYears to CompleteDegree

Model1

B

UnstandardizedCoefficients

Dependent Variable: Interviewsa.

The tolerance for a predictor is the proportion of variance that it does not share with the other predictors. The variance inflation factor (VIF) is the inverse of the tolerance.

Page 26: Regression Analyses

Remedies:(1) Combine variables using factor analysis (2) Use block entry(3) Model specification (omit variables)(4) Don’t worry about it as long as the program will allow it to run (you don’t have singularity, or perfect correlation)

Multicollinearity

Page 27: Regression Analyses

Incremental R2

• Changes in R2 that occur when adding IVs• Indicates the proportion of variance in prediction that

is provided by adding Z to the equation• It is what Z adds in prediction after controlling for X

in Z• Total variance in Y can be broken up in different

ways, depending on order of entry (which IVs controlled first)

• If you have multiple IVs, change in R2 strongly determined by intercorrelations and order of entry into the equation

• Later point of entry, less R2 available to predict

Page 28: Regression Analyses

Other Issues in Regression• Suppressors (one IV correlated with the other IV

but not with the DV; switches in sign)• Empirical cross-validation• Estimated cross-validation• Dichotomization, Trichotomization, Median splits

– Dichotomize one variable reduces max r to .798– Cost of dichot is loss of 1/5 to 2/3 of real variance– Dichot on more than one variable can increase Type

I error and yet can reduce power as well!

Page 29: Regression Analyses

Significance of Overall R2

• Tests: a + b + c + d + e + f against area g (error)

• Get this from a simultaneous regression or from last step of block or hierarchical entry.

• Other approaches may or may not give you an appropriate test of overall R2, depending upon whether all variables are kept or some omitted.

Ya

b

cd

e

f

g

X

W

Z

Page 30: Regression Analyses

Significance of Incremental R2

Change in R2 tests: a + b + c against area d + e + f + g

At this step, the t test for the b weight of X is the same as the square root of the F test if you only enter one variable. It is a test of whether or not the area of a + b + c is significant as compared to area d + e + f + g.

Step 1: Enter X

Ya

b

cd

e

f

g

X

Page 31: Regression Analyses

Significance of Incremental R2

Change in R2 tests: d + e against area f + g

At this step, the t test for the b weight of X is a test of area a against area f + g and the t test for the b weight of W is a test of area d + e against area f + g.

Ya

b

cd

e

f

g

X

W

Step 2: Enter W

Page 32: Regression Analyses

Significance of Incremental R2

Ya

b

cd

e

f

g

X

W

Z

Step 3: Enter ZChange in R2 tests: f against g

At this step, the t test for b weight of X is a test of area a against area g, the t test for the b weight of W is a test of area e against area g, and the t test for the b weight of Z is a test of area f against area g. These are the significance tests for the IV effects from a simultaneous regression analysis. No IV gets “credit” for areas b, c, d in a simultaneous analysis.

Page 33: Regression Analyses

Hierarchical RegressionSignificance of Incremental R2

Ya

b

cd

e

f

g

X

W

Z

Enter variables in hierarchical fashion to determine R2 for each effect. Test each effectagainst error variance after all variables have been entered.Assume we entered X then W then Z in a hierarchical fashion.Tests for X: areas a + b + c against gTests for W: areas d + e against gTests for Z: area f against g

Page 34: Regression Analyses

Significance test for b or Beta

Ya

b

cd

e

f

g

X

W

Z

In final equation, when we look at the t tests for our b weights we are looking at the following tests:Tests for X: Only area a against gTests for W: Only area e against gTests for Z: Only area f against g

That’s why incremental or effect R2 tests are more powerful.

Page 35: Regression Analyses

Methods of building regression equations

• Simultaneous: All variables entered at once• Backward elimination (stepwise): Starts with full equation and eliminates IVs on the basis of significance tests• Forward selection (stepwise): Starts with no variables and adds on the basis of increment in R2• Hierarchical: Researcher determines order and enters each IV• Block entry: Researcher determines order and enters multiple IVs in single blocks

Page 36: Regression Analyses

Simultaneousa

c

b e

X

Z

Y

f d

g

h

W

i

Simultaneous: All variables entered at onceSignificance tests and R2 based on unique variance No variable “gets credit” for area gVariables with intercorrelations have less unique variance

Variable X & Z together predict more than WVariable W might be significant, X & Z are not

Betas are partialled, so beta for W larger than X or Z

Page 37: Regression Analyses

Backward Eliminationa

c

b e

X

Z

Y

f d

g

h

W

i

• Starts with full equation and eliminates IVs • Gets rid of least significant variable (probably X), then tests remaining vars to see if they are signif•Keeps all remaining significant vars•Capitalizes on chance•Low cross-validation

Page 38: Regression Analyses

Forward Selection

a

c

b e

X

Z

Y

f d

g

h W

i

• Starts with no variables and adds IVs • Adds most unique R2 or next most significant variable (probably W because gets credit for area i)• Quits when more vars are not significant• Capitalizes on chance• Low cross-validation

Page 39: Regression Analyses

Hierarchical (Forced Entry)a

c

b e

X

Z

Y

f d

g

h W

i

•Researcher determines order of entry for IVs•Order based on theory, timing, or need for stat control•Less capitalization on chance•Generally higher cross-validation•Final model based on IVs of theoretical importance•Order of entry determines which IV gets credit for area g

Page 40: Regression Analyses

Order of Entry• Determining order of entry is crucial• Stepwise capitalizes on chance and reduces cross-

validation and stability of your prediction equation– Only useful to maximize prediction in a given sample– Can lose important variables

• Use the following:– Logic– Theory– Order of manipulations/treatments– Timing of measures

• Usefulness of the regression model is reduced as the k (number of IVs) approaches N (sample size)– Best to have at least 15 to 1 ratio or more

Page 41: Regression Analyses

Interpreting b or • B or b is raw regression weightis standardized (Scale invariant)• At a given step, size of b or influenced

by order of entry in a regression equation– Should be interpreted at entry step

• Once all variables are in the equation, bs and s will always be the same regardless of the order of entry

• Difficult to interpret b or for main effects when interaction in equation

Page 42: Regression Analyses

• We can code groups and use to analyze data (e.g., 1 and 2 to represent females and males)• Overall R2 and significance tests for full equation will not change regardless of how we code (as long as orthogonal)• Interpretation of intercept (a) and slope (b or beta weights) WILL change depending on coding • We can use coding to capture effects of categorical variables

Regression: Categorical IVs

Page 43: Regression Analyses

Regression: Categorical IVs • Total # codes needed is always # groups -1• Dummy coding

– One group assigned 0s. b wts indicate mean difference of groups coded 1 compared to the group coded 0

• Effect coding– One group assigned -1s. b wts indicate mean

difference of groups coded 1 to the grand mean• All forms of coding give you the same overall R2

and significance tests for total R2

• Difference is in interpretation of b wts

Page 44: Regression Analyses

Dummy Coding White Black Asian Hispanic Dummy 1 0 1 0 0 Dummy 2 0 0 1 0 Dummy 3 0 0 0 1

• # dummy codes = # groups – 1• Group that receives all zeros is the reference group• Beta = comparison of reference group to group represented by 1• Intercept in the regression equation is mean of the reference group

Page 45: Regression Analyses

Effect Coding

• # contrast codes = # groups – 1• Group that receives all zeros in dummy coding now gets all -1s• Beta = comparison of the group represented by 1 to the grand mean• Intercept in the regression equation is the grand mean

White Black Asian Hispanic Effect 1 -1 1 0 0 Effect 2 -1 0 1 0 Effect 3 -1 0 0 1

Page 46: Regression Analyses

Regression with Categorical IVs vs. ANOVA

• Provides the same results as t tests or ANOVA

• Provides additional information– Regression equation (line of best fit)– Useful for future prediction– Effect size (R2)– Adjusted R2

Page 47: Regression Analyses

Regression with Categorical Variables - Syntax

Step 1. Create k -1 dummy variablesStep 2. Run regression analysis with dummy variables as predictorsREGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT fiw /METHOD=ENTER msdum1 msdum2 msdum3 msdum4 msdum5 .

Page 48: Regression Analyses

Regression with Categorical Variables - Output

ANOVAb

3.562 5 .712 1.562 .170a

166.030 364 .456169.592 369

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), msdum5, msdum4, msdum2, msdum3, msdum1a.

Dependent Variable: fiwb.

Coefficientsa

2.134 .049 43.669 .000.084 .078 .059 1.081 .280

-.420 .260 -.084 -1.615 .107-.102 .162 -.033 -.631 .529-.634 .480 -.069 -1.321 .187-.127 .137 -.050 -.928 .354

(Constant)msdum1msdum2msdum3msdum4msdum5

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: fiwa.

Page 49: Regression Analyses

Adjusted R2

• There may be “overfitting” of the model and R2 may be inflated

• Model may not cross-validate shrinkage• More shrinkage with small samples (< 10-

15 observations per IV)

Model Summary

.145a .021 .008 .67537Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), msdum5, msdum4, msdum2,msdum3, msdum1

a.

Page 50: Regression Analyses

Example: Hierarchical RegressionExample. Number of children, hours in family work and sex as predictors of family interfering with work

REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA CHA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT fiw /METHOD=ENTER numkids /METHOD=ENTER hrsfamil /METHOD=ENTER sex .

Page 51: Regression Analyses

Hierarchical Regression OutputVariables Entered/Removedb

numkidsa . Enterhrsfamila . Entersexa . Enter

Model123

VariablesEntered

VariablesRemoved Method

All requested variables entered.a.

Dependent Variable: fiwb.

Model Summary

.067a .004 .001 .66568 .004 1.450 1 324 .229

.067b .004 -.002 .66669 .000 .014 1 323 .907

.222c .049 .040 .65254 .045 15.169 1 322 .000

Model123

R R SquareAdjustedR Square

Std. Error ofthe Estimate

R SquareChange F Change df1 df2 Sig. F Change

Change Statistics

Predictors: (Constant), numkidsa.

Predictors: (Constant), numkids, hrsfamilb.

Predictors: (Constant), numkids, hrsfamil, sexc.

Page 52: Regression Analyses

Hierarchical Regression OutputCoefficientsa

2.025 .098 20.673 .000.028 .024 .067 1.204 .229

2.034 .126 16.088 .000.028 .025 .065 1.112 .267.000 .002 -.007 -.117 .907

2.433 .161 15.144 .000.038 .024 .088 1.541 .124.000 .002 -.002 -.036 .971

-.285 .073 -.213 -3.895 .000

(Constant)numkids(Constant)numkidshrsfamil(Constant)numkidshrsfamilsex

Model1

2

3

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: fiwa.

Page 53: Regression Analyses

Hierarchical Regression OutputCoefficientsa

2.025 .098 20.673 .000.028 .024 .067 1.204 .229

2.034 .126 16.088 .000.028 .025 .065 1.112 .267.000 .002 -.007 -.117 .907

2.433 .161 15.144 .000.038 .024 .088 1.541 .124.000 .002 -.002 -.036 .971

-.285 .073 -.213 -3.895 .000

(Constant)numkids(Constant)numkidshrsfamil(Constant)numkidshrsfamilsex

Model1

2

3

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: fiwa.

Coefficientsa

2.433 .161 15.144 .000.038 .024 .088 1.541 .124.000 .002 -.002 -.036 .971

-.285 .073 -.213 -3.895 .000

(Constant)numkidshrsfamilsex

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: fiwa.

Simultaneous Regression Output