Regression Analysis Multiple Regression [ Cross-Sectional Data ]

75
Regression Analysis Multiple Regression [ Cross-Sectional Data ]

Transcript of Regression Analysis Multiple Regression [ Cross-Sectional Data ]

Regression Analysis

Multiple Regression[ Cross-Sectional Data ]

Learning ObjectivesLearning Objectives Explain the linear multiple regression

model [for cross-sectional data] Interpret linear multiple regression

computer output Explain multicollinearity Describe the types of multiple regression

models

Regression Modeling Steps Regression Modeling Steps

Define problem or question Specify model Collect data Do descriptive data analysis Estimate unknown parameters Evaluate model Use model for prediction

Simple vs. Multiple represents the

unit change in Y per unit change in X .

Does not take into account any other variable besides single independent variable.

i represents the unit change in Y per unit change in Xi.

Takes into account the effect of other

i s.

“Net regression coefficient.”

Assumptions Linearity - the Y variable is linearly related

to the value of the X variable. Independence of Error - the error

(residual) is independent for each value of X. Homoscedasticity - the variation around

the line of regression be constant for all values of X.

Normality - the values of Y be normally distributed at each value of X.

Goal

Develop a statistical model that can predict the values of a dependent (responseresponse) variable based upon the values of the independent (explanatoryexplanatory) variables.

Simple Regression

A statistical model that utilizes one quantitative quantitative independent variable “X” to predict the quantitativequantitative dependent

variable “Y.”

Multiple Regression

A statistical model that utilizes two or more quantitative and qualitative explanatory variables (x1,..., xp) to predict a quantitativequantitative dependent variable Y.

Caution: have at least two or more quantitative explanatory variables (rule of thumb)

Multiple Regression Model

X2

X1

Y

e

Hypotheses

H0: 1 = 2 = 3 = ... = P = 0

H1: At least one regression coefficient is not equal to

zero

Hypotheses (alternate format)

H0: ii = 0

H1: ii 0

Types of Models Positive linear relationship Negative linear relationship No relationship between X and Y Positive curvilinear relationship U-shaped curvilinear Negative curvilinear relationship

Multiple Regression ModelsMultiple Regression Models

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

Multiple Regression EquationsMultiple Regression Equations

This is too complicated! You’ve got to

be kiddin’!

Multiple Regression ModelsMultiple Regression Models

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

Linear ModelLinear Model

Relationship between one dependent & two or more independent variables is a linear function

PP XXXY 22110 PP XXXY 22110

Dependent Dependent (response)(response) variablevariable

Independent Independent (explanatory)(explanatory) variablesvariables

Population Population slopesslopes

Population Population Y-interceptY-intercept

Random Random errorerror

Method of Least Squares The straight line that best fits the data.

Determine the straight line for which the differences between the actual values (Y) and the values that would be predicted from the fitted line of regression (Y-hat) are as small as possible.

Measures of Variation Explained variation (sum of

squares due to regression) Unexplained variation (error sum

of squares)

Total sum of squares

Coefficient of Multiple Determination

When null hypothesis is rejected, a relationship between Y and the X variables exists.

Strength measured by R2 [ several types ]

Coefficient of Multiple Determination

R2y.123- - -P

The proportion of Y that is

explained by the set of

explanatory variables selected

Standard Error of the Estimate

ssy.x y.x

the measure of variability around the line of regression

Confidence interval estimates

»True mean

Y.X

»Individual

Y-hati

Interval Bands [from simple regression]Interval Bands [from simple regression]

X

Y

X

Y i= b 0

+ b 1X

^

Xgiven

_X

Y

X

Y i= b 0

+ b 1X

^

Xgiven

_

Multiple Regression EquationY-hat = 0 + 1x1 + 2x2 + ... + PxP + where:

0 = y-intercept {a constant value}

11 = slope of Y with variable x1 holding the variables x2, x3, ..., xP effects constant

P = slope of Y with variable xP holding all

other variables’ effects constant

Who is in Charge?

Mini-CasePredict the consumption of home heating oil during January for homes located around Screne Lakes. Two explanatory variables are selected - - average daily atmospheric temperature (oF) and the amount of attic insulation (“).

Oil (Gal) Temp Insulation275.30 40 3363.80 27 3164.30 40 1040.80 73 694.30 64 6230.90 34 6366.70 9 6300.60 8 10237.80 23 10121.40 63 331.40 65 10203.50 41 6441.10 21 3323.00 38 352.50 58 10

Mini-Case(0F)Develop a model for

estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.

Mini-Case What preliminary conclusions can home

owners draw from the data?

What could a home owner expect heating oil consumption (in gallons) to be if the outside temperature is 15 oF when the attic insulation is 10 inches thick?

Multiple Regression Equation[mini-case]

Dependent variable: Gallons Consumed

-------------------------------------------------------------------------------------

Standard T

Parameter Estimate Error Statistic P-Value

--------------------------------------------------------------------------------------

CONSTANT 562.151 21.0931 26.6509 0.0000

Insulation -20.0123 2.34251 -8.54313 0.0000

Temperature -5.43658 0.336216 -16.1699 0.0000

--------------------------------------------------------------------------------------

R-squared = 96.561 percent

R-squared (adjusted for d.f.) = 95.9879 percent Standard Error of Est. = 26.0138+

Multiple Regression Equation[mini-case]

Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x11 - 20.01x - 20.01x22

where: xx11 = temperature [degrees F]

xx22 = attic attic insulation [inches]

Multiple Regression Equation[mini-case]

Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x11 - 20.01x - 20.01x22

thus:thus: For a home with zero inches of attic

insulation and an outside temperature of 0 oF, 562.15 gallons of heating oil would be consumed.

[ caution .. data boundaries .. extrapolation ][ caution .. data boundaries .. extrapolation ]

+

ExtrapolationExtrapolation

Y

Interpolation

X

Extrapolation Extrapolation

Relevant Range

Y

Interpolation

X

Extrapolation Extrapolation

Relevant Range

Multiple Regression Equation[mini-case]

Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x11 - 20.01x - 20.01x22 For a home with zero attic insulation and an outside temperature of zero,

562.15 gallons of heating oil would be consumed. [ caution .. data [ caution .. data boundaries .. extrapolation ]boundaries .. extrapolation ]

For each incremental increase in degree F of temperature, for a given amount of attic for a given amount of attic insulation,insulation, heating oil consumption drops 5.44 gallons.

+

Multiple Regression Equation[mini-case]

Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x11 - 20.01x - 20.01x22 For a home with zero attic insulation and an outside temperature of zero,

562 gallons of heating oil would be consumed. [ caution … ][ caution … ] For each incremental increase in degree F of temperature, for a given

amount of attic insulation, heating oil consumption drops 5.44 gallons.

For each incremental increase in inches of attic insulation, at a given temperature,at a given temperature, heating oil consumption drops 20.01 gallons.

Multiple Regression Prediction[mini-case]

Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x11 - 20.01x - 20.01x22

with x1 = 15oF and x2 = 10 inches

Y-hat = 562.15 - 5.44(15) - 20.01(10)

= 280.45 gallons consumed

Coefficient of Multiple Determination [mini-case]

R2y.12 = .9656

96.56 percent of the variation in heating oil can be explained by the variation in temperature andand insulation.

Coefficient of Multiple DeterminationCoefficient of Multiple Determination

Proportion of variation in Y ‘explained’ by all X variables taken together

R2Y.12 = Explained variation = SSR

Total variation SST Never decreases when new X variable is

added to model– Only Y values determine SST– Disadvantage when comparing models

Proportion of variation in Y ‘explained’ by all X variables taken together

Reflects– Sample size– Number of independent variables

Smaller [more conservative] than R2Y.12

Used to compare models

Coefficient of Multiple Determination Adjusted

Coefficient of Multiple Determination Adjusted

Coefficient of Multiple Determination (adjusted)

R2(adj) y.123- - -P

The proportion of Y that is explained by the set of independent [explanatory] variables selected, adjusted for the number of independent variables and the sample size.

Coefficient of Multiple Determination (adjusted) [Mini-Case]

R2adj = 0.9599

95.99 percent of the variation in heating oil consumption can be explained by the model - adjusted for number of independent variables and the sample size

Coefficient of Partial DeterminationCoefficient of Partial Determination

Proportion of variation in Y ‘explained’ by variable XP holding all others constant

Must estimate separate models Denoted R2

Y1.2 in two X variables case

– Coefficient of partial determination of X1 with Y holding X2 constant

Useful in selecting X variables

Coefficient of Partial Determination [p. 878]

R2y1.234 --- P

The coefficient of partial variation of variable Y with x1 holding constant

the effects of variables x2, x3, x4, ... xP.

Coefficient of Partial Determination [Mini-Case]

R2y1.2 = 0.9561

For a fixed (constant) amount of insulation, 95.61 percent of the variation in heating oil can be explained by the variation in average atmospheric temperature. [p. 879]

Coefficient of Partial Determination [Mini-Case]

R2y2.1 = 0.8588

For a fixed (constant) temperature, 85.88 percent of the variation in heating oil can be explained by the variation in amount of insulation.

Testing Overall SignificanceTesting Overall Significance Shows if there is a linear relationship

between all X variables together & Y Uses p-value Hypotheses

– H0: 1 = 2 = ... = P = 0

» No linear relationship

– H1: At least one coefficient is not 0

» At least one X variable affects Y

Examines the contribution of a set of X variables to the relationship with Y

Null hypothesis:– Variables in set do not improve significantly

the model when all other variables are included Must estimate separate models Used in selecting X variables

Testing Model PortionsTesting Model Portions

Diagnostic Checking H0 retain or reject

If reject - {p-value 0.05}

R2adj

Correlation matrix Partial correlation matrix

MulticollinearityMulticollinearity

High correlation between X variables Coefficients measure combined effect Leads to unstable coefficients depending on

X variables in model Always exists; matter of degree Example: Using both total number of rooms

and number of bedrooms as explanatory variables in same model

Detecting MulticollinearityDetecting Multicollinearity

Examine correlation matrix– Correlations between pairs of X variables are

more than with Y variable Few remedies

– Obtain new sample data– Eliminate one correlated X variable

Evaluating Multiple Regression Model StepsEvaluating Multiple Regression Model Steps

Examine variation measures Do residual analysis Test parameter significance

– Overall model– Portions of model – Individual coefficients

Test for multicollinearity

Multiple Regression ModelsMultiple Regression Models

MultipleRegression

Models

Linear DummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

MultipleRegression

Models

Linear DummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

Dummy-Variable Regression ModelDummy-Variable Regression Model

Involves categorical X variable with two levels– e.g., female-male, employed-not employed, etc.

Dummy-Variable Regression ModelDummy-Variable Regression Model

Involves categorical X variable with two levels– e.g., female-male, employed-not employed, etc.

Variable levels coded 0 & 1

Dummy-Variable Regression ModelDummy-Variable Regression Model

Involves categorical X variable with two levels– e.g., female-male, employed-not employed, etc.

Variable levels coded 0 & 1 Assumes only intercept is different

– Slopes are constant across categories

Dummy-Variable Model RelationshipsDummy-Variable Model Relationships

YY

XX1100

00

Same slopes b1

bb00

bb0 0 + b+ b22

Females

Males

Dummy Variables

Permits use of qualitative data

(e.g.: seasonal, class standing, location, gender).

0, 1 coding (nominative data)

As part of Diagnostic Checking;

incorporate outliers

(i.e.: large residuals) and influence

measures.

Multiple Regression ModelsMultiple Regression Models

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

MultipleRegression

Models

LinearDummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot Log Reciprocal Exponential

Interaction Regression ModelInteraction Regression Model

Hypothesizes interaction between pairs of X variables– Response to one X variable varies at different

levels of another X variable Contains two-way cross product terms Y = 0 + 1x1 + 2x2 + 3x1x2 +

Can be combined with other models

e.g. dummy variable models

Effect of Interaction Effect of Interaction

Given:

Without interaction term, effect of X1 on Y is measured by 1

With interaction term, effect of X1 onY is measured by 1 + 3X2

– Effect increases as X2i increases

Y X X X Xi i i i i i 0 1 1 2 2 3 1 2Y X X X Xi i i i i i 0 1 1 2 2 3 1 2

Interaction ExampleInteraction Example

XX11

44

88

1212

0000 110.50.5 1.51.5

YY YY = 1 + 2 = 1 + 2XX11 + 3 + 3XX2 2 + 4+ 4XX11XX22

Interaction ExampleInteraction Example

XX11

44

88

1212

0000 110.50.5 1.51.5

YY YY = 1 + 2 = 1 + 2XX11 + 3 + 3XX2 2 + 4+ 4XX11XX22

YY = 1 + 2 = 1 + 2XX11 + 3( + 3(00) + 4) + 4XX11((00) = 1 + 2) = 1 + 2XX11

Interaction ExampleInteraction Example

YY

XX11

44

88

1212

0000 110.50.5 1.51.5

YY = 1 + 2 = 1 + 2XX11 + 3 + 3XX2 2 + 4+ 4XX11XX22

YY = 1 + 2 = 1 + 2XX11 + 3( + 3(11) + 4) + 4XX11((11) = 4 + 6) = 4 + 6XX11

YY = 1 + 2 = 1 + 2XX11 + 3( + 3(00) + 4) + 4XX11((00) = 1 + 2) = 1 + 2XX11

Interaction ExampleInteraction Example

Effect (slope) of Effect (slope) of XX11 on on YY does depend on does depend on XX22 value value

XX11

44

88

1212

0000 110.50.5 1.51.5

YY YY = 1 + 2 = 1 + 2XX11 + 3 + 3XX2 2 + 4+ 4XX11XX22

YY = 1 + 2 = 1 + 2XX11 + 3( + 3(11) + 4) + 4XX11((11) = 4 + ) = 4 + 66XX11

YY = 1 + 2 = 1 + 2XX11 + 3( + 3(00) + 4) + 4XX11((00) = 1 + ) = 1 + 22XX11

Multiple Regression ModelsMultiple Regression Models

MultipleRegression

Models

Linear DummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot

Log Reciprocal Exponential

MultipleRegression

Models

Linear DummyVariable

LinearNon-

Linear

Inter-action

Poly-Nomial

SquareRoot

Log Reciprocal Exponential

Inherently Linear ModelsInherently Linear Models Non-linear models that can be expressed in

linear form– Can be estimated by least square in linear form

Require data transformation

Y

X1

Y

X1

Curvilinear Model RelationshipsCurvilinear Model Relationships

Y

X1

Y

X1

Y

X1

Y

X1

Y

X1

Y

X1

Logarithmic TransformationLogarithmic Transformation

Y

X1

Y

X1

11 > 0 > 0

11 < 0 < 0

Y = + 1 lnx1 + 2 lnx2 +

Square-Root TransformationSquare-Root Transformation

Y

X1

Y

X1

Y X Xi i i i 0 1 1 2 2Y X Xi i i i 0 1 1 2 2

11 > 0 > 0

11 < 0 < 0

Reciprocal TransformationReciprocal Transformation

Y

X1

Y

X111 > 0 > 0

11 < 0 < 0

iii

i XXY

22

110

11i

iii XX

Y 2

21

10

11

AsymptoteAsymptote

Exponential TransformationExponential Transformation

Y

X1

Y

X1

11 > 0 > 0

11 < 0 < 0

Y eiX X

ii i 0 1 1 2 2Y ei

X Xi

i i 0 1 1 2 2

OverviewOverview Explained the linear multiple regression

model Interpreted linear multiple regression

computer output Explained multicollinearity Described the types of multiple regression

models

Source of Elaborate Slides

Prentice Hall, Inc

Levine, et. all, First Edition

Regression Analysis[Multiple Regression]

*** End of Presentation ***

Questions?