Mult reg

Multiple Regression

GoalsImplementation

Assumptions

Goals of Regression Description Inference Prediction (Forecasting)

Examples

Why is there a need for more than one predictor variable?

Shown using the examples given above:

more than one variable influences a response variable.

Predictors may themselves be correlated,

What is the independent contribution of each variable to explaining the variation in the response variable.

Three fundamental aspects of linear regression

Model selection – What is the most parsimonious set of

predictors that explain the most variation in the response variable

Evaluation of Assumptions Have we met the assumptions of the

regression model Model validation

The multiple regression model Express a p variable regression

model as a series of equations P equations condensed into a

matrix form, gives the familiar general linear

model coefficients are known as partial

regression coefficients

The p – variable Regression Model

This model gives the expected value of Y conditional on the fixed values of X2, X3, Xp, plus error

1 - Intercept

2p- Partial Regression slope coefficients

i - Residual term associated with the ith observation

ipipiii XXXY 33221

Matrix Representation

Regression model is best described as a system of equations:

npnpnn

pp

pp

n XXX

XXX

XXX

Y

Y

Y

33221

22323222

113132121

1

2

1

1

We can re-write these equations:

nppnn

p

p

n XXX

XXX

XXX

Y

Y

Y

2

1

2

1

332

23222

13121

2

1

1

1

1

Y = X +

(n 1)

(n p) (p 1)

(n 1)

Summary of Terms

Y = n 1 column vector of observations for response variable

X = n p matrix that portrays the n observations on p – 1 independent variablesX2, , Xp, and the first column of 1’s represents the intercept term, e.g., 1

= p 1 column vector of unknown parameters, 1, 2, , p, where 1, is the intercept term and the 2, , p, are partial regression coefficients.

= n 1 column vector of residuals i

Response Variable

Intercept

Partial Regression Coefficient

Predictor Variable

A Partial Regression Model

Burst = 1.21 + 2.1 Femur Length – 0.25 Tail Length + 1.0 Toe Velocity

Assumption 1. Expected value of the residual vector is 0

0

0

0

2

1

n

EE

Assumption 2. There is no correlation between

the ith and jth residual terms

0jiE

Assumption 3. The residuals exhibit constant

variance

IE 2

Assumption 4. Covariance between the X’s and

residual terms is 0 Usually satisfied if the predictor

variables are fixed and non-stochastic

0,cov X

Assumption 5. The rank of the data matrix, X is p,

the number of columns p < n, the number of observations. No exact linear relationships among

X variables. Assumption of no multicollinearity

pXr

If these assumptions hold… Then the OLS estimators are in

the class of unbiased linear estimators

Also minimum variance estimators

What does it mean to be BLUE? What does this mean? Allows us to compute a number of

statistics. OLS estimation

An estimator , is the best linear unbiased estimator of , iff Linear Unbiased, i.e., E( ) = Minimum variance in class of all linear

unbiased estimators Unbiased and minimum variance properties

means that OLS estimators are efficient estimators

If one or more of the conditions are not met than the OLS estimators are no longer BLUE

Does is matter?

Yes, it means we require an alternative method for

characterizing the association between our Y

and X variables

OLS Estimation

eXbY Sample-based counter part to population regression model:

OLS requires choosing values of b, such that error sum-of-squares (SSE) is as small as possible.

The Normal Equations

XbYXbYeeSSE

Need to differentiate with respect to the unknowns (b):

Yields p simultaneous equations in p unknowns, Also known as the Normal Equations

Matrix form of the Normal Equations

YXbXX

The solution for the “b’s”

It should be apparent how to solve for the unknown parameters

Pre-multiply by the inverse of XX

YXXXbXXXX 11

Solution ContinuedFrom the properties of Inverses we note that:

IXXXX 1

YXXXIb 1

YXXXb 1

This is the fundamental outcome of OLS theory

Assessment of “Goodness-of-Fit” Use the R2 statistic

It represents the proportion of variability in response variable that is accounted for by the regression model

1 R2 1 Good fit of model means that R-

square will be close to one. Poor fit means that R-square will

be near 0.

R2 – Multiple Coefficient of Determination

YYYY

YYYYR

ˆˆ

12

SST

SSER 12

SST

SSRR 2

Alternative Expressions

Critique of R2 in Multiple Regression R2 inflated by increasing the

number of parameters in the model.

One should also analyze the residual values from the model (MSE)

Alternatively use the adjusted R2

Adjusted R2

1

ˆˆ12

nYYYY

pnYYYYR

22;1 RRp

How does adjusted R-square work? Total Sum-of-Squares is fixed,

because it is independent of number of variables The numerator, SSE, decreases as the

number of variables increases. R2 artificially inflated by adding

explanatory variables to the model Use Adjusted R2 to compare different

regression Adjusted R2 takes into account the number

of predictors in the model

Statistical Inference and Hypothesis Testing Our goal may be:

1) hypothesis testing & 2) interval estimation

Hence we will need to impose distributional limits on the residuals

It turns out the probability distribution of the OLS estimators depends on the probability distribution of the residuals, .

Recount Assumptions Normality – this means the

elements of b are normally distributed

b’s are unbiased. If these hold then we can perform

several hypothesis tests.

ANOVA Approach Decomposition of total sums-of-

squares into components relating explained variance (regression) unexplained variance (error)

ANOVA Table

Source of

Variation

Sums-of-Squares

df Mean Square

F-ratio

Regression

p - 1 MSR/MSE

Residual n - p

Total n - 1

2YnYXb

YXbYY

YY

1

2

p

YnYXb

pn

YXbYY

Test of Null Hypothesis

Tests the null hypothesis:

H0: 2=3p = 0

Null hypothesis is known as a joint or simultaneous hypothesis, because it compares the values of all i simultaneously This tests overall significance of regression model

The F-test statistic and R2 vary directly

pnYXbYY

pYnYXbF

12 pnSSE

pSSRF

1

pnSSRSST

pSSRF

1

11

p

pn

SSTSSR

SSTSSRF

11 2

2

p

pn

R

RF

Tests of Hypotheses of true

Assume the regression coefficients are normally distributed

b N,2[]-1)

cov(b) = E(b - )(b - )= 2[]-1

Estimate of 2 is s2

pn

XbYXbYs

2

Test Statistic

ii

ii

cs

bt

where cii is the element of the ith row and ith column of []-1

Follows a t distribution with n – p df.

iii cspntb

;

2

100(1-)% Confidence Interval is obtained from

Model Comparisons Our interest is in parsimonious modeling

We seek a minimum set of X variables to predict variation in Y response variable.

Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.

Does leaving out one of the b’s significantly diminish the variance explained by the model.

Compare a Saturated to an Unsaturated model Note there are many possible Unsaturated models.

General Philosophy Let SSE( r ) designate the error sum-of-squares

for reduced model SSE( r ) SSE(f) The saturated model will contain p parameters The reduced model will contain k < p

parameters If we assume the errors are normally

distributed with mean 0 and variance sigma squared, then we can compare the two models.

Model Comparison

Compare saturated model with the reduced model Use the SSE terms as the basis for comparison

pnfSSE

kpfSSErSSE

)(

Follows an F-distribution, with (p – k), (n – p) dfIf Fobs > Fcritical we reject the reduced model as a parsimonious modelthe bi must be included in the model

Hence,

How Many Predictors to Retain?A short course in Model Selection Several Options

Sequential Selection Backward Selection Forward Selection Stepwise Selection

All possible subsets MAXR MINR RSQUARE ADJUSTED RSQUARE CP

Sequential Methods Forward, Stepwise, Backward

selection procedures Entails “Partialling-out” the predictor

variables Based on the partial correlation

coefficient

223

213

2313123.12

11 rr

rrrr

Forward Selection Build-up” procedure. Add predictors until the “best”

regression model is obtained

Outline of Forward Selection

1) No variables are included in regression equation2) Calculate correlations of all predictors with

dependent variable3) Enter predictor variable with highest correlation

into regression model if its corresponding partial F-value exceeds a predetermined threshold

4) Calculate the regression equation with the predictor

5) Select the predictor variable with the highest partial correlation to enter next.

Forward Selection ContinuedCompare the partial F-test value

(called FH also known as “F-to-enter”):to a predetermined tabulated F-value

(called FC)

If FH > FC, include the variable with the highest partial correlation and return to step 5.

If FH < FC, stop and retain the regression equation as calculated

Backward Selection A “deconstruction” approach Begin with the saturated (full) regression model Compute the drop in R2 as a consequence of

eliminating each predictor variable, and the partial F-test value; treat as if the variable was the last to enter the regression equation

Compare the lowest partial F-test value, (designated FL), to the critical value of F (designated FC)

a. If FL < FC, remove the variable recompute the regression equation using the remaining predictor variables and return to step 2.

b. FL < FC, adopt the regression equation as calculated

Stepwise Selection Calculate correlations of all predictors with response

variable Select the predictor variable with highest correlation.

Regress Y on Xi. Retain the predictor if there is a significant F-test value.

Calculate partial correlations of all variable not in equation with response variable. Select next predictor to enter that has the highest partial correlation. Call this predictor Xj.

Compute the regression equation with both Xi and Xj entered. Retain Xj if its partial F-value exceeds the tabulated F (1, n-2-1) df.

Now determine whether Xi warrants retention. Compare its partial F-value as if Xj was entered into the equation first.

Stepwise Continued Retain if its F-value exceeds the tabulated F

value Enter a new Xk variable. Compute

regression with three predictors. Compute partial F-values for Xi, Xj and Xk.

Determine whether any should be retained by comparing observed partial F with the critical F.

6) Retain regression equation when no other predictor can be entered or removed from the model.

All possible subsets

s2 is residual variance for reduced model and 2 is the residual variance for full model

All subset regressions compute possible 1, 2, 3, … variable models given some optimality criterion.

Requires use of optimality criterion, e.g., Mallow’s Cp

2

22

ˆ

ˆ

pns

pC p (p = k + 1)

Mallow’s Cp

Measures total squared error Choose model where Cp ~ p

Mult reg

Education

Transcript of Mult reg