Download - Learn About Multiple Regression With Dummy Variables in ...

Learn About Multiple Regression

With Dummy Variables in SPSS

With Data From the General Social

Survey (2012)

© 2015 SAGE Publications, Ltd. All Rights Reserved.

This PDF has been generated from SAGE Research Methods Datasets.

Learn About Multiple Regression

With Dummy Variables in SPSS

With Data From the General Social

Survey (2012)

Student Guide

Introduction

This dataset example introduces readers to multiple regression with dummy

variables. Multiple regression allows researchers to evaluate whether a

continuous dependent variable is a linear function of two or more independent

variables. When one (or more) of the independent variables is a categorical

variable, the most common method of properly including them in the model is

to code them as dummy variables. Dummy variables are dichotomous variables

coded as 1 to indicate the presence of some attribute and as 0 to indicate

the absence of that attribute. The multiple regression model is most commonly

estimated via ordinary least squares (OLS), and is sometimes called OLS

regression.

This example describes multiple regression with dummy variables, discusses

the assumptions underlying it, and shows how to estimate and interpret such

models. We use a subset of data from the 2012 General Social Survey

(http://www3.norc.org/GSS+Website/). It presents an analysis of whether a

person’s weight is a linear function of a number of attributes, including whether

or not the person is female and whether or not the person smokes cigarettes.

Weight, and particularly being overweight, is associated with a number of negative

SAGE

2015 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods Datasets Part

1

of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data

From the General Social Survey (2012)

http://www3.norc.org/GSS+Website/

health outcomes. Thus, results from an analysis like this could therefore have

implications for individual behavior and public health policy.

What Is Multiple Regression With Dummy Variables?

Multiple regression expresses a dependent, or response, variable as a linear

function of two or more independent variables. Readers looking for a general

introduction to multiple regression should refer to the appropriate examples in

Sage Research Methods. This example focuses specifically on including dummy

variables among the independent variables in a multiple regression model.

Many times, an independent variable of interest is categorical. "Gender" might be

coded as Male or Female; "Region" might be coded as South, Northeast, Midwest,

and West. When there is no obvious order to the categories or when there are

three or more categories and differences between them are not all assumed to be

equal, such variables need to be coded as dummy variables for inclusion into a

regression model.

The number of dummy variables you will need to capture a categorical variable

will be one less than the number of categories. Thus, for gender, we only need

one dummy variable, maybe coded "1" for Female and "0" for Male. For region,

we would need three, which might look like this:

• northeast: coded "1" if from the Northeast and "0" otherwise.

• south: coded "1" if from the South and "0" otherwise.

• midwest: coded "1" if from the Midwest and"0" otherwise.

We always need one less than the number of categories because the last one

would be perfectly predicted by the others. For example, if we know that northeast,

south, and midwest all equal zero, then the observation must be from the West.

Multiple regression models are typically estimated via Ordinary Least Squares

SAGE



1



(OLS). OLS produces estimates of the intercept and slopes that minimize the sum

of the squared differences between the observed values of the dependent variable

and the values predicted based on the regression model.

When computing formal statistical tests, it is customary to define the null

hypothesis (H0) to be tested. In multiple regression, the standard null hypothesis

is that each coefficient is equal to zero. The actual coefficient estimates will not

be exactly equal to zero in any particular sample of data simply due to random

chance in sampling. The t-tests conducted to test each coefficient are designed

to help us determine if the coefficient estimates are different enough from zero to

be declared statistically significant. "Different enough" is typically defined as a test

statistic with a level of statistical significance, or p-value, of less than 0.05. This

would lead us to reject the null hypothesis (H0) that a coefficient estimate equals

zero.

Estimating a Multiple Regression With Dummy Variables Model

To make this example easier to follow, we will focus on estimating a model with

just two independent variables, in this case, labeled X and D. Let’s further assume

that X is a continuous variable while D is a dummy variable coded 1 if the

observation has the characteristic associated with D and coded 0 if it does not.

The multiple regression model with two independent variables can be defined as

in Equation 1:

(1)

Yi = β0 + β1Xi + β2Di + εi

Where:

• Yi = individual values of the dependent variable

• Xi = individual values of the continuous independent variable

SAGE



1



• Di = individual values of the dummy independent variable

• β0 = the intercept, or constant, associated with the regression line

• β1 = the coefficient operating on the continuous independent variable

• β2 = the coefficient operating on the dummy independent variable

• ε = the unmodeled random, or stochastic, component of the dependent

variable; often called the error term or the residual of the model.

Researchers have values for Yi, Xi, and Di in their datasets – they use OLS to

estimate values for β0, β1, and β2. The coefficients β1 and β2 are often called

partial slope coefficients, or partial regression coefficients, because they represent

the unique independent effect of the corresponding independent variable on the

dependent variable after accounting for, or controlling for, the effects of the other

independent variables in the model.

Equation 2 can be used to estimate the coefficient operating on the first

independent variable. The same equation can be altered to estimate as well.

(2)

^β1 =

∑ (xi)(yi) × ∑ (di)2

− ∑ (di)(yi) × ∑ (xi)(di)

∑ (xi)2

× ∑ (di)2

− ∑ (xi)(di)2

Where:

• ^β1 = the estimated value of the coefficient operating on X1

• yi = Yi −¯Y

• ¯Y = the sample mean of the dependent variable

• xi = Xi −¯X

• ¯X = the sample mean of the continuous independent variable

SAGE



1



• di = Di −¯D

• ¯D = the sample mean of the dummy independent variable.

The numerator of Equation 2 is based on the product of deviations in X from its

mean and deviations of Y from its mean. The sum of these products will determine

whether the slope is positive, negative, or near zero. The numerator also accounts

for the shared association between D and Y as well as the correlation between

D and X. The denominator in Equation 2 adjusts the estimate of βi to account

for how much variability there is in X and D. The result is that β1, which is the

marginal effect of X on Y, captures the unique or independent effect of X on Y

after accounting for the presence of D. The same logic applies to computing the

estimate of β2.

Once both β1 and β2 are computed , Equation 3 can be used to compute the value

for the intercept.

(3)

^β0 =

¯Y −

^β1

¯X −

^β2

¯D

Equation 3 is a simple way to estimate the intercept, β0 . We can use this formula

because the OLS regression model is estimated such that it always goes through

the point at which the means of X, D, and Y intersect.

Note that the formulas presented here are identical to the formulas used for

multiple regression generally. The presence of a dummy variable among the

independent variables does not change the math of OLS.

As noted above, β1 and β2 represent the marginal effect of X and D, respectively,

on the expected value of the dependent variable. That means that when X

increases by 1 unit, the expected value of Y will change by an amount equal to

SAGE



1



β1. Similarly, when D increases by 1 unit, the expected value of Y will change by

an amount equal to β2.

For the continuous independent variable X, a 1-unit increase represents some

incremental increase in that variable. It might mean an increase of 1 dollar, 1

thousand dollars, 1 inch, 1 year, and so forth. In contrast, a 1-unit increase in

the dummy variable D represents shifting from the absence of some attribute (D

= 0) to the presence of that attribute (D = 1). Moving from 0 to 1 constitutes

the entire range of D. As a result, the estimate of can be interpreted as the

mean difference between observations where D = 0 and observations where

D = 1 after accounting for the effects of the other independent variables in

the model. In that way, regression with dummy variables effectively conducts a

difference of means test for the dependent variable across the two categories

of the dummy independent variable in question while controlling for the other

independent variables in the model. Note that in this setting, the model assumes

equal variance in the dependent variable across the two groups defined by D.

Finally, if you have a categorical variable with more than two categories, you

would construct a dummy variable for each of those categories except one and

include all of those dummy variables in the regression. Returning to our earlier

example, if you thought your dependent variable was influenced by region, you

could include the three regional dummy variables we constructed – northeast,

south, and midwest – as independent variables in the model. The coefficient

estimate for each one would capture the difference between that particular region

and the region that was left out of the model, which in this case was the West.

Assessing Model Fit

The most common way of assessing how well a regression model fits the data

is to use a statistic called R-squared (also written R2). R-squared measures the

SAGE



1



proportion of variance in the dependent variable that is explained by the set of

independent variables included in the model, and will always fall between 0 and

1. The formula for R-squared can be written many ways – we show one version in

Equation 4:

(4)

R2 = 1 −RSSTSS

where:

• RSS = The Residual Sum of Squares, or ∑ εi2

• TSS = The Total Sum of Squares, or ∑ (Yi −¯Y)

2

R-squared can also be thought of as the square of the Pearson correlation

coefficient measuring the correlation between the actual values of the dependent

variable and the values of the dependent variable predicted by the regression

model.

Because R-squared will always increase if more independent variables are added,

many scholars prefer the Adjusted R-squared. This statistic adjusts the value of

R-squared downward based on how many independent variables are included in

the model. The formula is:

(5)

Adjusted R − Squared = 1 −( RSS

(n − k − 1) )( TSS

(n − 1) )

Where:

SAGE



1



• RSS and TSS are as before

• n = the sample size for the model

• k = the number of independent variables included in the model.

Assumptions Behind the Method

Nearly every statistical test relies on some underlying assumptions, and they are

all affected by the mix of data you happen to have. Different textbooks present the

assumptions of OLS regression in different ways, but we present them as follows:

• The dependent variable is a linear function of the independent variables.

• Values of the independent variables are fixed in repeated samples; most

critical here is that the independent variables are not correlated with the

residual.

• The expected mean of the residual equals zero.

• The variance of the residual is constant (e.g. homoskedastic).

• The individual residuals are independent of each other (e.g. not correlated

with each other).

• The residuals are distributed normally.

• There is no perfect collinearity among the independent variables.

OLS is fairly robust to moderate violations of these assumptions. If these

assumptions hold, it can be shown that OLS produces the best linear unbiased

estimates of the coefficients in the model.

Illustrative Example: Modelling Weight in the 2012 General Social

Survey

This example explores whether a person’s weight can be modeled as a linear

function of a person’s height, age (and age squared), and family income as well

as two dummy variables: whether the person is female and whether the person is

SAGE



1



a non-smoker. The primary research question guiding this example is:

Do non-smokers on average weigh more or less than to smokers,

controlling for other factors?

We can also state this in the form of null hypotheses:

H0 = After accounting for the effect of height, age, gender, and income,

there is no difference in weight between non-smokers and smokers.

In order to keep the example manageable, we treat the remaining independent

variables as control variables and do not discuss them in great detail.

The Data

This example uses several variables from the 2012 General Social Survey:

• The respondent’s weight (rweight), measured in pounds (the dependent

variable).

• The respondent’s height (rheight), measured in inches.

• Whether the respondent is female (female), coded 1 = Yes and 0 = No.

• The respondent’s age (age), coded in years.

• The respondent’s age squared (age2), which is just age in years squared.

• The respondent’s family income (income), coded into categories from 1 to

25.

• Whether the respondent is a non-smoker (nosmoke), coded 1 = Yes and 0

= No.

The sample dataset includes 1351 respondents. The average weight of

respondents to the survey is just over 178 pounds, while the average height

is nearly 67 inches. Almost 55 percent of the respondents are female, with an

average age of almost 50 years old. The median income falls between $40,000

SAGE



1



and $49,000 per year. Turning to the independent variable of interest, nearly 76

percent of respondents are non-smokers, leaving 24 percent who do smoke.

Analyzing the Data

Before producing the full regression model, it is a good idea to look more carefully

at the dependent variable. Figure 1 presents a histogram for weight variable.

Figure 1: Histogram showing the distribution of respondent weight

measured in pounds, 2012 General Social Survey.

SAGE



1



Figure 1 shows that the majority of values for weight fall near the mean of 178.

Very few respondents report weights below 100, but a substantial number of

respondents report weights of 200 to 250 pounds. A handful of respondents report

weights of 300 pounds or greater. The distribution shown in Figure 1 suggests

that most of the data is distributed reasonably close to normal, though there

is a positive skew driven mostly by a few outliers. Researchers might want to

explore whether the handful of cases with particularly large values for weight have

a undue influence on the results. We also recommend doing similar descriptive

analysis of each independent variable, but we leave that to readers so we can

move toward estimating the model itself.

Regression results are often presented in a table that reports the coefficient

estimates, standard errors, t-scores, and levels of statistical significance. Table

1 presents the results of regressing weight on the set of independent variables

described above.

Table 1: Results from a multiple regression model where respondent

weight is regressed on a number of factors, 2012 General Social Survey.

Coefficient Standard Error t-score Sig.

Constant −164.15 26.46 −6.20 .000

Height 4.59 0.36 12.81 .000

Female −8.87 2.91 −3.05 .002

Age 1.87 0.37 5.07 .000

Age Squared −0.02 0.004 −5.25 .000

Family Income −0.70 0.20 −3.50 .000

Non-Smoker 12.98 2.57 5.05 .000

Table 1 reports results for the full model, but we focus attention on the dummy

SAGE



1



variable for being a non-smoker. The results in Table 1 report an estimate for the

coefficient operating on this variable of 12.98 that is statistically significant. This

means that a 1-unit increase in the non-smoker dummy variable is associated

with an average increase in weight of nearly 13 pounds. In other words, after

controlling for the effects of the other variables in the model, the average

difference in weight between smokers and non-smokers is nearly 13 pounds.

The remaining results in Table 1 conform to what we likely suspected in terms

of their direction, and all of the estimated partial slope coefficients reach

conventional levels of statistical significance. The R-squared for the model is

0.267, which means that about 26.7 percent of the variance in respondent weight

is explained by the independent variables in the model.

There are multiple diagnostic tests researchers might perform following the

estimation of a regression model to evaluate whether the model appears to violate

any of the OLS assumptions or whether there are other kinds of problem such as

particularly influential cases. Describing all of these diagnostic tests is well beyond

the scope of this example.

Presenting Results

The results of a multiple regression can be presented as follows:

"We used a subset of data from the 2012 General Social Survey to test the

following null hypothesis:

H0 = After accounting for the effect of height, age, gender, and income,

there is no difference in weight between non-smokers and smokers.

The data include 1351 individual respondents, and the regression model includes

a number of control variables. Results presented in Table 1 show that there is

a positive and statistically significant relationship between weight and being a

SAGE



1



non-smoker. Specifically, the results show that non-smokers on average weigh

nearly 13 pounds more than do smokers, controlling for the effects of the other

independent variables in the model. This result is statistically significant, meaning

that we should reject the null hypothesis of no difference. The remaining partial

slope coefficients estimated for this model are all in the expected direction and

are all statistically significant as well. The R-squared for the model is 0.267, which

means that about 26.7 percent of the variance in respondent weight is explained

by the independent variables in the model. Further diagnostic tests should be

explored to evaluate the robustness of these findings."

Review

Multiple regression allows researchers to model a continuous dependent variable

as a linear function of two or more independent variables. This example focused

specifically on the situation where one (or more) of those independent variables

is categorical and how to use dummy variables in response. This boils down to

testing the difference between the means of the dependent variable between the

two groups designated by the dummy variable in question while controlling for

the effects of the other independent variables in the model. Coefficients for a

multiple regression model are typically estimated via OLS. Accepting or rejecting

the null hypothesis that any of the partial slope coefficients differ from zero

tells us whether the dependent variable is a linear function of the independent

variable(s) in question. However, it does not say anything about whether there is

some other form of association between the dependent variable and any of the

independent variables. Two-way scatter plots comparing the dependent variable

to each independent variable can be useful for exploring more complicated

relationships, but only partially so because they only permit exploration of one

independent variable at a time.

You should know:

SAGE



1



• What types of variables are suitable for multiple regression with dummy

variables.

• The basic assumptions underlying OLS regression.

• How to estimate and interpret a multiple regression model that includes

dummy variables.

• How to report the results of a multiple regression with dummy variables

model.

Your Turn

You can download this sample dataset along with a guide showing how to

estimate a multiple regression with dummy variables model using statistical

software. See if you can replicate the analysis presented here. Next, try estimating

the model separately for men and women.

SAGE



1