Learn About Multiple Regression
With Dummy Variables in SPSS
With Data From the General Social
Survey (2012)
© 2015 SAGE Publications, Ltd. All Rights Reserved.
This PDF has been generated from SAGE Research Methods Datasets.
Learn About Multiple Regression
With Dummy Variables in SPSS
With Data From the General Social
Survey (2012)
Student Guide
Introduction
This dataset example introduces readers to multiple regression with dummy
variables. Multiple regression allows researchers to evaluate whether a
continuous dependent variable is a linear function of two or more independent
variables. When one (or more) of the independent variables is a categorical
variable, the most common method of properly including them in the model is
to code them as dummy variables. Dummy variables are dichotomous variables
coded as 1 to indicate the presence of some attribute and as 0 to indicate
the absence of that attribute. The multiple regression model is most commonly
estimated via ordinary least squares (OLS), and is sometimes called OLS
regression.
This example describes multiple regression with dummy variables, discusses
the assumptions underlying it, and shows how to estimate and interpret such
models. We use a subset of data from the 2012 General Social Survey
(http://www3.norc.org/GSS+Website/). It presents an analysis of whether a
person’s weight is a linear function of a number of attributes, including whether
or not the person is female and whether or not the person smokes cigarettes.
Weight, and particularly being overweight, is associated with a number of negative
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 2 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
health outcomes. Thus, results from an analysis like this could therefore have
implications for individual behavior and public health policy.
What Is Multiple Regression With Dummy Variables?
Multiple regression expresses a dependent, or response, variable as a linear
function of two or more independent variables. Readers looking for a general
introduction to multiple regression should refer to the appropriate examples in
Sage Research Methods. This example focuses specifically on including dummy
variables among the independent variables in a multiple regression model.
Many times, an independent variable of interest is categorical. "Gender" might be
coded as Male or Female; "Region" might be coded as South, Northeast, Midwest,
and West. When there is no obvious order to the categories or when there are
three or more categories and differences between them are not all assumed to be
equal, such variables need to be coded as dummy variables for inclusion into a
regression model.
The number of dummy variables you will need to capture a categorical variable
will be one less than the number of categories. Thus, for gender, we only need
one dummy variable, maybe coded "1" for Female and "0" for Male. For region,
we would need three, which might look like this:
• northeast: coded "1" if from the Northeast and "0" otherwise.
• south: coded "1" if from the South and "0" otherwise.
• midwest: coded "1" if from the Midwest and"0" otherwise.
We always need one less than the number of categories because the last one
would be perfectly predicted by the others. For example, if we know that northeast,
south, and midwest all equal zero, then the observation must be from the West.
Multiple regression models are typically estimated via Ordinary Least Squares
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 3 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
(OLS). OLS produces estimates of the intercept and slopes that minimize the sum
of the squared differences between the observed values of the dependent variable
and the values predicted based on the regression model.
When computing formal statistical tests, it is customary to define the null
hypothesis (H0) to be tested. In multiple regression, the standard null hypothesis
is that each coefficient is equal to zero. The actual coefficient estimates will not
be exactly equal to zero in any particular sample of data simply due to random
chance in sampling. The t-tests conducted to test each coefficient are designed
to help us determine if the coefficient estimates are different enough from zero to
be declared statistically significant. "Different enough" is typically defined as a test
statistic with a level of statistical significance, or p-value, of less than 0.05. This
would lead us to reject the null hypothesis (H0) that a coefficient estimate equals
zero.
Estimating a Multiple Regression With Dummy Variables Model
To make this example easier to follow, we will focus on estimating a model with
just two independent variables, in this case, labeled X and D. Let’s further assume
that X is a continuous variable while D is a dummy variable coded 1 if the
observation has the characteristic associated with D and coded 0 if it does not.
The multiple regression model with two independent variables can be defined as
in Equation 1:
(1)
Yi = β0 + β1Xi + β2Di + εi
Where:
• Yi = individual values of the dependent variable
• Xi = individual values of the continuous independent variable
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 4 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
• Di = individual values of the dummy independent variable
• β0 = the intercept, or constant, associated with the regression line
• β1 = the coefficient operating on the continuous independent variable
• β2 = the coefficient operating on the dummy independent variable
• ε = the unmodeled random, or stochastic, component of the dependent
variable; often called the error term or the residual of the model.
Researchers have values for Yi, Xi, and Di in their datasets – they use OLS to
estimate values for β0, β1, and β2. The coefficients β1 and β2 are often called
partial slope coefficients, or partial regression coefficients, because they represent
the unique independent effect of the corresponding independent variable on the
dependent variable after accounting for, or controlling for, the effects of the other
independent variables in the model.
Equation 2 can be used to estimate the coefficient operating on the first
independent variable. The same equation can be altered to estimate as well.
(2)
^β1 =
∑ (xi)(yi) × ∑ (di)2
− ∑ (di)(yi) × ∑ (xi)(di)
∑ (xi)2
× ∑ (di)2
− ∑ (xi)(di)2
Where:
• ^β1 = the estimated value of the coefficient operating on X1
• yi = Yi −¯Y
• ¯Y = the sample mean of the dependent variable
• xi = Xi −¯X
• ¯X = the sample mean of the continuous independent variable
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 5 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
• di = Di −¯D
• ¯D = the sample mean of the dummy independent variable.
The numerator of Equation 2 is based on the product of deviations in X from its
mean and deviations of Y from its mean. The sum of these products will determine
whether the slope is positive, negative, or near zero. The numerator also accounts
for the shared association between D and Y as well as the correlation between
D and X. The denominator in Equation 2 adjusts the estimate of βi to account
for how much variability there is in X and D. The result is that β1, which is the
marginal effect of X on Y, captures the unique or independent effect of X on Y
after accounting for the presence of D. The same logic applies to computing the
estimate of β2.
Once both β1 and β2 are computed , Equation 3 can be used to compute the value
for the intercept.
(3)
^β0 =
¯Y −
^β1
¯X −
^β2
¯D
Equation 3 is a simple way to estimate the intercept, β0 . We can use this formula
because the OLS regression model is estimated such that it always goes through
the point at which the means of X, D, and Y intersect.
Note that the formulas presented here are identical to the formulas used for
multiple regression generally. The presence of a dummy variable among the
independent variables does not change the math of OLS.
As noted above, β1 and β2 represent the marginal effect of X and D, respectively,
on the expected value of the dependent variable. That means that when X
increases by 1 unit, the expected value of Y will change by an amount equal to
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 6 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
β1. Similarly, when D increases by 1 unit, the expected value of Y will change by
an amount equal to β2.
For the continuous independent variable X, a 1-unit increase represents some
incremental increase in that variable. It might mean an increase of 1 dollar, 1
thousand dollars, 1 inch, 1 year, and so forth. In contrast, a 1-unit increase in
the dummy variable D represents shifting from the absence of some attribute (D
= 0) to the presence of that attribute (D = 1). Moving from 0 to 1 constitutes
the entire range of D. As a result, the estimate of can be interpreted as the
mean difference between observations where D = 0 and observations where
D = 1 after accounting for the effects of the other independent variables in
the model. In that way, regression with dummy variables effectively conducts a
difference of means test for the dependent variable across the two categories
of the dummy independent variable in question while controlling for the other
independent variables in the model. Note that in this setting, the model assumes
equal variance in the dependent variable across the two groups defined by D.
Finally, if you have a categorical variable with more than two categories, you
would construct a dummy variable for each of those categories except one and
include all of those dummy variables in the regression. Returning to our earlier
example, if you thought your dependent variable was influenced by region, you
could include the three regional dummy variables we constructed – northeast,
south, and midwest – as independent variables in the model. The coefficient
estimate for each one would capture the difference between that particular region
and the region that was left out of the model, which in this case was the West.
Assessing Model Fit
The most common way of assessing how well a regression model fits the data
is to use a statistic called R-squared (also written R2). R-squared measures the
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 7 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
proportion of variance in the dependent variable that is explained by the set of
independent variables included in the model, and will always fall between 0 and
1. The formula for R-squared can be written many ways – we show one version in
Equation 4:
(4)
R2 = 1 −RSSTSS
where:
• RSS = The Residual Sum of Squares, or ∑ εi2
• TSS = The Total Sum of Squares, or ∑ (Yi −¯Y)
2
R-squared can also be thought of as the square of the Pearson correlation
coefficient measuring the correlation between the actual values of the dependent
variable and the values of the dependent variable predicted by the regression
model.
Because R-squared will always increase if more independent variables are added,
many scholars prefer the Adjusted R-squared. This statistic adjusts the value of
R-squared downward based on how many independent variables are included in
the model. The formula is:
(5)
Adjusted R − Squared = 1 −( RSS
(n − k − 1) )( TSS
(n − 1) )
Where:
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 8 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
• RSS and TSS are as before
• n = the sample size for the model
• k = the number of independent variables included in the model.
Assumptions Behind the Method
Nearly every statistical test relies on some underlying assumptions, and they are
all affected by the mix of data you happen to have. Different textbooks present the
assumptions of OLS regression in different ways, but we present them as follows:
• The dependent variable is a linear function of the independent variables.
• Values of the independent variables are fixed in repeated samples; most
critical here is that the independent variables are not correlated with the
residual.
• The expected mean of the residual equals zero.
• The variance of the residual is constant (e.g. homoskedastic).
• The individual residuals are independent of each other (e.g. not correlated
with each other).
• The residuals are distributed normally.
• There is no perfect collinearity among the independent variables.
OLS is fairly robust to moderate violations of these assumptions. If these
assumptions hold, it can be shown that OLS produces the best linear unbiased
estimates of the coefficients in the model.
Illustrative Example: Modelling Weight in the 2012 General Social
Survey
This example explores whether a person’s weight can be modeled as a linear
function of a person’s height, age (and age squared), and family income as well
as two dummy variables: whether the person is female and whether the person is
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 9 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
a non-smoker. The primary research question guiding this example is:
Do non-smokers on average weigh more or less than to smokers,
controlling for other factors?
We can also state this in the form of null hypotheses:
H0 = After accounting for the effect of height, age, gender, and income,
there is no difference in weight between non-smokers and smokers.
In order to keep the example manageable, we treat the remaining independent
variables as control variables and do not discuss them in great detail.
The Data
This example uses several variables from the 2012 General Social Survey:
• The respondent’s weight (rweight), measured in pounds (the dependent
variable).
• The respondent’s height (rheight), measured in inches.
• Whether the respondent is female (female), coded 1 = Yes and 0 = No.
• The respondent’s age (age), coded in years.
• The respondent’s age squared (age2), which is just age in years squared.
• The respondent’s family income (income), coded into categories from 1 to
25.
• Whether the respondent is a non-smoker (nosmoke), coded 1 = Yes and 0
= No.
The sample dataset includes 1351 respondents. The average weight of
respondents to the survey is just over 178 pounds, while the average height
is nearly 67 inches. Almost 55 percent of the respondents are female, with an
average age of almost 50 years old. The median income falls between $40,000
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 10 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
and $49,000 per year. Turning to the independent variable of interest, nearly 76
percent of respondents are non-smokers, leaving 24 percent who do smoke.
Analyzing the Data
Before producing the full regression model, it is a good idea to look more carefully
at the dependent variable. Figure 1 presents a histogram for weight variable.
Figure 1: Histogram showing the distribution of respondent weight
measured in pounds, 2012 General Social Survey.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 11 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
Figure 1 shows that the majority of values for weight fall near the mean of 178.
Very few respondents report weights below 100, but a substantial number of
respondents report weights of 200 to 250 pounds. A handful of respondents report
weights of 300 pounds or greater. The distribution shown in Figure 1 suggests
that most of the data is distributed reasonably close to normal, though there
is a positive skew driven mostly by a few outliers. Researchers might want to
explore whether the handful of cases with particularly large values for weight have
a undue influence on the results. We also recommend doing similar descriptive
analysis of each independent variable, but we leave that to readers so we can
move toward estimating the model itself.
Regression results are often presented in a table that reports the coefficient
estimates, standard errors, t-scores, and levels of statistical significance. Table
1 presents the results of regressing weight on the set of independent variables
described above.
Table 1: Results from a multiple regression model where respondent
weight is regressed on a number of factors, 2012 General Social Survey.
Coefficient Standard Error t-score Sig.
Constant −164.15 26.46 −6.20 .000
Height 4.59 0.36 12.81 .000
Female −8.87 2.91 −3.05 .002
Age 1.87 0.37 5.07 .000
Age Squared −0.02 0.004 −5.25 .000
Family Income −0.70 0.20 −3.50 .000
Non-Smoker 12.98 2.57 5.05 .000
Table 1 reports results for the full model, but we focus attention on the dummy
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 12 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
variable for being a non-smoker. The results in Table 1 report an estimate for the
coefficient operating on this variable of 12.98 that is statistically significant. This
means that a 1-unit increase in the non-smoker dummy variable is associated
with an average increase in weight of nearly 13 pounds. In other words, after
controlling for the effects of the other variables in the model, the average
difference in weight between smokers and non-smokers is nearly 13 pounds.
The remaining results in Table 1 conform to what we likely suspected in terms
of their direction, and all of the estimated partial slope coefficients reach
conventional levels of statistical significance. The R-squared for the model is
0.267, which means that about 26.7 percent of the variance in respondent weight
is explained by the independent variables in the model.
There are multiple diagnostic tests researchers might perform following the
estimation of a regression model to evaluate whether the model appears to violate
any of the OLS assumptions or whether there are other kinds of problem such as
particularly influential cases. Describing all of these diagnostic tests is well beyond
the scope of this example.
Presenting Results
The results of a multiple regression can be presented as follows:
"We used a subset of data from the 2012 General Social Survey to test the
following null hypothesis:
H0 = After accounting for the effect of height, age, gender, and income,
there is no difference in weight between non-smokers and smokers.
The data include 1351 individual respondents, and the regression model includes
a number of control variables. Results presented in Table 1 show that there is
a positive and statistically significant relationship between weight and being a
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 13 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
non-smoker. Specifically, the results show that non-smokers on average weigh
nearly 13 pounds more than do smokers, controlling for the effects of the other
independent variables in the model. This result is statistically significant, meaning
that we should reject the null hypothesis of no difference. The remaining partial
slope coefficients estimated for this model are all in the expected direction and
are all statistically significant as well. The R-squared for the model is 0.267, which
means that about 26.7 percent of the variance in respondent weight is explained
by the independent variables in the model. Further diagnostic tests should be
explored to evaluate the robustness of these findings."
Review
Multiple regression allows researchers to model a continuous dependent variable
as a linear function of two or more independent variables. This example focused
specifically on the situation where one (or more) of those independent variables
is categorical and how to use dummy variables in response. This boils down to
testing the difference between the means of the dependent variable between the
two groups designated by the dummy variable in question while controlling for
the effects of the other independent variables in the model. Coefficients for a
multiple regression model are typically estimated via OLS. Accepting or rejecting
the null hypothesis that any of the partial slope coefficients differ from zero
tells us whether the dependent variable is a linear function of the independent
variable(s) in question. However, it does not say anything about whether there is
some other form of association between the dependent variable and any of the
independent variables. Two-way scatter plots comparing the dependent variable
to each independent variable can be useful for exploring more complicated
relationships, but only partially so because they only permit exploration of one
independent variable at a time.
You should know:
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 14 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
• What types of variables are suitable for multiple regression with dummy
variables.
• The basic assumptions underlying OLS regression.
• How to estimate and interpret a multiple regression model that includes
dummy variables.
• How to report the results of a multiple regression with dummy variables
model.
Your Turn
You can download this sample dataset along with a guide showing how to
estimate a multiple regression with dummy variables model using statistical
software. See if you can replicate the analysis presented here. Next, try estimating
the model separately for men and women.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 15 of 15 Learn About Multiple Regression With Dummy Variables in SPSS With Data
From the General Social Survey (2012)
Top Related