Download - Unit 11/Slide 1 © Judith D. Singer and Lindsay C. Page, Harvard Graduate School of Education Unit 11: Regression modeling in practice.

© Judith D. Singer and Lindsay C. Page, Harvard Graduate School of Education Unit 11/Slide 1

Unit 11: Regression modeling in practice


The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together


In this unit, we’re going to learn about…

• Distinguishing between question predictors, covariates, and rival hypothesis predictors

• Mapping your research questions onto an analytic strategy• What kinds of paths and feedback loops do you need?• Alternative analytic approaches—which are sound, which are

unwise?• Which kinds of rival explanations can you examine and rule out?• What caveats and limitations still remain?• Constructing informative tables and figures• Writing up your results


Automated model building strategies (& why you don’t want to use them)

Automated model building strategies

1. All possible subsets: all 2k-1 regression models

2. Forward selection: start with no predictors and sequentially add them so that each maximally increases the R2 statistic at that step

3. Backwards elimination: start with all predictors and sequentially drop them so that each minimally decreases the R2 statistic at that step

4. Stepwise regression (forward selection with backwards glances)

All models are wrong, but some are useful George E.P. Box (1979)

Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962)

The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962)

Occam’s razor: entia non sunt multiplicanda praeter necessitatem If two competing theories lead to the same predictions, the simpler one is better William of Occam (14th century)


Introducing the case study: Exploring public knowledge of current affairs

“Since the late 1980s, the emergence of 24-hour cable news as a dominant news source and the explosive growth of the internet have led to major changes in the American public’s news habits. But a new nationwide survey finds that the coaxial and digital revolutions and attendant changes in news audience behaviors have had little impact on how much Americans know about national and international affairs.

On average, today’s citizens are about as able to name their leaders, and are about as aware of major news events, as was the public nearly 20 years ago. The new survey includes questions that are either identical or roughly comparable to questions asked in the late 1980s and early 1990s. In 2007, somewhat fewer were able to name their governor, the vice president, and the president of Russia, but more respondents than in the earlier era gave correct answers to questions pertaining to national politics.”

What job does Hillary Clinton currently hold?A. Senator from New YorkB. Secretary of StateC. Secretary of Health and Human ServicesD. Ambassador to the United Nations Thinking about the military effort in Afghanistan - do you happen to know if Barack Obama has decided to increase, decrease, or not substantially change the number of U.S. troops stationed in Afghanistan?A. IncreaseB. DecreaseC. Not substantially change As far as you know, which foreign country holds the most U.S. government debt?A. JapanB. ChinaC. CanadaD. Saudi Arabia

Do you happen to know which political party has a majority in the U.S. House of Representatives?A. RepublicansB. Democrats What job does Timothy Geithner currently hold?A. President Obama's Press SecretaryB. The CEO of CitiGroupC. The Treasury SecretaryD. The House Majority Leader

What Americans Know: 1989-2007 – Public Knowledge of Current Affairs Little Changed by News and Information Revolutions. Pew

Research Center for the People and the Press, April 15, 2007.


How the study was conducted: Telephone interviews of national sample

Nationwide sample of 1,502 adults, 18 years of

age or older surveyed between February 1 and

February 13 (2007).

Types of information collected:

(1) News knowledge items

(2) Demographics (Age, Gender)

(3) Level of education

(4) Political engagement

(5) Political ideology and affiliation

(6) Level of news exposure

(7) Sources of news

The danger and the opportunity of secondary data analysis:

This dataset contains far more information than we will be able to make sense of in a single analysis. It’s important to approach existing datasets with a specific question or set of questions, rather than going fishing for significant relationships!


Research question: Would we be better off as a “Colbert Nation?”

“The results of the new Pew Survey on News Consumption suggest that viewers of the "fake news" programs The Daily Show and The Colbert Report are more knowledgeable about current events than watchers of "real" cable news shows … as well as average consumers of NBC, ABC, Fox News, CNN, C-SPAN and daily newspapers.”

--Greg Mitchell, The Huffington Post

RQ: While there are significant differences in news knowledge between those who do and do not watch The Daily Show or The Colbert Report, these differences may be explained away by differences in background characteristics, levels of political engagement or other news consumption patterns between those who do and do not watch comedy news.

Hypothesis 1: Viewers of comedy news tend to be more highly educated, and this higher level of education serves to explain the difference in news knowledge.

Hypothesis 2: Viewers of comedy news consume more news from a variety of sources and are more politically engaged, in general. Thus, comedy news, per se, is not the source of their greater news knowledge.


A first look at the data

Comedy TotalID Knowledge News Age Male Educ News Engagement Turnout1 50 0 44.4 0 14 57 79.555.02 32 1 67.9 0 9 54 69.255.03 45 0 45.7 0 12 43 61.046.44 26 0 39.3 0 15 16 66.852.25 79 0 59.5 1 15 48 80.852.26 31 0 43.8 0 12 59 99.656.47 22 0 29.6 1 10 57 37.456.08 93 0 29.6 1 16 44 88.656.09 23 0 45.1 0 12 6 82.457.010 41 0 18.7 1 11 10 87.342.211 23 0 47.5 0 9 37 67.052.212 35 0 80.7 0 12 23 68.552.213 44 0 18.3 1 10 28 69.449.014 97 0 79.1 1 17 50 78.258.615 5 0 91.3 0 11 5 84.558.616 54 0 32.5 0 13 75 97.858.617 79 0 37.7 0 15 60 83.858.618 58 0 30.8 0 16 60 78.358.619 96 0 49.0 0 15 51 92.457.820 93 0 68.0 1 14 85 66.148.821 80 1 58.9 1 15 59 54.948.822 22 0 50.0 0 12 16 75.253.023 79 1 51.9 1 16 58 83.453.0

News knowledge scores cover entire possible range (0 to

100)

16 percent of those sampled watch comedy

news

Age ranges from 18 (minimum age to participate in survey) to 95 years of age

Highest grade completed: Average person in sample

has almost 14 years of Education

Measure of exposure to news from all sources. Ranges from

low of 0 (no news consumption) to a high of 95 (out of possible

100)

Index of political engagement based on

several questions related to involvement with political

issues and activities

Voter turnout in respondent’s county in 2004 presidential race (ranges from 26 to 80

percent)

Among the other variables that

we are not including:

Ideology

Political Party

Blue State/Red State


How should you proceed when you have so many predictors

What have you done for HW assignments and in class?

1. Describe the distributions of the outcome and predictors

2. Examine scatterplots of the outcome vs. each predictor, transforming as necessary (with supplemental residual plots to guide transformation)

3. Examine estimated correlation matrix to see what it foreshadows for model building

4. If there is a clear predictor for which to control statistically, examine the estimated partial correlation matrix to further foreshadow model building

5. Thoughtfully fit a series of MR models

6. Examine the series to select a “final” model that you believe best summarizes your findings

But with > 3 or 4 predictors, model building (step 5) becomes unwieldy…

Advice: Before doing any analysis, place your predictors into up to four conceptual groups based on a combination of substance/theory and their role in your statistical analyses

Question predictor(s)

Key control predictor(s)

Additional control predictor(s)

Rival hypothesis predictor(s)

Challenge:

7 predictors = 27-1 = 127 possible models (+ interactions!)

Comedy News

Education

AgeGender

Total news consumptionPolitical engagement


Developing a taxonomy of fitted models (using the predictor classifications)

Question predictorsComedy news

Key control predictorsEducation

Additional control predictorsAge

Gender

Rival hypothesis predictorsTotal news consumption

Political engagement

Strategy 1: Question predictors first

1. Start with your question predictors: after all, those are the variables in which you’re most interested

2. Add key control predictors assessing whether the effects change—probably keep the key control predictors in the model regardless

3. Add additional control predictors, keeping them in the model only as necessary

4. Check rival hypothesis predictors to see whether the effects of the question predictors remain

Strategy 2: Build a control model first

1. Start with the key control predictors: after all, you’re pretty confident they have a major effect that you need to remove

2. Add additional control predictors, keeping them in the model only as necessary

3. Add the question predictors seeing whether they have an effect over and above the control predictors

4. Check rival hypothesis predictors to see whether the effects of the question predictors remainOften the approach of choice because

it focuses attention on the question predictors

Preferable when the effects of the control predictors are so well established that beyond a first “peek” it’s difficult to think about examining the question predictors

uncontrolled

Or some combinati

on

Don’t forget there’s a difference between how you do the analysis and how you report

the results


To compare knowledge levels between demographic groups, the sample was divided into roughly equal thirds on the basis of how many of the questions they answered correctly. About 35% were classified as the “High” knowledge group. About 31% were classified as having “Medium” levels of knowledge. [The remainder] were assigned to the “Low” knowledge group.

--Pew Research Center

r=0.22***r=0.45***

Variable: Knowledge Mean 57.09 Std Deviation 22.93

Histogram # Boxplot 102.5+*** 8 | .************** 41 | 92.5+********************** 66 | .*************************** 80 | 82.5+*********************************** 104 | .*************************************** 117 +-----+ 72.5+*********************************** 104 | | .******************************* 93 | | 62.5+***************************** 86 | | .************************************ 108 *--+--* 52.5+******************************** 94 | | .****************************************** 124 | | 42.5+*************************************** 115 +-----+ .*********************************** 104 | 32.5+********************* 61 | .****************** 54 | 22.5+******************* 55 | .************ 36 | 12.5+*********** 32 | .***** 15 | 2.5+** 5 | ----+----+----+----+----+----+----+----+--

Let’s begin by examining the outcome: knowledge of the news

Plots vs. question and key control predictors

Observations

• Distribution of Knowledge is symmetric and mound-shaped

• Variation in Knowledge is similar for those who do (sd=21.3) and do not (sd=22.7) watch comedy news

• Relationship between Knowledge and Education appears linear and moderate in strength (r=0.45, p<0.0001)

• Relationship between Knowledge and Age is potentially non-linear

Decision: explore functional form of age

r=0.18***


What functional form should we use for Age?

Histogram # Boxplot 97.5+* 1 | .*** 9 | .******* 26 | .************** 54 | .****************** 71 | .************************** 102 | .************************** 103 | .****************************** 120 +-----+ 57.5+********************************* 130 | | .*************************************** 154 *--+--* .**************************************** 160 | | .*********************************** 138 | | .******************************** 126 +-----+ .************************ 94 | .******************* 76 | .****************** 71 | 17.5+***************** 67 | ----+----+----+----+----+----+----+----+

Linear Model: R2 = 0.0469 Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 42.96843 1.74098 24.68 <.0001Age 1 0.27710 0.03224 8.60 <.0001

Quadratic Model: R2 = 0.0874 Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 12.66833 4.08709 3.10 0.0020Age 1 1.61078 0.16653 9.67 <.0001Age2 1 -0.01291 0.00158 -8.16 <.0001

“Dramatic differences emerge when the results are broken down by age. Young people know the least: Only 15% percent of 18-29 year-olds are among the most informed third of the public, compared with 43% of those ages 65 and older. But it is not these oldest respondents who know the most. Instead, it is people aged 50-64 who are slightly more likely to finish among the third of the sample who know the most…This difference likely is caused by the very different life circumstances of the two oldest age groups. Many of those 65 and older are retired from work, and health problems as well as lifestyle changes can disproportionately work to diminish the interest or ability of some in this generation to keep up with the news.” -– Pew Research Center


Where to from here?

Next Steps: Which model building strategy makes the most sense given that…

• The effect of the question predictor (ComedyNews) is statistically significant but relatively modest (r=0.18, p<0.0001).

• The effect of one key control predictor (Education) is moderate (r=0.45, p<0.0001).

• The effect of a second control predictor (Age) is non-linear. Together, Age and Age2 explain 9 percent of the variation in news knowledge.

Decision:


Considering effects of control variables

Pearson Correlation Coefficients, N = 1502 Prob > |r| under H0: Rho=0

Knowledge Education Male

Knowledge 1.00000

Education 0.44221 1.00000 <.0001

Male 0.24853 0.03042 1.00000 <.0001 0.2390

Simple Correlations

Partial Correlations

(partialling out Age and Age2)

Individuals with more Education have higher levels of News Knowledge, on average.

Effect of Education persists even after partialling out Age and Age2.

Pearson Correlation Coefficients, N = 1502 Prob > |r| under H0: Rho=0

Knowledge Age Education Male

Knowledge 1.00000

Age 0.21666 1.00000 <.0001

Education 0.44955 0.00646 1.00000 <.0001 0.8023

Male 0.23878 -0.02234 0.03473 1.00000 <.0001 0.3870 0.1785

On average, men have a higher level of News Knowledge than women.

Effect of Male also remains in the partial correlations. As does lack of relationship between gender and Education.

No linear relationship between Education and Age. Actually, a modest non-linear (quadratic in age) relationship.

Age and gender are not correlated, as we would expect. Neither are gender and Education.

Education and gender remain important predictors of news knowledge, even after controlling for (or partialling out) the non-linear effect of age. Because gender and education are not correlated with each other but are correlated with news knowledge, they will both be important variables to include in our regression analysis. We will also want to consider interactions among these control variables.


Building the “control” model for the effects of news knowledge

“Final” control model: R2 = 0.3231

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 8.45107 20.00031 0.42 0.6727Age 1 -1.27406 0.79271 -1.61 0.1082Age2 1 0.01565 0.00740 2.12 0.0345Education 1 0.52522 1.49291 0.35 0.7250Male 1 21.76688 5.81564 3.74 0.0002Malexeduc 1 -0.80728 0.41017 -1.97 0.0492Educxage 1 0.17842 0.05841 3.05 0.0023Educxage2 1 -0.00176 0.00054006 -3.26 0.0011

Examining the effects of Age, Education and Gender and their interactions.

The results of the analysis showed that, as expected, age, education and gender accounted for a sizeable amount of the variance (32%) in news knowledge. Interestingly, we observe interactions between education and gender such that gender differences in news knowledge are larger among those with lower levels of education (after controlling for the linear and quadratic effects of age). In addition, we observe an interaction between education and the linear and quadratic terms for age. Our graphical representation of this result serves to illuminate the relationship uncovered by our model. Here, we observe that news knowledge differences by education are less pronounced among young adults and the elderly, whereas differences by level of education are comparatively greater among middle-aged adults.

20

30

40

50

60

70

80

8 9 10 11 12 13 14 15 16 17 18

Education (Years)

New

s kn

ow

led

ge

sco

re

male

female

20

30

40

50

60

70

80

15 25 35 45 55 65 75 85 95

Age (Years)

New

s kn

ow

led

ge

sco

reMasters

< HS

College grad

HS grad

Fitted relationship between news knowledge and education and gender, controlling for age

Fitted relationship between news knowledge and education and age, controlling for gender


Results of fitting a taxonomy of multiple regression models predicting news knowledge among a random sample of 1502 adults in the US

Predictor Model A Model B Model C

Intercept55.303***(0.633)

8.451(20.000)

-1.707(19.653)

Comedy News11.441***(1.605)

10.376***(1.328)

Age -1.274(0.793)

-0.902(0.779)

Age2 0.016*(0.007)

0.013 ~(0.007)

Education 0.525(1.493)

1.125(1.466)

Male 21.767**(5.816)

21.793***(5.702)

Male*Education -0.807*(0.410)

-0.823*(0.402)

Education*Age 0.178**(0.058)

0.152**(0.057)

Education*Age2 -0.002**(0.001)

-0.002**(0.001)

R2 0.0328 0.3231 0.3497

F 50.83 101.89 100.37

df (1, 1500) (7, 1494) (8, 1493)

p-value <.0001 <.0001 <.0001

Cell entries are estimated regression coefficients (standard errors)~ p<.10, *p<.05, **p<.01, ***p<.001

Examining the effects of the question predictor: Uncontrolled & controlled

Hypothesis 1: Viewers of comedy news tend to be more highly educated, and this higher level of education serves to explain the difference in news knowledge.Before considering other controls, we estimate that comedy news viewers earn news knowledge scores that are an average of 11.4 points higher than those who do not watch comedy news (Model A). We hypothesized that this effect is, in part, due to differences in background characteristics by viewership; in particular, programs like The Daily Show attract a more highly educated audience, on average, and differences in educational background serve to explain much of the test score differential. While the relationship between education and viewership was borne out in the data, it was not as strong as we might have anticipated (r=0.056, p=0.0295). Rather, we observe that even after controlling for the effect of education and other key control variables, the main effect of comedy news viewership remains relatively stable (Model C).

A prescription for a better informed

citizenry?

Maybe… but first we need to consider our rival hypotheses!


What about the rival hypothesis predictors?

Hypothesis 2: Viewers of comedy news consume more news from a variety of sources and are more politically engaged, in general. Thus, comedy news, per se, is not the source of their greater news knowledge.

r=0.33*** r=0.32***r=0.19***

Political engagement is a moderately good

predictor of average news knowledge (r=0.33,

p<0.0001).

Voter turnout in a respondent’s county has a weak linear relationship

with news knowledge (r=0.19, p<0.0001).

While the correlation coefficient between news knowledge and total news consumption suggests a

moderate linear relationship (r=0.32), the scatterplot (and theory)

indicate that a non-linear specification may be

preferable.

Decision:


Considering a quadratic versus a log transformation for Total News Consumption

35

40

45

50

55

60

65

70

0 20 40 60 80 100

Total News Consumption

New

s K

no

wle

dg

e S

core

quadratic

logarithm

Reasons to prefer logarithm

• Theoretical: Log transformation corresponds to notion of diminishing marginal returns

• Practical: Parameter estimate from log transformation more readily interpretable

Similarities• As shown above, the fitted lines are remarkably similar

•R2 for both similar: 0.1173 for log & 0.1217 for quadratic

Reason to prefer quadratic

• Practical: Including quadratic term in model allows us to formally test for a non-linear relationship.


Functional form of Total News Consumption: “Starting” a log transformation

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 43.91565 1.14560 38.33 <.0001totalnews 1 0.29112 0.02209 13.18 <.0001

NVariable N Miss Mean Std Dev Minimum Maximum-------------------------------------------------------------------totalnews 1502 0 45.24 25.37 0.00 95.00-------------------------------------------------------------------

NVariable N Miss Mean Std Dev Minimum Maximum-------------------------------------------------------------------L2totnews 1451 51 5.26 1.06 0.00 6.57-------------------------------------------------------------------

log(0)

If some observations take on the value of zero on the raw scale, we must add a small adjustment to the variable prior to log transformation. Otherwise, we will have missing observations on the transformed scale. We call this adjustment “starting” a log transformation, which we do by adding an incremental value to all observations for that variable.

NVariable N Miss Mean Std Dev Minimum Maximum

------------------------------------------------------------------- L2totalnews 1502 0 5.13 1.37 0.00 6.58-------------------------------------------------------------------

L2totnews = log2(totalnews);

L2totalnews = log2(totalnews+1);

“starting” a

variable

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 28.06170 2.15311 13.03 <.0001L2totalnews 1 5.65306 0.40511 13.95 <.0001


R2=0.4325 Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 -27.57830 18.55229 -1.49 0.1374 ComedyNews 1 7.24408 1.27068 5.70 <.0001Age 1 -1.19026 0.72903 -1.63 0.1028Age2 1 0.01409 0.00680 2.07 0.0384Education 1 0.16643 1.37534 0.12 0.9037Male 1 22.35841 5.34508 4.18 <.0001MalexEduc 1 -0.84624 0.37706 -2.24 0.0250EducxAge 1 0.16075 0.05367 2.99 0.0028EducxAge2 1 -0.00158 0.00049604 -3.18 0.0015Turnout 1 0.17671 0.05167 3.42 0.0006Engagement 1 0.23259 0.02846 8.17 <.0001L2totalnews 1 3.57360 0.34307 10.42 <.0001

What about the rival hypothesis predictors?News consumption and political engagement

While attenuated somewhat, the effect of our question predictor

remains upon the addition of our rival hypotheses.

Both measures of political engagement are statistically significant. Controlling

for all other variables in the model, individuals with high levels of political engagement and individuals living in areas with higher voter turnout have higher levels of news knowledge, on

average.

As we would expect, individuals who consume more news (from

a greater variety of sources) have higher average levels of

news knowledge. What about other interactions with

rival hypotheses?

It’s plausible that the effect of gender would interact with political engagement or with news consumption such that gender differences are smaller among men and

women who are more politically engaged or who consume more news. These additional

interactions were tested with the model above but were found not to be significant.


Final check and our “final” model

R2 = 0.4353

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -30.32633 18.66522 -1.62 0.1044ComedyNews 1 20.80208 8.36105 2.49 0.0130Age 1 -1.08433 0.73263 -1.48 0.1391Age2 1 0.01318 0.00682 1.93 0.0535Education 1 0.22650 1.37494 0.16 0.8692Male 1 21.89698 5.34360 4.10 <.0001MaleEeduc 1 -0.81002 0.37701 -2.15 0.0318EducxAge 1 0.15969 0.05364 2.98 0.0030EducxAge2 1 -0.00158 0.00049573 -3.19 0.0014Turnout 1 0.17411 0.05164 3.37 0.0008Engagement 1 0.23437 0.02845 8.24 <.0001L2totalnews 1 3.58320 0.34275 10.45 <.0001ComxAge 1 -0.70171 0.35886 -1.96 0.0507ComxAge2 1 0.00772 0.00360 2.14 0.0323

Final check:

We tested for statistical interactions between our question

predictor, ComedyNews, and all other predictors in the model. We

found that after including in the model our rival hypothesis variables

(in addition to all other control predictors), the Comedy News

indicator interacted significantly with Age.

Next Steps:

1. Residual Plots – does it appear as though the assumptions underlying our model are reasonable?

2. Prototypical plots – which effects do we want to highlight through graphical display?

What predictors should we include in our “final” model???


Histogram # Boxplot 3.75+* 1 0 . .* 6 0 .*** 14 | .************* 76 | .************************* 150 | .***************************************** 246 +-----+ 0.25+******************************************** 263 *-----* .********************************************** 276 | + | .*************************************** 233 +-----+ .********************** 129 | .************ 72 | .**** 24 | .** 10 0 -3.25+* 2 0 ----+----+----+----+----+----+----+----+----+- * may represent up to 6 counts

Examining residuals from “final” model

n = 57(3.8%)

Reasonably symmetric


Contemplating a graph that displays the findings

MalexEducEducxAgeEducxAgeComedyxAgeComedyxAge

TotalNewsLEngageTurnoutMale

EducationAgeAgeComedyNewsledgewKno

81.0002.0160.0008.0702.0

2583.3234.174.0897.21

227.0013.0084.1802.20326.30ˆ

22

2

Want to show the effect of

ComedyNews – it’s my question

predictor but is dichotomous

ComedyNews interacts with Age and Age2 – put Age

on x-axis

Age and Age2 also interact with Education –

construct fitted lines for different

levels of Education

Scatterplot of news knowledge by age for those who watch comedy

news

Scatterplot of news knowledge by age for those who do not watch

comedy news

Less interested in effects of gender,

political engagement and

total news consumption – set

these at their means.


MalexEducEducxAgeEducxAgeComedyxAgeComedyxAge

TotalNewsLEngageTurnoutMale


81.0002.0160.0008.0702.0

2583.3234.174.0897.21

227.0013.00084.1802.20326.30ˆ

22

2

Computing fitted values to create prototypical plots

Step 1: Set Male, Turnout, Engage, and L2TotalNews to their means to control for their

effects in plots

Variable Mean---------------------------Male 0.49Turnout 56.40Engagement 73.41L2totalnews 5.09---------------------------

22

2

002.0160.0008.0702.0

17.0013.0084.1802.20633.25ˆ

EducxAgeEducxAgeComedyxAgeComedyxAge


xEducEducxAgeEducxAgeComedyxAgeComedyxAge


)49.0(81.0002.0160.0008.0702.0

)09.5(583.3)41.73(234.)40.56(174.0)49.0(897.21

227.0013.0084.1802.20326.30ˆ

22

2


22

2

002.0160.0)1(008.0)1(702.0

17.0013.0084.1)1(802.20633.25ˆ

EducxAgeEducxAgexAgexAge

EducationAgeAgeledgewKno

EducxAgeEducxAgeComedyxAgeComedyxAge


002.0160.0008.0702.0

17.0013.0084.1802.20633.25ˆ2

2

Computing fitted values to create prototypical plots, II

22 002.0160.017.0021.0786.1435.46ˆ EducxAgeEducxAgeEducationAgeAgeledgewKno

Step 2: Create separate equations according to the

ComedyNews indicator

For those who watch comedy news

22

2

002.0160.0)0(008.0)0(702.0

17.0013.0084.1)0(802.20633.25ˆ

EducxAgeEducxAgexAgexAge

EducationAgeAgeledgewKno

For those who do not watch comedy news



2

22

0110836059323

)12(002.0)12(160.0)12(170.0013.0084.1633.25ˆ

Age.Age..

xAgexAgeAgeAgeledgewKno

:Grad HS

Computing fitted values to create prototypical plots, III

2

22

011.0774.0715.44

)16(002.0)16(160.0)16(170.0021.0786.1435.46ˆ

AgeAge


:College

2

22

0190476.1913.22

)16(002.0)16(160.0)16(170.0013.0084.1633.25ˆ

Age.Age


:College

Step 3: Select prototypical values of education

12 Years = High School Graduate

16 Years = College Graduate

18 Years = Master’s degree

2

22

003.0134.0395.44

)12(002.0)12(160.0)12(170.0021.0786.1435.46ˆ

AgeAge


:Grad HS

22 002.0160.0170.0021.0786.1435.46ˆ EducxAgeEducxAgeEducationAgeAgeledgewKno For those who watch comedy news



2

22

0230796.1573.22

)18(002.0)18(160.0)18(170.0013.0084.1633.25ˆ

Age.Age


:Masters

2

22

015.0094.1375.44

)18(002.0)18(160.0)18(170.0021.0786.1435.46ˆ

AgeAge


:Masters


Creating prototypical plots

For those who watch comedy news


0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core Masters

College grad HS grad

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core

Masters

College grad

HS grad


Creating prototypical plots, II

Figure 1. Panel of plots illustrating the fitted relationship between news

knowledge and age by comedy news viewership status and level of education,

controlling for gender, measures of political engagement and total news consumption

(at their means) for Adult Americans (n=1502)

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core

College Degree (Education = 16)

Comedy news viewers

Non-viewers

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core

Master’s Degree (Education = 18)

Comedy news viewers

Non-viewers

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core

HS Degree (Education =

12)

Comedy news viewers

Non-viewers


0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Age (Years)

New

s K

no

wle

dg

e S

core

College Degree (Education = 16)

Comedy news viewers

Non-viewers

After controlling for education, gender, political engagement and total news consumption, there

remains a significant, positive relationship between comedy news viewership and news

knowledge, such that those who watch comedy news exhibit higher levels of news knowledge, on

average. As Figure 1 illustrates, the effect of comedy news viewership on news knowledge is

dependent on age. Across levels of education, we observe a general trend whereby the average

difference between comedy news viewers and non-viewers is larger among older Americans. Whereas older Americans who do not watch comedy news have lower levels of news knowledge than their middle-aged counterparts, older Americans who watch comedy news perform as well as (if not better than) their middle-aged counterparts.

Summarizing our results

Does this mean that comedy news is a prescription for a better-informed citizenry? Maybe. While

primarily intended as comedy entertainment, these shows do provide viewers with legitimate news and information –- often very memorably! Therefore, while plausible (especially given that the effect

largely remains after considering our control and rival hypothesis variables), due to the observational nature of this analysis, we are not able to make the causal claim that watching comedy news leads to

better news knowledge. Rather, comedy news viewership may relate to other unobserved

characteristics that also relate to news knowledge, such that our observed effect of Comedy News

viewership remains the result of these unobserved characteristics.


How might I organize and present my results?

Four sets of evidence ina typical research presentation

1. Descriptive statistics: a table summarizing distributions (often by interesting subgroups)

2. Correlation matrix summarizing relationships among variables (sometimes with partials as well)

3.3. SelectedSelected regression results documenting key findings from the analysis (not every model you fit)

4. Prototypical plots summarizing the major findings (probably the plots we just constructed)

Don’t forget to distinguish between how you do the analysis and how you report

the results

Helpful hints about presenting results

1. Decide on your key points: Your text, tables and displays (appropriately titled and organized) should support that argument

2. Think about your reader, not yourself: take the reader’s perspective and supply evidence that helps him/her evaluate your argument

3. Try out alternative displays and text: your first attempt is rarely your best

4. Writing up your results usually helps solidify—and often modify—your major argument, tables and graphs. Learn from writing; re-writing is essential

•Be very careful about causal language

•Make clear that the effects that you report estimate what happens “on average”

•Always specify what’s controlled (and implicitly what’s not) when making particular statements

•Be sure to illustrate the magnitude of effects by appropriately interpreting parameter estimates

A well-organized paper will likely include at least three sections:

(1) an introduction that presents an overview of your argument and the research question(s) to be examined

(2) an analysis section that describes what was done and why, as well as the results of these analyses(3) a summary and conclusion section that discusses the interpretation, implications and limitations of

your analyses


Table 1. Estimated means and sd’s by comedy news viewership

(with t-statistics testing for differences in means by viewership)

Table 1. Estimated means and standard deviations of news knowledge and predictors, by comedy news viewership status (with t-test for difference in means)

Variable Comedy News

Non-Viewers (n=1268)

Comedy News

Viewers(n=234)

t

News Knowledge

55.30(22.77)

66.74(21.33)

-7.13***

Age (Years) 51.51(17.98)

47.87(17.33)

2.86**

Education (Years)

13.89(2.41)

14.26(2.37)

-2.18*

Male 0.48(0.50)

0.53(0.50)

-1.16

County voter turnout

56.3(8.83)

56.94(8.01)

-1.03

Engagement 73.29(16.71)

74.04(17.38)

-0.63

Total news consumption

42.31(24.80)

61.10(22.44)

-10.81***

Cell entries are sample means and standard deviations*p<0.05; **p<0.01, ***P<0.001

Estimated mean news knowledge is 55.3 for non-viewers and 66.74 for viewers, a raw difference of about 11 points (half a standard deviation). The difference is statistically significant (t=-7.13, p<0.0001).Comedy news viewers are almost 4 years younger than non-viewers, on average (t=2.86, p=0.0043). Nevertheless, viewers have significantly more education than non-viewers (t=-2.18, p=0.0295).

In our sample, 53 percent of comedy new viewers and 48 percent of non-viewers are male. The proportion of males did not differ significantly by viewership status.

Measures of political engagement did not differ significantly by viewership status. On average, individuals in both groups lived in counties where approximately 56 percent of people voted in the 2004 presidential election and had similar levels of individual political engagement.

Comedy news viewers consume more news from a greater variety of sources than non-viewers (t=-11.38, p<0.0001).

Approximately 16 percent of the individuals in our sample watch comedy news.


Table 2. Correlation matrix and Partial Correlation Matrix controlling for Age (in linear and quadratic form), n=1502

Knowledge

ComedyNews

Age Education Male Turnout Engagement

ComedyNews0.18***0.20***

Age 0.22*** -0.07**

Education0.45***0.44***

0.06*0.06*

0.01

Male0.24***0.25***

0.030.03

-0.020.030.03

Turnout0.12***0.11***

0.030.03

0.020.06*0.05*

0.020.02

Engagement0.33***0.29***

0.020.03

0.16***0.25***0.23***

-0.05*-0.06*

0.06*0.05*

L2TotalNews0.34***0.33***

0.21***0.23***

0.13***0.13***0.13***

0.020.02

0.020.02

0.17***0.16***

Cell entries are correlations and partial correlations (partialing out Age and Age2).*p<0.05; **p<0.01, ***P<0.001


Table 3. Results of fitting a taxonomy of multiple regression models

Table 3. Results of fitting a taxonomy of multiple regression models predicting news knowledge among a random sample of 1502 American adults

Model A Model B Model C Model D Model E

Intercept 55.303***(0.63)

8.451(20.00)

-1.707(19.65)

-27.578(18.54)

-30.326(18.67)

Comedy News 11.441***(1.60)

10.376***(1.33)

7.244***(1.27)

20.802*(8.36)

Age -1.274(0.79)

-0.902(0.78)

-1.190(0.73)

-1.084(0.73)

Age2 0.016*(0.01)

0.013 ~(0.01)

0.014*(0.01)

0.013~(0.01)

Education 0.525(1.49)

1.125(1.47)

0.166(1.38)

0.227(1.37)

Male 21.767**(5.82)

21.793**(5.70)

22.358***(5.35)

21.897***(5.34)

MalexEducation -0.807*(0.41)

-0.823*(0.40)

-0.846*(0.38)

-0.810*(0.38)

EducationxAge 0.178**(0.06)

0.152**(0.06)

0.161**(0.05)

0.160**(0.05)

EducationxAge2

-0.002**(0.0005)

-0.002**(0.0005)

-0.002**(0.0005)

-0.002**(0.0005)

ComedyxAge -0.702~(0.36)

ComedyxAge2 0.008*(0.004)

Turnout 0.177**(0.05)

0.174**(0.05)

Engagement 0.233***(0.03)

0.234***(0.03)

Log2(TotalNews)

3.574***(0.34)

3.583***(0.34)

R2 0.0328 0.3231 0.3497 0.4325 0.4344


Another example of model building: The Father Presence study

“A hierarchical linear regression analysis was conducted to determine the effects of fathers’ antisocial behavior and fathers’ presence on child antisocial behavior. Fathers’ antisocial behavior (r=.30, p<.001) and fathers’ presence (r=-.16, p<.001) were significantly correlated with child behavior problems.

DADHOMEDADASBCHILDASB 321

At the second step, we asked whether the effect of father presence was moderated by fathers’ antisocial behavior. Thus, the interaction between fathers’ antisocial behavior and father presence was entered and the model was estimated as:

DADHOMEDADASB

DADHOMEDADASBCHILDASB

*4

321

The interaction was statistically significant, slope = .28, p<.001).

We conducted four additional analyses to test the robustness of the interaction between fathers’ antisocial behavior and father presence. First, we tested whether fathers’ antisocial behavior moderated the effect of father presence controlling for the presence of nonbiological father figures in the home. Second, we tested whether fathers’ antisocial behavior moderated the effect of father presence, controlling for maternal antisocial behavior. Third, we tested whether the interaction between fathers’ antisocial behavior and father presence predicted child behavior problems in the clinical range. Fourth, we tested whether fathers’ antisocial behavior moderated a more fine-grained measure of his involvement, such as his caretaking behavior.”

At the first step, we asked whether fathers’ antisocial behavior and father presence independently predicted child behavior problems. The model was estimated as:

Fathers’ antisocial behavior significantly predicted elevated levels of child antisocial behavior (slope = 0.32, p<0.001), but father presence did not when fathers’ antisocial behavior was controlled (slope = 1.80, p=.33).


What’s the big takeaway from this unit?

• Be guided by the research questions– Don’t go on fishing expeditions fitting all possible subsets and don’t rely

on computers to select models for fitting– No automated model selection routine can replace thoughtful model

building strategies– It’s wise to divide your predictors into substantive groupings and use

those groupings to guide the analysis

• There is no single “right answer” or “right model”– Different researchers may make different analytic decisions; hopefully,

substantive findings about question predictors won’t change (but they can)

– Different researchers will choose to make different decisions about what information to present in a paper; hopefully, regardless of approach, there will be sufficient information to judge the soundness of the conclusions

• You can do data analysis!– Think back to the beginning of the semester; you’ve all come a long

way – You can judge the soundness of a research presentation; don’t believe

everything you read and be sure to read the methods section– No matter how much you learn about data analysis, there’s always

more to learn!


Multiple RegressionAnalysis

Multiple RegressionAnalysis

22110 XXY

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

Are the data longitudinal?

Use Individual

growth modeling

If your residuals are not independent,

replace OLS by GLS regression analysis

Specify a Multilevel

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Discriminant Analysis

Multinomial logistic

regression analysis

(polychotomous outcome)

Binomial logistic

regression analysis

(dichotomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Conduct a Principal Components Analysis

Form composites of the indicators of any common

construct.

Use Cluster Analysis

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use non-linear regression analysis

The S-052 Roadmap (Courtesy of John B. Willett)


Epilogue: Reflections on the semester

Course Goal: For you to learn how to use statistical methods to to address RQs

• Solid foundation in regression modeling

• Solid understanding of assumptions

• Appreciation for the model’s flexibility

• Learn how to link methods and substance: how to think like a empirical researcher

• Learn how to communicate quantitative findings

Looking forward: You’ll be able to use statistical methods and evaluate the work of others

• You’ll never skip a methods section again

• You’ll never just believe someone’s findings without evaluating their methodology

• You’ll start thinking about how statistical methods might be used within your substantive arena to address important RQs


Thanks and Congratulations!

You, in September

You, now!

Hal Varian [Google’s chief economist] likes to say that the sexy job in the next ten years will be statisticians. After all, who would have guessed that computer engineers would be the cool job of the 90s? When every business has free and ubiquitous data, the ability to understand it and extract value from it becomes the complimentary scarce factor. It leads to intelligence, and the intelligent business is the successful business, regardless of its size. Data is the sword of the 21st century, those who wield it well, the Samurai.

--Jonathan Rosenberg (Googleblog)

S030