Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables...

23
Class 18 – Thursday, Nov. 11 • Omitted Variables Bias • Specially Constructed Explanatory Variables – Interactions – Squared Terms for Curvature – Dummy variables for categorical variables (next class) • I will e-mail you Homework 7 after class. It will be due next Thursday.

Transcript of Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables...

Page 1: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Class 18 – Thursday, Nov. 11

• Omitted Variables Bias• Specially Constructed Explanatory

Variables– Interactions– Squared Terms for Curvature– Dummy variables for categorical variables

(next class)

• I will e-mail you Homework 7 after class. It will be due next Thursday.

Page 2: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

California Test Score Data

• The California Standardized Testing and Reporting (STAR) data set californiastar.JMP contains data on test performance, school characteristics and student demographic backgrounds from 1998-1999.

• Average Test Score is the average of the reading and math scores for a standardized test administered to 5th grade students.

• One interesting question: What would be the causal effect of decreasing the student-teacher ratio by one student per teacher?

Page 3: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Multiple Regression and Causal Inference

• Goal: Figure out what the causal effect on average test score would be of decreasing student-teacher ratio and keeping everything else in the world fixed.

• Lurking variable: A variable that is associated with both average test score and student-teacher ratio.

• In order to figure out whether a drop in student-teacher ratio causes higher test scores, we want to compare mean test scores among schools with different student-teacher ratios but the same values of the lurking variables.

• If we include all of the lurking variables in the multiple regression model, the coefficient on student-teacher ratio represents the change in the mean of test scores that is caused by a one unit increase in student-teacher ratio.

Page 4: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Omitted Variables Bias

• Schools with many English learners tend to have worst resources. The multiple regression that shows how mean test score changes when student teacher ratio changes but percent of English learners is held fixed gives a better idea of the causal effect of the student-teacher ratio than the simple linear regression that does not hold percent of English learners fixed.

• Omitted variables bias of omitting percentage of English learners = -2.28-(-1.10)=-1.28.

Response Average Test Score Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 698.93295 9.467491 73.82 <.0001 Student Teacher Ratio -2.279808 0.479826 -4.75 <.0001

Response Average Test Score Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 686.03225 7.411312 92.57 <.0001 Student Teacher Ratio -1.101296 0.380278 -2.90 0.0040 Percent of English Learners -0.649777 0.039343 -16.52 <.0001

Page 5: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Omitted Variables Bias: General Formula

• What happens if we omit a lurking variable from the regression?

• Suppose we are interested in the causal effect of on y and believe that there are lurking variables

and that

• is the causal effect of on y. If we omit the lurking variable, , then the multiple regression will be estimating the coefficient as the coefficient on . How different are and .

1x

12 ,, pxx

ppp

ppp

xxxxyE

xxxxyE*

1*1

*01

1111011

},,|{

},,|{

11x

1px*1

1x*1

1

Page 6: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Omitted Variables Bias Formula• Suppose that

• Then • Formula tells us about direction and magnitude of bias

from omitting a variable in estimating a causal effect. • Formula also applies to least squares estimates, i.e., • Key point: In order for there to be omitted variable

bias, the omitted variable must be associated with both the explanatory variable of interest and the response.

pppp

ppp

ppp

xxxxxE

xxxxyE

xxxxyE

11011

*1

*1

*01

1111011

},,|{

},,|{

},,|{

11*

11 p

11*

11ˆˆˆˆ

p

Page 7: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Omitted Variables Bias Examples

• Would you expect the slope coefficient on X to be too high, too low or have no bias for the regression that omits the given variable?

• Y = Test Score, X= Number of Music Classes Taken, Omitted Variable = Student Ability

• Y = Salary, X = Gender (1=Female, 0=Male), Omitted Variable = Education

Page 8: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Key Warning About Multiple Regression

• Even if we have included many lurking variables in the multiple regression, we may have failed to include one or not have enough data to include one. There will then be omitted variables bias.

• The best way to study causal effects is to do a randomized experiment (coming up next week).

Page 9: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Specially Constructed Explanatory Variables

• Interaction variables

• Squared and higher polynomial terms for curvature

• Dummy variables for categorical variables.

Page 10: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Interaction

• Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2).

• There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.

• To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).

)(*)( 2211 XXXX )(*)( 2211 XXXX

Page 11: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

An experiment to study how noise affects the performance of children tested second grade hyperactive children and a control group of second graders who were not hyperactive. One of the tasks involved solving math problems. The children solved problems under both high-noise and low-noise conditions. Here are the mean scores:

0

50

100

150

200

250

Control Hyperactive

Me

an

Ma

the

ma

tic

s S

co

re

High Noise

Low Noise

Let Y=Mean Mathematics Score, 1X Type of Child (0= Control, 1 = Hyperactive),

2X =Type of Noise (0= Low Noise, 1= High Noise). There is an interaction between type of child and type of noise: Impact of increasing noise from low to high depends on the type of child.

Page 12: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Interaction variables in JMP

• To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box.

• JMP creates the explanatory variable

1X

2X

)(*)( 2211 XXXX

Page 13: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Interaction Example

• The number of car accidents on a stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling.

• A city alderman has decided to ask the county sheriff to provide him with statistics covering the last few years with the intention of examining these data statistically so that she can introduce new speed laws that will reduce traffic accidents.

• accidents.JMP contains data for different time periods on the number of cars passing along the stretch of road, the average speed of the cars and the number of accidents during the time period.

Page 14: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Interactions in Accident DataResponse Accidents Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.852117 7.314465 -0.12 0.9077 Cars 0.4154531 0.136048 3.05 0.0035 Speed 0.0644162 0.118519 0.54 0.5889 (Speed-60.0017)*(Cars-9.935) 1.0763228 0.087791 12.26 <.0001

21.1)935.911(*)6566(*076.1)6566(*064.0

]935.911(*)0017.6665(*076.1

65*064.011*415.0852.0[)]935.911(*)0017.6066(*076.166*064.0

11*415.0852.0[)65,11(ˆ)66,11|(ˆ

02.2)935.98(*)6566(*076.1)6566(*064.0

)]935.98(*)0017.6665(*076.1

65*064.08*415.0852.0[)]935.98(*)0017.6066(*076.166*064.0

8*415.0852.0[)65,8(ˆ)66,8|(ˆ

SpeedCarsESpeedCarsAccidentsE

SpeedCarsESpeedCarsAccidentsE

Increases in speed have a worse impact on number of accidents when there area large number of cars on the road than when there are a small number of cars onthe road.

Page 15: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Notes on Interactions

• The need for interactions is not easily spotted with residual plots. It is best to try including an interaction term and see if it is significant.

• To understand better the multiple regression relationship when there is an interaction, it is useful to make an Interaction Plot. After Fit Model, click red triangle next to Response, click Factor Profiling and then click Interaction Plots.

Page 16: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Interaction Profiles

-2

0

2

4

6

8

10

12

Acc

iden

ts

-2

0

2

4

6

8

10

12

Acc

iden

ts

Cars

56.6

62.5

7 8 9 10 12

7

12.6

Speed

57 58 59 60 61 62 63

Cars

Speed

Plot on left displays E(Accidents|Cars, Speed=56.6), E(Accidents|Cars,Speed=62.5)as a function of Cars. Plot on right displays E(Accidents|Cars=12.6), E(Accidents|Cars,Speed=7) as a function of Speed. We can see that the impact of speed on Accidents depends critically on the number of cars on the road.

Page 17: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Fast Food Locations

• An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp

Page 18: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

RevenueIncomeAge

1.0000 0.4355 0.3769

0.4355 1.0000 0.0201

0.3769 0.0201 1.0000

Revenue Income Age

Correlations

900

1000

1100

1200

1300

20

25

30

35

5.0

7.5

10.0

12.5

15.0

Revenue

900 1000110012001300

Income

20 25 30 35

Age

5.0 7.5 10.0 12.5 15.0

Scatterplot Matrix

Multivariate

Relationship between revenue and income and between revenue and age is quadratic. Members of relatively poor or relatively affluent households are less likely to eat at this chain’s restaurants, since the restaurants attract mostly middle-income customers. The quadratic relationship cannot be easily captured by a transformation. Curvature between y and x falls into two quadrants of circle in Tukey’s Bulging Rule.

Page 19: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Squared Terms for Curvature

• To capture a quadratic relationship between X1 and Y, we add as an explanatory variable.

• To do this in JMP, add X1 to the model, then highlight X1 in the Select Columns box and highlight X1 in the Construct Model Effects box and click Cross.

)(*)( 11 XXXX

Page 20: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041

The t-tests indicate strong evidence of curvature for both income and age. The curvature in age means that the impact of an extra year of age on mean revenue for a fixed level of income depends on the fixed value of income.

47.7)]392.89(*)392.89()392.810(*)392.810[(*)113.4(642.1

)9,2.24|(Reˆ)10,2.24|(Reˆ

98.8)]392.78(*)392.87()392.88(*)392.88[(*)113.4(642.1

)7,2.24|(Reˆ)8,2.24|(Reˆ

AgeIncomevenueEAgeIncomevenueE

AgeIncomevenueEAgeIncomevenueE

Page 21: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Notes on Squared Terms for Curvature

• If t-test for squared term has p-value <.05, indicating that there is curvature, then we keep the linear term

in the model regardless of its p-value. • Coefficients in model with squared terms for curvature are

tricky to interpret. If we have explanatory variables and in the model, then we can’t keep fixed and change

• As with interactions, to better understand the multiple regression relationship when there is a squared term for curvature, a plot is useful. After Fit Model, click red triangle next to Response, click Factor Profiling and click Profiler. JMP shows a plot for each explanatory variable of how the mean of Y changes as the explanatory variable is increased and the other explanatory variables are held fixed at their mean value.

21 )( XX

1X

1X

21 )( XX

21 )( XX

1X

Page 22: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Prediction Profiler

Rev

enue

1281

781.028

1208.257

±32.825

Income

15.6

33.6

24.2

Age

3.4

14.9

8.392

Left hand plot is a plot of Mean Revenue for different levels of income when Age isheld fixed at its mean value of 8.392. The 1208.257+/-32.825 is a confidence intervalfor the mean response at income=24.2, Age=8.392.

Page 23: Class 18 – Thursday, Nov. 11 Omitted Variables Bias Specially Constructed Explanatory Variables –Interactions –Squared Terms for Curvature –Dummy variables.

Regression Model for Fast Food Chain Data

• Interactions and polynomial terms can be combined in a multiple regression model.

• Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 921.11967 95.703 9.62 <.0001 Income 9.3678491 2.743887 3.41 0.0029 Age 6.2254725 5.472777 1.14 0.2695 (Income-24.2)*(Income-24.2) -3.726129 0.542156 -6.87 <.0001 (Age-8.392)*(Age-8.392) -3.868707 1.179054 -3.28 0.0039 (Age-8.392)*(Income-24.2) 1.9672682 0.944082 2.08 0.0509