Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

26
Class 19: Tuesday, Nov. 16 • Specially Constructed Explanatory Variables
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Page 1: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Class 19: Tuesday, Nov. 16

• Specially Constructed Explanatory Variables

Page 2: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Specially Constructed Explanatory Variables

• Interaction variables

• Squared and higher polynomial terms for curvature

• Dummy variables for categorical variables.

Page 3: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Interaction Example

• The number of car accidents on a stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling.

• A city alderman has decided to ask the county sheriff to provide him with statistics covering the last few years with the intention of examining these data statistically so that she can introduce new speed laws that will reduce traffic accidents.

• accidents.JMP contains data for different time periods on the number of cars passing along the stretch of road, the average speed of the cars and the number of accidents during the time period.

• It seems plausible that the effect of increases in speed on accidents is greater when there are more cars on the road.

Page 4: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Interaction

• Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2).

• There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.

• To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).

)(*)( 2211 XXXX )(*)( 2211 XXXX

Page 5: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Interaction variables in JMP

• To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box.

• JMP creates the explanatory variable

1X

2X

)(*)( 2211 XXXX

Page 6: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Interactions in Accident DataResponse Accidents Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.852117 7.314465 -0.12 0.9077 Cars 0.4154531 0.136048 3.05 0.0035 Speed 0.0644162 0.118519 0.54 0.5889 (Speed-60.0017)*(Cars-9.935) 1.0763228 0.087791 12.26 <.0001

21.1)935.911(*)6566(*076.1)6566(*064.0

]935.911(*)0017.6665(*076.1

65*064.011*415.0852.0[)]935.911(*)0017.6066(*076.166*064.0

11*415.0852.0[)65,11(ˆ)66,11|(ˆ

02.2)935.98(*)6566(*076.1)6566(*064.0

)]935.98(*)0017.6665(*076.1

65*064.08*415.0852.0[)]935.98(*)0017.6066(*076.166*064.0

8*415.0852.0[)65,8(ˆ)66,8|(ˆ

SpeedCarsESpeedCarsAccidentsE

SpeedCarsESpeedCarsAccidentsE

Increases in speed have a worse impact on number of accidents when there area large number of cars on the road than when there are a small number of cars onthe road.

Page 7: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Notes on Interactions

• The need for interactions is not easily spotted with residual plots. It is best to try including an interaction term and see if it is significant.

• To understand better the multiple regression relationship when there is an interaction, it is useful to make an Interaction Plot. After Fit Model, click red triangle next to Response, click Factor Profiling and then click Interaction Plots.

Page 8: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Interaction Profiles

-2

0

2

4

6

8

10

12

Acc

iden

ts

-2

0

2

4

6

8

10

12

Acc

iden

ts

Cars

56.6

62.5

7 8 9 10 12

7

12.6

Speed

57 58 59 60 61 62 63

Cars

Speed

Plot on left displays E(Accidents|Cars, Speed=56.6), E(Accidents|Cars,Speed=62.5)as a function of Cars. Plot on right displays E(Accidents|Cars=12.6), E(Accidents|Cars,Speed=7) as a function of Speed. We can see that the impact of speed on Accidents depends critically on the number of cars on the road.

Page 9: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Aptitude-Treatment Interactions

• There is a large literature in education and psychology that investigates aptitude-treatments – interactions between instructional strategies (more generally treatments) and aptitudes (more generally characteristics) of individuals.

• There is evidence that in general highly structured instructional strategies (e.g., high level of external control, well-defined sequences/components) seem to help students with low ability but hinder those with high ability, relative to low-structure instructional strategies.

Page 10: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Examples of Interesting Interactions

• Y=Measure of psychological distress, X1=# of life events in last three years that are personal disruptions (e.g., death in the family), X2=socioeconomic status. Coefficient on X1 is positive, X2 is negative and is negative – subjects who possess greater resources in the form of higher SES are better able to withstand the mental stress of potentially traumatic life events.

• Y=Measure of depression, X1=Education, X2=Age. Coefficient on X1 is negative,

)(*)( 21 XXXX

Page 11: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Fast Food Locations

• An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp

Page 12: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

RevenueIncomeAge

1.0000 0.4355 0.3769

0.4355 1.0000 0.0201

0.3769 0.0201 1.0000

Revenue Income Age

Correlations

900

1000

1100

1200

1300

20

25

30

35

5.0

7.5

10.0

12.5

15.0

Revenue

900 1000110012001300

Income

20 25 30 35

Age

5.0 7.5 10.0 12.5 15.0

Scatterplot Matrix

Multivariate

Relationship between revenue and income and between revenue and age is quadratic. Members of relatively poor or relatively affluent households are less likely to eat at this chain’s restaurants, since the restaurants attract mostly middle-income customers. The quadratic relationship cannot be easily captured by a transformation. Curvature between y and x falls into two quadrants of circle in Tukey’s Bulging Rule.

Page 13: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Squared Terms for Curvature

• To capture a quadratic relationship between X1 and Y, we add as an explanatory variable.

• To do this in JMP, add X1 to the model, then highlight X1 in the Select Columns box and highlight X1 in the Construct Model Effects box and click Cross.

)(*)( 11 XXXX

Page 14: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041

The t-tests indicate strong evidence of curvature for both income and age. The curvature in age means that the impact of an extra year of age on mean revenue for a fixed level of income depends on the fixed value of income.

47.7)]392.89(*)392.89()392.810(*)392.810[(*)113.4(642.1

)9,2.24|(Reˆ)10,2.24|(Reˆ

98.8)]392.78(*)392.87()392.88(*)392.88[(*)113.4(642.1

)7,2.24|(Reˆ)8,2.24|(Reˆ

AgeIncomevenueEAgeIncomevenueE

AgeIncomevenueEAgeIncomevenueE

Page 15: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Notes on Squared Terms for Curvature

• If t-test for squared term has p-value <.05, indicating that there is curvature, then we keep the linear term

in the model regardless of its p-value. • Coefficients in model with squared terms for curvature are

tricky to interpret. If we have explanatory variables and in the model, then we can’t keep fixed and change

• As with interactions, to better understand the multiple regression relationship when there is a squared term for curvature, a plot is useful. After Fit Model, click red triangle next to Response, click Factor Profiling and click Profiler. JMP shows a plot for each explanatory variable of how the mean of Y changes as the explanatory variable is increased and the other explanatory variables are held fixed at their mean value.

21 )( XX

1X

1X

21 )( XX

21 )( XX

1X

Page 16: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Prediction Profiler

Rev

enue

1281

781.028

1208.257

±32.825

Income

15.6

33.6

24.2

Age

3.4

14.9

8.392

Left hand plot is a plot of Mean Revenue for different levels of income when Age isheld fixed at its mean value of 8.392. The 1208.257+/-32.825 is a confidence intervalfor the mean response at income=24.2, Age=8.392.

Page 17: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Regression Model for Fast Food Chain Data

• Interactions and polynomial terms can be combined in a multiple regression model.

• Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 921.11967 95.703 9.62 <.0001 Income 9.3678491 2.743887 3.41 0.0029 Age 6.2254725 5.472777 1.14 0.2695 (Income-24.2)*(Income-24.2) -3.726129 0.542156 -6.87 <.0001 (Age-8.392)*(Age-8.392) -3.868707 1.179054 -3.28 0.0039 (Age-8.392)*(Income-24.2) 1.9672682 0.944082 2.08 0.0509

Page 18: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Categorical variables

• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).

• How to use categorical variables as explanatory variables in regression analysis:– If the variable has two categories (e.g., sex

(male/female), rain or not rain, snow or not snow), we have defined a variable that equals 1 for one of the categories and 0 for the other category.

Page 19: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Predicting Emergency Calls to the AAA Club

Response Calls Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)

28

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature

-35.63182 51.52383 -0.69 0.4972

Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175

Rain forecast=1 if rain is in forecast, 0 if notSnow forecast=1 if snow is inforecast, 0 if notWeekday=1 if weekday, 0 ifnot

Page 20: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Comparing Toy Factory Managers

• An analysis has shown that the time required to complete a production run in a toy factory increases with the number of toys produced. Data were collected for the time required to process 20 randomly selected production runs as supervised by three managers (A, B and C). Data in toyfactorymanager.JMP.

• How do the managers compare?

Page 21: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Marginal Comparison

• Marginal comparison could be misleading. We know that large production runs with more toys take longer than small runs with few toys. How can we be sure that Manager c has not simply been supervising very small production runs?

• Solution: Run a multiple regression in which we include size of the production run as an explanatory variable along with manager, in order to control for size of the production run.

Tim

e fo

r R

un

150

200

250

300

a b c

Manager

Oneway Analysis of Time for Run By Manager

Page 22: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Including Categorical Variable in Multiple Regression: Wrong

Approach • We could assign codes to the managers, e.g., Manager

A = 0, Manager B=1, Manager C=2.

• This model says that for the same run size, Manager B is 31 minutes faster than Manager A and Manager C is 31 minutes faster than Manager B.

• This model restricts the difference between Manager A and B to be the same as the difference between Manager B and C – we have no reason to do this.

• If we use a different coding for Manager, we get different results, e.g., Manager B=0, Manager A=1, Manager C=2

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 211.92804 7.212609 29.38 <.0001 Run Size 0.2233844 0.029184 7.65 <.0001 Managernumber -31.03612 3.056054 -10.16 <.0001

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 188.63636 12.73082 14.82 <.0001 Run Size 0.2103122 0.048921 4.30 <.0001 Managernumber2 -5.008207 5.122956 -0.98 0.3324

Manager A 5 min.faster than Manager B

Page 23: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Including Categorical Variable in Multiple Regression: Right

Approach• Create an indicator (dummy) variable for

each category.• Manager[a] = 1 if Manager is A 0 if Manager is not A • Manager[b] = 1 if Manager is B 0 if Manager is not B• Manager[c] = 1 if Manager is C 0 if Manager is not C

Page 24: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

• For a run size of length 100, the estimated time for run of Managers A, B and C ar

• For the same run size, Manager A is estimated to be on average 38.41-(-14.65)=53.06 minutes slower than Manager B and

38.41-(-23.76)=62.17 minutes slower than Manager C.

Response Time for Run Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[a] 38.409663 3.005923 12.78 <.0001 Manager[b] -14.65115 3.031379 -4.83 <.0001 Manager[c] -23.75851 2.995898 -7.93 <.0001

1*76.230*65.140*41.38100*24.071.176),100|(ˆ

0*76.231*65.140*41.38100*24.071.176),100|(ˆ

0*76.230*65.141*41.38100*24.071.176),100|(ˆ

cManagerRunsizeTimeE

bManagerRunsizeTimeE

aManagerRunsizeTimeE

Page 25: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Categorical Variables in Multiple Regression in JMP

• Make sure that the categorical variable is coded as nominal. To change coding, right clock on column of variable, click Column Info and change Modeling Type to nominal.

• Use Fit Model and include the categorical variable into the multiple regression.

• After Fit Model, click red triangle next to Response and click Estimates, then Expanded Estimates (the initial output in JMP uses a different, more confusing coding of the dummy variables).

Page 26: Class 19: Tuesday, Nov. 16 Specially Constructed Explanatory Variables.

Equivalence of Using One 0/1 Dummy Variable and Two 0/1 Dummy

Variables when Categorical Variable has two categories

• Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature

-35.63182 51.52383 -0.69 0.4972

Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175

Expanded Estimates Nominal factors expanded to all levels Term Estimate Intercept 4321.7173 Average Temperature -35.63182 Range 133.30434 Rain forecast[0] -214.8529 Rain forecast[1] 214.85294 Snow forecast[0] -274.4002 Snow forecast[1] 274.40019 Weekday[0] 801.55002 Weekday[1] -801.55 Sunday[0] 923.57625 Sunday[1] -923.5762 Subzero[0] -1928.8 Subzero[1] 1928.8002

Two models give equivalent predictions. The difference in mean number of Emergency calls between a day with a rain forecast and a day without a rain forecastholding all other variables fixed is 429.71=214.85-(-214.85).