Statistics for Social and Behavioral Sciences Part IV: Causality Multivariate Regression Chapter 11...

Post on 24-Dec-2015

217 views 3 download

Tags:

Transcript of Statistics for Social and Behavioral Sciences Part IV: Causality Multivariate Regression Chapter 11...

Statistics for Socialand Behavioral Sciences

Part IV: CausalityMultivariate Regression

Chapter 11Prof. Amine Ouazad

Movie Buzz• Can we predict the success of a movie?

1. Avatar (2009)$760,505,847

2. Titanic (1997)$658,672,302

3. The Avengers (2012)$623,279,547

4. The Dark Knight (2008) $533,316,0615. Star Wars: Episode I – The Phantom Menace

(1999)$474,544,677

Data• Box_mil = First run U.S. box office (Millions of $)• MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.• Budget = Production budget (Millions of $)• Starpowr = Index of star power• Sequel = 1 if movie is a sequel, 0 if not• Action = 1 if action film, 0 if not• Comedy = 1 if comedy film, 0 if not• Animated = 1 if animated film, 0 if not• Horror = 1 if horror film, 0 if not• Addict = Trailer views at traileraddict.com• Cmngsoon = Message board comments at comingsoon.net• Fandango = Attention at fandango.com • Cntwait3 = Percentage of Fandango votes that can't wait to see.

Statistics Course Outline

PART I. INTRODUCTION AND RESEARCH DESIGN

PART II. DESCRIBING DATA

PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL

STATISTICS

PART IV. : CORRELATION AND CAUSATION: TWO GROUPS,

REGRESSION ANALYSIS

Week 1

Weeks 2-4

Weeks 5-9

Weeks 10-14

Multivariate regression now!

Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99% Testing a hypothesis using the CI method and the t method.

Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical RuleBivariate sample statistics: Correlation, Slope

Four Steps of “Thinking Like a Statistician”Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling

Biases: Nonresponse bias, Response bias, Sampling bias

Coming up

• “Comparison of Two Groups”Last week.

• “Univariate Regression Analysis”Last Saturday, Section 9.5.

• “Association and Causality: Multivariate Regression”Last Saturday, Chapter 10.Today, Tomorrow, Chapter 11.

• “Randomized Experiments and ANOVA”.Wednesday. Chapter 12.

• “Robustness Checks and Wrap Up”.Last Thursday.

Outline

1. Multivariate regression

2. Interpreting coefficientsCeteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression: the F test (Continued)

Data: Variables

• y Box = First run U.S. box office ($)• x1 MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.

• x2 Budget = Production budget ($Mil)

• x3 Starpowr = Index of star power

• x4 Sequel = 1 if movie is a sequel, 0 if not

• x5 Action = 1 if action film, 0 if not

• x6 Comedy = 1 if comedy film, 0 if not

• x7 Animated = 1 if animated film, 0 if not

• x8 Horror = 1 if horror film, 0 if not

• x9 Addict = Trailer views at traileraddict.com

• x10 Cmngsoon = Message board comments at comingsoon.net

• x11 Fandango = Attention at fandango.com

• x12 Cntwait3 = Percentage of Fandango votes that can't wait to see.

Multivariate Regression

• With variables x1, x2, …, x12.• We are trying to get the true impact:

b1 of variable x1 on y. b2 of variable x2 on y. … b12 of variable xK on y.

• True model: y = a + b1 x1 + b2 x2 + b3 x3 + … + b12 x12 + e

We would get those if we had the population of all possible movies.

• Instead we estimate b1, b2, …, bK on the sample:– Minimizing the sum of the squared prediction

error !

• With these we can predict the success of a movie:

Multivariate Regression

Sampling Distribution of b3

• We only observe one coefficient estimate b3, because we have only one sample.

• But across all possible samples, the sampling distribution of b3 is bell-shaped.

• Hence we can design a test:• H0: “ b3 = 0 ”

follows a t distribution with N – (K + 1) degrees of freedom.

Under H0,

Hypothesis testing for H0 : “b3=0”

• Reject the null hypothesis at 95% if:

– The absolute value of the t statistic is greater than the t score with N – (K+1) degrees of freedom at 95%.

– Equivalently, if the p value is lower than 0.05.

There are as many null hypothesis as there are coefficients to estimate :

Here, there are

Outline

1. Multivariate regression

2. Interpreting coefficientsCeteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

Ceteris Paribus=“All other things equal”

• “All other things equal”, what is the impact of variable x3 on box office outcome in millions of $?

Increase in starpower (variable x3) all other things equal.Keep x1,x2,x4,x5,x6,x7,x8,x9,x10,x12 constant ! And change x3.

Increase in x3

(Star power)

Ceteris Paribus=“All other things equal”

• “All other things equal”, what is the impact of variable x3 on box office outcome in millions of $?

Increase in budget(variable x2) all other things equal.Keep x1,x3,x4,x5,x6,x7,x8,x9,x10,x12 constant ! And change x3.

Increase in x2

(Budget)by 1 million $

Reading the coefficients

• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.

• An action movie has on average all other things equal a lower box office outcome, by $12 million.

• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.

We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.

Which coefficients arestatistically significant?

• x1 MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG. ❏❏❏

• x2 Budget = Production budget ($Mil) ❏❏❏

• x3 Starpowr = Index of star power ❏❏❏

• x4 Sequel = 1 if movie is a sequel, 0 if not ❏❏❏

• x5 Action = 1 if action film, 0 if not ❏❏❏

• x6 Comedy = 1 if comedy film, 0 if not ❏❏❏

• x7 Animated = 1 if animated film, 0 if not ❏❏❏

• x8 Horror = 1 if horror film, 0 if not ❏❏❏

• x9 Addict = Trailer views at traileraddict.com ❏❏❏

• x10 Cmngsoon = Message board comments at comingsoon.net ❏❏❏

• x11 Fandango = Attention at fandango.com ❏❏❏

• x12 Cntwait3 = Percentage of Fandango votes that can't wait to see. ❏❏❏

At 1

0%At

5%

At 1

%

Read the p value !!! Or compare the t stat to the t score with N-13 degrees of freedom

With Budget

Without Budget

Budget and Can’t Wait to See the movie !

• Without budget among the variables, the popularity cntwait3 has a bigger impact…

• Than with budget included.

Budget

Cntwait3

Box office (box_mil)

We know that Budget and Cntwait3 are correlated (an arrow either in one direction or in the other, or both) because including Budget affects the coefficient of Cntwait3

Other variables

Outline

1. Multivariate regression

2. Interpreting coefficientsCeteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

Standardized CoefficientWe just saw:• An increase in budget by 1 million $ leads to a

rise in box office $ of 0.144 million $, all other things equal.

But is 1 million $ big? Is 0.144 million $ big?

• “a 1 standard deviation increase in x2, leads to a …. % standard deviation increase in y.”

• Standard deviation of x2 (budget): 42.9.• Standard deviation of y (box office outcome):

17.5.• Coefficient of budget: 0.144.• Fill in the blank.

Standardized Coefficient

Standardized Coefficient

We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.

• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.

• An action movie has on average all other things equal a lower box office outcome, by $12 million.

• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.

Outline

1. Multivariate regression

2. Interpreting coefficientsCeteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

R Squared

• How good are we at predicting the success of a movie?

• The multiple correlation is 1 if we are absolutely correct in our predictions. ei=0 for every movie.

• The multiple correlation is 0 if we do not better than taking the average. ei =

ESS/TSS = 13356/18665 = 0.7156

Wrap up

• We can use a number of variables to explain a dependent variable.

• Multiple regression accounts for multiple causes.• The coefficients minimize the sum of the squared

residuals.• Understand the t test and the p value.• The coefficients should be understood “all other things

equal” or “ceteris paribus”.• The standardized coefficients express effects in terms of

standard deviations.• The R squared between 0 and 100% measures how

accurate our predictions are.

Coming up:

• Schedule for next week:• Chapter on “Association and Causality”, and “Multivariate Regression”.• Make sure you come to sessions and recitations.

Sunday MondayMultivariate Regression

TuesdayMultivariate RegressionThe F test

WednesdayRandomized Experiments and ANOVA

ThursdayWrap up

Recitation Evening session 7.30pmWest Administration 002

Usual class12.45pmUsual room

Evening session7.30pmWest Administration 001

Usual class12.45pmUsual room