ps5_fall+2015

9

Click here to load reader

description

ps5

Transcript of ps5_fall+2015

Page 1: ps5_fall+2015

Columbia University W3412

Department of Economics Fall 2015

Problem Set 5

Introduction to Econometrics

Profs. Seyhan Erden and Miikka Rokkanen

for all sections

Part I Consider the following results for a wage regression where lwage is the natural log of average

hourly earnings in US dollars, age is in years, female is a binary variable for gender, bachelor is

one for someone with bachelor degree and zero otherwise, femxbac and femxage are self

explanotary interaction variables.

Regression 1:

. reg lwage age female bachelor femxbac, r

Linear regression Number of obs = 15316

F( 4, 15311) = 861.64

Prob > F = 0.0000

R-squared = 0.1852

Root MSE = .50507

------------------------------------------------------------------------------

| Robust

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0259277 .0014453 17.94 0.000 .0230947 .0287606

female | -.2257049 .010984 -20.55 0.000 -.2472348 -.2041749

bachelor | .3980052 .0114889 34.64 0.000 .3754857 .4205248

femxbac | .1082764 .0165844 6.53 0.000 .0757691 .1407838

_cons | 1.708492 .0434411 39.33 0.000 1.623342 1.793642

------------------------------------------------------------------------------

Regression 2:

. reg lwage female age bachelor femxage, r

Linear regression Number of obs = 15316

F( 4, 15311) = 840.78

Prob > F = 0.0000

R-squared = 0.1848

Root MSE = .5052

------------------------------------------------------------------------------

| Robust

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

female | .331927 .0859364 3.86 0.000 .1634814 .5003726

age | .032957 .0019547 16.86 0.000 .0291256 .0367884

bachelor | .4437149 .0083212 53.32 0.000 .4274044 .4600254

femxage | -.0171776 .002895 -5.93 0.000 -.0228521 -.0115032

_cons | 1.481794 .0583113 25.41 0.000 1.367497 1.596092

------------------------------------------------------------------------------

Page 2: ps5_fall+2015

Using the appropriate regression please answer the following questions

(a) What is the estimated average wage difference between females with bachelor degrees

and males with bachelor degree? Explain your answer.

(b) What is the estimated average wage difference between females with bachelor degrees

and females without bachelor degree? Explain your answer.

(c) How would you test if there is a significance difference in wages of females with

bachelor degrees and males with bachelor degrees? Please write STATA commands

necessary, you can use any method you want, but you must write the null hypothesis first.

(d) How would you test if there is an intercept and if there is a slope difference in the two

estimated regression lines for males and females in regression 2? Write the null

hypothesis for each test.

(e) As people get older, does the wage gender gap widen? Draw a sketch graph (with wage

on vertical axis and age on horizontal axis) of what this result tells you.

(f) Is the coefficient of female in Regression 2, statistically significant at 1% significance

level? Please write the null hypothesis and calculate the test statistic before you answer

this question. Also make sure to give your reason for your answer.

Part II

1. What are the root causes of terrorism? Poverty? Repressive political regimes? Religious or

ethnic conflicts arising from heterogeneous populations?

In this problem set you will take a look at some empirical evidence on cross-country sources of

terrorism. Variables in the data set, terrorism.dta, are defined in Table 1. Note that, to do this

problem set, you will need to create (generate) some new variables, which are functions of the

variables in terrorism.dta.

Preliminary data analysis:

(a) Produce the scatterplot of ftmpop vs. gdppc.

(b) Generate the variables lnftmpop = log(ftmpop) vs. lngdppc = log(gdppc). Produce the

scatterplot of lnftmpop vs. lngdppc.

(c) Produce the scatterplot of lnftmpop v. lackpf.

(d) Using the scatterplots from (a) and (b), would you suggest using the variables (i) ftmpop

and gdppc or (ii) lnftmpop and lngdppc for modeling using linear regression?

(e) Using the scatterplot from (c), does the relation between lnftmpop and lackpf appear to be

linear or nonlinear? If nonlinear, what sort of nonlinear curve might you want to explore

(briefly explain)?

2. Estimate the regressions in Table 2 and fill in the empty entries. You may write in the entries

by hand or type them using the .doc electronic version of the table on the course Web site.

Note: for these regressions, only use the countries that have nonzero values of ftmpop.

Page 3: ps5_fall+2015

3. Use the results in Table 2 to answer the following questions.

(a) Using regression (1), test the hypothesis that the coefficient on lngdppc is zero, against

the alternative that it is nonzero, at the 5% significance level. Explain in words what the

coefficient means.

(b) Using regression (3), test the hypothesis that the coefficients on lngdppc and lngdppc2 are

both zero, against the alternative that one or the other coefficient is nonzero, at the 5%

significance level.

(c) Explain why the conclusions in (a) and (b) differ.

(d) Using regression (3), is there evidence that the relationship between lnftmpop and

lngdppc is nonlinear?

(e) Using regression (3), is there evidence that the relationship between lnftmpop and lackpf

is nonlinear?

(f) Using regression (5), test the null hypothesis (at the 5% significance level) that the

coefficients on the “other regional dummies” all are zero, against the alternative

hypothesis that at least one is nonzero. What is number of restrictions q in your test?

What is the critical value of your test?

(g) Using regression (4), discuss the evidence that ethnic diversity is associated with

increases in terrorism, holding constant GDP per capita, religious diversity, and a

measure of political freedoms? Interpret the sign of the coefficient on ethnic.

(h) Using regression (4), discuss the evidence that religious diversity is associated with

increases in terrorism, holding constant GDP per capita, ethnic diversity, and a measure

of political freedoms? Interpret the sign of the coefficient on religious.

(i) Using regression (4), test the hypothesis that the population coefficients on ethnic and

religion are both zero, against the alternative that one or the other coefficient is nonzero.

Explain in words what hypothesis you have tested, and what your conclusion is.

(j) Can you use regressions (3) and (4) to test the same hypothesis as in (i), this time using

the R2 formula for the homoskedasticity-only F statistic? Explain.

(k) Using regression (4), estimate the effect on lnftmpop of changing from lackpf = 7

(extremely limited political freedoms) to lackpf = 5 (some political freedoms), holding

constant the values of the other regressors in regression (4).

(l) Using regression (4), compute a 95% confidence interval for the effect you estimated in

part (k).

(m)Using regression (4), plot the estimated relationship between lnftmpop and lackpf (you

can do this in STATA or in Excel or by hand). At approximately what value of lackpf is

this relationship maximized?

(n) In words, briefly describe the relationship you found in part (m).

Page 4: ps5_fall+2015

Table 1

DATA DESCRIPTION, FILE: terrorism.dta

Variable Definition

Ftmpop Number of fatalities from terrorist incidents in the country, 1998-2004, per million population (U.S. State Department)

Evmpop Number of terrorist events in the country, 1998-2004, per million population (U.S. State Department)

Gdppc GDP per capita in the country (World Bank)

Lackpf Index of the lack of political freedoms (Freedom House), 1-7 scale, 7 = extremely limited political freedoms

language Index of linguistic fractionalization (0 to 1 scale, 0 = no fractionalization)

Ethnic Index of ethnic fractionalization (0 to 1 scale, 0 = no fractionalization)

Religion Index of religious fractionalization (0 to 1 scale, 0 = no fractionalization)

mideast, latinam, easteurope, africa, eastasia

= 1 if the country is in the indicated region, = 0 otherwise

STATA HINTS

STATA command Description generate lngdppc = ln(gdppc) creates a new variable lngdppc that is the natural

logarithm of gdppc. gen lngdppc2 = lngdppc* lngdppc creates a new variable that is the square of lgdppc reg yvar xvar1 xvar2 xvar3, robust

test xvar2 xvar3

This pair of commands first estimates a regression of yvar against xvar1, xvar2, and xvar3, then tests the joint hypothesis that the coefficients on xvar2 and xvar3 both equal zero, against the hypothesis that at least one does not, using heteroskedasticity-robust standard errors. Note: the test command always refers to the most recently run regression (or estimation) command.

Page 5: ps5_fall+2015

Table 2

Determinants of Terrorism (1) (2) (3) (4) (5)

Dependent variable: lnftmpop lnftmpop lnftmpop lnftmpop lnftmpop

Regressor:

lngdppc

( )

( )

( )

( )

( )

(lngdppc)2

__ __

( ) __ __

lackpf

__ ( )

( )

( )

( )

lackpf2

__ __

( )

( )

( )

ethnic __ __ __ ( )

( )

religion __ __ __ ( )

( )

Mideast

__ __ __ __ ( )

Other regional dummies (latinam, easteurope, africa, eastasia)?

No No No No Yes

Intercept

( )

( )

( )

( )

( )

F-statistics testing the hypothesis that the population coefficients on the indicated regressors are all zero:

lngdppc, lngdppc2

__ __ ( )

__ __

lackpf, lackpf2

__ __ ( )

( )

( )

ethnic, religion __ __ __ ( )

( )

Other regional dummies __ __ __ __ ( )

Regression summary statistics 2R

R2

SER

n

Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated

coefficients, and p-values are given in parentheses under F- statistics. The F-statistics are

heteroskedasticity-robust. Coefficients are significant at the +10%, *5%, **1% significance

level. The “other regional dummies” included in regression (5) are latinam, easteurope, africa,

and eastasia (the omitted case is Western Europe combined with North America).

Page 6: ps5_fall+2015

4. Create a new binary variable higdppc, which equals one if gdppc is greater than or equal to

the median in the data set, and which equals zero otherwise; also create the interaction

variables hi_lackf = higdppclackpf and hi_lackf2 = higdppclackpf 2. Estimate the

regressions in Table 3 and fill in the empty entries. You may write in the entries by hand or

type them using the .doc electronic version of the table on the course Web site. Note: for all

calculations, only use the countries that have nonzero values of ftmpop and non-missing

values of gdppc.

5. Regression (6) produces two regression lines, one for higdppc = 0 and one for higdppc = 1.

Produce a scatterplot of lnftmpop vs. lackpf, showing the two regression lines. This can be

done either by producing two scatterplots, one for higdppc = 0 and one for higdppc = 1, or by

combining the two scatterplots into a single graph. Use regression (6) to write out the

estimated regression lines for the two groups (in slope-intercept form).

6. Use the results in Table 3 to answer the following questions.

(a) Is the difference between the two slopes plotted in the scatterplot of Question 5

statistically significantly different from zero at the 5% significance level? Explain.

(b) In a sentence or two, interpret the sign of the coefficients in regression (6) and the

scatterplot(s) in Question 5; that is, explain in everyday terms the findings shown in that

scatterplot.

(c) Using regression (7), test the hypothesis that the coefficients on higdppc lackpf and

higdppc lackpf 2 are zero, against the hypothesis that one or the other (or both) is

nonzero. State in words what the hypothesis is that you are testing.

(d) Using regression (7), test the hypothesis that the coefficients on lackpf2 and higdppc

lackpf2 are zero, against the hypothesis that one or the other (or both) is nonzero. State in

words what the hypothesis is that you are testing.

(e) Are the coefficients on the regional binary variables in regression (9) jointly statistically

significant at the 5% significance level? Explain.

7. One theory is that ethnic and religious diversity leads to strife and terrorism when economic

resources are poor, but if overall economic conditions are strong then ethnic and religious

diversity are more readily tolerated. Do regressions (8) and (9) support this theory? Explain.

8. Write a paragraph summarizing your findings about the relation between terrorist fatalities

and economic conditions, political freedoms, and ethnic and religious diversity. Your

discussion should be based on your results in Table 2 and Table 3 (that is, on the empirical

evidence, not your opinions), however you do not need to refer to specific regressions.

9. Assess the following threats to the internal validity of your analysis in Question 8.

Specifically, in your judgment are the following threats plausibly important in this

application? Why or why not? Be brief but specific.

(a) Omitted variable bias:

(b) Functional form misspecification:

(c) Errors-in-variables bias:

(d) Sample selection bias:

(e) Simultaneous causality bias:

Page 7: ps5_fall+2015

Table 3 Problem Set 5 Determinants of Terrorism, ctd. (6) (7) (8) (9)

Dependent variable: lnftmpop lnftmpop lnftmpop lnftmpop

Regressor:

higdppc

( )

( )

( )

( )

lackpf

( )

( )

( )

( )

lackpf2

__

( )

( )

( )

higdppc lackpf

( )

( )

__ __

higdppc lackpf2

__ ( )

__ __

ethnic __ __ ( )

( )

religion __ __ ( )

( )

higdppc ethnic __ __ ( )

( )

higdppc religion __ __ ( )

( )

Mideast

__ __ __ ( )

Other regional dummies (latinam, easteurope, africa, eastasia)?

No No No Yes

Intercept

( )

( )

( )

( )

F-statistics testing the hypothesis that the population coefficients on the indicated regressors are all zero:

higdppc lackpf,

higdppc lackpf2

__ ( )

__ __

lackpf2, higdppc lackpf

2

__

( ) __ __

lackpf, lackpf2

__ ( )

( )

( )

higdppc ethnic, higdppc religion __ __ ( )

( )

ethnic, religion,

higdppc ethnic, higdppc religion

( )

( )

Other regional dummies __ __ __ ( )

Regression summary statistics: 2R

R2

SER

n Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated coefficients, and p-values are given in

parentheses under F- statistics. The F-statistics are heteroskedasticity-robust. Coefficients are individually statistically

significant at the +10%, *5%, **1% significance level. The “other regional dummies” included in regression (5) are latinam,

easteurope, africa, and eastasia (the omitted case is Western Europe combined with North America).

Page 8: ps5_fall+2015

Following questions will not be graded, they are for you to practice and will be discussed at

the recitation:

10. US states differ in the generosity of their welfare programs. We here wish to analyze which factors

play a role in the level of benefits across different states. The data set TANF2.dta contains data from

each of 49 states. The variables in the data set are given in the following table:

Table 3

DATA DESCRIPTION, FILE: TANF2.dta

Variable Definition tanfreal State’s real maximum benefit for single parent with three kids.

black Percentage of state’s population who are African Americans.

blue Dummy variable, equals 1 if state voted Democratic in 2004

presidential election.

mdinc State’s median income.

west

= 1 if state is in West

= 0 otherwise

south = 1 if state is in South.

= 0 otherwise.

midwest = 1 if state is in Midwest

= 0 otherwise

northeast = 1 if state is in Northeast

=0 otherwise

Use data set TANF2.dta to examine whether Midwest states differ in their welfare programs from other

states. To do this, we will use the following regression model:

tanfreal = β0 + β1 black + β2 blue + β3 midwest + β4 (black*midwest) + β5 (blue*midwest) + u

Here, black*midwest is the product of the regressors black and midwest and so forth.

(a) Write the null hypothesis to test whether there is a difference between the welfare programs of

Midwest states and all other states, explain.

(b) Construct new set of interaction regressors in STATA. Estimate the model above. Write your

answer as a regression equation with standard errors in parenthesis underneath each coefficient.

Perform the test for the null hypothesis in part (a) with a robust F-test. What is your conclusion?

(c) Introduce a new variable nonmidwest = 1 – midwest. That is nonmidwest = 1 if a state is not in the

Midwest and zero otherwise. Consider the following alternative regression model:

tanfreal = γ1 nonmidwest + γ2 (black*nonmidwest) + γ3 (blue*nonmidwest) + γ4 midwest

+ γ5 (black*midwest) + γ6 (blue*midwest ) + u

Write up the hypothesis of no differences in welfare programs in terms of γ1.... γ6

What is the relationship between the parameters γ1.... γ6 in this new model and β1.... β6

in the previous model? Estimate the model in STATA and write the result in usual regression

equation form with standard errors in parentheses underneath coefficients.

Page 9: ps5_fall+2015

(d) What happens if you include an intercept γ0 in the model in part (c)? Explain.

11. SW exercise 8.2

12. SW Exercise 8.10

13. SW Exercise 9.6

14. SW Empirical Exercise 8.2

15. SW Empirical Exercise 9.2