Econometrics Project

Econometrics Project

Prepared for Module CB9016

“Applied Econometrics”

by

Carlos Ferreira

Submitted on the 16th March, 2009

1a) Looking at the dataset first, we realise there's a variable that accounts for total production (out) and a host of variables giving quantities of inputs used. We also note that the total capital expenditure is not included as one variable, but as several (fert, fodd, mach and cap). Finally, the variables age, soilc and soils are not continuous, suggesting their usage as dummies.

To examine the possible direction and magnitude of the impact of the regressand on the regressor, we plotted the each of the variable pairs. The plots revealed two cases that are constantly outliers. Even being roughly on line with the expected regression curves, the two largest farms present very large outputs and very large usage of inputs when compared to the rest of the sample, resulting in being over four standard errors beyond the mean. As a result, and at the risk of over-reacting to a potentially small problem, we chose to eliminate these two cases from the analysis.

For the variable land, we expect a strong, positive and linear link with the output. The same applies to the variable labour, but in this case we expect the coefficient to be higher than the one for land. We also expect a large, positive relation between fertilizer and output. In the case of fodder, the analysis of the plot shows that some farmers use it, while others don't.. This might result in a pronounced heteroscedasticity if fodder was included as a stand-alone variable in the model. The best way this variable could be used is in a composite total capital variable. In the case of machinery, we expect a high positive relation with output as well.

We created a total capital variable, tc

tci = ferti + fodi + machi + capi

The variable tc accounts the total capital expenditure in the farm. Plotting tc against the output suggests a strong, positive relation.

As for the variable age of the farmer, an analysis of the resulting plot suggests older farmers may obtain a larger output. The plots for clay soil and sandy soil don't show much difference between the two conditions for each variable, The charts, however, can't provide information concerning the interaction between them.

As a consequence of the discussed above, we suggest an economic model in which there is a linear relation between the revenue obtained from the output and the quantity of the various inputs. We would also expect a diminishing marginal product of the various inputs: increasing any of them will increase revenue, but at a diminishing rate. The total revenue per farm – out – and the various inputs (land, labor, tc, age, soilc, soils) are the independent variables (regressors). The expected relations are all positive: as any of the dependent variables (regressand) increases, so will the revenue. We believe resulting relation will be linear in the parameters, which means that the corresponding coefficient to each variable will be constant (a βi coefficient).

1b.i) The coefficients β1 and β5 represent the partial elasticity of output with respect to (respectively) the amount of land used and the amount of capital (cap) used.

We expect coefficient β1 to be positive, reflecting the fact that, the larger the amount of land ceteris paribus, the larger the amount produced and, consequently, the larger the revenue. Likewise, we expect to find positive signal β5, reflecting the fact that more capital will probably lead to more production and larger revenue, ceteris paribus.

Concerning the respective magnitudes, the predictions are not so clear-cut. Both inputs will theoretically have diminishing returns, but our model fails to account for that, by calculating a constant elasticity whatever the amount of land or capital used. In a setting of modern agriculture production, it's probably easier to increase capital than to increase land, when the objective is obtaining a larger revenue. Because of this, we expect β1 to be smaller than β5.

(1b.ii) Most of the problems we can find come from the possibility the regression violated any of the assumptions of the Classic Linear Regression Model. One potential problem we might find in this model is heteroscedasticity – the conditional variances of the error term being different. Some possible causes for heteroscedasticity in this case include differences on the precision of measure

methods (it is not clear that all farms have the same kind of care and precision while recording their activities), outliers (as mentioned before, there is a small number of cases that can be considered outliers), incorrect specification of the regression model and wrong choice of functional form.

Another potential problem arising in this case is multicollinearity. Some of the variables might be functionally linked: for instance, capital and labour have the amount of land built into their variables, because the largest the amount of land, the more capital and labour employed, ceteris paribus.

We also consider there may be an omitted variable bias: the model excludes variables potentially important, regarding the quality of the soil.

Finally, it is not at all clear that the model is correctly specified, since we may be ignoring important variables, the functional form may not the be the most adequate and some of the probabilistic assumptions about the variables may not be correct.

1c) We have estimated model A, and obtained the following results:

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -244238.83 79844.4 -3.06 0.00245 **

land -1179.65 545.93 -2.16 0.03159 *

lab 89.37 31.78 2.81 0.00528 **

fert 183.39 31.9 5.75 2.44e-08 ***

mach 164.9 20.69 7.97 4.48e-14 ***

cap 465.59 29.49 15.79 < 2e-16 ***Residual standard error: 644700 on 269 degrees of freedomMultiple R-squared: 0.8827, Adjusted R-squared: 0.8805 F-statistic: 404.8 on 5 and 269 DF, p-value: < 2.2e-16

All the coefficients in the model are significant, so the problem of high R-squares and low t values doesn't apply. However, the problem did apply to an alternative dataset, where the two outliers were not eliminated. The remaining of question 1c will refer to results and tests conducted for that alternative model A.

One of the possible reasons for a high R-squared coefficient but low t-values is the occurrence of a high degree of multicollinearity. It is suggested by the literature that an R-squared of more that 0.8 but the occurrence of slope coefficients not-statistically different from 0 could mean a high degree of multicollinearity. In this case, it is only one coefficient in that situation (for land), but we decide to test further for a high degree of multicollinearity.

For that, we decide to test for high pairwise correlations among regressors. The literature suggests multicollinearity could be an important issue if the zero-order correlations are higher that 0.8. Running these correlations yielded two values of R-squared larger that 0.8: Y= land, and X=fert, R2 = 0.81Y= lab, and X=fert, R2 = 0.81

Since the first test is too strong and the second necessary is sufficient but not necessary, we decide to apply a third test, and perform the auxiliary regressions, regressing each Xi on the remaining X variables. As a form of simplifying the analysis, we follow Klein's rule of thumb, which states that multicollinearity might be a problem if the adjusted R-squared of any of these auxiliary regressions is larger that the adjusted R-squared of the overall regressions. In this case, we obtained values of adjusted R-squared between 0.65 and 0.89, all smaller than the adjusted R-squared of the overall regression, so this test would point to no meaningful multicollinearity.

Overall, the first two tests point out the potential for multicollinearity (the first, arguably, not

so much since there is only one slope not significantly different from 0), while the third does not. The general impression is that multicollinearity could be an issue, but we choose not to act on it, instead investigating the possibility if another functional form that includes all capital-related variables, potentially curing the issue and better describing the data.

1d) The first possible solution is the one we adopted for model A: excluding the outliers from the analysis.

Another potential remedy for a significant multicollinearity is to drop one variable from the analysis – in the case of model A, the variable dropped would be the amount of fertilizer, part of the potentially multicollinear capital block, and highly correlated to both land and labour. However, that could induce a problem of omitted variable bias: the model could be incorrectly specified. Fertilizer is a theoretically important determinant of the total quantity produced and, consequently, of the total revenue obtained.

Besides, dropping the variable will result in an overestimation of the absolute value of the coefficients associated with labour, machinery and capital. Since the coefficients, even highly multicollinear, are BLUE, omitting this variable will result in a bias in the values of the parameters, and consequently impact on the values estimated from the regression. Another variable whose exclusion of the analysis could result in bias specification bias is fodder, which is capital-related.

We first estimated a model where land, labour and all capital-related variables are included (model A1):

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -281415.32 61360.28 -4.59 6.93e-06 ***

land 549.86 437.67 1.26 0.21

lab 128.2 24.56 5.22 3.59e-07 ***

fert 192.12 24.5 7.84 1.06e-13 ***

mach 142.24 15.97 8.91 < 2e-16 ***

cap 128.23 33.42 3.84 0.000155 ***

fodd 85.91 6.26 13.73 < 2e-16 ***Residual standard error: 495000 on 268 degrees of freedomMultiple R-squared: 0.9311, Adjusted R-squared: 0.9296 F-statistic: 603.6 on 6 and 268 DF, p-value: < 2.2e-16

Model A1 provides a good fit, besides being in accord to theory. However, the fact that four different variables account for capital expenditure could lead to unwanted complications and, potentially, multicollinearity (note the coefficient for land is not significant). Since the units in all capital-related variables are the same, we can sum them to produce a total capital variable (tc) and test the model (A2):

CoefficientsEstimate Std. Error t value Pr(>|t|)

(Intercept) -3.49E+005 6.36E+004 -5.49 9.14e-08 ***

land 2.24E+003 3.27E+002 6.86 4.75e-11 ***

lab 1.91E+002 2.22E+001 8.58 7.36e-16 ***

tc 9.67E+001 3.46E+000 27.91 < 2e-16 ***Residual standard error: 523200 on 271 degrees of freedom

Multiple R-squared: 0.9222, Adjusted R-squared: 0.9213 F-statistic: 1070 on 3 and 271 DF, p-value: < 2.2e-16

One alternative way to look at estimating the model is to consider it to be a short-run production model, taking revenue as a proxy of the quantity produced, and logging it to create the regressand. The regressors would be the quotient between 1 and land, labour and total capital, translating into a Logarithmic Reciprocal Model (model A3):

CoefficientsEstimate Std. Error t value Pr(>|t|)

(Intercept) 1.57E+001 5.11E-002 306.04 < 2e-16 ***

I(1/land) -3.61E+001 7.70E+000 -4.68 4.48e-06 ***

I(1/lab) -1.65E+003 2.47E+002 -6.67 1.46e-10 ***

I(1/tc) -4.49E+003 3.19E+002 -14.07 < 2e-16 ***Residual standard error: 0.3421 on 271 degrees of freedomMultiple R-squared: 0.8228, Adjusted R-squared: 0.8208 F-statistic: 419.4 on 3 and 271 DF, p-value: < 2.2e-16

From a purely statistical point of view, the better model is the one with the highest goodness of fit (adjusted R-squared). We note we can't compare models where the dependent variable is different: in models A1 and A2, we used out as dependent variable, while in model A3 we used log(out). As a result, we can only compare the adjusted R-squared of the first first two models. The adjusted R-squared of A1 is marginally higher, and would be statistical choice.

From an economics point of view, and as far as we agree that the revenue is a good representation of quantity produced and that the data pertains to the short run, we believe our short-run production function (A3) is the best suited. Although it has a smaller R-squared that the linear functions, it makes sense to believe all the three inputs have a diminishing marginal product. All the signs are in the expected direction, all the t values are statistically significant and the overall F statistic is also sufficiently high so we consider the model as significant overall.

From an econometrics point of view, the chosen model should be the one that better describes reality and makes economic sense. For what we've discussed above, we believe that is model A3.

1e) A Cobb-Douglas production function, in stochastic form, is expressed as:

Qi = β1Liβ2Ki

β3eui

where Q=quantity produced, L=amount of labour used and K=amount of capital used. In the present case, we do not have the data concerning the quantity produced, only the revenue. With the market prices, we could transform it into the produced quantity; this means the quantity produced is proportional to the revenue, so we can use the revenue as quantity produced in our model.

Extracting logs in both sides of the equation, we obtain:

lnQi = lnβ1 + β2lnLi + β3lnKi + ui

Since fertilizer, fodder, machinery and capital are all capital-related variables, and in the same standardized unit, we shall add them, using the total capital variable described before (tc). The result of the linear regression, model B, is:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.78 0.29 13.2 < 2e-16 ***

I(log(lab)) 0.48 0.06 8.56 8.57e-16 ***

I(log(tc)) 0.72 0.04 20.25 < 2e-16 ***Residual standard error: 0.2808 on 272 degrees of freedomMultiple R-squared: 0.8802, Adjusted R-squared: 0.8793 F-statistic: 998.9 on 2 and 272 DF, p-value: < 2.2e-16

From a statistical point of view, Model B has a high adjusted r-squared, making it a potentially good model. All coefficients have the expected signals and high t values, making them significant. Output presents a constant elasticity of 0.48 with respect to labour, keeping total capital constant (a 1% increase in the quantity labour results in a 0.48% increase in revenue), and a constant elasticity of 0.72, keeping the amount of labour constant.

In a Cobb-Douglas production function, the sum of the coefficients β2 and β3 tells us whether the production has constant, increasing or decreasing returns to scale, if it equals, is larger or is smaller than 1, respectively. In our estimation, the sum of the estimated coefficients is 1.20 (0.48+0.72). We test the hypothesis that the sum of the real coefficients equals 1:

H0: β2 + β3 = 1H1: β2 + β3 ≠ 1

Res. Df RSS Df Sum of Sq F Pr(>F)

1 272 21.45

2 273 23.83 -1 -2.37 30.1 9.404e-08 ***

A highly significant F value suggests we should reject H0 – the production does not enjoy constant returns to scale. From the sum of the estimated coefficients, we believe the production enjoys increasing returns to scale. To demonstrate this, we test a new hypothesis:

H0: β2 + β3 = 1.2H1: β2 + β3 ≠ 1.2

Res. Df RSS Df Sum of Sq F Pr(>F)

1 272 21.45

2 273 21.45 -1 0 0.02 0.9

A value of F not statistically significant means we can't reject H0. Since the sum of β2 and β3 equals 1.2, the production enjoys increasing returns to scale.

We then create model C, by dividing all the variables in model B by the variable land, and logging the result. We decided to do this because, in a competitive setting all producers are optimizing the level of input usage; as a result, all of them will be using a similar level of capital and labour per unit of land..

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.05 0.15 39.29 <2e-16 ***

I(log(lab/land)) 0.04 0.05 0.71 0.48

I(log(tc/land)) 0.74 0.04 21.07 <2e-16 ***Residual standard error: 0.2844 on 272 degrees of freedomMultiple R-squared: 0.6976, Adjusted R-squared: 0.6954 F-statistic: 313.7 on 2 and 272 DF, p-value: < 2.2e-16

Models B and C are quite similar in terms of functional form. The first question that must be asked is which makes more sense in economic theory. Model B only includes capital and labour as determinants of the quantity produced, whereas Model C accounts for the land as well. In an agricultural production setting, it makes sense to account for land, and model C does it.

We first test the models for heteroscedasticity. Since the visual method is not conclusive, we tried some more formal methods, starting with the Breusch-Pagan test. The results were the following:

data: B BP = 21.4954, df = 2, p-value = 2.149e-05data: C BP = 21.3869, df = 2, p-value = 2.269e-05

We reject the null hypothesis of homoscedasticity. We proceeded to apply the Goldfeld-Quandt test:

data: B GQ = 2.0398, df1 = 135, df2 = 134, p-value = 2.237e-05data: C GQ = 2.4099, df1 = 135, df2 = 134, p-value = 2.728e-07

Again, in both cases we reject the null hypothesis of homoscedasticity. We thus conclude that both regressions suffer from heteroscedasticity.

Next, we test for autocorrelation, using the Durbin-Watson test:

data: B DW = 1.4271, p-value = 6.783e-07data: C DW = 1.1591, p-value = 1.041e-12

In both cases, the regressions seem to exhibit significant first-order autocorrelation. To understand how robust these results are, we then test for autocorrelation using the Breusch-Godfrey test for serial correlation of order 1:

data: B LM test = 23.0077, df = 1, p-value = 1.614e-06data: C LM test = 51.0209, df = 1, p-value = 9.139e-13

Again, both models suffer from significant autocorrelation.Finally, we tested both models for multicollinearity, using pair-wise correlations and

determining auxiliary regressions. We concluded, by both methods, that neither B nor C suffer from multicollinearity.

Overall, we believe model B should be chosen. It is simpler and accounts for reality in a

satisfactory way. Model C accounts for the influence of land, but it is not clear that gives any advantage in the analysis, and bring about a reduced prediction capacity.

Besides, model B gives us a useful measure of returns to scale in this setting.In any case, in model B we have an estimate of the percentage change in the output for a

percentage change in capital or labour, while model C gives the same prediction per unit of land. Both can be useful.

1f) In creating a dummy variable, we must make sure that, for any m conditions of the benchmark category, we create m-1 variables. Looking at the dataset first, we predict that the variables soilc and soils will be dummy variables. Either soil characteristic will have a different impact in the quantity produced (and, consequently, in the total farm revenue, ceteris paribus). However, there are also situations where the soil is both “clay” and “sandy”, and other situations where it is neither of those characteristics. As a result, the model will need one more dummy variable, to account for the situation where the soil is both; this means all

We begin by analysing age of the farmer. The benchmark category has tow different conditions: up to (and including) 40 years old; and over 40. So, we will create one dummy variable, age2, with two conditions:0 if age <=401 if age > 40

As for the type of soil, we identify 4 different conditions: clay soil; sandy soil; clay and sandy soil; and soil neither clay not sandy. So, we will define three dummy variables: clay (1=yes, 0=no), sandy (1=yes, 2=no) and both (clay*sandy). This last variable is intended to give us a measure of the interaction between the types of soil.

With these variables defined, we estimate model D:

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.84 0.32 18.11 <2e-16 ***

I(log(lab/land)) 0.02 0.05 0.47 0.64

I(log(tc/land)) 0.76 0.03 22.32 <2e-16 ***

age 0 0 -0.97 0.33

clay 0.36 0.27 1.34 0.18

sandy 0.16 0.27 0.59 0.56

I(clay * sandy) -0.31 0.28 -1.14 0.26Residual standard error: 0.2702 on 268 degrees of freedomMultiple R-squared: 0.7311, Adjusted R-squared: 0.7251 F-statistic: 121.4 on 6 and 268 DF, p-value: < 2.2e-16

Model D, although significative and with a high adjusted R-squared, only has two significant coefficients: the intercept and total capital per hectare.

To finish, we compare models C and D using an anova. This is possible because both models have the same regressand and use the same sample. The models are significantly different (F = 8.3429, p = 2.345e-06).

To conclude, we believe model B is the one that better describes the reality. Being a Cobb-Douglas production function, it accounts for both capital and labour as inputs for agricultural production. As we have shown, there are increasing returns to scale – if there weren't, models B and C would be equivalent, with C accounting for the increase in production per hectare.

2a) We create an economic model to try to predict farmer's decision to join a water community or not. For this, we try to create an economic model that, given the levels of a number of variables, will predict whether or not a specific farmer will join the water community.

The economic model includes various factors to account for the probability of the farmer joining the water community. These factors can be socio-economic (education, age, gender), the cost of irrigation (total area farmed, percentage of total irrigated area, share of crops in total revenue and share of crops in total revenue), and variables related to the kind of irrigation used (furrow, sprinkler, flood, irrha, irrpc).

This is a model where if the sum of these four factors is over a certain threshold, the probability of the farmer joining the water community equals 1 (he will join it); and, it it's below that threshold, the probability equals 0 (he won't join it).

2b) This model cannot be linear, because our dependent variable is qualitative. As a result, OLS estimation would be meaningless. As a result, we resort to a qualitative response model.

The probability of belonging to the water community depends on an implicit utility index (which we call Ui), calculating the utility one farmer obtains from belonging to the water community. This index is a linear function of the several variables discussed above: socio-economic, cost of irrigation and type of irrigation, such that:

Ui = β1 + β2*totalha + β3*crops + β4*furrow + β5*sprinkler + β6*flood + β7*furrow*sprinkler + β8*furrow*flood + β9*sprinkler*flood + β10*furrow*sprinkler*flood + β11*irrha + β12*irrpc + β13*gender + β14*age + β15*education + ui

Note that the irrigation variables (furrow, sprinkler and flood) are dummy variables; and that farmers can use just one of them, combine two different kinds of irrigation, combine all three kinds or use no irrigation at all. As a result, our benchmark variable (type of irrigation used) has eight categories and consequently we use seven different variables to account for all possibilities.

The probability of joining the water community (Pi) is a function of the implicit utility Ui. If Ui exceeds a certain threshold (we call it Ui*) the farmer will join the water community; otherwise, he will not:

Pi = P(member=1 | U) = P(Ui>Ui*) = P(Zi>Ui) = F(Ui)

Where Zi is the standard normal variable and F is the standard normal cumulative distribution function.

We can calculate Pi by three different methods: by a Linear Probability Model, by a logit model or by a probit model. Since LPM is plagued by biases, we chose to only calculate the logit and the probit model.

Choosing between the logit and the probit model is a difficult task. Both estimates are quite similar, and produce similar predictions. In this case, we used the Akaine Information Criterion: the model with the smallest AIC was chosen. Again, the models were closely matched: the AIC for logit was 125.03, while the AIC for probit was 124.42. We chose the probit model as the best match.

2c.i) We obtained the following estimates for the probit model

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.63E+000 2.10E+000 -0.77 0.44

totalha 6.50E-002 8.65E-002 0.75 0.45

crops -8.57E-004 1.81E-002 -0.05 0.96

furrow 2.60E+000 7.18E-001 3.62 0.000293 ***

sprinkler 1.79E+000 6.91E-001 2.59 0.009557 **

flood 2.74E+000 8.17E-001 3.35 0.000806 ***

irrha 2.88E-001 2.21E-001 1.3 0.19

irrpc 2.99E-001 6.84E-001 0.44 0.66

gender 4.22E-002 4.81E-001 0.09 0.93

age -9.27E-002 1.56E-001 -0.59 0.55

education 4.25E-003 1.48E-001 0.03 0.98

furrow:sprinkler -1.72E+000 8.62E-001 -2 0.045607 *

furrow:flood 2.34E+000 2.61E+002 0.01 0.99

sprinkler:flood 2.93E+000 1.76E+003 0 1

furrow:sprinkler:flood

-3.02E+000 1.86E+003 0 1

Null deviance: 166.67 on 248 degrees of freedomResidual deviance: 94.42 on 234 degrees of freedomAIC: 124.42Number of Fisher Scoring iterations: 18

As can be seen by the estimation, only four variables have a significant impact on the probability of becoming a member of the water community or not: usng furrow irrigation, using sprinkler irrigation, using flood irrigation, or using a combination of furrow and sprinkler irrigation.

Using furrow, sprinkler or flood irrigation all have a positive impact in the probability of joining the water community. On the contrary, using both furrow and sprinkler irrigation methods means farmers are less likely to become members of the water community.

2c.ii) Analysis of the data suggests two possible path for the water community, in order to increase its number of associates. The first course of action is to contact more farmers that only use one irrigation method. The analysis shows that they are vary likely to decide to join.

The other course of action is to investigate why farmers who use both furrow and sprinkler irrigation methods are less likely to join the water community. Perhaps there might be some underlying economic (or otherwise) explanation for this fact, and the water association could in some way devise strategies to counteract the reduced probability of joining by these farmers.

2c.iii) Generally, there are criticisms to make on the dataset. The first one concerns the size of some of the groups involved: there are only 14 women owners for a total sample size of 149 farmers; there are only two farmers who use both flood and sprinkler, but a larger number of farmers who use only furrow irrigation. The quality of our analysis would gain from a more balanced group size.

The data also ignore potential personal differences between the farmers, at an attitudinal level. There is no data concerning farmer's personal preferences for belonging to a water community. Failure to accounting for these potential differences results in omitted variable bias – these variables could be underpining the differences (or lack of differences) observed.

Econometrics Project

Documents

Transcript of Econometrics Project