Proportions as dependent variable - Maarten Buis
Transcript of Proportions as dependent variable - Maarten Buis
![Page 1: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/1.jpg)
Proportions as dependent variableMaarten L. Buis
Vrije Universiteit Amsterdam
Department of Social Research Methodology
http://home.fsw.vu.nl/m.buis
Proportions as dependent variable – p. 1/42
![Page 2: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/2.jpg)
Outline
Problems with using regress for proportions asdependent variable
Methods for dealing with a single proportion
Methods for dealing with multiple proportions
Caveat: Ecological Fallacy
Proportions as dependent variable – p. 2/42
![Page 3: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/3.jpg)
Example
Explaining proportion of Dutch city budgets spent onadministration and government with:
Size of budget (natural logarithm of budget in 10s ofmillions euros)
Average house price (in 100,000s of euros)
Population density (in 1000s of persons per squarekm)
Political orientation of city government (either no leftparties in city government, left parties are a minorityin city government, or left parties are a majority incity government)
Proportions as dependent variable – p. 3/42
![Page 4: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/4.jpg)
OLS results
b selntot -0.030 (0.002)houseval 0.013 (0.004)popdens 0.008 (0.002)noleft -0.001 (0.005)minorityleft -0.007 (0.004)constant 0.109 (0.008)R2 0.499
Proportions as dependent variable – p. 4/42
![Page 5: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/5.jpg)
Non linear effects due to floor
0
.1
.2
.3
.4
prop
ortio
n bu
dget
toad
min
istr
atio
n an
d go
vern
men
t
.5 1 5 10 50 100 500
budget in 10s of milions of euros
non−linearity due to floor
Proportions as dependent variable – p. 5/42
![Page 6: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/6.jpg)
Residuals versus fitted values
−.2
−.1
0
.1
.2
resi
dual
s
−.05 0 .05 .1 .15 .2
Fitted values
Residuals versus fitted values
Proportions as dependent variable – p. 6/42
![Page 7: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/7.jpg)
Floor
observed = fitted + residual
observed ≥ 0 (and ≤ 1)
fitted + residual ≥ 0
residual ≥ −fitted
Proportions as dependent variable – p. 7/42
![Page 8: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/8.jpg)
Residuals versus fitted values
−.2
−.1
0
.1
.2
resi
dual
s
−.05 0 .05 .1 .15 .2
Fitted values
heteroscedasticity due to floor
Proportions as dependent variable – p. 8/42
![Page 9: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/9.jpg)
Problems with regress
Impossible predictions.
Non-normal errors.
Heteroscedasticity.
Non-linear effects.
Proportions as dependent variable – p. 9/42
![Page 10: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/10.jpg)
Outline
Problems with using regress for proportions asdependent variable
Methods for dealing with a single proportion
Methods for dealing with multiple proportions
Caveat: Ecological Fallacy
Proportions as dependent variable – p. 10/42
![Page 11: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/11.jpg)
A solution: betafit
Assumes that the proportion follows a beta distribution.
The beta distribution is bounded between 0 and 1 (butdoes not include either 0 or 1).
The beta distribution models heteroscedasticity in sucha way that the variance is largest when the averageproportion is near 0.5.
Proportions as dependent variable – p. 11/42
![Page 12: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/12.jpg)
Two parameterizations
the conventional parametrization with two shapeparameters (α and β)
Corresponds to the formulas of the beta distributionin textbooks.Does not correspond to conventions of GeneralizedLinear Models where one models how the mean ofthe distribution of the dependent variable changesas the explanatory variables change.
the alternative parametrization with one location andone scale parameter (µ and φ)
Does not correspond to textbook formulas of thebeta distribution but does correspond to the GLMconvention.
Proportions as dependent variable – p. 12/42
![Page 13: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/13.jpg)
Two parameterizations
conventional parametrization
f(y|α, β) ∝ yα−1(y − 1)β−1
E(y) =α
α + β
V ar(y) =αβ
(α + β)2(α + β + 1)
alternative parametrization
f(y|µ, φ) ∝ yµφ−1(y − 1)(1−µ)φ−1
E(y) = µ
V ar(y) = µ(1 − µ)1
1 + φProportions as dependent variable – p. 13/42
![Page 14: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/14.jpg)
different µ fixed φ
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 5 and beta = 5mu = .5 and phi = 10, var = .091
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 4 and beta = 6mu = .4 and phi = 10, var = .061
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 3 and beta = 7mu = .3 and phi = 10, var = .039
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 2 and beta = 8mu = .2 and phi = 10, var = .023
Proportions as dependent variable – p. 14/42
![Page 15: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/15.jpg)
different φ fixed µ
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 2 and beta = 8mu = .2 and phi = 10, var = .023
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 4 and beta = 16mu = .2 and phi = 20, var = .012
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 8 and beta = 32mu = .2 and phi = 40, var = .006
dens
ity
0 .2 .4 .6 .8 1
y
alpha = 16 and beta = 64mu = .2 and phi = 80, var = .003
Proportions as dependent variable – p. 15/42
![Page 16: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/16.jpg)
Modeling the mean
We allow different cities to have different µs dependingon their values of the explanatory variables.
µi = f(b0 + b1x1i + b2x2i · · · )
The logistic transformation is used to ensure µi remainsbetween 0 and 1.
µi = eb0+b1x1i+b2x2i···
1+eb0+b1x1i+b2x2i···
which is the same as:
ln( µ1−µ) = b0 + b1x1i + b2x2i · · ·
Proportions as dependent variable – p. 16/42
![Page 17: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/17.jpg)
output of betafit
. betafit gov, mu(lntot houseval popdens noleft minorityleft ) nolog
ML fit of beta (mu, phi) Number of obs = 394
Wald chi2(5) = 473.19
Log likelihood = 887.97456 Prob > chi2 = 0.0000
------------------------------------------------------------
| Coef. se z P>|z| [ 95% CI ]
-------------+----------------------------------------------
lntot | -.3999 .0227 -17.58 0.000 -.4445 -.3553
houseval | .1138 .0385 2.96 0.003 .0384 .1892
popdens | .0830 .0216 3.85 0.000 .0408 .1253
noleft | .0185 .0445 0.42 0.677 -.0686 .1057
minorityleft | -.0080 .0450 -0.18 0.859 -.0962 .0802
_cons | -2.0545 .0707 -29.06 0.000 -2.1931 -1.9160
-------------+----------------------------------------------
/ln_phi | 4.7968 .0715 67.13 0.000 4.6568 4.9368
-------------+----------------------------------------------
phi | 121.1 8.6545 105.3 139.3
------------------------------------------------------------Proportions as dependent variable – p. 17/42
![Page 18: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/18.jpg)
interpretation using dbetafit
. dbetafit , at(noleft 0 minorityleft 0)
----------------------------------------------------------------
discrete | Min --> Max +-SD/2 +-1/2
change | coef. se coef. se coef. se
--------------+-------------------------------------------------
lntot |-.2116 .0122 -.0344 .002 -.033 .0019
houseval | .0291 .0105 .0037 .0013 .0093 .0032
popdens | .0447 .0133 .0063 .0016 .0068 .0018
noleft | .0015 .0037
minorityleft |-6.6e-04 .0037
----------------------------------------------------------------
Proportions as dependent variable – p. 18/42
![Page 19: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/19.jpg)
discrete changes in lntot
0
.05
.1
.15
.2
.25
pred
icte
d pr
opor
tion
−2 0 2 4 6
ln(total budget)
Proportions as dependent variable – p. 19/42
![Page 20: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/20.jpg)
marginal effects
----------------------------------------------------
Marginal | MFX at x Max MFX
Effects | coef. se coef. se
--------------+-------------------------------------
lntot |-.0328 .0019 -.1 .0057
houseval | .0093 .0032 .0284 .0096
popdens | .0068 .0018 .0208 .0054
----------------------------------------------------
Proportions as dependent variable – p. 20/42
![Page 21: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/21.jpg)
marginal effects of lntot
0
.2
.4
.6
.8
1
pred
icte
d pr
opor
tion
−10 −5 0 5
ln(total budget)
Proportions as dependent variable – p. 21/42
![Page 22: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/22.jpg)
Fractional logit
Although the implied variance in betafit makessense, it is still an assumption and some think it is toorestrictive.
The fractional logit has been proposed as an alternativeby Papke and Wooldridge (1996).
Fractional logit can handle proportions of exactly 0 or 1,unlike betafit.
This model can be estimated by typing: glm varlist,family(binomial) link(logit) robust.
Marginal effects like those from dbetafit can beobtained with mfx, predict(mu).
Proportions as dependent variable – p. 22/42
![Page 23: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/23.jpg)
Does it matter?
OLS betafit glm
dy/dx se dy/sx se dy/dx se
lntot -.0296 .0027 -.0328 .0019 -.0330 .0026
houseval .0135 .0051 .0093 .0032 .0105 .0036
popdens .0078 .0019 .0068 .0018 .0071 .0018
noleft∗ -.0010 .0056 .0015 .0037 .0008 .0046
minorityleft∗ -.0065 .0047 -.0007 .0037 -.0019 .0042∗ dy/dx is for discrete change of dummy variable from 0 to 1
Proportions as dependent variable – p. 23/42
![Page 24: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/24.jpg)
Outline
Problems with using regress for proportions asdependent variable
Methods for dealing with a single proportion
Methods for dealing with multiple proportions
Caveat: Ecological Fallacy
Proportions as dependent variable – p. 24/42
![Page 25: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/25.jpg)
Multiple proportions
Cities also spent money on other categories:
Safety (which includes public health, fire department,and the police department)
Education (mostly primary and secondary schools)
recreation (which includes sport facilities and culture)
social (which includes social work and some socialsecurity benefits)
urbanplanning (which includes roads and houses)
Proportions as dependent variable – p. 25/42
![Page 26: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/26.jpg)
Multiple proportions
The proportions spent on each category should remainbetween 0 and 1, and
the proportions should add up to 1.
The proportions could be modeled with separatebetafit models.
This would ensure the first condition is met, but
it would ignore the second condition.
Proportions as dependent variable – p. 26/42
![Page 27: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/27.jpg)
A solution: dirifit
Assumes that the proportions follow a Dirichletdistribution.
The Dirichlet distribution is the multivariategeneralization of the beta distribution.
It ensures that the proportions remain between 0 and 1,and that they add up to 1.
Proportions as dependent variable – p. 27/42
![Page 28: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/28.jpg)
Two parameterizations
the conventional parametrization with one shape parameters for eachproportion (α1, α2, . . ., αk)
Corresponds to the formulas of the Dirichlet distribution intextbooks.
Does not correspond to conventions of Generalized LinearModels where one models how the mean of the distribution of thedependent variable changes as the explanatory variables change.
the alternative parametrization with on location location parameter foreach proportion and one scale parameter (µ1, µ2, . . ., µk, and φ)
Does not correspond to textbook formulas of the Dirichletdistribution but does correspond to the GLM convention.
One location parameter is redundant:µ1 = 1 − (µ2 + µ3 + . . . + µk).
Proportions as dependent variable – p. 28/42
![Page 29: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/29.jpg)
Modeling the mean
We allow different cities to have different µjs dependingon their values of the explanatory variables.
The multinomial logistic transformation is used toensure the µjs remain between 0 and 1 and add up to1.
Proportions as dependent variable – p. 29/42
![Page 30: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/30.jpg)
output of dirifit
. dirifit gov-urban, mu(lntot houseval popdens noleft minorityleft ) nolog
----------------------------------------------------------------
| Coef. se z P>|z| [ 95% CI ]
-------------+--------------------------------------------------
mu2 |
lntot | .1445 .0406 3.56 0.000 .0649 .2240
houseval | -.0518 .0718 -0.72 0.471 -.1924 .0889
popdens | -.0700 .0390 -1.79 0.073 -.1465 .0065
noleft | .0817 .0827 0.99 0.323 -.0805 .2439
minorityleft | .1043 .0826 1.26 0.207 -.0577 .2662
_cons | .5274 .1318 4.00 0.000 .2690 .7858
-------------+--------------------------------------------------
mu3 |
lntot | .4123 .0423 9.74 0.000 .3293 .4952
<snip>
-------------+--------------------------------------------------
phi | 45.01 1.407 42.33 47.85
----------------------------------------------------------------
mu2 = safety mu4 = recreation mu6 = urbanplanning
mu3 = education mu5 = social base outcome = governingProportions as dependent variable – p. 30/42
![Page 31: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/31.jpg)
Marginal effects obtained with ddirifit
governing safety education recreation social urban
planning
lntot -.0320∗ -.0314∗ .0115∗ -.0067∗ .0265∗ .0321∗
houseval .0132∗ .0143∗ -.0321∗ .0065 -.0496∗ .0477∗
popdens .0074∗ .0009 -.0067 .0002 .0072 -.0090∗
noleft† .0006 .0161∗ -.0266∗ .0048 -.0168 .0219∗
minorityleft† -.0019 .0154 -.0164∗ .0085 -.0105 .0049† discrete change of dummy variable from 0 to 1∗ significant at 5% level
Proportions as dependent variable – p. 31/42
![Page 32: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/32.jpg)
Variance and covariance ofy in dirifit
The variance of yi is µi(1 − µi)1
1+φ
The covariance of yi and yj implicit in dirifit is−µiµj
11+φ
It depends on the means in a similar fashion as themultinomial distribution, and on a precision parameter φ.
Covariance is forced to be negative. This makes sensein that there is less room for other categories if thefraction in one category increases.
Proportions as dependent variable – p. 32/42
![Page 33: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/33.jpg)
Variance Covariance structure too restrictive?
Though the implied variances and covariances makesense, they do not have to be true.
Alternatives have been proposed for cases where thisstructure is violated.
For dirifit a multivariate normal model for logittransformed dependent variables has been proposed byAitcheson (2003).
Proportions as dependent variable – p. 33/42
![Page 34: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/34.jpg)
Variance Covariance structure too restrictive?
This model can be estimated by typing:
gen logity1 = logit(y1)
gen logity2 = logit(y2)
.
.
gen logityk = logit(yk)
mvreg logity1 - logityk = indepvars, corr
Proportions as dependent variable – p. 34/42
![Page 35: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/35.jpg)
Outline
Problems with using regress for proportions asdependent variable
Methods for dealing with a single proportion
Methods for dealing with multiple proportions
Caveat: Ecological Fallacy
Proportions as dependent variable – p. 35/42
![Page 36: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/36.jpg)
Ecological Fallacy
Sometimes one wants to study behavior of individualsbut one only has information on a aggregate level.
This aggregate information is often in the form ofproportions.
One might be tempted to use the methods discussedpreviously to analyze this data.
Example from Robinson (1950): Relationship betweenimmigrant status and literacy in the 1930 US census.
Proportions as dependent variable – p. 36/42
![Page 37: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/37.jpg)
Individual level analysis
illiterateimmigrant literate illiterate Totalnative born 96.72 3.28 100.00foreign born 90.75 9.25 100.00Total 95.87 4.13 100.00
Proportions as dependent variable – p. 37/42
![Page 38: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/38.jpg)
State level analysis
0
5
10
15
% il
liter
ate
0 10 20 30
% immigrants
Proportions as dependent variable – p. 38/42
![Page 39: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/39.jpg)
Ecological Fallacy
Aggregate level relationships can be completelydifferent from individual level relationships.
If it is remotely possible to use individual level data, doso!
If that is not possible start reading up on EcologicalInference. A good place to start is Gary King (1997)
Ecol package from Department of Political Science,Aarhus University, Denmark:http://www.ps.au.dk/stata/
Proportions as dependent variable – p. 39/42
![Page 40: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/40.jpg)
Summary (1)
The constraint that a proportion must remain between 0and 1 causes problems with regress.
betafit is one possible solution.
Multiple proportions have the additional constraint thatthey must add up to 1.
dirifit is one possible solution.
Proportions as dependent variable – p. 40/42
![Page 41: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/41.jpg)
Summary (2)
Both betafit and dirifit make assumptions aboutthe variance (covariance) structure of the dependentvariable that does make sense but that some find toorestrictive.
Fractional logit and multivariate regression have beenproposed as alternatives.
None of these techniques are appropriate for studyingindividual behavior from aggregate data.
Proportions as dependent variable – p. 41/42
![Page 42: Proportions as dependent variable - Maarten Buis](https://reader030.fdocuments.us/reader030/viewer/2022021208/62062e7c8c2f7b1730053029/html5/thumbnails/42.jpg)
References
Aitcheson, John. 2003. The Statistical Analysis of Compositional Data. Blackburn Press.
King, Gary. 1997. A solution to the Ecological Inference Problem, Reconstructing IndividualBehavior from Aggregate Data. Princeton University Press.
Papke, Leslie E. and Jeffrey M. Wooldridge. 1996. “Econometric Methods for FractionalResponse Variables with an Application to 401(k) Plan Participation Rates.” Journal ofApplied Econometrics 11(6):619–632.
Robinson, W.S. 1950. “Ecological Correlations and the Behavior of Individuals.” Amercian
Sociological Review 15(3):351–357.
Proportions as dependent variable – p. 42/42