The Stata Journal - med.mahidol.ac.thSubscriptions are available from StataCorp, 4905 Lakeway Drive,...

The Stata Journal

Volume 16 Number 1 2016

®

A Stata Press publicationStataCorp LPCollege Station, Texas

The Stata Journal

Editors

H. Joseph Newton

Department of Statistics

Texas A&M University

College Station, Texas

[email protected]

Nicholas J. Cox

Department of Geography

Durham University

Durham, UK

[email protected]

Associate Editors

Christopher F. Baum, Boston College

Nathaniel Beck, New York University

Rino Bellocco, Karolinska Institutet, Sweden, and

University of Milano-Bicocca, Italy

Maarten L. Buis, University of Konstanz, Germany

A. Colin Cameron, University of California–Davis

Mario A. Cleves, University of Arkansas for

Medical Sciences

William D. Dupont, Vanderbilt University

Philip Ender, University of California–Los Angeles

David Epstein, Columbia University

Allan Gregory, Queen’s University

James Hardin, University of South Carolina

Ben Jann, University of Bern, Switzerland

Stephen Jenkins, London School of Economics and

Political Science

Ulrich Kohler, University of Potsdam, Germany

Frauke Kreuter, Univ. of Maryland–College Park

Peter A. Lachenbruch, Oregon State University

Jens Lauritsen, Odense University Hospital

Stanley Lemeshow, Ohio State University

J. Scott Long, Indiana University

Roger Newson, Imperial College, London

Austin Nichols, Urban Institute, Washington DC

Marcello Pagano, Harvard School of Public Health

Sophia Rabe-Hesketh, Univ. of California–Berkeley

J. Patrick Royston, MRC Clinical Trials Unit,

London

Philip Ryan, University of Adelaide

Mark E. Schaffer, Heriot-Watt Univ., Edinburgh

Jeroen Weesie, Utrecht University

Ian White, MRC Biostatistics Unit, Cambridge

Nicholas J. G. Winter, University of Virginia

Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager

Lisa Gilmore

Stata Press Copy Editors

David Culwell, Shelbi Seiner, and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book

reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository

papers that link the use of Stata commands or programs to associated principles, such as those that will serve

as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go

“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate

or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to

a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users

(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers

analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could

be of interest or usefulness to researchers, especially in fields that are of practical importance but are not

often included in texts or other journals, such as the use of Stata in managing datasets, especially large

datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata

with topics such as extended examples of techniques and interpretation of results, simulations of statistical

concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-

ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch),

Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

http://www.stata-journal.com

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone

979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic

1-year subscription $115 1-year subscription $145



1-year student subscription $ 85 1-year student subscription $115

1-year institutional subscription $345 1-year institutional subscription $375



Electronic only Electronic only

1-year subscription $ 85 1-year subscription $ 85



1-year student subscription $ 55 1-year student subscription $ 55

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may

be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX

77845, USA, or emailed to [email protected].

®

Copyright c© 2016 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and

help files) are copyright c© by StataCorp LP. The contents of the supporting files (programs, datasets, and

help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata

Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.

http://www.stata.com/bookstore/sj.html

http://www.stata.com/bookstore/sjj.html

http://www.stata-journal.com/archives.html

Volume 16 Number 1 2016

The Stata Journal

Articles and Columns 1

Announcement of the Stata Journal Editors’ Prize 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1

16 and all that . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Regressions are commonly misinterpreted . . . . . . . . . . . . . . . . . . . . . . . . . D. C. Hoaglin 5

Regressions are commonly misinterpreted: Comments on the article . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. W. Hardin 23

Regressions are commonly misinterpreted: Comments on the article . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. S. Long and D. M. Drukker 25

Regressions are commonly misinterpreted: A rejoinder . . . . . . . . . . . . D. C. Hoaglin 30

Estimation of multivariate probit models via bivariate probit. . . . . . . . .J. Mullahy 37

diff: Simplifying the estimation of difference-in-differences treatment effects . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. M. Villa 52

mfpa: Extension of mfp using the ACD covariate transformation for enhancedparametric multivariable modeling . . . . . . . . . . . . . P. Royston and W. Sauerbrei 72

Quantifying the uptake of user-written commands over time . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Choodari-Oskooei and T. P. Morris 88

bireprob: An estimator for bivariate random-effects probit models . . . . . . A. Plum 96

conindex: Estimation of concentration indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .O. O’Donnell, S. O’Neill, T. Van Ourti, and B. Walsh 112

Estimating polling accuracy in multiparty elections using surveybias . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Arzheimer and J. Evans 139

bicop: A command for fitting bivariate ordinal regressions with residual depen-dence characterized by a copula function and normal mixture marginals . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .M. Hernandez-Alava and S. Pudney 159

Features of the area under the receiver operating characteristic (ROC) curve. Agood practice.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .D. Lora, I. Contador, J. F. Perez-Regadera, and A. Gomez de la Camara 185

Implementing factor models for unobserved heterogeneity in Stata . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .M. Sarzosa and S. Urzua 197

Speaking Stata: Truth, falsity, indication, and negation . . . . . . . . . . . . . . . N. J. Cox 229

Review of Michael N. Mitchell’s Stata for the Behavioral Sciences . . . P. B. Ender 237

A menu-driven facility for power and detectable-difference calculations in stepped-wedge cluster-randomized trials, erratum . . . . . . . .K. Hemming and A. Girling 243

Software Updates 244

The Stata Journal (2016)16, Number 1, pp. 1–2

Announcement of the Stata Journal Editors’Prize 2016

The editors of the Stata Journal are pleased to invite nominations for their 2016prize in accordance with the following rules. Nominations should be sent as privateemail to [email protected] by July 31, 2016.

1. The Stata Journal Editors’ Prize is awarded annually to one or more authors ofa specified paper or papers published in the Stata Journal in the previous threeyears.

2. The prize will consist of a framed certificate and an honorarium of U.S. $1,000,courtesy of the publisher of the Stata Journal. The prize may be awarded inperson at a Stata Conference or Stata Users Group meeting of the recipient’s orrecipients’ choice or as otherwise arranged.

3. Nominations for the prize in a given year will be requested in the Stata Journalin the first issue of each year and simultaneously through announcements on theStata Journal website and on Statalist. Nominations should be sent to the edi-tors by private email to [email protected] by July 31 in that year. Therecipient(s) will be announced in the Stata Journal in the last issue of each yearand simultaneously through announcements on the Stata Journal website and onStatalist.

4. Nominations should name the author(s) and one or more papers published in theStata Journal in the previous three years and explain why the work concerned isworthy of the prize. The precise time limits will be the annual volumes of the StataJournal, so that, for example, the prize for 2016 will be for work published in theannual volumes for 2013, 2014, or 2015. The rationale might include originality,depth, elegance, or unifying power of work; usefulness in cracking key problemsor allowing important new methodologies to be widely implemented; and clarityor expository excellence of the work. Comments on the excellence of the softwarewill also be appropriate when software was published with the paper(s). Nomina-tions might include evidence of citations or downloads or of impact either withinor outside the community of Stata users. These suggestions are indicative ratherthan exclusive, and any special or unusual merits of the work concerned may nat-urally be mentioned. Nominations may also mention, when relevant, any body oflinked work published in other journals or previously in the Stata Journal or StataTechnical Bulletin. Work on any or all of statistical analysis, data management,statistical graphics, and Stata or Mata programming may be nominated.

c© 2016 StataCorp LP gn0067

2 Announcement of the Stata Journal Editors’ Prize 2016

5. Nominations will be considered confidential both before and after award of theprize. Neither anonymous nor public nominations will be accepted. Authors maynot nominate themselves and so doing will exclude those authors from consider-ation. The editors of the Stata Journal may not be nominated. Employees ofStataCorp may not be nominated. Such exclusions apply to any person with suchstatus at any time between January 1 of the year in question and the announce-ment of the prize. The associate editors of the Stata Journal may be nominated.

6. The recipient(s) of the award will be selected by the editors of the Stata Journal,who reserve the right to take advice in confidence from appropriate persons, sub-ject to such persons not having been nominated themselves. The editors’ decisionis final and not open to discussion.

Previous awards of the Prize were to David Roodman (2012); Erik Thorlund Parnerand Per Kragh Andersen (2013); Roger Newson (2014); and Richard Williams (2015).For full details, please see Stata Journal 12: 571–574 (2012); Stata Journal 13: 669–671(2013); Stata Journal 14: 703–707 (2014); and Stata Journal 15: 901–904 (2015).

H. Joseph Newton and Nicholas J. CoxEditors, Stata Journal


16 and all that

With this issue, the Stata Journal starts its 16th year.

The number 16 is key in computing, so some small self-congratulations on our ownimpending birthday seem forgivable. One good way to summarize our progress to dateis thoroughly statistical, a graph of the number of pages in each annual volume (seefigure 1).

0

250

500

750

1000

1250

num

ber

of p

ages

2001 2005 2010 2015

Figure 1. Number of pages in each annual volume of the Stata Journal, 2001–2015. Thepage count for 2001 is that of the single issue multiplied by 4.

A detail here is that volumes match Western calendar years, but volume 1 was justa single issue. Hence, for 2001, the number of pages has been multiplied by 4. Weforgo any temptation to turn our favorite time-series modeling or forecasting techniquesloose upon the data, preferring to think that the data speak for themselves as showingsubstantial and successful growth. Nevertheless, the deeper aim of the Journal remainsnot increased volumes, so to speak, but to maintain publication of high-quality papersof all kinds that are helpful and interesting to the Stata user community. It is for readersto judge how far we have been successful and for us to emphasize that our leadership israther notional: it is the reviewers, the support team at StataCorp, and above all theauthors who do virtually all the hard work.

This issue carries one notable innovation compared with our previous practices,although one that merely echoes common practice in several other statistical journals.

The first substantive paper from David Hoaglin was invited by the Editors followinga series of provocative postings that he made on Statalist urging that regression with


4 16 and all that

multiple predictors is often misinterpreted. We knew (indeed hoped) that his paperwould prove controversial. The outcome of our review was a decision to allow alter-native points of view from other invitees, together with a reply from Hoaglin. Thecomplete exchange is printed here. As history and current practice bear ample witness,people can disagree strongly in good faith about how to do and how to think aboutstatistics. (According to one story, appealing even if possibly apocryphal, R. A. Fisherdefined variance as the attitude of one statistician to another.) The articles printed hereinclude frank and forthright discussion exposing different interpretations of regressionand differing attitudes that are both accurate and acceptable. We welcome suggestionsof other subjects for future miniature symposiums.

H. Joseph Newton and Nicholas J. CoxEditors, Stata Journal


Regressions are commonly misinterpreted

David C. HoaglinIndependent consultant

Sudbury, MA

[email protected]

Abstract. Much literature misinterprets results of fitting multivariable modelsfor linear regression, logistic regression, and other generalized linear models, aswell as for survival, longitudinal, and hierarchical regressions. For the leadingcase of multiple regression, regression coefficients can be accurately interpretedvia the added-variable plot. However, a common interpretation does not reflectthe way regression methods actually work. Additional support for the correct in-terpretation comes from examining regression coefficients in multivariate normaldistributions and from the geometry of least squares. To properly implement mul-tivariable models, one must be cautious when calculating predictions that averageover other variables, as in the Stata command margins.

Keywords: st0419, regression models, added-variable plot, multivariate normaldistribution, geometry of least squares, margins command

1 Introduction

Despite multiple regression’s long history and extensive literature, many articles andbooks are misleading in reporting and interpreting results of fitting regression models.The problems arise in reporting for ordinary least-squares regression, logistic regression,and other generalized linear models, as well as for survival, longitudinal, and hierarchicalregressions. Like many other statistical techniques, regression is susceptible to garden-variety forms of abuse, but its greater complexity leads to other less obvious misunder-standings. In what follows, I focus on a major way in which reports and applications ofregression analyses often mislead: interpretation of regression coefficients. The correctinterpretation is evident in the added-variable plot and the geometry of least squares,as well as from examining regression coefficients in multivariate normal distributions.The common interpretation, regarding the other predictors as held constant, does notaccurately reflect how multiple regression works. The misunderstanding in interpretingregression coefficients suggests caution in calculating predictions that average over othervariables and in other applications of the Stata command margins.

For perspective, the purposes of regression analyses include

• to get a summary;

• to exclude the effect of a variable that might confuse the issue;

• to measure the size of an effect through a regression coefficient;

c© 2016 StataCorp LP st0419

6 Regressions are commonly misinterpreted

• to try to discover an empirical law; and

• to make predictions.

Mosteller and Tukey (1977) discuss these and other purposes.

2 Equations for multiple regression

To discuss multiple regression, we need a little notation. One common way to write therelation between the response (or dependent variable) Y and the predictors X1, . . . , Xp

in multiple regression isY = β1X1 + · · ·+ βpXp + ε (1)

(usually X1 ≡ 1). This equation represents the underlying or population model; theregression coefficients β1, . . . , βp are unknown constants to be estimated from the data,and ε is chance variation (noise, disturbance, or error).

By definition, the regression of Y on a set of variables Z1, . . . , Zm (from whichpredictors may be derived) is the conditional expectation E(Y |Z1 = z1, . . . , Zm = zm).Here we allow the possibility that some of the predictors X1, . . . , Xp are functions of thesame underlying variable (as in a polynomial or a linear spline), and we assume thatany appropriate transformations of response and predictors have already been settled.I deliberately avoid referring to predictors as “independent variables”, because theyare generally not independent in any usual sense. It is difficult to choose an accurateterm that has broad appeal. Some people interpret “predictor” as implying causation.Mosteller and Tukey (1977) referred to the X variables as “carriers”, a term that seemsquite neutral.

In a multiple regression, the definition of each regression coefficient includes theset of other predictors in the equation; that is, their names are part of the definition.G. Udny Yule (1907) introduced a notation that makes the role of the other predictorsexplicit. For example, we would denote the coefficient of X2 in (1) by βy2·13...p. Thefirst subscript denotes the response variable, the second subscript denotes the predictorto which the coefficient is attached, and the subscripts after the · denote the otherpredictors. In less abbreviated form, (1) is

Y = βy1·2...pX1 + βy2·13...pX2 + · · ·+ βyp·1...p−1Xp + ε (2)

Each integer 1 through p is an index in the list of predictors. Sometimes, it may behelpful to use the names of the predictors, as in βgp100m,weight·1,displacement (for example,when comparing models that use the same number of predictors, selected from amongX1, . . . , Xp).

Fitting the multiple regression model in (1) to a set of data yields estimates b1, . . . , bpof the regression coefficients β1, . . . , βp. Under the usual assumptions, each b is anunbiased estimate of the corresponding β. We denote an observed value of Y by y, thecorresponding given values of X1, . . . , Xp by x1, . . . , xp, and the corresponding residualby e. Thus the fitted equation corresponding to (1) is

D. C. Hoaglin 7

y = b1x1 + · · ·+ bpxp + e

and the less abbreviated form corresponding to (2) is

y = by1·2...px1 + by2·13...px2 + · · ·+ byp·12...p−1xp + y·1...p

(now the notation for the residual, y·1...p, shows explicitly the predictors whose contri-butions have been removed).

Many presentations tend to use the same letters in models that involve different setsof other predictors, which makes it easy to overlook the role of the other predictors inthe definition of the coefficient of each predictor. For example, if 2x + 5t is a good fitto the data on y, then −3x + 5(t + x) is also a good fit to those data (it gives exactlythe same predicted values). In the first fit, 2 is the coefficient of x when t is the otherpredictor, whereas in the second fit, −3 is the coefficient of x when t + x is the otherpredictor. By manipulating the choice of the other predictor, I can make the coefficientof x have any value. Mosteller and Tukey (1977, chap. 13) provide instructive examples.

3 Interpretation of regression coefficients

As the notation suggests, βy2·13...p (for example) summarizes the relation between Yand X2 when X1, X3, . . . , Xp are the other predictors. More specifically, the interpre-tation of βy2·13...p (or β2 for short) is that it “tells us how Y responds to change inX2 after adjusting for simultaneous linear change in the other predictors in the data athand” (Tukey 1970, chap. 23). This way of stating the effect of X2 on Y is a directconsequence of the presence of the other predictors. Because the model describes theregression of Y on X1, X2, X3, . . . , Xp jointly, the coefficient of each predictor accountsfor the contributions of the other predictors; that is, it reflects the adjustment for thosepredictors. The interpretation includes “in the data at hand” because the nature of theadjustment depends on the relations among the predictors in the particular dataset.

The interpretation of a regression coefficient has a straightforward mathematicalderivation. Yule (1907) gives an elegant short proof. For the estimated coefficientβy2·13...p, the main idea is illustrated by the partial regression plot (also called the“added-variable plot”—for example, in the Stata postestimation command avplot afterregress), in which the vertical coordinate is the residual from the regression of Y onX1, X3, . . . , Xp,

y·13...p = y − (by1·3...px1 + by3·14...px3 + · · ·+ byp·13...p−1xp)

and the horizontal coordinate is the residual from the regression ofX2 onX1, X3, . . . , Xp,

x2·13...p = x2 − (b21·3...px1 + b23·14...px3 + · · ·+ b2p·13...p−1xp)

In the regression line (through the origin) for y·13...p on x2·13...p, the slope is by2·13...p(see Cook and Weisberg [1982], section 2.3.2). That is, in the multiple regression of


Y on X1, X2, X3, . . . , Xp, the coefficient of X2 summarizes the change in Y per unitincrease in X2 after adjusting for simultaneous linear change in X1, X3, . . . , Xp (in thedata at hand). Dempster (1969, 160–161) makes a similar point. The interpretation,which applies in the same way to the βs, is also clear from the geometry of least squares,as discussed in section 6.

I avoid the common usage “controlling for” in describing analyses of observationaldata because it suggests that the variables being “controlled for” are under some sort of“control” (for example, in the way they would be in a randomized controlled trial or ina designed experiment). Referring to a variable as “controlled” implies that it is beingheld constant. “Adjusting for” is more accurate and straightforward.

For a concrete example of interpreting regression coefficients, I use the data on theforeign cars in the 1978 auto dataset (accessed in Stata), with gallons per 100 miles asthe response variable and weight and displacement as the predictors.

. sysuse auto, clear(1978 Automobile Data)

. generate gp100m = 100/mpg

. label var gp100m "Gallons per 100 miles"

For the 22 foreign cars, the command graph matrix produces a scatterplot matrixof the 3 variables (figure 1). Gallons per 100 miles has a fairly strong linear relationwith weight and displacement, and the relation between weight and displacement

is even stronger.

D. C. Hoaglin 9

. graph matrix gp100m weight displacement if foreign==1

Gallonsper 100miles

Weight(lbs.)

Displacement(cu.in.)

2

4

6

8

2 4 6 8

2,000

3,000

4,000

2,000 3,000 4,000

50

100

150

50 100 150

Figure 1. Scatterplot matrix of gp100m, weight, and displacement for the foreign carsin the 1978 automobile data

The command regress produces the following results:

. regress gp100m weight displacement if foreign == 1

Source SS df MS Number of obs = 22F( 2, 19) = 23.86

Model 19.6704568 2 9.83522842 Prob > F = 0.0000Residual 7.83165119 19 .412192168 R-squared = 0.7152

Adj R-squared = 0.6853Total 27.502108 21 1.30962419 Root MSE = .64202

gp100m Coef. Std. Err. t P>|t| [95% Conf. Interval]

weight .0003964 .0010435 0.38 0.708 -.0017877 .0025805displacement .032282 .0181606 1.78 0.091 -.0057286 .0702925

_cons -.195738 .810741 -0.24 0.812 -1.892638 1.501162

The coefficients, t statistics, and p-values pertain to the contribution of their re-spective predictors after adjusting for the contributions of the other predictors. Simpleregression with weight as the predictor yields the expected result from the pattern infigure 1.


. regress gp100m weight if foreign == 1





weight .0021599 .0003406 6.34 0.000 .0014494 .0028703_cons -.6892425 .8017998 -0.86 0.400 -2.361768 .9832824

The added-variable plot in figure 2 (produced by the user-written command favplot,which can be downloaded from Statistical Software Components using the command ssc

install favplots and, among other features, allows the user to control the numberof decimal places displayed for b and t) shows the relation of gp100m to displacement

after regression on weight has been removed from each. For the line through the origin,the slope (0.0323) and the t statistic (1.78) are the same as those for displacement

in the multiple regression with weight and displacement as the predictors. Thus0.0323 gallons per 100 miles per cubic inch summarizes the relation between gp100m

and displacement after adjusting for simultaneous linear change in weight. The effectof the adjustment is noticeable. Compare the previous slope with that in the simpleregression with displacement as the predictor.

. favplot displacement, bformat(%7.4f) name(hoaglin_2, replace)

−1.

5−

1−

.50

.51

resi

dual

for

gp10

0m |

othe

r X

−10 0 10 20residual for displacement | other X

b = 0.0323 t = 1.78Displacement (cu. in.)

Figure 2. Added-variable plot for displacement in the regression of gp100m on weightand displacement

D. C. Hoaglin 11

. regress gp100m displacement if foreign == 1





displacement .0388401 .0055092 7.05 0.000 .0273482 .050332_cons -.00723 .6272315 -0.01 0.991 -1.315612 1.301152

For completeness, figure 3 shows the added-variable plot for weight. The adjustmentfor simultaneous linear change in displacement leaves little relation between gp100m

and weight.

. favplot weight, bformat(%7.6f) name(hoaglin_3, replace)

−1

−.5

0.5

1re

sidu

al fo

r gp

100m

| ot

her

X

−400 −200 0 200residual for weight | other X

b = 0.000396 t = 0.38Weight (lbs.)

Figure 3. Added-variable plot for weight in the regression of gp100m on weight anddisplacement

4 A common misinterpretation

In the equation

y = b1x1 + · · ·+ bpxp + e

an estimated regression coefficient (for example, b2) looks like an ordinary slope, but thereality is more complicated. A common approach interprets b2 as the average change


in Y for a 1-unit increase in X2 when the other Xs are held constant. A more carefulvariation recognizes that b2 is a slope of Y against X2, so it summarizes change in Y perunit change in X2. (Of course, when X2 is an indicator or “dummy” variable, only anincrease from 0 to 1 is possible.) Either way, the interpretation is incorrect. It does notreflect the way multiple regression works and should be abandoned. Usually the datawere not obtained with the other Xs held constant. And even when some or all otherXs can be held constant, the proper interpretation of b2 is the one given in section 3.

“Held constant” suggests that one can hold all other Xs fixed for any desired valueof X2. What one can actually do depends on the data. When the other Xs are heldconstant, even at their means, some changes in X2 could stray into a region of “predictorspace” that is not represented in the data. And when one of the predictors is dichoto-mous, its mean does not occur in the data. Technically, a point involving such a meanis not in “predictor space” (though it may be surrounded by points that are) becauseno data can be collected there. On the other hand, various designed experiments collectdata to study the effect of some variables when other variables are held constant. Box(1966) discusses examples of passive observation and active (designed) intervention andconcludes with the often-quoted remark, “To find out what happens to a system whenyou interfere with it you have to interfere with it (not just passively observe it)”.

The “held constant” interpretation is often justified with a mathematical derivationthat uses partial derivatives. If the model is

Y = β1X1 + · · ·+ βpXp + ε

then taking the partial derivative of Y with respect to X2 yields ∂Y/∂X2 = β2.

This “proof”, however, has two transparent flaws. First, the actual data are nowherein sight. The partial derivative of Y with respect to X2 is purely formal. Second, the“proof” is faux mathematics: its assumptions include a key part of the conclusion(holding the other Xs constant). In calculus, the partial derivative is defined by alimiting process that explicitly holds all the other Xs constant and specifies the constantvalues of those Xs. In general, however, if the data were consulted, they would oftensay that the other Xs cannot be held constant. For these reasons, taking the partialderivative of the regression function

Y = β1X1 + · · ·+ βpXp (+ε)

with respect to X2 cannot yield an interpretation of β2 (or b2, both of which alreadyreflect the presence of the other predictors). It indicates how the predicted value ofY would change if one could increase X2 without changing the values of the otherpredictors. Some such changes in X2 should generally be possible, because I haveassumed that the regression equation is a good fit to the given data, but the justifiablechanges are constrained by what the data can support.

We can better understand the situation by recognizing that two distinct purposesof regression are involved. Taking the partial derivative is one aspect of examining themodel’s use for prediction. Interpreting a coefficient is an aspect of summarizing the

D. C. Hoaglin 13

effect of that predictor. The partial derivative operates on the model as given, withoutinformation on the extent to which the coefficients reflect the contributions of the otherpredictors.

Prediction that extrapolates substantially beyond the region of predictor space cov-ered by the data is seldom appropriate. And, though less noticeable, interpolation atpoints that do not occur in the population may not be meaningful. In a particular ap-plication, the analyst must check that the data underlying the model support situationsin which the variable changes and other variables do not (at least approximately) andcheck that the variables can be handled in the same way as in applying the results of theanalysis. In the example, it is clear from figure 1 that if displacement is held constant,the data support changes in weight only over a narrow interval, and conversely.

As a simple example in which “held constant” makes no sense, suppose the datacome from the model

Y = β0 + β1x+ β2x2 + ε

Here the predictors are 1, x, and x2, and the subscripts on the βs correspond to thepowers of x. It is not possible to change x while holding x2 constant (except for thetrivial change from x to −x). This example may seem artificial, but analysts oftenmechanically add one or more squared terms to models to summarize nonlinearity inthe relation between Y and the predictors. I generally advise against this approachbecause one should not assume that the nonlinearity can be well approximated with aquadratic or higher-order polynomial. It is better to examine the nonlinearity with theaim of uncovering an appropriate functional form. It may be tempting to consider x2

and x3 as terms in a Taylor series for a functional relation between Y and x, but theappropriate function may not satisfy the conditions for such an approximation.

In another simple and fairly common example, one predictor uses the product of twoother predictors to express their interaction:

Y = β0 + β1x1 + β2x2 + β3x1x2 + ε

It may be possible to hold x1x2 constant while changing x1, but then x2 must alsochange. And changing either x1 or x2 while holding the other constant will changex1x2.

Both of these examples involve functional dependence of a predictor on one or moreother predictors. In the generic regression model, if each of X3, . . . , Xp were a functionof X2, then ∂Y/∂X2 would have the following form:

∂Y

∂X2= β2 + β3

∂X3

∂X2+ · · ·+ βp

∂Xp

∂X2


(as before, X1 ≡ 1, so ∂X1/∂X2 ≡ 0). Within the limitations of a formal derivative,this gives the correct result for the two examples:

∂Y

∂x= β1 + 2β2x

∂Y

∂x1= β1 + β3x2

Usually, however, predictors are associated in the data, rather than functionally related.The data supply the information on these associations, and they are accounted for bythe interpretation discussed in section 3.

The preceding development applies also when the outcome and the linear predictorsare on different scales. In a generalized linear model, for example, the link function, g,relates μi = E(Yi) to the value of the linear predictor, η : ηi = g(μi). If h is the inverseof g, so that μi = h(ηi), instead of ∂Y/∂X2 we have

∂μ

∂X2=

dh

dη× ∂η

∂X2

Thus, for logistic regression, g(μ) = loge{μ/(1− μ)}, h(η) = 1/(1 + e−η), and dh/dη =e−η/(1 + e−η)2 = μ(1− μ). The coefficients and their interpretation are in the scale ofthe linear predictor.

5 Regression coefficients in multivariate normal distribu-tions

The interpretation discussed in section 3 also applies to regressions in multivariatenormal distributions. This interpretation emphasizes that the coefficients in the model

Y = β1X1 + β2X2 + · · ·+ βpXp + ε

reflect adjustment for simultaneous linear change in the other predictors. A multivariatenormal distribution differs from the usual multiple regression, where the predictors areassumed to be known constants, but the result is the same.

The usual parameters of a multivariate normal distribution are its vector of means(μ) and its covariance matrix (Σ). Here it suffices to take each mean equal to 0 andeach variance equal to 1 and to focus on the standardized trivariate normal distribution.Thus the three remaining parameters are the off-diagonal elements of Σ, which are thepairwise correlations, ρ12, ρ13, and ρ23. We denote the coordinate random variables byX1, X2, and X3.

We regard X3 as the response variable and X1 and X2 as the predictor variables.The regression of X3 on X1 and X2 is linear in X1 and X2 and can be written as

E (X3|X1, X2) = β31·2X1 + β32·1X2

D. C. Hoaglin 15

where βij·k is the partial regression coefficient for Xi on Xj when Xk is the otherpredictor. From the joint density of X1, X2, and X3 and the joint density of X1 andX2, it is straightforward to derive the conditional density of X3 given X1 and X2 andto verify that

β31·2 =ρ13 − ρ12ρ23

1− ρ212and β32·1 =

ρ23 − ρ12ρ131− ρ212

We arrive at the same expressions if we first adjust for the other predictor. The con-ditional distribution of X1 given X2 has mean ρ12X2 and variance 1 − ρ212, and theconditional distribution of X3 given X2 has mean ρ23X2 and variance 1 − ρ223. Thenthe regression of X3 − ρ23X2 on X1 − ρ12X2 has slope

cov(X1 − ρ12X2, X3 − ρ23X2)

var(X1 − ρ12X2)=

ρ13 − ρ12ρ231− ρ212

= β31·2

Similarly, the regression of X3 − ρ13X1 on X2 − ρ12X1 has slope β32·1. Thus the inter-pretation is that in the regression of X3 on X1 and X2, the coefficient β31·2 summarizesthe change in X3 per unit change in X1 after adjusting for simultaneous linear changein X2 (that is, after adjusting for the regressions of X3 and X1 on X2).

6 Geometry of least squares

We can also verify the interpretation in section 3 by examining the geometry of least-squares fitting.

Some books illustrate the step from simple regression to multiple regression with thethree-predictor model,

Yi = β1 + β2X2i + β3X3i + εi

which represents the data in three dimensions, withX2, X3, and Y as the axes, and showa plane whose slopes (β2 and β3) and intercept (β1) are estimated by minimizing thesum of squared vertical deviations. Here holding X3 constant (for example) correspondsto restricting the predicted values of Y to lie on the line formed by the intersection ofthe fitted plane and the plane perpendicular to the X3 axis at X3 = x3. When thatline is plotted in the X2 − Y plane, its slope is b2 and its intercept is b1 + b3x3. If X2

and X3 are correlated, the slope of the simple linear regression of Y on X2 will differfrom b2. The difference between the two slopes is a consequence of their definitions: b2reflects the adjustment for X3, and the slope in the simple regression does not. Thus itis important to look also at the plot of X3 versus X2. If the data indicate that a changein X2 should be accompanied by a corresponding change in X3, then the predicted valueb1 + b2x2 + b3x3 will change accordingly and will no longer lie on the line in the X2 −Yplane corresponding to the initial value of X3.

The geometry of obtaining the estimated coefficients (b1, b2, and b3) by using leastsquares involves a different representation applicable to any linear regression. Thus wereturn to the multiple regression with p predictors,

Y = β1X1 + · · ·+ βpXp + ε


in which we have n observations. In the customary matrix notation, y = (y1, . . . , yn)T

is the vector of data on Y , and the columns of the n× p matrix X contain the data onthe predictors (considered to be known), as follows:

y = Xβ + ε

If y contains the true values of Y (that is, ε = 0), then it lies in the subspace spannedby the columns of X (assumed to have dimension p) and is the linear combination ofthose columns with coefficients β1, . . . βp. The customary way to recover one of thosecoefficients (say, βp) is to change the basis for the subspace, subtracting from Xp thecomponent in the subspace spanned by X1, . . . , Xp−1 and thus replacing Xp as a basisvector by its component orthogonal to that subspace (suitably scaled). Then βp isthe projection of y on that new basis vector. In the language of multiple regression,βp is the slope from the regression (through the origin) of y on the residuals from theregression of Xp on X1, . . . , Xp−1 (that is, after adjusting for simultaneous linear changein those other predictors). We get the same βp by replacing y with the residuals fromthe regression of y on X1, . . . , Xp−1, so it is appropriate to state the interpretation ofβp in terms of adjusting both y and Xp.

In practice, ε �= 0, and y no longer lies in the subspace spanned by the columns ofX. The least-squares estimates, b, of the regression coefficients, β, minimize

n∑i=1

(yi − yi)2

which is the Euclidean distance from y to that subspace, yielding

y = Xb

To see that the interpretation of βp applies also to bp, we can obtain y by applying the“hat matrix”, H = X(XTX)−1XT , to y: y = Hy. We can then obtain bp from y inthe same way as we obtained βp above.

7 Implications for applications of the Stata commandmargins

The workings of multiple regression have important implications for use of the Statacommand margins, which calculates statistics “from predictions of a previously fit modelat fixed values of some covariates and averaging or otherwise integrating over the re-maining covariates” (StataCorp 2015, 1354). The analyst must demonstrate that theresulting combinations of values of the covariates are meaningful and supported by thedata.

To illustrate, we use example 1 in the PDF documentation for margins, which in-volves the regression of y on sex and group in an artificial 3,000-observation dataset.The cross-classification of the two predictors shows different distributions of males andfemales over the three groups.

D. C. Hoaglin 17

. webuse margex, clear(Artificial data for margins)

. tabulate group sex, column

Key

frequencycolumn percentage

sexgroup male female Total

1 215 984 1,19914.35 65.51 39.97

2 666 452 1,11844.46 30.09 37.27

3 617 66 68341.19 4.39 22.77

Total 1,498 1,502 3,000100.00 100.00 100.00

The regress command yields estimates of the coefficients for female, 2.group,3.group, and the constant.

. regress y i.sex i.group


Model 183866.077 3 61288.6923 Prob > F = 0.0000Residual 1207566.93 2996 403.059723 R-squared = 0.1321

Adj R-squared = 0.1313Total 1391433.01 2999 463.965657 Root MSE = 20.076

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

sexfemale 18.32202 .8930951 20.52 0.000 16.57088 20.07316

group2 8.037615 .913769 8.80 0.000 6.245937 9.8292933 18.63922 1.159503 16.08 0.000 16.36572 20.91272

_cons 53.32146 .9345465 57.06 0.000 51.48904 55.15388


With the default response option, margins calculates average adjusted predictions(AAPs), treating the sample as if every person were male (respectively, female) as follows:

. margins sex

Predictive margins Number of obs = 3000Model VCE : OLS

Expression : Linear prediction, predict()

Delta-methodMargin Std. Err. t P>|t| [95% Conf. Interval]

sexmale 60.56034 .5781782 104.74 0.000 59.42668 61.69401

female 78.88236 .5772578 136.65 0.000 77.7505 80.01422

Because the default is asobserved, the averaging in this linear regression correspondsto setting 2.group and 3.group at their means (0.3727 and 0.2277). The AAPs, 60.56and 78.88, are meaningful only if it is reasonable to consider an artificial person who is37.27% in group 2 and 22.77% in group 3 (and, hence, 39.97% in group 1) when data ony are available at only six points in “predictor space”, corresponding to {male, female}×{group 1, group 2, group 3}. Then it must be appropriate to use the same distributionover the three groups for both males and females. Because the data are artificial, I onlyobserve that the combined distribution (39.97%, 37.27%, 22.77%) differs noticeably fromthe distribution for males and the distribution for females shown in the cross-tabulation.The difference between the AAPs, 78.88−60.56 = 18.32, equals the regression coefficientfor female. Because the regression is linear in group, any distribution over the threegroups will, if used for both males and females, yield this same difference.

For an example not based on linear regression, I present one from Williams (2012).Using nhanes2f.dta (Second National Health and Nutrition Examination Survey),available from the StataCorp website, Williams (2012) fits a logistic regression model,

. webuse nhanes2f, clear

. logit diabetes black female age

(output omitted )

and uses margins to obtain adjusted predictions at six values of age with black andfemale at their means, as follows:

. margins, at(age=(20 30 40 50 60 70)) atmeans

(output omitted )

Williams (2012, 313) says, “According to these results, an average 70-year-old (whois again 0.105 black and 0.525 female) is almost 18 times as likely to have diabetesas an average 20-year-old (11.04% compared with 0.63%).” In practice, an analystshould explain why it is satisfactory to compare an artificial 70-year-old and an artificial20-year-old who are both 0.105 black and 0.525 female when data on diabetes areavailable at only four points in the “factor space”: (black, female) = (0, 0), (0, 1),(1, 0), and (1, 1). In nhanes2f.dta, the 20-year-olds (n = 244) are 0.123 black and

D. C. Hoaglin 19

0.578 female, and the 70-year-olds (n = 234) are 0.064 black and 0.500 female. Theoverall fractions may be a satisfactory combination for comparisons, but an analystshould first look at 20-year-olds’ and 70-year-olds’ predicted probabilities of diabetes ateach combination of black and female that actually appears in the data. The at()

option makes it easy to summarize the predicted probabilities of diabetes at a level ofdetail that is more relevant to individuals. (As Williams [2012] indicates, “These datawere collected in the 1980s. Rates of diabetes in the United States are much highernow.”) Thus 70-year-old nonblacks (of both sexes) were nearly 18 times as likely as 20-year-olds to have diabetes (9.60% compared with 0.54% for males and 11.02% comparedwith 0.63% for females), but the corresponding ratios for blacks were about 16. Theratio for black versus nonblack (of both sexes) was about 2 for 20-year-olds and about1.85 for 70-year-olds. And females (of both ages and both race categories) were roughly15% more likely than males to have diabetes. Of course, before embracing predictionsfrom a model, one should check how well it fits. In these data, no 20-year-olds haddiabetes, and the highest of the four rates for 70-year-olds was 11.11%.

. margins, at(age=(20 70) black=(0 1) female=(0 1))

Adjusted predictions Number of obs = 10335Model VCE : OIM

Expression : Pr(diabetes), predict()

1._at : black = 0female = 0age = 20









Delta-methodMargin Std. Err. z P>|z| [95% Conf. Interval]

_at1 .005399 .0009014 5.99 0.000 .0036324 .00716562 .0959674 .0071057 13.51 0.000 .0820404 .10989433 .0062957 .0010318 6.10 0.000 .0042735 .00831794 .1102392 .0073229 15.05 0.000 .0958865 .12459195 .0110063 .0020999 5.24 0.000 .0068904 .01512216 .1787334 .019682 9.08 0.000 .1401573 .21730957 .0128223 .0024099 5.32 0.000 .0080989 .01754568 .2025559 .0209683 9.66 0.000 .1614589 .243653

Setting other variables at their means or averaging over them is also part of cal-culating marginal effects, elasticity, and semielasticities—the response options dydx(),eyex(), dyex(), and eydx(). The logic underlying these options, however, is the sameas in the “held constant” interpretation of regression coefficients. Except for interde-pendencies that are made explicit by using factor-variable notation in the estimationcommand, the calculations for dydx() and the related options lead to interpretationsthat do not reflect the way multiple regression and other multipredictor analyses actu-ally work.

Although the Stata command margins (supported by marginsplot) offers greatpower and flexibility for studying predictions from many models, analysts should notmechanically average over other variables. It is essential to determine the region of “pre-dictor space” covered by the data and examine the associations among the predictors.

8 Many books give the incorrect interpretation

Many books mislead readers by using the “held constant” interpretation. The lowest-numbered page where I have seen this problem is page 2 of Vittinghoff et al. (2012), inan introductory example: “In a sense, multipredictor regression analysis allows us toexamine the effect of treatment aggressiveness while holding the other factors constant[italics original].”

Out of curiosity, I looked at books that I own by Stata Press that contain ma-terial related to multiple regression; these books were by Acock (2010), Kohler andKreuter (2012), Long and Freese (2006), Mitchell (2012), and Rabe-Hesketh and Skro-ndal (2012). All of them use the incorrect “held constant” interpretation.

Fortunately, some books use the correct general interpretation. These include thebooks by De Veaux, Velleman, and Bock (2012), Hastie, Tibshirani, and Friedman(2009), and Weisberg (2014).

D. C. Hoaglin 21

9 Conclusion

The interpretation of a coefficient as summarizing the relation between a change in Y andthe increase in that predictor after adjusting for simultaneous linear change in the otherpredictors in the data at hand is an important component of a proper understanding ofmultiple regression and other multipredictor methods. When one makes explicit the roleof the set of other predictors in the definition of each coefficient, this mathematicallyaccurate interpretation is a straightforward consequence of the presence of those otherpredictors in the model. Applied to the usual tables of estimated coefficients, it helps toclarify the meaning of the t statistics and p-values. It also suggests caution in makingpredictions and comparisons at combinations of predictor values that do not occurin the data. Appreciation of the proper interpretation should help to avoid commonmisunderstandings in various applications. However, a challenge is to overcome thebarrier created by many publications’ use of the “held constant” interpretation, whichhas no place in a proper understanding of multiple regression.

10 Acknowledgments

I thank Alan Agresti, Jeroan Allison, Nicholas J. Cox, Richard Goldstein, Frank E.Harrell, Jr., Stephen D. Kennedy, Terry Speed, and Paul F. Velleman for discussionsand comments on earlier versions. I also thank an anonymous reviewer for thoughtfuland extensive comments that substantially improved the presentation. Of course, I amresponsible for any opinions expressed.

11 ReferencesAcock, A. C. 2010. A Gentle Introduction to Stata. 3rd ed. College Station, TX: Stata

Press.

Box, G. E. P. 1966. Use and abuse of regression. Technometrics 8: 625–629.

Cook, R. D., and S. Weisberg. 1982. Residuals and Influence in Regression. New York:Chapman & Hall.

De Veaux, R. D., P. F. Velleman, and D. E. Bock. 2012. Stats: Data and Models. 3rded. Boston: Addison–Wesley.

Dempster, A. P. 1969. Elements of Continuous Multivariate Analysis. Reading, MA:Addison–Wesley.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station,TX: Stata Press.


Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent VariablesUsing Stata. 2nd ed. College Station, TX: Stata Press.

Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata.College Station, TX: Stata Press.

Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Coursein Statistics. Reading, MA: Addison–Wesley.

Rabe-Hesketh, S., and A. Skrondal. 2012. Multilevel and Longitudinal Modeling UsingStata. 3rd ed. College Station, TX: Stata Press.

StataCorp. 2015. Stata 14 Base Reference Manual. College Station, TX: Stata Press.

Tukey, J. W. 1970. Exploratory Data Analysis. Limited preliminary ed., vol. 2. Reading,MA: Addison–Wesley.

Vittinghoff, E., D. V. Glidden, S. C. Shiboski, and C. E. McCulloch. 2012. RegressionMethods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models.2nd ed. New York: Springer.

Weisberg, S. 2014. Applied Linear Regression. 4th ed. Hoboken, NJ: Wiley.

Williams, R. 2012. Using the margins command to estimate and interpret adjustedpredictions and marginal effects. Stata Journal 12: 308–331.

Yule, G. U. 1907. On the theory of correlation for any number of variables, treated bya new system of notation. Proceedings of the Royal Society of London, Series A 79:182–193.

About the author

David C. Hoaglin is an independent statistical consultant and an adjunct professor in theDepartment of Quantitative Health Sciences at the University of Massachusetts Medical School.He received a PhD in statistics from Princeton University in 1971. His current research interestsinclude meta-analysis, biostatistics, exploratory data analysis, and shapes of distributions. Heis an associate editor for Annals of Applied Statistics and a member of the editorial board forResearch Synthesis Methods.


Regressions are commonly misinterpreted:Comments on the article

James W. HardinUniversity of South Carolina

Department of Epidemiology and BiostatisticsColumbia, [email protected]

How much should we really care whether a description says “comparison” or“change”? Will it lead to a mistake if we say “held constant” instead of “all otherthings being equal” or “clamping the other variables”? Is there a single best phrasethat should be prescribed to all textbooks that discuss the interpretation of coefficientsin a regression model? The article by Dr. Hoaglin presents advice on interpretation ofregression models and criticism of some commonly applied interpretations.

The author states that 1) the correct interpretation of regression coefficients is ev-ident in added-variable plots; 2) the correct interpretation should be based on an ex-amination of coefficients in multivariate normal distributions and the geometry of leastsquares; and 3) the proper application of multivariable models requires caution in cal-culating predictions that average over other variables.

To motivate the discussion, the author uses a notation that places two pieces ofinformation in the subscript of regression parameters. The first part of the subscriptidentifies the outcome variable and the subscript of the associated covariate. The sec-ond part of the subscript identifies the concomitant covariates in the model. In mosttextbooks (and nearly all articles), this notation is simplified to identify the associatedcovariate only because the rest of the information is available in context.

Dr. Hoaglin advocates the phrase “adjusting for” instead of “controlling for” whenidentifying concomitant covariates in discussion of the interpretation of a particularcovariate of interest. I agree with the author’s assertion that “controlling for” couldimply that randomization rules were applied over those covariates in the collection ofdata.

To illustrate added-variable plots, the author uses Stata’s ubiquitous automobiledataset. In the example, two correlated covariates are specified in a linear regressionmodel of gallons per 100 miles: the total weight of the car in pounds (weight) and thecubic inch displacement of the car’s engine (displacement). The example illustrates thateven though each covariate has a strong relationship as evidenced in a scatterplot, nei-ther appears to be significantly associated with the outcome variable in a multivariableregression. Indeed, in the univariate models, each covariate is found to have a significantassociation with the outcome variable.

In sections 3 and 4, the author cautions that in most cases, independent variablestake on a limited set of values. As such, obtaining predicted values for which one


24 Regressions are commonly misinterpreted: Comments on the article

covariate is allowed to change while other covariates are fixed at their mean may not bemeaningful. Not only may mean values of certain covariates lack meaning, any specificcovariate pattern used in a prediction could be unrepresented in the data. In the vastmajority of cases in public health, predictions are obtained for groups, and the modelassumptions adequately address this given that it is the comparison of the predictionsthat is germane.

In section 4, Dr. Hoaglin cautions against the “holding other covariates constant”phrase when interpreting a coefficient associated with a variable that enters the modelin multiple ways either as part of an interaction or in an additional function form asis the case in polynomial models. All authors (of texts to which I had access in mybookshelf) go to great lengths to cover interpretation in such models, and none of themadvocate using this phrase in those instances.

The added-variable plot and the initial sections lead Dr. Hoaglin to favoring thephrase “the change in the outcome per unit increase in the covariate of interest afteradjusting for simultaneous linear change in the data at hand” over “the change inthe outcome per unit increase in the covariate of interest holding the other variablesconstant”. There is nothing I can say against this preference. Although accurate, theauthor’s phrase leaves me unsatisfied. But I would never use that phrase. It is dulland lifeless; it fails to illuminate or highlight, and it forces an awkward overly wordypresentation of something that, frankly, I do not mind leaving a little vague.

At worst, “held constant” is a placeholder with which we have become a little toocomfortable. That phrase is a nod toward the fitted model and the model’s underly-ing assumptions. Once the model is fit, and under the assumptions of that model, aresearcher can make calculations despite any lack of covariate pattern representation ofthe particular sample. If we cannot interpret those calculations in terms of the model,then what good is it?

Rather than what specific phrasing was used to describe coefficients and marginalmeans, I found the examples far more compelling for a different prescription: the inclu-sion of detailed tabulations, summaries, and regression plots. I contend that the “heldconstant” phrase is not as relevant as the inclusion of important contextual information.


Regressions are commonly misinterpreted:Comments on the article

J. Scott LongUniversity of Indiana

Departments of Sociology and StatisticsBloomington, IN

[email protected]

David M. DrukkerStataCorp

College Station, TX

[email protected]

Hoaglin claims that regression coefficients are commonly, perhaps usually, misinter-preted. Citing the preliminary edition of Tukey’s classic Exploratory Data Analysis(1970, chap. 23), Hoaglin argues that the correct interpretation of a regression coeffi-cient is that it “tells us how Y responds to change in X2 after adjusting for simultaneouslinear change in the other predictors in the data at hand”. He contrasts this with whathe views as the common misinterpretation of the coefficient as “the average change inY for a 1-unit increase in X2 when the other Xs are held constant”. He asserts thatthis interpretation is incorrect because “[i]t does not accurately reflect how multipleregression works”. We find that Hoaglin’s characterization of common practice is ofteninaccurate and that his narrow view of proper interpretation is too limiting to fully ex-ploit the potential of regression models. His article rehashes debates that were settledlong ago, confuses the estimator of an effect with what is estimated, ignores modernapproaches, and rejects a basic goal of applied research. In what follows, we assume theoutcome variable is y, that x is the predictor whose effect is being interpreted, and thatw represents one or more additional predictors in the model. Our examples are pur-posely simple, but the arguments would not change with more realistic specifications.Although our discussion is limited to linear regression, the ideas apply generally to theinterpretation of marginal effects, such as ∂π(x)/∂x, in nonlinear models such as logitor probit. Before explaining why we disagree with key points in Hoaglin’s argument,we note several points that we agree with:

1. The regression coefficient for x generally depends on the other predictors in themodel.

2. Using a regression to make predictions that are based off the support of the datais a mistake, and a data analyst must know the limitations of the data.

3. You cannot change x while holding x2 constant.

4. Covariation does not imply causality.

We disagree, however, that these points are commonly misunderstood. On other keyissues, we disagree with his arguments.

First, Hoaglin appears to confuse the interpretation of what is being estimated withan estimator for it. To clarify, let’s suppose that the function for the mean of y given



x and w is E(y|x,w) = β0 + βxx + βww. Suppose that the values β0, βx, and βw areknown. Then, βx = {E(y|x+ 1, w)− E(y|x,w)} is the effect of a unit increase in xholding w constant. We did not tell you how we know the values, so appeals to howwe learned the values are nonsensical. From this perspective, any consistent estimatorfor βx is a consistent estimator for the effect of a unit change in x holding w constant.Interpretation does not depend on how the consistent estimator was computed.

Second, Hoaglin objects to defining the effect of interest as a unit change to xwhile holding w constant. Yet much of applied science is about finding cases in whichchanging x while holding w fixed is a sensible effect. Although much literature dealswith limiting the discussion to sensible cases for interpretation, Hoaglin is arguing thatapplied scientists should not be looking for such ceteris paribus effects.

Third, we find the phrase “after adjusting for simultaneous linear change in the otherpredictors in the data at hand” to be misleading or confusing. The phrase is misleadingbecause we explicitly choose not to model any relationship between x and w when weestimate E(y|x,w). If we need to account for a change in w that results from a change inx, we should estimate the parameters of a model that allows for this structure. Considerthe following example to see how the phrase is misleading. Suppose y is wages and xindicates if a person completed a training program. What does it mean that βx tellsus how wages respond to a change in completing a training program after adjusting forsimultaneous linear change in the other predictors in the data at hand? We understandhow this language relates to the derivations provided by Hoaglin, but we do not findit helpful for interpreting the model. Because Hoaglin’s argument is based on workby Tukey (1977), we consulted him and Mosteller and Tukey (1977) for clarification,but we could not find this language in either.1 When Mosteller and Tukey (1977, 303)describe fitting by stages they write that “the least-squares coefficient of each carrier[that is, predictor] is the regression of the response on this carrier linearly adjusted forthis carrier’s costock [that is, other predictors in the regression]”. This describes howfitting by stages leads to the coefficients for multiple regression, but Mosteller and Tukey(1977) do not use this language when they interpret the coefficients later in the book.

Fourth, Hoaglin categorically dismisses as wrong any interpretation that uses thephrase “holding constant” or something similar as making unjustified claims regardingcausality and manipulability of variables. He writes, “ ‘Held constant’ is misleading andsuggests that one can hold all other Xs fixed for any desired value of X2. What one canactually do depends on the data”. Given the explosion of research on causal inference(see Morgan and Winship [2015], Imbens and Wooldridge [2009], and Berk [2004] forexcellent reviews), we do not think the phrase “holding constant” implies anything ofthe sort. The literature on causal inference precisely specifies the assumptions requiredto conclude that a change in a treatment variable causes a change in an outcome. Underthese conditions, the average treatment effect, which equals the regression coefficient inthe linear regression model, can reasonably be given a causal interpretation. If Hoaglinis rejecting this literature as naıve and misleading, he needs a much more detailed argu-

1. Hoaglin cites the unpublished version of Exploratory Data Analysis, while we consulted the pub-lished version.

J. S. Long and D. M. Drukker 27

ment to rebut work by Rubin and others that dates back at least to Cochran and Rubin(1973). Although researchers whose data do not support the conditions for causal infer-ence might draw causal conclusions, these conclusions are not based on the coefficientsalone. For example, after noting the perils of causal inference in observational data,Mosteller and Tukey (1977, 322–323) write, “The idea that increased schooling prob-ably increases the score is likely correct, though not because of these data”. Usingregression coefficients to support conclusions about causality is not the same as naivelyusing regression coefficients alone to make causal claims. Although Hoaglin reads agreat deal into the phrase “holding constant”, we do not think others do.

Fifth, suppose that a researcher is using regression for prediction or description.Hoaglin infers that if the researcher uses the phrase “holding other variables constant”,then he or she must not understand regression. We disagree. Even if we might prefera different language, we find the language to be clear and consistent with proper in-terpretation. Suppose y is the salary received by a faculty member and x is whetherthat person held a postdoctoral fellowship before joining the faculty. By saying “The ex-pected income of a faculty member who held a postdoctoral fellowship is βx greater thanthat of a faculty member with the same characteristics who did not have a postdoctoralfellowship”, we are not suggesting that when a scientist takes a postdoctoral fellow-ship, no other characteristics of that scientist change. We are reporting the value of{E(y|x = 1, w = w∗)− E(y|x = 0, w = w∗)} = βx.

2 If we write “If a scientist publishesone additional article, his or her salary would increase by βx”, we are reporting that{E(y|x = x∗ + 1, w = w∗)− E(y|x = x∗, w = w∗)} = βx. We could debate the modelspecification, functional form, whether one paper in mathematics means the same thingas one paper in biochemistry, and other things. However, our interpretation does not im-ply that we think we could magically give everyone another publication, that we believesimultaneously increasing every scientist’s number of publications by one would lead toan increase in everyone’s salary, or that changing a scientist’s productivity has no impacton his or her other characteristics. Long and Freese (2006), who Hoaglin criticizes forusing the phrase “holding constant”, understand these issues and wrote extensively onhow to use regression models to make predictions at substantively motivated locationsin the data and to address the practical difficulties in doing so in nonlinear models withmany independent variables. For an especially interesting and modern discussion ofparameter interpretation conditional on covariates or after averaging out the covariates,we also suggest Wooldridge (2010, chap. 2 and 21).

Weisberg’s (2014) excellent book on regression is one of the few that Hoaglin be-lieves correctly interprets regressions. Yet Weisberg (2014, 74–75) uses the phrase “withall other regressors in the model held fixed” and notes “[i]nterpreting a coefficient orits estimate as a rate of change given that other regressors are fixed assumes that theregressor can in fact be changed without affecting the other regressors in the mean func-tion and that the available data will apply when the predictor is so changed”. AlthoughWeisberg (2014) agrees with many of the qualifications that Hoaglin suggests, his use ofthe phrase “held fixed” is consistent with the proper understanding of regression. Sim-ilarly, Mosteller and Tukey (1977, 319) write, “If the x’s are not closely related, either

2. Here and later we assume that there is adequate support in the data to make the comparison.


functionally or statistically, we may be able to get away with interpreting bi as the ‘ef-fect of xi changing while the other x’s keep their same values.’ If we want to tap expertjudgment about the value of bi, some set of words like those in quotes may be the bestwe can use [emphasis added]”. For a regression of a test score on schooling and specialtraining, Mosteller and Tukey (1977, 322) use an interpretation that involves holdingother variables constant but add qualifications: “What we are not so justified in sayingis that if we took a random person who had, say, 9 years of schooling and gave him 2more years, he would, on the average, gain two points on the test. This is not becauseof the lack of reliability in determining the equation. We are thinking of this additionalschooling being given to many people. The difficulty arises because the data are notbased on an experiment”. We think that most people who interpret regressions usingobservational data would agree, even those who use the phrase “holding constant”.

Sixth, Hoaglin’s example confuses the distinction between marginal and conditionalpredictions computed by the margins command. With the asobserved option, thedefault if no option is given, margins computes predictions at the observed values ofthe predictors for each observation and then takes the average to make a marginalprediction. For regress, this is equivalent to the command predict yhat followedby summarize yhat. When margins uses the at() or atmeans option to specify thevalue of the predictors at which predictions are made, conditional predictions are be-ing made. In the linear model, margins, asobserved computes the mean predictionpredmeanof = 1/N

∑i E(y|x = xi, w = wi), while margins, atmeans computes the

prediction at the mean predatmean = E(y|x = x,w = w). Because the model is lin-ear, predmeanof = predatmean. This does not mean, as Hoaglin states, that margins,

asobserved is inappropriate with binary predictors because a 0/1 variable would beheld at a fractional value. Hoaglin has misunderstood the distinction between marginaland conditional predictions. Further details are given in Long and Freese (2014).

Seventh, the assumption of normality is not necessary for the derivation of regressioncoefficients in the section “Regression coefficients in multivariate normal distributions”;see Wooldridge (2010, chap. 4). His derivation mirrors the path analytic derivation ofregression coefficients used by Sewall Wright in the early 1900s.

We find Hoaglin’s proposed language of interpretation confusing. On the other hand,Hoaglin argues that anyone using the term “holding constant” misunderstands the mostbasic aspects of regression, thinks correlation implies causality, believes you can changex while holding x2 constant, and is unaware of the danger in using a sample of teensto predict what will happen to those in their 60s. We are not suggesting that all iswell in the use of regression models. Indeed, we find it a bit uncomfortable defendingcurrent practice! But we believe that Hoaglin’s article sets up a straw man that is easyto knock down. Mosteller and Tukey (1977, 320) write that “Regression is probably themost powerful technique we have for analyzing data. Correspondingly, it often seems totell us more of what we want to know than our data possibly could provide”. Keepingthis in mind will make you a more thoughtful and vigilant data analyst. Using thephrase “holding constant”, however, does not make you thoughtless and rash.

J. S. Long and D. M. Drukker 29

1 ReferencesBerk, R. A. 2004. Regression Analysis: A Constructive Critique. Thousand Oaks, CA:

Sage.

Cochran, W. G., and D. B. Rubin. 1973. Controlling bias in observational studies: Areview. Sankhya 35: 417–446.

Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometricsof program evaluation. Journal of Economic Literature 47: 5–86.


. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rded. College Station, TX: Stata Press.

Morgan, S. L., and C. Winship. 2015. Counterfactuals and Causal Inference: Methodsand Principles for Social Research. 2nd ed. New York: Cambridge University Press.


Tukey, J. W. 1970. Exploratory Data Analysis. Limited preliminary ed., vol. 2. Reading,MA: Addison–Wesley.

. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley.


Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nded. Cambridge, MA: MIT Press.


Regressions are commonly misinterpreted: Arejoinder

David C. HoaglinIndependent consultant

Sudbury, MA

[email protected]

I thank James Hardin and Scott Long and David Drukker for their illuminatingcomments. I also thank the Editors for the opportunity to respond.

The commentaries point to several issues that need clarification. However, I firstemphasize that for the purpose of measuring the size of an effect, the correct generalinterpretation of a regression coefficient, as summarizing how Y responds to changein the corresponding predictor after adjusting for simultaneous linear change in theother predictors in the data (or population) at hand, is a straightforward result ofthe mathematics of multiple regression. Heuristically, because a regression summarizesthe relation of Y to the predictors, taken together, the coefficient of each predictoraccounts for the contributions of the other predictors. Neither commentary provides amathematical proof that contradicts this interpretation.

1 Hardin

James Hardin asks, “Will it lead to a mistake if we say ‘held constant’ instead of ‘allother things being equal’ or ‘clamping the other variables’?” “Held constant” is simplya shorter version of the other two phrases. As a general interpretation, all three makethe same mistake. All textbooks need not use the same phrase; they only need toexplain how multiple regression actually works.

In my article, I did not advise that the correct interpretation “should be based on”an examination of coefficients in multivariate normal distributions or on the geometryof least squares. Because that interpretation is an inherent feature, any way of lookingat the workings of multiple regression can reveal it. Those settings (in sections 5 and6) introduce two alternative approaches.

I agree that the second part of Yule’s notation may be omitted when the contextspecifies the other predictors. Textbooks invite confusion, however, when they use thesame symbols for the coefficients in a sequence of nested models, as in

Y = β1X1 + β2X2 + ε

Y = β1X1 + β2X2 + β3X3 + ε

Y = β1X1 + β2X2 + β3X3 + β4X4 + ε

The definitions of β1, β2, and β3 are not the same in the various models, and the notationshould reflect this in some way (at least when such sequences are first discussed).


D. C. Hoaglin 31

Nowhere in either section 3 or section 4 do I caution that “in most cases, indepen-dent variables take on a limited set of values.” The values that the predictors can takevary among applications, and “continuous” predictors are common. When making pre-dictions, analysts must consider possible discreteness of the variables. For comparisonsamong groups, the “model assumptions” may not adequately address situations in whichthe groups differ on some predictors. Some comparisons in public health overcome thisdifficulty by using standardization.

The two examples in section 4 play the role of counterexamples; they show that ingeneral, the “held constant” interpretation cannot be valid. I would expect textbooksto handle them correctly, and it is reassuring to hear that they do.

Because the correct interpretation actively keeps in view the adjustments for thecontributions of the other predictors, I find it far from “dull and lifeless”. It doesinvolve more words. One can omit those details—but only after giving readers anaccurate explanation. I welcome suggestions of alternative phrasing that accuratelyconveys the nature of multiple regression.

As my article explains, the regression model and its assumptions do not generallysupport the “held constant” interpretation. Researchers can calculate predicted val-ues for any valid combinations of values of the predictors, but then they are using theestimated coefficients for another of the purposes that I listed in section 1. The distinc-tion may at times be subtle, but the interpretation of a regression coefficient is part ofmeasuring the size of an effect.

2 Long and Drukker

I am sorry that Scott Long and David Drukker found my article so confusing. It isunfortunate that they criticize it for things that it neither says nor implies. In whatfollows, I try to dispel as much of their confusion as I can.

The proper interpretation of regression coefficients is no more limiting than multipleregression itself. Indeed, an accurate understanding of how multiple regression works isessential for exploiting the potential of regression models.

Also, the proper interpretation follows directly from a result known to econometri-cians as the Frisch–Waugh–Lovell theorem (discussed, for example, by Filoso [2013]).

Nowhere does my article confuse the estimator of an effect with what is estimated.

The responses below follow the commentary’s numbering of the areas of disagreement(First, Second, etc.).

First

A regression function is not an abstract object. (As I discuss in section 4 of my article,a common general “proof” of the incorrect “held constant” interpretation mistakenly

32 Regressions are commonly misinterpreted: A rejoinder

makes that assumption.) By writing E(y|x,w) = β0 + βxx + βww, Long and Drukkeracknowledge that the function involves a distribution (such as the distribution of εin Y = Xβ + ε or a multivariate normal distribution). The function allows them tocalculate the predicted values E(y|x0, w0) and E(y|x0 +1, w0) for any particular valuesx0 and w0 and then use the difference, βx, as an estimate of the effect of a change ofone unit in x with w held constant. (Because the function is linear, the actual values ofx and w play no role in that result; they can be arbitrary, even incompatible with thecontext of the model.) That process, however, is not the same as interpreting βx as partof E(y|x,w). Because the function summarizes the relation of y to x and w jointly, eachof βx and βw accounts for the contribution of the other variable. Those adjustmentsare already a feature of the coefficients before any predictions are calculated, and anyconsistent estimators of βx and βw will have this property. Thus the data or distributionfrom which β0, βx, and βw arose plays an essential role.

Second

If one wants to estimate the effect of making a unit change in x while holding w constant,one must have data in which values of x differ by one unit and w remains constant, sothat one can actually observe that effect. Designed experiments in applied science oftendo this. Nowhere do I suggest that applied scientists should not look for ceteris paribuseffects. It is essential for them to look in the right places.

Third

“[A]fter adjusting for simultaneous linear change in the other predictors in the data athand” is a mathematical fact. It may seem confusing if it is unfamiliar, but it is notmisleading. The data play an essential role.

When one fits a regression model for E(Y |x,w), what matters is the relation of Yto x and w; given the choice of x and w, the model accepts those predictors’ relationin the data. In section 2, I mentioned the possibility of functional relations amongthe predictors and assumed that those choices had been settled. Thus the process ofbuilding the model may have introduced relations between x and w. As an example,if the predictors contained in w include x2, they do so because the analyst considers xand x2 suitable predictors (after accounting for the contributions of other predictors).

In the example where y is wages and x indicates whether a person has completeda training program, βx summarizes the difference between those who have completeda training program and those who have not, accounting for differences between thetwo groups on the characteristics represented in w. If the data were collected in arandomized experiment, w should not differ (on average) between the groups. However,if the data arose from an observational study, w might differ between the groups, andan assessment of the effect of the training program should consider those differences (asβx does).

D. C. Hoaglin 33

In a footnote, Long and Drukker refer to the limited preliminary edition of Ex-ploratory Data Analysis as “unpublished”, but their reference list has the same entryas my article’s reference list. Though copies may be difficult to obtain now, the volumewas definitely published. I purchased my copy in a bookstore in 1970.

In this comment and others, Long and Drukker quote, selectively, Mosteller andTukey (1977). I wish they had read that book, particularly chapter 12 (“Regres-sion for Fitting”) and chapter 13 (“Woes of Regression Coefficients”), as a sourceof insight rather than, apparently, a source of quotations that would appear to con-tradict statements in my article. From many years of collaboration (for example,Hoaglin, Mosteller, and Tukey [1983, 1985, 1991]), I am confident that the views ofMosteller and Tukey (1977) do not conflict with my article. After all, they and I haveused the same underlying mathematics.

Mosteller and Tukey (1977) actually describe graphical fitting by stages at pages271–279; they develop the process algebraically on pages 303–305, partly “because itemphasizes what we have just said by observing that the least-squares coefficient of eachcarrier is the regression of the response on this carrier linearly adjusted for this carrier’scostock”. They end that section as follows (page 305): “When we want a statementto remember, we can condense a little [from writing out such a statement as ‘for eachcoefficient that we want to interpret’], and say: The (least-squares) coefficient of eachcarrier is the regression of the response, linearly adjusted for this carrier’s costock, onthis carrier also linearly adjusted for this carrier’s costock (where the costock containsall possibilities in the whole stock for which the corresponding carrier has coefficientzero).” (In graphical form, the statement is the added-variable plot.) Having presentedthe concept both graphically and algebraically and given readers the above statementto remember, Mosteller and Tukey (1977) did not need to repeat that language at eachsubsequent opportunity.

Fourth

My article focuses on the usual multiple regression. It makes no claims about causalmodels. However, as far as ordinary multiple regression is concerned, the extensivereview by Imbens and Wooldridge (2009) does not disagree with my article. It gener-ally does not mention the aspects of interpreting regression coefficients that my articlediscusses, and I do not recall seeing “held constant” or similar language anywhere in it.At several points, Imbens and Wooldridge (2009) talk about adjusting for covariates.Further, I found nothing in that article that I reject.

Fifth

As I mentioned earlier, making predictions is a distinct purpose from measuring thesize of an effect through a regression coefficient. Holding other predictors constant isnot part of the way multiple regression works, but the data (or population) at handmay support predictions that involve holding other predictors constant. In the firstexample, the presence of the other predictors (w) in the model allows the regression to


adjust for differences in characteristics between faculty members who held a postdoctoralfellowship before joining the faculty and those who did not hold a postdoctoral fellowship(that is, to produce a value of βx that accounts for those differences). Long and Drukkerappear to deny the existence of that adjustment and the role it has, logically andmathematically, in the interpretation of βx. In the second example, they do not saywhether the data are cross-sectional or longitudinal. If the data are cross-sectional, thepredictions involve a comparison between scientists who have published x∗ + 1 articlesand scientists who have published x∗ articles (both with w = w∗).

My article criticizes Long and Freese (2006). That edition (page 114) derives aninterpretation of the coefficient of a continuous predictor in a linear regression model bytaking a partial derivative, the fallacious approach discussed in section 4 of my article.It then states (page 115), “The distinguishing feature of interpretation in the LRM isthat the effect of a given change in an independent variable is the same regardless ofthe value of that variable at the start of its change and regardless of the level of theother variables in the model [emphasis original].” From the mathematics of the linearregression model, it is clear that this statement is incorrect. I have not read the 2014edition; however, judging by the comments of Long and Drukker on my article, it islikely to have the same interpretation as the 2006 edition. If so, Long and Freese havesome revising to do. As I have said about textbooks (Hoaglin 2014), “students shouldbe able to trust that the authors of their textbooks understand the methods they arewriting about.”

After Weisberg (2014) introduces the correct interpretation in an early discussionof multiple regression, I would not expect him to repeat the full wording on everysubsequent occasion.

Importantly, the sentences quoted from Mosteller and Tukey (1977, 319)—“If thex’s are not closely related, either functionally or statistically, we may be able to getaway with interpreting bi as the ‘effect of xi changing while the other x’s keep theirsame values.’ If we want to tap expert judgment about the value of bi, some set ofwords like those in quotes may be the best we can use”—appear in a section titled“Sometimes x’s can be ‘Held Constant’.” Thus it is clear that Mosteller and Tukey(1977) are discussing situations that are exceptions. Also, the second sentence can beread as describing the difficulty of obtaining expert judgment on a regression coefficient.

Sixth

It is inaccurate to say that my article misunderstands the distinction between marginaland conditional predictions, especially because (as Long and Drukker show, and as sec-tion 7 of my article mentions) the two yield the same result for a linear regression. Inthe first example in section 7, both approaches result in setting 2.group and 3.group

at their means. For predictions that treat every person in the sample as male and,separately, treat every person as female, my article does not say flatly that it is inap-propriate to consider artificial persons who are 39.97% in Group 1, 37.27% in Group 2,and 22.77% in Group 3. It is up to the analyst to justify such a choice.

D. C. Hoaglin 35

Seventh

One cannot derive regression coefficients in a multivariate normal distribution withoutassuming that one has a multivariate normal distribution. The point of section 5 ofmy article, clearly stated, is that the correct interpretation applies to the populationregression coefficients in that familiar family of distributions.

Summary

In each of the seven areas of disagreement that Long and Drukker comment on, the rootcause of the disagreement is confusion or misunderstanding on their part. I hope thatmy responses clarify the situation, and I urge them to take another, closer look at themathematics of multiple regression.

As a parting shot, Long and Drukker make several further claims, most of whichhave no credible basis in my article. Their lack of validity should be obvious. For therecord, however, I respond as follows:

1. As my article shows, as a general interpretation of a regression coefficient, “holdingconstant” is incompatible with the mathematics of multiple regression.

2. I do not think that correlation implies causality.

3. My article actually says (in section 4), “It is not possible to change x while holdingx2 constant (except for the trivial change from x to −x).”

4. In the example based on the NHANES data, section 7 emphasizes the differencesin the rates of diabetes between 70-year-olds and 20-year-olds. Nowhere does myarticle suggest “using a sample of teens to predict what will happen to those intheir sixties”.

3 ReferencesFiloso, V. 2013. Regression anatomy, revealed. Stata Journal 13: 92–106.

Hoaglin, D. C. 2014. Teaching of multiple regression should reflect the way it works. InProceedings of the 2014 Joint Statistical Meetings. Alexandria, VA: American Statis-tical Association.

Hoaglin, D. C., F. Mosteller, and J. W. Tukey, eds. 1983. Understanding Robust andExploratory Data Analysis. New York: Wiley.

. 1985. Exploring Data Tables, Trends, and Shapes. New York: Wiley.

. 1991. Fundamentals of Exploratory Analysis of Variance. New York: Wiley.

Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometricsof program evaluation. Journal of Economic Literature 47: 5–86.


Estimation of multivariate probit models viabivariate probit

John MullahyUniversity of Wisconsin–Madison

National University of Ireland Galwayand National Bureau of Economic Research

Madison, WI

[email protected]

Abstract. In this article, I suggest the utility of fitting multivariate probit modelsusing a chain of bivariate probit estimators. This approach is based on Stata’sbiprobit and suest commands and is driven by a Mata function, bvpmvp(). Idiscuss two potential advantages of the approach over the mvprobit command(Cappellari and Jenkins, 2003, Stata Journal 3: 278–294): significant reductionsin computation time and essentially unlimited dimensionality of the outcome set.Computation time is reduced because the approach does not rely on simulationmethods; unlimited dimensionality arises because only pairs of outcomes are con-sidered at each estimation stage. This approach provides a consistent estimatorof all the multivariate probit model’s parameters under the same assumptions re-quired for consistent estimation via mvprobit, and simulation exercises I providesuggest no loss of estimator precision relative to mvprobit.

Keywords: st0423, bvpmvp(), bvopmvop(), multivariate probit models, bivariateprobit

1 Introduction

In this article, I suggest the utility of fitting multivariate probit (MVP) models using achain of bivariate probit estimators. I demonstrate how this approach, based on Stata’sbiprobit and suest commands and driven by the Mata function bvpmvp(), affords twopotential advantages over the mvprobit command, that is, significant reductions in com-putation time and essentially unlimited dimensionality of the outcome set (mvprobit’slimit is M = 20 outcomes).1 Computation time is reduced because, unlike mvprobit,bvpmvp() does not rely on simulation methods; unlimited dimensionality arises be-cause only pairs of outcomes are considered at each estimation stage. Importantly, thisbvpmvp() approach provides a consistent estimator of all the MVP model’s parametersunder the same assumptions required for consistent estimation via mvprobit, and thesimulation exercises herein suggest no loss of estimator precision relative to mvprobit.

1. Stata/SE’s restriction that matsize cannot exceed 11,000 ultimately places a limit on the size ofthe parameter vector that can be estimated. All references to Stata herein are to Stata/SE 13.1.


38 Estimation of multivariate probit models via bivariate probit

This approach was inspired by the goal of embedding MVP estimation in a large-replication bootstrap exercise. The simulation results that I present in section 5 suggestthat the computation time saved by the bvpmvp() method relative to mvprobit canbe significant, while numerical differences in the respective point estimates and esti-mated standard errors are trivial. Because the potential applicability of MVP models isbroad, it is important in practice that such potential not be thwarted by computationalchallenges.

The remainder of the article is organized as follows. In section 2, I describe the MVP

model and, in section 3, the bvpmvp() method. In section 4, I present the comparisonempirical exercises and, in section 5, the comparative results. In section 6, I considerparallel issues involved in the estimation of multivariate ordered probit (MVOP) models,and in section 7, I finish with a summary.

2 The MVP model

The MVP model is typically specified as

y∗ij = xiβj + uij (1)

yij = 1(y∗ij > 0) (2)

ui = (ui1, . . . , uiM ) ∼ MVN(O,R) or y∗i = (yi1, . . . , yiM ) ∼ MVN(xiB,R) (3)

where i = 1, . . . , N indexes observations, j = 1, . . . ,M indexes outcomes, xi is a K-vector of exogenous covariates, the ui are assumed to be independent identically dis-tributed across i but correlated across j for any i, and MVN denotes the multivariatenormal distribution. (Henceforth, the i subscripts will be suppressed.) The standardnormalization sets the diagonal elements of R equal to 1 so that R is a correlation ma-trix with off-diagonal elements ρpq, {p, q} ∈ {1, . . . ,M}, p �= q.2 With standard full-rankconditions on the x’s and each |ρpq| < 1, B = (β1, . . . ,βM ) and R will be identifiedand estimable with sufficient sample variation in the x’s.

3 Estimation and inference

Estimation of the M -outcome MVP model using mvprobit requires simulation of theMVN probabilities (Cappellari and Jenkins 2003), with mvprobit computation time in-creasing in M , K, N , and D (simulation draws).3 However, all the parameters (B,R)can be estimated consistently using bivariate probit—implemented as Stata’s biprobit

2. This normalization rules out cases like heteroskedastic errors (Wooldridge 2010, sec. 15.7.4). Whilethis normalization is common—for instance, normalizing each univariate marginal to be a standardprobit—it is not the only possible normalization of the covariance matrix.

3. Specifically, in the empirical exercises reported below as well as in some other simulations notreported here, mvprobit computation time increases—trivially in K, essentially proportionatelyin D, slightly more than proportionately in N , and at a rate between 2M and 3M in M .Greene and Hensher (2010) suggest that MVP computation time would increase with 2M , butthe results obtained in the simulations here suggest a somewhat greater rate of increase.

J. Mullahy 39

command—while consistent inferences about all of these parameters are afforded viaStata’s suest command. Because the proposed approach proves significantly fasterin terms of computation time with no obvious disadvantages, this strategy may meritconsideration in applied work.

The key result for the proposed estimation strategy is that the multivariate normaldistribution is fully characterized by the mean vector xB and correlation matrix R.For present purposes, the key feature of the multivariate (conditional) normal distribu-tion F (y∗1 , . . . , Y

∗M |x) is that all of its bivariate marginals—F (y∗j , y

∗m|x)—are bivariate

normal with mean vectors and correlation matrices corresponding to the respective sub-matrices of xB and R (Rao 1973, 8a.2.10).

Under the normalization that the diagonal elements of R are all one, the B pa-rameters are identified using all M (conditional) univariate marginals F (y∗j |x); there isno need to appeal to the multivariate features of F (y∗1 , . . . , y

∗M |x) to identify B. The

0.5M(M − 1) bivariate marginals provide the additional information about the ρpq pa-rameters. As such, identifying the parameters of all the bivariate marginals impliesidentification4 of the parameters of the full multivariate joint distribution so that con-sistent estimation of all the bivariate marginal probit models Pr(yp = tp, yq = tq|x)provides consistent estimates of all the parameters (B,R) of the full MVP model forPr(y1 = t1, . . . , yM = tM |x) for tj ∈ {0, 1}, j = 1, . . . ,M .

3.1 Estimation via bivariate probit

The proposed approach, which can be implemented using the Mata function bvpmvp(),is as follows. First, corresponding to each possible outcome pair, 0.5M(M−1) bivariateprobit models are fit using biprobit, yielding one estimate5 of each ρpq and M − 1estimate of βj , where j = 1, . . . ,M . Each M − 1 estimate of βj is consistent becauseeach biprobit specification uses the same normalization on the relevant submatrices of

R. Each of these estimates (βp, βq, ρpq)b, where b = 1, . . . , 0.5M(M − 1), is stored andthen combined using Stata’s suest command, which provides a consistent estimate ofthe joint variance–covariance matrix of allM(M−1)(0.5+K) parameters estimated withthe 0.5M(M − 1) biprobit estimates. We denote this vector of parameter estimates

and its estimated variance–covariance matrix as α and Ω, respectively.6

Second, we compute the simple averages βjA = {1/(M − 1)}∑Mm=1m �=j

βjm. This gives

a k×M matrix of estimated averaged coefficients, denoted BA = (β1A, . . . , βMA). Be-cause a weighted average of consistent estimators is generally a consistent estimator,the resulting BA will be consistent for B. This averaging occurs because the B param-eters in the proposed approach are overidentified; that is, there are M − 1 consistentestimates of each βj , j = 1, . . . ,M . Some other rule could be used to compute oneconsistent estimate of each βj from among the M −1 candidates, but unless alternative

4. As discussed below, identification of all the bivariate marginals implies overidentification of B.5. biprobit directly estimates the inverse hyperbolic tangent of ρpq or 0.5ln{(1 + ρpq)/(1− ρpq)}.6. α and Ω are the suest-stored matrix results e(b) (a row vector) and e(V), respectively.


strategies could boast significant precision gains, computational simplicity recommendsthe simple average as an obvious solution. See the appendix for further discussion.

Finally, we let Q denote the 0.5M(M − 1) vector of the tanh−1(ρjk) estimatedin each biprobit specification, and we define the M{0.5(M − 1) + K} × 1 vector

Θ = [vec(BA)T , QT ]T . We define H as the M{0.5(M − 1) +K} ×M(M − 1)(0.5 +K)

averaging and selection matrix that maps α to Θ; that is, Θ = HαT ; the elements ofH are 1/(M − 1), 1, or 0.7 The estimated variance–covariance matrix of Θ, useful for

inference, is given by var(Θ) = HΩHT .

3.2 bvpmvp(): A Mata function to implement the proposed estima-tion approach

The function bvpmvp() returns the M{k + 0.5(M − 1)} × [M{k + 0.5(M − 1)} + 1]

matrix, whose first column is ΘTand whose remaining elements are the elements of the

M{k + 0.5(M − 1)} dimension-symmetric square matrix var(Θ). bvpmvp() takes sixarguments: 1) a string containing the names of the M outcomes; 2) a string containingthe names of the K − 1 nonconstant covariates; 3) a (possibly null) string containingany “if” conditions for estimation; 4) a scalar indicating whether to display the interimestimation results; 5) a scalar indicating the rounding level of presented results; and6) a scalar indicating whether to display the final results. For example,

bv1 = bvpmvp("y1 y2 y3 y4","x1 x2 x3 x4","if _n<=10000",0,.001,1)bv2 = bvpmvp(yn,xn,ic,0,.001,1)

bvpmvp()’s summary report displays the BA estimates, their estimated standard

errors, and the estimated correlation matrix R; an example is provided in exhibit 1. Ofcourse, suppressing these results may be useful, for instance, in simulation or bootstrap-ping exercises. The do-file containing the Mata code for bvpmvp() is available with thisarticle’s supplementary materials.

7. A general form of the H matrix is complicated to express concisely. For example, for M = 3 andK = 2, the 9× 15 H matrix, computed internally by bvpmvp(), is

H =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.5 0 0 0 0 0.5 0 0 0 0 0 0 0 0 00 0.5 0 0 0 0 0.5 0 0 0 0 0 0 0 00 0 0.5 0 0 0 0 0 0 0 0.5 0 0 0 00 0 0 0.5 0 0 0 0 0 0 0 0.5 0 0 00 0 0 0 0 0 0 0.5 0 0 0 0 0.5 0 00 0 0 0 0 0 0 0 0.5 0 0 0 0 0.5 00 0 0 0 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

J. Mullahy 41

Exhibit 1: Sample output from bvpmvp() (N = 10000, M = 4, K = 5)

. matamata (type end to exit)

: yn="y1 y2 y3 y4"

: xn="x1 x2 x3 x4"

: ic="if _n<=10000"

: bv1=bvpmvp(yn,xn,ic,1,.001,1)

*********************************************** ** Multivariate Probit: Results ** ***********************************************

N. of Observations (from suest): 10000

Estimation Sample: if _n<=10000

Averaged Beta-Hat Point Estimates and Estimated Standard Errors

1 2 3 4 5

1 y1 y2 y3 y423 x1 .328 -.449 .315 .4574 (.045) (.046) (.045) (.046)56 x2 -.331 .562 .388 -.4417 (.045) (.046) (.045) (.046)89 x3 .32 -.398 -.321 -.452

10 (.045) (.046) (.045) (.046)1112 x4 -.392 .396 -.35 -.3513 (.045) (.046) (.045) (.045)1415 _cons .391 -.508 .321 -.45216 (.046) (.047) (.046) (.047)17


Estimated Correlation (Rho) Matrix and Estimated Standard Errors

1 2 3 4 5

1 y1 y2 y3 y423 y1 1 .331 .507 .2874 (.016) (.013) (.016)56 y2 .331 1 .342 .2037 (.016) (.016) (.017)89 y3 .507 .342 1 .309

10 (.013) (.016) (.016)1112 y4 .287 .203 .309 113 (.016) (.017) (.016)14

Cut & Paste Matrix, Averaged Beta-Hat Point Estimates

(.328 , -.449 , .315 , .457) \(-.331 , .562 , .388 , -.441) \(.32 , -.398 , -.321 , -.452) \(-.392 , .396 , -.35 , .45) \(.391 , -.508 , .321 , -.452)

Cut & Paste Matrix, Estimated Correlation Matrix

(1 , .331 , .507 , .287) \(.331 , 1 , .342 , .203) \(.507 , .342 , 1 , .309) \(.287 , .203 , .309 , 1)

: end

4 Simulation exercises

Here I present a simulation exercise to assess the relative performance of the pro-posed approach and the approach based on mvprobit. Three sample sizes (N = 2000,N = 10000, N = 50000) are considered. The data structure corresponding to (1)–(2)has either K = 5 or K = 9 covariates x (four or eight independently distributed uniformvariates plus a constant) and M = 8 binary outcomes yij (only four of which are used insome specifications) corresponding to latent y∗ij having cross-outcome correlations ρik

variously in (0.2, 1/√10, 0.5) for all j �= k; specifically, we have the following:

J. Mullahy 43

R =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

110−0.5 10.5 10−0.5 1 (symm.)

10−0.5 0.2 10−0.5 10.5 10−0.5 0.5 10−0.5 1

10−0.5 0.2 10−0.5 0.2 10−0.5 10.5 10−0.5 0.5 10−0.5 0.5 10−0.5 1

10−0.5 0.2 10−0.5 0.2 10−0.5 0.2 10−0.5 1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦For mvprobit, the draws() option was set both at 10 and 20. The simulations

are performed using Stata/SE 13.1 on an iMac 3.4GHz Intel Core i7 processor andOS X v10.8.8 The do-files containing the code used to generate the data and performthe simulations are available on request.

5 Simulation results

Key results of the simulations are summarized in tables 1–3. Table 1 displays theabsolute and relative computation times for mvprobit and bvpmvp() estimation acrossthe various combinations of the N , M , K, and D parameters. Enormous differences incomputation time are seen between the two estimation methods across all the differentparameter combinations (for reference, it may be useful to recall that there are 86,400seconds in one day). Tables 2 and 3 present a side-by-side comparison of the pointestimates of B and R obtained in one select specification (N = 10000, M = 4, K = 5).For both B and R, the differences between the mvprobit and bvpmvp() point estimatesand corresponding estimated standard errors are trivial.

8. The simulations set Stata’s matsize parameter at 600 for all specifications. In some preliminaryinvestigation, I observed that computation time for bvpmvp() increased significantly when matsize

was set much larger than necessary; this was not the case for mvprobit.


Table 1. Estimation time comparisons (in seconds)

Relative

Parameters Computation time difference

N M K D mvprobit bvpmvp() (ratio)

2,000

4

510 29

129

20 53 53

910 28

128

20 54 54

8

510 1,219

5244

20 2,041 408

910 1,036

8130

20 2,044 256

10,000

4

510 142

271

20 263 132

910 137

346

20 258 86

8

510 4,628

14331

20 10,469 748

910 4,669

19246

20 9,833 518

50,000

4

510 986

1282

20 1,937 161

910 995

1855

20 1,970 109

8

510 35,833

65551

20 72,406 1114

910 36,647

86426

20 73,204 851

Legend

N : Number of sample observations

M : Number of outcomes

K: Number of covariates (including constant term)

D: Number of draws for mvprobit

Note: Stata’s matsize parameter is set at 600 for all specifications.

J. Mullahy 45

Table 2. mvprobit and bvpmvp() comparison: point estimates, one example(N = 10000, M = 4, K = 5; estimated standard errors in parentheses)

mvprobit

Outcome Covariate (draws = 20) bvpmvp()

y1

x10.3265 0.3279

(0.0448) (0.0446)

x2−0.3301 −0.3314

(0.0447) (0.0447)

x30.3184 0.3198

(0.0447) (0.0449)

x4−0.3902 −0.3916

(0.0448) (0.0447)

Constant0.3901 0.3909

(0.0466) (0.0464)

y2

x1−0.4487 −0.4487

(0.0456) (0.0455)

x20.5624 0.5620

(0.0458) (0.0456)

x3−0.3998 −0.3977

(0.0457) (0.0457)

x40.4000 0.3961

(0.0456) (0.0457)

Constant−0.5086 −0.5079

(0.0474) (0.0474)

y3

x10.3102 0.3151

(0.0445) (0.0446)

x20.3846 0.3875

(0.0445) (0.0449)

x3−0.3188 −0.3206

(0.0446) (0.0447)

x4−0.3462 −0.3496

(0.0446) (0.0447)

Constant0.3230 0.3210

(0.0463) (0.0463)

y4

x10.4567 0.4573

(0.0455) (0.0457)

x2−0.4438 −0.4408

(0.0455) (0.0457)

x3−0.4489 −0.4516

(0.0456) (0.0457)

x40.4555 0.4499

(0.0456) (0.0453)

Constant−0.4552 −0.4524

(0.0472) (0.0472)


Table 3. mvprobit and bvpmvp() comparison: R point estimates, one example(N = 10000, M = 4, K = 5; estimated standard errors in parentheses)

mvprobit

R (draws = 20) bvpmvp()

ρ120.3190 0.3308

(0.0158) (0.0159)

ρ130.4942 0.5073

(0.0134) (0.0134)

ρ140.2766 0.2872

(0.0160) (0.0161)

ρ230.3356 0.3424

(0.0156) (0.0158)

ρ240.2000 0.2034

(0.0163) (0.0167)

ρ340.3059 0.3086

(0.0157) (0.0160)

We can see that using methods like bvpmvp() to fit MVP models merits considerationwhen reduced computation time is important.9

6 MVOP models

Analogous conceptual considerations arise in the context of MVOP models in which theobserved ordered outcomes are yoj ∈ {0, . . . , Gj} for finite integers Gj ≥ 1. MVOP

9. Note that these simulations paint a somewhat “worst-case” picture for mvprobit estimation. Thesimulations use mvprobit “out of the box”, that is, without specifying any options that might en-hance estimation speed (see the help file for mvprobit; also see Cappellari and Jenkins [2003, 2006]).For instance, specifying a smaller number of draws (for example, draws(3) or draws(5)) wouldclearly result in faster estimation times; however, any diminished performance of the mvprobit

estimator relative to the performance at a greater number of draws would be a potential considera-tion. Alternatively, using good starting values for R via mvprobit’s atrho0() option might also beexpected to result in faster estimation times. One such approach would involve two stages: 1) to fitthe full model using mvprobit with a small number of draws, for example, draws(1) or draws(2);and 2) to use the estimate of R thus obtained to provide starting values for a second mvprobit

estimation with a larger number of draws (for example, draws(10) or draws(20)) being specified.This approach—with draws(1) specified initially, followed by draws(10)—was examined in somesimulations. It was observed in this instance that the two-stage approach resulted in roughly a10% reduction in overall estimation time, due mainly to a smaller number of iterations (three ver-sus four) required for convergence in the second stage. This article also has not considered howestimation using the cmp command (Roodman 2011) to fit the MVP model would compare withthe bvpmvp() approach. I would like to thank Stephen Jenkins and an anonymous referee for theirinsights and suggestions on these matters.

J. Mullahy 47

modeling involves estimation of and inference about the parameters B and R as well asthe vector of category cutpoints, C (for each outcome yoj , there are Gj cutpoints thatdelineate the Gj + 1 categories).10

An estimation strategy fully analogous to bvpmvp() is not available because thebioprobit command (Sajaia 2008) does not permit postestimation prediction withthe score option, as required by suest. However, an alternative, fully consistent, andcomputationally efficient approach is available, as follows. First, fit M univariate or-dered probit models using Stata’s oprobit command, and store these estimates usingestimates store. This provides consistent estimates of the B and C parameters. Sec-ond, fit a chain of bivariate binary probit models using biprobit—as with bvpmvp()—and store these estimates using estimates store. This provides a consistent estimateof R.11 Note that any thresholds used to map the ordered yoij to their correspondingcoarsened binary outcomes should result in consistent estimates of R. biprobit usesthe rule that a nonbinary outcome is treated as zero for zero values and one otherwise;this is a convenient mapping that minimizes programming burden. Third, combine allthe estimates stored in these two steps by using suest. The estimates from suest

can then be used for inference. The do-file containing the Mata code for the functionbvopmvop() that implements this approach is available with this article’s supplementarymaterials.12 An example of bvopmvop() output is presented in exhibit 2.13

Exhibit 2: Sample output from bvopmvop() (N = 10000, M = 4, K = 5)

. matamata (type end to exit)

: yn="y1o y2o y3o y4o"

: xn="x1 x2 x3 x4"

: ic="if _n<=10000"

: bv2=bvopmvop(yn,xn,ic,1,.001,1)

******************************************************* ** Multivariate Ordered Probit: Results ** *******************************************************

N. of Observations (from suest): 10000

Estimation Sample: if _n<=10000

10. For the MVOP model, B will not contain a parameter for the constant term because this is absorbedinto the cutpoints C.

11. Note that this also provides consistent estimates of B, but these are unnecessary given thoseobtained in the first step.

12. bvopmvop() accommodates ordered outcomes having different numbers of cutpoints, includingmixed ordered and binary outcomes. The single cutpoint estimated in oprobit for binary out-comes is −1 times the corresponding constant term that would be estimated using probit.

13. The outcomes in this example are ordered versions yoj of the yj used in the earlier simulations inwhich the outcome value 2 is assigned if 1 ≤ y∗j ≤ 2 and 3 is assigned if y∗j > 2. Then, y2 combines

the top two categories, and y3 combines the top three categories (that is, y3 is the original binarymeasure). Thus the numbers of categories are G1 = 4, G2 = 3, G3 = 2, and G4 = 4.


Beta-Hat and Cutpoint Point Estimates and Estimated Standard Errors(Note: SEs are from suest ests.)

1 2 3 4 5

1 y1o y2o y3o y4o23 x1 .379 -.457 .316 .4644 (.038) (.043) (.045) (.043)56 x2 -.325 .53 .388 -.447 (.038) (.044) (.045) (.043)89 x3 .338 -.404 -.321 -.471

10 (.038) (.043) (.045) (.043)1112 x4 -.393 .397 -.348 .4513 (.038) (.043) (.045) (.043)1415 cut1 -.354 .485 -.319 .44716 (.04) (.045) (.046) (.045)1718 cut2 .356 1.379 -- 1.30519 (.04) (.047) (.047)2021 cut3 1.079 -- -- 2.1822 (.041) (.054)23

Estimated Correlation (Rho) Matrix and Estimated Standard Errors

1 2 3 4 5

1 y1o y2o y3o y4o23 y1o 1 .331 .507 .2874 (.016) (.013) (.016)56 y2o .331 1 .342 .2037 (.016) (.016) (.017)89 y3o .507 .342 1 .309

10 (.013) (.016) (.016)1112 y4o .287 .203 .309 113 (.016) (.017) (.016)14

J. Mullahy 49

Cut & Paste Matrix, Beta-Hat and Cutpoint Point Estimates

(.379 , -.457 , .316 , .464) \(-.325 , .53 , .388 , -.44) \(.338 , -.404 , -.321 , -.471) \(-.393 , .397 , -.348 , .45) \(-.354 , .485 , -.319 , .447) \(.356 , 1.379 , . , 1.305) \(1.079 , . , . , 2.18)

Cut & Paste Matrix, Estimated Correlation Matrix

(1 , .331 , .507 , .287) \(.331 , 1 , .342 , .203) \(.507 , .342 , 1 , .309) \(.287 , .203 , .309 , 1)

: end

7 Summary

In this article, I have presented a novel estimation strategy for consistent estimationof and inference about the parameters of MVP and MVOP models. The straightfor-ward implementation of these approaches using available Mata programs recommendstheir consideration in applied work, particularly in situations involving large numbersof outcomes (M) and large sample sizes (N) or in situations requiring repeated MVP

estimation (like bootstrapping exercises).

Note that the methods suggested here may prove useful in many but not all appli-cations of MVP models. Ultimately, the methods proposed—as well as the mvprobit

method—permit estimation of the joint conditional probability model Pr(y = k|x) forthe M vectors of outcomes y, all possible 2M vectors k = (km), km ∈ {0, 1}, andexogenous covariates x. As such, when these joint conditional probabilities are per sethe estimands of interest, when they are instrumentally of interest in the estimation ofother quantities (see Mullahy [2011] for discussion), or when reduced forms of struc-tural models are of interest, the approach suggested here may prove useful. However,in other MVN contexts with binary outcomes—for example, where endogenous ym areright-hand-side variables in the structural models for other latent y∗j —consistent esti-mation of the structural parameters will typically demand attention to the full jointprobability structure, not just its bivariate marginals.14

14. I thank an anonymous referee for emphasizing these points.


8 Acknowledgments

I thank Bill Greene, Stephen Jenkins, Joao Santos Silva, and an anonymous refereefor helpful comments on earlier drafts. Support for this article was provided by theNational Institute of Child Health and Human Development grant P2CHD047873 tothe University of Wisconsin–Madison’s Center for Demography and Ecology, by an Evi-dence for Action Grant from the Robert Wood Johnson Foundation, and by the RobertWood Johnson Foundation Health and Society Scholars program at the University ofWisconsin–Madison.

9 ReferencesCappellari, L., and S. P. Jenkins. 2003. Multivariate probit regression using simulatedmaximum likelihood. Stata Journal 3: 278–294.

. 2006. Calculation of multivariate normal probabilities by simulation, with ap-plications to maximum simulated likelihood estimation. Stata Journal 6: 156–189.

Greene, W. H., and D. A. Hensher. 2010. Modeling Ordered Choices: A Primer. Cam-bridge: Cambridge University Press.

Mullahy, J. 2011. Marginal effects in multivariate probit and kindred discrete andcount outcome models, with applications in health economics. NBER Working PaperNo. 17588, The National Bureau of Economic Research.http://www.nber.org/papers/w17588.

Rao, C. R. 1973. Linear Statistical Inference and Its Applications. 2nd ed. New York:Wiley.

Roodman, D. 2011. Fitting fully observed recursive mixed-process models with cmp.Stata Journal 11: 159–206.

Sajaia, Z. 2008. bioprobit: Stata module for bivariate ordered probit regression. Sta-tistical Software Components S456920, Department of Economics, Boston College.https://ideas.repec.org/c/boc/bocode/s456920.html.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nded. Cambridge, MA: MIT Press.

About the author

John Mullahy is a professor of health economics at the University of Wisconsin–Madison.

Additional remarks on combining biprobit estimates

In general, the optimal approach to combining such multiple estimates in the overi-dentified case is to use a minimum-distance estimator with an optimal weight matrix

J. Mullahy 51

(Wooldridge 2010, sec. 14.5). In the present context, this would amount to comput-

ing a weighted average for each point estimate; that is, βjkw =∑M

m=1m �=j

wjkwβjkw,

j = 1, . . . ,M , and k = 1, . . . ,K. However, implementing the minimum-distance ap-proach can be computationally challenging. For example, consider the simplest case,M = 3. Even in this instance, the optimal (variance-minimizing) weights are com-plicated functions of the estimates’ variances and covariances; suppressing the j, ksubscripts, for (p, q, r) ∈ (1, 2, 3), p �= q �= r, these optimal weights are

wr =wrn

wrd

wherewrn = σppσqq − σ2

pq − σqqσpr − σppσrq − σprσpq − σpqσrq

and

wrd = σppσrr + σrrσqq + σppσqq − σ2pr − σ2

pq − σ2rq

+ 2(σprσpq + σpqσrq + σprσrq − σppσrq − σrrσpq − σqqσpr)

and where σ•• are variances and covariances of the parameter estimates (the empiricalcounterpart, wr, would use σ••). The algebraic complexity of these weights increasesrapidly as M increases.

The additional computational complexity involved in implementing such a minimum-distance approach is unlikely to be beneficial (in terms of precision) unless the optimalwjkm were to diverge dramatically from 1/(M − 1). The simulations here suggest thatthis is unlikely to be the case. Generally, the optimal weights will diverge from theequiweighted case of 1/(M − 1) to the extent that the variances and covariances ofand between the parameter point estimates differ substantively across the (M − 1)estimates.15

For illustration, arbitrarily selecting the (M − 1) point estimates corresponding tothe parameter β11 (outcome y1, covariate x1) for the N = 10000, M = 8, and K = 5

specification, we find that the range of the 7 point estimates β11 is [0.3266, 0.3288], therange of the corresponding 7 estimated point-estimate variances is [0.001983, 0.001995],and the range of the 28 estimated point-estimate covariances is [0.001983, 0.001993].Therefore, it is unlikely that the optimal weights would diverge much from 1/(M − 1).

The ultimately important result is that at least insofar as the simulations here areconcerned, the differences between the mvprobit and bvpmvp() point estimates andestimated standard errors are inconsequentially small (see tables 2 and 3).

15. Bill Greene suggested that a computationally straightforward middle-ground weighting strategywould be, in essence, to ignore the cross-estimator covariances and compute the variance-matrix-weighted quantities, as follows:

βjv =

[∑M

m=1m �=j

{var

(βm

)}−1]−1

×∑M

m=1m �=j

{var

(βm

)}−1βm, j = 1, . . . ,M


diff: Simplifying the estimation ofdifference-in-differences treatment effects

Juan M. VillaGlobal Development Institute

University of ManchesterManchester, UK

[email protected]

Abstract. In this article, I present the features of the user-written command diff,which estimates difference-in-differences (DID) treatment effects. diff simplifiesthe DID analysis by allowing the conventional DID setting to be combined withother nonexperimental evaluation methods. The command is equipped with anattractive set of options: the single DID with covariates, the kernel propensity-scorematching DID, and the quantile DID. Specific options are included to obtain DID

estimation on a repeated cross-section setting and to test the general balancingproperties of the model. I illustrate the features of diff using a sample of thedataset from the pioneering implementation of DID by Card and Krueger (1994,American Economic Review 84: 772–793).

Keywords: st0424, diff, difference-in-differences, causal inference, kernel propensityscore, quantile treatment effects, nonexperimental methods, DID, QDID

1 Introduction

There is a growing body of literature using difference-in-differences (DID) treatmenteffects as a reliable nonexperimental evaluation method.1 DID estimation has beenwidely used when panel data or repeated cross-sections are available for interventionimpact assessments. A key aspect of DID is that it facilitates the causal inference anal-ysis of an intervention when time-invariant unobserved heterogeneity might confound acausal-effect analysis (Abadie 2005; Angrist and Pischke 2009). Different specificationsof the DID model can also account for observed heterogeneity and can incorporate othernonexperimental evaluation methods into the analysis.

Despite the availability of other plausible methods based on the existence of obser-vational data for nonexperimental causal inference (that is, matching methods, instru-mental variables, regression discontinuity, etc.), DID estimation offers an alternative byreaching unbiased results while accounting for time-invariant unobserved heterogeneity.Four elements are specific to the DID setting (see figure 1): the first one is the availabilityof a treated group and control group; the second is the existence of parallel paths in thepretreatment trends; the third is the clear time cutoff identifying when the treatmentstarts, so there is a before and after period; and the fourth is the assumption that, with-

1. According to https://scholar.google.com (accessed in April 2015), while the number of academicdocuments using DID was 136 in 2000, it had reached 2,990 in 2014.


J. M. Villa 53

out the treatment, the treated group would show a trend similar to that observed forthe control group. Thus the DID treatment effects are obtained when panel or repeatedcross-section data are available and a treatment has been administered.

Figure 1. Basic DID setting

Although the latest version of Stata is equipped with the command teffects, whichestimates the treatment effects on a cross-sectional basis, DID is based on the assessmentof an intervention’s impact on a given outcome variable in a before-and-after setting.While DID treatment effects are focused on comparing treated and control groups sharingcommon pretreatment trends, the options of the teffects command entail estimatingthe average treatment effects, with special focus on the nearest-neighbor matching ap-proach. Therefore, although existing nonexperimental evaluation methods can reachdifferent levels of internal and external validity (Dehejia 2013), the best method forevaluating a given intervention depends on the characteristics of the available data.

In this article, I present the user-written command diff, which estimates DID treat-ment effects. diff runs several types of DID estimation beyond basic single DID. diff isattractive because it combines the single DID with control covariates, advanced match-ing methods, and balancing-test analysis. By employing two-period panel data or re-peated cross-sections, diff joins the DID treatment-effects estimation with the kernelpropensity-score matching following Heckman, Ichimura, and Todd (1997, 1998), andBlundell and Dias (2009). This kernel propensity-score matching in diff follows thealgorithm of psmatch2 developed under a cross-sectional setting by Leuven and Sianesi(2003). diff also allows estimation of the DID treatment effects at different quantiles forthe kernel matching and repeated cross-sections options (Meyer, Viscusi, and Durbin1995). In this article, I provide details on implementing diff using a sample of thedataset from Card and Krueger’s (1994) pioneering article on the effects of a naturalexperiment consisting of a minimum-wage increase in the United States. Finally, Iexplain how the balancing properties can be tested when information on covariates isprovided.

54 Simplifying the estimation of difference-in-differences treatment effects

diff makes an important contribution to advancing the development of commandsdesigned for causal inference analysis in Stata. In addition to the existing commandsfor assessing the impact of interventions with data on a cross-section format, diff

extends the causal inference analysis for panel data with a before-and-after setting.For instance, pscore, psmatch2, and nnmatch (Abadie et al. 2004; Becker and Ichino2002; Leuven and Sianesi 2003) estimate the effects of interventions by using matchingtechniques; rd (Nichols 2007) is helpful for when the treatment is delivered according toan assignment variable with a clear cutoff selection threshold; and ivtreatreg (Cerulli2014) assists in the specification of an instrumental-variable approach accounting forunobserved heterogeneity with cross-sectional data. diff joins this family of user-written commands in Stata by providing an intuitive syntax and simplifying the casualinference analysis with binary treatments over time.

I divide this article into four sections. In section 2, I explain the equations behindthe estimation of the DID and the development of the diff command. In section 3, Ipresent the syntax for the command and options. In section 4, I provide an exampleusing diff on the Card and Krueger (1994) data.

2 A basic DID framework

The definition of DID treatment effects estimated by diff is based on the existenceof a pair of before-and-after periods, namely, one baseline (t = 0) and one follow-up(t = 1). The basic DID framework is dependent on the availability of two groups ofunits i, including a treated group to which the treatment is delivered (Zi = 1) and acontrol group to which the treatment is not delivered (Zi = 0). The treatment indicatorin the DID setting requires absence of any intervention in the baseline for either group(Di,t=0 = 0|Zi = 1, 0), and it requires the intervention to be positive for the treatedgroup in the follow-up (Di,t=1 = 1|Zi = 1). For a given outcome variable, Yit, thepopulation DID treatment effect is given by the difference in the outcome variable fortreated and control units before and after the intervention. The single DID setting isgiven by

DID = {E(Yit=1|Dit=1 = 1, Zi = 1)− E(Yit=1|Dit=1 = 0, Zi = 0)}− {E(Yit=0|Dit=0 = 0, Zi = 1)− E(Yit=0|Dit=0 = 0, Zi = 0)} (1)

This single DID can be combined with other nonexperimental evaluation methods.Additional control covariates are important when observed heterogeneity may confoundthe identification strategy. Given the features of DID estimation, observed covariatesshould be exempt from the effects of the treatment. Thus, if observable covariates (Xi)are available, they can be added into the analysis.

DID = {E(Yit=1|Dit=1 = 1, Zi = 1, Xi)− E(Yit=1|Dit=1 = 0, Zi = 0, Xi)}− {E(Yit=0|Dit=0 = 0, Zi = 1, Xi)− E(Yit=0|Dit=0 = 0, Zi = 0, Xi)} (2)

J. M. Villa 55

A complementary method to the DID treatment effect is the incorporation of kernelpropensity-score weights. Apart from the inclusion of control variables, observed co-variates can be used to estimate the propensity score (the likelihood of being treated)and to calculate kernel weights following Heckman, Ichimura, and Todd (1997, 1998).Instead of accounting for control variables, this method matches treated and controlunits according to their propensity score. Each treated unit is matched to the wholesample of control units instead of on a limited number of nearest neighbors. To begin,one obtains the propensity score (pi) for both groups.

pi = E(Zi = 1|Xi)

According to Heckman, Ichimura, and Todd (1997), the kernel matching is given bythe propensity score, given the covariates, which leads to the calculation of the kernelweights,

wi =K(

pi−pk

hn

)∑

K(

pi−pk

hn

) (3)

in which K(·) is the kernel function and hn is the selected bandwidth. The kernelweights are then introduced into (1) to obtain a kernel propensity-score matching DID

treatment effect as follows:

DID = {E(Yit=1|Dit=1 = 1, Zi = 1)− wi × E(Yit=1|Dit=1 = 0, Zi = 0)}− {E(Yit=0|Dit=0 = 0, Zi = 1)− wi × E(Yit=0|Dit=0 = 0, Zi = 0)} (4)

Now, to increase the internal validity of the DID estimand, one can restrict (4) to thecommon support of the propensity score for treated and control groups. The commonsupport is the overlapping region of the propensity for treated and control groups. Thissample of i units can be restricted to the region defined as

(i : pi ∈ [max{min(pi|Zi = 1),min(pi|Zi = 0)},min{max(pi|Zi = 1),min(pi|Zi = 0)}])

Complementarily, when treated and control units cannot be followed over the base-line and follow-up periods, the DID treatment effects can be estimated with repeatedcross-sections. This is very common when a treatment has been administered to cer-tain regional or demographic groups over several cross-sections. The kernel propensity-score matching with repeated cross-section DID treatment effects is specified followingBlundell and Dias (2009).

DID = {E(Yit=1|Dit=1 = 1, Zi = 1)− wcit=1 × E(Yit=1|Dit=1 = 0, Zi = 0)}

− wtit=0 × {E(Yit=0|Dit=0 = 0, Zi = 1)− wc

it=0 × E(Yit=0|Dit=0 = 0, Zi = 0)}

Here wcit=0 and wc

it=1 are the kernel weights for the control group in the baseline andfollow-up periods, respectively, while wt

it=0 is the kernel weight for the treated groupin the baseline period. The three sets of kernel weights are calculated independently


according to the estimated propensity score and do not require the panel structure ofthe units in the sample.

Finally, the balancing property of the treated and the control can be tested. Giventhe availability of observable covariates, it can be shown that in absence of the treatment,the outcome variable is orthogonal to the treatment indicator given the set of covariates.In other words, the balancing property can be tested in the baseline as

Yit=0⊥Zi|Xi (5)

Note that the balancing property is optional in the DID setting. The most importantassumption, which is not tested in this approach, is the complement of the parallel pathsof the outcome for the treated and the control groups. Given the availability of twoperiods in this analysis, this assumption cannot be tested here. For an extension of thistest, see Mora and Reggio (2012).

2.1 Estimation

To estimate the expected values in (1), we rely on linear regression for the single DID

analysis. The subsequent complementary introduction of control variables or kernelpropensity-score matching weights is similarly specified by linear regression. In thebasic framework, the estimation can be shown as follows:

outcome vari = β0+β1×period()i+β2×treated()i+β3×period()i×treated()i+ei

Here outcome vari is the outcome variable for each unit; period()i is a binary vari-able taking the value of 0 in the baseline and 1 in the follow-up periods; and treated()iis a binary variable indicating the treatment status for each unit, similar to Zi = 1.

The expected values in (1) are obtained from the interaction of the estimated coef-ficients. The estimated coefficients have the following interpretation:

• β0: the mean outcome of the control group at the baseline.

• β0 + β1: the mean outcome of the control group in the follow-up.

• β2: the single difference between the treated and the control groups at the baseline.

• β0 + β2: the mean outcome of the treated group at the baseline.

• β0 + β1 + β2 + β3: the mean outcome of the treated group in the follow-up.

• β3: the DID estimand.

3 The diff command

The diff command demands pooled data containing the treatment status of the treatedand control groups and the before-and-after period indicator. diff mainly simplifies

J. M. Villa 57

DID estimation by providing the output table with the estimated coefficients and theirinteractions.

3.1 Syntax

diff outcome var[if] [

in] [

weight], period(varname) treated(varname)[

cov(varlist) kernel id(varname) bw(#) ktype(kernel) rcs qdid(quantile)

pscore(varname) logit support addcov(varlist) cluster(varname) robust

bs reps(int) test report nostar export(filename)]

The command requires the specification of the outcome variable (outcome var) andallows the use of sampling weights.

The simplification of the diff command also consists of the arrangement of theregression coefficients in the output table. The number of observations, R-squared, thestandard errors, the t statistic (or the z statistic when standard errors are bootstrapped),and the p-value are also presented.

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTSNumber of observations in the DIFF-IN-DIFF: #

Baseline Follow-upControl: # # (total)Treated: # # (total)

(total) (total)

Outcome var. fte S. Err. t P>|t|

Baseline

Control β0

Treated β0 + β2

Diff (T-C) β2

Follow-up

Control β0 + β1

Treated β0 + β1 + β2 + β3

Diff (T-C) β2 + β3

Diff-in-Diff β3

R-square: #.##* Means and Standard Errors are estimated by linear regression**Inference: *** p<0.01; ** p<0.05; * p<0.1


3.2 Options

period(varname) specifies the binary period variable (0: baseline; 1: follow-up). Op-tion period() is required.

treated(varname) specifies the binary treatment variable (0: controls; 1: treated).Option treated() is required.

cov(varlist) allows the user to include control covariates in the model [Xi in (2)]. Thecoefficients of the variables in cov(varlist) are not displayed in the output table.They can be seen when option report is specified.

kernel performs the kernel propensity-score matching DID. This option generates thevariable weights, which contains the weights2 derived from the kernel propensity-score matching, as well as generates ps when the propensity score is not suppliedin pscore(varname), following Leuven and Sianesi (2003). This option requiresspecification of the id(varname) option except when the rcs option is also specified(under the repeated cross-section setting).3 Under a panel or cross-sectional setting,you can specify the support option with kernel to allow the estimation of the DID

on the common support.

id(varname) specifies the identification variable for each unit or individual when thedataset is composed of a panel of treated and control groups. Option kernel requiresid().

bw(#) specifies the supplied bandwidth of the kernel function. The default is bw(0.06).

ktype(kernel) specifies the type of the kernel function. The types are epanechnikov

(the default), gaussian, biweight, uniform, and tricube.

rcs indicates that the kernel is set for repeated cross-section. This option doesnot require option id(varname). Option rcs strongly assumes that covariates incov(varlist) do not vary over time.

qdid(quantile) performs the quantile difference-in-differences (QDID) estimation at thespecified quantile from 0.1 to 0.9 (quantile 0.5 performs the QDID at the median).This option may be combined with kernel and cov(). qdid() does not supportweights or robust standard errors. This option uses the Stata commands qreg forquantile nonlinear regressions and bsqreg for complementary bootstrapped standarderrors. See Angrist and Pischke (2009) for detailed information on quantile treat-ment effects and Meyer, Viscusi, and Durbin (1995) for an illustrative example.

pscore(varname) specifies the supplied propensity score.

2. These weights are 1 for treated units or individuals.3. See Blundell and Dias (2009) for further details on kernel propensity-score matching upon repeated

cross-sections.

J. M. Villa 59

logit specifies logit estimation of the propensity score. The default is probit estimation.The results of the probit estimation are used to predict the probability of beingtreated, known as the propensity score, and then to calculate the kernel matching,as in (3).

support performs diff on the common support of the propensity score given the optionkernel.

addcov(varlist) specifies additional covariates with those specified in the estimation ofthe propensity score.

cluster(varname) estimates clustered standard errors by the specified category invarname.

robust estimates robust standard errors following Stata’s sandwich-type estimation.

bs executes a bootstrapped estimation of standard errors.

reps(int) specifies the number of replications when the bs option is also specified. Thedefault is reps(50).

test performs a balancing t test of the difference in the means of the covariates betweenthe control and the treated groups in period() = 0. The option test combinedwith kernel performs the balancing t test with the weighted covariates; see [R] ttest.This option is one way to test (5).

report displays the inference of the included covariates or the estimation of the propen-sity score when option kernel is specified.

nostar removes the inference stars from the p-values.

export(filename) exports the output table into the working directory in a .csv file.See [D] cd for details.


3.3 Stored results

diff stores the following in r():

Scalarsr(mean c0) mean of output var of the control group in period()= 0r(mean t0) mean of output var of the treated group in period()= 0r(diff0) difference of the mean of output var between the treated and the

control groups in period t = 0r(mean c1) mean of output var of the control group in period()= 1r(mean t1) mean of output var of the treated group in period()= 1r(diff1) difference of the mean of output var between the treated and the

control groups in period()= 1r(did) DID treatment effectr(se c0) standard error of the mean of output var of the control group in period()= 0r(se t0) standard error of the mean of output var of the treated group in period()= 0r(se d0) standard error of the difference of output var between the treated and the

control groups in period()= 0r(se c1) standard error of the mean of output var of the control group in period()= 1r(se t1) standard error of the mean of output var of the treated group in period()= 1r(se d1) standard error of the difference of output var between the treated and the

control groups in period()= 1r(se dd) standard error of the DID

4 Example

To illustrate the use of diff, we use a downloadable dataset (included with the com-mand) with a sample of the data used by Card and Krueger (1994).4 The data are froma study by the authors on the impact of the increase in minimum wage in New Jersey(the treated group) on the employment level in the fast-food industry. This interven-tion took place in April 1992. They compare the changes in the number of employeesat fast-food restaurants in the treated group with those located in a neighboring state,Pennsylvania (the control or untreated group). They conducted a baseline survey inFebruary 1992 and a follow-up in November.

4. This dataset is provided for illustration only. It might not be suitable for all diff options.

J. M. Villa 61

The details of the variables in the dataset are as follows:

. use cardkrueger1994.dta(Sample dataset from Card and Krueger (1994))

. describe

Contains data from cardkrueger1994.dtaobs: 780 Sample dataset from Card and

Krueger (1994)vars: 8 12 Mar 2014 14:03size: 11,700

storage display valuevariable name type format label variable label

id int %8.0g Store IDt byte %8.0g Feb. 1992 = 0; Nov. 1992 = 1treated long %8.0g treated New Jersey = 1; Pennsylvania = 0fte float %9.0g Output: Full Time Employmentbk byte %8.0g Burger King == 1kfc byte %8.0g Kentucky Fried Chicken == 1roys byte %8.0g Roy Rogers == 1wendys byte %8.0g Wendy´s == 1

Sorted by: id t treated

With 780 observations, the number of units (or restaurants) is 314 and 76 in thetreated and the control groups (or states), respectively. The outcome variable is full-time employment (fte). Some covariates are defined as binary variables, indicatingwhether the observation belongs to a given fast-food restaurant. The basic statisticsare shown as follows:

. summarize id t treated fte bk kfc roys wendys

Variable Obs Mean Std. Dev. Min Max

id 780 247.2641 148.644 1 522t 780 .5 .5003208 0 1

treated 780 .8051282 .3963561 0 1fte 780 17.58109 9.095066 0 80bk 780 .4179487 .4935381 0 1

kfc 780 .2051282 .4040544 0 1roys 780 .2435897 .4295233 0 1

wendys 780 .1333333 .3401528 0 1


4.1 Single DID with no covariates

The single DID treatment-effects estimations with diff requires three variables: out-come, treated, and period. This basic estimation assumes that time-invariant unob-served heterogeneity exclusively contaminates the identification strategy. In absence ofcontrol covariates, the command is run as follows:

. diff fte, treated(treated) period(t)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTSNumber of observations in the DIFF-IN-DIFF: 780

Baseline Follow-upControl: 76 76 152Treated: 314 314 628

390 390


BaselineControl 20.013Treated 17.069Diff (T-C) -2.944 1.160 -2.54 0.011**

Follow-upControl 17.523Treated 17.518Diff (T-C) -0.005 1.160 -0.00 0.997

Diff-in-Diff 2.939 1.641 1.79 0.074*

R-square: 0.01- Means and Standard Errors are estimated by linear regression**Inference: *** p<0.01; ** p<0.05; * p<0.1

The baseline rows contain information on the mean outcome for each group as wellas each group’s single difference (−2.944 in this case). These estimators are presentedalong with their standard errors, t statistics, and p-values. The same information isdisplayed for the follow-up period. The last row is the DID treatment-effects estimand,implying an increase in the number of employees by 2.939. The p-value is accompaniedby a star, which indicates the statistical inference at different significance levels, asshown below the table (*** p < 0.01; ** p < 0.05; * p < 0.1). In this case, the DID

estimand is significant at the 10% level.

J. M. Villa 63

The parametric estimation of the standard errors could be problematic in certaincircumstances. Therefore, as an alternative, bootstrapped standard errors can be re-quested by adding the option bs.

. diff fte, treated(treated) period(t) bs rep(50)(running regress on estimation sample)

Bootstrap replications (50)1 2 3 4 5

.................................................. 50



390 390



Follow-upControl 17.523Treated 17.518Diff (T-C) -0.005 1.216 -0.00 0.997

Diff-in-Diff 2.939 1.768 1.66 0.096*

R-square: 0.01- Bootstrapped Standard Errors- Means and Standard Errors are estimated by linear regression**Inference: *** p<0.01; ** p<0.05; * p<0.1


4.2 Single DID with covariates

When control covariates are available in the dataset, diff allows them to be includedwith the option cov(varlist). In this case, the binary variables indicating the categoriesof the restaurants are provided.

. diff fte, treated(treated) period(t) cov(bk kfc roys)DIFFERENCE-IN-DIFFERENCES WITH COVARIATES



390 390



Follow-upControl 18.852Treated 19.452Diff (T-C) 0.600 1.052 0.57 0.569

Diff-in-Diff 2.939 1.485 1.98 0.048**


Option report displays the output table of the coefficients and statistics from thecov(varlist).

. diff fte, treated(treated) period(t) cov(bk kfc roys) reportDIFFERENCE-IN-DIFFERENCES WITH COVARIATES



390 390Report - Covariates and coefficients:

Variable(s) Coeff. Std. Err. t P>|t|

bk 0.850 0.925 0.918 0.359kfc -9.331 1.037 -8.997 0.000roys -1.054 1.003 -1.051 0.294

J. M. Villa 65




Diff-in-Diff 2.939 1.485 1.98 0.048**


4.3 Kernel propensity-score matching DID treatment effects

As mentioned above, the control covariates can be used to match treated and controlunits. This is possible with diff by adding the kernel option. Additionally, thekernel propensity-score matching DID can be estimated on the common support of thepropensity score, and the propensity score can be provided in option pscore(varname)if the user has previously executed its estimation. The syntax is

diff fte, treated(treated) period(t) cov(bk kfc roys) kernel id(id)

To view the first stage of the estimation of the propensity score, the user shouldsupply the report option.

. diff fte, treated(treated) period(t) cov(bk kfc roys) kernel id(id) reportKERNEL PROPENSITY SCORE MATCHING DIFFERENCE-IN-DIFFERENCES

Report - Propensity score estimation with probit commandAtention: _pscore is estimated at baseline

Iteration 0: log likelihood = -192.3521Iteration 1: log likelihood = -191.15937Iteration 2: log likelihood = -191.15777

Probit regression Number of obs = 390LR chi2(3) = 2.39Prob > chi2 = 0.4957

Log likelihood = -191.15777 Pseudo R2 = 0.0062

treated Coef. Std. Err. z P>|z| [95% Conf. Interval]

bk .1368372 .2190827 0.62 0.532 -.292557 .5662315kfc .3619436 .2549971 1.42 0.156 -.1378415 .8617288

roys .2448943 .2415265 1.01 0.311 -.228489 .7182775_cons .6744898 .1889629 3.57 0.000 .3041292 1.04485


Matching iterations...................................................................................> ..............................................................................> ..............................................................................> ..............................................................................DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTSNumber of observations in the DIFF-IN-DIFF: 780


390 390


BaselineControl 20.006Treated 17.069Diff (T-C) -2.937 0.959 -3.06 0.002***


Diff-in-Diff 3.088 1.357 2.28 0.023**


In the case of repeated cross-sections, diff should be typed as

. diff fte, treated(treated) period(t) cov(bk kfc roys) kernel rcsKERNEL PROPENSITY SCORE MATCHING DIFFERENCE-IN-DIFFERENCES

Repeated Cross Section - rcs optionMatching iterations: control group at base line...

................................................................................> ..............................................................................> ..............................................................................> ..............................................................................

Matching iterations: control group at follow up...................................................................................> ..............................................................................> ..............................................................................> ..............................................................................

Matching iterations: treated group at baseline...................................................................................> ..............................................................................> ..............................................................................> ..............................................................................

J. M. Villa 67



390 390


BaselineControl 20.006Treated 17.497Diff (T-C) -2.508 0.961 -2.61 0.009***


Diff-in-Diff 2.660 1.359 1.96 0.051*


4.4 QDID

It sometimes is useful to assess the effects of the intervention over the distribution of theoutcome variable. diff provides this option on the DID setting. Here one would like toknow whether the effect of the increase in minimum wage was stronger for restaurantswith a low or high number of full-time employees. The QDID is then obtained when theoption qdid(quantile) is specified. For example, estimating the treatment effects on themedian of the number of full-time employees requires the following syntax:

diff fte, treated(treated) period(t) qdid(0.50)


This specification might be combined with covariates as follows:

. diff fte, treated(treated) period(t) qdid(0.50) cov(bk kfc roys)QUANTILE DIFFERENCE-IN-DIFFERENCES WITH COVARIATES



390 390


BaselineControl 17.250Treated 17.250Diff (T-C) 0.000 0.996 0.00 1.000


Diff-in-Diff -0.000 1.412 -0.00 1.000

R-square: 0.15- Values are estimated at the .5 quantile**Inference: *** p<0.01; ** p<0.05; * p<0.1

By chance, when one accounts for covariates, at the 0.5 quantile, which is the same asthe median of the dependent variable, the value of the full-time employment variable(fte) is similar for control and treated units in the baseline and follow-up periods.Therefore, the result above indicates a DID effect of −1.407e-15 (very close to 0), but itis rounded to −0.000 because of the numbering format of the table.

As with the single DID, the QDID can be combined with the kernel option (also inrepeated cross-sections).

. diff fte, treated(treated) period(t) qdid(0.50) cov(bk kfc roys) kernel> id(id) reportKERNEL PROPENSITY SCORE MATCHING QUANTILE DIFFERENCE-IN-DIFFERENCES

Report - Propensity score estimation with probit commandAtention: _pscore is estimated at baseline


Probit regression Number of obs = 390LR chi2(3) = 2.39Prob > chi2 = 0.4957

Log likelihood = -191.15777 Pseudo R2 = 0.0062

treated Coef. Std. Err. z P>|z| [95% Conf. Interval]

bk .1368372 .2190827 0.62 0.532 -.292557 .5662315kfc .3619436 .2549971 1.42 0.156 -.1378415 .8617288

roys .2448943 .2415265 1.01 0.311 -.228489 .7182775_cons .6744898 .1889629 3.57 0.000 .3041292 1.04485

J. M. Villa 69

Matching iterations...................................................................................> ..............................................................................> ..............................................................................> ..............................................................................DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTSNumber of observations in the DIFF-IN-DIFF: 780


390 390


BaselineControl 17.000Treated 15.750Diff (T-C) -1.250 1.202 -1.04 0.299


Diff-in-Diff 2.250 1.695 1.33 0.185

R-square: 0.00- Values are estimated at the .5 quantile**Inference: *** p<0.01; ** p<0.05; * p<0.1

4.5 Balancing test

The balancing test for each covariate is obtained only at the baseline. The syntax issimilar to the one for DID with control covariates. The syntax for this option is

. diff fte, treated(treated) period(t) cov(bk kfc roys) testTWO-SAMPLE T TEST

Number of observations (baseline): 390Baseline Follow-up

Control: 76 - 76Treated: 314 - 314

390 -

t-test at period = 0:

Variable(s) Mean Control Mean Treated Diff. |t| Pr(|T|>|t|)

fte 20.013 17.069 -2.944 2.43 0.0155**bk 0.447 0.411 -0.037 0.58 0.5634kfc 0.158 0.217 0.059 1.14 0.2569roys 0.224 0.248 0.025 0.45 0.6533

*** p<0.01; ** p<0.05; * p<0.1


When combined with the kernel option, the covariates are weighted, and the dif-ferences are obtained by linear regression (this test is also suitable with repeated cross-sections).

. diff fte, treated(treated) period(t) cov(bk kfc roys) test id(id) kernelMatching iterations...

................................................................................> ..............................................................................> ..............................................................................> ..............................................................................TWO-SAMPLE T TEST

Number of observations (baseline): 390Baseline Follow-up

Control: 76 - 76Treated: 314 - 314

390 -

t-test at period = 0:

Weighted Variable(s) Mean Control Mean Treated Diff. |t| Pr(|T|>|t|)

fte 20.006 17.069 -2.937 2.79 0.0056***bk 0.465 0.411 -0.054 1.08 0.2797kfc 0.146 0.217 0.071 1.81 0.0704*roys 0.285 0.248 -0.036 0.81 0.4174

*** p<0.01; ** p<0.05; * p<0.1Attention: option kernel weighs variables in cov(varlist)Means and t-test are estimated by linear regression

5 Acknowledgments

I thank Kit Baum from Boston College for his valuable suggestions. I also thank at-tendees at the 2012 Stata Users Meeting Group in London, UK, for providing feedbackon a previous version of this command. David Card from the University of Califor-nia, Berkeley, as well as Vincenzo di Maro from The World Bank and Pablo Ibarraranfrom the Inter-American Development Bank, provided important suggestions in an earlystage of the development of the code. Monica Oviedo from Universitat Autonoma deBarcelona contributed with a review of some options of the diff code. I am gratefulto the Global Development Institute (formerly Brooks World Poverty Institute) andUnited Nations University World Institute for Development Economics Research fortheir research support. All the errors and omissions in the article are my own.

6 ReferencesAbadie, A. 2005. Semiparametric difference-in-differences estimators. Review of Eco-nomic Studies 72: 1–19.

Abadie, A., J. L. Herr, G. Imbens, and D. M. Drukker. 2004. nnmatch: Stata module tocompute nearest-neighbor bias-corrected estimators. Statistical Software ComponentsS439701, Department of Economics, Boston College.http://econpapers.repec.org/software/bocbocode/s439701.htm.

J. M. Villa 71

Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’sCompanion. Princeton, NJ: Princeton University Press.

Becker, S. O., and A. Ichino. 2002. Estimation of average treatment effects based onpropensity scores. Stata Journal 2: 358–377.

Blundell, R., and M. C. Dias. 2009. Alternative approaches to evaluation in empiricalmicroeconomics. Journal of Human Resources 44: 565–640.

Card, D., and A. B. Krueger. 1994. Minimum wages and employment: A case study ofthe fast-food industry in New Jersey and Pennsylvania. American Economic Review84: 772–793.

Cerulli, G. 2014. ivtreatreg: A command for fitting binary treatment models withheterogeneous response to treatment and unobservable selection. Stata Journal 14:453–480.

Dehejia, R. 2013. The porous dialectic: Experimental and non-experimental methods indevelopment economics. WIDER Working Paper No. WP/2013/011, United NationsUniversity World Institute for Development Economics Research.http://www.wider.unu.edu/publications/working-papers/2013/en GB/wp2013-011/.

Heckman, J. J., H. Ichimura, and P. E. Todd. 1997. Matching as an econometricevaluation estimator: Evidence from evaluating a job training programme. Review ofEconomic Studies 64: 605–654.

. 1998. Matching as an econometric evaluation estimator. Review of EconomicStudies 65: 261–294.

Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobisand propensity score matching, common support graphing, and covariate imbalancetesting. Statistical Software Components S432001, Department of Economics, BostonCollege. https://ideas.repec.org/c/boc/bocode/s432001.html.

Meyer, B. D., W. K. Viscusi, and D. L. Durbin. 1995. Workers’ compensation andinjury duration: Evidence from a natural experiment. American Economic Review85: 322–340.

Mora, R., and I. Reggio. 2012. Treatment effect identification using alternative parallelassumptions. Working Paper 12-33, Universidad Carlos III de Madrid.

Nichols, A. 2007. rd: Stata module for regression discontinuity estimation. StatisticalSoftware Components S456888, Department of Economics, Boston College.https://ideas.repec.org/c/boc/bocode/s456888.html.

About the author

Juan M. Villa is a research fellow at the Global Development Institute at the University ofManchester, UK.

http://www.wider.unu.edu/publications/working-papers/2013/en_GB/wp2013-011/


mfpa: Extension of mfp using the ACDcovariate transformation for enhancedparametric multivariable modeling

Patrick RoystonMRC Clinical Trials UnitUniversity College London

London, UK

[email protected]

Willi SauerbreiCenter for Medical Biometry and Medical Informatics

Medical Center—University of FreiburgFreiburg, Germany

[email protected]

Abstract. In a recent article, Royston (2015, Stata Journal 15: 275–291) intro-duced the approximate cumulative distribution (ACD) transformation of a contin-uous covariate x as a route toward modeling a sigmoid relationship between x andan outcome variable. In this article, we extend the approach to multivariable mod-eling by modifying the standard Stata program mfp. The result is a new program,mfpa, that has all the features of mfp plus the ability to fit a new model for user-selected covariates that we call FP1(p1, p2). The FP1(p1, p2) model comprises thebest-fitting combination of a dimension-one fractional polynomial (FP1) functionof x and an FP1 function of ACD (x). We describe a new model-selection algorithmcalled function-selection procedure with ACD transformation, which uses signifi-cance testing to attempt to simplify an FP1(p1, p2) model to a submodel, an FP1

or linear model in x or in ACD (x). The function-selection procedure with ACD

transformation is related in concept to the FSP (FP function-selection procedure),which is an integral part of mfp and which is used to simplify a dimension-two (FP2)function. We describe the mfpa command and give univariable and multivariableexamples with real data to demonstrate its use.

Keywords: st0425, mfpa, mfp, continuous covariates, sigmoid function, ACD trans-formation, multivariable fractional polynomials, regression models

1 Introduction

Over the years, fractional polynomials (FPs) have steadily gained popularity as a toolfor flexible parametric modeling of regression relationships. A recent search in GoogleScholar (22 February 2016) yielded 1,181 citations of the original article by Roystonand Altman (1994). The multivariable fractional polynomials (MFP) method of multi-ple regression modeling (Sauerbrei and Royston 1999) simultaneously removes weaklyinfluential predictors and determines a suitable functional form (FP or linear) for con-tinuous predictors. MFP is implemented as the mfp command in Stata. Its appeal maylie in a combination of relative simplicity and familiarity (an extension of conventionalpolynomials) with added flexibility for representing nonlinear functional forms and usu-ally a low probability of introducing uninterpretable artifacts into the fitted functions.Furthermore, unlike splines—which have only a local interpretation of the fitted function(piecewise between knots)—FPs provide a curve with a global interpretation.


P. Royston and W. Sauerbrei 73

MFP extends backward elimination by systematically searching for improvement infit by modeling possible nonlinearity in the effects of continuous variables. The heartof MFP lies in modeling each continuous predictor using FP functions combined witha principled function-selection procedure (FSP) to yield a simplified functional form, ifappropriate. Each predictor is modeled univariately by this method, adjusted for theother predictors, within an overarching back-fitting algorithm that visits each predictorin turn.

Royston (2015) described an extension of univariate FP modeling via the so-calledapproximate cumulative distribution (ACD) covariate transformation. The ACD trans-formation is a smooth function that maps a continuous covariate, x, to an approxima-tion, ACD (x), of its distribution function. By construction, the distribution of ACD (x)in the sample is roughly uniform on (0, 1). FP modeling is then performed with thetransformed values ACD (x) instead of x as a predictor. Royston (2015) showed thatsuch an approach could successfully represent a sigmoid function of x, something a stan-dard FP function cannot do (Royston and Sauerbrei 2008, sec. 5.8.1). He went on todemonstrate that useful flexibility in functional form could be achieved by consideringboth x and a = ACD (x) simultaneously as independent predictors and applying the MFP

algorithm to x and a. To limit instability and overfitting, he suggested restricting themodels considered for x and a to FP1 functions. Royston (2015) also noted that modelsbased on ACD (x) may have other advantages in terms of interpretability of regressioncoefficients and resistance to the potential influence of extreme covariate observations.

In the present article, we take the modeling process further. We show how to selectoptimal FP1 functions for x and ACD (x) in a univariable context. We describe a modifiedversion of the FP FSP adapted to the x and ACD (x) approach. We then modify the MFP

algorithm to produce a new but closely related algorithm called MFPA, in which theFP FSP is replaced by the modified version (FSP with ACD transformation [FSPA]) justmentioned. MFPA may help with situations in which a sigmoid function is needed, whichMFP cannot provide. Also, as mentioned, MFPA may reduce the influence of extremecovariate values on a selected function.

The structure of the article is as follows. Section 2 describes how to select a univari-able model based on applying the FSPA to combinations of x and ACD (x). Section 3introduces MFPA as a modification of MFP. Section 4 gives examples of applying MFP

and MFPA to two real datasets. Section 5 describes mfpa, a new command that extendsthe standard mfp command by allowing the FSPA instead of the FSP to be applied toone or more of the candidate continuous predictors. Additionally, mfpa supports Stata’sfactor variables. Section 6 contains some final remarks.

2 Choosing a suitable functionIn this section, we propose a method to select a univariable model. We consider estima-tion with a single continuous predictor, x, combined with the preliminary transformationa = ACD (x). In section 3, we describe how the selected function can be used in an iter-ative multivariable modeling procedure, MFPA, that is closely related to MFP. We firstdefine the ACD transformation.

74 Extension to MFP

2.1 The ACD transformation

Let X be a continuous random variable to be considered as a covariate in some kindof regression model. We wish to approximate the empirical cumulative distributionfunction of a random sample x1, . . . , xn of n observations from the distribution of X.We define the ACD (·) transformation in several steps as follows. Let rank(xi) be the rankof xi, with ranks 1 and n denoting the lowest and highest sample values, respectively.Define

zi = Φ−1 {(rank (xi)− 0.5) /n}E (zi) = β0 + β1 (xi + s)

p

zi = E (zi) = β0 + β1 (xi + s)p

ACD (xi) = ai = Φ(zi)

where Φ (·) is the standard normal cumulative distribution function (normal() in Stata),Φ−1 (·) is its inverse (invnormal() in Stata), and p is the best-fitting estimate of p inan FP1 regression model E (zi) = β0 + β1 (xi + s)

p. Powers p are selected from the

set S = {−2,−1,−0.5, 0, 0.5, 1, 2, 3}. Ordinary least-squares regression of the zi on thevalues (xi + s)

pis used to estimate the parameters β0, β1, and p, with p = 0 meaning

log transformation. If any xi ≤ 0, then all the xi are shifted by a constant, s, chosento ensure that (xi + s) > 0 for all i; if all xi > 0, then s = 0. See, for example,Royston and Sauerbrei (2008, 84–85) for details of how s may be determined. In thefollowing, we assume that xi > 0 and s = 0 so that s can be ignored in the formulation.

An explanation of the rationale for the above approach is given in the section “TheACD transformation” in Royston (2015). Depictions of ACD (xi) when X has a normalor lognormal distribution are given in figure 1 in the section “Example 1: Simulateddistributions” of Royston (2015).

2.2 The model FP1(p1, p2) and some submodels

In an example analysis of the prognostic importance of tumor thickness in malignantmelanoma (Baade et al. 2015), Royston (2015) demonstrated that applying MFP toselect FP1 functions of x = tumor thickness and of a = ACD (x) simultaneously couldgive rise to a well-fitting function that a standard FP1 or FP2 function in x or in a couldnot match. The chosen function had a linear component in x and an FP1 component in a,with the latter being a sigmoid function of x. The result hinted that models comprisingFP functions of x and a might be of value in particular cases as an alternative to thestandard FP class.

In this section, we take the idea further and consider a four-parameter model class,β1x

p1 +β2ap2 , called FP1(p1, p2) and based on FP1 transformations of x and a. The aim

is to adapt to FP1(p1, p2) the FSP that, starting with the FP2 class, is used to determinea parsimonious FP function of x. Function selection needs to be done in a systematicand principled way. We address function selection in section 2.4.


First, we consider six models, M1–M6, each of which represents the best-fittingmodel within its respective class. They are potentially useful in deriving a more parsi-monious “final” model, aiming to reduce the risk of overfitting the most complex allowedfunction, M1 = FP1(p1, p2). M2–M6 are submodels of M1. The models are listed intable 1.

Table 1. Six submodels of FP1(p1, p2). A dot (.) indicates that the corresponding termis omitted.

Model Notation Function Comment

M1 FP1(p1, p2) β1xp1 + β2a

p2 The most complex allowed functionM2 FP1(p1, .) β1x

p1 Standard FP1 function of xM3 FP1(., p2) β2a

p2 Usually a singly or doubly asymptoticcurve in x

M4 FP1(1, .) β1x Linear reduction of model M2M5 FP1(., 1) β2a Linear reduction of model M3M6 FP1(., .) – Null model; x is omitted altogether

The models have been chosen to provide two nesting hierarchies that can be appliedfor model reduction: M1 ⊃ M2 ⊃ M4 ⊃ M6 and M1 ⊃ M3 ⊃ M5 ⊃ M6. For example,M1 ⊃ M2 means that M2 is nested in M1. These hierarchies are used to provide sets ofnested models for use in function selection (see section 2.4).

Plots of some of the functional forms available with models M1, M3, and M5 maybe seen in several of the figures in Royston (2015). Next, we consider estimation of theparameters of M1–M6.

2.3 Estimation

Models M2–M5 are conventional FP1 or linear models in x or in a. In univariablesettings, M6 is simply a constant. Powers p1 or p2 in M2 and M3 are estimated in theusual way by finding the corresponding values that maximize the likelihood in the setof power transformations S.

To estimate p1 and p2 in M1, one might consider applying MFP (with maximumallowed complexity FP1 functions) to x and a, treating them as though they were inde-pendent variables. However, because of the high collinearity of x and a, the approachmay produce a suboptimal fit; it does not always find the best values of p1 and p2.Instead, we systematically search all 8× 8 = 64 possible pairs (p1, p2) for the maximumlikelihood solution by fitting each of the FP1 models and finding the pair giving thehighest likelihood.

When p1 and p2 have been determined for M1–M5, models M1, M2, and M3 areconditionally linear and β1 and β2 are estimated by maximum likelihood in standardfashion.

76 Extension to MFP

2.4 Function-selection procedure FSPA

To select a suitable model among M1–M6 above, we need a systematic model-selectionprocedure akin to the FSP. Full details of the FSP are given by Royston and Sauerbrei(2008, 82–84). In summary, the FSP has three steps with the following characteristics:

1. The FSP is a closed test procedure that maintains the preselected nominal signifi-cance level (α1) for testing whether x is influential. The first test (FP2 versus null)achieves this. If FP2 is not a significantly better fit than null, then x is droppedand the procedure ends. Note that α1 is set by mfp’s option select(#), whosedefault value is 1, meaning that x is automatically selected and the procedure con-tinues to the function-selection stage. The α1 significance level is of course muchmore relevant to multivariable modeling than in the present context of functionselection for a single x.

2. Assuming x is deemed influential after the first test, the FSP is also a closed testprocedure that maintains a second preselected nominal significance level (α2) fortesting whether the functional form of the relation between x and the outcomeis nonlinear. The second test (FP2 versus linear) achieves this. If FP2 is not asignificantly better fit than linear, then a linear function of x is selected and theprocedure ends. Often, the significance levels α1 and α2 are taken as equal. Notethat α2 is set by mfp’s option alpha(#); the default is alpha(0.05).

3. If nonlinearity is found at the second step, a final test (FP2 versus FP1), also atthe α2 level, is applied to refine the selected function further. The procedure ends,selecting either an FP1 or an FP2 function.

Allowing ACD transformation, we can reproduce the main features of the FSP start-ing with FP1(p1, p2) as the most complex permitted function. We call the modifiedprocedure the FSPA. To enable testing, deviances (−2× log likelihood) for each of M1–M6 are first obtained, requiring 64(M1) + 8(M2) + 8(M3) + 1(M6) = 81 distinct modelfits. (Models M4 and M5 are already fit as special cases of FP1 models M2 and M3,respectively.) The FSPA then runs as follows.

1. Step 1 is identical to step 1 of the FSP except that M1 is tested against M6 (on4 degrees of freedom [d.f.]). This provides a closed test at the α1 level for x beinginfluential. If the test is nonsignificant, then drop x and end. Otherwise, continueto step 2.

2. Step 2 is identical to step 2 of the FSP except that M1 is tested against M4 (on3 d.f.). This provides a closed test at the α2 level for the functional form for xbeing nonlinear. If the test is nonsignificant, then accept a linear function for xand end. Otherwise, continue to step 3.

3. Step 3 is similar to step 3 of the FSP except that M1 is tested against M2 (on2 d.f.) and the procedure may continue. If the test is nonsignificant at the α2

level, then accept M2 and end. Otherwise, continue to step 4.


4. We now know that M1 is a significantly better fit than M2. However, it may bepossible to simplify M1 in the direction of the ACD model M3; therefore, M1 istested against M3 (on 2 d.f.). If the test is significant at the α2 level, then acceptM1 and end. Otherwise, continue to step 5.

5. Finally, M3 is tested against M5 (on 1 d.f.). If the test is significant at the α2

level, then accept M3 and end. Otherwise, accept M5 and end.

With the FSPA, depending on the choices of α1 and α2, we may obtain any of modelsM1–M6 as “final”. The ordered sequence of steps comprising the FSPA is designed toselect a linear or FP1 model if the fit of one of them is sufficient. Only if M1 is betterthan both M4 and M2 do M3 and M5 (ACD-based models) come into play. Thus theFSPA favors FP1 or linear functions in the sense that it will consider an ACD-based modelonly if a standard FP1 or linear model fails to fit as well as M1 does. The approachfollows the philosophy of MFP that an explanatory model should be as simple as possibleand that increased complexity should be adequately supported by an improved fit tothe data.

3 The MFP and MFPA algorithms

At each step of the MFP algorithm, the FSP is applied to each continuous covariatein turn to decide whether it is sufficiently influential (that is, significant at the α1

level) to remain in the model, and if so, to estimate its functional form (usually anFP2, FP1, or linear function). Categorical variables are also tested for inclusion instandard fashion. The models fit at each step are adjusted for all other currentlyselected candidate variables, whether continuous or categorical, retaining any FP orlinear functions if those have been selected so far. A cycle is defined as a complete tour,in a specified order, of all the candidate variables. The algorithm terminates when theselected functions or categorical variables do not change from one cycle to the next.Typically, MFP converges in about 2–4 cycles. Theoretically, MFP can oscillate betweentwo different solutions, but in practice such behavior is extremely rare. In section 6.3.2of Royston and Sauerbrei (2008), we illustrate further details of the algorithm in anexample.

The MFPA algorithm is identical to MFP except that the FSP is replaced with the FSPA

for any continuous variable(s) that the user wishes to assess using the ACD approach. Itis possible to specify ACD and hence the FSPA for any subset of the continuous predictors.In the mfpa program (described below in section 5), specifying which variables are tobe modeled with FP1(p1, p2) as the most complex permitted function of an x and thecorresponding a is done through the acd() option.

78 Extension to MFP

4 Examples

4.1 Example 1: A function with an asymptote

We use the well-known German breast cancer dataset (Schumacher et al. 1994), whichcan be loaded into Stata via the command webuse brcancer. The data are preparedfor survival analysis using the command stset rectime, failure(censrec).

We compare five functions selected for the effect of the strongest predictor (x5 =number of positive lymph nodes) in univariate Cox regression models, all adjusted forhormonal therapy (hormon). The models we consider for x5 are as follows:

1. FP2(p1, p2) for which the FSP selects (p1, p2) = (−2,−1) (that is, a quadraticfunction in x5−1).

2. A negative exponential model, that is, a linear function of exp (−0.12× x5) , assuggested by Sauerbrei and Royston (1999).

3. FP1(p1, p2), that is, model M1 without simplification, for which the maximumlikelihood estimate is (p1, p2) = (−0.5,−2).

4. FP1(p1, p2) with model simplification with the FSPA using α1 = α2 = 0.05, forwhich the selected powers are (p1, p2) = (., 3) (an instance of model M3; seetable 1).

5. A restricted cubic regression spline with 4 d.f. (Royston and Sauerbrei 2007b).

The fitted curves, depicting log relative-hazards fit by the Cox model, are shown infigure 1.


−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(a) FP2(−2, −1)

−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(b) Neg. exp.

−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(c) FP1(−0.5, −2)

−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(d) FP1(., 3)

−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(e) Spline (4 d.f.)

−0.5

0.0

0.5

1.0

1.5

0 10 20 30 40

(a)

(b)

(c)

(d)

(e)

(f) All functions

Log

rela

tive

haza

rd

Number of positive lymph nodes (x5)

Figure 1. Five fitted functions for x5 in the German breast cancer dataset. Graph (f)compares the functions shown individually in graphs (a) through (e).

The FP2(−2,−1), FP1(−0.5,−2), and spline curves are all nonmonotonic, with thespline curve exhibiting a maximum log relative-hazard at about 25 positive lymphnodes. Such nonmonotonicity is implausible for biologic reasons, because more posi-tive nodes should mean a higher risk of cancer recurrence. The negative exponentialand FP1(., 3) curves are closely similar and are by construction both monotonic. Thus,the FSPA provides a “good” model for x5 within the ACD-extended FP class withoutresorting to special nonlinear functions such as the negative exponential transforma-tion in figure 1(b). The FP2 function fits the data best, but the local minimum at twonodes conflicts with medical knowledge and is probably a result of overfitting the data.Sauerbrei and Royston (1999) therefore introduced the negative exponential transfor-mation as a possible pretransformation to provide a monotonic function.

As an illustration of the workings of the FSPA, table 2 shows the results of the varioustests on the deviances (minus twice the log partial likelihoods) for the six models M1–M6for x5, adjusted for hormon.

80 Extension to MFP

Table 2. Models (table 2a) and accompanying tests (table 2b) comprising the FSPA

when applied to x5 (number of positive lymph nodes) in the German breast cancerdata. All models are Cox regression, adjusted for hormon.

Table 2a.

ModelCode Description Deviance

M1 FP1(−0.5,−2) 3483.88M2 FP1(0.5, .) 3493.35M3 FP1(., 3) 3486.13M4 FP1(1, .) 3517.88M5 FP1(, .1) 3494.06M6 FP1(., .) 3567.53

Table 2b.

FSPA model comparisonsStep Comparison Dev. diff. p-value

1 M1 versus M6 83.65 <0.0012 M1 versus M4 34.00 <0.0013 M1 versus M2 9.47 0.0094 M1 versus M3 2.25 0.35 M3 versus M5 7.93 0.005

We see that M1 fits significantly better than all of M6 (P < 0.001), M4 (P < 0.001),and M2 (P = 0.009). At step 4 of the FSPA, the fit of M3 is not significantly worsethan that of M1 (P = 0.3), leading to provisional acceptance of M3 and to the finalcomparison at step 5 (M3 versus M5). Because M3 fits significantly better than M5(P = 0.005), M3 is finally selected.

Below, we show the output from mfpa, summarized in table 2, when fitting x5 andhormon:

. webuse brcancer, clear(German breast cancer data)

. stset rectime censrec

(output omitted )

. mfpa, select(0.05) acd(x5): stcox x5 hormon

Deviance for model with all terms untransformed = 3517.881, 686 observations

Variable Model (vs.) Deviance Dev diff. P Powers (vs.)

(A)x5 M6 M1 3567.530 83.650 0.000* . -.5,-2M4 3517.881 34.002 0.000* 1M2 3493.355 9.475 0.009* 0M3 3486.128 2.248 0.325 3M5 M3 3494.056 7.928 0.005* 1 3Final (M3) 3486.128 . 3

hormon null lin. 3496.724 10.596 0.001* . 1Final 3486.128 1


Fractional polynomial fitting algorithm converged after 1 cycle.

Transformations of covariates:

-> gen double IAx5__1 = Ax5^3-.125 if e(sample)

Final multivariable fractional polynomial model for _t

Variable Initial Finaldf Select Alpha Status df Powers

(A)x5 4 0.0500 0.0500 in 2 . 3hormon 1 0.0500 0.0500 in 1 1

(output omitted )

The Deviance column shows the deviance of each model in the Model column. Thedeviance difference between it and its comparator in the (vs.) column is shown in theDev diff. column, with the p-value in the P column. An asterisk indicates significanceat the alpha() level; here, the default setting alpha(0.05) was applied. The selectedFP powers in the FP1(p1, p2) models are shown in the Powers and corresponding (vs.)

columns.

The tests are applied from the top down, as described in section 2.4. As noted, thetests of M6, M4, and M2 versus M1 are all significant. M3 is provisionally selected andthen confirmed as the final model by the result of the fifth test. Model M3 has powers(., 3), that is, no term in x5 and one term comprising the cube of acd(x5).

4.2 Example 2: A multivariable model

We consider the so-called Boston housing dataset, in which the log median house pricein the Boston area is to be predicted from 13 housing- or environment-related variables,12 of which are continuous, in a dataset of size 506. Some of the continuous variables arestrongly correlated and some have a rather strange distribution. Difficulties in findinga suitable model have made it a dataset often used for comparing various modelingapproaches.

The data were analyzed in some detail by Royston and Sauerbrei (2008, 207–213).The selected MFP model is described in table 9.1 of that work. Ten of the 13 variableswere selected as significant at the 5% level; three of these (crim, rm, and dis) requiredFP2 functions and one (lstat) required an FP1 function. The remaining five continuousfunctions were selected as linear. The only categorical variable (chas) was selected.The explained variation (R2

a), adjusted for model dimension, was 0.827.

On applying mfpa to this dataset, we obtain eight predictors significant at the 5%level, all of them continuous. Of these, five have two FP1 powers and three are linear.The adjusted explained variation (R2

a) is 0.853.

Table 3 describes the selected models. It is interesting that the MFPA model has twofewer predictors, one additional parameter, and a higher explained variation than theMFP model.

82 Extension to MFP

Table 3. Boston housing data. Comparison of predictors and functions selected at the5% nominal significance level by the MFP and MFPA algorithms.

Covariate MFP MFPA Covariate MFP MFPA

crim 1, 2 0, 0.5 dis −2, 1 1,−2zn out out rad linear linearindus out out tax linear 0.5, 2chas∗ in out ptratio linear linearnox linear linear bk linear outrm 0.5, 0.5 −1, 3 lstat 0.5 0.5, 1age out out R2

a 0.827 0.853

∗Binary predictor

Figure 2 compares the partial predictors for the nine continuous predictors selectedby MFP and MFPA.

−1.0

−0.5

0.0

0.5

1.0

0 20 40 60 80 100crim

−1.0

−0.5

0.0

0.5

1.0

.4 .5 .6 .7 .8 .9nox

−1.0

−0.5

0.0

0.5

1.0

4 5 6 7 8 9rm

−1.0

−0.5

0.0

0.5

1.0

2 4 6 8 10 12dis

−1.0

−0.5

0.0

0.5

1.0

0 5 10 15 20 25rad

−1.0

−0.5

0.0

0.5

1.0

200 300 400 500 600 700tax

−1.0

−0.5

0.0

0.5

1.0

12 14 16 18 20 22ptratio

−1.0

−0.5

0.0

0.5

1.0

0 .2 .4 .6bk

−1.0

−0.5

0.0

0.5

1.0

0 10 20 30 40lstat

Par

tial p

redi

ctor

MFP MFPAACD()

Figure 2. Boston housing data. Fitted partial predictors for the MFP (solid lines)and MFPA (long dashes) models as well as ACD (short dashes) approximations to thecumulative distribution functions of the predictors.


Note that the ACD transformations of two predictors (crim and bk) are notablyskewed in distribution and that the remainder are more symmetrical.

At first glance, the differences between the fitted functions appear rather minor.However, the FP2(1, 2) function for crim (level of criminality in the local area) seemsinappropriate because it is nonmonotonic, whereas the FP1(0, 0.5) function is nearlymonotonic. The two functions for rm are both nonmonotonic but are subtly different.MFP selects bk, which evidently has a (very) weak effect, whereas MFPA omits it.

In terms of fit, figure 3 shows smoothed residuals for the MFP model.

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0crim

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0rm

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0dis

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0tax

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8bk

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0lstat

Sm

ooth

ed r

esid

uals

MFP model

Figure 3. Boston housing data. Smoothed residuals for partial predictors in the MFP

model.

Subjectively, some lack of fit is evident for crim, dis, and perhaps lstat.

84 Extension to MFP

Figure 4 shows smoothed residuals for the same predictors in the MFPA model.

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0crim

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0rm

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0dis

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0tax

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8bk

−0.4

−0.2

0.0

0.2

0.4

0.0 0.2 0.4 0.6 0.8 1.0lstat

Sm

ooth

ed r

esid

uals

MFPA model

Figure 4. Boston housing data. Smoothed residuals from the MFPA model for the partialpredictors in the MFP model.

Altogether the fit seems a little better, and the only predictor still exhibiting lack of fitis dis.

This example suggests that MFPA may uncover subtle nonlinearity missed by MFP

in difficult situations with unusual distributions and a potential influence of extremevalues. The overall predictive ability of the model may not be too different, but theinterpretation of the effects of individual predictors may change.

5 The mfpa command

5.1 Syntax

The syntax of mfpa is as follows:

mfpa[, acd(varlist) linadj(varlist) mfp options

]: regression cmd[

yvar1[yvar2

] ]xvarlist

[if] [

in] [

weight] [

, regression cmd options]


mfpa is identical to mfp except that it accepts factor variables in xvarlist and hastwo additional options, which are described below.

The standard postestimation commands fracpred and fracplot have been replacedwith xfracpred and xfracplot, respectively.

Note that the acd program must be installed before using mfpa. To install acd, typenet install st0339, from(http://www.stata-journal.com/software/sj14-2).

5.2 Description

mfpa selects the MFP model that best predicts the outcome variable from the right-hand-side variables in xvarlist.

mfpa provides some extensions to Stata’s mfp command:

1. mfpa supports factor variables, and

2. mfpa has two new options: linadj(varlist) to adjust linearly for variables invarlist, and acd(varlist) to optimize the fit for each xvar in varlist and its ACD

transformation.

As mentioned above, the mfp postestimation commands fracpred and fracplot

are replaced with xfracpred and xfracplot, respectively. The syntax is unchangedexcept that xfracplot has the additional option nopts, which suppresses plotting ofpartial residuals. Also provided with the software package for this article is xfracpoly,which extends the fracpoly command (which is no longer part of official Stata) bysupporting the use of factor variables in its xvarlist. The three xfrac* commands arebriefly documented in the mfpa help file under the heading Related commands.

5.3 Options

acd(varlist) creates the ACD transformation of each member of varlist. It also invokesthe FSPA to determine the best-fitting FP1(p1, p2) model, as described in section 2.4.For a given continuous predictor xvar, depending on the values of select(#) andalpha(#), mfpa simplifies the FP1(p1, p2) model to select one of the six submodelsdescribed in section 2.2. The variable representing the ACD transformation of xvar isnamed Axvar and is left behind in the workspace, together with FP transformation(s)of Axvar as appropriate.

linadj(varlist) adjusts linearly for members of varlist; that is, the members are includedin every model fit. This avoids the need for the more complicated and less efficientdf() and select() options to achieve the same result.

mfp options are any options appropriate to mfp.

regression command options are any options appropriate to the regression commandspecified in regression command.

86 Extension to MFP

5.4 Examples

webuse brcancer, clearstset rectime, failure(censrec) scale(365.24)mfpa, acd(x5): stcox x5mfpa, select(0.05): stcox x1 x2 x3 x4a x4b x5 x6 x7 hormonxfracplot x5mfpa, select(0.05) acd(x5 x6 x7): stcox x1 x2 x3 x4a x4b x5 x6 x7 hormonxfracplot x5

6 Comments

In this article, we introduced MFPA and the mfpa command, an extension of MFP and mfp

that supports the ACD transformation in the range of possible predictor transformations.If sigmoid relationships are relevant or expected, MFPA can be used instead of MFP.Our impression is that replacement of the FP2(p1, p2) family with the FP1(p1, p2) familydoes not sacrifice flexibility in functional form. The mathematical details of how thishappens merit further investigation. With the possibility of modeling singly or doublyasymptotic relationships, the FP1(p1, p2) family offers an attractive alternative to theFP2(p1, p2) family in some cases. However, its interpretability and transportability areless straightforward than those of MFP, and its properties remain to be explored ingreater detail and in more datasets.

The ACD transformation may provide a solution to the problem of influential co-variate observations. In the MFP context, we previously proposed the gδ (.) pretrans-formation (Royston and Sauerbrei 2007a), which works quite differently from ACD. Forany continuous x, the distribution of ACD (x) is by construction approximately uniform(0, 1). The extreme values of the uniform distribution are generally much less influen-tial in regression models than those of the original distribution of x. In the selectedfunctions of x5 in the German breast cancer dataset, the FSP selects a nonmonotonicFP2 function, which contradicts medical knowledge, whereas the FSPA chooses FP1(., 3),which fits the data well and, being guaranteed monotonic, makes more biologic sense.

In summary, publication of mfpa makes the command widely available to otherresearchers. We hope this will stimulate further research in this important topic area.

7 ReferencesBaade, P. D., P. Royston, P. H. Youl, M. A. Weinstock, A. Geller, and J. F. Aitken. 2015.Prognostic survival model for people diagnosed with invasive cutaneous melanoma.BMC Cancer 15: 27.

Royston, P. 2015. Tools for checking calibration of a Cox model in external validation:Prediction of population-averaged survival curves based on risk groups. Stata Journal15: 275–291.

Royston, P., and D. G. Altman. 1994. Regression using fractional polynomials of con-tinuous covariates: Parsimonious parametric modelling (with discussion). Journal ofthe Royal Statistical Society, Series C 43: 429–467.


Royston, P., andW. Sauerbrei. 2007a. Improving the robustness of fractional polynomialmodels by preliminary covariate transformation: A pragmatic approach. Computa-tional Statistics and Data Analysis 51: 4240–4253.

. 2007b. Multivariable modeling with cubic regression splines: A principledapproach. Stata Journal 7: 45–70.

. 2008. Multivariable Model-building: A Pragmatic Approach to RegressionAnalysis Based on Fractional Polynomials for Modelling Continuous Variables. Chich-ester, UK: Wiley.

Sauerbrei, W., and P. Royston. 1999. Building multivariable prognostic and diagnosticmodels: Transformation of the predictors by using fractional polynomials. Journal ofthe Royal Statistical Society, Series A 162: 71–94.

Schumacher, M., G. Bastert, H. Bojar, K. Hubner, M. Olschweski, W. Sauerbrei,C. Schmoor, C. Beyerle, R. L. A. Neumann, and H. F. Rauschecker. 1994. Ran-domized 2×2 trial evaluating hormonal treatment and the duration of chemotherapyin node-positive breast cancer patients. Journal of Clinical Oncology 12: 2086–2093.

About the authors

Patrick Royston is a medical statistician with more than 30 years of experience, with a stronginterest in biostatistical methods and in statistical computing and algorithms. He works largelyin methodological issues in the design and analysis of clinical trials and observational studies.He is currently focusing on alternative outcome measures in trials with a time-to-event outcome;on problems of model building and validation with survival data, including prognostic factorstudies and treatment-covariate interactions; on parametric modeling of survival data; and onnovel clinical trial designs.

Willi Sauerbrei has worked for more than 30 years as an academic biostatistician. He hasextensive experience of cancer research and a long-standing interest in modeling observationaldata. Topics of interest include variable and function selection, model stability, treatment-covariate interactions, time-dependent effects in survival analysis, meta-analysis, and reportingof research findings.

Royston and Sauerbrei have collaborated on regression methods using continuous predictors formore than two decades and have written a book (Royston and Sauerbrei 2008) on multivariablemodeling.


Quantifying the uptake of user-writtencommands over time

Babak Choodari-OskooeiHub for Trials Methodology Research

MRC Clinical Trials UnitUniversity College London

London, UK

[email protected]

Tim P. MorrisHub for Trials Methodology Research

MRC Clinical Trials UnitUniversity College London

andDepartment of Medical Statistics

London School of Hygiene and Tropical MedicineLondon, UK

[email protected]

Abstract. A major factor in the uptake of new statistical methods is the availabil-ity of user-friendly software implementations. One attractive feature of Stata isthat users can write their own commands and release them to other users via Sta-tistical Software Components at Boston College. Authors of statistical programsdo not always get adequate credit, because programs are rarely cited properly.There is no obvious measure of a program’s impact, but researchers are underincreasing pressure to demonstrate the impact of their work to funders. In ad-dition to encouraging proper citation of software, the number of downloads of auser-written package can be regarded as a measure of impact over time. In thisarticle, we explain how such information can be accessed for any month from July2007 and summarized using the new ssccount command.

Keywords: dm0086, ssccount, SSC, impact

1 Introduction

Many statisticians are paid to develop new methods, but implementing methods insoftware is not always recognized as a key part of this activity. A published articledetailing a new method is citeable, and citations can be tracked, providing funders andbosses with a measure of interest or relevance. There is no equivalent to an impactfactor or H-index for programs, which often go uncited by users. It is thus harder todemonstrate the value of time spent writing and testing programs. However, there areother indicators that can be used to demonstrate impact (Brueton et al. 2014).

c© 2016 StataCorp LP dm0086

B. Choodari-Oskooei and T. P. Morris 89

We regard the release of programs as an important factor in the uptake of newmethods (Pullenayegum et al. 2016). Historically, this appears to be supported by thefollowing:

• The Cox model was originally published in 1972 (Cox 1972), but it was notwidely used until implementations in Fortran by Richard Peto and colleaguesand Kalbfleisch and Prentice (1980).

• Multiple imputation was first conceived in 1978 (Rubin 1978) followed by a pe-riod of theoretical developments (Rubin 1987), but the widespread use now seen(Rezvan, Lee, and Simpson 2015) did not occur until the release of the R pack-age mice (van Buuren and Oudshoorn 2000) and the Stata package ice (Royston2004).

• Propensity-score matching was originally proposed in 1983 (Rosenbaum and Ru-bin 1983) and has gradually been applied more and more since the turn of themillennium, thanks in part to programs such as psmatch2 (Leuven and Sianesi2003).

Each new Stata release adds commands implementing recent methods, but it wouldbe unreasonable to expect StataCorp to keep on top of all the methodological develop-ments in statistics and implement them. Rather, the onus falls on methodologists toimplement their own methods and promote the software. Having written a program, auser can share it easily: a package of files can be submitted to the Statistical SoftwareComponents (SSC) repository at Boston College, and it can then be downloaded byothers by typing ssc install pkg name in Stata’s command line.

The ssc hot command returns the number of downloads in the previous month formost user-written packages on SSC. Many users might not know that they can obtainthe datasets that this command is based on for any month dating back to July 2007.These monthly datasets can then be linked.

In this article, we describe how to obtain data on monthly hits, and we introducethe ssccount command, which downloads the datasets for a specified time window.ssccount allows specification of certain packages and authors of interest, and it providesa graph plotting downloads over time. The number of downloads over time providesa useful—though imperfect—picture of how much a program is used, provided it hasbeen released on SSC. The ssccount command is thus one way for Stata programmersto demonstrate the uptake of their packages and evaluate the value of the time spentwriting them.

2 Methods

In this section, we discuss how to obtain the number of downloads of user-writtenstatistical packages, which can be regarded as a soft measure of impact, and we introducethe ssccount command, which can be used for this purpose.

90 Uptake of user-written commands

2.1 Statistics regarding uptake

The SSC archive is a well-known repository for user-written commands. The host site,RePEc services, tallies the individual file downloads whenever a user issues ssc install

pkg name to Stata. Typing ssc hot produces a list of the 10 (by default) most down-loaded packages for the previous month. This list consists of the top 10 rows of datafrom a file containing the downloads for all packages. A file is created for each monthand stored in the SSC archive, which goes back to July 2007.

Stata users can access this information from Stata by submitting the following com-mand:

. use http://repec.org/docs/sschotPxxx.dta

In this command, xxx corresponds to Stata’s monthly calendar (for example, xxx =570 is the “Stata internal form” value for July 2007 [typing display %tm 570 returns2007m7]). So to obtain the file containing the number of package downloads in July2007, you replace xxx with 570 in the above command. The number of package “hits”(downloads) reported can be noninteger because some users might have downloadedonly some of the files in a package. Some packages consist of many files, and not all areupdated each time.

The number of hits must be interpreted cautiously for these reasons:

1. The statistics appear to be limited to packages containing user-written ado-files.For example, graph schemes are not counted.

2. The data do not distinguish between the first download and the downloads of anupdate.

3. If a user downloads a command to two computers (say, one at work and one athome), this is counted as two hits.

Clearly, the precise number of hits should not be relied on too heavily—there ispotential for commands to look more impressive by releasing many incremental updatesinstead of a fully developed first version—but the information is useful.

On the other hand, citations in peer-reviewed articles are widely used as a measureof “impact”, but they have their own pitfalls. Simple citation counts are agnostic towhether citations were for positive, negative, or neutral reasons. Although we cannot tellprecisely what the spirit of a software download was, it seems plausible that downloadsare mainly for positive reasons.

3 The ssccount command

The ssccount command downloads datasets detailing monthly downloads of user-written commands from SSC for specified authors and packages, and it optionally plotsthe results.


3.1 Syntax

The syntax for the ssccount command is

ssccount[, from(month) to(month) author(author name) clear fillin(#)

graph package(pkg name) saving(filename, replace)]

where month is a calendar month in Stata’s %tm format.

3.2 Options

from(month) specifies the earliest month of data to download. This must be entered inStata’s %tm format (for example, January 2011 is specified by 2011m1). Specifying amonth before July 2007 (2007m7) will return an error because this is before recordsbegan. The default is from(2007m7).

to(month) specifies the latest month of data to download. As with from(), this mustbe entered in Stata’s %tm format (for example, January 2011 is specified by 2011m1).Specifying a month before July 2007 (2007m7) will return an error. The default isthe current month minus three months, which helps users avoid trying to downloaddatasets that do not yet exist, though one further month may be available. (Userscan check the latest available month by typing ssc hot.)

author(author name) specifies the name of the author whose packages are of interest.The names on SSC packages can be inconsistent. You do not have to get it exactlyright, as long as the name used contains what you specify in author(). The optionis not case sensitive, so specifying author(bloggs) is the same as author(BLOGGS)or anything in between, like author(BlOgGs).

clear specifies that the data in memory will be cleared. If saving() or clear is notspecified and you have data in memory, ssccount will exit with an error.

fillin(#) calls the fillin command (see [D] fillin). This option is used with plotswhen more than one author or package has been specified. It creates missing monthsto form a rectangular dataset and fills each one with # hits. Filling as missing (.)is allowed. The default is to not fill anything.

graph draws a simple graph of the month-by-month hits using twoway line and overlaysa smoothed trend using lowess. If the data contain multiple authors or packages,the graphs will be drawn by author and package.

package(pkg name) specifies the name of the package of interest. This may be useful ifan author has written multiple packages but a user is interested in one in particular.It can also be helpful if the author’s name is a substring of one or more other authors’names.

saving(filename, replace) specifies the downloaded data be saved as filename.dta.


3.3 Examples

To download the data on downloads (hits) for all SSC packages and save them to a filecalled allhits.dta, type

. ssccount, saving(allhits, replace)Looking to download 99 months of SSC files (Jul 2007 to Sep 2015).................................................................................> ..................file allhits.dta saved

This will append the various files; the appended dataset will be stored in allhits.dta.

Next, we look at the downloads of Royston’s (2004) ice command over time. Thepackage was first released as ice in April 2005 (after its earlier incarnations as mvis

and, briefly, mice). As noted earlier, the records in SSC begin in July 2007. Here is thecommand:

. ssccount, from(2007m7) to(2015m9) author(Royston) graph package(ice)> saving(icehits, replace)Looking to download 99 months of SSC files (Jul 2007 to Sep 2015)................................................................................> ..................file icehits.dta saved

100

200

300

400

500

Num

ber

of h

its

Jul 2

007

Jul 2

009

Jul 2

011

Jul 2

013

Jul 2

015

Date

Number of hits lowess npkghit mo

Figure 1. Plot showing the number of hits for ice, July 2007 to September 2015. Grayline: number of hits recorded each month; black curve: lowess-smoothed trend.

Here we have downloaded data for all packages from July 2007 to September 2015,and we kept the data if the author’s name contains Royston and the package is namedice. The resulting data are saved to icehits.dta, and the graph shown in figure 1 isproduced.


Note the reduction in hits from the end of 2012. This is presumably due to the re-lease of mi impute chained by StataCorp; users were likely directed to use mi impute

chained instead of ice because of the reassurance that comes with using an officialStata command. Further development of ice then became less necessary, so updateswere less frequent. There is a surprising sharp spike in ice hits during 2014 despite noupdates at the time. We speculate that the rise was due to an article critiquing multi-ple imputation by predictive mean matching (Morris, White, and Royston 2014), whichpraised the ice implementation and noted the serious shortcomings of mi impute pmm.

As a further example, we look at the uptake of the psmatch2 command. We useallhits.dta, which we previously downloaded. Downloading the datasets afresh is aslow process.

. use allhits, clear

. keep if lower(package) == "psmatch2"(180,402 observations deleted)

. sort mo

. twoway (line npkghit mo, lcolor(gs10)) (lowess npkghit mo, lp(l)),> ylabel(#6,format(%9.0f) angle(0)) xlabel(,angle(45)) yscale(r(0 .))> ytitle("Number of hits")

Figure 2 demonstrates that the psmatch2 command is much used, and, unlike ice,its use continues to increase despite the release of Stata’s official teffects command.

0

1000

2000

3000

4000

5000

Num

ber

of h

its

Jul 2

007

Jul 2

009

Jul 2

011

Jul 2

013

Jul 2

015

Date

Number of hits lowess npkghit mo

Figure 2. Plot of the number of hits for psmatch2, July 2007 to September 2015. Grayline: number of hits recorded each month; black curve: lowess-smoothed trend.


4 Closing remarks

Accessible software for implementing new statistical methods is obviously an importantfactor in the uptake of new methods. We have introduced a command, ssccount, thatcounts the monthly downloads of user-written packages stored in the SSC archive. Theprogram provides useful information on the extent of the use of such packages.

Some authors of commands put their packages on only personal or corporate web-sites, or they do this in addition to putting packages on SSC. The ability to keep trackof downloads makes the option of releasing packages exclusively on the SSC archive at-tractive. We hope the ssccount command is helpful for highlighting the packages withthe greatest uptake over time.

5 Acknowledgments

We thank Kit Baum for helping us understand where the datasets containing downloadsare stored. We are grateful to Patrick Royston and Roger Newson for their commentson a draft and earlier versions of the program and to Stephen Evans for information onthe history of the Cox model. We also thank the associate editor and an anonymousreviewer for their useful comments on the earlier version of this article. This workwas supported by the UK Medical Research Council (MRC) London Hub for TrialsMethodology Research grant number MC EX G0800814 (510636, MQEL).

6 ReferencesBrueton, V. C., C. L. Vale, B. Choodari-Oskooei, R. Jinks, and J. F. Tierney. 2014.Measuring the impact of methodological research: A framework and methods toidentify evidence of impact. Trials 15: 464.

Cox, D. R. 1972. Regression models and life-tables. Journal of the Royal StatisticalSociety, Series B 34: 187–220.

Kalbfleisch, J. D., and R. L. Prentice. 1980. The Statistical Analysis of Failure TimeData. New York: Wiley.

Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobisand propensity score matching, common support graphing, and covariate imbalancetesting. Statistical Software Components S432001, Department of Economics, BostonCollege. https://ideas.repec.org/c/boc/bocode/s432001.html.

Morris, T. P., I. R. White, and P. Royston. 2014. Tuning multiple imputation by predic-tive mean matching and local residual draws. BMC Medical Research Methodology14: 75.

Pullenayegum, E. M., R. W. Platt, M. Barwick, B. M. Feldman, M. Offringa, andL. Thabane. 2016. Knowledge translation in biostatistics: A survey of current prac-


tices, preferences, and barriers to the dissemination and uptake of new statisticalmethods. Statistics in Medicine 35: 805–818.

Rezvan, P. H., K. J. Lee, and J. A. Simpson. 2015. The rise of multiple imputation: Areview of the reporting and implementation of the method in medical research. BMCMedical Research Methodology 15: 30.

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score inobservational studies for causal effects. Biometrika 70: 41–55.

Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227–241.

Rubin, D. B. 1978. Multiple imputations in sample surveys: A phenomenologicalBayesian approach to nonresponse. In Proceedings of the Survey Research Meth-ods Section of the American Statistical Association, 20–34.

. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

van Buuren, S., and C. G. M. Oudshoorn. 2000. Multivariate Imputation by ChainedEquations: MICE V1.0 User’s Manual. Leiden, The Netherlands: Netherlands Orga-nization for Applied Scientific Research.

About the authors

Babak Choodari-Oskooei is a statistician in the Hub for Trials Methodology Research at theMRC Clinical Trials Unit at University College London. He has a particular interest in survivalanalysis, clinical trials methodology, and research impact.

Tim P. Morris is a medical statistician interested in statistical methods to improve the designand analysis of randomized trials and meta-analyses and in the use of simulation studies. Heis a Stata enthusiast.


bireprob: An estimator for bivariaterandom-effects probit models

Alexander PlumOtto von Guericke University Magdeburg

Magdeburg, [email protected]

Abstract. I present the bireprob command, which fits a bivariate random-effectsprobit model. bireprob enables a researcher to estimate two (seemingly unrelated)nonlinear processes and to control for interrelations between their unobservables.The estimator uses quasirandom numbers (Halton draws) and maximum simulatedlikelihood to estimate the correlation between the error terms of both processes.The application of bireprob is illustrated in two examples: the first one usesartificial data, and the second one uses real data. Finally, in a simulation, the per-formance of the estimator is tested and compared with the official Stata commandxtprobit.

Keywords: st0426, bireprob, bivariate random-effects probit, maximum simulatedlikelihood, Halton draws

1 Introduction

When modeling a process (for example, the risk of becoming unemployed at a certaintime point), one must distinguish between two types of error terms: an individual-specific time-invariant error term and a time-specific shock. When applying a random-effects estimator, one assumes that the persistent unobservable difference between the in-dividuals is normally distributed. Two (seemingly unrelated) processes might be linkedwith each other by the correlation of their unobservables. For instance, an individualwho is more likely to become unemployed (because of constraints in his or her ability,for example) might also be more likely to live in a poor household, and the differencesbetween the individuals might be persistent over time. On the other hand, a time-specific shock that increases the risk of becoming unemployed might also increase therisk of becoming poor.1

In the past years, the number of journal articles accounting for correlation in theunobservables between two (seemingly unrelated) nonlinear processes has increased no-ticeably. Alessie, Hochguertel, and van Soest (2004) investigate the ownership dynam-ics of stocks and mutual funds; Devicienti and Poggi (2011) investigate the interrela-tion between poverty and social exclusion; Biewen (2009) and Ayllon (2015) investi-gate the dynamic relationship between unemployment and poverty; Stewart (2007) andKnabe and Plum (2013) investigate the interrelation between unemployment and low-pay; Miranda (2011) investigate the relationship between education and migration in

1. In example 2, I show that the bireprob command can be applied to any two-level equation system.


A. Plum 97

Mexico; Clark and Etile (2006) investigate the spousal correlation in smoking behav-ior and Haan and Myck (2009) investigate the interrelation between poor health andunemployment.

To analyze the relationship between two (seemingly unrelated) nonlinear processesin Stata, one can use Hole’s (2007) mixlogit command. However, mixlogit does notaccount for correlation in the time-specific shocks between two processes. In general, thisis not possible for multilevel logistic regressions. Another possibility is shown by Ayllon(2014), who uses the statistical tool aML with Stata. aML is a multilevel multiprocessorestimator that estimates the correlation of random-effects error terms for various levelsand different models. It is not restricted to a two-equation system. However, data needto be prepared carefully. With bireprob, a user-friendly estimator is presented thatestimates two (seemingly unrelated) nonlinear processes and accounts for the correlationin the time-specific and individual-specific error terms.

The remainder of this article is structured as follows: Section 2 presents the bivariaterandom-effects probit model. Section 3 explains why quasirandom numbers (Haltondraws) are used for simulation. Section 4 introduces the command bireprob. Sections 5and 6 present examples of the application of the bireprob command. Section 7 showsthe performance of bireprob in a simulation. The last section concludes.

2 The bivariate random-effects probit model

Assume that the observed binary outcome variables y1it and y2it are defined by thefollowing latent-response models:

y1it = 1 (x′1itβ1 + ν1it > 0)

y2it = 1 (x′2itβ2 + ν2it > 0)

The subscript i refers to the panel variable (for example, individual or firm) with i =1, . . . , N and t identifies the time point (for example, month or year) with t = 1, . . . , T .The dependent variable y1it is explained by the explanatory variables x1it, and thedependent variable y2it is explained by the explanatory variables x2it. Furthermore,νjit refer to the process-specific error terms with j ∈ (1, 2). It is assumed that νjitconsists of an individual-specific time-invariant error term αji and of a time-specificidiosyncratic shock ujit; thus νjit = αji + ujit:

y1it = 1 (x′1itβ1 + α1i + u1it > 0)

y2it = 1 (x′2itβ2 + α2i + u2it > 0)

Because of normalization of the error terms, it is assumed that the individual-specifictime-constant error terms are normally distributed, αj ∼ (0, σ2

αj), and that the id-

iosyncratic shocks are standard normally distributed, uj ∼ (0, 1). The ratio of thetime-constant individual-specific error term and composite error term is

λj = corr(νjit, νjis) =σ2αj

σ2νj

98 bireprob: An estimator for bivariate random-effects probit models

for t �= s.2 Furthermore, it is assumed that both processes are interrelated by thecorrelation of their error terms:

corr(ν1it, ν2is) =

{ρασα1

σα2+ ρu if s = t

ρασα1σα2

if s �= t

The individual likelihood function is the product of the joint probability of the observedbinary outcome variable {Pi(α1, α2)} and the joint density of the random-effects errorterms {f2(α1, α2;μα)},

Li =

∫α1

∫α2

Pi(α1, α2)f2(α1, α2;μα)dα1dα2

with μα referring to the covariance of the random-effects error terms (μα = ρασα1σα2

).Because it is assumed that the joint density of the random-effects error terms followsa bivariate normal distribution, the joint probability of the observed binary outcomevariables is

Pit(α1, α2) = Φ2 {k1 (x′1itβ1 + α1i) , k2 (x

′2itβ2 + α2i) , k1k2ρu}

with

kj =

{1 if yj = 1

−1 else

Φ2[·] is the bivariate normal cumulative distribution function. In general, the bivariatenormal cumulative distribution function takes the following form (Greene 2012),

Φ2(x1, x2, ρu) =

∫ x2

−∞

∫ x1

−∞φ2(z1, z2, ρu)dz1dz2

with the density

φ2(x1, x2, ρu) =e(−1/2)(x2

1+x22−2ρux1x2)/(1−ρ2

u)

2π(1− ρ2u)1/2

The sample likelihood now takes the following form:

L =N∏i=1

∫α1

∫α2

{T∏

t=1

Pit(α1, α2)

}f2(α1, α2;μα)dα1dα2 (1)

However, (1) cannot be solved analytically; therefore, the random-effects error termsmust be integrated out. Strategies such as applying (adaptive) Gaussian quadratureor simulation belong to the most common approaches. For simulation, draws fromrandom numbers are needed to simulate the bivariate normal distribution of the random-effects error terms. R uniformly distributed random draws rj on the interval [0,1) are

2. Note that in xtprobit, this ratio is labeled by ρ, whereas ρ refers in this model to the correlationof the error terms.

A. Plum 99

taken and then transformed by the inverse cumulative standard normal distributionαrj = Φ−1(rj) (see figure 1). Thereafter, the Cholesky decomposition of the variance–

covariance matrix of the bivariate normal distribution Σα = CC ′, with C being a lowertriangular matrix, is integrated into the routine and updated during each iteration. Themaximum simulated likelihood (MSL) is

MSL =N∏i=1

1

R

R∑r=1

{T∏

t=1

Pit(αr1, α

r2)

}The link between the transformed initial draws and the bivariate normally distributednumbers is

αr1 = σα1

αr1

αr2 = σα2

ρααr1 + σα2

√1− ρ2αα

r2

Because random numbers for the simulation are needed, quasirandom numbers areapplied. Quasirandom numbers are based on prime numbers and are also called Haltondraws. In section 3, I briefly introduce Halton draws and explain why they are appliedinstead of pseudorandom numbers.

Figure 1. Transformation of the random number

3 Halton draws

Stata offers two possibilities for generating uniformly distributed random numbers. Onepossibility is to generate pseudorandom numbers by using the runiform() function. An-other possibility is to generate quasirandom numbers such as Halton draws, which canbe generated by using the mdraws command (Cappellari and Jenkins 2006). Halton


draws are based on prime numbers and are often applied in the context of simulatedmaximum likelihood.3 The advantage of Halton draws is that, compared with pseudo-random numbers, they have certain characteristics that make them more appropriatein the context of MSL:

1. They exhibit better coverage of the normal distribution, especially in the case oflow numbers of observations (see figure 2).

2. Negatively correlated draws help to minimize the variance of the MSL maximand(Train 2009).

Therefore, for simulating bivariate normal distributions of the random-effects errorterms, the bireprob estimator uses Halton draws.

Figure 2. Coverage of different random-number generators

4 The bireprob command

The bireprob command fits a bivariate random-effects probit model that considerscorrelation in the random-effects error terms and in the idiosyncratic shocks. Note

3. For example, mixlogit (Hole 2007), redpace (Stewart 2006), petpoisson (Miranda 2012), or theHeckman estimator based on multivariate normal probabilities (Plum 2014).

A. Plum 101

that the mdraws command must be installed before using bireprob. bireprob checkswhether mdraws is installed and, if it is not, will exit with a note to install the missingpackage.

4.1 Syntax

bireprob depvar1 indepvars1 (depvar2 indepvars2)[if] [

in] [

, draws(#)

burn(#) primes(matname) from(matname) nosigma noalpha mutual]

4.2 Options

draws(#) specifies the number of Halton draws needed for the simulation of the randomeffects. The default is draws(10).

burn(#) specifies the number of initial elements of the Halton sequence to be droppedfor burn in. The default is burn(15). For details, see Cappellari and Jenkins (2006).

primes(matname) specifies a 1× 2 matrix matname containing the primes to be usedfor the Halton sequences. The numbers specified must be integers. If primes() isnot specified, the following primes are used: 2 and 3.

from(matname) lowers computational time by specifying a matrix matname that con-tains reasonable starting values for each equation. bireprob does not quietly fit arandom-effects probit model by using xtprobit.

nosigma specifies that the estimator not control for correlation in the idiosyncraticshock.

noalpha specifies that the estimator not control for correlation in the random-effectserror terms.

mutual specifies that the two dependent variables, y1 and y2, be mutually exclusive(for instance, the three labor market positions high paid, low paid, and unemployed[Stewart 2007]). When one applies this option, y2 is considered only if y1 = 0.bireprob checks whether both dependent variables are mutually exclusive and, ifthey are not, exits. When one applies this restriction, a notification will be displayed.

5 Example 1

In the first example, I use artificial data to introduce the bireprob command.

5.1 Constructing the dataset

At first, an artificial dataset that contains 500 individuals is constructed. An individualidentifier (id) is generated that is based on the consecutive number of the respectiveobservation; thus id = 1, . . . , 500.


. version 13

. local obs=500

. local per=5

. set obs òbsńumber of observations (_N) was 0, now 500

. set seed 987654321

. generate id=_n

The two dependent variables, y1it and y2it, are defined in the following way:

y1it = 1 (1.5x1 + α1i + u1it > 0)

y2it = 1 (−2x1 + 3x2 + α2i + u2it > 0)

The two random-effects error terms are standard normally distributed; hence, αji ∈(0, 1). They are negatively correlated with ρα = −0.3. The idiosyncratic shocks, whichare also standard normally distributed with uji ∈ (0, 1), are positively correlated withρu = 0.5. In the next step, the random-effects error terms are generated with the helpof the drawnorm command. Note that the variance–covariance matrix must be specifiedbefore applying drawnorm. Then, the dataset is expanded to a panel dataset with fivetime periods per individual (note that the number of time periods is defined in the local‘per’).

. matrix C = (1, -.3 \ -.3, 1)

. drawnorm re1 re2, n(òbs´) corr(C)

. expand `per´(2,000 observations created)

For each individual, the time-point identifier tper is generated. Moreover, the twoexplanatory variables x1 and x2 are generated. For defining the idiosyncratic shocksu1it and u2it, one again applies the drawnorm command. Thereafter, the two outcomevariables are generated; they become 1 if the value exceeds 0, and 0 otherwise.

. by id, sort: generate tper=_n

. generate x1=invnormal(runiform())

. generate x2=invnormal(runiform())

. matrix C = (1 , .5 \ .5 , 1)

. local obs=òbs´*`per´

. drawnorm u1 u2, n(òbs´) corr(C)

. sort id (tper)

. by id: generate y1=(1.5*x1 + re1 + u1>0)

. by id: generate y2=(-2*x1 + 3*x2 + re2 + u2>0)

5.2 Estimation

Before applying the bireprob command, one must use xtset to declare the panelvari-able and the timevariable. In this example, the panelvariable is id, and the timevariableis tper. The panel is strongly balanced; hence, each individual is observed for the same

A. Plum 103

number of time points. However, bireprob is not restricted to balanced panels and canalso be applied to unbalanced panels (see section 6).

Then, the bireprob command is applied. In this application, the first dependentvariable is y1, and the explanatory variable is x1. In parentheses, the first variable indi-cates the second dependent variable, which is y2, and x1 and x2 are used as explanatoryvariables. Furthermore, 50 Halton draws are chosen for the estimation (in section 5.4,I show how the results are affected by the number of Halton draws).

. xtset id tperpanel variable: id (strongly balanced)time variable: tper, 1 to 5

delta: 1 unit

. bireprob y1 x1 (y2 x1 x2), draws(50)Dependent variable (1st equation): y1Dependent variable (2nd equation): y2Explanatory variables (1st equation): x1Explanatory variables (2nd equation): x1 x2

Estimating 1st equation with xtprobit.

Estimating 2nd equation with xtprobit.

Generating 50 Halton draws with prime numbers 2 and 3. 15 Halton draws are> burned in.

Estimating a bivariate random-effects probit model

Iteration 0: log likelihood = -1731.9335Iteration 1: log likelihood = -1718.5778Iteration 2: log likelihood = -1718.5062Iteration 3: log likelihood = -1718.5062

Bivariate Random-effects Probit Model, 50 Halton draws

Number of obs = 2,500Wald chi2(1) = 484.82

Log likelihood = -1718.5062 Prob > chi2 = 0.0000

Coef. Std. Err. z P>|z| [95% Conf. Interval]

y1x1 1.502347 .0682307 22.02 0.000 1.368617 1.636077

_cons -.0869098 .059284 -1.47 0.143 -.2031042 .0292847

y2x1 -2.006314 .1310377 -15.31 0.000 -2.263143 -1.749485x2 2.918305 .1854261 15.74 0.000 2.554876 3.281733

_cons .0454579 .0645171 0.70 0.481 -.0809933 .171909

/logitlam_1 .053315 .0696122 0.77 0.444 -.0831224 .1897524/logitlam_2 -.0264849 .1163182 -0.23 0.820 -.2544644 .2014946

/atsiga -.2961899 .0932455 -3.18 0.001 -.4789478 -.113432/atsigu .443295 .1044243 4.25 0.000 .2386271 .6479629

alpha_1 1.112523 .1548903 7.18 0.000 .8468388 1.461561alpha_2 .9484087 .2206344 4.30 0.000 .6011392 1.496291

rho_alpha -.287822 .0855209 -3.37 0.001 -.4454005 -.112948rho_sigma .4163719 .0863207 4.82 0.000 .2341986 .570297


At the start of the estimation procedure, the bireprob command displays the de-pendent and the independent variables of the first and of the second equation. In thenext two steps, the bireprob command quietly fits a random-effects probit model foreach equation by using xtprobit. The estimated coefficients of the explanatory vari-ables and the variances of the random-effects error terms are used as starting valuesfor the bireprob command. As a starting value for ρα and ρu, 0 is chosen. Then, theHalton draws are generated, in this case 50 draws per individual. Because no primenumbers are defined, the prime numbers 2 and 3 are used. Moreover, the number ofHalton draws that should be burned in is not defined. Therefore, the default number ofinitial draws dropped per dimension, 15, is used. Finally, the bivariate random-effectsprobit model is fit.

Looking at the output table and comparing the coefficients with the true values,we can see that the estimated coefficients are close to the true values. In the four lastlines, the variances of the random effects and the correlation parameters are displayed.Referring to the variances of the random-effects error terms, σ2

α1= 1.11 and σ2

α2= 0.95,

we see that both are close to 1. Furthermore, a negative correlation parameter ofthe random-effects error terms is found, ρα = −0.29, and a positive correlation ofthe idiosyncratic shocks is found, ρu = 0.42. Moreover, all estimated coefficients aresignificantly different from 0 at the 1% level.

5.3 Predicted probabilities

I now show how to predict probabilities. In this example, we are interested in cal-culating the probability that y1 = 1 and y2 = 1 simultaneously when x1 = x2 = 1.Before calculating the predicted probabilities, we should note that the variances of thecomposite error terms are not standard normally distributed (σ2

νj�= 1); therefore, the

coefficients must be rescaled by√1/σ2

νj(Arulampalam 1999). In general, the predicted

probabilities that y1 = y2 = 1 are calculated as follows:

p = Φ2

{(x′1itβ1

)√ 1

σ2ν1

,(x′2itβ2

)√ 1

σ2ν2

, ρu

}

Note that the coefficients referring to the variance, σ2α1

and σ2α2, are included in

the estimator as the square root of their logarithm; thus ln(√

σ2αj

). The correlation

parameters, ρu and ρα, are included as the inverse hyperbolic tangent in the estimator.Furthermore, the variances of the idiosyncratic shocks are equal to 1; thus σ2

νj= σ2

αj+1.

The predicted probabilities are calculated with the nlcom command, and the probabilitythat both dependent variables are equal to 1 given that x1 = x2 = 1 is 0.67.

A. Plum 105

. nlcom (pred: binormal(> (([y1]_b[x1]*1 + [y1]_b[_cons])*sqrt(1/((exp(_b[/logitlam_1]))^2+1))),> (([y2]_b[x1]*1 + [y2]_b[x2]*1 + [y2]_b[_cons])*> sqrt(1/((exp(_b[/logitlam_2]))^2+1))), tanh(_b[/atsigu])))

pred: binormal( (([y1]_b[x1]*1 + [y1]_b[_cons])*> sqrt(1/((exp(_b[/logitlam_1]))^2+1))), (([y2]_b[x1]*1 +> [y2]_b[x2]*1 + [y2]_b[_cons])*sqrt(1/((exp(_b[/logitlam_2]))^2+1))),> tanh(_b[/atsigu]))


pred .6667429 .0215528 30.94 0.000 .6245003 .7089855

5.4 Sensitivity analysis

To illustrate how the number of Halton draws affects the estimation results, I repeatthe estimation with different numbers of Halton draws: I start at 25 draws and increasethe random numbers successively by an additional 5 draws until reaching 250 draws.The effect of the number of Halton draws on the simulated log likelihood, the estimatedvariances of the random effects (σ2

αj) and the correlation parameters (ρα, ρu) is shown in

figure 3. We can see that the simulated log likelihood changes only slightly depending onthe number of Halton draws and that the variances σ2

αjand the correlation parameters

ρα and ρu are also on the same level.

Figure 3. Sensitivity analysis


6 Example 2

In this example, I show how bireprob can be used with real data in the context of atwo-level equation system with mutually exclusive dependent variables. For the illus-tration, data about teachers’ evaluations of pupils’ behavior are used. These data werealso used by Haan and Uhlendorff (2006) and Hole (2007).4 There are three differenttypes of schools (tby), and the analysis focuses on whether there is some unobservedheterogeneity between those schools. The sample comprises 48 schools (scy3) and 1,313pupils. The school is the panel variable, and the pupils are treated as the time variable.

. use jspmix, clear

. tabulate tby, gen(y)

tby Freq. Percent Cum.

1 329 25.06 25.062 678 51.64 76.693 306 23.31 100.00

Total 1,313 100.00

. by scy3, sort: generate tper=_n

. xtset scy3 tperpanel variable: scy3 (unbalanced)time variable: tper, 1 to 85

delta: 1 unit

Because each student can be at only one type of school, this variable is mutuallyexclusive. If y1 = 1, the pupil is attending a school of the first category. If y1 = 0and y2 = 1, the pupil is attending a school of the second category. If y1 = y2 = 0, thepupil is attending a school of the third category. Therefore, when applying bireprob,we choose the mutual option to indicate that y2 is considered only if y1 = 0. For theestimation, we take 50 Halton draws. Furthermore, we control for correlation only inthe random effects. Thus we use the nosigma option. Following Haan and Uhlendorff(2006) and Hole (2007), we take a single explanatory variable: the gender of the pupil,sex.

4. In both articles, multinomial logistic regressions are applied.

A. Plum 107

. bireprob y1 sex (y2 sex), mutual nosigma draws(50)

(output omitted )

Bivariate Random-effects Probit Model, 50 Halton draws




y1sex -.4173062 .081013 -5.15 0.000 -.5760886 -.2585237

_cons -.5215657 .0840545 -6.21 0.000 -.6863096 -.3568219

y2sex -.3295871 .0867177 -3.80 0.000 -.4995506 -.1596237

_cons .7069166 .0740025 9.55 0.000 .5618744 .8519588

/logitlam_1 -.8405676 .1689955 -4.97 0.000 -1.171793 -.5093425/logitlam_2 -1.573497 .3180058 -4.95 0.000 -2.196777 -.950217

/atsiga .5123391 .3707035 1.38 0.167 -.2142265 1.238905

alpha_1 .1861625 .0629213 2.96 0.003 .0959829 .3610694alpha_2 .0429811 .0273365 1.57 0.116 .0123567 .1495037

rho_alpha .4717657 .2881987 1.64 0.102 -.2110084 .8451429

The results indicate that there is some evidence for correlation in the random effects;however, ρα is not significantly different from 0 at the 10% level.

7 Simulation

Finally, I test the performance of bireprob by a simulation and compare the results withthose of xtprobit. Again I use artificial data. To emphasize the necessity to controlfor correlation in the unobservables, I choose a dynamic model in which the currentoutcome depends on the outcome in the previous period. In the economic literature,an often-examined example is state dependence in unemployment (among others, seeArulampalam, Booth, and Taylor [2000]). Furthermore, the current outcome dependson the past outcome of the second dependent variable and vice versa. For example, whilethe first dependent variable is unemployment, the second dependent variable could bebad health. Past unemployment could significantly increase the risk of suffering frombad health. The same is true in the opposite direction: bad health not only increasesone’s risk of being affected by bad health in the future but also makes it more likely forone to become unemployed. In the simulation, the underlying model takes the followingstructure:5

y1it = 1 (1y1it−1 − 1y2it−1 + α1i + u1it > 0)

y2it = 1 (1y2it−1 − 1y1it−1 + α2i + u2it > 0)

Both random-effects error terms are standard normally distributed and positively cor-related with ρα = 0.7. Not controlling for correlation in the random effects would lead

5. The respective do-file can be found in the supplement.


to an overestimation of the variances of the random effects and to an underestimationof the lagged dependent variables’ coefficients.

The artificial dataset consists of 500 individuals observed for 5 subsequent timepoints. Therefore, the panel is strongly balanced. The outcome in the initial periodis randomly assigned; thus we do not have to control for the “initial conditions prob-lem” (Heckman 1981). The above equation system is estimated in total 100 times byxtprobit and by bireprob6 (with 100 Halton draws). In each round, a new randomdraw of the distribution of the random effects and the idiosyncratic shocks is generated.The mean over all 100 estimations of the coefficients and the standard errors can befound in table 1. The first column of table 1 shows that when one does not control forcorrelated random effects, the coefficients are on a much lower level in absolute terms.However, when one does control for correlated random effects, the coefficients are muchcloser to the true value.

Table 1. Simulation results

Coefficients xtprobit bireprob†

Coefficient Standard Coefficient Standarderror error

y1y1it−1 0.935�� 0.063 1.011 0.062y2it−1 −0.804�� 0.067 −1.000 0.070

y2y1it−1 −0.802�� 0.067 −0.997 0.070y2it−1 0.937�� 0.063 1.011� 0.062

σ2α1

1.123�� 1.001σ2α2

1.133�� 1.010ρα − 0.697Observations 100 100Log likelihood −2 451.452 −2 410.959†100 Halton draws with prime numbers 2 and 3.

�

Coefficient statistically significantly different from the true valueat the 0.10 level; �� at the 0.05 level; �� at the 0.01 level.

We can also conclude this when comparing the distribution of the coefficients infigure 4 (the solid vertical line refers to the true value). Furthermore, note that thevariances of the random effects are greater in the first model than when controllingfor the correlation between the random effects. Moreover, whether the means of thecoefficients and the variances are significantly different from the true values is tested.Referring to the xtprobit model, we see that every coefficient and variance is signifi-cantly different from the true value at the 1% level. However, in the bireprob model,

6. It is not controlled for correlation in the idiosyncratic shocks.

A. Plum 109

only the lagged dependent y2it−1 of the second equation is significantly different fromthe true value at the 10% level. For the remaining estimated coefficients and variances,no significant difference from the true value is detected.

Figure 4. Distribution of the estimated coefficients

8 Conclusion

In this article, I presented the bireprob command. bireprob fits a bivariate random-effects probit model and allows one to control for correlation in the random-effectserror terms and in the idiosyncratic shocks. The advantage of this estimator is thatcompared with existing estimators, such as mixlogit or aML, it requires no specificdata preparation. After presenting the command and its options, I gave two examplesand a simulation: the first example is based on artificial data, and the second exampleon real data. The simulation showed that not controlling for correlation in the randomeffects might cause biased estimation results. An open research task remains in loweringcomputational time.


9 Acknowledgments

I thank Sara Ayllon and an anonymous referee for helpful comments. Moreover, Iacknowledge the financial support of the German Research Foundation (DFG, projectKN 984/1-1).

10 ReferencesAlessie, R., S. Hochguertel, and A. van Soest. 2004. Ownership of stocks and mutualfunds: A panel data analysis. Review of Economics and Statistics 86: 783–796.

Arulampalam, W. 1999. A note on estimated coefficients in random effects probitmodels. Oxford Bulletin of Economics and Statistics 61: 597–602.

Arulampalam, W., A. L. Booth, and M. P. Taylor. 2000. Unemployment persistence.Oxford Economic Papers 52: 24–50.

Ayllon, S. 2014. From Stata to aML. Stata Journal 14: 342–362.

. 2015. Youth poverty, employment, and leaving the parental home in Europe.Review of Income and Wealth 61: 651–676.

Biewen, M. 2009. Measuring state dependence in individual poverty histories when thereis feedback to employment status and household composition. Journal of AppliedEconometrics 24: 1095–1116.

Cappellari, L., and S. P. Jenkins. 2006. Calculation of multivariate normal probabilitiesby simulation, with applications to maximum simulated likelihood estimation. StataJournal 6: 156–189.

Clark, A. E., and F. Etile. 2006. Don’t give up on me baby: Spousal correlation insmoking behaviour. Journal of Health Economics 25: 958–978.

Devicienti, F., and A. Poggi. 2011. Poverty and social exclusion: Two sides of the samecoin or dynamically interrelated processes? Applied Economics 43: 3549–3571.

Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: PrenticeHall.

Haan, P., and M. Myck. 2009. Dynamics of health and labor market risks. Journal ofHealth Economics 28: 1116–1125.

Haan, P., and A. Uhlendorff. 2006. Estimation of multinomial logit models with unob-served heterogeneity using maximum simulated likelihood. Stata Journal 6: 229–245.

Heckman, J. J. 1981. The incidental parameters problem and the problem of initialcondition in estimating a discrete time-discrete data stochastic process. In Struc-tural Analysis of Discrete Data with Econometric Applications, ed. C. F. Manski andD. McFadden, 179–195. Cambridge: MIT Press.

A. Plum 111

Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.Stata Journal 7: 388–401.

Knabe, A., and A. Plum. 2013. Low-wage jobs—Springboard to high-paid ones? Labour27: 310–330.

Miranda, A. 2011. Migrant networks, migrant selection, and high school graduation inMexico. Research in Labor Economics 33: 263–306.

. 2012. petpoisson: Stata module to estimate an endogenous participation en-dogenous treatment poisson model by MSL. Statistical Software Components S457393,Department of Economics, Boston College.https://ideas.repec.org/c/boc/bocode/s457393.html.

Plum, A. 2014. Simulated multivariate random-effects probit models for unbalancedpanels. Stata Journal 14: 259–279.

Stewart, M. B. 2006. Maximum simulated likelihood estimation of random-effects dy-namic probit models with autocorrelated errors. Stata Journal 6: 256–272.

. 2007. The interrelated dynamics of unemployment and low-wage employment.Journal of Applied Econometrics 22: 511–531.

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:Cambridge University Press.

About the author

Alexander Plum is a research assistant at the Chair of Public Economics at the Otto vonGuericke University Magdeburg. His main research interests are labor economics and appliedeconometrics.


conindex: Estimation of concentration indices

Owen O’DonnellErasmus School of Economics

Erasmus University Rotterdam, the NetherlandsTinbergen Institute, the Netherlandsand University of Macedonia, Greece

Stephen O’NeillDepartment of Health Services Research and PolicyLondon School of Hygiene and Tropical Medicine, UK

[email protected]

Tom Van OurtiErasmus School of Economics

Erasmus University Rotterdam, the Netherlandsand Tinbergen Institute, the Netherlands

Brendan WalshDivision of Health Services Research and Management

School of Health Sciencesand City Health Economics Centre

City University London, UK

Abstract. Concentration indices are frequently used to measure inequality inone variable over the distribution of another. Most commonly, they are appliedto the measurement of socioeconomic-related inequality in health. We introducethe user-written command conindex, which provides point estimates and standarderrors of a range of concentration indices. The command also graphs concentrationcurves (and Lorenz curves) and performs statistical inference for the comparisonof inequality between groups. We offer an accessible introduction to the variousconcentration indices that have been proposed to suit different measurement scalesand ethical responses to inequality. We also demonstrate the command’s capabil-ities and syntax by analyzing wealth-related inequality in health and health carein Cambodia.

Keywords: st0427, conindex, inequality, rank-dependent indices, concentrationindex, health, health care

1 Introduction

Concentration indices measure inequality in one variable over the distribution of an-other variable (Kakwani 1977). They are a particularly popular choice for the mea-surement of socioeconomic-related health inequality (Wagstaff, Paci, and van Doorslaer


O. O’Donnell, S. O’Neill, T. Van Ourti, and B. Walsh 113

1991; O’Donnell et al. 2008), as is evident from the 9,220 entries in Google Scholar withthe keywords “concentration index” and “health”. In that case, the concentration indexcaptures the extent to which health differs across individuals ranked by some indica-tor of socioeconomic status. A variety of concentration indices have been proposed tosuit the measurement properties of the variable in which inequality is to be assessedand the assessor’s ethical response to inequality (Wagstaff, Paci, and van Doorslaer1991; Wagstaff 2002; Wagstaff 2005; Erreygers 2009b; Erreygers and Van Ourti 2011a,b;Erreygers, Clarke, and Van Ourti 2012).

We introduce the conindex command, which provides a simple unified means toestimate various concentration indices and their standard errors. It can graph the con-centration curves that underlie some of the indices and test for differences in inequalityacross groups. It can also measure cross-sectional inequality in a cardinal variable overobservations ranked by another variable that is at least ordinally measured. With re-peated cross-section or panel data, one can use conindex to compare inequality acrossperiods. One can also use it to estimate rank-dependent indices of univariate inequality,such as the Gini and generalized Gini.

Other user-written commands are available to calculate some rank-dependent in-equality indices. concindc (Chen 2007) computes the most standard version of the con-centration index for both individual and grouped data. The Lorenz curve and associatedindices of univariate inequality can be computed and decomposed with a range of com-mands: glcurve (van Kerm and Jenkins 2007), ineqerr (Jolliffe and Krushelnytskyy1999), ineqdeco (Jenkins 1999), and descogini (Lopez-Feldman 2005). The compara-tive advantage of conindex is that it estimates a battery of concentration indices, whichallows an analyst to select an index that fits the measurement properties of the vari-able of interest and is consistent with their normative principles concerning inequality.The indices are estimated by using the correspondence of each to a transformation ofthe covariance between the variable in which inequality is measured and the rank inthe distribution over which inequality is assessed. This so-called convenient covarianceapproach (Kakwani 1980; Jenkins 1988; Kakwani, Wagstaff, and van Doorslaer 1997)can be implemented with both individual and grouped data while taking account of thesample design.

Before explaining the command, we define the various inequality indices it can com-pute and offer some guidance regarding the context in which each index is suitable. Weillustrate the command’s features by analyzing wealth-related inequality in health andhealth care in Cambodia.

2 Standard and generalized concentration indices

The concentration curve is the bivariate analogue of the Lorenz curve. It plots the cu-mulative proportion of one variable against the cumulative proportion of the populationranked by another variable. To facilitate more concise exposition, we will mostly referto the variable of interest as health and the ranking variable as income. Income-relatedhealth inequality can be assessed by plotting the cumulative proportion of health across

114 conindex: Estimation of concentration indices

individuals ranked from poorest to richest. Unlike the Lorenz curve, the concentrationcurve may lie above the 45◦ line if health, or more likely a measure of ill health, is moreheavily concentrated among those with lower incomes (as in the hypothetical examplein figure 1). The concentration index is twice the area between the concentration curveand the 45◦ line, indicating no relationship between the two variables.1 It is defined as

C(h|y) = 2cov(hi, Ri)

h=

1

n

n∑i=1

{hi

h(2Ri − 1)

}(1)

where hi is the health variable in which inequality is measured, for example, health.20

.2.4

.6.8

1

Cum

ulat

ive

shar

e of

ill h

ealth

0 .2 .4 .6 .8 1

Fractional income rank

Figure 1. Hypothetical concentration curve

C ranges from (1− n)/n, maximal pro-poor inequality (that is, all health is concen-trated on the poorest individual), to (n− 1)/n, maximal pro-rich inequality.3

Equation (1) reveals that the concentration index can be interpreted as a weightedmean of (health) shares with the weights depending on the fractional (income) rank(2Ri − 1).4 The Gini coefficient measure of univariate inequality arises as a special caseof the concentration index when inequality is measured in the same variable that is used

1. A concentration index of 0 can arise either because health does not vary with income rank orbecause the concentration curve crosses the 45◦ line and pro-poor inequality in one part of theincome distribution is exactly offset by pro-rich inequality in another part of the distribution.

2. The fractional rank varies between 1/2n and 1− (1/2n) if there are no ties. In the case of ties, itequals the mean fractional rank of those individuals with the same value for yi.

3. If given a welfare interpretation, then C respects the principle of income-related health transfers—social welfare falls when the health of a lower-income individual is reduced, and the health of ahigher-income individual is raised by the same magnitude (Bleichrodt and van Doorslaer 2006).

4. C respects the principle of income-related health transfers. This is analogous to the principle oftransfers for G and requires that a health transfer from a low- to a high-income individual willlower social welfare (Bleichrodt and van Doorslaer 2006).


for ranking. This is true for all indices we discuss in the remainder of this article andimplies that conindex can be used to estimate univariate inequality.

The concentration index measures relative inequality and is invariant to equipro-portionate changes in the variable of interest (health). This relative invariance is oneextreme of the many normative positions one might take in measuring inequality (Kolm1976). At the other extreme, absolute invariance corresponds to an inequality measurethat is invariant to equal additions to health. Such a measure can be obtained throughmultiplication of the standard concentration index by the mean health leading to thegeneralized concentration index (Wagstaff, Paci, and van Doorslaer 1991).5 Multiplica-tion by the mean gives this parameter an important role in the assessment of absoluteinequality. When two distributions display the same level of relative inequality, the onewith the higher mean will correspond to greater absolute inequality. The generalizedconcentration index GC can be expressed as

GC(h|y) = 1

n

n∑i=1

{hi(2Ri − 1)}

and ranges between h {(1− n)/n} (maximal pro-poor) and h {(n− 1)/n} (maximal pro-rich).

3 Taking account of measurement scale

The standard and generalized concentration indices are not necessarily invariant, orequivariant, under transformations of the variable of interest that are permissible forthe level of measurement (that is, nominal, ordinal, cardinal, ratio, or fixed scale)(Erreygers and Van Ourti 2011a,b).6 Several variants of the standard and generalizedconcentration indices have been proposed for use with variables possessing differentmeasurement properties. We differentiate between measurement levels at which per-mitted transformations affect the value of an index and levels at which transformationsto different scales affect inequality orderings. Both have received attention (for example,Lambert and Zheng [2011]), but most applications focus on the former. We think thelatter issue is more important because it deals with whether one bivariate distribution isevaluated to display greater inequality than another, irrespective of an arbitrary scaling.

5. The graphical representation of the generalized concentration index corresponding to figure 1 is thegeneralized concentration curve. According to the generalized concentration dominance criterion(Shorrocks 1983), a distribution with a higher mean cannot be dominated by one with a lowermean. But the ordering of two distributions by their generalized concentration indices does notnecessarily correspond to their ordering by their means.

6. A function f(·) is invariant under transformation g(·) if f {g (x)} = f (x). A function is equivariantif f {g (x)} = |∂g/∂x| f (x).


3.1 Measurement level

In bivariate inequality measurement, an ordinal scale is sufficient for the variable thatis used for the ranking of individuals. Rank-dependent indices can then be deployed toquantify inequality in variables measured at three levels:7

• Fixed: the measurement scale is unique (or fixed) with zero point correspondingto a situation of complete absence, for example, number of visits to a hospitalwithin a given period.

• Ratio: the measurement scale is unique up to a proportional scaling factor withthe zero point corresponding to a situation of complete absence, for example, lifeexpectancy that could be measured in years, months, etc.

• Cardinal: the measurement scale is such that differences between values are mean-ingful but ratios are not, and the zero point is fixed arbitrarily, for example, tem-perature in Celsius or Fahrenheit or a health utility index.

For variables on a fixed scale, the standard and generalized concentration indicesquantify inequality in the attribute of fundamental interest. Both are appropriate,with the choice between them depending on whether one is concerned about relativeor absolute inequality. Changing the proportionality factor of a ratio-scaled variablewill affect the value of the generalized concentration index but not that of the standardconcentration index.8 The generalized concentration index should therefore be usedwith ratio-scaled data only when the variables compared in an inequality ordering aresubject to the same scaling factor.9 Only in this case can one be sure that the inequalityordering given by the index applied to the variable is informative of the ranking ofpopulations by inequality in the attribute of essential interest. Alternatively, becausethe generalized concentration index is equivariant under a proportional transformationof the variable, if one knows the differential scaling factors, then one can use them expost to make the indices comparable across populations.

When the variable of interest is cardinal, the standard concentration index is notnecessarily invariant to arbitrary retransformations of the variable.10 One can addressthis by using the modified concentration index MC (Erreygers and Van Ourti 2011a,b),

MC(h|y) = 1

n

n∑i=1

{hi

h− hmin(2Ri − 1)

}(2)

where hmin is the lower limit of hi and the index ranges between (1− n)/n and (n− 1)/n.Under ratio- or fixed-measurement scales (hmin = 0), (2) simplifies to the standardconcentration index in (1).

7. Measuring inequality in nominal and ordinal variables is not feasible using rank-dependent indices(Erreygers and Van Ourti 2011a,b).

8. Assuming that hi = βxi, one obtains C (h |y ) = 2cov(hi/h,Ri

)= 2cov (βxi/βx,Ri) = C (x |y )

and GC (h |y ) = 2cov (hi, Ri) = 2cov (βxi, Ri) = βGC(x |y ).9. We assume monotone transformations; hence, the proportional scaling factor is positive.

10. We restrict attention to positive linear transformations; that is, β must be positive in hi = α+βxi.


One should use the modified concentration index when comparing inequality in anattribute using a variable that is inconsistently cardinalized for different populationsor when comparing inequality in different cardinally scaled variables in the same pop-ulation. If the cardinalization is constant, then inequality orderings made using thestandard concentration index will be robust to the chosen cardinalization (although theindex values will depend on the specific cardinalization chosen). Nevertheless, we ad-vise that one also use the modified index in this case because it allows for an easierinterpretation—the range is always [(1− n)/n, (n− 1)/n].

There is no easy modification to ensure the invariance of the generalized concentra-tion index to retransformations of cardinal variables. However, provided the cardinal-ization adopted across populations or variables is the same, then the inequality orderingwill be robust to the chosen cardinalization.

3.2 Bounded variables

Variables with a finite upper limit, such as years in school, a (health) utility index, or anybinary indicator, complicate the measurement of inequality.11 For instance, boundedvariables can be represented either as attainments ai ∈

[amin, amax

]or as shortfalls from

the upper limit si = amax − ai. Erreygers (2009b) introduced the “mirror” property,which requires that the magnitude of measured inequality represented by the absolutevalue of an index should not depend on whether the index is computed over attainmentsor shortfalls; that is, I (a) = −I (s).12

The standard concentration index does not satisfy this condition, C (s) = (−a/s)C (a); hence, inequality in attainments do not mirror inequality in shortfalls exceptwhen a = s (Erreygers 2009b).13 Moreover, inequality orderings based on the stan-dard concentration index might depend on whether one uses shortfalls or attainments.More generally, the mirror condition is incompatible with the measurement of relativeinequality (Erreygers and Van Ourti 2011a,b; Lambert and Zheng 2011). One mustchoose between satisfaction of the mirror condition and satisfaction of relative inequal-ity invariance.14

The generalized concentration index satisfies the mirror condition, GC (s) =−GC (a). However, as noted in section 2, the value of this index is not invariant topermissible transformations of ratio-scaled and cardinal variables. Erreygers (2009b)proposed a modification of the generalized concentration index that corrects this defi-ciency:

11. For discussion of the issues, particularly in relation to the measurement of health in-equality, see Clarke et al. (2002); Wagstaff (2005); Erreygers (2009a,b,c); Wagstaff (2009);Erreygers and Van Ourti (2011a,b); Wagstaff (2011a,b); and Kjellsson and Gerdtham (2013a,b).

12. Lambert and Zheng (2011) suggested a weaker condition requiring that the inequality ordering ofpopulations by attainments is strictly the reverse of that by shortfalls.

13. The same holds for the modified concentration index.14. Bosmans (Forthcoming) shows how one can overcome the impossibility of satisfying both conditions

if one allows for different functional forms for the inequality index for attainments and the inequalityindex for shortfalls.


E(a|y) = 1

n

n∑i=1

{4ai

(amax − amin)(2Ri − 1)

}= −E(s|y) (3)

This index ranges between −1 and +1.

Wagstaff (2005) noted that the range of the standard concentration index dependson the mean of the bounded variable and suggested rescaling the standard concentrationindex to ensure that it always lies in the range [−1, 1]:15

W (a|y) = 1

n

n∑i=1

{(amax − amin)ai

(amax − a)(a− amin)(2Ri − 1)

}= −W (s|y)

This index satisfies the mirror condition and so cannot be in line with the relativeinvariance criterion. Neither does it satisfy an absolute invariance criterion. In fact, theindex is consistent with an inequality invariance condition consisting of a mixture ofinvariance with respect to proportionate changes in 1) attainments ai and 2) shortfallssi. This may be considered to have paradoxical implications (Erreygers and Van Ourti2011a,b), although Kjellsson and Gerdtham (2013b) argue that this invariance criterionis actually intuitive when one realizes that W can be written as the difference betweenthe standard concentration indices for attainments and shortfalls.

Unlike for unbounded variables, the precise scaling of bounded variables does notaffect the value of any rank-dependent inequality index provided that the boundingis considered. This is most easily understood from the realization that any boundedvariable can be retransformed into an indicator of the proportional deviation from theminimum value, bi = (ai − amin)/(amax − amin). This lies on the range [0, 1] and recordsonly “real” changes in the underlying attribute, not “nominal” ones due to the choiceof measurement scale. Under this transformation, the Erreygers and Wagstaff indicessimplify, respectively, to

E (b |y ) = (1/n)

n∑i=1

{4bi (2Ri − 1)}

and

W (b |y ) = (1/n)

n∑i=1

[{bi/(1− b)}b(2Ri − 1)]

3.3 Summing up

The main message of this section is not that the most appropriate inequality indexdepends on the measurement properties of the variable of interest but that those prop-erties partly determine the ethical choices one faces when quantifying inequality, which

15. Wagstaff (2005) focused on the binary case.


is intrinsically a normative exercise. When the variable of interest has an infinite up-per bound on a fixed scale, the main normative choice is between absolute and relativeinvariance. Matters are more complicated when the measurement scale is not unique.Applying the generalized concentration index to a ratio or cardinal variable requires oneto accept that the inequality ordering may depend on the scaling adopted. This canbe avoided for the relative inequality invariance criterion if one replaces the standardconcentration index with the modified one. When the variable has a finite upper bound,one should first choose between relative inequality invariance and the mirror condition.If one prioritizes the relative invariance criterion (in attainments or shortfalls), thenthe standard concentration index or its modified version can be used. When priorityis given to the mirror condition, one faces a choice between the Erreygers index, whichfocuses on absolute differences, and the Wagstaff index, which mixes concern for relativeinequalities in attainments and relative inequalities in shortfalls.

If one considers no index to be normatively superior to all others, then one can checkwhether the inequality orderings are consistent across indices. If they are, all is welland good. However, such robustness does not hold in general.

4 Incorporating alternative attitudes to inequality

As noted in section 2, the standard concentration index can be interpreted as a weightedmean of the variable of interest with each individual’s weight depending on its fractionalrank; that is, (2Ri − 1). This weight equals 0 for individuals with the median value of theranking variable16 and is negative (positive) for individuals below (above) the median.Presuming the ranking variable is income, the weight increases linearly from (1− n)/nfor the poorest individual to (n− 1)/n for the richest. This linearity is consistent with aparticular attitude toward inequality that need not command widespread support. Twoextensions based on nonlinear weighting schemes can represent a variety of alternativeethical positions. The first approach makes it possible to vary the weight put on thoseat the top relative to those at the bottom of the distribution of the ranking variable.We refer to it as “sensitivity to poverty” because it allows more (or less) weight tobe placed on the poorest individuals when income is used as the ranking variable. Thesecond approach allows more (or less) weight to be placed on the extremes of the rankingdistribution (for example, the very rich and very poor) vis-a-vis those in the middle.We term this approach “sensitivity to extremity”.

4.1 Extended concentration index: sensitivity to poverty

Kakwani (1980) and Yitzhaki (1983) proposed a flexible extension of the univariate Giniindex that incorporates a distributional sensitivity parameter v specifying the attitudetoward inequality within the weight defined by 1 − v (1−Ri)

v−1. Pereira (1998) and

Wagstaff (2002) suggested using the same weighting function in the context of the mea-

16. Ri for this individual is (2i− 1)/2n = [2 {(n+ 1)/2} − 1]/2n = 1/2; hence, the weight is 2 (1/2)−1 = 0.


surement of income-related health inequality. This results in an extended concentrationindex identical to the standard concentration index except for the weighting function:17

EC (h |y; v ) = 1

n

n∑i=1

[hi

h

{1− v (1−Ri)

v−1}]

(4)

The distributional sensitivity parameter must take a value greater than or equal to 1.Larger values place more weight on the poorest individuals (when income is the rankingvariable). The weighting function equals 0 for v = 1 and in that case gives an index of 0regardless of the distribution of h, while v = 2 yields the standard concentration index.The extended concentration index ranges between 1 − v and 1,18 which suggests anintuitive interpretation of v as the distance between the weight given to the poorest andthe richest individual. The weight given to the richest individual is always +1, whilethe weight given to the poorest individual becomes more negative for higher values of v.Hence, the weighting function in (4) is asymmetric around the individual with medianincome, unless v = 2. For v �= 2, the individual with median income does not have aweight of 0. The individual given a weight of 0 will have a lower income than the onewith the median income when v > 2.

4.2 Symmetric concentration index: Sensitivity to extremity

Erreygers, Clarke, and Van Ourti (2012) suggest extending the linear weighting func-tion of the concentration index in such a way that two conditions are satisfied: 1) theindividual with median income should play a pivotal role, obtaining a weight of zero, and2) the weights for the other individuals should be inversely symmetric around medianincome. The poorest and richest individual should have the same weight but with anopposite sign. The second poorest and second richest should have the same weight withopposite signs, and so on. Under these conditions, varying attitudes toward inequalityexpress one’s sensitivity to extremity, that is, whether one is concerned merely withdifferences in the variable of interest at the middle of the income distribution or withdifferences between the extremes of the income distribution.

An index that satisfies both conditions and allows for varying degrees of sensitivityto extremity depending on the value of β > 1, which is analogous to the parameter v inthe extended index, is19

17. When one uses a finite number of observations to calculate the extended concentration index andv �= 2, a small-sample bias arises. Erreygers, Clarke, and Van Ourti (2012) develop an alternativeway to calculate the extended concentration index (and its generalized version) to address thissmall-sample bias. Their approach is applied in the conindex command. We refer the reader tothe appendix in Erreygers, Clarke, and Van Ourti (2012) for more details.

18. EC ranges between 1− v and 1 when n → +∞. For a finite value of n, the lower and upper limitof EC are 1− v {(2n− 1)/2n}v−1 and 1− v (1/2n)v−1.

19. As with the extended concentration index, a small-sample bias arises when β �= 2 (see footnote 17).We have implemented the approach explained in the appendix of Erreygers, Clarke, and Van Ourti(2012) in the conindex command. Note that the approach is also used for the generalized version.


S (h |y;β ) = 1

n

n∑i=1

(hi

h

)⎡⎣β2β−2

{(Ri − 1

2

)2} β−2

2 (Ri − 1

2

)⎤⎦ (5)

If 1 < β < 2, more weight is placed on the middle of the income distribution, whereasfor β > 2, the extremes are weighted more at the expense of the middle. If β = 2,the symmetric index equals the standard concentration index. When β becomes verylarge, the symmetric index will be very similar to the range index, which is sensitiveonly to the difference between the upper and lower end of the income distribution.This corresponds to one of the earliest measures of health inequality (for example,Townsend and Davidson [1982]). The range of the symmetric index is [−(β/2), +(β/2)],which provides an intuitive interpretation of the β parameter as the absolute deviationbetween the weight given to the poorest and richest individual.20

The choice between the symmetric and extended indices is normative. The symmet-ric index gives equal weight (but with an opposite sign) to individuals that are equally farapart from the pivotal individual with median rank, while the extended index prioritizesthe lower regions of the ranking (income) distribution. Applied to income-related healthinequality, the symmetric index is increasingly sensitive to a change that raises the healthof a richer individual and reduces that of a poorer individual by an equal magnitude thefurther those individuals are from the pivotal individual. In contrast, the extended con-centration index will be increasingly sensitive the closer the location of such a “healthtransfer” to the bottom of the income distribution. Erreygers, Clarke, and Van Ourti(2012) argue that the symmetric index is more concerned about the association betweenincome and health, while the extended concentration index puts priority on the incomedistribution and only then analyzes health differences within the prioritized region ofthe income distribution.21

4.3 Generalizing the extended and symmetric indices

Erreygers, Clarke, and Van Ourti (2012) consider counterparts of the extended andsymmetric indices that satisfy the mirror condition. They refer to the resulting measuresas generalized indices because they satisfy an absolute inequality invariance criterionand define these on the transformed bounded variable bi = (ai − amin)/(amax − amin):

20. This range is entirely correct only when n → +∞. For a finite number of observations, the range

is

[β2β−2

[{(1− n)/2n}2

](β−2)/2 {(1− n)/2n} ;β2β−2[{(n− 1)/2n}2

](β−2)/2 {(n− 1)/2n}].

21. Equations (4) and (5) reveal that the symmetric and extended concentration indices consider healthshares and hence are sensitive to relative health differences. The “absolute” counterparts of theseindices have not explicitly been introduced by Erreygers, Clarke, and Van Ourti (2012) (or Pereira[1998] and Wagstaff [2002] in the case of the extended concentration index) but are trivially de-rived by replacing the health shares with the health levels. Similarly, the measurement scale ofunbounded variables is important for the extended and symmetric indices, but the discussion es-sentially mimics that in section 3.1. Modifications such as in sections 3.1 and 3.2 can be derived.Because these modifications are not integrated in the conindex command, we do not discuss theseindices explicitly.


GEC (b |y; v ) = 1

n

n∑i=1

(v

vv−1

v − 1bi

){1− v (1−Ri)

v−1}

(6)

GS (b |y;β ) = 1

n

n∑i=1

4bi

⎡⎣β2β−2

{(Ri − 1

2

)2} β−2

2 (Ri − 1

2

)⎤⎦ (7)

with v ≥ 1 and β > 1. For v = β = 2, both indices simplify to the Erreygers index (3).Erreygers, Clarke, and Van Ourti (2012) show that both indices always range between−1 and +1.22

5 Estimation and inference

Each of the rank-dependent inequality indices discussed above can be expressed as atransformation of the covariance between the variable of interest (hi) and the fractionalrank (Ri) of the ordering variable. For example, the standard concentration index istwice the covariance divided by the mean of the variable of interest [see equation (1)].Because the slope coefficient of a simple least-squares regression is the covariance dividedby the variance of the regressor, each inequality index can be obtained from a regressionof a transformation of the variable of interest on the rank. For example, the standardconcentration index is the least-squares estimate of α1 in the model

2σ2R

hhi = α0 + α1Ri + εi (8)

where σ2R is the variance of R and εi is an error term. The standard error of the least-

squares estimate of α1 serves as a standard error of the estimate of the concentrationindex.23

An advantage of this approach is that Stata readily allows for sampling weights aswell as robust and clustered standard errors. Appropriate rescalings of the dependentvariable lead to the other indices considered in section 3.24 Erreygers, Clarke, and Van

22. Extensions of these indices that simultaneously satisfy the mirror condition and the inequalityinvariance criterion underlying the Wagstaff index are not discussed, because these have not beenintroduced in the literature before. However, in principle, it would be feasible to derive such indices.

23. conindex does not consider the sampling variability of the estimate of the mean of the variable of in-terest used in constructing the dependent variable of the regression. Typically, this makes very littledifference to the standard error of an estimated concentration index. The command calculates thepopulation formula for σ2

R and not the sampling formula; that is, there is no degrees of freedom cor-rection. The command does not implement the approach of Kakwani, Wagstaff, and van Doorslaer(1997) to account for serial correlation because this approach has not been extended to also allowfor sample design issues such as clustering. For more details on these issues, see O’Donnell et al.(2008, chap. 8).

24. One should replace (2σ2R)/h in (8) by 2σ2

R for GC, by (2σ2R)/(h− hmin) for MC, by

(8σ2R)/(a− amin) for E, and by {2σ2

R

(amax − amin

)}/{(amax − a)(a− amin

)} for W .


Ourti (2012) do not provide standard errors for the extended and symmetric indices;therefore, standard errors for these indices are not reported.

A final note concerns ties in the ranking variable that arise when different observa-tions have the same value for the ranking variable. conindex accounts for this by cal-culating the fractional rank from the proportion of individuals with a given value of the

ranking variable (y), such that Ri = (∑n

i=1 swi)−1

[q (yi − 1) + 0.5 {q (yi)− q (yi − 1)}],where swi denotes the sampling weight of individual i and q (yi) =

∑nk=1 1 (yk ≤ yi) swk

equals the proportion of individuals with at least the value yi (Van Ourti 2004). Whileconindex automatically adjusts for ties in computing the point estimate, it purpose-fully does not do so in generating the standard error. This is because two individualswith the same value of the ranking variable may or may not be entirely independentobservations. In the former case, one should not correct the standard errors, but if theobservations are dependent (because they belong to the same household, for example),then the cluster() option should be used.25 The occurrence of ties in the rankingvariable is similar to the case of grouped data estimation of the standard concentrationindex. With grouped data, one row in the data matrix will include group mean values ofthe variable of interest and the ranking variable as well as the sample weight indicatingthe relative size of the group. One can apply conindex directly to such grouped data.

6 The conindex command

6.1 Syntax

The syntax for conindex is

conindex varname[if] [

in] [

weight] [

, rankvar(varname) robust

cluster(varname) truezero generalized bounded limits(#1 #2)

wagstaff erreygers v(#) beta(#) graph loud compare(varname)

keeprank(string) ytitle(string) xtitle(string)]

fweights, aweights, and pweights are allowed; see [U] 11.1.6 weight.

by varlist: is allowed and can be used to calculate indices for groups defined by multiplevariables.

6.2 Description

conindex computes a range of rank-dependent inequality indices, including the Gini co-efficient, the concentration index, the generalized (Gini) concentration index, the mod-ified concentration index, the Wagstaff and Erreygers normalized concentration indices

25. Where ties occur between observations in different clusters, clustered standard errors may be un-stable because they are obtained from a regression at the group level with groups defined by theunique values of the ranking variable.


for bounded variables, and the distributionally sensitive extended and symmetric con-centration indices (and their generalized versions). There is no default index. Optionsdefine the index to be computed. One can use the graph option to obtain (generalized)Lorenz and (generalized) concentration curves. The default axis labels can be replacedwith the xtitle(string) and ytitle(string) options.

For unbounded variables (that is, those with at least one infinite bound), thetruezero option should be specified if the variable of interest is ratio scale (or fixed)and has a zero lower limit, in which case the standard concentration index is calculated.If instead the variable of interest is cardinal (with the zero point fixed arbitrarily), thenthe theoretical lower limit must be specified using the limits(#) option, where # isthe minimum value. Note that one should not use the lowest value observed in the sam-ple if this does not correspond to the theoretical lower bound. Specifying this optionresults in calculation of the modified concentration index.

The generalized concentration index derives from specifying the generalized optionin conjunction with the truezero option.

For bounded variables (that is, those with both a finite lower and upper bound), thebounded option can be specified in conjunction with limits(#1 #2), where #1 and#2 denote the theoretical minimum and maximum values of the variable of interest. Theinequality indices are then calculated based on the standardized version of the variableof interest, h∗ = {(h−#1) / (#2 −#1)}, and hence will be scale invariant.

The normalized concentration indices proposed by Wagstaff (2005) and Erreygers(2009b) may be obtained by specifying the wagstaff and erreygers options, respec-tively, in conjunction with the bounded and limits(#1 #2) options.

When a ranking variable is not provided using the rankvar option, conindex de-faults to use varname to rank observations, leading to the calculation of unidimensionalinequality indices (for example, the Gini coefficient).

The extended concentration index, which allows for alternative attitudes to inequal-ity (Pereira 1998; Wagstaff 2002), is computed with the truezero and v(#) options,where # is the distributional sensitivity parameter. With v(2), the extended concen-tration index is equivalent to the standard concentration index.

The symmetric concentration index is obtained with the truezero and beta(#)

options. With beta(2), the symmetric concentration index is equivalent to the standardconcentration index.

The generalized version of the extended and symmetric concentration indices areobtained by combining the v() and beta() options with the truezero and generalized

options.

All indices are calculated using the so-called convenient covariance approach (Kak-wani 1980; Jenkins 1988; Kakwani, Wagstaff, and van Doorslaer 1997). Robust- andcluster-corrected standard errors can be obtained with the usual options. Standard er-rors for the extended and symmetric indices are not calculated by the current versionof conindex.


The value of an index can be compared across groups defined by a single variable(for example, urban), and the null of homogeneity tested using the compare() option.The prefix bysort varlist: can be used to calculate the indices for groups defined bymultiple variables (for example, urban and hhsize).

The fractional rank may be preserved using the keeprank(string) option, wherestring is the name given to the rank variable created.

6.3 Options

rankvar(varname) specifies the variable by which individuals are ranked. varnamemust be at least an ordinal variable. When a ranking variable is not provided usingthe rankvar() option, conindex defaults to using varname to rank observations,leading to the calculation of unidimensional inequality indices (for example, the Ginicoefficient).

robust requests Huber/White/sandwich standard errors.

cluster(varname) requests clustered standard errors that allow for intragroup corre-lation.

truezero declares that the variable of interest is ratio scaled (or fixed), leading tocomputation of the standard concentration index.

generalized requests the generalized concentration (Gini) index, measuring absoluteinequality. This option can be used only in conjunction with truezero.

bounded specifies that the dependent variable is bounded. This option must be used inconjunction with the limits() option.

limits(#1 #2) must be used to specify the theoretical minimum (#1) and maxi-mum (#2) for bounded variables. If the bounded and truezero options are notspecified, then limits(#1) should be used to specify the minimum value to obtainthe modified concentration index.

wagstaff in conjunction with bounded and limits(#1 #2) requests the Wagstaffindex.

erreygers in conjunction with bounded and limits(#1 #2) requests the Erreygersindex.

v(#) requests the extended concentration index be computed. This option can be usedonly in conjunction with truezero. With v(2), the standard concentration indexis computed. If the v(#), truezero, and generalized options are specified, oneobtains the generalized extended concentration index. In the latter case, with v(2),the extended concentration index simplifies to the Erreygers index.

beta(#) requests the symmetric concentration index be computed. This option can beused only in conjunction with truezero. With beta(2), the standard concentrationindex is computed. If the beta(#), truezero, and generalized options are speci-


fied, one obtains the generalized symmetric concentration index. In the latter case,with beta(2), the symmetric concentration index leads to the Erreygers index.

graph requests that a concentration curve be displayed. If no ranking variable is speci-fied, a Lorenz curve is produced. In conjunction with generalized, one obtains thegeneralized Lorenz or concentration curve.

loud shows the output from the regression used to generate the inequality indices.

compare(varname) computes indices specific to groups specified by varname. Two testsof the null hypothesis of equality of the index values across groups are produced:an F test that is valid in small samples but requires an assumption of equal vari-ances across groups (Chow 1960) and a z test that relaxes the assumption of equalvariances but is valid only in large samples (Clogg, Petkova, and Haritou 1995). Ifvarname is not binary, then only the F test is given.

keeprank(string) creates a new variable that contains the fractional ranks, where stringis the name of the variable to be created. When used in conjunction with thecompare() option, the variable string will contain the fractional rank for the fullsample, and the suffix k is added to string to indicate the fractional rank for group k.

ytitle(string) and xtitle(string) specify the titles to appear on the y and x axes,respectively.

6.4 Stored results

conindex stores the following in r():

Scalarsr(N) number of observationsr(Nunique) number of unique observations for rankvar()r(CI) concentration indexr(CIse) standard error of concentration indexr(SSE unrestricted) unrestricted sum of squared errors (with compare() option)r(SSE restricted) restricted sum of squared errors (with compare() option)r(F) F statistic for joint hypothesis that concentration index is same

for all groups (with compare() option)r(CI0) concentration index for group 0 (with compare() option if only

two groups)r(CI1) concentration index for group 1 (with compare() option if only

two groups)r(CIse0) standard error of concentration index for group 0 (with

compare() option if only two groups)r(CIse1) standard error of concentration index for group 1 (with

compare() option if only two groups)r(Diff) difference in concentration index between groups (with

compare() option if only two groups)r(Diffse) standard error of difference in concentration index between

groups (with compare() option if only two groups)r(z) Z statistic for hypothesis that concentration index is same for

both groups (with compare() option if only two groups)


7 conindex: Example applications

We illustrate the functionality of conindex through examples using data from the 2010Demographic and Health Survey (DHS) of Cambodia, which can be obtained fromhttp://www.dhsprogram.com/. The Cambodian DHS covers a representative sampleof women aged between 15 and 49. It asks each participating woman about her preg-nancies in the last 10 years and also collects information at the household level. Weconstruct a dataset of households to estimate inequality in the distribution of healthcare expenditures and a dataset of births to estimate inequality in infant mortality.Inequality in each variable is examined in relation to a wealth index (wealthindex)that is obtained from a principal components analysis of the households’ possession of abattery of assets and durables as well as housing materials (Filmer and Pritchett 2001).This index has an ordinal interpretation and is used as the ranking variable.

In the household dataset, we construct a measure of health care expenditure percapita (healthexp) by summing out-of-pocket medical spending in the last month acrossindividuals within the household and dividing by household size.26 This measure willserve as an example of an unbounded variable with a ratio scale. From the child dataset,we construct a binary indicator of infant mortality (u1mr) that indicates whether eachchild born during the last 10 years survived to its first birthday.27 Results below indicatethat around 6% of children die within a year of birth. Average per capita monthly healthexpenditure is about 12,000 riel (e2.40), but the median value of 0 riel and the highmaximum of 14,500,000 riel show that the distribution of health expenditures is rightskewed.

. summarize healthexp [aweight=sampweight_hh]

Variable Obs Weight Mean Std. Dev. Min Max

healthexp 15,667 75391.2524 12010.62 116693 0 1.45e+07

. summarize u1mr [aweight=sampweight]

Variable Obs Weight Mean Std. Dev. Min Max

u1mr 14,598 14588.3669 .0606938 .238776 0 1

Figure 2 shows that more than 8% of Cambodian children in the lowest wealthquintile group die before they reach their first birthday. This is more than three timesgreater than the rate of infant mortality in the richest wealth quintile group. The

26. For each ill or injured household member, the respondent was asked to state the costs expendedfor transportation and treatment for each visit to a health care provider (for up to three visits andwithout differentiating between outpatient and inpatient care). These costs were reported only forliving people who had been ill or injured during the last month and did not include costs incurredfor people who had died in the 30 days preceding the interview.

27. For the summary statistics of health care expenditures, this implies that we consider the individualas the ultimate unit of observation, even though expenditures are measured at the household level.For the concentration indices of health expenditures, it also implies that household size will influencethe fractional ranks. As shown by Edbert (1997) and illustrated by Decoster and Ooghe (2003)on income data, this has important normative implications in terms of the axioms of anonymityand principle of transfer (both of which would be violated if each household were weighted equallyindependently of its size). In practice, this means we report estimates based on 15,667 observations,but application of sampweight hh ensures these are representative of 75,391 individuals.


mortality rate declines, but not monotonically, in moving to higher wealth groups.Health expenditure rises from around 7,000 riel per capita in the lowest wealth quintilegroup to more than 22,500 riel in the top group.

. xtile wealthquint_hh = wealthindex [pweight=sampweight_hh], n(5)

. xtile wealthquint = wealthindex [pweight=sampweight], n(5)

. graph bar (mean) healthexp [pweight=sampweight_hh], over(wealthquint_hh)

. graph bar (mean) u1mr [pweight=sampweight], over(wealthquint)

Table 1 (see page 132) summarizes all the indices discussed below. The concentrationindex of 0.248 confirms that medical spending is heavily concentrated among better-offsample households identified by a higher position in the wealth index distribution. Aswell as the point estimate, conindex returns a cluster-adjusted standard error and ap-value for a test that the index equals 0. In this example, the null is strongly rejected(p < 0.001). conindex can be used to graph a concentration curve by adding the graphoption.28

Figure 2. Infant mortality rate (left panel) and mean health care expenditure per capita(right panel) over wealth quintiles in Cambodia DHS, 2010

. conindex healthexp [aweight=sampweight_hh], rankvar(wealthindex) truezero> cluster(PSU) graph ytitle(Cumulative share of healthexp)> xtitle(Rank of wealthindex)

Index: No. of obs. Index value Robust std. error p-value

CI 15667 .24786719 .07246288 0.0007

(Note: Std. error adjusted for 611 clusters in PSU)

28. When graphing, conindex defaults to the variable label or, if it is unavailable, the variable namewhen labeling the axis. This can be overridden by specifying the xtitle() and ytitle() optionsand specifying the desired axis labels inside the parentheses. When rankvar() is not specified,conindex draws the Lorenz curve. Note also that the generalized concentration (Lorenz) curve willbe drawn when the generalized option is also specified.


Figure 3 reveals that there is no ambiguity in the distribution of health expenditures.The concentration curve always lies below the diagonal, which indicates greater spendingby those ranked higher according to the wealth index.

0.2

.4.6

.81

Cum

ulat

ive

shar

e of

hea

lthex

p

0 .2 .4 .6 .8 1

Fractional Income rank

Figure 3. Concentration curve for out-of-pocket health care expenditure per capitaagainst wealth index rank, Cambodia (DHS, 2010)

Because health care expenditures are unbounded and measured on a ratio scale, thisestimate is robust to the proportionality factor arising from the choice of currency andcan be used to rank inequality in medical spending in Cambodia against inequalitiesin other ratio-scale variables (for example, food expenditures) or health expenditureinequality in other countries. If one prefers that the measure of inequality in medicalspending respect absolute invariance rather than relative invariance, then the general-ized concentration index can be requested by typing

. conindex healthexp [aweight=sampweight_hh], rankvar(wealthindex) generalized> truezero cluster(PSU)


Gen. CI 15667 2977.0381 870.32394 0.0007


This gives an estimate of around 2,977 riel, which is obviously sensitive to the propor-tionality factor and cannot be used directly to compare inequality in medical spendingacross countries with different currencies.29

The standard concentration index of infant mortality is negative, which indicatesthat infant deaths are concentrated among less wealthy households. The index for infantsurvival (u1sr) is correspondingly positive but differs greatly in absolute value from theindex for mortality, which confirms that the mirror property does not hold and reflectsimposition of relative invariance with respect to different variables (see also section 3.2).

29. The univariate Gini and generalized Gini index are obtained by omitting the rankvar() option inthe conindex command.


Given the standard concentration index is insensitive to a proportional transformationof the variable of interest, the value used to indicate presence of a characteristic, forexample, death = 1, is irrelevant provided the value used to indicate absence of thatcharacteristic is fixed at 0.30

. conindex u1mr [aweight=sampweight], rankvar(wealthindex) truezero cluster(PSU)


CI 14598 -.18890669 .02546028 0.0000


. conindex u1sr [aweight=sampweight], rankvar(wealthindex) truezero cluster(PSU)


CI 14598 .01220632 .00164513 0.0000


The generalized concentration indices of mortality and survival are equal in absolutevalue (see table 1), confirming that this index satisfies the mirror condition when it isapplied to a binary variable. This is because the generalized concentration index fora binary variable equals one-fourth of the Erreygers index, which possesses the mirrorproperty. But the generalized concentration index does not satisfy this condition ingeneral. The Erreygers index is computed by specifying the erreygers option, alongwith two further options that indicate the variable is bounded and how it is coded.

. conindex u1mr [aweight=sampweight], rankvar(wealthindex) erreygers bounded> limits(0 1) cluster(PSU)


Erreygers norm. CI 14598 -.04586189 .00618113 0.0000


. conindex u1sr [aweight=sampweight], rankvar(wealthindex) erreygers bounded> limits(0 1) cluster(PSU)


Erreygers norm. CI 14598 .04586189 .00618113 0.0000


TheWagstaff index of infant mortality, which as explained above has different norma-tive underpinnings, is computed by simply specifying wagstaff in place of erreygers.

30. Consult Erreygers and Van Ourti (2011a,b) and Wagstaff (2011a,b) for some discussion on thisissue.


. conindex u1mr [aweight=sampweight], rankvar(wealthindex) wagstaff bounded> limits(0 1) cluster(PSU)


Wagstaff norm. CI 14598 -.20111301 .02710541 0.0000


. conindex u1sr [aweight=sampweight], rankvar(wealthindex) wagstaff bounded> limits(0 1) cluster(PSU)


Wagstaff norm. CI 14598 .20111301 .02710541 0.0000


The value of the Wagstaff index is close to that of the standard concentration indexbecause the prevalence of infant deaths, at 6.1%, is close to 0; thus the index placesgreater weight on relative invariance with respect to presence of the characteristic (here,death) and so comes closer to the normative principle imposed by the standard concen-tration index. If the prevalence were 50%, then the Wagstaff index would give equalweight to relative invariance in attainments and shortfalls, which coincides with abso-lute invariance. In that case, its value would equal that of the Erreygers index (seeKjellsson and Gerdtham [2013b] for more discussion).


Tab

le1.

Concen

tratio

nindices

estimated

from

theCambodian

DHS,

2010

Health

expenditu

re(h

ealth

exp)

Infantmorta

lity(u

1mr)

Infantsu

rvival(u

1sr)

Standard

concen

tratio

nindex

(C)

0.2479

(0.0725)

−0.1889

(0.0255)

0.0122

(0.0016)

Gen

eralized

concen

tratio

nindex

( GC)

2,977

(870)

−0.0115

(0.0015)

0.0115

(0.0015)

Errey

gers

index

(E)

−0.0459

(0.0062)

0.0459

(0.0062)

Wagsta

ffindex

(W)

−0.2011

(0.0271)

0.2011

(0.0271)

v,β

1.5

51.5

51.5

5

Exten

ded

concen

tratio

nindex

( smrm

EC(v))

0.1696

.

0.3818

.

−0.1201

.

−0.3187

.

0.0078

.

0.0206

.

Symmetric

concen

tratio

nindex

( SC(β

))0.2057

.

0.3943

.

−0.1683

.

−0.2492

.

0.0109

.

0.0161

.

Gen

eralized

exten

ded

concen

tra-

tionindex

( GEC(v))

−0.0492

.

−0.0362

.

0.0492

.

0.0362

.

Gen

eralized

symmetric

concen

tra-

tionindex

( GSC(β

))−0.0409

.

−0.0605

.

0.0409

.

0.0605

.

Note:

Robust

standard

errors

thatacco

untforclu

steringatthelev

eloftheprim

ary

samplin

gunit

(PSU)are

inbrackets.

Theprim

ary

samplin

gunits

corresp

ondto

villa

ges

intheDHS.


The bottom panel of table 1 presents estimates of concentration indices with al-ternative attitudes to inequality to those underlying the standard concentration index.Setting the parameter v of the extended concentration index to 1.5 places relativelymore weight on those residing in wealthier households, while setting the parameter to 5gives more weight to the poorer observations. A value of 2 corresponds to the weightingimplicit in the standard concentration index and so would result in an estimate equalto that of C. We use the same values for the β parameter of the symmetric index,where β = 1.5 corresponds to the case where more weight is placed on the middle ofthe wealth distribution, while β = 5 corresponds to a case where the extremes of thewealth distribution are more heavily weighted.31

The indices are computed as follows:

conindex varlist [aweight=sampweight], rankvar(wealthindex) truezero v(#) ///cluster(PSU)

conindex varlist [aweight=sampweight], rankvar(wealthindex) truezero ///beta(#) cluster(PSU)

We emphasize that little can be learned from comparing extended indices computedfor different values of v.32 Rather, one might check whether an inequality orderingacross populations is robust to the choice of the value of v (β). If it is not, then aconclusion that a variable is more unequally distributed in one population than anotherneeds to be made conditional on an explicit attitude toward inequality.

Generalized extended and symmetric indices are computed by simply adding thegeneralized option to the command lines immediately above. As is clear from theestimates in table 1, these indices satisfy the mirror condition.33

conindex varlist [aweight= sampweight], rankvar(wealthindex) generalized ///truezero v(#) cluster(PSU)

conindex varlist [aweight= sampweight], rankvar(wealthindex) generalized ///truezero beta(#) cluster(PSU)

conindex allows estimates of all inequality indices to be compared across groupsdefined by a binary or categorical variable, and it tests the null of equality acrossgroups. This is done by including the compare() option. For example, to comparewealth-related inequality in infant mortality across urban and rural locations, we canuse

31. There is no particular reason to choose the same values for v and β. Our reason for doing so isthat both v and β can be interpreted as the distance between the weights given to the least andmost wealthy individual. See also sections 4.1 and 4.2.

32. For instance, while their examples do not occur in the illustration in this article,Erreygers, Clarke, and Van Ourti (2012) report some empirical examples where initially pro-poorinequality reverses into pro-rich inequality when v is increased. The same reversal might alsohappen for the symmetric index.

33. When v = β = 2, both indices are equal to the Erreygers index.


. conindex u1mr [aweight=sampweight], rankvar(wealthindex) erreygers bounded> limits(0 1) cluster(PSU) compare(urban)




For groups:

CI for group 1: urban = 0




CI for group 2: urban = 1




Test for stat. significant differences with Ho: diff=0 (assuming equal variances)

F-stat = .66985873 p-value= 0.4131

Test for stat. significant differences with Ho: diff=0 (large sample assumed)

Diff. = .00115295 Std. err. = .01240129 z-stat = 0.09 p-value = 0.9259

The index estimated from the combined sample is displayed first. Then, the group-specific estimates are given. There is a significant concentration of infant mortalityamong the least wealthy in both rural and urban locations. The point estimates suggestthat the degree of inequality is greatest in rural areas, but the difference with urbanareas is small. Both tests fail to reject the null hypothesis that the index is the same inrural and urban locations.

8 Concluding remarks

This article introduced the user-written Stata command conindex, which calculatesrank-dependent inequality indices while offering a great deal of flexibility in consideringmeasurement scale and alternative attitudes to inequality. Estimation and inferenceis via a regression approach that can allow for sampling design, misspecification andgrouped data and for testing for differences in inequality across populations.


Concentration indices are frequently used, particularly for the measurement of in-equality in health by socioeconomic status. The indices estimated for different regions,periods, or groups could also be included in regression analyses as control variables.We hope that the greatly reduced computational cost offered by conindex will affordresearchers the time to give greater consideration to their choice of index, ensuring thatthe one selected is appropriate for the scale of measurement and consistent with thenormative position they are prepared to defend.

9 Acknowledgments

Owen O’Donnell and Tom Van Ourti acknowledge support from the National Instituteon Aging under grant R01AG037398. We thank Ellen Van de Poel for assistance withthe DHS data. The usual caveats apply, and all remaining errors are our responsibility.

10 ReferencesBleichrodt, H., and E. van Doorslaer. 2006. A welfare economics foundation for healthinequality measurement. Journal of Health Economics 25: 945–957.

Bosmans, K. Forthcoming. Consistent comparisons of attainment and shortfall inequal-ity: A critical examination. Health Economics .

Chen, Z. A. 2007. concindc: Stata module to calculate concentration index with bothindividual and grouped data. Statistical Software Components S456802, Departmentof Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s456802.html.

Chow, G. C. 1960. Tests of equality between sets of coefficients in two linear regressions.Econometrica 28: 591–605.

Clarke, P. M., U.-G. Gerdtham, M. Johannesson, K. Bingefors, and L. Smith. 2002.On the measurement of relative and absolute income-related health inequality. SocialScience and Medicine 55: 1923–1928.

Clogg, C. C., E. Petkova, and A. Haritou. 1995. Statistical methods for comparingregression coefficients between models. American Journal of Sociology 100: 1261–1293.

Decoster, A., and E. Ooghe. 2003. Weighting with individuals, equivalent individualsor not weighting at all. Does it matter empirically? In Inequality, Welfare andPoverty: Theory and Measurement, ed. Y. Amiel and J. A. Bishop, vol. 9, 173–190.Amsterdam: JAI.

Edbert, U. 1997. Social welfare when needs differ: An axiomatic approach. Economica64: 233–244.

Erreygers, G. 2009a. Can a single indicator measure both attainment and shortfallinequality? Journal of Health Economics 28: 885–893.


. 2009b. Correcting the concentration index. Journal of Health Economics 28:504–515.

. 2009c. Correcting the concentration index: A reply to Wagstaff. Journal ofHealth Economics 28: 521–524.

Erreygers, G., P. Clarke, and T. Van Ourti. 2012. “Mirror, mirror, on the wall, who inthis land is fairest of all?”—Distributional sensitivity in the measurement of socioe-conomic inequality of health. Journal of Health Economics 31: 257–270.

Erreygers, G., and T. Van Ourti. 2011a. Putting the cart before the horse. A commenton Wagstaff on inequality measurement in the presence of binary variables. HealthEconomics 20: 1161–1165.

. 2011b. Measuring socioeconomic inequality in health, health care and healthfinancing by means of rank-dependent indices: A recipe for good practice. Journal ofHealth Economics 30: 685–694.

Filmer, D., and L. H. Pritchett. 2001. Estimating wealth effects without expendituredata—or tears: An application to educational enrollments in states of India. Demog-raphy 38: 115–132.

Jenkins, S. P. 1988. Calculating income distribution indices from micro-data. NationalTax Journal 41: 139–142.

. 1999. ineqdeco: Stata module to calculate inequality indices with decompositionby subgroup. Statistical Software Components S366002, Department of Economics,Boston College. https://ideas.repec.org/c/boc/bocode/s366002.html.

Jolliffe, D., and B. Krushelnytskyy. 1999. sg115: Bootstrap standard errors for indices ofinequality. Stata Technical Bulletin 51: 28–32. Reprinted in Stata Technical BulletinReprints, vol. 9, pp. 191–196. College Station, TX: Stata Press.

Kakwani, N., A. Wagstaff, and E. van Doorslaer. 1997. Socioeconomic inequalities inhealth: Measurement, computation, and statistical inference. Journal of Econometrics77: 87–103.

Kakwani, N. C. 1977. Measurement of tax progressivity: An international comparison.Economic Journal 87: 71–80.

. 1980. Income Inequality and Poverty: Methods of Estimation and Policy Ap-plications. New York: Oxford University Press.

Kjellsson, G., and U.-G. Gerdtham. 2013a. Lost in translation: Rethinking the in-equality equivalence criteria for bounded health variables. In Research on EconomicInequality, Volume 21: Health and Inequality, ed. P. R. Dias and O. O’Donnell, 3–32.Bingley, UK: Emerald.

. 2013b. On correcting the concentration index for binary variables. Journal ofHealth Economics 32: 659–670.


Kolm, S.-C. 1976. Unequal inequalities. I. Journal of Economic Theory 12: 416–442.

Lambert, P., and B. Zheng. 2011. On the consistent measurement of attainment andshortfall inequality. Journal of Health Economics 30: 214–219.

Lopez-Feldman, A. 2005. descogini: Stata module to perform Gini decomposition byincome source. Statistical Software Components S456001, Department of Economics,Boston College. https://ideas.repec.org/c/boc/bocode/s456001.html.

O’Donnell, O., E. van Doorslaer, A. Wagstaff, and M. Lindelow. 2008. Analyzing HealthEquity Using Household Survey Data: A Guide to Techniques and Their Implemen-tation. Washington, DC: The International Bank for Reconstruction and Develop-ment/The World Bank.

Pereira, J. A. 1998. Inequality in infant mortality in Portugal, 1971–1991. In De-velopments in Health Economics and Public Policy, Volume 6: Health, the MedicalProfession, and Regulation, ed. P. Zweifel, 75–93. Dordrecht: Kluwer Academic Pub-lishers.

Shorrocks, A. F. 1983. Ranking income distributions. Econometrica 50: 3–17.

Townsend, P., and N. Davidson, eds. 1982. Inequalities in Health: The Black Report.Harmondsworth: Penguin.

van Kerm, P., and S. P. Jenkins. 2007. Software Updates: Generalized Lorenz curvesand related graphs: Update for Stata 7. Stata Journal 7: 280.

Van Ourti, T. 2004. Measuring horizontal inequity in Belgian health care using aGaussian random effects two part count data model. Health Economics 13: 705–724.

Wagstaff, A. 2002. Inequality aversion, health inequalities and health achievement.Journal of Health Economics 21: 627–641.

. 2005. The bounds of the concentration index when the variable of interest isbinary, with an application to immunization inequality. Health Economics 14: 429–432.

. 2009. Correcting the concentration index: A comment. Journal of HealthEconomics 28: 516–520.

. 2011a. The concentration index of a binary outcome revisited. Health Economics20: 1155–1160.

. 2011b. Reply to Guido Erreygers and Tom Van Ourti’s comment on ‘Theconcentration index of a binary outcome revisited’. Health Economics 20: 1166–1168.

Wagstaff, A., P. Paci, and E. van Doorslaer. 1991. On the measurement of inequalitiesin health. Social Science and Medicine 33: 545–557.

Yitzhaki, S. 1983. On an extension of the Gini inequality index. International EconomicReview 24: 617–628.


About the authors

Owen O’Donnell is a professor of applied economics in the Erasmus School of Economics atErasmus University Rotterdam, a research fellow of the Tinbergen Institute, and an associateprofessor at the University of Macedonia (Greece).

Stephen O’Neill (corresponding author) is a research fellow in health economics in the Depart-ment of Health Services Research and Policy at the London School of Hygiene and TropicalMedicine.

Tom Van Ourti is a professor of applied health economics in the Erasmus School of Economicsat Erasmus University Rotterdam and a research fellow of the Tinbergen Institute.

Brendan Walsh is a research fellow in health economics in the School of Health Sciences andthe City Health Economics Centre at City University London.


Estimating polling accuracy in multipartyelections using surveybias

Kai ArzheimerJohannes Gutenberg University

Mainz, [email protected]

Jocelyn EvansUniversity of Leeds

Leeds, UK

[email protected]

Abstract. Any rigorous discussion of bias in opinion surveys requires a scalarmeasure of survey accuracy. Martin, Traugott, and Kennedy (2005, Public Opin-ion Quarterly 69: 342–369) propose such a measure A for the two-party case, andArzheimer and Evans (2014, Political Analysis 22: 31–44) demonstrate how mea-sures A′

i, B, and Bw for the more common multiparty case can be derived. Wedescribe the commands surveybias, surveybiasi, and surveybiasseries, whichenable the fast computation of these binomial and multinomial measures of biasin opinion surveys. While the examples are based on pre-election surveys, themethodology applies to any multinomial variable whose true distribution in thepopulation is known (for example, through census data).

Keywords: st0428, surveybias, surveybiasi, surveybiasseries, multinomial vari-ables, surveys, survey bias

1 Introduction

Pre-election polls are a vital feature of political life in all democratic societies. Theyinform the strategic choices of politicians and potential donors and are at the heart of themedia coverage during campaigns. But their predictive qualities are often the subject ofheated debates. Pre-election polls suffer not only from sampling error, social desirabilityeffects, bandwagon and underdog effects, and genuine swings in the political moodbut also from “house effects”: pollster-specific bias introduced by particular samplingframes, sponsorship effects, secret weighting formulas, or even political allegiances ofthe firms involved (see Weisberg [2005] for a comprehensive and systematic account ofpotential sources of bias).

Any rigorous discussion of bias in opinion surveys presupposes a scalar measureof their accuracy. In a two-party setting, assessing the quality of pre-election surveysis relatively straightforward once the election results are in. Building on early workby Mosteller et al. (1949), Martin, Traugott, and Kennedy (2005) have developed anaccuracy measure A that is based on odds ratios and therefore comparable across surveystaken at different times and even during different campaigns.

Most democracies, however, feature multiparty systems. In these instances, calcu-lating A involves case-specific decisions that will quickly lead to inconsistencies, leavingapplied researchers with an awkward choice between ad hockery and eyeballing theirdata.


140 Polling accuracy in multiparty elections

In a recent bid to overcome this problem, Arzheimer and Evans (2014) have shownhow the Martin, Traugott, and Kennedy (2005) approach can be generalized to yielda multinomial accuracy measure B that can be applied to the multiparty case. Inthis article, we summarize the main statistical properties of B and then describe theuser-written commands surveybias, surveybiasi, and surveybiasseries, which es-timate B and several related measures, along with their variances and covariances. Wealso introduce a new approach to the estimation of the variance–covariance matricesthat is simpler and considerably faster than the numerical approximation outlined inArzheimer and Evans (2014).

While B and the user-written commands are presented here in the context of pre-election polls, they are applicable to any survey that samples a multinomial variablewhose distribution in the population is known. The latest versions of surveybias,surveybiasi, and surveybiasseries can always be found on the Statistical SoftwareComponents (SSC) archive and can be installed by typing ssc install surveybias,

replace.

2 Assessing bias in surveys

2.1 The Martin, Traugott, and Kennedy (2005) approach

Mosteller et al. (1949) were among the first scholars who systematically studied bias inpre-election polls. In their study of the 1948 polling disaster where polls famously gavevictory to Thomas Dewey over Harry S. Truman, they developed eight different scalarmeasures that aimed at describing the extent of discrepancies between polls and electionresults. Some of these measures were widely used in the industry well into the 1990s,although they were problematic in several ways (Martin, Traugott, and Kennedy 2005,344–377).

These earlier measures were largely based on percentage point differences, whereasMartin, Traugott, and Kennedy (2005, 350) propose to measure accuracy as a loggedodds-ratio. More specifically, they define

A = ln

(rdRD

)where R/D is the ratio of actual votes for the Republicans and Democrats, respectively,and r/d is the ratio of support for the two parties in a given pre-election poll.

Looking at the odds of the two major parties takes undecided respondents andnonvoters out of the equation consistently, while calculating the ratio of the odds focuseson whether the advantage of the winning party was adequately reflected in the survey.Finally, taking the log makes the measure symmetric around zero (no bias).

A is a huge improvement over the older measures. But its core advantage—isolatingthe relative strength of the two major parties—becomes a weakness in multiparty elec-tions. While support for third parties is indeed negligible in the United States, most

K. Arzheimer and J. Evans 141

other democracies feature three or more relevant parties, and coalition government (of-ten excluding the biggest party) is the rule. In cases of extreme multipartyism, theremay even be more than two relatively large parties, rendering the concept of “majorparties” a pointless one.

Martin, Traugott, and Kennedy (2005) as well as Durand (2008) hint at possiblesolutions for this problem, but they do not develop a full generalization of A. The nextsection presents a brief outline of the approach taken by Arzheimer and Evans (2014),who also provide a full derivation and simulation studies and discuss potential problemsand alternative approaches.

2.2 A generalization of the Martin, Traugott, and Kennedy (2005)approach for the multiparty case

To generalize the Martin, Traugott, and Kennedy (2005) measure for the k party case,Arzheimer and Evans (2014, sec. 2.2) define p as a vector of proportions p1, p2, . . . , pkof support for party i in a given poll and v as a vector of proportions v1, v2, . . . , vkof voters for the respective party in the election. Depending on the research question,(self-declared) nonvoters can be either excluded from the analysis following the leadof Martin, Traugott, and Kennedy (2005) or coded separately as a pseudoparty. Usingthis new notation, we see that A becomes

A = ln

(p1

p2

v1

v2

)= ln

(p1

1−p1

v1

1−v1

)(1)

in the two-party case. From (1), a straightforward definition of a “party-specific” mea-sure A′

i of polling accuracy follows. For the ith of k parties, A′i is

A′i = ln

(pi

1−pi

vi

1−vi

)= ln

⎛⎝ pi∑kj=1 pj

vi∑kj=1 vj

⎞⎠ for j �= i (2)

Let a be the vector of k party-specific measures of bias A′1, A

′2, . . . , A

′k.

1

A′i retains the interpretation of A, and for the two-party case, the absolute values

of A′i and A are identical. Positive values indicate that a poll overestimates support for

party i, whereas negative numbers show that the poll is biased against i. If the poll isin perfect agreement with the result of the actual election, all A′

is are zero.

Whereas A′i captures party-specific bias, applied researchers will also be interested

in overall measures of (in)accuracy. Arzheimer and Evans (2014) therefore propose acomposite measureB, which is simply the average of the absolute values of the individualA′

is, as well as a weighed alternative measure Bw, which additionally considers theparties’ respective electoral shares.

1. The prime was dropped from the vector because it could be confused with the symbol for transpo-sition.


B =

∑ki=1 |A′

i|k

Bw =k∑

i=1

vi × |A′i|

Again, B and |A| are identical for the two-party case.

Taking absolute values before averaging is necessary because positive and negativebias components would otherwise cancel each other out, but it results in some upwardbias because B and Bw are distributed folded normal. Simulations show, however, thatthis bias is small across a wide range of applied settings (Arzheimer and Evans 2014,sec. 2.4).

Calculating these measures of bias for a given poll is a simple algebraic exercise.But survey samples can rarely be treated as fixed; therefore, standard errors for A′

isare required. To see how an analytical estimator of the A′

is and their standard errorscan be derived, consider the answers to the voting intention question in a pre-electionsurvey, which can be modeled as draws from a multinomial distribution of stated pref-erences, their divergence from the observed electoral preferences in the ballot being thesystematic bias we model. For k different categories of voting intentions, this distri-bution can be described by parameter θ = θ1, θ2, . . . , θk.

2 In the absence of anyexplanatory variables, the sample proportions p1, p2, . . . , pk provide the maximumlikelihood estimate (MLE) for θ: θ1 = p1, θ2 = p2, . . . , θk = pk.

3 The asymptoticvariance–covariance matrix of the estimates being

Σθ =

⎛⎜⎜⎜⎜⎜⎜⎜⎝

θ1(1−θ1)n − θ1θ2

n · · · − θ1θkn

− θ2θ1n

θ2(1−θ2)n · · · − θ2θk

n

......

. . ....

− θkθ1n − θkθ2

n · · · θk(1−θk)n

⎞⎟⎟⎟⎟⎟⎟⎟⎠the probability density Prθ(x) of the estimates is asymptotically multivariate normal:

Prθ(x) =1√

(2π)k|Σθ|exp

{−1

2(x− θ)TΣ−1

θ(x− θ)

}

Recalling the definition of a as the vector of k party-specific measures of bias, wesee that (2) can then be interpreted as a function r(x) = y that maps points in the

2. Strictly speaking, the distribution can be characterized by a vector of length k − 1 because theprobabilities must sum to unity.

3. This has an important implication: because any A′i depends on vi (which is a constant) and pi [see

equation (2)] and because pi is the MLE for θi, calculating A′i according to (2) gives the MLE for

bias with respect to party i in a given survey.


parameter space of θ onto the parameter space of a. Because this mapping is one toone, a corresponding inverse function r−1(y) = x maps a back to θ. It can be found byrearranging (2):

pi =vie

A′i

vieA′i − vi + 1

Using the change of variables formula approach, we see that the probability densityof Pra(y) (that is, the multivariate normal distribution of the estimates for a) is givenby

Pra(y) = Prθ{r−1(y)}det (Jr−1)

where Jr−1 is the Jacobian matrix of first partial derivatives of the inverse function.

Instead of integrating over this distribution to find its variance–covariance matrix,we can approximate Σa by pre- and postmultiplying Σθ with the Jacobian of r(x) [thatis, equation (2)]:

Σa ≈ JrΣθJrT

This approximation is known to work well for many nonlinear transformations suchas r(x) (Tellinghuisen 2001). Because the calculation of A′

i involves only a single em-pirical quantity (pi) and a single constant (vi), Jr has a particularly simple form. Eachdiagonal element ji,i equals 1/{pi × (1− pi)}, and all off-diagonal elements are zero.

Approximating standard errors for B and Bw is also possible but will be misleadingbecause their sampling distributions are based on the (weighted) sum of a multivari-ate folded-normal distribution (Arzheimer and Evans 2014). Tests of the null hypoth-esis of no overall bias should thus be based solely on χ2 or G2 goodness-of-fit tests(Cressie and Read 1989), which give the probability of p’s deviation from v. Bothstatistics and the associated p-values are calculated by surveybias, assuming that thedata come from a simple random probability sample. For complex variance estima-tors, surveybias instead reports the result of the equivalent Wald test (Greene 2012,155–161) that all A′

is are jointly zero. See section 3.4 for examples.

Calculating a from the data and using the analytical approximation to estimateits variance–covariance matrix is fast even when the number of categories is large. Itis therefore the default option in recent versions of surveybias. If one wants to usepweights or complex variance estimators such as the bootstrap, jackknife, survey, or theclustered sandwich estimator, a second method is available. This second approach usesStata’s proportion ([R] proportion) command, which estimates θ and Σθ from thedata while accounting for complex data structures and weights.4 a is then calculatedfrom θ with (2), and the variance–covariance matrix is once more approximated by pre-and postmultiplying with the Jacobian. This second approach is somewhat slower, but

4. We are grateful to the anonymous reviewer, who pointed us toward proportion.


in most applications, the difference will be negligible. For testing purposes, the use ofproportion can be enforced with the prop option.5

The package consists of three separate ado-files. The main command is surveybias.It computes the A′

is, B, and Bw as well as standard errors and statistical tests from avariable held in memory and additional information about the true distribution of therespective variable in the population. surveybias is complemented by surveybiasi,an immediate command that makes these calculations based on information typed asarguments on the command line. By using surveybiasi, one can produce estimates ofpolling accuracy from published margins when the raw data are not available.

surveybiasseries takes this idea one step further. In the aftermath of an election,researchers will often want to compare polling accuracy across time and firms, butcommercial pollsters tend to make their raw data available for secondary analysis onlyafter some cooling-off period, if at all. surveybiasseries calculates accuracy measuresfrom a dataset of published margins, where each row represents the headline findingsfrom a single survey. surveybiasseries stores the accuracy measures as new variablesin the dataset so that it is easy to model polling accuracy as a function of variables suchas duration and timing of field work, sample size, or polling company.

3 The surveybias command

3.1 Syntax

surveybias varname[if] [

in] [

weight], popvalues(numlist)

[verbose prop

vce(cluster clustvar | bootstrap | jackknife) svy subpop(varname)

level(#)]

bootstrap, by, jackknife, and statsby are allowed; see [U] 11.1.10 Prefix com-mands.

fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.

3.2 Description

surveybias compares the distribution of a categorical variable varname in the datasetwith its true distribution in the population. This true distribution is submitted to thecommand as a numlist in popvalues(numlist).

5. Previous versions of the package exploited the relationship between a set of A′is and the coefficients

of a corresponding empty (constant only) multinomial logit model. They relied on a numericalmethod to approximate Σa (Arzheimer and Evans 2014), calling mlogit ([R] mlogit) and combin-ing its results with nlcom ([R] nlcom). As of version 1.4, this code has been removed from thepackage. For compatibility with older scripts, numerical has been retained as a hidden option thatnow activates estimation with proportion, followed by the Jacobian approximation. Apart from arounding error, results are identical to the old numerical method, but computation is much faster.


The values of varname must be strictly positive integers, but there are no otherrestrictions such as consecutive numbering placed on the values. Thus one should ensurethat the order of categories in varname and popvalues() matches. If in doubt, oneshould use either the verbose option or the inspect command ([D] inspect).

Many countries publish electoral counts for subnational units (provinces, regions,etc.) or even for subgroups of the electorate (women, senior citizens, etc.). If therespective identifying variables are present in the dataset, these subsamples can beselected via the if and in qualifiers so that the accuracy of group-specific predictionsmay be assessed. Standard errors will be based on the size of the reduced sample. Whenusing the survey estimator, one should specify subpopulations with the subpop() optioninstead.

Typed without arguments, surveybias replays the results of a previous surveybiascomputation. One can alter the desired confidence level while replaying the results withthe level() option.

3.3 Options

popvalues(numlist) is required for specifying the true distribution of varname in thepopulation. The full numlist syntax is supported, although normally users will enterjust an ordinary list of values. Its elements may be specified as counts, percentages,or relative frequencies because the list is internally rescaled so that its elements sumup to unity.

verbose displays the numeric values of varname with their labels and frequencies, mak-ing it easier to verify that the sequences of population and sample values match.

prop manually switches to the method relying on proportion ([R] proportion).

vce(cluster clustvar | bootstrap | jackknife) requests complex variance estimators.

svy instructs surveybias to respect survey characteristics of the data. This requiresthat the survey design variables be identified using svyset ([SVY] svyset).

subpop(varname) identifies the subpopulation variable for use with the survey estima-tor.

level(#) specifies the confidence level, as a percentage, for confidence intervals; see[R] level.


3.4 Remarks and examples

surveybias estimates k A′is, B, and Bw for variables with two or more discrete cate-

gories.6

Example

. use onefrenchsurvey

. surveybias vote, popvalues(28.6 27.18 17.9 9.13 11.1 2.31 1.15 1.79 0.8)

vote Coef. Std. Err. z P>|z| [95% Conf. Interval]

A´Hollande -.0757639 .0697397 -1.09 0.277 -.2124512 .0609233Sarkozy .0477294 .0689193 0.69 0.489 -.0873499 .1828087

LePen -.0559812 .0823209 -0.68 0.496 -.2173271 .1053648Bayrou .3057213 .0953504 3.21 0.001 .1188379 .4926047

Melenchon -.0058251 .0988715 -0.06 0.953 -.1996096 .1879594Joly -.0913924 .2154899 -0.42 0.671 -.5137449 .33096

Poutou -.8802476 .4482915 -1.96 0.050 -1.758883 -.0016125DupontAigna -.5349338 .3031171 -1.76 0.078 -1.129032 .0591648

other .1841789 .3177577 0.58 0.562 -.4386147 .8069724

BB .2424193 . . . . .

B_w .0965423 . . . . .

Ho: no biasDegrees of freedom: 8Chi-square (Pearson) = 18.695468Pr (Pearson) = .01657592Chi-square (LR) = 19.540804Pr (LR) = .01222022

Ten candidates ran in the first round of the French presidential election in 2012, butonly two of them would progress to the runoff. While surveybias can handle variableswith many categories, requesting estimates for small parties increases the computationalburden, may lead to numerically unstable estimates, and is often of little substantiveinterest. Therefore, in onefrenchsurvey.dta—a poll taken a couple of weeks beforethe actual election—support for the two lowest-ranking candidates has been recoded toa generic “other” category. The first-round results, which serve as a yardstick for theaccuracy of the poll, are submitted in popvalues().

The top panel lists the A′is for the first eight candidates and the “other” category

alongside their standard errors, z- and p-values, and confidence intervals. By conven-tional standards (p ≤ 0.05), only two of these values are significantly different from 0:support for Francois Bayrou was overestimated (A′

4 = 0.31), while support for PhilippePoutou was underestimated (A′

7 = −0.88).

6. surveybias will issue a warning if the number of categories exceeds 12 but will proceed nonetheless.Previous versions of surveybias restricted the permissible number of categories to 12 or fewer.


Poutou was the little-known candidate for the tiny New Anticapitalist Party. Whilethe odds of his support were underestimated by a considerable margin, the case ofBayrou is more interesting. Bayrou, a center–right candidate, ran in the previous2007 election and came in third with a respectable result of almost 19%, surprisingmany political observers. In 2012, when he ran for a new party that he had foundedimmediately after the 2007 election, his vote effectively halved. But this is not fullyreflected in the poll, which overestimates the odds of his support by roughly a third[exp(0.31) ≈ 1.35]. This could be due to (misguided) bandwagon effects, sampling bias,or political weighting of the poll by the company.

The lower panel of the output lists B and Bw. B, the unweighted average of theA′

is absolute values, is much higher than Bw. This is because the estimates for all themajor candidates with the exception of Bayrou were reasonably good. While support forPoutou and also for Dupont-Aignan was underestimated by large factors, Bw heavilydiscounts these differences because they are of little practical relevance unless one isinterested specifically in splinter parties.

As outlined in section 2.2, Bs (and Bws) sampling distribution is nonnormal. There-fore, surveybias performs additional χ2 tests based on the Pearson and the likelihood-ratio formulas, whose results are listed below the main table. Both tests agree that thenull hypothesis of no bias is indeed rejected by the data.

surveybias is by no means restricted to analyzing electoral behavior. It can beapplied to any categorical variable measured in a survey whose distribution in the pop-ulation is known. Levels of educational attainment are a case in point.

Example

Various reforms and variations at the state level not withstanding, there is a stricthierarchy of schools and school-leaving qualifications in Germany. Historically, mostpupils would leave school after 9 years and would be awarded a “Hauptschulabschluss”,whereas a smaller proportion would proceed to the “Realschulabschluss” (awarded after10 years) or even the “Abitur” (the qualification required to enter German universities,awarded after 12 or 13 years). Over the last couple of decades, however, the number ofpupils educated to the Abitur level has risen sharply. The true distribution of school-leaving qualifications in the population is known from census data and so can serve asa yardstick for assessing bias.

Normally, more educated voters are more likely to participate in opinion surveysand are therefore overrepresented in the survey. But this pattern is not reflected in thepre-election wave of the German Longitudinal Election Study (GLES).


. use gles-preelection

. surveybias educ, popvalues(4 36.1 30.5 29)

educ Coef. Std. Err. z P>|z| [95% Conf. Interval]

A´noqualifica -.5113485 .1461372 -3.50 0.000 -.7977721 -.2249249

upto9yrs .0115364 .0469027 0.25 0.806 -.0803912 .10346410yrs .308362 .0466371 6.61 0.000 .2169549 .3997691

12yrs+ -.290088 .0532528 -5.45 0.000 -.3944617 -.1857144

BB .2803337 . . . . .

B_w .203609 . . . . .

Ho: no biasDegrees of freedom: 3Chi-square (Pearson) = 63.802082Pr (Pearson) = 9.048e-14Chi-square (LR) = 65.206022Pr (LR) = 4.532e-14

On the contrary, respondents with 12 or more years of schooling are clearly under-represented, while there is no appreciable bias for respondents with 9 years of schooling,and respondents with 10 years of schooling are actually overrepresented. Only themisrepresentation of the small group of school dropouts is in line with expectations.

There are some possible reasons for this unusual type of bias. One is the generationalgap in educational attainment. Younger voters are much more likely to hold Abiturqualifications and are also more mobile and less likely to have a landline connection;hence, it is more difficult for interviewers to contact them.

But another plausible and perhaps more interesting reason is the complex design ofthe GLES: the GLES is a multistage survey that deliberately oversamples respondentsfrom the former East Germany (GDR) to account for persistent attitudinal, social, andeconomic differences between Germany’s Eastern and Western regions. In the GDR,the Communists phased out the Hauptschulabschluss and instead promoted a 10-yearcurriculum. At the same time, they limited access to the Abitur. Thus the distributionof school-leaving qualifications in the former East Germany still differs markedly fromthe West.

surveybias supports Stata’s survey estimator, so it is possible to use the weightssupplied by the GLES team as well as the information on the primary sampling unit andstratification to see whether this reduces the apparent bias.


. quietly svyset vnvpoint [pweight=w_ipfges_1], strata(distost)

. surveybias educ, popvalues(4 36.1 30.5 29) svyUsing survey characteristics of your data

educ Coef. Std. Err. z P>|z| [95% Conf. Interval]

A´noqualifica -.2203508 .2777376 -0.79 0.428 -.7647065 .3240049

upto9yrs .0665091 .0780089 0.85 0.394 -.0863856 .219403810yrs .0202821 .0657158 0.31 0.758 -.1085185 .1490827

12yrs+ -.0596029 .0943501 -0.63 0.528 -.2445258 .12532

BB .0916863 . . . . .

B_w .0565208 . . . . .

Ho: no biasDegrees of freedom: 3Chi-square (Wald) = 1.3289546Pr (Wald) = .72226926

Incorporating the information on the design of the survey massively reduces theestimates for bias. The A′

is for the three major groups are now small, while the A′i is

roughly halved, and none of them differs significantly from zero. B, the estimate for theoverall bias, drops to one-third of the original figure of 0.28, while its weighted version,Bw, is reduced even further from 0.20 to 0.06 because it considers the size of the “noqualification” group.

With complex variance estimators, simple goodness-of-fit tests are not appropriate.They are replaced by the equivalent Wald test of the null hypothesis that all A′

is (and,by implication, the overall measures B and Bw) jointly equal zero. At three degrees offreedom, this hypothesis cannot be rejected.

4 The surveybiasi command

4.1 Syntax

surveybiasi, popvalues(numlist) samplevalues(numlist) n(#)[prop

level(#)]

4.2 Description

surveybiasi is an immediate command that compares the distribution of a categoricalvariable in a survey with its true distribution in the population. Both distributions needto be specified with the popvalues() and samplevalues() options.


4.3 Options

popvalues(numlist) is required for specifying the true distribution of varname in thepopulation. The full numlist syntax is supported, which may be convenient forhypothetical calculations, although users will normally enter just an ordinary listof values. Its elements may be specified in terms of counts, of percentages, or ofrelative frequencies because the list is internally rescaled so that its elements sumup to unity.

samplevalues(numlist) is required for specifying the true distribution of varname inthe sample. The full numlist syntax is supported. Its elements may be specified interms of counts, of percentages, or of relative frequencies because the list is internallyrescaled so that its elements sum up to unity.

n(#) is required for specifying the sample size.


level(#) specifies the confidence level, as a percentage, for confidence intervals; see[R] level.


surveybiasi estimates k A′is, B, and Bw for categorical variables when raw data are

not available.

Example

A week before the 2012 election for the U.S. House of Representatives, 563 likelyvoters were polled for CBS and The New York Times. Of these, 46% said they wouldvote for the Republican candidate in their district; 48% said they would vote for theDemocratic candidate. Another 3% said it would depend, and another 2% said theywere unsure or refused to answer the question. In the example, these 5% are treated as“other”. Because of a rounding error, the numbers do not exactly add up to 100, butsurveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6%, and the Democrats won 48.8% ofthe popular vote, with the rest going to third-party candidates. Given the small samplesize and the close match between survey and electoral counts, it is not surprising thatthere is no evidence for statistically or substantively significant bias in this poll.


. surveybiasi, popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)

catvar Coef. Std. Err. z P>|z| [95% Conf. Interval]

A´1 -.0455767 .0845014 -0.54 0.590 -.2111965 .12004312 -.0126154 .0843287 -0.15 0.881 -.1778965 .15266583 .3537155 .1924563 1.84 0.066 -.0234919 .7309229

BB .1373025 . . . . .

B_w .0405846 . . . . .

Ho: no biasDegrees of freedom: 2Chi-square (Pearson) = 3.4542892Pr (Pearson) = .17779136Chi-square (LR) = 3.0856308Pr (LR) = .21377838

An alternative approach is to follow Martin, Traugott, and Kennedy (2005) and ig-nore third-party voters, undecided respondents, and refusals. This requires minimal ad-justments: n is now 535 because the analytical sample size is reduced by 5%, while thefigures representing the “other” category can simply be dropped. Again surveybiasi

internally rescales the values accordingly.

. surveybiasi, popvalues(47.6 48.8) samplevalues(46 48) n(535)

catvar Coef. Std. Err. z P>|z| [95% Conf. Interval]

A´1 -.0176621 .0864871 -0.20 0.838 -.1871738 .15184952 .0176621 .0864871 0.20 0.838 -.1518495 .1871738

BB .0176621 . . . . .

B_w .0176621 . . . . .

Ho: no biasDegrees of freedom: 1Chi-square (Pearson) = .0417056Pr (Pearson) = .83818198Chi-square (LR) = .0417092Pr (LR) = .83817509

Under this two-party scenario, A′1 is identical to Martin, Traugott, and Kennedy’s

(2005) original A (and all other estimates are identical to As absolute value). Its negativesign points to the (tiny) anti-Republican bias in this poll, which is of course even lesssignificant than in the previous example.


5 The surveybiasseries command

5.1 Syntax

surveybiasseries[if] [

in], samplevariables(varlist) nvar(varname)

generate(newvarstub)[missasnull popvalues(numlist)

popvariables(varlist) prop descriptivenames]

5.2 Description

surveybiasseries estimates accuracy measures from a dataset of survey margins. Eachobservation represents a single poll. For each survey, the distribution of some categor-ical variable is given by a series of variables specified in samplevariables(). Thedistribution can be expressed in terms of absolute frequencies, relative frequencies, orpercentages. Information on the true distribution can be specified either directly on thecommand line with the popvalues() option or as another series of variables specified inthe popvariables() option. Either popvalues() or popvariables() must be given,but not both. Moreover, another variable whose name is passed to the command in thenvar() option must hold the respective sample sizes. The if and in qualifiers can beused to restrict the analysis to a subgroup of surveys.

The command leaves behind a series of new variables whose names are based onthe stub submitted in the generate() option, which is required. In these variables,surveybiasseries stores the complete information that would be generated by theequivalent series of surveybias commands: B, Bw, one A′

i per category, standarderrors for each of these, Pearson and likelihood-based χ2 values, and the accompanying p-values. With the descriptivenames option, the names of the category-specific variablesare derived from the names of the respective sample variables; otherwise, this is just aconsecutive number.

5.3 Options

samplevariables(varlist) is required for specifying the information on the observeddistribution of some categorical variable in a series of samples.

nvar(varname) is required for specifying the variable that holds the information on thenumber of observations in each sample.

generate(newvarstub) is required for specifying the stub for the new variables to becreated, which will hold the A′

is, B, Bw, and other statistics.

missasnull requests that any missing values in the sample and population variables berecoded to zero. While this is often handy, missasnull changes the original dataand is not reversible.


popvalues(numlist) specifies the true distribution of varname in the population. Thefull numlist syntax is supported, which may be convenient for hypothetical calcula-tions, although users will normally enter just an ordinary list of values. Its elementsmay be specified in terms of counts, of percentages, or of relative frequencies becausethe list is internally rescaled so that its elements sum up to unity.

popvariables(varlist) specifies the information on the true distribution of some cate-gorical variable in the population.


descriptivenames generates descriptive names for variables that will hold the As andBs from the names of the sample variables.


surveybiasseries estimates k A′is, B, and Bw for categorical variables in a series of

surveys.

Example

The website http://www.wahlrecht.de publishes a wealth of information on Germanelectoral law, including a series of margins from pre-election polls going back to thelate 1990s. Building on this remarkable resource, our dataset of German pre-electionsurveys contains margins for 152 nationwide polls that were conducted by 6 leadingpollsters between January 2013 (when the exact date of the Parliamentary electionwas agreed on between the parties and then officially announced) and mid-September(the week immediately before the election). Two main parties, the Christian Democrats(CDU/CSU) and the Social Democrats (SPD), contested this election. The smaller Liberalparty (FDP), Greens (Die Gruenen), a left-wing party (Die Linke), and a range of smallparties, coded here as “other”, ran as well.

Obviously, one cannot expect the early polls to provide accurate predictions of thefinal result because voting intentions would not be firm so far from election day; pollsconducted during spring and early summer will reflect the waxing and waning popularityof parties that is due to events and campaign effects. But these effects can be modeled,and once they are considered, the relatively large number of cases (individual polls)enables one to assess the general reliability of individual polling houses and to identifyany party-specific polling problems that similarly afflict all pollsters.

surveybiasseries calculates the accuracy measures for the 152 surveys with asingle command.7

7. If the number of surveys is large, coarse parallelization with the user-written parallel commandis convenient and provides gains in speed that are nearly proportional to the number of cores.


. use german-pre-election-polls

. quietly surveybiasseries, samplevariables(cducsu spd linke gruene fdp other)> nvar(n) popvalues(41.5 25.7 8.6 8.4 4.8 10.9) generate(g) descriptivenames

Once the accuracy measures have been estimated, assessing and modeling bias is straight-forward.

. summarize gaprime* gbw, separator(6)

Variable Obs Mean Std. Dev. Min Max

gaprimecdu~u 152 -.0261784 .0811378 -.1905943 .3416225gaprimespd 152 -.001119 .1027209 -.2053933 .3520881

gaprimelinke 152 -.1953139 .18804 -.8156653 .165164gaprimegru~e 152 .5237401 .1552392 .0744723 .80248

gaprimefdp 152 -.0958617 .2510729 -.9055073 .3996237gaprimeother 152 -.3627553 .243908 -1.376225 .1074434

gbw 152 .1586219 .0616449 .0630823 .389022

On average, the polls measured support for the two major parties with little bias.Moreover, the A′s for these parties also have small standard deviations, which meansthat their final vote share was consistently well predicted.

Bias is stronger for the smaller parties, whose measured support displayed morefluctuation throughout the campaign. This is most pronounced for the Greens, whosefinal result was considerably and rather consistently overestimated by the polls. Thisis not necessarily a sign of any methodological problems: after a strong start in thecampaign, the party presented a platform that called for a comprehensive ecologicaltax hike. This proved almost universally unpopular, and the party’s support declinedmarkedly. Thus the relatively high figure seems to reflect some true change of opinionover the course of the campaign.

Finally, the last line shows the average estimate of Bw, which seems rather highcompared with the French case (see section 3.4). However, the French surveys weretaken the week preceding the election, whereas the German surveys cover a much longertime span.

A simple linear model of overall bias can be constructed by regressing Bw on timeto the election (measured in days) and a set of dummy variables that represent the sixdifferent major polling companies. Because Bw is biased away from zero and becausethis bias is more pronounced in smaller samples, sample size should also be included inthe model.


. regress gbw timetoelection n company1-company5

Source SS df MS Number of obs = 152F(7, 144) = 60.00

Model .427316633 7 .061045233 Prob > F = 0.0000Residual .146498119 144 .001017348 R-squared = 0.7447

Adj R-squared = 0.7323Total .573814752 151 .003800098 Root MSE = .0319

gbw Coef. Std. Err. t P>|t| [95% Conf. Interval]

timetoelection .0004322 .0000335 12.92 0.000 .0003661 .0004983n 9.51e-06 9.81e-06 0.97 0.334 -9.89e-06 .0000289

company1 -.1089721 .0120095 -9.07 0.000 -.1327098 -.0852345company2 -.1265327 .0093036 -13.60 0.000 -.1449221 -.1081434company3 -.1494146 .0121243 -12.32 0.000 -.1733792 -.1254501company4 -.1524052 .0142219 -10.72 0.000 -.1805159 -.1242945company5 -.1306604 .0133531 -9.79 0.000 -.1570537 -.104267

_cons .2081095 .0154356 13.48 0.000 .1775998 .2386192

The results show that sample size does not have any effect on bias8 and that biasdeclines over time. More importantly, there are remarkable differences between thecompanies: each of the first five companies does significantly better (produces lessbiased results) than “Forschungsgruppe Wahlen” (the reference category) when time toelection is controlled for. The average difference in the expected Bw is about 0.14.

This is not necessarily what one would expect. Forschungsgruppe Wahlen is a highlyrespected company with roots in academia, and its polls are conducted on behalf of oneof Germany’s biggest public broadcasters. But unlike the other pollsters, Forschungs-gruppe releases two different series of headline results: their raw (although presumablydesign-weighted) data and a model-based “projection”, which factors in party identifi-cation and long-term trends. In our dataset, we have used the former. The fact thatthe other companies are consistently closer to the final result than Forschungsgruppesuggests that they do not publish raw survey results but rather the product of somemodel-based weighting—something that they do not publicize.

Given the length of the observation span, a linear trend for time is somewhat im-plausible and could be misleading. Using the mfp command, we therefore ran a secondmodel that applies a fractional polynomial transformation to the time variable (outputnot shown). The (−2 0.5) transformation provides the best fit and results in a func-tional form that makes substantive sense: bias declines nearly linearly over most of thecampaign, then drops quickly over the last few days immediately before the election(see figure 1). But finding a more adequate functional form does not substantively alterthe estimates of the house effects: Forschungsgruppe performs somewhat worse thanthe other five.

8. Several plausible nonlinear transformations (square, log, etc.) yield virtually identical results.


0.1

.2.3

.4P

artia

l pre

dict

or+

resi

dual

of g

bw

0100200300timetoelection

Fractional Polynomial (−2 .5),adjusted for covariates

Figure 1. Effect of time to election on overall bias (Bw)

A similar modeling strategy can also be applied to party-specific bias. In recentyears, pollsters have clashed in the media and even in court over the issue of measur-ing support for the Social Democrats. Forsa has accused Infratest dimap of overre-porting SPD support for commercial and political reasons. Other observers claim thatForsa—founded by Manfred Gullner, who is a friend of former party leader and chan-cellor Gerhard Schroder and who later became embroiled in the intraparty dispute overSchroder’s welfare reforms—is trying to hurt the Social Democrats by systematicallyunderreporting SPD support.

While it seems impossible to resolve this dispute, surveybiasseries makes model-ing the extent of bias in favor or against the SPD trivially easy. On page 154, we sawthat on average, bias against or in favor of the SPD (A′

2) is negligible but that there isvariation to start with. Because A′

2 (unlike Bw) is normally distributed, there is no needto include sample size in the model. We therefore model bias in the estimate of SocialDemocratic support as a function of time to the election (again allowing for nonlineareffects) and house effects, pitting Infratest dimap (company2) and Forsa (company4)against the other four companies.


. mfp: regress gaprimespd timetoelection company2 company4

(output omitted )

Source SS df MS Number of obs = 152F(4, 147) = 49.40

Model .913620509 4 .228405127 Prob > F = 0.0000Residual .679668875 147 .004623598 R-squared = 0.5734

Adj R-squared = 0.5618Total 1.59328938 151 .010551585 Root MSE = .068

gaprimespd Coef. Std. Err. t P>|t| [95% Conf. Interval]

Itime__1 -.1039413 .0214838 -4.84 0.000 -.1463984 -.0614842Itime__2 .3111809 .0530732 5.86 0.000 .2062958 .416066company2 .0033631 .0136761 0.25 0.806 -.0236641 .0303903company4 -.158217 .0134935 -11.73 0.000 -.1848832 -.1315508

_cons .0265307 .0086185 3.08 0.002 .0094986 .0435628

Deviance: -390.967.

Here the (0 0.5) transformation provides the best fit. This functional form is J-shaped (see figure 2): bias in favor of the SPD declined over the course of the campaignbut rose sharply in the last few polls taken immediately before the election. Controllingfor time, we see that the four companies that are treated as the reference point performedwell. While their average bias of 0.027 is significantly different from 0, this number issmall in absolute terms.

−.2

0.2

.4P

artia

l pre

dict

or+

resi

dual

of g

aprim

espd

0100200300timetoelection

Fractional Polynomial (0 .5),adjusted for covariates

Figure 2. Effect of time to election on overall bias in favor of the Social Democrats (A′2)

The coefficient for Infratest dimap is statistically indistinguishable from zero. Putdifferently, with respect to the estimate of the SPD vote, there is no evidence that thepolls conducted by Infratest differ in any way from those produced by the other fourcompanies. The estimate for Forsa, on the other hand, is statistically and substantivelysignificant. While five companies, including Infratest dimap, got the SPD vote right


on average, Forsa consistently underestimated support for the Social Democrats by aconsiderable margin.

6 Conclusion

Martin, Traugott, and Kennedy (2005) derive a useful scalar measure A for survey ac-curacy, but its use is confined to two-party systems. Arzheimer and Evans (2014) gen-eralize their ideas to the more common multiparty case, making them applicable to amuch wider array of political systems. Our suite of user-written commands providesthe means to efficiently compute their measures A′

i, B, and Bw, along with standarderrors and statistical tests. While surveybias fits situations where one has access tothe raw data, surveybiasi estimates polling accuracy from published margins alone.Published margins are also sufficient for surveybiasseries, which is particularly suitedfor researchers who, perhaps in the wake of an electoral campaign, want to assess theaccuracy of many surveys at once.

7 ReferencesArzheimer, K., and J. Evans. 2014. A new multinomial accuracy measure for pollingbias. Political Analysis 22: 31–44.

Cressie, N., and T. R. C. Read. 1989. Pearson’s X2 and the loglikelihood ratio statisticG2: A comparative review. International Statistical Review 57: 19–43.

Durand, C. 2008. The polls of the 2007 French presidential campaign: Were lessonslearned from the 2002 catastrophe? International Journal of Public Opinion Research20: 275–298.


Martin, E. A., M. W. Traugott, and C. Kennedy. 2005. A review and proposal for anew measure of poll accuracy. Public Opinion Quarterly 69: 342–369.

Mosteller, F., H. Hyman, P. J. McCarthy, E. S. Marks, and D. B. Truman. 1949. ThePre-Election Polls of 1948: Report to the Committee on Analysis of Pre-ElectionPolls and Forecasts. New York: Social Science Research Council.

Tellinghuisen, J. 2001. Statistical error propagation. Journal of Physical Chemistry105: 3917–3921.

Weisberg, H. F. 2005. The Total Survey Error Approach: A Guide to the New Scienceof Survey Research. Chicago: University of Chicago Press.

About the authors

Kai Arzheimer is a professor of political science at the University of Mainz in Mainz, Germany.

Jocelyn Evans is professor of politics at the University of Leeds in Leeds, UK.


bicop: A command for fitting bivariate ordinalregressions with residual dependence

characterized by a copula function and normalmixture marginals

Monica Hernandez-AlavaSchool of Health and Related Research (ScHARR)

Health Economics and Decision ScienceUniversity of Sheffield

Sheffield, UK

[email protected]

Stephen PudneyInstitute for Social and Economic Research (ISER)

University of EssexColchester, UK

[email protected]

Abstract. In this article, we describe a new Stata command, bicop, for fittinga model consisting of a pair of ordinal regressions with a flexible residual distri-bution, with each marginal distribution specified as a two-part normal mixture,and stochastic dependence governed by a choice of copula functions. The bicop

command generalizes the existing biprobit and bioprobit commands, which as-sume a bivariate normal residual distribution. We present and explain the bicop

estimation command and the available postestimation commands using data onfinancial well-being from the UK Understanding Society Panel Survey.

Keywords: st0429, bicop, bivariate ordinal regression, copula, mixture model

1 Introduction

We are often interested in modeling the joint distribution of two observed measuresconditional on a set of observed covariates. For example, income and wealth are twostrongly related aspects of economic welfare that should, arguably, be studied jointly;drinking and smoking, particularly when combined, have important health implicationsand should thus be studied jointly; and joint analysis of different domains of satisfactionhas been used in “happiness” research. Methodological issues also often take this formand ask how two alternative measures of the same theoretical concept may be related.

Frequently, the indicators concerned are coarse binary or ordinal measures ratherthan direct observations on the relevant theoretical concepts, and this naturally suggestsusing a pair of correlated ordinal probit or logit regressions. Stata already provides the


160 Generalized bivariate ordinal regression

command biprobit for the case of a pair of binary indicators and the user-written com-mand bioprobit (Sajaia 2008) for the more general ordinal case. However, biprobitand bioprobit are based on the assumption of joint normality, which may be hard todefend. In many applications, the influence of observed covariates has a pronouncednonnormal distributional shape, and there is no compelling reason to assume that thefactors we cannot observe conform to normality when the factors we can observe donot. Moreover, the linear form of stochastic dependence implied by bivariate normalitymay be unduly restrictive: there is no reason why the nature and degree of dependenceshould not vary across different parts of the population.

Models of this type are not distribution free, and misspecification of the joint residualdistribution may cause significant bias in the estimated coefficients of the covariatesand may give a distorted picture of stochastic dependence. We developed the bicop

command as a method of estimating a more general specification of the bivariate ordinalmodel, using mixtures to allow for nonnormality and copula representations to allow forcomplex forms of dependence.

The article is organized as follows: in section 2, we give an overview of the generalizedbivariate ordinal regression model and the approach we use to allow for nonnormality inthe residual distribution. In section 3, we discuss two hypothesis tests that are relevantto bicop. In section 4, we explain the predictors that are provided postestimation. Insection 5, we describe the bicop syntax and options, including the syntax for predict.In section 6, we conclude with an empirical example using the bicop command.

2 The generalized bivariate ordinal regression model

The generalized bivariate ordinal regression model is

Y ∗i1 = Xi1β1 + Ui (1)

Y ∗i2 = Xi2β2 + Vi (2)

where Y ∗i1 and Y ∗

i2 are latent variables, Xi1 and Xi2 are row vectors of covariates, andβ1 and β2 are conformable column vectors of coefficients. Ui and Vi are unobservedresiduals that may be stochastically dependent and nonnormal. The covariate vectorsXi1 and Xi2 may contain the same or different variables.

The observable counterparts of Y ∗i1 and Y ∗

i2 are generated by the threshold-crossingconditions

Yij = r iff Γrj ≤ Y ∗ij < Γr+1j r = 1, . . . , Rj and j = 1, 2

where Rj is the number of categories of Yij and Γrj are threshold parameters, withΓ1j = −∞ and ΓRjj = +∞. (Note that in practice, the Yij do not have to be scored as1, 2, 3, . . . ; bicop will work, whatever numerical values are used to index outcomes—only their ordering matters.)

M. Hernandez-Alava and S. Pudney 161

The likelihood function requires evaluation of the probability that (Y ∗i1, Y

∗i2) falls in a

rectangle corresponding to the observed values of (Yi1, Yi2). For given parameter values,that probability can be computed using the joint distribution function F (Ui, Vi), whichallows the likelihood to be maximized numerically. However, if the assumed form forF (Ui, Vi) is incorrect, the probabilities in the likelihood function will be misspecified,and the (pseudo) maximum likelihood estimator will be inconsistent. This means thatthe standard approach using a bivariate normal form for F (., .) is potentially vulnerableto bias. On the other hand, a full nonparametric specification for F (., .) would becomplicated and unlikely to provide reliable estimates except in large samples, so anintermediate degree of flexibility is desirable.

The model specification is based on a copula representation of the joint distributionof the residuals U and V . A bivariate copula is any function c(u, v) : [0, 1]2 → [0, 1] thatis (weakly) increasing and satisfies c(u, 0) = c(0, v) = 0, c(u, 1) = u, and c(1, v) = v forall u, v ∈ [0, 1]. By adding a parameter θ governing the stochastic dependence of U andV , we can write the joint residual distribution function as

F (U, V ) = c{Fu(U), Fv(V ); θ}

where Fu(U) ≡ F (U,+∞) and Fv(V ) ≡ F (+∞, V ) are the marginal distribution func-tions of U and V . The bicop command generalizes the standard bivariate normal modelin the following ways:

• Marginals: bicop allows the marginal distributions Fu(.) and Fv(.) to be specifiedas mixtures of two normal components. For Fu(.),

Fu(u) = πuΦ

(u− μu1

σu1

)+ (1− πu)Φ

(u− μu2

σu2

)(3)

where πu is the mixing probability, and (μu1, μu2) and (σu1, σu2) are location anddispersion parameters constrained to satisfy the mean and variance normalizationsπuμu1 + (1− πu)μu2 ≡ 0 and πu

(σ2u1 + μ2

u1

)+ (1− πu)

(σ2u2 + μ2

u2

)= 1. A similar

specification can be used for Fv(.). These normal mixtures can capture variousdistributional shapes, especially those involving skewness or bimodality.

The bicop command performs the optimization with respect to ln {πu/(1− πu)}rather than πu, but both values are reported in the output. In the Stata outputlog, the mixing parameters πu, (1−πu), μu1, μu2, σ

2u1, and σ2

u2 are labeled pi u 1,pi u 2, mean u 1, mean u 2, var u 1, and var u 2 for (1) and, analogously, pi v 1,pi v 2, mean v 1, mean v 2, var v 1, and var v 2 for (2).1

• Dependence: The bicop command offers the following six forms as options:

– Independent: c(u, v) = uv.

1. The auxiliary parameters that are optimized during estimation are also written to the output log,with labels /pu1, /mu2, /su2, /pv1, /mv2, and /sv2. These parameters are transformations of themixing parameters and can be ignored when interpreting the output of the model.


– Gaussian: c(u, v) = Φ{Φ−1(u),Φ−1(v); θ

}, where Φ(., .; θ) is the distribution

function of the bivariate normal with correlation coefficient −1 ≤ θ ≤ 1, andΦ−1(.) is the inverse of the univariate N(0, 1) distribution function.

– Clayton: c(u, v) ={max

(u−θ + v−θ − 1, 0

)}−1/θfor 0 < θ ≤ ∞ and

c(u, v) = uv for θ = 0.

– Frank: −(1/θ) ln

{1 +

(e−θu−1)(e−θv−1)e−θ−1

}for θ �= 0 and c(u, v) = uv for

θ = 0.

– Gumbel: exp[−{(− lnu)θ + (− ln v)θ

}1/θ]for θ ≥ 1.

– Joe: 1− {(1− u)θ + (1− v)θ − (1− u)θ(1− v)θ}1/θ

for θ ≥ 1.

These copulas can represent various dependence structures. The Gaussian and theFrank copulas are similar in that both allow for positive and negative dependence, anddependence is symmetric in both tails. However, compared with the Gaussian copula,the Frank copula exhibits weaker dependence in the tails, and dependence is strongest inthe middle of the distribution. In contrast, the Clayton, Gumbel, and Joe copulas do notallow for negative dependence, and dependence in the tails is asymmetric. The Claytoncopula exhibits strong left-tail dependence and relatively weak right-tail dependence.Thus, if two variables are strongly correlated at low values but not so correlated at highvalues, then the Clayton copula is a good choice. The Gumbel and Joe copulas displaythe opposite pattern with weak left-tail dependence and strong right-tail dependence.The right-tail dependence is stronger in the Joe copula than in the Gumbel, and thusthe Joe copula is closer to the opposite of the Clayton copula.

bicop maximizes the likelihood with respect to an unrestricted constantδ ∈ [−∞,+∞], with θ related to δ in the following ways:

θ =

⎧⎪⎪⎨⎪⎪⎩tanh(δ) Gaussianeδ Claytonδ Frankeδ + 1 Gumbel, Joe

The output from bicop reports both δ (labeled as /depend) and θ.

Both mixture and copula models can be difficult to fit in some circumstances (seeMcLachlan and Peel [2000] on the former and Trivedi and Zimmer [2005] on the latter).Two distinct problems await the unwary. Nonconvergence of the likelihood optimizer of-ten occurs in copula models, typically for some choices of copula function but not others.The problem occurs when the chosen copula function does a poor job of representingthe pattern of dependence between the two residuals, and it can often be resolved byswitching to a different copula function; we see an example of this in section 6, whereconvergence cannot be achieved for the Gumbel and Joe copulas. Poor starting valuescan also cause nonconvergence; restarting the optimizer from a different point in theparameter space will work in some cases.


Another possible reason for nonconvergence is local nonidentification of the mixtureparameters. For the normal mixture (3), the parameter πu is not identified at interiorpoints in the parameter space where μu1 = μu2 and σu1 = σu2. Boundary problems alsoarise because μu1, σu1 are not identified when πu = 0, nor are μu2, σu2 identified whenπu = 1. All three regions correspond to a pure N(0, 1) distribution.2 Consequently,if either of the marginal distributions is approximately normal, identification will beweak and nonconvergence a likely result. These cases usually become evident if the logand trace options are used to display current parameter values during optimization.When this occurs, the relevant marginal can be respecified as an unmixed normal in asubsequent run.

Related to this last type of nonconvergence problem is the problem of testing for theappropriate number of mixture components. Standard likelihood-ratio tests of H0 : U ∼N(0, 1) or V ∼ N(0, 1) against a two-component normal mixture do not work correctlyin this nonregular context (Titterington, Smith, and Makov 1985, 154), and we are notaware of any alternative formal procedure that is entirely satisfactory.

The problem of multiple optima is less obvious than nonconvergence—and, there-fore, more dangerous. The existence of multiple optima poses problems for likelihoodmaximization in many mixture models and should be assumed to be a potential pit-fall. The bicop command offers the standard Stata optimization options for startingvalues (see [R] maximize), and the application in section 6 provides an example of arecommended starting-values strategy.

3 Hypothesis tests

Two hypothesis tests may be of special interest in particular applications of bicop. Oneis the hypothesis test of conditional independence: Y1 � Y2|X1, X2, which holds if andonly if c(u, v) = uv for all u, v ∈ [0, 1]. This independence condition is equivalent toθ = 0 for the Gaussian, Clayton, and Frank copulas and θ = 1 for the Gumbel and Joecopulas. For the Gaussian and Frank copulas, this involves a regular likelihood-ratioor Wald test, which can be done in the usual way. For these copula functions, bicopproduces a Wald test automatically. For the Clayton, Gumbel, and Joe functions, thenull hypothesis is on the boundary of the parameter space, and the likelihood-ratioand Wald tests are not valid (see Chernoff [1954] and Andrews [2001]). Because thesecopulas are a natural choice in applications only where we are confident of positivedependence, bicop does not produce an automatic test in these cases. Instead, if thetest is required, the user could fit the model unrestrictedly using the Clayton, Gumbel,or Joe copula, repeat estimation while imposing independence by specifying the copulac = uv, and then construct the usual statistic of minus twice the log-likelihood ratio. Thecomplication here is that the test statistic has a nonstandard limiting distribution, thatis, χ2 [a 50:50 mixture of a degenerate probability mass at zero and a χ2(1) distribution].This amounts to performing a standard χ2(1) likelihood-ratio test and then halving thep-value (see Chernoff [1954]).

2. The variance of the distribution is normalized to 1 for identification purposes in an ordered probitmodel.


The second special hypothesis test of interest in some applications of bicop is thehypothesis of equal coefficients, H0 : β1 = β2, which will normally arise when X1 andX2 contain the same variables. This null hypothesis arises naturally when Y1 and Y2 areinterpreted as alternative measures of the same concept—for example, they might beresponses to the same survey questions, repeated with different response scales. A testcan be performed easily using the standard Stata command test, which implementsthe Wald test, but for convenience, bicop does the test automatically. If X1 and X2

are different, the test is made on the coefficients of any variables that are common toboth.

4 Prediction

The bicop command allows the usual Stata prediction options postestimation, throughthe evaluation of the linear indices Xi1β1 and Xi2β2, the associated prediction standarderrors, and the probabilities of specific outcomes for (Yi1, Yi2) conditional on the co-variates (Xi1, Xi2). However, bicop additionally has options for conditional prediction.These can be used, for instance, to convert (or “map” or “cross-walk”) a measurementscale represented by the dependent variable Yi1 into another scale represented by Yi2.Following the use of bicop, the predict command can convert a measurement scale byconstructing estimates of the distribution of one dependent variable conditional on theobserved outcome for the other. For example,

Pr(Yi2 = s|Yi1 = r,Xi1, Xi2) =Pr(Yi1 = r, Yi2 = s|Xi1, Xi2)∑R2

s=1 Pr(Yi1 = r, Yi2 = s|Xi1, Xi2)

where r ∈ [1, R1] and s ∈ [1, R2] are specified levels for the two outcomes.

5 Command syntax

5.1 bicop

Syntax

There are two forms of the syntax:

X1 and X2 contain the same covariates

bicop depvar1 depvar2[indepvars

] [if] [

in] [

weight] [

, syntax1 options]

X1 and X2 contain different covariates

bicop (equation1) (equation2)[if] [

in] [

weight] [

, syntax2 options]


syntax1 options and syntax2 options are as listed in the Options section below.

equation1 and equation2 are specified as

([eqname:

]depvar

[=] [

indepvars] [

, offset(varname)])

pweights, fweights, and iweights are allowed; see [U] 11.1.6 weight.

Description

bicop is a user-written command that fits a generalized bivariate ordinal regressionmodel using maximum likelihood estimation. It is implemented as an lf1 ml evaluator.The model involves a pair of latent regression equations, each with a standard threshold-crossing condition to generate ordinal observed dependent variables. The bivariateresidual distribution is specified to have marginals, each with the form of a two-partnormal mixture, and a choice of copula functions to represent the pattern of dependencebetween the two residuals.

Options

Options common to both syntax 1 and syntax 2 are the following:

mixture(mixturetype) specifies the marginal distribution of each residual. There are fivechoices for mixturetype: none specifies that each marginal distribution be N(0, 1);mix1 specifies that the residual from equation 1 has a two-part normal mixturedistribution but that the residual from equation 2 be N(0, 1); mix2 specifies N(0, 1)for equation 1 and a normal mixture for equation 2; both allows each residual tohave a different normal mixture distribution; and equal specifies that both residualshave the same normal mixture distribution. The default is mixture(none).

copula(copulatype) specifies the copula function to be used to control the pattern ofstochastic dependence of the two residuals. There are six choices for copulatype:indep, which specifies the special form c(u, v) = uv, gaussian, clayton, frank,gumbel, and joe. The default is copula(gaussian). Note that if both mixture()

and copula() are omitted, the bicop command produces the same results as the ex-isting bioprobit and (if both dependent variables are binary) biprobit commands.

constraints(numlist) applies specified linear constraints; see [R] constraint.

collinear retains collinear variables. Usually, there is no reason to leave collinearvariables in place, and doing so would cause the estimation to fail because of matrixsingularity. However, in some constrained cases, the model may be fully identifieddespite the collinearity. The collinear option then allows estimation to occur,leaving the equations with collinear variables intact. This option is seldom used.


vce(vcetype) specifies how to estimate the variance–covariance matrix corresponding tothe parameter estimates. The supported options are oim, opg, robust, and cluster.The current version of the command does not allow bootstrap or jackknife esti-mators. See [R] vce option.

level(#) sets the significance level to be used for confidence intervals; see [R] level.

from(init specs), where init specs is either matname, the name of a matrix containingthe starting values, or matname, copy | skip. The copy suboption specifies that theinitialization vector be copied into the initial-value vector by position rather thanby name, and the skip suboption specifies that any irrelevant parameters found inthe specified initialization vector be ignored. Poor values in from() may lead toconvergence problems.

search(spec) specifies whether ml’s ([R] ml) initial search algorithm is used. spec maybe on or off.

repeat(#) specifies the number of random attempts to be made to find a better initial-value vector. This option should be used in conjunction with search().

maximize options specifies the maximization options; maximize options are difficult,technique(algorithm spec), iterate(#),

[no]log, trace, gradient, showstep,

hessian, showtolerance, tolerance(#), ltolerance(#), gtolerance(#),nrtolerance(#), and nonrtolerance; see [R] maximize.

Additional options for syntax 1 only are as follows:

offset1(varname) specifies an offset variable for the first equation.

offset2(varname) specifies an offset variable for the second equation.

5.2 predict

Syntax

predict varname[if] [

in] [

, predicttype outcome(r,s)]

Description

Following bicop, the predict command can be used to construct several alternativepredictions. The predictions include the linear indices Xi1β1 and Xi2β2 and corre-sponding standard errors; probabilities of the form Pr(Yij = r|Xij) or Pr(Yi1 = r, Yi2 =s|Xi1, Xi2); and conditional probabilities of the form Pr(Yij = r|Yik = s,Xi1, Xi2).

Options

predicttype specifies the type of prediction required. If predicttype is xb1 or xb2, thevariable varname is constructed as Xi1β1 or Xi2β2, respectively. Set predicttype to


std1 or std2 to construct varname as the corresponding prediction standard error.If predicttype is pr, the prediction is calculated as a probability Pr(Yi1 = r|Xij),Pr(Yi2 = r|Xij), or Pr(Yi1 = r, Yi2 = s|Xi1, Xi2) with r and s specified by theoutcome() option. The predicttypes pcond1 and pcond2 specify the conditionalprobabilities Pr(Yi1 = r|Yi2 = s,Xi1, Xi2) or Pr(Yi2 = s|Yi1 = r,Xi1, Xi2), respec-tively, with r and s supplied by outcome().

outcome(r,s) specifies the outcome levels to be used in predicting probabilities for Yi1

and Yi2. The possibilities for predicttype and outcome(r,s) are as follows:

Option Predicted probability

pr outcome(r, . ) Pr(Yi1 = r|Xi1)

pr outcome( . ,s) Pr(Yi2 = s|Xi2)

pr outcome(r,s) Pr(Yi1 = r, Yi2 = s|Xi1, Xi2)

pcond1 outcome(r,s) Pr(Yi1 = r|Yi2 = s,Xi1, Xi2)

pcond2 outcome(r,s) Pr(Yi2 = s|Yi1 = r,Xi1, Xi2)

6 An illustrative application: Financial well-being

We now show how to use the bicop command to model bivariate ordinal data. Ourexample uses data from Understanding Society: the UK Household Longitudinal Survey(UKHLS). See Knies (2015) for a detailed description of the survey. The main UKHLS

sample began in 2009 with approximately 30,000 households. Interviewing proceedscontinuously through the year with households interviewed annually, but each wavetakes two years to complete and thus overlaps with the preceding and succeeding waves.We use a simple dataset comprising a cross-section of 5,482 individual respondentsdrawn from the calendar years 2011–2012. The dataset is supplied to users with thebicop code.

We analyze the responses to the following two questions about financial well-being(FWB), and we construct the variables Y1 and Y2 as the corresponding five-level andthree-level ordinal indicators, both recoded to give scales increasing in current or ex-pected FWB (see Pudney [2011] for discussion and analysis of this FWB measure).

• “How well would you say you yourself are managing financially these days? Wouldyou say you are . . . ” [1. Living comfortably 2. Doing alright 3. Just about gettingby 4. Finding it quite difficult 5. or finding it very difficult?].

• “Looking ahead, how do you think you will be financially a year from now, willyou be . . . ” [1. Better off 2. Worse off than you are now 3. or about the same?].


Three binary explanatory covariates distinguish people who are female, homeowners,and unemployed or long-term sick and disabled.3

The following code fits all six copula models with the mixture(none) option. TheClayton copula clearly provides the best likelihood fit. Note that the Gumbel estimateis a boundary solution with θ ≈ 1; thus it is also identical to the Joe estimate and theresult produced by the copula(indep) option (neither of which are reproduced here).The superior fit of the Clayton model and failure of the Gumbel and Joe models todetect any dependence suggest a pattern of strong dependence in the left tail of theresidual distribution but not in the right tail.

. use ukhlsfwb

. local maxll=minfloat()

. foreach cop in gaussian frank clayton gumbel joe indep {2. local xvars female homeowner unempsick3. bicop finnow finfut `xvars´, copula(`cop´)4. estimates store `cop´5. if e(ll)>`maxll´&e(converged) {6. local maxll=e(ll)7. local bestcop="`cop´"8. matrix bestb=e(b)9. }10. }

3. A more substantial application with 10 explanatory variables can be found in an earlier version ofthis paper (Hernandez-Alava and Pudney 2015). We cannot make that dataset publicly availablebecause of respondent confidentiality, but the full UKHLS data files are obtainable on applicationto the UK Data Archive (Study Number 6614) athttp://discover.ukdataservice.ac.uk/catalogue/?sn=6614&type=Data%20catalogue.


LogL for independent ordered probit model -13062.773

initial: log likelihood = -16992.008rescale: log likelihood = -15050.038rescale eq: log likelihood = -13062.146Iteration 0: log likelihood = -13062.146Iteration 1: log likelihood = -13062.145Iteration 2: log likelihood = -13062.145

Generalized bivariate ordinal regression model (copula: gaussian, mixture: none)




finnowfemale -.1549272 .0296466 -5.23 0.000 -.2130335 -.096821

homeowner .5237826 .0303863 17.24 0.000 .4642266 .5833386unempsick -.7196592 .0399321 -18.02 0.000 -.7979247 -.6413936

finfutfemale -.046568 .0313308 -1.49 0.137 -.1079753 .0148393

homeowner -.2102546 .0320044 -6.57 0.000 -.2729822 -.147527unempsick -.1461849 .0419871 -3.48 0.000 -.2284782 -.0638916

/cuteq1_1 -1.592359 .0394148 -40.40 0.000 -1.669611 -1.515108/cuteq1_2 -.9077473 .0343043 -26.46 0.000 -.9749824 -.8405122/cuteq1_3 .0811928 .0326669 2.49 0.013 .0171667 .1452188/cuteq1_4 1.056313 .0348781 30.29 0.000 .9879537 1.124673/cuteq2_1 -1.054656 .0360324 -29.27 0.000 -1.125278 -.9840339/cuteq2_2 .475085 .0343894 13.81 0.000 .407683 .5424871

/depend .0179149 .015992 1.12 0.263 -.0134287 .0492586

theta .017913 .0159868

Wald test of equality of coefficients chi2(df = 3)= 521.974 [p-value=0.000]Wald test of independence chi2(df = 1)= 1.255 [p-value=0.263]



initial: log likelihood = -13132.429rescale: log likelihood = -13132.429rescale eq: log likelihood = -13062.443Iteration 0: log likelihood = -13062.443Iteration 1: log likelihood = -13062.442

Generalized bivariate ordinal regression model (copula: frank, mixture: none)




finnowfemale -.1547791 .0296449 -5.22 0.000 -.2128821 -.0966761


finfutfemale -.0465239 .0313291 -1.49 0.138 -.1079278 .0148801



/depend .0770508 .0947965 0.81 0.416 -.1087471 .2628486

theta .0770508 .0947965




initial: log likelihood = -17203.534rescale: log likelihood = -15145.713rescale eq: log likelihood = -13101.382Iteration 0: log likelihood = -13101.382Iteration 1: log likelihood = -13051.968Iteration 2: log likelihood = -13051.923Iteration 3: log likelihood = -13051.923

Generalized bivariate ordinal regression model (copula: clayton, mixture: none)




finnowfemale -.1589312 .0296393 -5.36 0.000 -.2170231 -.1008392


finfutfemale -.0499395 .0313101 -1.59 0.111 -.1113061 .0114272



/depend -2.53765 .228445 -11.11 0.000 -2.985394 -2.089906

theta .0790519 .018059




initial: log likelihood = -19774.602rescale: log likelihood = -15862.755rescale eq: log likelihood = -13330.654Iteration 0: log likelihood = -13330.654Iteration 1: log likelihood = -13067.223Iteration 2: log likelihood = -13062.777Iteration 3: log likelihood = -13062.773Iteration 4: log likelihood = -13062.773

Generalized bivariate ordinal regression model (copula: gumbel, mixture: none)




finnowfemale -.1548403 .0296463 -5.22 0.000 -.2129459 -.0967347


finfutfemale -.0465534 .0313303 -1.49 0.137 -.1079598 .0148529



/depend -38.4 . . . . .

theta 1 .

Wald test of equality of coefficients chi2(df = 3)= 514.359 [p-value=0.000]



initial: log likelihood = -16915.294rescale: log likelihood = -15450.545rescale eq: log likelihood = -13207.919Iteration 0: log likelihood = -13207.919Iteration 1: log likelihood = -13066.981Iteration 2: log likelihood = -13062.777Iteration 3: log likelihood = -13062.773Iteration 4: log likelihood = -13062.773

Generalized bivariate ordinal regression model (copula: joe, mixture: none)




finnowfemale -.1548403 .0296463 -5.22 0.000 -.2129459 -.0967347


finfutfemale -.0465534 .0313303 -1.49 0.137 -.1079598 .0148529



/depend -38.4 . . . . .

theta 1 .





Generalized bivariate ordinal regression model (copula: indep, mixture: none)




finnowfemale -.1548403 .0296463 -5.22 0.000 -.2129459 -.0967347


finfutfemale -.0465534 .0313303 -1.49 0.137 -.1079598 .0148529




. estimates stats _all

Akaike´s information criterion and Bayesian information criterion

Model Obs ll(null) ll(model) df AIC BIC

gaussian 5,482 . -13062.15 13 26150.29 26236.21frank 5,482 . -13062.44 13 26150.88 26236.8

clayton 5,482 . -13051.92 13 26129.85 26215.77gumbel 5,482 . -13062.77 12 26149.55 26228.86

joe 5,482 . -13062.77 12 26149.55 26228.86indep 5,482 . -13062.77 12 26149.55 26228.86

Note: N=Obs used in calculating BIC; see [R] BIC note.

Now using the preferred Clayton copula, we allow for the same nonnormal distribu-tion in both residuals, using the mixture(equal) option, and we check for local optimaby running the optimizer from 10 randomly perturbed starting points. We generatethese random points over a region with ln θ ∈ [−3, 1]; ln {πu/(1− πu)} ∈ [−2, 2];μu2 ∈[−1, 1];σ2

u2 ∈ [0, 2].


. quietly bicop finnow finfut `xvars´, copula(`bestcop´) mixture(equal)> iterate(25)

. local k=e(k)-3 // position of /depend in parameter vector

. local k1=`k´+1 // position of /pu1

. local k2=`k´+2 // position of /mu2

. local k3=`k´+3 // position of /su2

. local nstarts=10 // no. of random starts

. local nits=7 // no. iterations from each start

. set seed 22246

. matrix bequal=e(b)

. matrix maxpar=bequal

. local maxll=e(ll)

. matrix ttt=bequal

. forvalues r=1/`nstarts´ {2. quietly {3. matrix ttt[1,`k´]=4*runiform()-3 // start value for /depend4. matrix ttt[1,`k1´]=4*(runiform()-0.5) // start value for /pu15. matrix ttt[1,`k2´]=2*(runiform()-0.5) // start value for /mu26. matrix ttt[1,`k3´]=2*runiform() // start value for /su27. capture bicop finnow finfut `xvars´, copula(`bestcop´) mixture(equal)

> from(ttt) log iterate(`nits´) search(off)8. local retcode=_rc9. if e(ll)>`maxll´&`retcode´==0 {10. matrix maxpar=e(b)11. local maxll=e(ll)12. }13. noisily display "Replication... " `r´ ": logL = " e(ll) " best so far =

> " `maxll´14. }15. }

Replication... 1: logL = -13047.235 best so far = -13047.235Replication... 2: logL = -769989.06 best so far = -13047.235Replication... 3: logL = -13047.243 best so far = -13047.235Replication... 4: logL = -13054.104 best so far = -13047.235Replication... 5: logL = -13047.781 best so far = -13047.235Replication... 6: logL = -13047.235 best so far = -13047.235Replication... 7: logL = -13048.043 best so far = -13047.235Replication... 8: logL = -13048.43 best so far = -13047.235Replication... 9: logL = -769989.06 best so far = -13047.235Replication... 10: logL = -13047.484 best so far = -13047.235


. bicop finnow finfut `xvars´, copula(`bestcop´) mixture(equal) from(maxpar)> iterate(50)LogL for independent ordered probit model -13062.773


Generalized bivariate ordinal regression model (copula: clayton, mixture: equal)




finnowfemale -.1684891 .0294299 -5.73 0.000 -.2261707 -.1108075


finfutfemale -.0594949 .0304459 -1.95 0.051 -.1191677 .0001779



/depend -2.508975 .2246021 -11.17 0.000 -2.949187 -2.068763/pu1 1.71723 .7078914 2.43 0.015 .3297886 3.104672/mu2 .4726607 .1257883 3.76 0.000 .2261201 .7192013/su2 .5318347 .1646931 3.23 0.001 .2090421 .8546273

theta .0813516 .0182717pi_u_1 .8477717 .0913568pi_u_2 .1522283 .0913568

mean_u_1 -.0848723 .059337mean_u_2 .4726607 .1257883var_u_1 1.081455 .0422284var_u_2 .2828481 .175179


. matrix bequal=e(b)

. estimates store clayton_equ

. estimates stats clayton clayton_equ



clayton 5,482 . -13051.92 13 26129.85 26215.77clayton_equ 5,482 . -13047.23 16 26126.47 26232.22



The estimated residual distribution is a mixture of a dominant component(pi u 1=0.85) centered close to zero (mean u 1=-0.08), with a secondary(pi u 2=0.15), less dispersed (var u 2=0.28) component centered above zero(mean u 2=0.47).

However, the evidence for nonnormality in the marginal residual distributions isnot strong. The Akaike information criterion (AIC) favors the model with equal mix-ture marginals over the model with normal marginals, while the Bayesian informationcriterion (BIC), which penalizes model complexity more heavily, gives the opposite re-sult. The following code shows a procedure for plotting the fitted mixture density incomparison with the standard N(0, 1) density. To do this, we recover the transformedparameters composing θ and all the mixing parameters from the matrix returned ine(extpar). The resulting plot is shown in figure 1, which reveals a negatively skewedmixture distribution.

. matrix mixparams=e(extpar)

. matrix list mixparams

mixparams[1,7]theta pi_u_1 pi_u_2 mean_u_1 mean_u_2 var_u_1

r1 .08135157 .84777173 .15222827 -.08487228 .4726607 1.0814547

var_u_2r1 .28284813

. matrix pu1=mixparams[1,"pi_u_1"]

. scalar pu1 = pu1[1,1]

. matrix pu2=mixparams[1,"pi_u_2"]

. scalar pu2 = pu2[1,1]

. matrix mu1=mixparams[1,"mean_u_1"]

. scalar mu1 = mu1[1,1]

. matrix mu2=mixparams[1,"mean_u_2"]

. scalar mu2 = mu2[1,1]

. matrix su1=mixparams[1,"var_u_1"]

. scalar su1 = sqrt(su1[1,1])

. matrix su2=mixparams[1,"var_u_2"]

. scalar su2 = sqrt(su2[1,1])

. twoway (function pu1*normalden(x,mu1,su1)+pu2*normalden(x,mu2,su2),> range(-3 3) lpattern(solid) lcolor(black)) (function normalden(x),> range(-3 3) lpattern(longdash) lcolor(black)),> graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white)> ifcolor(white)) legend(col(2) label(1 "Mixture") label(2 "N(0,1)"))> xtitle(" ") xscale(titlegap(2)) xlabel(-3(1)3)


0.1

.2.3

.4y

−3 −2 −1 0 1 2 3

Mixture N(0,1)

Figure 1. Estimated normal mixture density for the Clayton model residuals

We now allow for different distributional forms in the two residuals by using theoption mixture(both) and again using multiple starting values. Here convergence isnot achieved by using the default initial values but by restarting the optimization fromrandom points, although the estimated mixture is poorly determined. A likelihood-ratiotest against the equal-marginals specification gives a marginal result (Pr = 0.0871), andthere is conflict between the AIC and the BIC, with the AIC favoring these estimates andthe BIC favoring the equal-mixtures model.

. local k4=`k´+4 // position of /pv1

. local k5=`k´+5 // position of /mv2

. local k6=`k´+6 // position of /sv2

. matrix a=bequal[1,`k1´..`k3´] // initial values for mixing parameters for V

. matrix colnames a= pv1:_cons mv2:_cons sv2:_cons

. matrix b0=bequal,a

. quietly bicop finnow finfut `xvars´, copula(`bestcop´) mixture(both)> iterate(25)

. quietly matrix maxpar=e(b)

. quietly local maxll=e(ll)

. set seed 22246

. matrix ttt=b0

. forvalues r=1/`nstarts´ {2. quietly {3. matrix ttt[1,`k´]=4*runiform()-3 // start value for /depend4. matrix ttt[1,`k1´]=4*(runiform()-0.5) // start value for /pu15. matrix ttt[1,`k2´]=2*(runiform()-0.5) // start value for /mu26. matrix ttt[1,`k3´]=2*runiform() // start value for /su27. matrix ttt[1,`k4´]=4*(runiform()-0.5) // start value for /pv18. matrix ttt[1,`k5´]=2*(runiform()-0.5) // start value for /mv29. matrix ttt[1,`k6´]=2*runiform() // start value for /sv2


10. capture bicop finnow finfut `xvars´, copula(`bestcop´) mixture(both)> from(ttt) log iterate(`nits´) search(off)11. local retcode=_rc12. if e(ll)>`maxll´&`retcode´==0 {13. matrix maxpar=e(b)14. local maxll=e(ll)15. }16. noisily display "Replication... " `r´ ": logL = " e(ll) " best so far =

> " `maxll´17. }18. }

Replication... 1: logL = -769989.06 best so far = -13043.952Replication... 2: logL = -769989.06 best so far = -13043.952Replication... 3: logL = -298620.15 best so far = -13043.952Replication... 4: logL = -13044.151 best so far = -13043.952Replication... 5: logL = -769989.06 best so far = -13043.952Replication... 6: logL = -769989.06 best so far = -13043.952Replication... 7: logL = -769989.06 best so far = -13043.952Replication... 8: logL = -769989.06 best so far = -13043.952Replication... 9: logL = -13046.226 best so far = -13043.952Replication... 10: logL = -769989.06 best so far = -13043.952

. bicop finnow finfut `xvars´, copula(`bestcop´) mixture(both) from(maxpar)> iterate(50) search(off)LogL for independent ordered probit model -13062.773

(output omitted )

Generalized bivariate ordinal regression model (copula: clayton, mixture: both)




finnowfemale -.1659407 .0297574 -5.58 0.000 -.2242643 -.1076172


finfutfemale -.008524 2.094358 -0.00 0.997 -4.11339 4.096342

homeowner -.0196506 4.821656 -0.00 0.997 -9.469922 9.430621unempsick -.0031946 .7761486 -0.00 0.997 -1.524418 1.518029

/cuteq1_1 -1.643005 .045037 -36.48 0.000 -1.731276 -1.554734/cuteq1_2 -.9138487 .0375694 -24.32 0.000 -.9874833 -.8402141/cuteq1_3 .1137748 .0354499 3.21 0.001 .0442942 .1832554/cuteq1_4 1.060686 .0348293 30.45 0.000 .9924221 1.12895/cuteq2_1 .2728202 33.24207 0.01 0.993 -64.88043 65.42607/cuteq2_2 .4271055 4.731994 0.09 0.928 -8.847433 9.701643

/depend -2.504887 .2234805 -11.21 0.000 -2.9429 -2.066873/pu1 -2.228777 1.243522 -1.79 0.073 -4.666035 .2084806/mu2 .1549464 .1383337 1.12 0.263 -.1161827 .4260755/su2 .9007953 .0681421 13.22 0.000 .7672393 1.034351/pv1 -1.702438 .590641 -2.88 0.004 -2.860073 -.5448025/mv2 .4089723 .6733854 0.61 0.544 -.9108388 1.728783/sv2 .068307 16.76491 0.00 0.997 -32.79031 32.92692


theta .0816849 .018255pi_u_1 .0971959 .1091176pi_u_2 .9028041 .1091176

mean_u_1 -1.43922 .5266053mean_u_2 .1549464 .1383337var_u_1 .4571569 2.1114var_u_2 .8114322 .1227641pi_v_1 .1541472 .0770112pi_v_2 .8458528 .0770112

mean_v_1 -2.244156 3.605769mean_v_2 .4089723 .6733854var_v_1 .5076687 17.47826var_v_2 .0046658 2.29032


. matrix bunequal=e(b)

. estimates store clayton_both

. estimates stats clayton_equ clayton_both



clayton_equ 5,482 . -13047.23 16 26126.47 26232.22clayton_both 5,482 . -13043.95 19 26125.9 26251.48


. lrtest clayton_equ clayton_both

Likelihood-ratio test LR chi2(3) = 6.57(Assumption: clayton_equ nested in clayton_both) Prob > chi2 = 0.0871

Next, to demonstrate the second form of the bicop syntax, we revert to the optionmixture(equal) and refit the model with the marginally insignificant gender effectdropped from equation 2. Except for the scaling of the coefficients in equation 2, theresults change little. Again the AIC and BIC are in conflict over whether this is thebest-fitting model.


. local xvars1 female homeowner unempsick

. local xvars2 homeowner unempsick

. bicop (finnow=`xvars1´) (finfut=`xvars2´), copula(`bestcop´) mixture(equal)> from(bequal, skip) iterate(50) search(off)LogL for independent ordered probit model -13063.877

Iteration 0: log likelihood = -13052.337Iteration 1: log likelihood = -13049.144Iteration 2: log likelihood = -13049.134Iteration 3: log likelihood = -13049.134

Generalized bivariate ordinal regression model (copula: clayton, mixture: equal)




finnowfemale -.1651425 .0294366 -5.61 0.000 -.2228371 -.1074479


finfuthomeowner -.2079559 .0312757 -6.65 0.000 -.2692553 -.1466566unempsick -.1139064 .0417335 -2.73 0.006 -.1957025 -.0321102


/depend -2.520257 .2268808 -11.11 0.000 -2.964936 -2.075579/pu1 1.760666 .7653936 2.30 0.021 .260522 3.26081/mu2 .4740137 .1367034 3.47 0.001 .2060798 .7419475/su2 .5423894 .1763895 3.07 0.002 .1966723 .8881065

theta .0804389 .01825pi_u_1 .853293 .0958151pi_u_2 .146707 .0958151

mean_u_1 -.0814973 .0611763mean_u_2 .4740137 .1367034var_u_1 1.076078 .0428315var_u_2 .2941863 .1913436


. estat ic



. 5,482 . -13049.13 15 26128.27 26227.41



To show the differences in results that can follow from using bicop rather thanbioprobit, we now use the predict command to construct predictions for expectationsof the change in FWB conditional on current reported FWB. These are sample meansof estimates of Pr(Y2 = s|Y1 = r,Xi). The following code computes the predictions forthe Gaussian model and the equal-mixtures Clayton specification for s = 1 (expectedworsening of FWB) and s = 3 (expected improvement) and all r = 1, . . . , 5, summarizingthe relationship by plotting them against r.

. generate tee=_n if _n<=5(5,477 missing values generated)

. foreach c in clayton_equ gaussian {2. generate up`c´=.3. generate down`c´=.4. forvalues t=1/5 {5. quietly {6. estimates restore `c´7. capture drop tmp*8. predict tmp if e(sample), pcond2 outcome(`t´,3)9. predict tmp1 if e(sample), pcond2 outcome(`t´,1)10. summarize tmp, meanonly11. replace up`c´=r(mean) if tee==`t´12. summarize tmp1, meanonly13. replace down`c´=r(mean) if tee==`t´14. }15. }16. }

(5,482 missing values generated)(5,482 missing values generated)(5,482 missing values generated)(5,482 missing values generated)

. drop tmp*

. line upgaussian upclayton tee, graphregion(fcolor(white) ilcolor(white)> icolor(white) lcolor(white) ifcolor(white)) msymbol(none) xtick(1(1)5)> xtitle("Current financial wellbeing") xscale(titlegap(2)) xlabel(1(1)5)> ytitle("Pr(better)") yscale(titlegap(5)) lpattern(solid longdash)> lcolor(black black)> legend(col(2) label(1 "Bivariate ordered probit") label(2 "Generalized model"))

. line downgaussian downclayton tee, graphregion(fcolor(white) ilcolor(white)> icolor(white) lcolor(white) ifcolor(white)) msymbol(none) xtick(1(1)5)> xtitle("Current financial wellbeing") xscale(titlegap(2)) xlabel(1(1)5)> ytitle("Pr(worse)") yscale(titlegap(5)) lpattern(solid longdash)> lcolor(black black) legend(col(2) label(1 "Bivariate ordered probit")> label(2 "Generalized model"))

Figures 2 and 3 show these plots. The most striking feature is that the generalizedbicop model suggests considerably more pessimistic expectations conditional on a lowcurrent level of FWB, particularly for the expectation of further worsening. Note thatthe data come from a period of government austerity targeted particularly on welfarerecipients following a deep recession, so these pessimistic predictions are not implausible.


.2.2

2.2

4.2

6.2

8

Pr(

bette

r)

1 2 3 4 5

Current financial well−being

Bivariate ordered probit Generalized model

Figure 2. Predicted probability of expectation of better FWB conditional on currentFWB

.16

.18

.2.2

2.2

4.2

6

Pr(

wor

se)

1 2 3 4 5

Current financial well−being

Bivariate ordered probit Generalized model

Figure 3. Predicted probability of expectation of worse FWB conditional on currentFWB

The source of the difference is the different patterns of dependence built into theClayton and Gaussian copulas: the former model implies strong positive dependence inonly the left tail (low actual and anticipated FWB), whereas the latter implies uniformdependence. Although the Clayton model used to generate the plot also allows for adeparture from normality in each residual, in this particular application, the form ofthe marginals makes much less difference to the properties of the fitted model than thechoice of copula.

7 Acknowledgments

We thank the editor and an anonymous referee for helpful comments and suggestions.This work was supported by the Medical Research Council under grant MR/L022575/1.It uses data from the Understanding Society survey administered by ISER, Universityof Essex, funded by the Economic and Social Research Council. Pudney acknowledgesfurther ESRC funding through the UK Centre for Longitudinal Studies and the Centrefor Micro-Social Change (grants RES-586-47-0002 and RES-518-28-5001). The viewsexpressed in this article, and any errors or omissions, are those of the authors only.


8 ReferencesAndrews, D. W. K. 2001. Testing when a parameter is on the boundary of the maintainedhypothesis. Econometrica 69: 683–734.

Chernoff, H. 1954. On the distribution of the likelihood ratio. Annals of MathematicalStatistics 25: 573–578.

Hernandez-Alava, M., and S. Pudney. 2015. bicop: A Stata command for fittingbivariate ordinal regressions with residual dependence characterised by a copulafunction and normal mixture marginals. Understanding Society Working PaperNo. 2015-02, Economic & Social Research Council.http://www.understandingsociety.ac.uk/research/publications/working-paper/understanding-society/2015-02.pdf.

Knies, G., ed. 2015. Understanding Society: The UK Household Longitudinal StudyWaves 1–5 User Manual. Version 1.1. Colchester, UK: Economic & Social ResearchCouncil. https://www.understandingsociety.ac.uk/d/218/6614 UserManual Wave1to5 v1.1.pdf?1446134584.

McLachlan, G., and D. Peel. 2000. Finite Mixture Models. New York: Wiley.

Pudney, S. 2011. Perception and retrospection: The dynamic consistency of responsesto survey questions on wellbeing. Journal of Public Economics 95: 300–310.

Sajaia, Z. 2008. bioprobit: Stata module for bivariate ordered probit regression. Sta-tistical Software Components S456920, Department of Economics, Boston College.https://ideas.repec.org/c/boc/bocode/s456920.html.

Titterington, D. M., A. F. M. Smith, and U. E. Makov. 1985. Statistical Analysis ofFinite Mixture Distributions. New York: Wiley.

Trivedi, P. K., and D. M. Zimmer. 2005. Copula modeling: An introduction for practi-tioners. Foundations and Trends in Econometrics 1: 1–111.

About the authors

Monica Hernandez-Alava is an applied microeconometrician in the health economics and deci-sion science section in ScHARR at the University of Sheffield in Sheffield, UK.

Steve Pudney is a professor of economics at ISER at the University of Essex in Essex, UK.


Features of the area under the receiveroperating characteristic (ROC) curve. A good

practice.

David LoraClinical Research Unit (imas12)

Hospital Universitario 12 de Octubreand CIBER de Epidemiologıa y Salud Publica (CIBERESP)

Madrid, [email protected]

Israel ContadorDepartment of Basic Psychology, Psychobiology and Methodology of

Behavioral SciencesUniversity of Salamanca

Salamanca, Spain

Jose F. Perez-RegaderaDepartment of Radiation OncologyHospital Universitario 12 de Octubre

Madrid, Spain

Agustın Gomez de la CamaraClinical Research Unit (imas12)

Hospital Universitario 12 de Octubreand CIBER de Epidemiologıa y Salud Publica (CIBERESP)

Madrid, Spain

Abstract. The area under the receiver operating characteristic (ROC) curve is ameasure of discrimination ability used in diagnostic and prognostic research. TheROC plot is usually represented without additional information about decisionthresholds used to generate the graph. In our article, we show that adding at leastone or more informative cutoff points on the ROC graph facilitates the character-ization of the test and the evaluation of the discriminatory capacities, which canresult in more informed medical decisions. We use the rocreg and rocregplot

commands.

Keywords: st0430, receiver operating characteristic (ROC) curve, area under theROC curve, cervix cancer, diagnostic test, discrimination, prognostic models, ro-creg, rocregplot


186 Features of the area under the receiver operating characteristic curve

1 Introduction

The receiver operating characteristic (ROC) area represents the probability that in a spe-cific diagnostic test or prognostic model, a randomly chosen diseased subject is rankedwith greater suspicion than a randomly chosen nondiseased subject (Hanley and Mc-Neil 1982). The area under the ROC curve is a measure of discrimination ability usedin diagnostic tests and prognostic models. The discriminatory capacity corresponds tothe area below the curve. The ROC area can be graphed by varying the cutoff pointsused to determine which values of the clinical procedure will be considered abnormaland then plotting the resulting true-positive rate (sensitivity) against the correspondingfalse-positive rate (1− specificity) (DeLong, DeLong, and Clarke-Pearson 1988). How-ever, decision thresholds are usually not displayed on the ROC plot, though they areknown and used to generate the graph (Zweig and Campbell 1993). In this article, weshow that the addition of cutoff point information enables one to 1) locate the deci-sion threshold with respect to its sensitivity and specificity and 2) show its importanceregarding the other thresholds. This information, which is useful in making medicaldecisions (Royston, Altman, and Sauerbrei 2006), facilitates the characterization of thetest as well as the evaluation of the discriminatory capacity when using the ROC plot(Moons et al. 2015).

2 Example

To demonstrate the importance of the description of the area under the ROC curve,we generated a random dataset of 200 patients in the early stages of cervical cancerwith three variables: the lymph node metastasis (LNM) status (40 out of 200) and thepredicted probabilities of two preoperative prognostic models for the identification ofLNM in early cervical cancer. The first model incorporated age, tumor size by magneticresonance imaging (MRI), and LNM assessed by positron emission tomography and com-puted tomography variables; the second model incorporated age, tumor size by MRI, andLNM assessed by MRI and squamous cell carcinoma antigen. The sample was formed bycomparing the patients’ data with the information given in Kim et al. (2014).

. set obs 200number of observations (_N) was 0, now 200

. set seed 1556

. generate LNM=1 if _n<=40(160 missing values generated)

. generate predprob=0.01*runiform() if _n<=1(199 missing values generated)

. replace predprob=0.01+(0.05-0.01)*runiform() if _n<=3 & _n>1(2 real changes made)




D. Lora, I. Contador, J. F. Perez-Regadera, and A. Gomez de la Camara 187


. generate predprob2=0.15*runiform() if _n<=1(199 missing values generated)

. replace predprob2=0.15+(0.25-0.15)*runiform() if _n<=3 & _n>1(2 real changes made)





. replace LNM=0 if _n> 40(160 real changes made)

. replace predprob=0.01*runiform() if _n<=86 & _n>40(46 real changes made)






. replace predprob2=0.15*runiform() if _n<=86 & _n>40(46 real changes made)






Using the rocreg command, we estimate the area under the ROC curve for theclassifier based on model 1 with 95% confidence intervals (CI) and store the true-positiverate (sensitivity) and false-positive rate (1 − specificity) values in variables for eachclassifier point, roc predprob and fpr predprob, respectively.


. rocreg LNM predprob, nodots

Bootstrap results Number of obs = 200Replications = 1,000

Nonparametric ROC estimation

Control standardization: empiricalROC method : empirical

Area under the ROC curve

Status : LNMClassifier: predprob

Observed BootstrapAUC Coef. Bias Std. Err. [95% Conf. Interval]

.8335938 .0012301 .0341697 .7666225 .900565 (N).7586384 .8939733 (P).7467949 .8849932 (BC)

We use the rocregplot command to draw the ROC curve for the first model (fig-ure 1).

. rocregplot, plot1opts(msymbol(none) lcolor(black))> scheme(s1color) rlopts(lcolor(black))> ylabel(0(0.25)1, angle(0) format(%3.2f)) ytitle("Sensitivity")> xlabel(0(0.25)1, format(%3.2f)) xtitle("1 - Specificity")> legend( col(1) order(1) label(1 "ROC AUC= 0.83 [0.77;0.90]"))

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 − Specificity

ROC AUC= 0.83 [0.77;0.90]

Figure 1. ROC curve for model 1

This curve is depicted by consecutive values from the test (that is, predicted proba-bilities from the prognostic model) to classify an individual as positive (LNM ill = valueis above the threshold) or negative (free of LNM = values are low or under the selectedthreshold) in comparison with a gold-standard criterion. An increase of the cutoff point


would decrease sensitivity and increase specificity. We have added the model’s discrimi-natory capacity—0.83 (95% CI [0.77; 0.90])—to the plot using the legend() option. Wecan also characterize and enhance the model by displaying cutoff point information onthe plot. To add the predicted probabilities, we first list the observations with predictedprobabilities close to 1%, 5%, and 10%. The observations with predicted probabilitiesclosest to 1%, 5%, and 10% are indicated with an arrow.

. sort predprob

. list if (predprob > 0.009 & predprob < 0.013)

LNM predprob predpr~2 _roc_p~b _fpr_p~b

45. 0 .0095514 .0902184 .975 .7312546. 0 .0095826 .0040212 .975 .72547. 0 .0096395 .1032082 .975 .71875 <--48. 0 .0120601 .2063762 .975 .712549. 0 .0128457 .184004 .975 .70625



85. 0 .0490097 .2001811 .925 .4937586. 0 .0498729 .228905 .925 .487587. 0 .0499744 .2237535 .925 .48125 <--88. 0 .0567622 .3069373 .925 .475



102. 0 .0936247 .3390107 .875 .40625 <--103. 0 .1064184 .2700449 .85 .4

Now, we use rocregplot to redraw the ROC curve. We add captions at the sensitivityand 1 − specificity of the indicated observations to the plot using the text() option(figure 2).

. rocregplot, plot1opts(msymbol(none) lcolor(black))> scheme(s1color) rlopts(lcolor(black))> ylabel(0(0.25)1, angle(0) format(%3.2f)) ytitle("Sensitivity")> xlabel(0(0.25)1, format(%3.2f)) xtitle("1 - Specificity")> text(0.99 0.69 "1%", size(small) placement(n))> text(0.94 0.44 "5%", size(small) placement(n))> text(0.89 0.37 "10%", size(small) placement(n))> text(0.98 0.72 "x", size(large) placement(c))> text(0.93 0.48 "x", size(large) placement(c))> text(0.88 0.41 "x", size(large) placement(c))> legend( col(1) order(1) label(1 "ROC AUC= 0.83 [0.77;0.90]"))


1%5%

10%

xx

x

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 − Specificity

ROC AUC= 0.83 [0.77;0.90]

Figure 2. ROC curve for model 1 with predicted probabilities closest to 1%, 5%, and10%

The result is that we show the predicted probabilities of developing LNM with theassociated sensitivity and specificity values in the ROC plot. Further, we can describethe sensitivity and 95% CI with respect to its 1 − specificity using the roc() option(figure 3) and describe the false-positive rate and 95% CI for given true-positive valuesusing the invroc() option in the rocreg command (figure 4):

. rocreg LNM predprob, nodots roc(0.71875 0.48125 0.40625)




ROC curve


Observed BootstrapROC Coef. Bias Std. Err. [95% Conf. Interval]

.71875 .975 -.0002608 .0249877 .9260249 1.023975 (N).9166667 1 (P).9090909 1 (BC)

.48125 .925 -.0044098 .0487009 .829548 1.020452 (N).8101673 1 (P)

.8 1 (BC).40625 .875 -.0159269 .0697107 .7383696 1.01163 (N)

.7105263 .9725976 (P)

.7333333 .9767442 (BC)


. rocregplot, btype(p) plot1opts(msymbol(none) lcolor(black))> scheme(s1color) rlopts(lcolor(black))> ylabel(0(0.25)1, angle(0) format(%3.2f)) ytitle("Sensitivity")> xlabel(0(0.25)1, format(%3.2f)) xtitle("1 - Specificity")> legend( col(1) order(1) label(1 "ROC AUC= 0.83 [0.77;0.90]"))

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 − Specificity

ROC AUC= 0.83 [0.77;0.90]

Figure 3. ROC curve for model 1 showing sensitivity and 95% CI with respect to its1− specificity

. rocreg LNM predprob, nodots invroc(0.875 0.925 0.975)




False-positive rate


Observed BootstrapinvROC Coef. Bias Std. Err. [95% Conf. Interval]

.875 .40625 .0025914 .0776321 .254094 .558406 (N).1911568 .5642543 (P)

.1875 .5616438 (BC).925 .45625 .0346563 .1047699 .2509048 .6615952 (N)

.3393398 .8106061 (P).3125 .64375 (BC)

.975 .6 .0644339 .1507743 .3044879 .8955121 (N).4086708 .8700964 (P).3806452 .8543046 (BC)


. rocregplot, plot1opts(msymbol(none) lcolor(black))> scheme(s1color) rlopts(lcolor(black))> ylabel(0(0.25)1, angle(0) format(%3.2f)) ytitle("Sensitivity")> xlabel(0(0.25)1, format(%3.2f)) xtitle("1 - Specificity")> legend( col(1) order(1) label(1 "ROC AUC= 0.83 [0.77;0.90]"))

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 − Specificity

ROC AUC= 0.83 [0.77;0.90]

Figure 4. ROC curve for model 1 showing false-positive rate and 95% CI for giventrue-positive values

The risk thresholds of the prognostic model allow us to see the effects of select-ing different cutoff points on medical decisions (Royston, Altman, and Sauerbrei 2006).Further, we can display the ROC curves of different classifiers in the same figure, but weshould remember that even when two classifiers agree in sensitivity and specificity, theywill not necessarily have the same cutoff point values. Now, we show an example withthe classifiers for the first and second models. We use rocreg to calculate the sensitivityand 1−specificity values for both models and then list the predicted probabilities wherethe sensitivity and 1− specificity values agree.


. rocreg LNM predprob predprob2, nodots




Area under the ROC curve



.8335938 -.0010747 .0368254 .7614172 .9057703 (N).7531488 .8963632 (P).7459636 .8940851 (BC)

Status : LNMClassifier: predprob2


.8339062 -.0006906 .0357851 .7637688 .9040437 (N).7543176 .8938994 (P).752333 .891875 (BC)

Ho: All classifiers have equal AUC values.Ha: At least one classifier has a different AUC value.

P-value: .9741677 Test based on bootstrap (N) assumptions.

. list if _roc_predprob==_roc_predprob2 & _fpr_predprob==_fpr_predprob2

LNM predprob predpr~2 _roc_p~b _fpr_p~b _roc_p~2 _fpr_p~2

12. 0 .002632 .0291194 1 .93125 1 .93125133. 0 .2548581 .4659569 .75 .2375 .75 .2375

. rocreg LNM predprob predprob2, nodots roc(0.2375)




ROC curve



.2375 .75 -.0050337 .0736913 .6055677 .8944323 (N).6 .8787879 (P)

.6136364 .8857143 (BC)


Status : LNMClassifier: predprob2


.2375 .75 -.0151284 .087794 .5779269 .9220731 (N).5357346 .8837209 (P).5833333 .90625 (BC)

Ho: All classifiers have equal ROC values.Ha: At least one classifier has a different ROC value.

Test based on bootstrap (N) assumptions.

ROC P-value

.2375 1

At a sensitivity of 0.75 and 1−specificity of approximately 0.25, the classifiers agree.We use rocregplot to plot the ROC curves of both classifiers and caption the differingcutoff point values (figure 5).

. rocregplot, scheme(s1color) rlopts(lcolor(black))> plot1opts(msymbol(none) lcolor(black))> plot2opts(msymbol(none) lcolor(black))> ylabel(0(0.25)1, angle(0) format(%3.2f)) ytitle("Sensitivity")> xlabel(0(0.25)1, format(%3.2f)) xtitle("1 - Specificity")> text(0.75 0.19 "25%", size(small) placement(n) color("dknavy"))> text(0.70 0.29 "47%", size(small) placement(n) color("maroon"))> legend( col(1) order(1 2)> label(1 "ROC AUC - Model 1= 0.83 [0.77;0.90]")> label(2 "ROC AUC - Model 2= 0.83 [0.77;0.90]"))

25%47%

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 − Specificity

ROC AUC − Model 1= 0.83 [0.77;0.90]ROC AUC − Model 2= 0.83 [0.77;0.90]

Figure 5. ROC curves for models 1 and 2


In this example, the layout of the thresholds allows us to define what predictedprobability value is appropriate to define patients as “low risk” and “nonlow risk” ofdeveloping LNM. This will help us to evaluate, for example, the therapeutic value oflymphadenectomy in “nonlow-risk” patients in clinical trials, as in the role of sentinellymph node biopsy and the prognostic value of metastatic nodal resection in this groupand not in the entire population (Kim et al. 2014). That is, it will help us to identifythose persons who may benefit from being included in a clinical trial, which lets usbetter understand disease mechanisms and select the best treatment options.

3 Conclusion

Adding at least one or more informative cutoff points on the ROC graph or textualdescription of a prognostic model enables one to better characterize the test as well asevaluate the strategic discriminatory capacities. This good practice will help improveour knowledge about the diagnostic and prognostic uses of clinical procedures.

4 ReferencesDeLong, E. R., D. M. DeLong, and D. L. Clarke-Pearson. 1988. Comparing the areas un-der two or more correlated receiver operating characteristic curves: A nonparametricapproach. Biometrics 44: 837–845.

Hanley, J. A., and B. J. McNeil. 1982. The meaning and use of the area under a receiveroperating characteristic (ROC) curve. Radiology 143: 29–36.

Kim, D.-Y., S.-H. Shim, S.-O. Kim, S.-W. Lee, J.-Y. Park, D.-S. Suh, J.-H. Kim, Y.-M.Kim, Y.-T. Kim, and J.-H. Nam. 2014. Preoperative nomogram for the identificationof lymph node metastasis in early cervical cancer. British Journal of Cancer 110:34–41.

Moons, K. G. M., D. G. Altman, J. B. Reitsma, J. P. A. Ioannidis, P. Macaskill, E. W.Steyerberg, A. J. Vickers, D. F. Ransohoff, and G. S. Collins. 2015. Transparentreporting of a multivariable prediction model for individual prognosis or diagnosis(TRIPOD): Explanation and elaboration. Annals of Internal Medicine 162: W1–W73.

Royston, P., D. G. Altman, and W. Sauerbrei. 2006. Dichotomizing continuous predic-tors in multiple regression: A bad idea. Statistics in Medicine 25: 127–141.

Zweig, M. H., and G. Campbell. 1993. Receiver-operating characteristic (ROC) plots: Afundamental evaluation tool in clinical medicine. Clinical Chemistry 39: 561–577.

About the authors

David Lora is a biostatistician at the Department of Clinical Research, Clinical Trials, ClinicalEpidemiology and Bioinformatics, Hospital Universitario 12 de Octubre, Madrid, Spain. Heis also a researcher at the Spanish Epidemiology and Public Health Consortium. His researchinterests lie in the intersection of computer science and statistical methods with applicationsin the areas of prognosis and cancer research.


Israel Contador, PhD, is a professor of psychology at the Department of Basic Psychology,Psychobiology and Methodology of Behavioral Sciences, University of Salamanca, Spain. Heis particularly interested in clinical neuropsychology, and his research covers topics such as theprognosis of mild cognitive impairment and the early detection of dementia.

Jose F. Perez-Regadera, PhD, is the head of the Department of Radiation Oncology, HospitalUniversitario 12 de Octubre, Madrid, Spain. He is an associate professor in the Department ofRadiation Oncology, Complutense University, Madrid, Spain. His research focuses primarilyon gynaecological cancer, neuro-oncology, and brachytherapy.

Agustın Gomez de la Camara, PhD, is a senior investigator at and the head of the Depart-ment of Clinical Research, Clinical Trials, Clinical Epidemiology and Bioinformatics, HospitalUniversitario 12 de Octubre, Madrid, Spain. He is a professor at the Department of Medicineat Madrid University. He is a member of the executive group of the Spanish Clinical ResearchNetwork and the European Clinical Research Network and a senior researcher at the SpanishEpidemiology and Public Health Consortium. His research focuses on the efficacy of medicaland public health interventions and diagnostic and prognostic models in chronic diseases.


Implementing factor models for unobservedheterogeneity in Stata

Miguel SarzosaPurdue UniversityWest Lafayette, IN

[email protected]

Sergio UrzuaUniversity of Maryland

College Park, MD

and National Bureau of Economic ResearchCambridge, MA

[email protected]

Abstract. We introduce a new command, heterofactor, for the maximum likeli-hood estimation of models with unobserved heterogeneity, including a Roy model.heterofactor fits models with up to four latent factors and allows the unobservedheterogeneity to follow general distributions. Our command differs from Stata’ssem command in that it does not rely on the linearity of the structural equationsand distributional assumptions for identification of the unobserved heterogeneity.It uses the estimated distributions to numerically integrate over the unobservedfactors in the outcome equations by using a mixture of normals in a Gauss–Hermitequadrature. heterofactor delivers consistent estimates, including the unobservedfactor loadings, in a variety of model structures.

Keywords: st0431, heterofactor, unobserved heterogeneity, factor models, Roymodel, maximum likelihood, numerical integration

1 Introduction

Unobserved heterogeneity has become a particularly relevant topic in modern appliedmicroeconomics (Keane and Wolpin 1997; Cameron and Heckman 1998, 2001; Carneiro,Hansen, and Heckman 2003; Heckman, Stixrud, and Urzua 2006; Urzua 2008; Sarzosaand Urzua 2015). However, its adequate analysis requires structural models often tai-lored to the needs of each particular research project. This reflects the fact that theresearch community lacks the tools for the systematic inclusion of unobserved hetero-geneity in practical analyses. However, advances in computational capability have fa-cilitated the estimation of structural models so that it is now conceivable to run someof these models in standard computers.

In this article, we discuss the implementation of factor models when estimatingstructural equations in the presence of unobserved heterogeneity and a new command,heterofactor, for fitting such models. Our routines allow the calculation of consistentestimates, including the loadings of the unobserved factors, in multiple structures. Thestructural models that we address are related to the model used in Carneiro, Hansen,and Heckman (2003) and Heckman, Stixrud, and Urzua (2006) and first introduced byJoreskog and Goldberger (1972) and Cameron and Heckman (1998, 2001). The mostsalient feature of these models is their factor structure that provides a parsimonious


198 Implementing factor models for unobserved heterogeneity in Stata

specification to identify unobserved heterogeneity and its effects on the outcomes ofinterest. Unlike Stata’s sem command, our command does not rely on the linearity ofthe structural equations and distributional assumptions for identification. Instead, thedistributions of the unobserved factors are identified nonparametrically following thecontributions of Kotlarski (1967).

As shown below, the structural models we refer to here have a variety of applica-tions. Recently, the treatment-effect literature has embraced these models not onlybecause they provide a method for estimating treatment effects depending on the levelof unobservables but also because controlling for unobserved heterogeneity allows thesimulation of counterfactuals. This method can also be used to estimate the parametersof a measurement system that contains unobserved attributes. In particular, this settingcould relate to the skills literature, where cognitive and noncognitive skills are unob-servable characteristics of individuals that influence their decisions and outcomes laterin life (see Heckman et al. [2011]; Prada and Urzua [2013]; Sarzosa and Urzua [2015]).

This article is organized as follows. In section 2, we review the factor model struc-ture and discuss the mechanisms that allow us to identify key parameters. In sec-tion 3, we discuss the implementation of our estimation routines, including the syntaxof heterofactor. In section 4, we provide examples using both simulated data and datafrom the National Longitudinal Survey of Youth (NLSY79). In section 5, we conclude.

2 Factor model estimation

The type of structural model that our command can handle can be described as aset of measurement systems that are linked by a factor structure. This is the typeof model considered by Hansen, Heckman, and Mullen (2004), Heckman, Stixrud, andUrzua (2006), Heckman and Navarro (2007), and Sarzosa and Urzua (2015). In a gen-eral setup, suppose we face the following linear system,

Y = XY βY +UY

where Y is an M × 1 vector of outcome variables, XY is a matrix with all observablecontrols, and UY is a vector that contains the unobservables for each outcome equationwith a factor structure of the form UY = ΛYΘ+ eY. Hence, we can expand the linearsystem to

Y = XY βY +ΛYΘ+ eY (1)

where Θ is a q × 1 vector that contains the q dimensions of unobserved heterogeneity(that is, q latent factors), ΛY is anM×q matrix that contains the factor loadings for eachtype of unobserved heterogeneity, and eY is a vector of error terms with distributionsfeym (·) for every m = 1, . . . ,M . We assume that eY ⊥ (Θ,XY ) and also that eyi ⊥ eyj

for i, j = 1, . . . ,M and i �= j. Furthermore, we assume the vector Θ has the associateddistribution fθ (·). Hence, the econometrician does not observe the actual value of Θfor each observation. Instead, he or she knows or estimates the distributions from whichthey are drawn.

M. Sarzosa and S. Urzua 199

The measurement system (2) can be used to identify the components of matrixΛY, albeit under very stringent constraints and assumptions (Aakvik, Heckman, andVytlacil 2000). As indicated by Carneiro, Hansen, and Heckman (2003), the estimationsthat come from the factor structure will gain interpretability and will require fewerrestrictions for its identification if a measurement system—also linked by the samefactor structure—is adjoined to the system (1). This system can be used to identifythe distributional parameters of the unobserved factors. This adjoined measurementsystem would have the form

T = XTβT +ΛTΘ+ eT (2)

where T is an L× 1 vector of measurements (for example, test scores), XT is a matrixwith all observable controls for each measurement, and ΛT is an L×q matrix that holdsthe loadings of the q unobserved factors. Again we assume that (Θ,XT ) ⊥ eT and thatall the elements of the L × 1 vector eT are mutually independent and have associateddistributions feh (·) for every h = 1, . . . , L.1

2.1 Identification of the adjunct measurement system

heterofactor can handle up to four factors. However, for presentation purposes, wewill demonstrate estimation using a two-factor model.2 In the two-factor model, (1)becomes

Y = XY βY +αY,AθA +αY,BθB + eY (3)

and (2) becomes

T = XTβT +αT,AθA +αT,BθB + eT (4)

To explain how the parameters of the adjunct measurement system (4) are identified,let’s focus on the matrix Cov (T |XT ), whose elements in the diagonal are of the form

Cov (Ti, Ti |XT ) =(αTi,A

)2σ2θA + αTi,AαTi,BσθAθB +

(αTi,B

)2σ2θB + σ2

eTi (5)

and in the off diagonal are of the form

Cov (Ti, Tj |XT ) = αTi,AαTj ,Aσ2θA +

(αTi,AαTj ,B + αTi,BαTj ,A

)σθAθB

+ αTi,BαTj ,Bσ2θB (6)

As it is, the model is underidentified (Carneiro, Hansen, and Heckman 2003). There-fore, identification requires some assumptions. First, we need θA ⊥ θB , so σθAθB = 0 in

1. For the maximum likelihood procedure we describe below, we assume feh (·) are normal distribu-tions. This is a relatively mild assumption because these come from the idiosyncratic variationthat remains after controlling for observed controls and unobserved heterogeneity.

2. The extension to three and four factors is straightforward.


(5) and (6).3 The second assumption relates to the minimum number of measurementswe need to have per factor. Notice that the diagonal elements of Cov (T |XT ) havethe variances of the idiosyncratic errors, while the off-diagonal elements do not. Hence,once we identify the rest of the model parameters, the diagonals will identify σ2

eThfor

h = 1, . . . , L. Then, following Carneiro, Hansen, and Heckman (2003), we can use the{L(L − 1)}/2 off-diagonal elements to identify the variances of the factors and theirassociated factor loadings. If we let k be the number of factors we are using in themodel—in the present example, k = 2—then we have k × L loadings. It should thenfollow that

L (L− 1)

2≥ Lk + k thus

L (L− 1)

2 (L+ 1)≥ k

In our example where k = 2, this restriction tells us that L ≥ 6. That is, we needat least six test scores to identify the parameters of the measurement system with twofactors.

The next step for identification is to acknowledge that latent factors have no metricor scale of their own. Hence, we need to normalize to unity one loading per factor, andthe estimation of all other loadings should be interpreted as relative to those used asnumeraire.4 To incorporate this into our notation, we expand (4) into k blocks of sizemκ such that

∑κ mκ = L. That way, without loss of generality, we set the first loading

in the first equation in each block to one. In our example, we get two blocks, a and b.That is, we write (4) as

Ta = XTaβTa

+αTa,AθA +αTa,BθB + eTa

Tb = XT bβT b

+αTb,AθA +αTb,BθB + eTb

3. Using higher moments of the distributions, Heckman and Navarro (2007) show that identificationcan be achieved even if the factor independence assumption is relaxed. Also, Sarzosa (2015) showsthat models with correlated factors can be identified if additional restrictions are imposed on thefactor loadings structure.

4. These normalizations reduce by k the number of parameters to estimate. Hence, L, the numberof measurements needed, is given by {L (L− 1)}/2 ≥ Lk + k − k, which simplifies to L ≥ 2k + 1.Therefore, the presence of two factors in (3) implies that there should be at least five measures in(4). Throughout the routines in this article, we will assume that we have at least 3k measurements.


with αTa1 ,A = 1 and αT b

1 ,B = 1, where Tκ1 indicates the first test in block κ and Tκ

i

indicates all tests different from the first one in block κ. Then the off-diagonal elementsof the Cov (T |XT ) matrix follow one of the following cases,

Cov(T a1 , T

bi |XT

)= αT b

i ,Aσ2θA + αTa

1 ,BαT bi ,Bσ2

θB (7)

Cov(T ai , T

bi |XT

)= αTa

i ,AαT bi ,Aσ2

θA + αTai ,BαT b

i ,Bσ2θB

Cov(T a1 , T

b1 |XT

)= αT b

1 ,Aσ2θA + αTa

1 ,Bσ2θB (8)

Cov(T ai , T

b1 |XT

)= αTa

i ,AαT b1 ,Aσ2

θA + αTai ,Bσ2

θB (9)

Cov (Tκ1 , T

κi |XT ) = αTκ

i ,Aσ2θA + αTκ

1 ,BαTκi ,Bσ2

θB (10)

Cov(Tκi , T

κj |XT

)= αTκ

i ,AαTκj ,Aσ2

θA + αTκi ,BαTκ

j ,Bσ2θB (11)

for κ = {a, b} and i �= j. These elements show that we cannot identify σ2θA and σ2

θB

and the loadings without further restrictions. Carneiro, Hansen, and Heckman (2003)suggest that the first restrictions should be αTa

1 ,B = 0, αTa2 ,B = 0, and αTa

3 ,B = 0. Thatis, the first three tests in the first block can be affected by only the first factor. Then

Cov (T a1 , T

a2 |XT ) = αTa

2 ,Aσ2θA

Cov (T a1 , T

a3 |XT ) = αTa

3 ,Aσ2θA

Cov (T a2 , T

a3 |XT ) = αTa

2 ,AαTa3 ,Aσ2

θA

Then

Cov (T a2 , T

a3 |XT )

Cov (T a1 , T

a2 |XT )

= αTa3 ,A,

Cov (T a2 , T

a3 |XT )

Cov (T a1 , T

a3 |XT )

= αTa2 ,A,

Cov (T a2 , T

κi |XT )

Cov (T a1 , T

a2 |XT )

= αTκi ,A

and hence, we identify σ2θA from

Cov (T a1 , T

a3 |XT ) =

Cov (T a2 , T

a3 |XT )

Cov (T a1 , T

a2 |XT )

σ2θA

Identification of the loadings and variances associated with the subsequent factorsrequires fewer restrictions. Note that under the assumption of αTa

1 ,B = 0, (7) and (8)become

Cov(T a1 , T

bi |XT

)= αT b

i ,Aσ2θA and Cov

(T a1 , T

b1 |XT

)= αT b

1 ,Aσ2θA

respectively. Given that we already know σ2θA , we can identify all the loadings associated

with the first factor in all the subsequent blocks. This allows us to use (9), (10), and

(11) when κ = b to identify σ2θB and αT b,B because we already know the first part of

the right-hand side of those expressions.


Finally, having identified all the parameters from the off-diagonal elements of theCov (T |XT ) matrix, we can identify the parameters in the diagonal. From (5) and therestrictions we have imposed, we find that the typical diagonal element of Cov (T |XT )is

Cov (Ti, Ti |XT ) =(αTi,K

)2σ2θK + σ2

eTi

for K = {A,B}. Given that we have already identified the first part of the right-handside of this equation, we can use the diagonal elements to identify σ2

eTi.

Now that we have identified all the loadings, factor variances, and measurementresidual variances and that we know that the means of θA, θB , and eT are finite—in fact, they are equal to zero because we allow the measurement system (4) to haveintercepts—we can invoke the Kotlarski Theorem (Kotlarski 1967)5 to use the manifestvariables T to nonparametrically identify the distributions of fθA (·) and fθB (·).

2.2 Loadings structures in the measurement system

We have shown that identification requires some restrictions in the loadings structure.The more general structure requires one normalization per factor and the first threemeasurements of the first block to be affected only by the first factor. In our examplewith two factors and using three measurements per block, the loadings structure can berepresented as

ΛT =

⎡⎢⎢⎢⎢⎢⎢⎣αT1,A αT1,B

αT2,A αT2,B

αT3,A αT3,B

αT4,A αT4,B

αT5,A αT5,B

αT6,A αT6,B

⎤⎥⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎢⎣αT1,A 0αT2,A 01 0

αT4,A αT4,B

αT5,A αT5,B

αT6,A 1

⎤⎥⎥⎥⎥⎥⎥⎦ (12)

Provided that the loadings structure fulfills the required restrictions, the choice ofstructure depends entirely on the available data. The triangular structure (12) allowsfor a block of measures that depend on both factors. For instance, grades and educationachievement scores depend not only on a cognitive factor but also on a noncognitiveone.

5. The Kotlarski Theorem states that if there are three independent random variables, eT1, eT2

, andθ, and we define T1 = θ + eT1

and T2 = θ + eT2, the joint distribution of (T1,T2) determines the

distributions of eT1, eT2

, and θ, up to one normalization. Given that we have already identified allthe loadings, we can write (4) in terms of Tτ = θ + eTτ by dividing both sides by the loading. Seemore details in Carneiro, Hansen, and Heckman (2003).


If data permit, the researcher can use a more restrictive loadings structure in whichonly one factor affects each block of measurements. It will take the following form:

ΛT =

⎡⎢⎢⎢⎢⎢⎢⎣αT1,A αT1,B

αT2,A αT2,B

αT3,A αT3,B

αT4,A αT4,B

αT5,A αT5,B

αT6,A αT6,B

⎤⎥⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎢⎣αT1,A 0αT2,A 01 00 αT4,B

0 αT5,B

0 1

⎤⎥⎥⎥⎥⎥⎥⎦ (13)

This type of loadings structure will increase the speed of estimation because it re-quires the estimation of fewer parameters.

2.3 Estimation

We fit the model (4) using maximum likelihood estimation. The likelihood is

L =

N∏i=1

∫ ∫ {fe1

(XT1

, T1, ζA, ζB

)× · · · × feL(XTL

, TL, ζA, ζB

) }dFθA

(ζA)dFθB

(ζB)

where we integrate over the distributions of the factors because of their unobservablenature, obtaining βT , αT,A, αT,B , FθA(·), and FθB (·). All the integrals are calcu-lated numerically using a Gauss–Hermite quadrature within a mixture of normals (Judd1998). This guarantees the flexibility required to appropriately re-create the unobserveddistributions in the estimation. Our routine does not impose normality on FθA (·) andFθB (·). Instead, it assumes they are distributed according to mixtures of two normaldistributions. Therefore, we estimate the distributional parameters of the normals andthe mixing probability. This way, we can identify an array of possible functional formsfor FθA (·) and FθB (·).

Having identified the distributional parameters of FθA (·) and FθB (·) from (4), wecan proceed to fit model (3). The likelihood function here is

L =N∏i=1

∫ ∫ {fey1

(XY1

, Y1, ζA, ζB

)× · · · × feyM(XYM

, YM , ζA, ζB) }

dFθA

(ζA)dFθB

(ζB)

This maximum likelihood estimation will yield βY , αY,A, and αY,B .6

6. In this two-step procedure, we use a limited-information maximum likelihood and correct thevariance–covariance matrix of the second stage, incorporating the estimated variance–covariancematrix and gradient of the first stage (Greene 2012).


Also, the two steps presented above can be joined and calculated in one likelihoodof the form

L =N∏i=1

∫ ∫ {fey1

(XY1

, Y1, ζA, ζB

)× · · · × feyM(XYM

, YM , ζA, ζB)

×fe1(XT1

, T1, ζA, ζB

)× · · · × feL(XTL

, TL, ζA, ζB

) }dFθA

(ζA)dFθB

(ζB)

However, the two-step procedure is less computationally burdensome, especially if weare fitting a model with two or more factors.7

2.4 The treatment-effect setting: A Roy model

In this subsection, we discuss the special case of model (3), where there is a binarytreatment (for example, to go to college) and a later outcome (for example, wagesearned at age 30). This is one of the settings where the factor structure has receivedmore attention (Heckman, Stixrud, and Urzua 2006; Urzua 2008; Heckman et al. 2011;Prada and Urzua 2013). The advantage that the factor structure has here is that poten-tial outcomes are separable in observables and unobservables. That is, conditional onθ and XY , potential outcomes are independent because any selection on unobservablesis already accounted for.8 This allows researchers to simulate observationally identicalcounterfactuals, permitting the calculation of treatment parameters like average treat-ment effect, average treatment effect for the treated, and average treatment effect onthe untreated, for every level of the unobserved heterogeneity.

Consider a model of potential outcomes inspired by the Roy model (Roy 1951).Individuals must choose between two sectors such as treated and not treated or highschool and college. The choice is based on the decision model

D = �(XDβYD + αYD,AθA + αYD,BθB + eD > 0

)where � (A) denotes an indicator function that takes a value of 1 if A is true. ThenD is the binary treatment variable, and XD represents a set of exogenous observablevariables. Depending on the selected sector (that is, D = 1 or D = 0), individualswill experience different outcomes. We denote these potential outcomes by Y1 and Y0,respectively. Y1 can represent, for instance, the wages earned at age 30 by a collegegraduate, while Y0 represents the wages earned at age 30 by a person that did not goto college. Therefore, in a treatment-effect setting, the system of equations (3) willrepresent both potential outcomes and the choice equation. That is, Y = (Y1, Y0, D)

′.

Here the system would be

7. For one-factor models, (3) and (4) become Y = XY βY +αY θ+ eY and T = XT βT +αTθ+ eT,respectively. The likelihood function would be

L =N∏i=1

∫ ∫ {fey1

(XY1

, Y1, ζ)× · · · × feyM

(XYM

, YM , ζ)

×fe1(XT1

, T1, ζ)× · · · × feL

(XTL

, TL, ζ) }

dFθ (ζ)

8. Recall that eyi ⊥ eyj for i, j = 1, . . . ,M and i �= j.


Y1 =

{XY β

Y1 + αY1,AθA + αY1,BθB + eY1 if D = 1

0 if D = 0(14)

Y0 =

{XY β

Y0 + αY0,AθA + αY0,BθB + eY0 if D = 0

0 if D = 1(15)

D = �(XDβYD + αYD,AθA + αYD,BθB + eD > 0

)(16)

The second-step likelihood function is given by

L =

N∏i=1

∫ ∫ {fY0

(XY , Y0, ζ

A, ζB)1−D

fY1(XY , Y1, ζ

A, ζB)D

×fD(XD, YD, ζA, ζB

) }dFθA

(ζA)dFθB

(ζB)

Thus we obtain different parameter values for each potential outcome. That is, themeasures of the effects of observable and unobservable features on the outcome differdepending on D.

2.5 Probit and normal regressions with unobserved heterogeneity

An especial case related to the one presented above is one in which the vector of outcomesis composed by only the choice or treatment decision. That is, vector Y = D, meaningthere are no potential-outcome equations in the second step. The outcome equation tobe estimated is (17). The complete likelihood would be

L =N∏i=1

∫ ∫ {fD

(XD, YD, ζA, ζB

)×fe1

(XT1

, T1, ζA, ζB

)× · · · × feL(XTL

, TL, ζA, ζB

) }dFθA

(ζA)dFθB

(ζB)

(17)

This structure should be interpreted as a probit of D on XD that allows for unobservedheterogeneity.

Similarly, we may have a case where there is no choice or treatment equation, andthere is only one outcome Y . Then the outcome vectorY = Y and the outcome equationof interest will be

Y = XY βY + αY,AθA + αY,BθB + eY

Here the complete likelihood would be

L =

N∏i=1

∫ ∫ {fey

(XY , Y, ζ

A, ζB)

×fe1(XT1

, T1, ζA, ζB

)× · · · × feL(XTL

, TL, ζA, ζB

) }dFθA

(ζA)dFθB

(ζB)

(18)

This structure should be interpreted like a normal regression of Y on XY that allowsfor unobserved heterogeneity.


3 The heterofactor command

3.1 Syntax

The syntax of the command is as follows:

heterofactor depvar varlist X[if] [

in], scores(varlist T)

indvarsc(varlist Q)[treatind(varname D) instrum(varlist Z) exp1(varlist)

exp2(varlist) exp3(varlist) exp4(varlist) factors(#) fdistonly

scndstponly choiceonly nochoice triangular numf1tests(#)

numf2tests(#) nodes(#) twostep initialreg nohats sigmamixt11(#)

sigmamixt12(#) sigmamixt21(#) sigmamixt22(#) sigmamixt31(#)

sigmamixt32(#) sigmamixt41(#) sigmamixt42(#) mumixt1(#)

mumixt2(#) mumixt3(#) mumixt4(#) mixtprob1(#) mixtprob2(#)

mixtprob3(#) mixtprob4(#) st2(#) st3(#) st4(#) st5(#) st6(#)

st9(#) st12(#) resvar2(varname) resvar3(varname) resvar4(varname)

resvar5(varname) resvar6(varname) resvar9(varname) resvar12(varname)

firstloads(matname) firstgrad(matname) firstvarmat(matname)

level(#) maximize options]

heterofactor is implemented for Stata 11 by using the d0 estimator of ml. Alllikelihood routines are coded in Mata. These commands share the same features of mostof the Stata estimation commands that use maximum likelihood, including access to thelast estimation results and the options for the maximization process (see [R]maximize).

3.2 Options

scores(varlist T) specifies the variables that contain the scores of the measurementsystem [that is, vector T in (4)]. There must be at least three variables specifiedin varlist T per factor. Users may specify more than three variables per factorfor models with one or two factors. If the model has three or four factors, usersmust specify three variables for the third and fourth factors. The order of varlist Tmatters. Users must list variables in blocks, where each block should be affectedby the same factor or factors. Identification requires one loading normalization perfactor. Thus the loadings of the last test score in each block will be normalized. Forinstance, if the model chosen has four factors and varlist T contains exactly threevariables per factor, then the loadings of the third, sixth, ninth, and twelfth variablewill be normalized to one. This arrangement is somewhat different if the triangularoption is specified. In that case, the factor structure is the one presented in (12).That is, if f is the number of factors, the first f−1 sets of three measures provided invarlist T should depend on only one factor each, while the last set of three measureswill be affected by all factors. scores() is required.


indvarsc(varlist Q) specifies the observed variables that affect all test-score regressions[that is, XT in (4)]. varlist Q can be the same as varlist X, but users must specifyboth. There is no limit for the number of variables that can be specified in varlist Q.If users want to specify different controls for each set of three measures, the exp1(),exp2(), exp3(), and exp4() options should be used. indvarsc() is required.

treatind(varname D) specifies the choice variable when there is a choice equationin the model. varname D represents variable D in (16). varname D indicates theassignment to treatment, and it needs to be a binary variable.

instrum(varlist Z) specifies the observed variables that affect the binary choice equation[that is, XD in (16)].

exp1(varlist), exp2(varlist), exp3(varlist), and exp4(varlist) include more controlsin each set of three measures in addition to those specified in varlist Q, which arecommon to all. Users can add regressors that are believed to affect only one set andnot the other ones.

factors(#) specifies the number of factors used in the model. # can be any integerbetween 1 and 4. The default is factors(1).

fdistonly specifies to estimate only the first step. Stata will estimate only the factors’distribution parameters and factor loadings on the test scores. No outcome equationwill be estimated. However, depvar and varlist X should be provided even if theyare not going to be used.

scndstponly specifies to estimate only the second step. Stata will estimate only theoutcome equations. No factor-distribution identification takes place. If scndstponlyis specified, all the parameters that describe the factors’ distributions Fθ (·), theresiduals of varlist T, and their variances and loadings should be provided by usersthrough additional options. This option is useful if users did the first step beforeand now need only to estimate a new set of outcome equations based on the samefactors.

choiceonly specifies that the model to be estimated in the second step include only achoice equation. That is, it will estimate only the equation described by varlist Dand varlist Z. No other outcomes are estimated, including the potential-outcomeequations (14) and (15). The estimation using this option should be interpreted asrunning a probit estimation allowing for the presence of unobserved heterogeneity.

nochoice specifies that the outcome equations in the model not include the binarytreatment equation. It indicates to Stata that the model is not of the treatment-effect nature described in subsection 2.4. This likelihood is described in (18) andhas a unique outcome equation, Y = XY β

Y + αY,AθA + αY,BθB + eY . This shouldbe interpreted as a linear regression allowing for the presence of unobserved hetero-geneity.

triangular indicates that the measurement system in the first step has a triangularloading structure. If triangular is specified, the structure assumed for the mea-surement system is one that has one block of scores that depends on all factors; the


other blocks of scores depend on only one factor each. This option is valid only forthe two-factor case. Note that this option increases the computational time neededfor calculation. If triangular is not specified, the loading structure assumed is theone presented in (13).

numf1tests(#) and numf2tests(#) specify the number of tests used in each block ofvarlist T. These options should be specified only if the number of tests is differentfrom three. For instance, if the user lists seven variables in varlist T, numf1tests(4)and numf2tests(3) are specified to indicate that the first four variables are in thefirst block and the last three variables are in the second block.

nodes(#) defines the number of points used in the Gauss–Hermite quadrature forintegration. The number defined can be either 4 or 10. While using 10 nodesprovides more accuracy, integrating with 4 nodes is faster.

twostep divides estimation into two parts: the factor-identification part (4) and theoutcome-equations part (3). If the factor(#) option is specified with # > 1,twostep is assumed. If the factor(1) option is specified, twostep is not used.

initialreg specifies to calculate initial values using ordinary least-squares regressionof each equation separately. These initial values are different from the ones providedby Stata in the absence of the initialreg option.

nohats specifies that estimated factor values θ not be saved in the data. This optionspeeds the command execution, especially in big datasets.

Sometimes, the user needs to fit several models that use the same factor structure.In that case, the user needs to run the first step only once and can save time by runningall the required models using only the second step. To do so, the user needs to specifyscndstponly and the parameters that describe the distributions, the residuals, thevariance–covariance matrix, and the gradient of the first stage. The distributions Fθ (·)of the factors are obtained using a mixture of two normals. To fully describe eachfactor’s distribution, we need the standard deviation, the mean of each of the normals,and the weight (probability) with which the two normals are combined.

If scondstponly is specified, the user needs to provide the parameters that describethe distributions. The parameters should be provided using the following:

sigmamixt11(#), sigmamixt12(#), sigmamixt21(#), sigmamixt22(#),sigmamixt31(#), sigmamixt32(#), sigmamixt41(#), and sigmamixt42(#) spec-ify the standard deviations of the two distributions used in the mixture of normalsthat describe the distribution of the first and specify the standard deviations of thetwo distributions used in the mixture of the first, second, third, and fourth factors.Given the transformations done in the code to ensure the parameters remain in thevalid range, the user needs to provide the natural logarithm of the actual standarddeviations. That is, if the standard deviation of the first normal of the first mixtureis σ11 = 1, the sigmamixt11(0) option should be specified. Note that the values dis-played in the output in the first stage are untransformed, and they must be providedwith these options.


mumixt1(#), mumixt2(#), mumixt3(#), and mumixt4(#) specify the mean of thefirst part of the mixture of each factor. The factor is centered at zero, so themean of the second part of the mixture can be obtained from the equation �μ1 +(1− �)μ2 = 0, where � is the probability used to combine the mixtures, provided byexp(mixtprob1(#))/{1 + exp(mixtprob1(#))} for factor 1, exp{mixtprob2(#)}/{1 + exp(mixtprob2(#))} for factor 2, and so on.

mixtprob1(#), mixtprob2(#), mixtprob3(#), and mixtprob4(#) specify the prob-ability used to combine the two normal distributions into the mixture of normalsfor the distribution of each factor. As with standard deviations, the value inmixtprob#(#) is the logit transformation of the actual mixing probability.

When scndstponly is specified, the user also needs to specify some of the variableswhere the estimated residuals of varlist T are stored and their variances. This is doneusing the following:

st2(#), st3(#), st4(#), st5(#), st6(#), st9(#), and st12(#) allow the userto provide the standard deviations of the residuals specified in resvar#(). Thesevariances are given in the first step. They should also be provided using the loga-rithmic transformation. st2(#), st4(#), and st5(#) have to be provided only inthe two-factor case. These options should be used only when scndstponly has beenspecified.

resvar2(varname), resvar3(varname), resvar4(varname), resvar5(varname),resvar6(varname), resvar9(varname), and resvar12(varname) contain theresiduals of the test equations’ estimations. resvar2(varname) andresvar3(varname) refer to the second to last and last variable of the first block oftests. resvar4(varname), resvar5(varname), and resvar6(varname) refer to thethird to last, the second to last, and last variable of the second block of tests.resvar9(varname) and resvar12(varname) refer to the last variable of the thirdblock and the last variable of the fourth block. These residuals are given by thefirst step under the names res2, res3, res4, res5, res6, res9, andres12. resvar2(varname), resvar4(varname), and resvar5(varname) have to

be provided only in the two-factor case. These options should be used only whenscndstponly has been specified.

When scndstponly is specified, the user also needs to specify some of the matricesreported in the first stage to correct the standard errors of the second stage becausethere was a previous step in the estimation. This is done using the following options:

firstloads(matname) provides the name of the matrix where the loadings of the firststage are stored.

firstgrad(matname) provides the name of the matrix where the gradient of the firststage is stored.

firstvarmat(matname) provides the name under which the variance–covariance ma-trix of the first stage is stored.


level(#) specifies the confidence level, as a percentage, for confidence intervals of thecoefficients. The default is level(95) or as set by set level; see [R] level.

maximize options: difficult, iterate(#),[no]log, trace, gradient, showstep,

hessian, shownrtolerance, tolerance(#), ltolerance(#), gtolerance(#),nrtolerance(#), nonrtolerance; see [R] maximize. These options are seldomused.

3.3 Further remarks

1. heterofactor typically requires relatively large samples, especially if it is used ina setting with more than one factor. The structural model is estimating not onlyseveral parameters (that is, βY , αY,A, αY,B , βT , αT,A, αT,B) but also the distribu-

tions of unobservable attributes Fθ (·).

2. heterofactor is computationally demanding because of the nonparametric waythe unobserved factors’ distributions are estimated. Numerical integration in acomplex likelihood function, together with the numerical calculation of the gradi-ent and Hessian during optimization, puts pressure on the computational resourcesavailable. Thus the estimation time increases with sample size, the number of ob-servable controls, and the number of nodes used in the Gauss–Hermite quadraturefor the numerical integration. For instance, it took 302.51 seconds for a MacBookPro with 3.1 GHz Intel Core i7 and 16 GB memory to estimate the one-factorexample presented in section 4.1. The same machine took 2155.38 seconds to fitthe two-factor model presented in section 4.1.

3. There are trade-offs between estimation time and precision and smoothness andconcavity of the likelihood function. Using larger samples and more nodes in-creases precision but also increases estimation time. Analogously, using moreobservable controls increases smoothness and concavity in the likelihood function.This is because the factors are being estimated from the residuals left after con-trolling for the observed variables. Therefore, a “cleaner” residual leads to aneasier estimation of the factors and thus a smoother likelihood to maximize. How-ever, more observable controls implies higher dimensions of the Hessian of thelikelihood.

4. As explained in subsection 3.1, the estimated standard deviations and mixingprobabilities are in transformed terms. This is done to avoid the optimizationroutine using unfeasible values. In particular, standard deviations should alwaysbe positive, and the mixing probabilities should always be in the (0, 1) interval.Therefore, the standard deviations are transformed using the exponential function,and the mixing probabilities are transformed using a logit function. Thus, if s isthe number provided by the estimation results for the standard deviations, thenthe actual standard deviation value is σ = exp (s). If the number provided bythe estimation results for the mixing probability is p, then the actual mixingprobability value is � = exp (p) /{1 + exp (p)}.


5. Here are some practical recommendations for using heterofactor:

a. When doing the first exploratory analyses, users should try using few in-tegration nodes. This will decrease estimation time but will give a verywell-informed indication on how the estimations will look.

b. Given that the likelihood function is complex, convergence can be difficult.Recall that more control variables (sensible and informative) facilitate con-vergence. When convergence has been elusive, users are encouraged to use allthe available tools in maximum likelihood estimation to improve the chancesof convergence (see [R] maximize). For instance, Stata’s maximum likeli-hood option difficult can be helpful in this case. heterofactor also offersthe initialreg option, which provides a set of initial values different fromthe ones provided by Stata. As in any complicated likelihood, convergencemight depend on the initial values. Trying with different sets of initial valuesis encouraged when convergence is difficult.

6. heterofactor requires the matdelrc command (Cox 1999), which can be down-loaded by typing search matdelrc in the Command window.

7. The heterofactor routines are written in Mata and thus compiled in a librarycalled lheterofactor.mlib (see [M-3] mata mlib). The library must be placedin a folder where Stata will look for it. However, before you call the libraryfor the first time, you must type mata mlib index in the Mata prompt. See[M-3] mata mlib for details.

8. heterofactor creates the following variables every time it runs the first stage:

a. res#: the estimated residuals for each variable in varlist T (that is, res =T−XTβ

T ), where # is given according to the order in varlist T

b. mixt#: the random draws of the estimated distributions of θ, which providea way to explore the shape of the distributions, where # represents the factornumber


3.4 Stored results

heterofactor saves numerous results in ereturn. The ones produced during the firststage are crucial because they will be used in a future second-stage estimation, if needed.For instance, for a two-factor model, the main stored results after the first stage are thefollowing:

Scalarse(N) number of observationse(sf11) standard deviation of first normal used in mixture defining first factore(sf12) standard deviation of second normal used in mixture defining first factore(mu11) mean of first normal used in mixture defining first factore(p1) mixing probability for mixture of normals defining first factore(sf21) standard deviation of first normal used in mixture defining second factore(p2) mixing probability for mixture of normals defining second factore(mu21) mean of first normal used in mixture defining second factore(sf22) standard deviation of second normal used in mixture defining second factor

Matricese(b) coefficient vectore(ilog) iteration loge(gradient) gradient vectore(V) variance–covariance matrix of the estimatorse(g11) gradient vector of the first stepe(V11) variance–covariance matrix of the first stepe(sT2) standard deviation of residuals of the second block of testse(aT2) factor loadings of the second block of testse(sT1) standard deviation of residuals of the first block of testse(aT1) factor loadings of the first block of testse(coeff T6) coefficient vector of test 6e(coeff T5) coefficient vector of test 5e(coeff T4) coefficient vector of test 4e(coeff T3) coefficient vector of test 3e(coeff T2) coefficient vector of test 2e(coeff T1) coefficient vector of test 1

Functionse(sample) marks estimation sample

e(sf11), e(sf12), e(mu11), e(p1), e(sf21), e(p2), e(mu21), and e(sf22) pro-vide the distributional parameters of the two factors. e(g11) and e(V11) provide thegradient and the variance–covariance of the parameters in the first stage. e(aT1) ande(aT2) are matrices that collect the loadings associated with each block in varlist T.e(sT1) and e(sT2) are matrices that collect the variances of the estimated residuals foreach block in varlist T. Finally, e(coeff T#) are vectors that collect the coefficients ofthe observable controls for each variable in varlist T (that is, βT ).


The main results stored after a second stage are the following:

Scalarse(N) number of observationse(av1) loading of first factor on the choice equatione(av2) loading of second factor on the choice equatione(sY0) standard deviation of residuals of Y0 equatione(a01) loading of first factor on Y0 equatione(a02) loading of second factor on Y0 equatione(a12) loading of second factor on Y1 equatione(a11) loading of first factor on Y1 equatione(sY1) standard deviation of residuals of Y1 equation

Matricese(b) coefficient vectore(ilog) iteration loge(gradient) gradient vectore(V) variance–covariance matrix of the estimatorse(coeff Y1) coefficient vector of equation Y1

e(coeff Y0) coefficient vector of equation Y0

e(coeff D) coefficient vector of choice equation

Functionse(sample) marks estimation sample

e(av1) and e(av2) are scalars that collect the loadings of each factor in the choiceequation. e(a01), e(a02), e(a12), and e(a11) are the scalars that store the loadingsof each factor for the outcome equations when D = 0 and D = 1. Likewise, e(sY0)and e(sY1) are scalars that store the variance of the residual of the outcome equations.Matrices e(coeff Y1), e(coeff Y0), and e(coeff D) collect the coefficients of theobservable controls for the outcome equations when D = 0 and D = 1 and for thechoice equation, respectively (that is, βY0 , βY1 , and βD).

heterofactor estimates multiple equations; it is not compatible with the predict

postestimation command. Instead, it provides users with vectors stored in e(.) tocreate the predicted values of the desired equations. (See [P] matrix score for detailson how vectors can be used to create variables with predicted values.)

4 Examples

In this section, we illustrate the heterofactor command using both simulated and realdata (that is, the NLSY79). We use simulated data as a benchmark for the precision ofthe estimates in different structures.

4.1 Examples with simulated data

We present three examples using simulated data, all of which use the treatment-effectstructure. First, we present a one-factor model. Then, we present a two-factor model as-suming a loadings structure as in (13). Finally, we present a two-factor model assuminga triangular loadings structure as in (12).


To show how to empirically recover the parameters from this model, we present thefollowing parameterization:

θA ∼ 0.3N (0, 1) + 0.7N (−0.428, 0.387)

θB ∼ 0.5N (0, 1) + 0.5N (−0.5, 0.5)

(eT , eY , X, Z,Q) ∼ N (0, 1)

T1 = 0.1 + 0.1Q+ 1.1θA + e1

T2 = 0.5 + 0.1Q+ 1.4θA + e2

T3 = 0.4 + 0.3Q+ θA + e3

T4 = 0.3 + 0.11Q+ 3θB + e4

T5 = 0.4 + 0.21Q+ 1.6θB + e5

T6 = 0.1 + 0.31Q+ θB + e6

T7 = 0.3 + 0.11Q+ 3.1θA + 3θB + e7

T8 = 0.4 + 0.21Q+ 1.2θA + 1.6θB + e8

T9 = 0.1 + 0.31Q+ 2θA + θB + e9

D =

{1 if 0.5Z + θA + eD > 0

0 otherwise

Y1 = 2 + 2X + 2θA + eY 1

Y0 = 1.5 +X + θA + eY 0

D2 =

{1 if 0.5Z + θA + θB + eD2

> 0

0 otherwise

Y2,1 = 2 + 2X + 2θA + 2θB + eY 12

Y2,0 = 1.5 +X + θA + θB + eY 02

The results were produced using Stata/MP 8 14.1. Your results may vary if you areusing a different flavor (that is, Stata/SE, Stata/MP 2, etc.) of Stata. We create ourdata using the following:

. set seed 12345

. set obs 5000obs was 0, now 5000

. generate u1=runiform()

. generate u2=runiform()

. generate f1 = rnormal()*sqrt(1)+1 if u1<0.3(3486 missing values generated)

. replace f1 = rnormal()*0.622269-0.42857143 if u1>=0.3(3486 real changes made)

. generate f2 = invnormal(runiform())*sqrt(1) + 0.5 if u2<0.5(2536 missing values generated)


. replace f2 = invnormal(runiform())*0.70710678 -0.5 if u2>=0.5(2536 real changes made)

. drop u?

. generate X=rnormal()

. generate Q=rnormal()

. generate Z=rnormal()

. generate uv=rnormal()

. generate u1=rnormal()

. generate u0=rnormal()

. forvalues i=1/12{2. generate e`i´=rnormal()3. }

. generate t1=0.1 +0.1 *Q +1.1*f1+e1

. generate t2=0.5 +0.1 *Q +1.4*f1+e2

. generate t3=0.4 +0.3 *Q + f1+e3

. generate t4=0.3 +0.11*Q +3 *f2 +e7

. generate t5=0.4 +0.21*Q +1.6*f2 +e8

. generate t6=0.1 +0.31*Q + f2 +e9

. generate t7=0.3 +0.11*Q +3.1*f1 + 3*f2 +e7

. generate t8=0.4 +0.21*Q +1.2*f1 + 1.6*f2 +e8

. generate t9=0.1 +0.31*Q +2*f1 + f2 +e9

. generate D=(0.5*Z + f1 + uv>0)

. generate Y11=2 +2*X + 2*f1 + u1

. generate Y10=1.5 + X + f1 + u0

. generate Y1=D*Y11 + (1-D)*Y10

. generate d2=(0.5*Z + f1 + f2 + uv>0)

. generate Y21=2 +2*X + 2*f1 + 2*f2 + u1

. generate Y20=1.5 + X + f1 + f2 + u0

. generate Y2=d2*Y21 + (1-d2)*Y20

One-factor model

Here we present a case where the system is described by only one factor, as in footnote 7.The command is


. heterofactor Y1 X, treatind(D) instrum(Z) scores(t1 t2 t3) indvarsc(Q)> factors(1) difficult initialregEstimating Initial Values VectorRunning Factor Model

Iteration 0: log likelihood = -40882.22 (not concave)Iteration 1: log likelihood = -38557.065 (not concave)

(output omitted )





D1Z .5068783 .0238641 21.24 0.000 .4601055 .5536512

_cons -.0237893 .0251811 -0.94 0.345 -.0731434 .0255648

Y11xw 1.975667 .0276118 71.55 0.000 1.921549 2.029785

_cons 1.997345 .042107 47.44 0.000 1.914817 2.079873

Y10xw 1.015075 .020827 48.74 0.000 .9742545 1.055895

_cons 1.468498 .0315469 46.55 0.000 1.406667 1.530329

t1x .0998038 .01589 6.28 0.000 .0686601 .1309476

_cons .0861379 .0206553 4.17 0.000 .0456543 .1266216

t2x .0997367 .0170335 5.86 0.000 .0663516 .1331219

_cons .5030196 .0242277 20.76 0.000 .4555342 .5505049

t3x .302106 .0157761 19.15 0.000 .2711855 .3330266

_cons .4174916 .0196586 21.24 0.000 .3789615 .4560217

/a1 2.122694 .0449329 47.24 0.000 2.034627 2.210761/a0 1.043721 .0433614 24.07 0.000 .9587339 1.128708/av 1.019103 .0409113 24.91 0.000 .9389189 1.099288

/aT1 1.129494 .0242495 46.58 0.000 1.081966 1.177022/aT2 1.482229 .0290633 51.00 0.000 1.425266 1.539192/sig1 .0351039 .0258972 1.36 0.175 -.0156537 .0858614/sig0 .0098692 .0170994 0.58 0.564 -.023645 .0433833

/sigT1 -.0006658 .0120529 -0.06 0.956 -.0242892 .0229575/sigT2 -.0093154 .0145453 -0.64 0.522 -.0378236 .0191928/sigT3 .0210075 .0114337 1.84 0.066 -.001402 .0434171/sigf1 .0030812 .033485 0.09 0.927 -.0625482 .0687106/sigf2 -.4976842 .0340155 -14.63 0.000 -.5643534 -.431015

/p1 -.666261 .1045168 -6.37 0.000 -.8711102 -.4614118/mu1 .8015842 .0598374 13.40 0.000 .684305 .9188635

Done Estimating Factor Model


In this output, /a1 and /a0 indicate the factor loadings for the equation of Y1

and Y0, respectively. Likewise, /av indicates the estimand of the factor loading in thechoice equation, while /aT1 and /aT2 are the factor loadings for measures T1 and T2,respectively. Note that the reported standard deviations (that is, /sigf1 and /sigf2)and the mixture-combining probability (that is, /p1) are transformed. To retrieve theactual values, we need to transform them back as follows:

. display exp(_b[sigf1:_cons])

.90781715

. display exp(_b[sigf2:_cons])

.62749649

. display invlogit(_b[p1:_cons])

.24802025

The command provides a random draw from the estimated factor distribution underthe name mixt. That is, the program creates a variable for the user to plot the distri-bution that results from the estimated mixture of normals. Here we use this variable toshow the accuracy of the estimation by comparing it with the true distribution of θA.

. kdensity mixt, addplot(kdensity f1) scheme(sj)> legend(order(2 1) label(1 "Estimated factor") label(2 "True factor"))> xtitle("")

0.1

.2.3

.4.5

Den

sity

−4 −2 0 2 4

True FactorEstimated Factor

kernel = epanechnikov, bandwidth = 0.1466

Kernel density estimate

Figure 1. Actual and estimated factor 1

Two-factor model

Here we present a two-factor model assuming the loadings structure presented in (13).We use measures T1 to T6. The output will be divided into three parts: one part for theestimation of the first factor’s distribution, one for the estimation of the second factor’s


distribution, and one for the estimation of the outcomes and choice equations. We fitthe model using the following command:

. heterofactor Y2 X, treat(d2) instrum(Z) scores(t1 t2 t3 t4 t5 t6) indvarsc(Q)> factors(2) initialreg difficult nohatsEstimating Initial Values VectorRunning Factor Model

Twostep option specifiedStep: 1

Factor: 1


(output omitted )

Iteration 9: log likelihood = -25333.818Iteration 10: log likelihood = -25333.818




t1Q .1106338 .0196421 5.63 0.000 .072136 .1491316

_cons .0934989 .0209643 4.46 0.000 .0524096 .1345882

t2Q .1140862 .0227646 5.01 0.000 .0694685 .1587039

_cons .512667 .0246802 20.77 0.000 .4642947 .5610394

t3Q .3116778 .0188027 16.58 0.000 .2748252 .3485303

_cons .4240101 .0199136 21.29 0.000 .3849802 .46304

/aT11 1.131651 .0266184 42.51 0.000 1.079479 1.183822/aT21 1.500883 .0346451 43.32 0.000 1.43298 1.568786

/sigT1 .0054918 .0143887 0.38 0.703 -.0227096 .0336931/sigT2 -.0210072 .0212032 -0.99 0.322 -.0625647 .0205502/sigT3 .0273248 .0127516 2.14 0.032 .0023321 .0523174/sigf11 -.0382022 .0866956 -0.44 0.659 -.2081224 .131718/sigf12 -.4788531 .04016 -11.92 0.000 -.5575653 -.4001409

/p1 -1.000756 .2626884 -3.81 0.000 -1.515616 -.4858966/mu1 1.043295 .2033173 5.13 0.000 .6448008 1.44179


Factor: 2


(output omitted )





t4Q .1498471 .0447385 3.35 0.001 .0621612 .237533

_cons .2250401 .0443057 5.08 0.000 .1382024 .3118777

t5Q .2303199 .0266435 8.64 0.000 .1780997 .2825401

_cons .3583129 .026471 13.54 0.000 .3064307 .4101952

t6Q .3066849 .0202724 15.13 0.000 .2669518 .3464181

_cons .0861055 .0202484 4.25 0.000 .0464193 .1257916

/aT42 2.887153 .0466271 61.92 0.000 2.795765 2.97854/aT52 1.564409 .0270325 57.87 0.000 1.511426 1.617392

/sigT4 .0762297 .029669 2.57 0.010 .0180796 .1343799/sigT5 -.0180211 .0153144 -1.18 0.239 -.0480368 .0119946/sigT6 .0060149 .0111457 0.54 0.589 -.0158304 .0278601/sigf21 -.1622406 .0272878 -5.95 0.000 -.2157236 -.1087575/sigf22 -.3072059 .0274006 -11.21 0.000 -.36091 -.2535018

/p2 -.625175 .1279547 -4.89 0.000 -.8759617 -.3743883/mu2 .8941899 .0585462 15.27 0.000 .7794415 1.008938


Second Stage: Estimation


(output omitted )





d2Z .5071589 .0269401 18.83 0.000 .4543573 .5599605

_cons -.0062522 .0246106 -0.25 0.799 -.054488 .0419836

Y21X 1.968805 .0306787 64.17 0.000 1.908676 2.028934

_cons 2.009102 .0417183 48.16 0.000 1.927335 2.090868

Y20X 1.017017 .0221343 45.95 0.000 .9736349 1.0604

_cons 1.457208 .0335403 43.45 0.000 1.39147 1.522946

/a11 2.084713 .0370919 56.20 0.000 2.012014 2.157412/a12 1.910914 .0361597 52.85 0.000 1.840042 1.981786/a01 1.066335 .0420311 25.37 0.000 .9839559 1.148715/a02 .9777517 .0296384 32.99 0.000 .9196615 1.035842/av1 1.034998 .0443362 23.34 0.000 .9481003 1.121895/av2 .9655901 .0363247 26.58 0.000 .8943949 1.036785/aT21 1.484669 .0199244 74.51 0.000 1.445618 1.52372/aT42 2.884038 .0252152 114.38 0.000 2.834617 2.933459/aT52 1.56073 .0176811 88.27 0.000 1.526076 1.595384/sig1 .0673842 .0321999 2.09 0.036 .0042735 .1304948/sig0 .001 .0196342 0.05 0.959 -.0374824 .0394823

When you fit a model with two factors, the output includes an extra digit to identifythe factor it is referring to. For instance, in the second-stage estimation, /a11 indicatesthe loading of the first factor in the equation for Y1, and /a12 indicates the loadingof the second factor in the same equation. That is, αY1,A and αY1,B . Similarly, /a01indicates αY0,A, and /a02 indicates αY0,B . To show the accuracy of our estimates ofFθB (ζ), we plot it together with the true factor in figure 2.


. kdensity mixt2, addplot(kdensity f2) scheme(sj)> legend(label(1 "Estimated Factor") label(2 "True Factor")) xtitle("")

0.1

.2.3

.4D

ensi

ty

−4 −2 0 2 4

Estimated FactorTrue Factor



Figure 2. Actual and estimated factor 2 using structure (13)

Two factors—Triangular loadings structure

Now, we run a model that assumes that the measurement system (4) has a triangularloadings structure as in (12). Note that the estimation of the system that is affected bythe first factor is exactly the same as in subsection 4.1. Therefore, we omit that part ofthe output.


. heterofactor Y2 X, treat(d2) instrum(Z) scores(t1 t2 t3 t7 t8 t9) indvarsc(Q)> factors(2) triangular initialreg difficult nohatsEstimating Initial Values VectorRunning Factor Model

Twostep option specifiedStep: 1

Factor: 1

(output omitted )

Factor: 2


(output omitted )





t7Q .1775851 .0435664 4.08 0.000 .0921966 .2629736

_cons .2291283 .0483811 4.74 0.000 .134303 .3239536

t8Q .2378664 .0253233 9.39 0.000 .1882337 .2874991

_cons .3582543 .0275752 12.99 0.000 .3042079 .4123007

t9Q .329379 .0221729 14.86 0.000 .2859209 .3728371

_cons .0928807 .0236972 3.92 0.000 .0464349 .1393264

/aT41 3.239198 .0510511 63.45 0.000 3.139139 3.339256/aT51 1.229221 .0290314 42.34 0.000 1.17232 1.286121/aT61 2.060524 .0270328 76.22 0.000 2.007541 2.113508/aT42 2.909617 .0534597 54.43 0.000 2.804838 3.014396/aT52 1.552329 .0324726 47.80 0.000 1.488684 1.615974/aT11 1.118679 .0179589 62.29 0.000 1.08348 1.153878/aT21 1.481497 .0191889 77.21 0.000 1.443887 1.519107

/sigT4 -.0069295 .0359718 -0.19 0.847 -.0774329 .0635739/sigT5 .0012072 .0156359 0.08 0.938 -.0294387 .0318531/sigT6 .0145307 .0132415 1.10 0.272 -.0114222 .0404837/sigf21 -.1259575 .0345348 -3.65 0.000 -.1936446 -.0582704/sigf22 -.3364971 .0361957 -9.30 0.000 -.4074393 -.2655549

/p2 -.3397048 .1032454 -3.29 0.001 -.5420622 -.1373475/mu2 .7849042 .0503796 15.58 0.000 .6861619 .8836465


Second Stage: Estimation


(output omitted )





d2Z .4985125 .025586 19.48 0.000 .4483649 .5486602

_cons -.0167719 .0231616 -0.72 0.469 -.0621677 .028624

Y21X 1.993162 .0229973 86.67 0.000 1.948088 2.038236

_cons 1.97525 .029863 66.14 0.000 1.91672 2.033781

Y20X 1.023553 .0200909 50.95 0.000 .984175 1.06293

_cons 1.456468 .0269527 54.04 0.000 1.403642 1.509294

/a11 2.068437 .0380411 54.37 0.000 1.993878 2.142996/a12 1.91965 .0301661 63.64 0.000 1.860526 1.978775/a01 1.050172 .035932 29.23 0.000 .9797465 1.120597/a02 .9953106 .0291753 34.11 0.000 .9381281 1.052493/av1 1.002611 .0376846 26.61 0.000 .928751 1.076472/av2 .9590321 .0351233 27.30 0.000 .8901916 1.027873/aT21 1.480445 .019431 76.19 0.000 1.442361 1.51853/aT41 3.22221 .0483981 66.58 0.000 3.127352 3.317069/aT42 2.893978 .0283124 102.22 0.000 2.838487 2.949469/aT51 1.217317 .0278293 43.74 0.000 1.162773 1.271862/aT52 1.54802 .0195086 79.35 0.000 1.509784 1.586256/aT61 2.053388 .0256322 80.11 0.000 2.00315 2.103626/sig1 .0266306 .0182996 1.46 0.146 -.0092359 .0624972/sig0 -.0118307 .0150991 -0.78 0.433 -.0414245 .017763


. kdensity mixt2, addplot(kdensity f2) scheme(sj)> legend(label(1 "Estimated Factor") label(2 "True Factor")) xtitle("")

0.1

.2.3

.4D

ensi

ty

−4 −2 0 2 4

Estimated FactorTrue Factor



Figure 3. Actual and estimated factor 2 using triangular structure (12)

Again, to show the accuracy of our estimates of FθB (ζ) in this more complicatedsetting, we plot it together with the actual factor in figure 3.

4.2 Example using the NLSY79

In this section, we present an example using real data from NLSY79. The dataset is widelyused by the research community (see, for instance, Heckman, Stixrud, and Urzua [2006],Urzua [2008], Prada and Urzua [2013]). In our example, we fit a Roy model where theendogenous choice is whether the person went to college by age 25 and the potentialoutcomes are the log of earnings by age 30. The adjunct measurement system comprisesthe armed services vocational aptitude battery tests recorded during the participants’teenage years. The observable controls used in the test equations are race and mother’seducation. This last control is also used in the college enrollment decision. Finally, inthe earning equations, we control for race and experience.

. use nlsyforfactor.dta, clear

. heterofactor lnincome blackwhite ExperienceF Experience2, treat(HR_5)> instrum(HGC_MOTHER) scores(stASVAB_6 stASVAB_10 stASVAB_8)> indvarsc(blackwhite HGC_MOTHER)Running Factor Model

initial: log likelihood = -23908.226alternative: log likelihood = -19308.277rescale: log likelihood = -19308.277rescale eq: log likelihood = -12760.811Iteration 0: log likelihood = -12760.811 (not concave)


(output omitted )


Number of obs = 2188Wald chi2(1) = 152.51



HR_5HGC_MOTHER .2440448 .0197617 12.35 0.000 .2053125 .2827771

_cons -3.977122 .2557827 -15.55 0.000 -4.478447 -3.475797

lnincome1blackwhite .0545235 .1332389 0.41 0.682 -.2066198 .3156669ExperienceF .0582035 .0129243 4.50 0.000 .0328723 .0835347Experience2 -.000468 .000138 -3.39 0.001 -.0007385 -.0001975

_cons 1.627223 .3299871 4.93 0.000 .9804607 2.273986

lnincome0blackwhite .3306208 .0602065 5.49 0.000 .2126183 .4486233ExperienceF .0629075 .0069841 9.01 0.000 .049219 .0765961Experience2 -.0003744 .0000741 -5.05 0.000 -.0005197 -.0002291

_cons .4799472 .1643739 2.92 0.004 .1577802 .8021143

stASVAB_6blackwhite .6175045 .0523823 11.79 0.000 .5148372 .7201719HGC_MOTHER .0933385 .0072596 12.86 0.000 .0791098 .1075671

_cons -1.579435 .0983605 -16.06 0.000 -1.772218 -1.386652


_cons -1.345903 .0993812 -13.54 0.000 -1.540687 -1.15112


_cons -1.285871 .105387 -12.20 0.000 -1.492425 -1.079316

/a1 .2686718 .0924776 2.91 0.004 .087419 .4499247/a0 .2871036 .0478504 6.00 0.000 .1933186 .3808887/av 1.829669 .1091898 16.76 0.000 1.61566 2.043677

/aT1 1.064874 .0456752 23.31 0.000 .975352 1.154396/aT2 1.663054 .0630574 26.37 0.000 1.539464 1.786644/sig1 -.3036387 .0300436 -10.11 0.000 -.3625231 -.2447543/sig0 -.2164866 .0175662 -12.32 0.000 -.2509156 -.1820575

/sigT1 -.4138092 .0171057 -24.19 0.000 -.4473358 -.3802827/sigT2 -1.311971 .076692 -17.11 0.000 -1.462285 -1.161657/sigT3 -.2876287 .0162171 -17.74 0.000 -.3194136 -.2558439/sigf1 -1.624681 .1030396 -15.77 0.000 -1.826635 -1.422727/sigf2 -1.172679 .0643634 -18.22 0.000 -1.298829 -1.046529

/p1 -.6149592 .1085965 -5.66 0.000 -.8278045 -.4021139/mu1 .5898844 .0333342 17.70 0.000 .5245506 .6552181

Done Estimating Factor Model


The results of this example indicate that people with higher levels of latent abilityare more likely to go to college and to earn more.

5 Conclusions

Models of unobserved heterogeneity are becoming increasingly popular. However, theirimplementation is difficult and often tailored to the needs of each particular project.In this article, we presented code that can fit many models whose common feature isthat they are systems of equations with latent-factor structures. Our code is flexibleenough to incorporate different features of the data while keeping the distributionalassumptions to the minimum. Although these models are computationally demanding,most estimations can be done using personal computers.

6 Acknowledgments

We thank Maria Prada for all of her contributions, especially for the development ofthe triangular loadings structure. We also thank Ricardo Espinoza, the Stata reviewer,as well as Koji Miyamoto, Katarzyna Kubacka, and the rest of the OECD-ESP team fortheir useful comments. All mistakes are ours.

7 ReferencesAakvik, A., J. J. Heckman, and E. J. Vytlacil. 2000. Treatment effects for discreteoutcomes when responses to treatment vary among observationally identical persons:An application to Norwegian vocational rehabilitation programs. NBER TechnicalWorking Paper No. 262. http://www.nber.org/papers/t0262.

Cameron, S. V., and J. J. Heckman. 1998. Life cycle schooling and dynamic selectionbias: Models and evidence for five cohorts of American males. Journal of PoliticalEconomy 106: 262–333.

. 2001. The dynamics of educational attainment for black, Hispanic, and whitemales. Journal of Political Economy 109: 455–499.

Carneiro, P., K. T. Hansen, and J. J. Heckman. 2003. 2001 Lawrence R. Klein lecture:Estimating distributions of treatment effects with an application to the returns toschooling and measurement of the effects of uncertainty on college choice. Interna-tional Economic Review 44: 361–422.

Cox, N. J. 1999. dm69: Further new matrix commands. Stata Technical Bulletin 50: 5–9. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 28–34. College Station,TX: Stata Press.



Hansen, K. T., J. J. Heckman, and K. J. Mullen. 2004. The effect of schooling andability on achievement test scores. Journal of Econometrics 121: 39–98.

Heckman, J. J., J. E. Humphries, S. Urzua, and G. Veramendi. 2011. Theeffects of educational choices on labor market, health, and social out-comes. Working Paper No. 2011-002, Human Capital and Economic Opportu-nity: A Global Working Group. http://humcap.uchicago.edu/RePEc/hka/wpaper/HHUV 2010 effect-edu-choice.pdf.

Heckman, J. J., and S. Navarro. 2007. Dynamic discrete choice and dynamic treatmenteffects. Journal of Econometrics 136: 341–396.

Heckman, J. J., J. Stixrud, and S. Urzua. 2006. The effects of cognitive and noncognitiveabilities on labor market outcomes and social behavior. Journal of Labor Economics24: 411–482.

Joreskog, K. G., and A. S. Goldberger. 1972. Factor analysis by generalized least squares.Psychometrika 37: 243–260.

Judd, K. L. 1998. Numerical Methods in Economics. Cambridge, MA: MIT Press.

Keane, M. P., and K. I. Wolpin. 1997. The career decisions of young men. Journal ofPolitical Economy 105: 473–522.

Kotlarski, I. I. 1967. On characterizing the gamma and the normal distribution. PacificJournal of Mathematics 20: 69–76.

Prada, M. F., and S. Urzua. 2013. One size does not fit all: The role of vocationalability on college attendance and labor market outcomes. Working Paper.http://lacer.lacea.org/bitstream/handle/123456789/48629/lacea 2013 rolevocational ability.pdf.

Roy, A. D. 1951. Some thoughts on the distribution of earnings. Oxford EconomicPapers 3: 135–146.

Sarzosa, M. 2015. The dynamic consequences of bullying on skill accumulation.http://krannert.purdue.edu/faculty/msarzosa/Research/DynBullying.pdf.

Sarzosa, M., and S. Urzua. 2015. Bullying among adolescents: The role of cognitiveand noncognitive skills. NBER Working Paper No. 21631, The National Bureau ofEconomic Research. http://www.nber.org/papers/w21631.

Urzua, S. 2008. Racial labor market gaps: The role of abilities and schooling choices.Journal of Human Resources 43: 919–971.

About the authors

Sergio Urzua is an associate professor of economics at the University of Maryland. His primaryresearch focuses on the role of cognitive ability, noncognitive skills, and early health status asdeterminants of schooling, labor market, and adult social behaviors. His research in applied


econometrics is mainly concerned with the estimation and identification of selection modelswith unobserved heterogeneity.

Miguel Sarzosa is an assistant professor of economics at Purdue University. His research focuseson the estimation of the effects that cognitive and noncognitive skills have on social behaviors,specifically, the effect that skill endowments have on in-school victimization and workplacediscrimination.


Speaking Stata: Truth, falsity, indication, andnegation

Nicholas J. CoxDepartment of Geography

Durham UniversityDurham, UK

[email protected]

Abstract. Many problems in Stata call for selection of observations according totrue or false conditions, indicator variables flagging the same, groupwise calcula-tions, or a prearranged sort order. The example of finding the first (earliest) andlast nonmissing value in panel or longitudinal data is used to explain and explorethese devices and how they may be used together. Negating an indicator variablehas the special virtue that selected observations may be sorted easily to the top ofthe dataset.

Keywords: dm0087, true or false, logical, Boolean, indicator variable, dummyvariable, sort, by, panel data, longitudinal data, programming, data management

1 Introduction: True and false, in Stata and otherwise

The title may hint at a miniature philosophical treatise, but the topic is eminentlypractical. We start with these fundamentals:

1. When fed an argument, Stata takes nonzero values as true and zero values asfalse. Thus 42 in if 42 would count as true, and 0 in if 0 would count as false,however unlikely it may be that anybody would write either in Stata. Stata wouldalways try to do something given if 42, but it would never try that given if 0.

2. When producing results for true-or-false evaluations, Stata returns 1 for true and0 for false. Thus the expression x > 42 would return 1 if and only if x was in-deed greater than 42, and 0 otherwise. (It is a side issue—but one that bitesoften enough to deserve a big flag—to emphasize in this example that any nu-meric missing value does count as greater than 42.) Similarly, the function callmissing(x) will return 1 if and only if x is missing, and 0 otherwise. “Indicatorvariable” is a common term for variables that take on values 1 or 0. “Dummyvariable” is another common term, especially for those who first met the idea ina regression course.

3. It follows that using 1 and 0 for true and false is at root just a convention, al-though a convention that appears supremely natural and helpful, especially withknowledge of those tricks possible with 1s and 0s. Thus adding lots of 0s and 1s isprecisely how to count how many observations have the condition marked by 1s.

c© 2016 StataCorp LP dm0087

230 Speaking Stata: Truth, falsity, indication, and negation

Such (0, 1) values are often labeled Booleans or logicals. However, you can, ifyou wish, use other conventions too. The easiest useful alternative is to use −1for true and 0 for false. That last simple idea is the least standard detail in thiscolumn, so readers broadly familiar with the territory here may want to focus onthat point alone.

George Boole (1815–1864) was a British mathematician who became a professor atQueen’s College, Cork in Ireland (now University College Cork). He is best rememberedfor his works in logic and probability that lie behind the term “Boolean”. His majorbook was Boole (1854). An earlier, shorter book (Boole 1847) is among those piecescollected in Boole (1952). Hailperin (1986) revisited this work from a more modernperspective. MacHale (1985, slightly revised 2014) gives a full-length biography thatcovers his personal life and his family, which remains distinguished to the present, as wellas mathematical contributions. See also Iverson (1962) for a now classical discussionof logical values in computing; Knuth (2011) for an authoritative survey of zeroes andones in programming; and Gregg (1998) for other material on Boolean algebra, circuitdesign, and the logic of sets.

The next section of this column uses a slightly tricky problem to illustrate the useof indicator variables and also their negations. The last section sketches some moregeneral advice for programmers.

2 Illustration: First and last nonmissing values

Here is a concrete problem. A panel dataset defined by identifiers and a time variable isspeckled with missing values. Your job is to find the first (earliest) and last nonmissingvalues for a particular variable. A sandbox for this problem could be the Grunfeld paneldataset, messed up randomly for the purpose. I set the seed for random numbers forreproducibility. Here I use Stata 14.1. If you are using a version before 14, your resultswill differ slightly, but the principles are unaffected.

. webuse grunfeld

. set seed 2803

. replace kstock = . if runiform() < 0.2(35 real changes made, 35 to missing)

The Grunfeld dataset includes 200 observations, 20 companies each for 10 years, allnonmissing, so we expect about 200 × 0.2 = 40 values to be set to missing by such acall.

Now to the problem. If you know about collapse supporting identification of firstand last nonmissing values, please set that aside. We should not need to destroy adataset to find some of its contents. If you know that there are user-written egen

functions to do this, or indeed of any other canned solution, please set that aside also.We seek a solution from first principles.

N. J. Cox 231

The problem would be easy if there were no missing values. We just need to sortthe data into the right order and identify the first and last values using subscripts toidentify the observations needed.

. bysort company (year): generate first = kstock[1]

. by company: generate last = kstock[_N]

The workhorse here is by:. bysort company: bundles two operations into one, tofirst sort on company and then to carry out calculations separately for each group ofobservations. As in elementary algebra, the operation on the inside, sort, is carried outfirst. In this case, each group of observations is for a single company defining a separatepanel. If observations are already sorted by group (here by distinct values of company),the sort element is unnecessary but harmless. If observations are not already sorted,the sort element is essential.

Under the aegis of by:, we can sort on other variables as well as the group identifier.Here that also is essential. Sorting on year within each panel allows the first value tobe identified using subscript 1 and the last to be identified using subscript N, whichunder by: is the number of observations in each panel and hence also the subscript ofthe last observation in each panel. (Easy: if there are 10 observations in a panel, 10 isalso the subscript of the last.) If this is unfamiliar, then you may also want to look formanual sections discussing by: or go to Cox (2002).

The parentheses ( ) around year are part of the syntax. The variables beforethe parentheses, just company in this example, define the groups for which separatecalculations are needed. The variables inside the parentheses, just year here, definethe order of observations within the group when that is important. If the order withinthe group is of no consequence, no variables need be included within parentheses. If wewere calculating a group total or mean, the sort order would not matter.

We will come back to what is before and what is inside the parentheses, because itis a handle helping with all kinds of calculations in Stata that in other software wouldoften call for nested loops.

That code is not yet the solution to our problem, because we do have missing values.The first or last values in different panels could be missing. We must not just hope thatis not so. But this is a good start, and the rest of the solution is to add a twist to keepmissing values out of the way.

The best device is to create an indicator variable for missing values:

. generate ismissing = missing(kstock)

. bysort company (ismissing year): generate first = kstock[1]

Let’s go through that more slowly. The new variable ismissing is 1 if kstock is missingand 0 otherwise. Within each panel, we sort first on ismissing, so all the observationswith missing values on kstock with value 1 on ismissing get sorted after the otherswith value 0 on ismissing. Then, within such blocks, we sort on year. So the firstvalue of stock within each panel should be the first nonmissing value, so long as there


is at least one nonmissing value. At worst, all the values for a panel on kstock will bemissing, and then calculation will return missing as the first nonmissing value, whichseems fair enough.

As before, the order of variables is crucial: first company, then ismissing, thenyear. A different order would usually produce a different sort order for the dataset anda different answer, probably wrong.

As explained, the parentheses () control exactly how observations are sorted.

• A paraphrase of the last Stata command above is this: within blocks of observa-tions defined by different values of company—sorted internally first by ismissing

and then by year—find the first value of kstock.

• That is different from this: within blocks of observations defined by differentvalues of company and ismissing—sorted internally by year—find the first valueof kstock.

• Both are different from this: within blocks of observations defined by differentvalues of company, ismissing, and by year, find the first value of kstock.

Which syntax you want is crucially dependent on the problem, but the parentheses arethere to make the distinction you need. Getting it right is the tricky part of using by:.Getting it wrong a few times and thinking it through with small-sample datasets whereyou can see results easily and quickly is the way to learn how to get it right.

So far, so good, but what about the last nonmissing values? Because we sortedmissings to the end of each panel to keep them out of the way, the last value for eachpanel will certainly be missing on kstock even if there is only one missing value in eachpanel. The solution is to flip the sort order around. This is where negation can be usedto solve the problem.

. replace ismissing = -ismissing

Negation (note the minus sign - in the command just given) flips the 1s all to −1 andleaves the 0s untouched. Now, we can re-sort with the changed variable.

. bysort company (ismissing year): generate last = kstock[ N]

The observations with missing values for kstock (with ismissing −1) now always comebefore those with nonmissing values (with ismissing unchanged at 0). Within thosetwo subsets, we sort on year. Thus the last nonmissing value should come last, and wecan pick it up using the subscript [ N].

As before, if all the data for a panel are missing, then the result would also bemissing, and again that is fair enough.

N. J. Cox 233

By the way, why is the code not like this?

. generate ismissing = missing(kstock)

. bysort ismissing company (year): generate first = kstock[1]

. by ismissing company (year): generate last = kstock[_N]

The problem with that code is that missings on stock all map to missings on the newvariables. That is often awkward. We could clean up afterward, but for most purposes,such code creates as many problems as it solves.

If you are interested in how we could clean up, here is one way.

. bysort company (first): replace first = first[1]

. bysort company (last): replace last = last[1]

This is similar logic. Within panels, we sort on the variable of interest, either first

or last. Nonmissing values of interest will then be in the first observation of eachpanel and can be copied to all observations in the panel. We do not need conditions onreplace such as if missing(first) or if missing(last) because there is no loss inoverwriting nonmissing values with the same nonmissing values.

Naturally, you could argue that the observations with missings are just useless andmight as well be dropped. That is true if all variables of interest are missing in thoseobservations; otherwise, it would result in throwing away data that might be useful. (Arecent column of mine, Cox (2015), introduced a program, missings, helpful for suchproblems.)

To recap the problem, let’s gather all the commands for the recommended solution:

. webuse grunfeld, clear

. set seed 2803

. replace kstock = . if runiform() < 0.2

. gen ismissing = missing(kstock)

. bysort company (ismissing year): gen first = kstock[1]

. replace ismissing = -ismissing

. bysort company (ismissing year): gen last = kstock[_N]

We will not list the results here, partly because of space but mostly because you cantry this out for yourself. But we should check on the results:

. count if missing(first, last)

. bysort company (first): assert first[1] == first[_N]

. bysort company (last): assert last[1] == last[_N]

The first check is whether any result is missing for the new variables. In principle,there could be a missing result if any panel was all missing. Because there are none, weneed not take that further. If you have random numbers different from mine, then yourresults could be different.

The second and third checks are whether results are constant within panels. If wesort on first (similarly on last) within panels, then any different values would beshaken apart. Here the result of assert would be an error message if the assertion were


not true: literally, no news is good news. Know that there was no such message. Formore on assert, see any or all of its help, its manual entry, and Gould (2003).

3 Invitation: Some wider uses

Let’s close with some general advice, especially for programmers. A common earlydevice in programs is writing

. marksample touse

which creates an indicator variable with value 1 for observations to be used and withvalue 0 otherwise. That meaning explains the conventional name, touse, meaning(if you did not spot it) “to use”. In practice, that variable is temporary, referred tothereafter as ‘touse’, an incidental detail here.

Such a variable permits all kinds of useful stuff, often starting with

. count if `touse´

to determine if there are not any observations to apply a command to, in which caseyour program should probably bail out now, even if a zero count is good news for somepurpose. (Perhaps the purpose of the command is to look for something unwanted, soa zero count is indeed good news.)

Sometimes, it is simpler to throw out observations you do not want to work with,provided that the dataset has been saved or preserved or is not of long-term utility:

. keep if `touse´

Often the main point is just to control which part of the dataset is used, say, for a graphor some statistical analysis, so the qualifier if ‘touse’ could be common in a program.

None of the uses mentioned so far in this section is affected by negating ‘touse’

so that its values are −1 and 0. We know −1 is not 0, hence true, and 0 manifestlyremains 0 and false. But why would you do that? The main reason is whenever sortingthe observations you want to the top of the dataset makes anything easier.

For example, I find that list is frequently a good way to output results. Its excel-lent subvarname option provides an easy way to label columns. Its separation optionsseparator() and sepby() are often useful. It is smart on your behalf about spacingand boxing. And there are other advantages besides: see, for example, Harrison (2006).

Once I have counted how many rows will be in the table, the output is then oftenarranged by a single command of the form list . . . in 1/whatever. The observationnumbers will often be relevant to the displayed results; if not, they can always besuppressed.

N. J. Cox 235

That is a small trick, but one that I have found helpful in programming. Indeed,working interactively as well, it can be useful whenever the observations of most interestcome first in the dataset.

4 Conclusion

What makes a language practical to learn and to use? Often it is that a small numberof key concepts allow a large number of problems to be solved directly and efficiently.

In this column, we have looked at a bundle of key Stata concepts that very muchbelong together: indicator variables, by: for groupwise calculations, and deliberate anddelicate control of sort order to enable exactly what you want. A particular twist isthat negating an indicator can be useful too: logical values of −1 remain true andimmediately allow a sort order that can be as or more convenient than the standardorder in which true values follow false.

5 ReferencesBoole, G. 1847. The Mathematical Analysis of Logic, Being an Essay Towards a Calculusof Deductive Reasoning. Cambridge: Macmillan, Barclay, and Macmillan.

. 1854. An Investigation of the Laws of Thought, on Which are Founded theMathematical Theories of Logic and Probabilities. London: Walton and Maberley.

. 1952. Studies in Logic and Probability. London: Watts.

Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

. 2015. Speaking Stata: A set of utilities for managing missing values. StataJournal 15: 1174–1185.

Gould, W. 2003. Stata tip 3: How to be assertive. Stata Journal 3: 448.

Gregg, J. R. 1998. Ones and Zeros: Understanding Boolean Algebra, Digital Circuits,and the Logic of Sets. Piscataway, NJ: IEEE Press.

Hailperin, T. 1986. Boole’s Logic and Probability. A Critical Exposition from the Stand-point of Contemporary Algebra, Logic and Probability Theory. Amsterdam: North-Holland.

Harrison, D. A. 2006. Stata tip 34: Tabulation by listing. Stata Journal 6: 425–427.

Iverson, K. E. 1962. A Programming Language. New York: Wiley.

Knuth, D. E. 2011. The Art of Computer Programming, Volume 4A: CombinatorialAlgorithms, Part 1. Upper Saddle River, NJ: Addison–Wesley.

MacHale, D. 1985. George Boole: His Life and Work. Dublin: Boole Press.


. 2014. The Life and Work of George Boole: A Prelude to the Digital Age. Cork,Ireland: Cork University Press.

About the author

Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-mands in official Stata. He was an author of several inserts in the Stata Technical Bulletin andis an editor of the Stata Journal. His “Speaking Stata” articles on graphics from 2004 to 2013have been collected as Speaking Stata Graphics (College Station, TX: Stata Press, 2014).


Review of Michael N. Mitchell’s Stata for theBehavioral Sciences

Philip B. EnderCulver City, CA

[email protected]

Abstract. In this article, I review Stata for the Behavioral Sciences by Michael N.Mitchell (2015 [Stata Press]).

Keywords: gn0069, book review, behaviorial sciences, ANOVA, analysis of variance,experimental design

1 Introduction

You know that warm comfortable feeling you get when you return home after a longabsence? That is the feeling I got while reading Michael N. Mitchell’s Stata for theBehavioral Sciences. The feeling was due to the familiarity of the material on analysisof variance (ANOVA) and experimental design covered in the book. Reading this booktook me back to my student days when I struggled with ANOVA problem sets and classresearch projects. The familiar material is one reason I thoroughly enjoyed readingMitchell’s book. The other reason I enjoyed it so much is Mitchell’s clear writing styleand copious, detailed examples.

As you can tell from the previous paragraph, the subject matter of this book pri-marily covers ANOVA and experimental design. Some may think that this definition ofmethods for the behavioral sciences is a bit narrow, but it is consistent with the waythe topic was taught back in the day. Interestingly, before Stata 11, many social andbehavioral researchers felt that Stata was difficult to use for ANOVA and experimentaldesign. This was not because Stata could not do these types of analyses but becauseStata lacked easy-to-use postestimation convenience commands for ANOVA. This beganto change when Stata 11 introduced factor variables and the margins command. Then,Stata 12 followed with the contrast and marginsplot commands, which put Stata ona par with other statistical packages used in the behavioral sciences.

Before I get into the meat of this review, I need to deal with a couple of details. First,in the spirit of full disclosure, I want to say that Michael Mitchell was the person whohired me to work in the Statistical Consulting Group at the University of California–Los Angeles (UCLA) in 1999. In fact, the first assignment Michael gave me was to learnStata. We worked closely for many years before Michael left UCLA to pursue otherendeavors.

Second, I need to address the elephant in the room, namely, the title of the book—Stata for the Behavioral Sciences. Does this title mean that the book will be of interestonly to researchers in the behavioral sciences? No, of course not. Many disciplines


238 Review of Michael N. Mitchell’s Stata for the Behavioral Sciences

other than the behavioral sciences use ANOVA methods. Beyond that, any researcherthat uses categorical predictors with more than two groups will find useful material inthis book. Finally, anyone wanting more details on topics such as factor variables andthe margins, contrast, or marginsplot command will find this book a treasure troveof information.

2 Content

After a preface in which Mitchell provides his motivation for writing the book anddescribes his background as a statistical consultant at UCLA, the text is divided into23 chapters, which, in turn, are collected into 5 sections. I will resist the temptationto present in-depth reviews of each of the 23 chapters; that would be rather long andrepetitive. I do not use the word “repetitive” as a criticism. Rather, it is part of thenature of ANOVA analyses; that is, similar tasks are performed for each of the differentANOVA designs.

So rather than providing in-depth coverage of each of the chapters, I will revieweach of the five sections and highlight material that readers may find most useful—likea tour guide on a seven-countries-in-six-days grand tour. Or, in this case, 23 chaptersin 5 sections.

2.1 Warming up (chapters 1–3)

The first stop on our tour is a section titled “Warming up”, and like a warm-up before along run, it is intended to ease you into ANOVA and experimental design before gettinginto more substantial material.

The book does not start with a formal tutorial on Stata but rather lists severalreasons why one would want to use Stata. This is followed by a chapter on how tosummarize and describe ANOVA-type datasets. Finally, there is a chapter on inferentialstatistics covering single-sample and two-sample hypothesis tests of means and propor-tions. No big highlights in this section. Remember, it is just a warm-up.

2.2 Between-subjects ANOVA models (chapters 4–11)

The next stop on our tour delves into between-subjects designs. With between-subjectmodels, the observations in each group or cell are independent of the observationsin the other groups or cells. Introductory statistics courses often label these designsusing terms like “one-way ANOVA”, “two-way ANOVA”, or “factorial ANOVA”. Theseintroductory courses will also usually include material on analysis of covariance (thatis, ANOVA designs that include a continuous predictor).

A long time ago, back in my student days, textbooks such as Winer’s StatisticalPrinciples in Experimental Design (1962) or Kirk’s Experimental Design: Procedures forthe Behavioral Sciences (1968) devoted many pages to manual computational formulas

P. B. Ender 239

for various ANOVA designs. Back then, we thought that computing the omnibus ANOVA

was the hardest part of the analysis. Today, all of that computational drudgery is takencare of using the anova command. We now realize the real work comes after the anovacommand when we address specific hypotheses about group differences or work on thedecomposition of complex interactions. In some respects, the anova command itselfbecomes almost the least interesting part of the data analysis.

This is where Mitchell’s detailed explanations of the inner workings of margins,contrast, and marginsplot come into play. Through numerous examples, he showshow you can estimate the group means, test contrasts among the group means, ordecompose interactions into simple effects and simple main effects.

I wish to highlight two chapters in this section. The first is chapter 10, “Superchargeyour analysis of variance (via regression)”. Here Mitchell demonstrates that by usingregression to do ANOVA, one can analyze complex survey data, use robust standarderrors to compensate for heterogeneity, and do full robust regression (using the rreg

command); he even dips into quantile regression. Early on as a student, I learned thatANOVA and regression were two sides of the same coin. They each estimate the sameunderlying model but present their results differently. Sadly, many students today donot know that there is any connection between ANOVA and regression. Mitchell makesthis connection clear.

Another highlight of this section is chapter 11 on power analysis. Here Mitchelldetails the complexities of the power command. This is a topic that many of the olderANOVA books either gloss over to some degree or discuss with almost illegible powergraphs.

2.3 Repeated measures and longitudinal designs (chapters 12–13)

The third stop on our tour delves into repeated measures and longitudinal designs.If the previous section was concerned with between-subject designs, then this sectiondeals with within-subject designs. In the past, these designs might have gone by variousnames, such as “randomized block design” for one-way within-subject design or “ran-domized block factorial” when there are two or more crossed within-subject factors. Ifthere is a mixture of between- and within-subject factors, the model might be called a“split-plot factorial”. Traditional statistics packages dealt with these designs by havingthe data in wide form, incorporating both multivariate and univariate estimation. Thebiggest downside to this approach is that if a subject is missing even one of the repeatedmeasures, then all the data for that subject have to be discarded.

Instead of using anova to analyze repeated-measures designs, Mitchell has chosen analternative approach of analyzing the data using linear mixed models (using the mixedcommand). This approach is becoming more common. You may even encounter theterm “repeated-measures mixed models” for these types of analyses in the literature.

These mixed models use data in the long form and are much more tolerant of miss-ing observations. They allow one to use all the available data for each subject. Ad-


ditionally, the mixed-models approach allows for a greater variety of within-subjectcovariance structures. Traditional repeated-measures ANOVA allowed only for unstruc-tured and compound symmetry (exchangeable) covariance structures, while mixed mod-els allow for independent, autoregressive (ar #), moving-average (ma #), banded,Toeplitz (toeplitz), or exponential structures in addition to unstructured andexchangeable.

One complication for the mixed-model approach is that it uses large-sample max-imum likelihood estimation. The large-sample approach is evident from the z andchi-squared statistics displayed in the results. Much of behavioral sciences researchhas relatively small sample sizes. Thus the p-values associated with z and chi-squaredstatistics are biased downward.

Small-sample mixed models were an issue with Stata before version 14, which intro-duced the dfmethod option. The term “dfmethod” stands for the degrees of freedommethod, that is, the method used to approximate the denominator degrees of freedomthat are necessary to obtain p-values for t and F . Stata 14 provides five differentmethods for estimating denominator degrees of freedom: residual, repeated, anova,Satterthwaite (satterthwaite), and Kenward–Rogers (krogers). Unsurprisingly, notall statisticians approve of this approach for dealing with small samples in mixed models.Mitchell provides an example of a small-sample repeated measures using the repeatedmethod with dfmethod in mixed.

2.4 Regression models (chapters 14–19)

The next stop on the tour examines traditional regression analyses. Earlier, in chap-ter 10, Mitchell described how to supercharge ANOVA models using regression. In thissection, he details traditional regression analysis using continuous predictors. Highlightsof this section include presenting regression results, tools for model building, regressiondiagnostics, and power analysis for regression. The discussion on regression in this sec-tion is very good, but it is not as extensive as that found in Mitchell’s (2012) earlierbook, Interpreting and Visualizing Regression Models Using Stata.

2.5 Stata overview (chapters 20–23)

Our final stop on this tour is the section titled “Stata overview”, which will probably notbe of great interest to most experienced Stata users. The sights here will appeal mainlyto the neophyte Stata user, but even an experienced user may find some gems. Aftera generic review of Stata’s estimation commands, Mitchell moves on to postestimationcommands, with a section highlighting features of the margins command. Also, do notmiss the gallery of marginsplot graphs covering a number of common scenarios.

P. B. Ender 241

Next comes a short side trip through some data management commands, includinghow to read data into Stata and a review of the reshape command. Reshaping is fre-quently called upon when analyzing repeated-measures ANOVA data. I should mentionthat Mitchell (2010) has also authored Data Management Using Stata: A PracticalHandbook, which provides more extensive coverage of data management in Stata.

The tour ends with 41 common SPSS commands. Mitchell gives the Stata equivalentfor each SPSS command along with a worked example. This material is specificallydesigned to help users who are moving from SPSS to Stata. SPSS is widely taughtand used in the social and behavioral sciences. Many of these users do not know theadvantages of using Stata and how it can facilitate the analysis of complex experimentaldesigns.

3 Strengths and weaknesses

I have already alluded to one of the strengths of this book, namely, the numerous, clearlydescribed examples. These examples lead you through progressively more complexdesigns. Mitchell presents many easy-to-understand data analysis scenarios. These aresimilar to the questions that in my experience, researchers do ask. Each scenario comeswith its own dataset, which Mitchell describes and then analyzes.

Another strength is that the inclusion of topics not typically found in ANOVA books,such as power and regression, sets this book apart from other ANOVA books that I haveused.

When it comes to weaknesses, I can identify only minor improvements or omissions.I am sure that most of what I consider omissions came after careful consideration aboutthe length of the book. No single book can include all the information on this extensivetopic. But here are some areas that could be improved if space were not an issue.

Exercises or problem sets would be a nice addition if one wanted to use the book asa text for a first course in ANOVA.

Classical ANOVA books, such as Kirk (1968) and Winer (1962), cover many moreANOVA designs than Mitchell does. There are a multitude of nested designs, lattice de-signs, Latin square designs, Graeco-Latin square designs, designs with group-interactionconfounding, treatment-interaction confounding, etc. These designs came primarily outof agricultural research and, even back in my student days, were not very common inbehavioral sciences. However, it could be useful to see examples of a couple of thesemore esoteric designs in the book.

I would have liked more examples of repeated-measures designs with small samplesto highlight the different dfmethod options. The only example in the book uses therepeated option. And even though I prefer using linear mixed models for repeatedmeasures, it would have been informative to see how the results compare with analyzingthe same data using the anova command.


4 Conclusion

Back in his UCLA days, Mitchell used to describe stat packages as statistical tool boxes.The more tools you know how to use well, the faster and easier the process of analyz-ing data becomes. While Stata for the Behavioral Sciences does not break any newmethodological or statistical ground, it does provide solid coverage on useful ANOVA

tools that may not be well known to everyone. The material in the book is accessibleto beginners, yet it still can provide useful information for many advanced users. I wishthat this book had been available years ago because I would have gladly recommendedit to some of our Stat Consulting clients. And just maybe, reading this book may temptsome of the SPSS users to give Stata a try.

5 ReferencesKirk, R. E. 1968. Experimental Design: Procedures for the Behavioral Sciences. Thou-sand Oaks, CA: Brooks/Cole.

Mitchell, M. N. 2010. Data Management Using Stata: A Practical Handbook. CollegeStation, TX: Stata Press.

. 2012. Interpreting and Visualizing Regression Models Using Stata. CollegeStation, TX: Stata Press.

. 2015. Stata for the Behavioral Sciences. College Station, TX: Stata Press.

Winer, B. J. 1962. Statistical Principles in Experimental Design. New York: McGraw–Hill.

About the author

Phil Ender is a psychologist who took too many statistics courses. He taught statistics andresearch methods courses for UCLA’s Graduate School of Education and Information Studies.He also worked for 15 years as a consultant for the UCLA Statistical Consulting Group. Heretired from UCLA in 2015.

The Stata Journal (2016)16, Number 1, p. 243

A menu-driven facility for power anddetectable-difference calculations in

stepped-wedge cluster-randomized trials,erratum

Karla HemmingUniversity of Birmingham

Birmingham, UK

[email protected]

Alan GirlingUniversity of Birmingham

Birmingham, UK

[email protected]

In the print and electronic versions of Hemming and Girling (2014, Stata Journal 14:363–380), the equation in the fifth line of text in section 2 appeared as

“. . . such that σ2/2n is the variance of the estimated . . . ”

but should have appeared as

“. . . such that 2σ2/n is the variance of the estimated . . . ”

There is no change to the software.

c© 2016 StataCorp LP st0341 1

The Stata Journal (2016)16, Number 1, p. 244

Software Updates

dm0078 1: newspell: Easy management of complex spell data. H. Kroger. Stata Journal15: 155–172.

In previous versions of the program, the newspell combine command did not com-bine two types of spells if one completely overlapped the other. This bug has nowbeen fixed. In addition, the dataset is now sorted as specified in the sort() option.

st0146 1: Error-correction–based cointegration tests for panel data. D. Persyn andJ. Westerlund. Stata Journal 8: 232–241.

This update fixes an issue that caused the program to fail with an “invalid operator”r(198) error message on recent versions of Stata. A new option, mg, has been addedthat shows the mean group estimator results. This is similar to the xtpmg command,while allowing for different lead and lag lengths.

st0390 1: Generalized maximum entropy estimation of discrete choice models. P. Corraland M. Terbish. Stata Journal 15: 512–522.

The program has been reorganized so that Mata code is now, for greater ease ofuse, included with the ado-file, rather than in separate do-files, as first published.Results are unaffected.

st0393 1: Estimating almost-ideal demand systems with endogenous regressors.S. Lecocq and J.-M. Robin. Stata Journal 15: 554–573.

The following changes have been made, echoed as needed in the help files:

1. Attempting to run the aidsills command without any valid observations nowresults in an error message.

2. In the aidsills command, an error has been corrected in the calculation of thevariance–covariance matrix of the estimator.

3. In the aidsills elas postestimation command, an error has been corrected inthe calculation of price elasticities for the quadratic version of the almost-idealdemand system. Note that the formulas given in the paper were correct.

4. The postestimation command aidsills vif has been added to calculate centeredvariance inflation factors for the independent variables specified in the demandequations, and in the instrumental regression(s), if any.

c© 2016 StataCorp LP up0050

The Stata Journal - med.mahidol.ac.thSubscriptions are available from StataCorp, 4905 Lakeway Drive,...

Documents

Transcript of The Stata Journal - med.mahidol.ac.thSubscriptions are available from StataCorp, 4905 Lakeway Drive,...