To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods...

22
Statistical Science 2010, Vol. 25, No. 3, 289–310 DOI: 10.1214/10-STS330 © Institute of Mathematical Statistics, 2010 To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal ex- planation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and pre- diction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the phi- losophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory ver- sus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the model- ing process. Key words and phrases: Explanatory modeling, causality, predictive mod- eling, predictive power, statistical strategy, data mining, scientific research. 1. INTRODUCTION Looking at how statistical models are used in dif- ferent scientific disciplines for the purpose of theory building and testing, one finds a range of perceptions regarding the relationship between causal explanation and empirical prediction. In many scientific fields such as economics, psychology, education, and environmen- tal science, statistical models are used almost exclu- sively for causal explanation, and models that possess high explanatory power are often assumed to inher- ently possess predictive power. In fields such as natural language processing and bioinformatics, the focus is on empirical prediction with only a slight and indirect re- lation to causal explanation. And yet in other research fields, such as epidemiology, the emphasis on causal explanation versus empirical prediction is more mixed. Statistical modeling for description, where the purpose is to capture the data structure parsimoniously, and which is the most commonly developed within the field of statistics, is not commonly used for theory building and testing in other disciplines. Hence, in this article I Galit Shmueli is Associate Professor of Statistics, Department of Decision, Operations and Information Technologies, Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742, USA (e-mail: [email protected]). focus on the use of statistical modeling for causal ex- planation and for prediction. My main premise is that the two are often conflated, yet the causal versus pre- dictive distinction has a large impact on each step of the statistical modeling process and on its consequences. Although not explicitly stated in the statistics method- ology literature, applied statisticians instinctively sense that predicting and explaining are different. This article aims to fill a critical void: to tackle the distinction be- tween explanatory modeling and predictive modeling. Clearing the current ambiguity between the two is critical not only for proper statistical modeling, but more importantly, for proper scientific usage. Both ex- planation and prediction are necessary for generating and testing theories, yet each plays a different role in doing so. The lack of a clear distinction within statistics has created a lack of understanding in many disciplines of the difference between building sound explanatory models versus creating powerful predictive models, as well as confusing explanatory power with predictive power. The implications of this omission and the lack of clear guidelines on how to model for explanatory versus predictive goals are considerable for both scien- tific research and practice and have also contributed to the gap between academia and practice. I start by defining what I term explaining and pre- dicting. These definitions are chosen to reflect the dis- 289

Transcript of To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods...

Page 1: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

Statistical Science2010, Vol. 25, No. 3, 289–310DOI: 10.1214/10-STS330© Institute of Mathematical Statistics, 2010

To Explain or to Predict?Galit Shmueli

Abstract. Statistical modeling is a powerful tool for developing and testingtheories by way of causal explanation, prediction, and description. In manydisciplines there is near-exclusive use of statistical modeling for causal ex-planation and the assumption that models with high explanatory power areinherently of high predictive power. Conflation between explanation and pre-diction is common, yet the distinction must be understood for progressingscientific knowledge. While this distinction has been recognized in the phi-losophy of science, the statistical literature lacks a thorough discussion of themany differences that arise in the process of modeling for an explanatory ver-sus a predictive goal. The purpose of this article is to clarify the distinctionbetween explanatory and predictive modeling, to discuss its sources, and toreveal the practical implications of the distinction to each step in the model-ing process.

Key words and phrases: Explanatory modeling, causality, predictive mod-eling, predictive power, statistical strategy, data mining, scientific research.

1. INTRODUCTION

Looking at how statistical models are used in dif-ferent scientific disciplines for the purpose of theorybuilding and testing, one finds a range of perceptionsregarding the relationship between causal explanationand empirical prediction. In many scientific fields suchas economics, psychology, education, and environmen-tal science, statistical models are used almost exclu-sively for causal explanation, and models that possesshigh explanatory power are often assumed to inher-ently possess predictive power. In fields such as naturallanguage processing and bioinformatics, the focus is onempirical prediction with only a slight and indirect re-lation to causal explanation. And yet in other researchfields, such as epidemiology, the emphasis on causalexplanation versus empirical prediction is more mixed.Statistical modeling for description, where the purposeis to capture the data structure parsimoniously, andwhich is the most commonly developed within the fieldof statistics, is not commonly used for theory buildingand testing in other disciplines. Hence, in this article I

Galit Shmueli is Associate Professor of Statistics,Department of Decision, Operations and InformationTechnologies, Robert H. Smith School of Business,University of Maryland, College Park, Maryland 20742,USA (e-mail: [email protected]).

focus on the use of statistical modeling for causal ex-planation and for prediction. My main premise is thatthe two are often conflated, yet the causal versus pre-dictive distinction has a large impact on each step of thestatistical modeling process and on its consequences.Although not explicitly stated in the statistics method-ology literature, applied statisticians instinctively sensethat predicting and explaining are different. This articleaims to fill a critical void: to tackle the distinction be-tween explanatory modeling and predictive modeling.

Clearing the current ambiguity between the two iscritical not only for proper statistical modeling, butmore importantly, for proper scientific usage. Both ex-planation and prediction are necessary for generatingand testing theories, yet each plays a different role indoing so. The lack of a clear distinction within statisticshas created a lack of understanding in many disciplinesof the difference between building sound explanatorymodels versus creating powerful predictive models, aswell as confusing explanatory power with predictivepower. The implications of this omission and the lackof clear guidelines on how to model for explanatoryversus predictive goals are considerable for both scien-tific research and practice and have also contributed tothe gap between academia and practice.

I start by defining what I term explaining and pre-dicting. These definitions are chosen to reflect the dis-

289

Page 2: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

290 G. SHMUELI

tinct scientific goals that they are aimed at: causal ex-planation and empirical prediction, respectively. Ex-planatory modeling and predictive modeling reflect theprocess of using data and statistical (or data mining)methods for explaining or predicting, respectively. Theterm modeling is intentionally chosen over models tohighlight the entire process involved, from goal defini-tion, study design, and data collection to scientific use.

1.1 Explanatory Modeling

In many scientific fields, and especially the socialsciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical model, statistical models are applied to data inorder to test causal hypotheses. In such models, a setof underlying factors that are measured by variables X

are assumed to cause an underlying effect, measuredby variable Y . Based on collaborative work with socialscientists and economists, on an examination of someof their literature, and on conversations with a diversegroup of researchers, I conjecture that, whether statis-ticians like it or not, the type of statistical models usedfor testing causal hypotheses in the social sciences arealmost always association-based models applied to ob-servational data. Regression models are the most com-mon example. The justification for this practice is thatthe theory itself provides the causality. In other words,the role of the theory is very strong and the relianceon data and statistical modeling are strictly through thelens of the theoretical model. The theory–data relation-ship varies in different fields. While the social sciencesare very theory-heavy, in areas such as bioinformat-ics and natural language processing the emphasis ona causal theory is much weaker. Hence, given this re-ality, I define explaining as causal explanation and ex-planatory modeling as the use of statistical models fortesting causal explanations.

To illustrate how explanatory modeling is typicallydone, I describe the structure of a typical article in ahighly regarded journal in the field of Information Sys-tems (IS). Researchers in the field of IS usually havetraining in economics and/or the behavioral sciences.The structure of articles reflects the way empirical re-search is conducted in IS and related fields.

The example used is an article by Gefen, Karahannaand Straub (2003), which studies technology accep-tance. The article starts with a presentation of the pre-vailing relevant theory(ies):

Online purchase intensions should be ex-plained in part by the technology accep-tance model (TAM). This theoretical modelis at present a preeminent theory of technol-ogy acceptance in IS.

The authors then proceed to state multiple causal hy-potheses (denoted H1,H2, . . . in Figure 1, right panel),justifying the merits for each hypothesis and ground-ing it in theory. The research hypotheses are given interms of theoretical constructs rather than measurablevariables. Unlike measurable variables, constructs areabstractions that “describe a phenomenon of theoreti-cal interest” (Edwards and Bagozzi, 2000) and can beobservable or unobservable. Examples of constructs inthis article are trust, perceived usefulness (PU), andperceived ease of use (PEOU). Examples of constructsused in other fields include anger, poverty, well-being,and odor. The hypotheses section will often include acausal diagram illustrating the hypothesized causal re-lationship between the constructs (see Figure 1, leftpanel). The next step is construct operationalization,where a bridge is built between theoretical constructsand observable measurements, using previous litera-ture and theoretical justification. Only after the theo-retical component is completed, and measurements arejustified and defined, do researchers proceed to the next

FIG. 1. Causal diagram (left) and partial list of stated hypotheses (right) from Gefen, Karahanna and Straub (2003).

Page 3: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 291

step where data and statistical modeling are introducedalongside the statistical hypotheses, which are opera-tionalized from the research hypotheses. Statistical in-ference will lead to “statistical conclusions” in terms ofeffect sizes and statistical significance in relation to thecausal hypotheses. Finally, the statistical conclusionsare converted into research conclusions, often accom-panied by policy recommendations.

In summary, explanatory modeling refers here tothe application of statistical models to data for test-ing causal hypotheses about theoretical constructs.Whereas “proper” statistical methodology for testingcausality exists, such as designed experiments or spe-cialized causal inference methods for observationaldata [e.g., causal diagrams (Pearl, 1995), discoveryalgorithms (Spirtes, Glymour and Scheines, 2000),probability trees (Shafer, 1996), and propensity scores(Rosenbaum and Rubin, 1983; Rubin, 1997)], in prac-tice association-based statistical models, applied to ob-servational data, are most commonly used for that pur-pose.

1.2 Predictive Modeling

I define predictive modeling as the process of apply-ing a statistical model or data mining algorithm to datafor the purpose of predicting new or future observa-tions. In particular, I focus on nonstochastic prediction(Geisser, 1993, page 31), where the goal is to predictthe output value (Y ) for new observations given theirinput values (X). This definition also includes temporalforecasting, where observations until time t (the input)are used to forecast future values at time t + k, k > 0(the output). Predictions include point or interval pre-dictions, prediction regions, predictive distributions, orrankings of new observations. Predictive model is anymethod that produces predictions, regardless of its un-derlying approach: Bayesian or frequentist, parametricor nonparametric, data mining algorithm or statisticalmodel, etc.

1.3 Descriptive Modeling

Although not the focus of this article, a third type ofmodeling, which is the most commonly used and de-veloped by statisticians, is descriptive modeling. Thistype of modeling is aimed at summarizing or repre-senting the data structure in a compact manner. Un-like explanatory modeling, in descriptive modeling thereliance on an underlying causal theory is absent or in-corporated in a less formal way. Also, the focus is at themeasurable level rather than at the construct level. Un-like predictive modeling, descriptive modeling is not

aimed at prediction. Fitting a regression model can bedescriptive if it is used for capturing the association be-tween the dependent and independent variables ratherthan for causal inference or for prediction. We mentionthis type of modeling to avoid confusion with causal-explanatory and predictive modeling, and also to high-light the different approaches of statisticians and non-statisticians.

1.4 The Scientific Value of Predictive Modeling

Although explanatory modeling is commonly usedfor theory building and testing, predictive modeling isnearly absent in many scientific fields as a tool for de-veloping theory. One possible reason is the statisticaltraining of nonstatistician researchers. A look at manyintroductory statistics textbooks reveals very little inthe way of prediction. Another reason is that predictionis often considered unscientific. Berk (2008) wrote, “Inthe social sciences, for example, one either did causalmodeling econometric style or largely gave up quan-titative work.” From conversations with colleagues invarious disciplines it appears that predictive modelingis often valued for its applied utility, yet is discarded forscientific purposes such as theory building or testing.Shmueli and Koppius (2010) illustrated the lack of pre-dictive modeling in the field of IS. Searching the 1072papers published in the two top-rated journals Infor-mation Systems Research and MIS Quarterly between1990 and 2006, they found only 52 empirical paperswith predictive claims, of which only seven carried outproper predictive modeling or testing.

Even among academic statisticians, there appears tobe a divide between those who value prediction as themain purpose of statistical modeling and those who seeit as unacademic. Examples of statisticians who em-phasize predictive methodology include Akaike (“Thepredictive point of view is a prototypical point of viewto explain the basic activity of statistical analysis” inFindley and Parzen, 1998), Deming (“The only use-ful function of a statistician is to make predictions”in Wallis, 1980), Geisser (“The prediction of observ-ables or potential observables is of much greater rel-evance than the estimate of what are often artificialconstructs-parameters,” Geisser, 1975), Aitchison andDunsmore (“prediction analysis. . . is surely at the heartof many statistical applications,” Aitchison and Dun-smore, 1975) and Friedman (“One of the most com-mon and important uses for data is prediction,” Fried-man, 1997). Examples of those who see it as unacad-emic are Kendall and Stuart (“The Science of Statisticsdeals with the properties of populations. In considering

Page 4: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

292 G. SHMUELI

a population of men we are not interested, statisticallyspeaking, in whether some particular individual hasbrown eyes or is a forger, but rather in how many of theindividuals have brown eyes or are forgers,” Kendalland Stuart, 1977) and more recently Parzen (“The twogoals in analyzing data. . . I prefer to describe as “man-agement” and “science.” Management seeks profit. . .Science seeks truth,” Parzen, 2001). In economics thereis a similar disagreement regarding “whether predic-tion per se is a legitimate objective of economic sci-ence, and also whether observed data should be usedonly to shed light on existing theories or also for thepurpose of hypothesis seeking in order to develop newtheories” (Feelders, 2002).

Before proceeding with the discrimination betweenexplanatory and predictive modeling, it is important toestablish prediction as a necessary scientific endeavorbeyond utility, for the purpose of developing and test-ing theories. Predictive modeling and predictive testingserve several necessary scientific functions:

1. Newly available large and rich datasets often con-tain complex relationships and patterns that are hardto hypothesize, especially given theories that ex-clude newly measurable concepts. Using predic-tive modeling in such contexts can help uncoverpotential new causal mechanisms and lead to thegeneration of new hypotheses. See, for example,the discussion between Gurbaxani and Mendelson(1990, 1994) and Collopy, Adya and Armstrong(1994).

2. The development of new theory often goes hand inhand with the development of new measures (VanMaanen, Sorensen and Mitchell, 2007). Predictivemodeling can be used to discover new measures aswell as to compare different operationalizations ofconstructs and different measurement instruments.

3. By capturing underlying complex patterns and re-lationships, predictive modeling can suggest im-provements to existing explanatory models.

4. Scientific development requires empirically rigor-ous and relevant research. Predictive modeling en-ables assessing the distance between theory andpractice, thereby serving as a “reality check” tothe relevance of theories.1 While explanatory powerprovides information about the strength of an under-lying causal relationship, it does not imply its pre-dictive power.

1Predictive models are advantageous in terms of negative em-piricism: a model either predicts accurately or it does not, and thiscan be observed. In contrast, explanatory models can never be con-firmed and are harder to contradict.

5. Predictive power assessment offers a straightfor-ward way to compare competing theories by ex-amining the predictive power of their respective ex-planatory models.

6. Predictive modeling plays an important role inquantifying the level of predictability of measurablephenomena by creating benchmarks of predictiveaccuracy (Ehrenberg and Bound, 1993). Knowledgeof un-predictability is a fundamental component ofscientific knowledge (see, e.g., Taleb, 2007). Be-cause predictive models tend to have higher predic-tive accuracy than explanatory statistical models,they can give an indication of the potential levelof predictability. A very low predictability levelcan lead to the development of new measures, newcollected data, and new empirical approaches. Anexplanatory model that is close to the predictivebenchmark may suggest that our understanding ofthat phenomenon can only be increased marginally.On the other hand, an explanatory model that is veryfar from the predictive benchmark would imply thatthere are substantial practical and theoretical gainsto be had from further scientific development.

For a related, more detailed discussion of the valueof prediction to scientific theory development see thework of Shmueli and Koppius (2010).

1.5 Explaining and Predicting Are Different

In the philosophy of science, it has long been de-bated whether explaining and predicting are one ordistinct. The conflation of explanation and predic-tion has its roots in philosophy of science litera-ture, particularly the influential hypothetico-deductivemodel (Hempel and Oppenheim, 1948), which explic-itly equated prediction and explanation. However, aslater became clear, the type of uncertainty associatedwith explanation is of a different nature than that as-sociated with prediction (Helmer and Rescher, 1959).This difference highlighted the need for developingmodels geared specifically toward dealing with pre-dicting future events and trends such as the Delphimethod (Dalkey and Helmer, 1963). The distinctionbetween the two concepts has been further elaborated(Forster and Sober, 1994; Forster, 2002; Sober, 2002;Hitchcock and Sober, 2004; Dowe, Gardner and Oppy,2007). In his book Theory Building, Dubin (1969,page 9) wrote:

Theories of social and human behavior ad-dress themselves to two distinct goals of

Page 5: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 293

science: (1) prediction and (2) understand-ing. It will be argued that these are separategoals [. . . ] I will not, however, conclude thatthey are either inconsistent or incompatible.

Herbert Simon distinguished between “basic science”and “applied science” (Simon, 2001), a distinction sim-ilar to explaining versus predicting. According to Si-mon, basic science is aimed at knowing (“to describethe world”) and understanding (“to provide explana-tions of these phenomena”). In contrast, in applied sci-ence, “Laws connecting sets of variables allow infer-ences or predictions to be made from known values ofsome of the variables to unknown values of other vari-ables.”

Why should there be a difference between explainingand predicting? The answer lies in the fact that measur-able data are not accurate representations of their un-derlying constructs. The operationalization of theoriesand constructs into statistical models and measurabledata creates a disparity between the ability to explainphenomena at the conceptual level and the ability togenerate predictions at the measurable level.

To convey this disparity more formally, consider atheory postulating that construct X causes constructY , via the function F , such that Y = F (X ). F is of-ten represented by a path model, a set of qualitativestatements, a plot (e.g., a supply and demand plot), ormathematical formulas. Measurable variables X and Y

are operationalizations of X and Y , respectively. Theoperationalization of F into a statistical model f , suchas E(Y ) = f (X), is done by considering F in light ofthe study design (e.g., numerical or categorical Y ; hi-erarchical or flat design; time series or cross-sectional;complete or censored data) and practical considera-tions such as standards in the discipline. Because Fis usually not sufficiently detailed to lead to a single f ,often a set of f models is considered. Feelders (2002)described this process in the field of economics. In thepredictive context, we consider only X, Y and f .

The disparity arises because the goal in explanatorymodeling is to match f and F as closely as possiblefor the statistical inference to apply to the theoreticalhypotheses. The data X, Y are tools for estimating f ,which in turn is used for testing the causal hypotheses.In contrast, in predictive modeling the entities of inter-est are X and Y , and the function f is used as a tool forgenerating good predictions of new Y values. In fact,we will see that even if the underlying causal relation-ship is indeed Y = F (X ), a function other than f (X)

and data other than X might be preferable for predic-tion.

The disparity manifests itself in different ways. Fourmajor aspects are:

Causation–Association: In explanatory modeling f

represents an underlying causal function, and X isassumed to cause Y . In predictive modeling f cap-tures the association between X and Y .

Theory–Data: In explanatory modeling, f is care-fully constructed based on F in a fashion that sup-ports interpreting the estimated relationship betweenX and Y and testing the causal hypotheses. In predic-tive modeling, f is often constructed from the data.Direct interpretability in terms of the relationship be-tween X and Y is not required, although sometimestransparency of f is desirable.

Retrospective–Prospective: Predictive modeling isforward-looking, in that f is constructed for pre-dicting new observations. In contrast, explanatorymodeling is retrospective, in that f is used to test analready existing set of hypotheses.

Bias–Variance: The expected prediction error for anew observation with value x, using a quadratic lossfunction,2 is given by Hastie, Tibshirani and Fried-man (2009, page 223)

EPE = E{Y − f (x)}2

= E{Y − f (x)}2 + {E(f (x)) − f (x)}2

(1)+ E{f (x) − E(f (x))}2

= Var(Y ) + Bias2 + Var(f (x)).

Bias is the result of misspecifying the statisticalmodel f . Estimation variance (the third term) is theresult of using a sample to estimate f . The first termis the error that results even if the model is correctlyspecified and accurately estimated. The above de-composition reveals a source of the difference be-tween explanatory and predictive modeling: In ex-planatory modeling the focus is on minimizing biasto obtain the most accurate representation of theunderlying theory. In contrast, predictive modelingseeks to minimize the combination of bias and es-timation variance, occasionally sacrificing theoreti-cal accuracy for improved empirical precision. Thispoint is illustrated in the Appendix, showing that the“wrong” model can sometimes predict better thanthe correct one.

2For a binary Y , various 0–1 loss functions have been suggestedin place of the quadratic loss function (Domingos, 2000).

Page 6: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

294 G. SHMUELI

The four aspects impact every step of the modelingprocess, such that the resulting f is markedly differentin the explanatory and predictive contexts, as will beshown in Section 2.

1.6 A Void in the Statistics Literature

The philosophical explaining/predicting debate hasnot been directly translated into statistical language interms of the practical aspects of the entire statisticalmodeling process.

A search of the statistics literature for discussion ofexplaining versus predicting reveals a lively discussionin the context of model selection, and in particular, thederivation and evaluation of model selection criteria. Inthis context, Konishi and Kitagawa (2007) wrote:

There may be no significant difference be-tween the point of view of inferring the truestructure and that of making a prediction ifan infinitely large quantity of data is avail-able or if the data are noiseless. However,in modeling based on a finite quantity ofreal data, there is a significant gap betweenthese two points of view, because an optimalmodel for prediction purposes may be dif-ferent from one obtained by estimating the‘true model.’

The literature on this topic is vast, and we do not intendto cover it here, although we discuss the major pointsin Section 2.6.

The focus on prediction in the field of machine learn-ing and by statisticians such as Geisser, Aitchison andDunsmore, Breiman and Friedman, has highlighted as-pects of predictive modeling that are relevant to the ex-planatory/prediction distinction, although they do notdirectly contrast explanatory and predictive modeling.3

The prediction literature raises the importance of eval-uating predictive power using holdout data, and theusefulness of algorithmic methods (Breiman, 2001b).The predictive focus has also led to the developmentof inference tools that generate predictive distributions.Geisser (1993) introduced “predictive inference” anddeveloped it mainly in a Bayesian context. “Predic-tive likelihood” (see Bjornstad, 1990) is a likelihood-based approach to predictive inference, and Dawid’sprequential theory (Dawid, 1984) investigates infer-ence concepts in terms of predictability. Finally, the

3Geisser distinguished between “[statistical] parameters” and“observables” in terms of the objects of interest. His distinctionis closely related, but somewhat different from our distinction be-tween theoretical constructs and measurements.

bias–variance aspect has been pivotal in data miningfor understanding the predictive performance of differ-ent algorithms and for designing new ones.

Another area in statistics and econometrics that fo-cuses on prediction is time series. Methods have beendeveloped specifically for testing the predictability ofa series [e.g., random walk tests or the concept ofGranger causality (Granger, 1969)], and evaluatingpredictability by examining performance on holdoutdata. The time series literature in statistics is dominatedby extrapolation models such as ARIMA-type modelsand exponential smoothing methods, which are suit-able for prediction and description, but not for causalexplanation. Causal models for time series are commonin econometrics (e.g., Song and Witt, 2000), where anunderlying causal theory links constructs, which leadto operationalized variables, as in the cross-sectionalcase. Yet, to the best of my knowledge, there is nodiscussion in the statistics time series literature regard-ing the distinction between predictive and explanatorymodeling, aside from the debate in economics regard-ing the scientific value of prediction.

To conclude, the explanatory/predictive modelingdistinction has been discussed directly in the model se-lection context, but not in the larger context. Areas thatfocus on developing predictive modeling such as ma-chine learning and statistical time series, and “predic-tivists” such as Geisser, have considered prediction as aseparate issue, and have not discussed its principal andpractical distinction from causal explanation in termsof developing and testing theory. The goal of this arti-cle is therefore to examine the explanatory versus pre-dictive debate from a statistical perspective, consider-ing how modeling is used by nonstatistician scientistsfor theory development.

The remainder of the article is organized as fol-lows. In Section 2, I consider each step in the mod-eling process in terms of the four aspects of the pre-dictive/explanatory modeling distinction: causation–association, theory–data, retrospective–prospectiveand bias–variance. Section 3 illustrates some of thesedifferences via two examples. A discussion of the im-plications of the predict/explain conflation, conclu-sions, and recommendations are given in Section 4.

2. TWO MODELING PATHS

In the following I examine the process of statisti-cal modeling through the explain/predict lens, fromgoal definition to model use and reporting. For clar-ity, I broke down the process into a generic set of steps,

Page 7: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 295

FIG. 2. Steps in the statistical modeling process.

as depicted in Figure 2. In each step I point out differ-ences in the choice of methods, criteria, data, and infor-mation to consider when the goal is predictive versusexplanatory. I also briefly describe the related statis-tics literature. The conceptual and practical differencesinvariably lead to a difference between a final explana-tory model and a predictive one, even though they mayuse the same initial data. Thus, a priori determinationof the main study goal as either explanatory or pre-dictive4 is essential to conducting adequate modeling.The discussion in this section assumes that the main re-search goal has been determined as either explanatoryor predictive.

2.1 Study Design and Data Collection

Even at the early stages of study design and datacollection, issues of what and how much data to col-lect, according to what design, and which collectioninstrument to use are considered differently for predic-tion versus explanation. Consider sample size. In ex-planatory modeling, where the goal is to estimate thetheory-based f with adequate precision and to use itfor inference, statistical power is the main consider-ation. Reducing bias also requires sufficient data formodel specification testing. Beyond a certain amountof data, however, extra precision is negligible for pur-poses of inference. In contrast, in predictive modeling,f itself is often determined from the data, thereby re-quiring a larger sample for achieving lower bias andvariance. In addition, more data are needed for creatingholdout datasets (see Section 2.2). Finally, predictingnew individual observations accurately, in a prospec-tive manner, requires more data than retrospective in-ference regarding population-level parameters, due tothe extra uncertainty.

A second design issue is sampling scheme. Forinstance, in the context of hierarchical data (e.g.,sampling students within schools) Afshartous and deLeeuw (2005) noted, “Although there exists an exten-sive literature on estimation issues in multilevel mod-els, the same cannot be said with respect to prediction.”

4The main study goal can also be descriptive.

Examining issues of sample size, sample allocation,and multilevel modeling for the purpose of “predictinga future observable y∗j in the J th group of a hierar-chial dataset,” they found that allocation for estimationversus prediction should be different: “an increase ingroup size n is often more beneficial with respect toprediction than an increase in the number of groupsJ . . . [whereas] estimation is more improved by increas-ing the number of groups J instead of the group sizen.” This relates directly to the bias–variance aspect.A related issue is the choice of f in relation to sam-pling scheme. Afshartous and de Leeuw (2005) foundthat for their hierarchical data, a hierarchical f , whichis more appropriate theoretically, had poorer predictiveperformance than a nonhierarchical f .

A third design consideration is the choice betweenexperimental and observational settings. Whereas forcausal explanation experimental data are greatly pre-ferred, subject to availability and resource constraints,in prediction sometimes observational data are prefer-able to “overly clean” experimental data, if they bet-ter represent the realistic context of prediction in termsof the uncontrolled factors, the noise, the measured re-sponse, etc. This difference arises from the theory–dataand prospective–retrospective aspects. Similarly, whenchoosing between primary data (data collected for thepurpose of the study) and secondary data (data col-lected for other purposes), the classic criteria of data re-cency, relevance, and accuracy (Patzer, 1995) are con-sidered from a different angle. For example, a predic-tive model requires the secondary data to include theexact X, Y variables to be used at the time of predic-tion, whereas for causal explanation different opera-tionalizations of the constructs X , Y may be accept-able.

In terms of the data collection instrument, whereasin explanatory modeling the goal is to obtain a reliableand valid instrument such that the data obtained rep-resent the underlying construct adequately (e.g., itemresponse theory in psychometrics), for predictive pur-poses it is more important to focus on the measurementquality and its meaning in terms of the variable to bepredicted.

Page 8: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

296 G. SHMUELI

Finally, consider the field of design of experiments:two major experimental designs are factorial designsand response surface methodology (RSM) designs.The former is focused on causal explanation in termsof finding the factors that affect the response. The lat-ter is aimed at prediction—finding the combination ofpredictors that optimizes Y . Factorial designs employa linear f for interpretability, whereas RSM designsuse optimization techniques and estimate a nonlinearf from the data, which is less interpretable but morepredictively accurate.5

2.2 Data Preparation

We consider two common data preparation opera-tions: handling missing values and data partitioning.

2.2.1 Handling missing values. Most real datasetsconsist of missing values, thereby requiring one toidentify the missing values, to determine the extent andtype of missingness, and to choose a course of actionaccordingly. Although a rich literature exists on dataimputation, it is monopolized by an explanatory con-text. In predictive modeling, the solution strongly de-pends on whether the missing values are in the trainingdata and/or the data to be predicted. For example, Sarle(1998) noted:

If you have only a small proportion of caseswith missing data, you can simply throw outthose cases for purposes of estimation; ifyou want to make predictions for cases withmissing inputs, you don’t have the option ofthrowing those cases out.

Sarle further listed imputation methods that are use-ful for explanatory purposes but not for predictive pur-poses and vice versa. One example is using regressionmodels with dummy variables that indicate missing-ness, which is considered unsatisfactory in explana-tory modeling, but can produce excellent predictions.The usefulness of creating missingness dummy vari-ables was also shown by Ding and Simonoff (2010).In particular, whereas the classic explanatory ap-proach is based on the Missing-At-Random, Missing-Completely-At-Random or Not-Missing-At-Randomclassification (Little and Rubin, 2002), Ding and Si-monoff (2010) showed that for predictive purposes theimportant distinction is whether the missingness de-pends on Y or not. They concluded:

5I thank Douglas Montgomery for this insight.

In the context of classification trees, the re-lationship between the missingness and thedependent variable, rather than the standardmissingness classification approach of Lit-tle and Rubin (2002). . . is the most help-ful criterion to distinguish different missingdata methods.

Moreover, missingness can be a blessing in a predic-tive context, if it is sufficiently informative of Y (e.g.,missingness in financial statements when the goal is topredict fraudulent reporting).

Finally, a completely different approach for handlingmissing data for prediction, mentioned by Sarle (1998)and further developed by Saar-Tsechansky and Provost(2007), considers the case where to-be-predicted obser-vations are missing some predictor information, suchthat the missing information can vary across differentobservations. The proposed solution is to estimate mul-tiple “reduced” models, each excluding some predic-tors. When predicting an observation with missingnesson a certain set of predictors, the model that excludesthose predictors is used. This approach means that dif-ferent reduced models are created for different obser-vations. Although useful for prediction, it is clearly in-appropriate for causal explanation.

2.2.2 Data partitioning. A popular solution foravoiding overoptimistic predictive accuracy is to evalu-ate performance not on the training set, that is, the dataused to build the model, but rather on a holdout samplewhich the model “did not see.” The creation of a hold-out sample can be achieved in various ways, the mostcommonly used being a random partition of the sampleinto training and holdout sets. A popular alternative,especially with scarce data, is cross-validation. Otheralternatives are resampling methods, such as bootstrap,which can be computationally intensive but avoid “badpartitions” and enable predictive modeling with smalldatasets.

Data partitioning is aimed at minimizing the com-bined bias and variance by sacrificing some bias in re-turn for a reduction in sampling variance. A smallersample is associated with higher bias when f is es-timated from the data, which is common in predic-tive modeling but not in explanatory modeling. Hence,data partitioning is useful for predictive modeling butless so for explanatory modeling. With today’s abun-dance of large datasets, where the bias sacrifice is prac-tically small, data partitioning has become a standardpreprocessing step in predictive modeling.

Page 9: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 297

In explanatory modeling, data partitioning is lesscommon because of the reduction in statistical power.When used, it is usually done for the retrospective pur-pose of assessing the robustness of f . A rarer yet im-portant use of data partitioning in explanatory model-ing is for strengthening model validity, by demonstrat-ing some predictive power. Although one would notexpect an explanatory model to be optimal in terms ofpredictive power, it should show some degree of accu-racy (see discussion in Section 4.2).

2.3 Exploratory Data Analysis

Exploratory data analysis (EDA) is a key initial stepin both explanatory and predictive modeling. It consistsof summarizing the data numerically and graphically,reducing their dimension, and “preparing” for the moreformal modeling step. Although the same set of toolscan be used in both cases, they are used in a differentfashion. In explanatory modeling, exploration is chan-neled toward the theoretically specified causal relation-ships, whereas in predictive modeling EDA is used ina more free-form fashion, supporting the purpose ofcapturing relationships that are perhaps unknown or atleast less formally formulated.

One example is how data visualization is carried out.Fayyad, Grinstein and Wierse (2002, page 22) con-trasted “exploratory visualization” with “confirmatoryvisualization”:

Visualizations can be used to explore data,to confirm a hypothesis, or to manipulatea viewer. . . In exploratory visualization theuser does not necessarily know what he islooking for. This creates a dynamic scenarioin which interaction is critical. . . In a confir-matory visualization, the user has a hypoth-esis that needs to be tested. This scenario ismore stable and predictable. System para-meters are often predetermined.

Hence, interactivity, which supports exploration acrossa wide and sometimes unknown terrain, is very usefulfor learning about measurement quality and associa-tions that are at the core of predictive modeling, butmuch less so in explanatory modeling, where the dataare visualized through the theoretical lens.

A second example is numerical summaries. In a pre-dictive context, one might explore a wide range of nu-merical summaries for all variables of interest, whereasin an explanatory model, the numerical summarieswould focus on the theoretical relationships. For ex-ample, in order to assess the role of a certain variable

as a mediator, its correlation with the response variableand with other covariates is examined by generatingspecific correlation tables.

A third example is the use of EDA for assessing as-sumptions of potential models (e.g., normality or mul-ticollinearity) and exploring possible variable transfor-mations. Here, too, an explanatory context would bemore restrictive in terms of the space explored.

Finally, dimension reduction is viewed and used dif-ferently. In predictive modeling, a reduction in thenumber of predictors can help reduce sampling vari-ance. Hence, methods such as principal componentsanalysis (PCA) or other data compression methods thatare even less interpretable (e.g., singular value decom-position) are often carried out initially. They may laterlead to the use of compressed variables (such as thefirst few components) as predictors, even if those arenot easily interpretable. PCA is also used in explana-tory modeling, but for a different purpose. For ques-tionnaire data, PCA and exploratory factor analysis areused to determine the validity of the survey instrument.The resulting factors are expected to correspond to theunderlying constructs. In fact, the rotation step in fac-tor analysis is specifically aimed at making the factorsmore interpretable. Similarly, correlations are used forassessing the reliability of the survey instrument.

2.4 Choice of Variables

The criteria for choosing variables differ markedlyin explanatory versus predictive contexts.

In explanatory modeling, where variables are seenas operationalized constructs, variable choice is basedon the role of the construct in the theoretical causalstructure and on the operationalization itself. A broadterminology related to different variable roles existsin various fields: in the social sciences—antecedent,consequent, mediator and moderator6 variables; inpharmacology and medical sciences—treatment andcontrol variables; and in epidemiology—exposure andconfounding variables. Carte and Craig (2003) men-tioned that explaining moderating effects has becomean important scientific endeavor in the field of Man-agement Information Systems. Another important termcommon in economics is endogeneity or “reverse cau-sation,” which results in biased parameter estimates.Endogeneity can occur due to different reasons. One

6“A moderator variable is one that influences the strength of arelationship between two other variables, and a mediator variable isone that explains the relationship between the two other variables”(from http://psych.wisc.edu/henriques/mediator.html).

Page 10: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

298 G. SHMUELI

reason is incorrectly omitting an input variable, sayZ, from f when the causal construct Z is assumed tocause X and Y . In a regression model of Y on X, theomission of Z results in X being correlated with the er-ror term. Winkelmann (2008) gave the example of a hy-pothesis that health insurance (X ) affects the demandfor health services Y . The operationalized variables are“health insurance status” (X) and “number of doctorconsultations” (Y ). Omitting an input measurement Z

for “true health status” (Z ) from the regression modelf causes endogeneity because X can be determined byY (i.e., reverse causation), which manifests as X beingcorrelated with the error term in f . Endogeneity canarise due to other reasons such as measurement errorin X. Because of the focus in explanatory modeling oncausality and on bias, there is a vast literature on detect-ing endogeneity and on solutions such as constructinginstrumental variables and using models such as two-stage-least-squares (2SLS). Another related term is si-multaneous causality, which gives rise to special mod-els such as Seemingly Unrelated Regression (SUR)(Zellner, 1962). In terms of chronology, a causal ex-planatory model can include only “control” variablesthat take place before the causal variable (Gelman etal., 2003). And finally, for reasons of model identifia-bility (i.e., given the statistical model, each causal ef-fect can be identified), one is required to include maineffects in a model that contains an interaction term be-tween those effects. We note this practice because it isnot necessary or useful in the predictive context, due tothe acceptability of uninterpretable models and the po-tential reduction in sampling variance when droppingpredictors (see, e.g., the Appendix).

In predictive modeling, the focus on associationrather than causation, the lack of F , and the prospec-tive context, mean that there is no need to delve intothe exact role of each variable in terms of an underly-ing causal structure. Instead, criteria for choosing pre-dictors are quality of the association between the pre-dictors and the response, data quality, and availabil-ity of the predictors at the time of prediction, knownas ex-ante availability. In terms of ex-ante availability,whereas chronological precedence of X to Y is nec-essary in causal models, in predictive models not onlymust X precede Y , but X must be available at the timeof prediction. For instance, explaining wine quality ret-rospectively would dictate including barrel characteris-tics as a causal factor. The inclusion of barrel charac-teristics in a predictive model of future wine qualitywould be impossible if at the time of prediction thegrapes are still on the vine. See the eBay example inSection 3.2 for another example.

2.5 Choice of Methods

Considering the four aspects of causation–associa-tion, theory–data, retrospective–prospective and bias–variance leads to different choices of plausible meth-ods, with a much larger array of methods useful for pre-diction. Explanatory modeling requires interpretablestatistical models f that are easily linked to the un-derlying theoretical model F . Hence the popularity ofstatistical models, and especially regression-type meth-ods, in many disciplines. Algorithmic methods suchas neural networks or k-nearest-neighbors, and unin-terpretable nonparametric models, are considered ill-suited for explanatory modeling.

In predictive modeling, where the top priority isgenerating accurate predictions of new observationsand f is often unknown, the range of plausible meth-ods includes not only statistical models (interpretableand uninterpretable) but also data mining algorithms.A neural network algorithm might not shed light on anunderlying causal mechanism F or even on f , but itcan capture complicated associations, thereby leadingto accurate predictions. Although model transparencymight be important in some cases, it is of secondaryimportance: “Using complex predictors may be un-pleasant, but the soundest path is to go for predictiveaccuracy first, then try to understand why” (Breiman,2001b).

Breiman (2001b) accused the statistical communityof ignoring algorithmic modeling:

There are two cultures in the use of sta-tistical modeling to reach conclusions fromdata. One assumes that the data are gener-ated by a given stochastic data model. Theother uses algorithmic models and treats thedata mechanism as unknown. The statisti-cal community has been committed to thealmost exclusive use of data models.

From the explanatory/predictive view, algorithmicmodeling is indeed very suitable for predictive (anddescriptive) modeling, but not for explanatory model-ing.

Some methods are not suitable for prediction fromthe retrospective–prospective aspect, especially in timeseries forecasting. Models with coincident indicators,which are measured simultaneously, are such a class.An example is the model Airfaret = f (OilPricet ),which might be useful for explaining the effect of oilprice on airfare based on a causal theory, but not forpredicting future airfare because the oil price at the

Page 11: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 299

time of prediction is unknown. For prediction, alterna-tive models must be considered, such as using a laggedOilPrice variable, or creating a separate model for fore-casting oil prices and plugging its forecast into the air-fare model. Another example is the centered movingaverage, which requires the availability of data duringa time window before and after a period of interest, andis therefore not useful for prediction.

Lastly, the bias–variance aspect raises two classesof methods that are very useful for prediction, but notfor explanation. The first is shrinkage methods such asridge regression, principal components regression, andpartial least squares regression, which “shrink” predic-tor coefficients or even eliminate them, thereby intro-ducing bias into f , for the purpose of reducing esti-mation variance. The second class of methods, which“have been called the most influential development inData Mining and Machine Learning in the past decade”(Seni and Elder, 2010, page vi), are ensemble meth-ods such as bagging (Breiman, 1996), random forests(Breiman, 2001a), boosting7 (Schapire, 1999), varia-tions of those methods, and Bayesian alternatives (e.g.,Brown, Vannucci and Fearn, 2002). Ensembles com-bine multiple models to produce more precise predic-tions by averaging predictions from different models,and have proven useful in numerous applications (seethe Netflix Prize example in Section 3.1).

2.6 Validation, Model Evaluation and ModelSelection

Choosing the final model among a set of models,validating it, and evaluating its performance, differmarkedly in explanatory and predictive modeling. Al-though the process is iterative, I separate it into threecomponents for ease of exposition.

2.6.1 Validation. In explanatory modeling, valida-tion consists of two parts: model validation validatesthat f adequately represents F , and model fit validatesthat f fits the data {X,Y }. In contrast, validation in pre-dictive modeling is focused on generalization, which isthe ability of f to predict new data {Xnew, Ynew}.

Methods used in explanatory modeling for modelvalidation include model specification tests such asthe popular Hausman specification test in econometrics(Hausman, 1978), and construct validation techniquessuch as reliability and validity measures of survey

7Although boosting algorithms were developed as ensemblemethods, “[they can] be seen as an interesting regularizationscheme for estimating a model” (Bohlmann and Hothorn, 2007).

questions and factor analysis. Inference for individualcoefficients is also used for detecting over- or under-specification. Validating model fit involves goodness-of-fit tests (e.g., normality tests) and model diagnosticssuch as residual analysis. Although indications of lackof fit might lead researchers to modify f , modificationsare made carefully in light of the relationship with Fand the constructs X , Y .

In predictive modeling, the biggest danger to gener-alization is overfitting the training data. Hence valida-tion consists of evaluating the degree of overfitting, bycomparing the performance of f on the training andholdout sets. If performance is significantly better onthe training set, overfitting is implied.

Not only is the large context of validation markedlydifferent in explanatory and predictive modeling, butso are the details. For example, checking for multi-collinearity is a standard operation in assessing modelfit. This practice is relevant in explanatory modeling,where multicollinearity can lead to inflated standard er-rors, which interferes with inference. Therefore, a vastliterature exists on strategies for identifying and re-ducing multicollinearity, variable selection being onestrategy. In contrast, for predictive purposes “multi-collinearity is not quite as damning” (Vaughan andBerry, 2005). Makridakis, Wheelwright and Hyndman(1998, page 288) distinguished between the role ofmulticollinearity in explaining versus its role in pre-dicting:

Multicollinearity is not a problem unless ei-ther (i) the individual regression coefficientsare of interest, or (ii) attempts are made toisolate the contribution of one explanatoryvariable to Y, without the influence of theother explanatory variables. Multicollinear-ity will not affect the ability of the model topredict.

Another example is the detection of influential ob-servations. While classic methods are aimed at detect-ing observations that are influential in terms of modelestimation, Johnson and Geisser (1983) proposed amethod for detecting influential observations in termsof their effect on the predictive distribution.

2.6.2 Model evaluation. Consider two performanceaspects of a model: explanatory power and predic-tive power. The top priority in terms of model perfor-mance in explanatory modeling is assessing explana-tory power, which measures the strength of relation-ship indicated by f . Researchers report R2-type values

Page 12: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

300 G. SHMUELI

and statistical significance of overall F -type statisticsto indicate the level of explanatory power.

In contrast, in predictive modeling, the focus is onpredictive accuracy or predictive power, which refer tothe performance of f on new data. Measures of pre-dictive power are typically out-of-sample metrics ortheir in-sample approximations, which depend on thetype of required prediction. For example, predictionsof a binary Y could be binary classifications (Y = 0,1),predicted probabilities of a certain class [P (Y = 1)], orrankings of those probabilities. The latter are commonin marketing and personnel psychology. These threedifferent types of predictions would warrant differentperformance metrics. For example, a model can per-form poorly in producing binary classifications but ad-equately in producing rankings. Moreover, in the con-text of asymmetric costs, where costs are heftier forsome types of prediction errors than others, alterna-tive performance metrics are used, such as the “averagecost per predicted observation.”

A common misconception in various scientific fieldsis that predictive power can be inferred from explana-tory power. However, the two are different and shouldbe assessed separately. While predictive power can beassessed for both explanatory and predictive models,explanatory power is not typically possible to assessfor predictive models because of the lack of F and anunderlying causal structure. Measures such as R2 andF would indicate the level of association, but not cau-sation.

Predictive power is assessed using metrics computedfrom a holdout set or using cross-validation (Stone,1974; Geisser, 1975). Thus, a major difference betweenexplanatory and predictive performance metrics is thedata from which they are computed. In general, mea-sures computed from the data to which the model wasfitted tend to be overoptimistic in terms of predictiveaccuracy: “Testing the procedure on the data that gaveit birth is almost certain to overestimate performance”(Mosteller and Tukey, 1977). Thus, the holdout setserves as a more realistic context for evaluating pre-dictive power.

2.6.3 Model selection. Once a set of models f1, f2,

. . . has been estimated and validated, model selectionpertains to choosing among them. Two main differen-tiating aspects are the data–theory and bias–varianceconsiderations. In explanatory modeling, the modelsare compared in terms of explanatory power, and hencethe popularity of nested models, which are easily com-pared. Stepwise-type methods, which use overall F

statistics to include and/or exclude variables, might ap-pear suitable for achieving high explanatory power.However, optimizing explanatory power in this fash-ion conceptually contradicts the validation step, wherevariable inclusion/exclusion and the structure of thestatistical model are carefully designed to represent thetheoretical model. Hence, proper explanatory modelselection is performed in a constrained manner. In thewords of Jaccard (2001):

Trimming potentially theoretically mean-ingful variables is not advisable unless oneis quite certain that the coefficient for thevariable is near zero, that the variable is in-consequential, and that trimming will notintroduce misspecification error.

A researcher might choose to retain a causal covari-ate which has a strong theoretical justification evenif is statistically insignificant. For example, in med-ical research, a covariate that denotes whether a personsmokes or not is often present in models for health con-ditions, whether it is statistically significant or not.8 Incontrast to explanatory power, statistical significanceplays a minor or no role in assessing predictive perfor-mance. In fact, it is sometimes the case that removinginputs with small coefficients, even if they are statis-tically significant, results in improved prediction ac-curacy (Greenberg and Parks, 1997; Wu, Harris andMcAuley, 2007, and see the Appendix). Stepwise-typealgorithms are very useful in predictive modeling aslong as the selection criteria rely on predictive powerrather than explanatory power.

As mentioned in Section 1.6, the statistics literatureon model selection includes a rich discussion on thedifference between finding the “true” model and find-ing the best predictive model, and on criteria for ex-planatory model selection versus predictive model se-lection. A popular predictive metric is the in-sampleAkaike Information Criterion (AIC). Akaike derivedthe AIC from a predictive viewpoint, where the modelis not intended to accurately infer the “true distribu-tion,” but rather to predict future data as accuratelyas possible (see, e.g., Berk, 2008; Konishi and Kita-gawa, 2007). Some researchers distinguish betweenAIC and the Bayesian information criterion (BIC) onthis ground. Sober (2002) concluded that AIC mea-sures predictive accuracy while BIC measures good-ness of fit:

8I thank Ayala Cohen for this example.

Page 13: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 301

In a sense, the AIC and the BIC provide es-timates of different things; yet, they almostalways are thought to be in competition. Ifthe question of which estimator is better isto make sense, we must decide whether theaverage likelihood of a family [=BIC] orits predictive accuracy [=AIC] is what wewant to estimate.

Similarly, Dowe, Gardner and Oppy (2007) con-trasted the two Bayesian model selection criteria Min-imum Message Length (MML) and Minimum Ex-pected Kullback–Leibler Distance (MEKLD). Theyconcluded,

If you want to maximise predictive accu-racy, you should minimise the expected KLdistance (MEKLD); if you want the best in-ference, you should use MML.

Kadane and Lazar (2004) examined a variety of modelselection criteria from a Bayesian decision–theoreticpoint of view, comparing prediction with explanationgoals.

Even when using predictive metrics, the fashion inwhich they are used within a model selection processcan deteriorate their adequacy, yielding overoptimisticpredictive performance. Berk (2008) described the casewhere

statistical learning procedures are often ap-plied several times to the data with one ormore tuning parameters varied. The AICmay be computed for each. But each AICis ignorant about the information obtainedfrom prior fitting attempts and how manydegrees of freedom were expended in theprocess. Matters are even more complicatedif some of the variables are transformedor recoded. . . Some unjustified optimism re-mains.

2.7 Model Use and Reporting

Given all the differences that arise in the model-ing process, the resulting predictive model would ob-viously be very different from a resulting explanatorymodel in terms of the data used ({X,Y }), the estimatedmodel f , and explanatory power and predictive power.The use of f would also greatly differ.

As illustrated in Section 1.1, explanatory modelsin the context of scientific research are used to de-rive “statistical conclusions” using inference, which in

turn are translated into scientific conclusions regard-ing F ,X,Y and the causal hypotheses. With a focuson theory, causality, bias and retrospective analysis, ex-planatory studies are aimed at testing or comparing ex-isting causal theories. Accordingly the statistical sec-tion of explanatory scientific papers is dominated bystatistical inference.

In predictive modeling f is used to generate predic-tions for new data. We note that generating predictionsfrom f can range in the level of difficulty, dependingon the complexity of f and on the type of predictiongenerated. For example, generating a complete predic-tive distribution is easier using a Bayesian approachthan the predictive likelihood approach.

In practical applications, the predictions might bethe final goal. However, the focus here is on predictivemodeling for supporting scientific research, as was dis-cussed in Section 1.2. Scientific predictive studies andarticles therefore emphasize data, association, bias–variance considerations, and prospective aspects of thestudy. Conclusions pertain to theory-building aspectssuch as new hypothesis generation, practical relevance,and predictability level. Whereas explanatory articlesfocus on theoretical constructs and unobservable para-meters and their statistical section is dominated by in-ference, predictive articles concentrate on the observ-able level, with predictive power and its comparisonacross models being the core.

3. TWO EXAMPLES

Two examples are used to broadly illustrate the dif-ferences that arise in predictive and explanatory stud-ies. In the first I consider a predictive goal and dis-cuss what would be involved in “converting” it to anexplanatory study. In the second example I consideran explanatory study and what would be different ina predictive context. See the work of Shmueli andKoppius (2010) for a detailed example “converting”the explanatory study of Gefen, Karahanna and Straub(2003) from Section 1 into a predictive one.

3.1 Netflix Prize

Netflix is the largest online DVD rental service inthe United States. In an effort to improve their movierecommendation system, in 2006 Netflix announced acontest (http://netflixprize.com), making public a hugedataset of user movie ratings. Each observation con-sisted of a user ID, a movie title, and the rating that theuser gave this movie. The task was to accurately pre-dict the ratings of movie-user pairs for a test set such

Page 14: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

302 G. SHMUELI

that the predictive accuracy improved upon Netflix’srecommendation engine by at least 10%. The grandprize was set at $ 1,000,000. The 2009 winner was acomposite of three teams, one of them from the AT&Tresearch lab (see Bell, Koren and Volinsky, 2010). Intheir 2008 report, the AT&T team, who also won the2007 and 2008 progress prizes, described their model-ing approach (Bell, Koren and Volinsky, 2008).

Let me point out several operations and choices de-scribed by Bell, Koren and Volinsky (2008) that high-light the distinctive predictive context. Starting withsample size, the very large sample released by Net-flix was aimed at allowing the estimation of f fromthe data, reflecting the absence of a strong theory. Inthe data preparation step, with relation to missingnessthat is predictively informative, the team found that“the information on which movies each user choseto rate, regardless of specific rating value” turned outto be useful. At the data exploration and reductionstep, many teams including the winners found that thenoninterpretable Singular Value Decomposition (SVD)data reduction method was key in producing accuratepredictions: “It seems that models based on matrix-factorization were found to be most accurate.” As forchoice of variables, supplementing the Netflix datawith information about the movie (such as actors, di-rector) actually decreased accuracy: “We should men-tion that not all data features were found to be use-ful. For example, we tried to benefit from an exten-sive set of attributes describing each of the movies inthe dataset. Those attributes certainly carry a signifi-cant signal and can explain some of the user behav-ior. However, we concluded that they could not helpat all for improving the accuracy of well tuned collab-orative filtering models.” In terms of choice of meth-ods, their solution was an ensemble of methods thatincluded nearest-neighbor algorithms, regression mod-els, and shrinkage methods. In particular, they foundthat “using increasingly complex models is only oneway of improving accuracy. An apparently easier wayto achieve better accuracy is by blending multiple sim-pler models.” And indeed, more accurate predictionswere achieved by collaborations between competingteams who combined predictions from their individ-ual models, such as the winners’ combined team. Allthese choices and discoveries are very relevant for pre-diction, but not for causal explanation. Although theNetflix contest is not aimed at scientific advancement,there is clearly scientific value in the predictive modelsdeveloped. They tell us about the level of predictabil-ity of online user ratings of movies, and the implicated

usefulness of the rating scale employed by Netflix. Theresearch also highlights the importance of knowingwhich movies a user does not rate. And importantly,it sets the stage for explanatory research.

Let us consider a hypothetical goal of explainingmovie preferences. After stating causal hypotheses, wewould define constructs that link user behavior andmovie features X to user preference Y , with a care-ful choice of F . An operationalization step would linkthe constructs to measurable data, and the role of eachvariable in the causality structure would be defined.Even if using the Netflix dataset, supplemental covari-ates that capture movie features and user characteris-tics would be absolutely necessary. In other words, thedata collected and the variables included in the modelwould be different from the predictive context. As tomethods and models, data compression methods suchas SVD, heuristic-based predictive algorithms whichlearn f from the data, and the combination of multi-ple models would be considered inappropriate, as theylack interpretability with respect to F and the hypothe-ses. The choice of f would be restricted to statisticalmodels that can be used for inference, and would di-rectly model issues such as the dependence betweenrecords for the same customer and for the same movie.Finally, the model would be validated and evaluated interms of its explanatory power, and used to concludeabout the strength of the causal relationship betweenvarious user and movie characteristics and movie pref-erences. Hence, the explanatory context leads to a com-pletely different modeling path and final result than thepredictive context.

It is interesting to note that most competing teamshad a background in computer science rather than sta-tistics. Yet, the winning team combines the two dis-ciplines. Statisticians who see the uniqueness and im-portance of predictive modeling alongside explanatorymodeling have the capability of contributing to sci-entific advancement as well as achieving meaningfulpractical results (and large monetary awards).

3.2 Online Auction Research

The following example highlights the differences be-tween explanatory and predictive research in onlineauctions. The predictive approach also illustrates theutility in creating new theory in an area dominated byexplanatory modeling.

Online auctions have become a major player inproviding electronic commerce services. eBay (www.eBay.com), the largest consumer-to-consumer auctionwebsite, enables a global community of buyers and

Page 15: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 303

sellers to easily interact and trade. Empirical researchof online auctions has grown dramatically in recentyears. Studies using publicly available bid data fromwebsites such as eBay have found many divergencesof bidding behavior and auction outcomes comparedto ordinary offline auctions and classical auction the-ory. For instance, according to classical auction the-ory (e.g., Krishna, 2002), the final price of an auc-tion is determined by a priori information about thenumber of bidders, their valuation, and the auction for-mat. However, final price determination in online auc-tions is quite different. Online auctions differ from of-fline auctions in various ways such as longer duration,anonymity of bidders and sellers, and low barriers ofentry. These and other factors lead to new bidding be-haviors that are not explained by auction theory. An-other important difference is that the total number ofbidders in most online auctions is unknown until theauction closes.

Empirical research in online auctions has concen-trated in the fields of economics, information systemsand marketing. Explanatory modeling has been em-ployed to learn about different aspects of bidder be-havior in auctions. A survey of empirical explanatoryresearch on auctions was given by Bajari and Hortacsu(2004). A typical explanatory study relies on gametheory to construct F , which can be done in differ-ent ways. One approach is to construct a “structuralmodel,” which is a mathematical model linking the var-ious constructs. The major construct is “bidder val-uation,” which is the amount a bidder is willing topay, and is typically operationalized using his observedplaced bids. The structural model and operationalizedconstructs are then translated into a regression-typemodel [see, e.g., Sections 5 and 6 in Bajari and Hor-tacsu (2003)]. To illustrate the use of a statistical modelin explanatory auction research, consider the study byLucking-Reiley et al. (2007) who used a dataset of 461eBay coin auctions to determine the factors affectingthe final auction price. They estimated a set of linear re-gression models where Y = log(Price) and X includedauction characteristics (the opening bid, the auction du-ration, and whether a secret reserve price was used),seller characteristics (the number of positive and nega-tive ratings), and a control variable (book value of thecoin). One of their four reported models was of theform

log(Price) = β0 + β1 log(BookValue)

+ β2 log(MinBid) + β3Reserve

+ β4NumDays + β5PosRating

+ β6NegRating + ε.

The other three models, or “model specifications,” in-cluded a modified set of predictors, with some interac-tion terms and an alternate auction duration measure-ment. The authors used a censored-Normal regressionfor model estimation, because some auctions did notreceive any bids and therefore the price was truncatedat the minimum bid. Typical explanatory aspects of themodeling are:

Choice of variables: Several issues arise from thecausal-theoretical context. First is the exclusion ofthe number of bidders (or bids) as a determinant dueto endogeneity considerations, where although it islikely to affect the final price, “it is endogenouslydetermined by the bidders’ choices.” To verify endo-geneity the authors report fitting a separate regres-sion of Y = Number of bids on all the determinants.Second, the authors discuss operationalization chal-lenges that might result in bias due to omitted vari-ables. In particular, the authors discuss the constructof “auction attractiveness” (X ) and their inability tojudge measures such as photos and verbal descrip-tions to operationalize attractiveness.

Model validation: The four model specifications areused for testing the robustness of the hypothesizedeffect of the construct “auction length” across differ-ent operationalized variables such as the continuousnumber of days and a categorical alternative.

Model evaluation: For each model, its in-sample R2

is used for determining explanatory power.Model selection: The authors report the four fitted re-

gression models, including both significant and in-significant coefficients. Retaining the insignificantcovariates in the model is for matching f with F .

Model use and reporting: The main focus is on in-ference for the β’s, and the final conclusions aregiven in causal terms. (“A seller’s feedback rat-ings. . . have a measurable effect on her auctionprices. . . when a seller chooses to have her auctionlast for a longer period of days [sic], this signifi-cantly increases the auction price on average.”)

Although online auction research is dominated byexplanatory studies, there have been a few predictivestudies developing forecasting models for an auction’sfinal price (e.g., Jank, Shmueli and Wang, 2008; Japand Naik, 2008; Ghani and Simmons, 2004; Wang,Jank and Shmueli, 2008; Zhang, Jank and Shmueli,2010). For a brief survey of online auction forecast-ing research see the work of Jank and Shmueli (2010,Chapter 5). From my involvement in several of thesepredictive studies, let me highlight the purely predic-tive aspects that appear in this literature:

Page 16: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

304 G. SHMUELI

Choice of variables: If prediction takes place beforeor at the start of the auction, then obviously the to-tal number of bids or bidders cannot be included asa predictor. While this variable was also omitted inthe explanatory study, the omission was due to a dif-ferent reason, that is, endogeneity. However, if pre-diction takes place at time t during an ongoing auc-tion, then the number of bidders/bids present at timet is available and useful for predicting the final price.Even more useful is the time series of the number ofbidders from the start of the auction until time t aswell as the price curve until time t (Bapna, Jank andShmueli, 2008).

Choice of methods: Predictive studies in online auc-tions tend to learn f from the data, using flexiblemodels and algorithmic methods (e.g., CART, k-nearest neighbors, neural networks, functional meth-ods and related nonparametric smoothing-basedmethods, Kalman filters and boosting (see, e.g.,Chapter 5 in Jank and Shmueli, 2010). Many of theseare not interpretable, yet have proven to provide highpredictive accuracy.

Model evaluation: Auction forecasting studies eval-uate predictive power on holdout data. They reportperformance in terms of out-of-sample metrics suchas MAPE and RMSE, and are compared against otherpredictive models and benchmarks.

Predictive models for auction price cannot providedirect causal explanations. However, by producinghigh-accuracy price predictions they shed light on newpotential variables that are related to price and on thetypes of relationships that can be further investigated interms of causality. For instance, a construct that is notdirectly measurable but that some predictive modelsare apparently capturing is competition between bid-ders.

4. IMPLICATIONS, CONCLUSIONS ANDSUGGESTIONS

4.1 The Cost of Indiscrimination toScientific Research

Currently, in many fields, statistical modeling is usednearly exclusively for causal explanation. The conse-quence of neglecting to include predictive modelingand testing alongside explanatory modeling is losingthe ability to test the relevance of existing theories andto discover new causal mechanisms. Feelders (2002)commented on the field of economics: “The pure hy-pothesis testing framework of economic data analysis

should be put aside to give more scope to learning fromthe data. This closes the empirical cycle from observa-tion to theory to the testing of theories on new data.”The current accelerated rate of social, environmental,and technological changes creates a burning need fornew theories and for the examination of old theories inlight of the new realities.

A common practice due to the indiscrimination ofexplanation and prediction is to erroneously infer pre-dictive power from explanatory power, which can leadto incorrect scientific and practical conclusions. Col-leagues from various fields confirmed this fact, anda cursory search of their scientific literature bringsup many examples. For instance, in ecology an arti-cle intending to predict forest beetle assemblages in-fers predictive power from explanatory power [“Tostudy. . . predictive power, . . . we calculated the R2”;“We expect predictabilities with R2 of up to 0.6”(Muller and Brandl, 2009)]. In economics, an articleentitled “The predictive power of zero intelligence infinancial markets” (Farmer, Patelli and Zovko, 2005)infers predictive power from a high R2 value of a lin-ear regression model. In epidemiology, many studiesrely on in-sample hazard ratios estimated from Cox re-gression models to infer predictive power, reflecting anindiscrimination between description and prediction.For instance, Nabi et al. (2010) used hazard ratio esti-mates and statistical significance “to compare the pre-dictive power of depression for coronary heart diseasewith that of cerebrovascular disease.” In informationsystems, an article on “Understanding and predictingelectronic commerce adoption” (Pavlou and Fygenson,2006) incorrectly compared the predictive power ofdifferent models using in-sample measures (“To exam-ine the predictive power of the proposed model, wecompare it to four models in terms of R2 adjusted”).These examples are not singular, but rather they reflectthe common misunderstanding of predictive power inthese and other fields.

Finally, a consequence of omitting predictive mod-eling from scientific research is also a gap betweenresearch and practice. In an age where empirical re-search has become feasible in many fields, the oppor-tunity to bridge the gap between methodological de-velopment and practical application can be easier toachieve through the combination of explanatory andpredictive modeling.

Finance is an example where practice is concernedwith prediction whereas academic research is focusedon explaining. In particular, there has been a relianceon a limited number of models that are considered

Page 17: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 305

pillars of research, yet have proven to perform verypoorly in practice. For instance, the CAPM model andmore recently the Fama–French model are regressionmodels that have been used for explaining market be-havior for the purpose of portfolio management, andhave been evaluated in terms of explanatory power(in-sample R2 and residual analysis) and not predic-tive accuracy.9 More recently, researchers have be-gun recognizing the distinction between in-sample ex-planatory power and out-of-sample predictive power(Goyal and Welch, 2007), which has led to a discus-sion of predictability magnitude and a search for pre-dictively accurate explanatory variables (Campbell andThompson, 2005). In terms of predictive modeling, theChief Actuary of the Financial Supervisory Authorityof Sweden commented in 1999: “there is a need formodels with predictive power for at least a very near fu-ture. . . Given sufficient and relevant data this is an areafor statistical analysis, including cluster analysis andvarious kind of structure-finding methods” (Palmgren,1999). While there has been some predictive model-ing using genetic algorithms (Chen, 2002) and neuralnetworks (Chakraborty and Sharma, 2007), it has beenperformed by practitioners and nonfinance academicresearchers and outside of the top academic journals.

In summary, the omission of predictive modeling fortheory development results not only in academic workbecoming irrelevant to practice, but also in creatinga barrier to achieving significant scientific progress,which is especially unfortunate as data become easierto collect, store and access.

In the opposite direction, in fields that focus on pre-dictive modeling, the reason for omitting explanatorymodeling must be sought. A scientific field is usu-ally defined by a cohesive body of theoretical knowl-edge, which can be tested. Hence, some form of test-ing, whether empirical or not, must be a component ofthe field. In areas such as bioinformatics, where there islittle theory and an abundance of data, predictive mod-els are pivotal in generating avenues for causal theory.

4.2 Explanatory and Predictive Power:Two Dimensions

I have polarized explaining and predicting in this ar-ticle in an effort to highlight their fundamental differ-ences. However, rather than considering them as ex-

9Although in their paper Fama and French (1993) did split thesample into two parts, they did so for purposes of testing the sensi-tivity of model estimates rather than for assessing predictive accu-racy.

tremes on some continuum, I consider them as two di-mensions.10,11 Explanatory power and predictive accu-racy are different qualities; a model will possess somelevel of each.

A related controversial question arises: must an ex-planatory model have some level of predictive power tobe considered scientifically useful? And equally, musta predictive model have sufficient explanatory power tobe scientifically useful? For instance, some explanatorymodels that cannot be tested for predictive accuracy yetconstitute scientific advances are Darwinian evolutiontheory and string theory in physics. The latter producescurrently untestable predictions (Woit, 2006, pages x–xii). Conversely, there exist predictive models that donot properly “explain” yet are scientifically valuable.Galileo, in his book Two New Sciences, proposed ademonstration to determine whether light was instan-taneous. According to Mackay and Oldford (2000),Descartes gave the book a scathing review:

The substantive criticisms are generally di-rected at Galileo’s not having identified thecauses of the phenomena he investigated.For most scientists at this time, and partic-ularly for Descartes, that is the whole pointof science.

Similarly, consider predictive models that are based ona wrong explanation yet scientifically and practicallythey are considered valuable. One well-known exampleis Ptolemaic astronomy, which until recently was usedfor nautical navigation but is based on a theory provento be wrong long ago. While such examples are ex-treme, in most cases models are likely to possess somelevel of both explanatory and predictive power.

Considering predictive accuracy and explanatorypower as two axes on a two-dimensional plot wouldplace different models (f ), aimed either at explana-tion or at prediction, on different areas of the plot. Thebi-dimensional approach implies that: (1) In terms ofmodeling, the goal of a scientific study must be speci-fied a priori in order to optimize the criterion of inter-est; and (2) In terms of model evaluation and scientificreporting, researchers should report both the explana-tory and predictive qualities of their models. Even ifprediction is not the goal, the predictive qualities ofa model should be reported alongside its explanatory

10Similarly, descriptive models can be considered as a third di-mension, where yet different criteria are used for assessing thestrength of the descriptive model.

11I thank Bill Langford for the two-dimensional insight.

Page 18: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

306 G. SHMUELI

power so that it can be fairly evaluated in terms of itscapabilities and compared to other models. Similarly, apredictive model might not require causal explanationin order to be scientifically useful; however, reportingits relation to causal theory is important for purposesof theory building. The availability of information ona variety of predictive and explanatory models alongthese two axes can shed light on both predictive andcausal aspects of scientific phenomena. The statisticalmodeling process, as depicted in Figure 2, should in-clude “overall model performance” in terms of bothpredictive and explanatory qualities.

4.3 The Cost of Indiscrimination to theField of Statistics

Dissolving the ambiguity surrounding explanatoryversus predictive modeling is important for advancingour field itself. Recognizing that statistical methodol-ogy has focused mainly on inference indicates an im-portant gap to be filled. While our literature containspredictive methodology for model selection and pre-dictive inference, there is scarce statistical predictivemethodology for other modeling steps, such as studydesign, data collection, data preparation and EDA,which present opportunities for new research. Cur-rently, the predictive void has been taken up the fieldof machine learning and data mining. In fact, the differ-ences, and some would say rivalry, between the fieldsof statistics and data mining can be attributed to theirdifferent goals of explaining versus predicting evenmore than to factors such as data size. While statisti-cal theory has focused on model estimation, inference,and fit, machine learning and data mining have concen-trated on developing computationally efficient predic-tive algorithms and tackling the bias–variance trade-offin order to achieve high predictive accuracy.

Sharpening the distinction between explanatory andpredictive modeling can raise a new awareness of thestrengths and limitations of existing methods and prac-tices, and might shed light on current controversieswithin our field. One example is the disagreement insurvey methodology regarding the use of samplingweights in the analysis of survey data (Little, 2007).Whereas some researchers advocate using weights toreduce bias at the expense of increased variance, andothers disagree, might not the answer be related to thefinal goal?

Another ambiguity that can benefit from an explana-tory/predictive distinction is the definition of parsi-mony. Some claim that predictive models should be

simpler than explanatory models: “Simplicity is rele-vant because complex families often do a bad job ofpredicting new data, though they can be made to fit theold data quite well” (Sober, 2002). The same argumentwas given by Hastie, Tibshirani and Friedman (2009):“Typically the more complex we make the model, thelower the bias but the higher the variance.” In con-trast, some predictive models in practice are very com-plex,12 and indeed Breiman (2001b) commented: “insome cases predictive models are more complex in or-der to capture small nuances that improve predictiveaccuracy.” Zellner (2001) used the term “sophisticat-edly simple” to define the quality of a “good” model.I would suggest that the definitions of parsimony andcomplexity are task-dependent: predictive or explana-tory. For example, an “overly complicated” model inexplanatory terms might prove “sophisticatedly sim-ple” for predictive purposes.

4.4 Closing Remarks and Suggestions

The consequences from the explanatory/predictivedistinction lead to two proposed actions:

1. It is our responsibility to be aware of how statisti-cal models are used in research outside of statistics,why they are used in that fashion, and in responseto develop methods that support sound scientific re-search. Such knowledge can be gained within ourfield by inviting scientists from different disciplinesto give talks at statistics conferences and seminars,and to require graduate students in statistics to readand present research papers from other disciplines.

2. As a discipline, we must acknowledge the differ-ence between explanatory, predictive and descrip-tive modeling, and integrate it into statistics edu-cation of statisticians and nonstatisticians, as earlyas possible but most importantly in “research meth-ods” courses. This requires creating written materi-als that are easily accessible and understandable bynonstatisticians. We should advocate both explana-tory and predictive modeling, clarify their differ-ences and distinctive scientific and practical uses,and disseminate tools and knowledge for imple-menting both. One particular aspect to consider isadvocating a more careful use of terms such as “pre-dictors,” “predictions” and “predictive power,” toreduce the effects of terminology on incorrect sci-entific conclusions.

12I thank Foster Provost from NYU for this observation.

Page 19: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 307

Awareness of the distinction between explanatory andpredictive modeling, and of the different scientificfunctions that each serve, is essential for the progressof scientific knowledge.

APPENDIX: IS THE “TRUE” MODEL THE BESTPREDICTIVE MODEL? A LINEAR REGRESSION

EXAMPLE

Consider F to be the true function relating constructsX and Y and let us assume that f is a valid oper-ationalization of F . Choosing an intentionally biasedfunction f ∗ in place of f is clearly undesirable froma theoretical–explanatory point of view. However, wewill show that f ∗ can be preferable to f from a pre-dictive standpoint.

To illustrate this, consider the statistical modelf (x) = β1x1 + β2x2 + ε which is assumed to be cor-rectly specified with respect to F . Using data, we ob-tain the estimated model f , which has the properties

Bias = 0,(2)

Var(f (x)) = Var(x1β1 + x2β2)(3)

= σ 2x′(X′X)−1x,

where x is the vector x = [x1, x2]′, and X is the de-sign matrix based on both predictors. Combining thesquared bias with the variance gives

EPE = E(Y − f (x)

)2

= σ 2 + 0 + σ 2x′(X′X)−1x(4)

= σ 2(1 + x′(X′X)−1x

).

In comparison, consider the estimated underspeci-fied form f ∗(x) = γ1x1. The bias and variance hereare given by Montgomery, Peck and Vining (2001,pages 292–296):

Bias = x1γ1 − (x1β1 + x2β2)

= x1(x′1x1)

−1x′1(x1β1 + x2β2)

− (x1β1 + x2β2),

Var(f ∗(x)) = x1 Var(γ1)x1 = σ 2x1(x′1x1)

−1x1.

Combining the squared bias with the variance gives

EPE = (x1(x

′1x1)

−1x′1x2β2 − x2β2

)2

(5)+ σ 2(

1 + x1(x′1x1)

−1x′1).

Although the bias of the underspecified model f ∗(x)

is larger than that of f (x), its variance can be smaller,and in some cases so small that the overall EPE will

be lower for the underspecified model. Wu, Harris andMcAuley (2007) showed the general result for an un-derspecified linear regression model with multiple pre-dictors. In particular, they showed that the underspeci-fied model that leaves out q predictors has a lower EPEwhen the following inequality holds:

qσ 2 > β ′2X

′2(I − H1)X2β2.(6)

This means that the underspecified model producesmore accurate predictions, in terms of lower EPE, inthe following situations:

• when the data are very noisy (large σ );• when the true absolute values of the left-out parame-

ters (in our example β2) are small;• when the predictors are highly correlated; and• when the sample size is small or the range of left-out

variables is small.

The bottom line is nicely summarized by Hagertyand Srinivasan (1991): “We note that the practice inapplied research of concluding that a model with ahigher predictive validity is “truer,” is not a valid in-ference. This paper shows that a parsimonious but lesstrue model can have a higher predictive validity than atruer but less parsimonious model.”

ACKNOWLEDGMENTS

I thank two anonymous reviewers, the associate ed-itor, and editor for their suggestions and commentswhich improved this manuscript. I express my grati-tude to many colleagues for invaluable feedback andfruitful discussion that have helped me develop theexplanatory/predictive argument presented in this ar-ticle. I am grateful to Otto Koppius (Erasmus) andRavi Bapna (U Minnesota) for familiarizing me withexplanatory modeling in Information Systems, for col-laboratively pursuing prediction in this field, and fortireless discussion of this work. I thank Ayala Cohen(Technion), Ralph Snyder (Monash), Rob Hyndman(Monash) and Bill Langford (RMIT) for detailed feed-back on earlier drafts of this article. Special thanks toBoaz Shmueli and Raquelle Azran for their meticulousreading and discussions of the manuscript. And specialthanks for invaluable comments and suggestions go toMurray Aitkin (U Melbourne), Yoav Benjamini (TelAviv U), Smarajit Bose (ISI), Saibal Chattopadhyay(IIMC), Ram Chellapah (Emory), Etti Doveh (Tech-nion), Paul Feigin (Technion), Paulo Goes (U Arizona),Avi Goldfarb (Toronto U), Norma Hubele (ASU),Ron Kenett (KPA Inc.), Paul Lajbcygier (Monash),

Page 20: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

308 G. SHMUELI

Thomas Lumley (U Washington), David Madigan(Columbia U), Isaac Meilejson (Tel Aviv U), Dou-glas Montgomery (ASU), Amita Pal (ISI), Don Poskitt(Monash), Foster Provost (NYU), Saharon Rosset (TelAviv U), Jeffrey Simonoff (NYU) and David Steinberg(Tel Aviv U).

REFERENCES

AFSHARTOUS, D. and DE LEEUW, J. (2005). Prediction in multi-level models. J. Educ. Behav. Statist. 30 109–139.

AITCHISON, J. and DUNSMORE, I. R. (1975). Statistical Predic-tion Analysis. Cambridge Univ. Press. MR0408097

BAJARI, P. and HORTACSU, A. (2003). The winner’s curse, reserveprices and endogenous entry: Empirical insights from ebay auc-tions. Rand J. Econ. 3 329–355.

BAJARI, P. and HORTACSU, A. (2004). Economic insights frominternet auctions. J. Econ. Liter. 42 457–486.

BAPNA, R., JANK, W. and SHMUELI, G. (2008). Price formationand its dynamics in online auctions. Decision Support Systems44 641–656.

BELL, R. M., KOREN, Y. and VOLINSKY, C. (2008). The BellKor2008 solution to the Netflix Prize.

BELL, R. M., KOREN, Y. and VOLINSKY, C. (2010). All togethernow: A perspective on the netflix prize. Chance 23 24.

BERK, R. A. (2008). Statistical Learning from a Regression Per-spective. Springer, New York.

BJORNSTAD, J. F. (1990). Predictive likelihood: A review. Statist.Sci. 5 242–265. MR1062578

BOHLMANN, P. and HOTHORN, T. (2007). Boosting algorithms:Regularization, prediction and model fitting. Statist. Sci. 22477–505. MR2420454

BREIMAN, L. (1996). Bagging predictors. Mach. Learn. 24 123–140. MR1425957

BREIMAN, L. (2001a). Random forests. Mach. Learn. 45 5–32.BREIMAN, L. (2001b). Statistical modeling: The two cultures. Sta-

tist. Sci. 16 199–215. MR1874152BROWN, P. J., VANNUCCI, M. and FEARN, T. (2002). Bayes

model averaging with selection of regressors. J. R. Stat. Soc.Ser. B Stat. Methodol. 64 519–536. MR1924304

CAMPBELL, J. Y. and THOMPSON, S. B. (2005). Predicting ex-cess stock returns out of sample: Can anything beat the histori-cal average? Harvard Institute of Economic Research WorkingPaper 2084.

CARTE, T. A. and CRAIG, J. R. (2003). In pursuit of moderation:Nine common errors and their solutions. MIS Quart. 27 479–501.

CHAKRABORTY, S. and SHARMA, S. K. (2007). Prediction of cor-porate financial health by artificial neural network. Int. J. Elec-tron. Fin. 1 442–459.

CHEN, S.-H., ED. (2002). Genetic Algorithms and Genetic Pro-gramming in Computational Finance. Kluwer, Dordrecht.

COLLOPY, F., ADYA, M. and ARMSTRONG, J. (1994). Principlesfor examining predictive–validity—the case of information-systems spending forecasts. Inform. Syst. Res. 5 170–179.

DALKEY, N. and HELMER, O. (1963). An experimental applica-tion of the delphi method to the use of experts. Manag. Sci. 9458–467.

DAWID, A. P. (1984). Present position and potential developments:Some personal views: Statistical theory: The prequential ap-proach. J. Roy. Statist. Soc. Ser. A 147 278–292. MR0763811

DING, Y. and SIMONOFF, J. (2010). An investigation of missingdata methods for classification trees applied to binary responsedata. J. Mach. Learn. Res. 11 131–170.

DOMINGOS, P. (2000). A unified bias–variance decomposition forzero–one and squared loss. In Proceedings of the SeventeenthNational Conference on Artificial Intelligence 564–569. AAAIPress, Austin, TX.

DOWE, D. L., GARDNER, S. and OPPY, G. R. (2007). Bayes notbust! Why simplicity is no problem for Bayesians. Br. J. Philos.Sci. 58 709–754. MR2375767

DUBIN, R. (1969). Theory Building. The Free Press, New York.EDWARDS, J. R. and BAGOZZI, R. P. (2000). On the nature

and direction of relationships between constructs. Psychologi-cal Methods 5 2 155–174.

EHRENBERG, A. and BOUND, J. (1993). Predictability and pre-diction. J. Roy. Statist. Soc. Ser. A 156 167–206.

FAMA, E. F. and FRENCH, K. R. (1993). Common risk factors instock and bond returns. J. Fin. Econ. 33 3–56.

FARMER, J. D., PATELLI, P. and ZOVKO, I. I. A. A. (2005). Thepredictive power of zero intelligence in financial markets. Proc.Natl. Acad. Sci. USA 102 2254–2259.

FAYYAD, U. M., GRINSTEIN, G. G. and WIERSE, A. (2002). In-formation Visualization in Data Mining and Knowledge Discov-ery. Morgan Kaufmann, San Francisco, CA.

FEELDERS, A. (2002). Data mining in economic science. In Deal-ing with the Data Flood 166–175. STT/Beweton, Den Haag,The Netherlands.

FINDLEY, D. Y. and PARZEN, E. (1998). A conversation with Hi-rotsugo Akaike. In Selected Papers of Hirotugu Akaike 3–16.Springer, New York. MR1486823

FORSTER, M. (2002). Predictive accuracy as an achievable goal ofscience. Philos. Sci. 69 S124–S134.

FORSTER, M. and SOBER, E. (1994). How to tell when simpler,more unified, or less ad-hoc theories will provide more accuratepredictions. Br. J. Philos. Sci. 45 1–35. MR1277464

FRIEDMAN, J. H. (1997). On bias, variance, 0/1-loss, and thecurse-of-dimensionality. Data Mining and Knowledge Discov-ery 1 55–77.

GEFEN, D., KARAHANNA, E. and STRAUB, D. (2003). Trust andTAM in online shopping: An integrated model. MIS Quart. 2751–90.

GEISSER, S. (1975). The predictive sample reuse method with ap-plications. J. Amer. Statist. Assoc. 70 320–328.

GEISSER, S. (1993). Predictive Inference: An Introduction. Chap-man and Hall, London. MR1252174

GELMAN, A., CARLIN, J. B., STERN, H. S. and RUBIN, D. B.(2003). Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRCNew York/Boca Raton, FL. MR1385925

GHANI, R. and SIMMONS, H. (2004). Predicting the end-price ofonline auctions. In International Workshop on Data Mining andAdaptive Modelling Methods for Economics and Management,Pisa, Italy.

GOYAL, A. and WELCH, I. (2007). A comprehensive look at theempirical performance of equity premium prediction. Rev. Fin.Stud. 21 1455–1508.

GRANGER, C. (1969). Investigating causal relations by economet-ric models and cross-spectral methods. Econometrica 37 424–438.

Page 21: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

TO EXPLAIN OR TO PREDICT? 309

GREENBERG, E. and PARKS, R. P. (1997). A predictive approachto model selection and multicollinearity. J. Appl. Econom. 1267–75.

GURBAXANI, V. and MENDELSON, H. (1990). An integrativemodel of information systems spending growth. Inform. Syst.Res. 1 23–46.

GURBAXANI, V. and MENDELSON, H. (1994). Modeling vs.forecasting—the case of information-systems spending. Inform.Syst. Res. 5 180–190.

HAGERTY, M. R. and SRINIVASAN, S. (1991). Comparing the pre-dictive powers of alternative multiple regression models. Psy-chometrika 56 77–85. MR1115296

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. H. (2009). TheElements of Statistical Learning: Data Mining, Inference, andPrediction, 2nd ed. Springer, New York. MR1851606

HAUSMAN, J. A. (1978). Specification tests in econometrics.Econometrica 46 1251–1271. MR0513692

HELMER, O. and RESCHER, N. (1959). On the epistemology ofthe inexact sciences. Manag. Sci. 5 25–52.

HEMPEL, C. and OPPENHEIM, P. (1948). Studies in the logic ofexplanation. Philos. Sci. 15 135–175.

HITCHCOCK, C. and SOBER, E. (2004). Prediction versus accom-modation and the risk of overfitting. Br. J. Philos. Sci. 55 1–34.

JACCARD, J. (2001). Interaction Effects in Logistic Regression.SAGE Publications, Thousand Oaks, CA.

JANK, W. and SHMUELI, G. (2010). Modeling Online Auctions.Wiley, New York.

JANK, W., SHMUELI, G. and WANG, S. (2008). Modeling pricedynamics in online auctions via regression trees. In Statis-tical Methods in eCommerce Research. Wiley, New York.MR2414052

JAP, S. and NAIK, P. (2008). Bidanalyzer: A method for estima-tion and selection of dynamic bidding models. Marketing Sci.27 949–960.

JOHNSON, W. and GEISSER, S. (1983). A predictive view ofthe detection and characterization of influential observationsin regression analysis. J. Amer. Statist. Assoc. 78 137–144.MR0696858

KADANE, J. B. and LAZAR, N. A. (2004). Methods and cri-teria for model selection. J. Amer. Statist. Soc. 99 279–290.MR2061890

KENDALL, M. and STUART, A. (1977). The Advanced Theory ofStatistics 1, 4th ed. Griffin, London.

KONISHI, S. and KITAGAWA, G. (2007). Information Criteria andStatistical Modeling. Springer, New York. MR2367855

KRISHNA, V. (2002). Auction Theory. Academic Press, San Diego,CA.

LITTLE, R. J. A. (2007). Should we use the survey weights toweight? JPSM Distinguished Lecture, Univ. Maryland.

LITTLE, R. J. A. and RUBIN, D. B. (2002). Statistical Analysiswith Missing Data. Wiley, New York. MR1925014

LUCKING-REILEY, D., BRYAN, D., PRASAD, N. andREEVES, D. (2007). Pennies from ebay: The determinants ofprice in online auctions. J. Indust. Econ. 55 223–233.

MACKAY, R. J. and OLDFORD, R. W. (2000). Scientific method,statistical method, and the speed of light. Working Paper 2000-02, Dept. Statistics and Actuarial Science, Univ. Waterloo.MR1847825

MAKRIDAKIS, S. G., WHEELWRIGHT, S. C. and HYNDMAN,R. J. (1998). Forecasting: Methods and Applications, 3rd ed.Wiley, New York.

MONTGOMERY, D., PECK, E. A. and VINING, G. G. (2001).Introduction to Linear Regression Analysis. Wiley, New York.MR1820113

MOSTELLER, F. and TUKEY, J. W. (1977). Data Analysis and Re-gression. Addison-Wesley, Reading, MA.

MULLER, J. and BRANDL, R. (2009). Assessing biodiversity byremote sensing in mountainous terrain: The potential of lidar topredict forest beetle assemblages. J. Appl. Ecol. 46 897–905.

NABI, J., KIVIMÄKI, M., SUOMINEN, S., KOSKENVUO, M. andVAHTERA, J. (2010). Does depression predict coronary heartdiseaseand cerebrovascular disease equally well? The healthand social support prospective cohort study. Int. J. Epidemiol.39 1016–1024.

PALMGREN, B. (1999). The need for financial models. ERCIMNews 38 8–9.

PARZEN, E. (2001). Comment on statistical modeling: The twocultures. Statist. Sci. 16 224–226. MR1874152

PATZER, G. L. (1995). Using Secondary Data in Marketing Re-search: United States and Worldwide. Greenwood Publishing,Westport, CT.

PAVLOU, P. and FYGENSON, M. (2006). Understanding and pre-dicting electronic commerce adoption: An extension of the the-ory of planned behavior. Mis Quart. 30 115–143.

PEARL, J. (1995). Causal diagrams for empirical research. Bio-metrika 82 669–709. MR1380809

ROSENBAUM, P. and RUBIN, D. B. (1983). The central role ofthe propensity score in observational studies for causal effects.Biometrika 70 41–55. MR0742974

RUBIN, D. B. (1997). Estimating causal effects from large data setsusing propensity scores. Ann. Intern. Med. 127 757–763.

SAAR-TSECHANSKY, M. and PROVOST, F. (2007). Handlingmissing features when applying classification models. J. Mach.Learn. Res. 8 1625–1657.

SARLE, W. S. (1998). Prediction with missing inputs. In JCIS 98Proceedings (P. Wang, ed.) II 399–402. Research Triangle Park,Durham, NC.

SENI, G. and ELDER, J. F. (2010). Ensemble Methods in DataMining: Improving Accuracy Through Combining Predictions(Synthesis Lectures on Data Mining and Knowledge Discovery).Morgan and Claypool, San Rafael, CA.

SHAFER, G. (1996). The Art of Causal Conjecture. MIT Press,Cambridge, MA.

SCHAPIRE, R. E. (1999). A brief introduction to boosting. In Pro-ceedings of the Sixth International Joint Conference on Artifi-cial Intelligence 1401–1406. Stockholm, Sweden.

SHMUELI, G. and KOPPIUS, O. R. (2010). Predictive analytics ininformation systems research. MIS Quart. To appear.

SIMON, H. A. (2001). Science seeks parsimony, not simplicity:Searching for pattern in phenomena. In Simplicity, Inferenceand Modelling: Keeping it Sophisticatedly Simple 32–72. Cam-bridge Univ. Press. MR1932928

SOBER, E. (2002). Instrumentalism, parsimony, and the Akaikeframework. Philos. Sci. 69 S112–S123.

SONG, H. and WITT, S. F. (2000). Tourism Demand Modellingand Forecasting: Modern Econometric Approaches. PergamonPress, Oxford.

SPIRTES, P., GLYMOUR, C. and SCHEINES, R. (2000). Causation,Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA.MR1815675

Page 22: To Explain or to Predict? - Galit Shmueli Science published.pdf · sciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical

310 G. SHMUELI

STONE, M. (1974). Cross-validatory choice and assesment of sta-tistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B39 111–147. MR0356377

TALEB, N. (2007). The Black Swan. Penguin Books, London.VAN MAANEN, J., SORENSEN, J. and MITCHELL, T. (2007). The

interplay between theory and method. Acad. Manag. Rev. 321145–1154.

VAUGHAN, T. S. and BERRY, K. E. (2005). Using Monte Carlotechniques to demonstrate the meaning and implications of mul-ticollinearity. J. Statist. Educ. 13 online.

WALLIS, W. A. (1980). The statistical research group, 1942–1945.J. Amer. Statist. Assoc. 75 320–330. MR0577363

WANG, S., JANK, W. and SHMUELI, G. (2008). Explaining andforecasting online auction prices and their dynamics using func-tional data analysis. J. Business Econ. Statist. 26 144–160.MR2420144

WINKELMANN, R. (2008). Econometric Analysis of Count Data,5th ed. Springer, New York. MR2148271

WOIT, P. (2006). Not Even Wrong: The Failure of String Theoryand the Search for Unity in Physical Law. Jonathan Cope, Lon-don. MR2245858

WU, S., HARRIS, T. and MCAULEY, K. (2007). The use of simpli-fied or misspecified models: Linear case. Canad. J. Chem. Eng.85 386–398.

ZELLNER, A. (1962). An efficient method of estimating seeminglyunrelated regression equations and tests for aggregation bias.J. Amer. Statist. Assoc. 57 348–368. MR0139235

ZELLNER, A. (2001). Keep it sophisticatedly simple. In Simplic-ity, Inference and Modelling: Keeping It Sophisticatedly Simple242–261. Cambridge Univ. Press. MR1932939

ZHANG, S., JANK, W. and SHMUELI, G. (2010). Real-time fore-casting of online auctions via functional k-nearest neighbors.Int. J. Forecast. 26 666–683.