to explain or to predict? - arxiv · to explain or to predict? galit shmueli abstract. statistical...

23
arXiv:1101.0891v1 [stat.ME] 5 Jan 2011 Statistical Science 2010, Vol. 25, No. 3, 289–310 DOI: 10.1214/10-STS330 c Institute of Mathematical Statistics, 2010 To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and descrip- tion. In many disciplines there is near-exclusive use of statistical mod- eling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statis- tical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to re- veal the practical implications of the distinction to each step in the modeling process. Key words and phrases: Explanatory modeling, causality, predictive modeling, predictive power, statistical strategy, data mining, scientific research. 1. INTRODUCTION Looking at how statistical models are used in dif- ferent scientific disciplines for the purpose of the- ory building and testing, one finds a range of per- ceptions regarding the relationship between causal explanation and empirical prediction. In many sci- entific fields such as economics, psychology, educa- tion, and environmental science, statistical models are used almost exclusively for causal explanation, and models that possess high explanatory power are often assumed to inherently possess predictive power. In fields such as natural language processing and bioinformatics, the focus is on empirical predic- tion with only a slight and indirect relation to causal Galit Shmueli is Associate Professor of Statistics, Department of Decision, Operations and Information Technologies, Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742, USA (e-mail: [email protected]). This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in Statistical Science, 2010, Vol. 25, No. 3, 289–310. This reprint differs from the original in pagination and typographic detail. explanation. And yet in other research fields, such as epidemiology, the emphasis on causal explanation versus empirical prediction is more mixed. Statisti- cal modeling for description, where the purpose is to capture the data structure parsimoniously, and which is the most commonly developed within the field of statistics, is not commonly used for theory building and testing in other disciplines. Hence, in this article I focus on the use of statistical mod- eling for causal explanation and for prediction. My main premise is that the two are often conflated, yet the causal versus predictive distinction has a large impact on each step of the statistical modeling pro- cess and on its consequences. Although not explic- itly stated in the statistics methodology literature, applied statisticians instinctively sense that predict- ing and explaining are different. This article aims to fill a critical void: to tackle the distinction between explanatory modeling and predictive modeling. Clearing the current ambiguity between the two is critical not only for proper statistical modeling, but more importantly, for proper scientific usage. Both explanation and prediction are necessary for gener- ating and testing theories, yet each plays a differ- ent role in doing so. The lack of a clear distinction 1

Upload: others

Post on 31-May-2020

31 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

arX

iv:1

101.

0891

v1 [

stat

.ME

] 5

Jan

201

1

Statistical Science

2010, Vol. 25, No. 3, 289–310DOI: 10.1214/10-STS330c© Institute of Mathematical Statistics, 2010

To Explain or to Predict?Galit Shmueli

Abstract. Statistical modeling is a powerful tool for developing andtesting theories by way of causal explanation, prediction, and descrip-tion. In many disciplines there is near-exclusive use of statistical mod-eling for causal explanation and the assumption that models with highexplanatory power are inherently of high predictive power. Conflationbetween explanation and prediction is common, yet the distinctionmust be understood for progressing scientific knowledge. While thisdistinction has been recognized in the philosophy of science, the statis-tical literature lacks a thorough discussion of the many differences thatarise in the process of modeling for an explanatory versus a predictivegoal. The purpose of this article is to clarify the distinction betweenexplanatory and predictive modeling, to discuss its sources, and to re-veal the practical implications of the distinction to each step in themodeling process.

Key words and phrases: Explanatory modeling, causality, predictivemodeling, predictive power, statistical strategy, data mining, scientificresearch.

1. INTRODUCTION

Looking at how statistical models are used in dif-ferent scientific disciplines for the purpose of the-ory building and testing, one finds a range of per-ceptions regarding the relationship between causalexplanation and empirical prediction. In many sci-entific fields such as economics, psychology, educa-tion, and environmental science, statistical modelsare used almost exclusively for causal explanation,and models that possess high explanatory powerare often assumed to inherently possess predictivepower. In fields such as natural language processingand bioinformatics, the focus is on empirical predic-tion with only a slight and indirect relation to causal

Galit Shmueli is Associate Professor of Statistics,Department of Decision, Operations and InformationTechnologies, Robert H. Smith School of Business,University of Maryland, College Park, Maryland 20742,USA (e-mail: [email protected]).

This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2010, Vol. 25, No. 3, 289–310. Thisreprint differs from the original in pagination andtypographic detail.

explanation. And yet in other research fields, suchas epidemiology, the emphasis on causal explanationversus empirical prediction is more mixed. Statisti-cal modeling for description, where the purpose isto capture the data structure parsimoniously, andwhich is the most commonly developed within thefield of statistics, is not commonly used for theorybuilding and testing in other disciplines. Hence, inthis article I focus on the use of statistical mod-eling for causal explanation and for prediction. Mymain premise is that the two are often conflated, yetthe causal versus predictive distinction has a largeimpact on each step of the statistical modeling pro-cess and on its consequences. Although not explic-itly stated in the statistics methodology literature,applied statisticians instinctively sense that predict-ing and explaining are different. This article aims tofill a critical void: to tackle the distinction betweenexplanatory modeling and predictive modeling.Clearing the current ambiguity between the two is

critical not only for proper statistical modeling, butmore importantly, for proper scientific usage. Bothexplanation and prediction are necessary for gener-ating and testing theories, yet each plays a differ-ent role in doing so. The lack of a clear distinction

1

Page 2: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

2 G. SHMUELI

within statistics has created a lack of understand-ing in many disciplines of the difference betweenbuilding sound explanatory models versus creatingpowerful predictive models, as well as confusing ex-planatory power with predictive power. The impli-cations of this omission and the lack of clear guide-lines on how to model for explanatory versus pre-dictive goals are considerable for both scientific re-search and practice and have also contributed to thegap between academia and practice.I start by defining what I term explaining and

predicting. These definitions are chosen to reflectthe distinct scientific goals that they are aimed at:causal explanation and empirical prediction, respec-tively. Explanatory modeling and predictive model-ing reflect the process of using data and statistical(or data mining) methods for explaining or predict-ing, respectively. The term modeling is intentionallychosen over models to highlight the entire process in-volved, from goal definition, study design, and datacollection to scientific use.

1.1 Explanatory Modeling

In many scientific fields, and especially the socialsciences, statistical methods are used nearly exclu-sively for testing causal theory. Given a causal theo-retical model, statistical models are applied to datain order to test causal hypotheses. In such mod-els, a set of underlying factors that are measuredby variables X are assumed to cause an underlyingeffect, measured by variable Y . Based on collabora-tive work with social scientists and economists, onan examination of some of their literature, and onconversations with a diverse group of researchers, Iconjecture that, whether statisticians like it or not,the type of statistical models used for testing causalhypotheses in the social sciences are almost alwaysassociation-based models applied to observational

data. Regression models are the most common ex-ample. The justification for this practice is that thetheory itself provides the causality. In other words,the role of the theory is very strong and the relianceon data and statistical modeling are strictly throughthe lens of the theoretical model. The theory–datarelationship varies in different fields. While the so-cial sciences are very theory-heavy, in areas such asbioinformatics and natural language processing theemphasis on a causal theory is much weaker. Hence,given this reality, I define explaining as causal ex-planation and explanatory modeling as the use ofstatistical models for testing causal explanations.To illustrate how explanatory modeling is typi-

cally done, I describe the structure of a typical arti-cle in a highly regarded journal in the field of Infor-mation Systems (IS). Researchers in the field of ISusually have training in economics and/or the be-havioral sciences. The structure of articles reflectsthe way empirical research is conducted in IS andrelated fields.The example used is an article by Gefen, Kara-

hanna and Straub (2003), which studies technologyacceptance. The article starts with a presentation ofthe prevailing relevant theory(ies):

Online purchase intensions should be ex-plained in part by the technology accep-tance model (TAM). This theoretical modelis at present a preeminent theory of tech-nology acceptance in IS.

The authors then proceed to state multiple causalhypotheses (denoted H1,H2, . . . in Figure 1, rightpanel), justifying the merits for each hypothesis andgrounding it in theory. The research hypotheses aregiven in terms of theoretical constructs rather thanmeasurable variables. Unlike measurable variables,

Fig. 1. Causal diagram (left) and partial list of stated hypotheses (right) from Gefen, Karahanna and Straub (2003).

Page 3: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 3

constructs are abstractions that “describe a phe-nomenon of theoretical interest” (Edwards andBagozzi, 2000) and can be observable or unobserv-able. Examples of constructs in this article are trust,perceived usefulness (PU), and perceived ease of use(PEOU). Examples of constructs used in other fieldsinclude anger, poverty, well-being, and odor. The hy-potheses section will often include a causal diagramillustrating the hypothesized causal relationship be-tween the constructs (see Figure 1, left panel). Thenext step is construct operationalization, where abridge is built between theoretical constructs andobservable measurements, using previous literatureand theoretical justification. Only after the theoret-ical component is completed, and measurements arejustified and defined, do researchers proceed to thenext step where data and statistical modeling are in-troduced alongside the statistical hypotheses, whichare operationalized from the research hypotheses.Statistical inference will lead to “statistical conclu-sions” in terms of effect sizes and statistical sig-nificance in relation to the causal hypotheses. Fi-nally, the statistical conclusions are converted intoresearch conclusions, often accompanied by policyrecommendations.In summary, explanatory modeling refers here to

the application of statistical models to data for test-ing causal hypotheses about theoretical constructs.Whereas “proper” statistical methodology for test-ing causality exists, such as designed experimentsor specialized causal inference methods for observa-tional data [e.g., causal diagrams (Pearl, 1995), dis-covery algorithms (Spirtes, Glymour and Scheines,2000), probability trees (Shafer, 1996), and propen-sity scores (Rosenbaum and Rubin, 1983; Rubin,1997)], in practice association-based statistical mod-els, applied to observational data, are most com-monly used for that purpose.

1.2 Predictive Modeling

I define predictive modeling as the process of ap-plying a statistical model or data mining algorithmto data for the purpose of predicting new or futureobservations. In particular, I focus on nonstochasticprediction (Geisser, 1993, page 31), where the goalis to predict the output value (Y ) for new observa-tions given their input values (X). This definitionalso includes temporal forecasting, where observa-tions until time t (the input) are used to forecastfuture values at time t+ k, k > 0 (the output). Pre-dictions include point or interval predictions, pre-diction regions, predictive distributions, or rankings

of new observations. Predictive model is any methodthat produces predictions, regardless of its underly-ing approach: Bayesian or frequentist, parametric ornonparametric, data mining algorithm or statisticalmodel, etc.

1.3 Descriptive Modeling

Although not the focus of this article, a thirdtype of modeling, which is the most commonly usedand developed by statisticians, is descriptive mod-eling. This type of modeling is aimed at summariz-ing or representing the data structure in a compactmanner. Unlike explanatory modeling, in descriptivemodeling the reliance on an underlying causal the-ory is absent or incorporated in a less formal way.Also, the focus is at the measurable level rather thanat the construct level. Unlike predictive modeling,descriptive modeling is not aimed at prediction. Fit-ting a regression model can be descriptive if it isused for capturing the association between the de-pendent and independent variables rather than forcausal inference or for prediction. We mention thistype of modeling to avoid confusion with causal-explanatory and predictive modeling, and also tohighlight the different approaches of statisticians andnonstatisticians.

1.4 The Scientific Value of Predictive Modeling

Although explanatory modeling is commonly usedfor theory building and testing, predictive modelingis nearly absent in many scientific fields as a toolfor developing theory. One possible reason is thestatistical training of nonstatistician researchers. Alook at many introductory statistics textbooks re-veals very little in the way of prediction. Anotherreason is that prediction is often considered unsci-entific. Berk (2008) wrote, “In the social sciences,for example, one either did causal modeling econo-metric style or largely gave up quantitative work.”From conversations with colleagues in various disci-plines it appears that predictive modeling is oftenvalued for its applied utility, yet is discarded for sci-entific purposes such as theory building or testing.Shmueli and Koppius (2010) illustrated the lack ofpredictive modeling in the field of IS. Searching the1072 papers published in the two top-rated journalsInformation Systems Research and MIS Quarterlybetween 1990 and 2006, they found only 52 empiricalpapers with predictive claims, of which only sevencarried out proper predictive modeling or testing.

Page 4: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

4 G. SHMUELI

Even among academic statisticians, there appearsto be a divide between those who value prediction asthe main purpose of statistical modeling and thosewho see it as unacademic. Examples of statisticianswho emphasize predictive methodology includeAkaike (“The predictive point of view is a proto-typical point of view to explain the basic activity ofstatistical analysis” in Findley and Parzen, 1998),Deming (“The only useful function of a statisticianis to make predictions” in Wallis, 1980), Geisser(“The prediction of observables or potential observ-ables is of much greater relevance than the estimateof what are often artificial constructs-parameters,”Geisser, 1975), Aitchison and Dunsmore (“predic-tion analysis. . . is surely at the heart of many statis-tical applications,” Aitchison and Dunsmore, 1975)and Friedman (“One of the most common and im-portant uses for data is prediction,” Friedman, 1997).Examples of those who see it as unacademic areKendall and Stuart (“The Science of Statistics dealswith the properties of populations. In considering apopulation of men we are not interested, statisticallyspeaking, in whether some particular individual hasbrown eyes or is a forger, but rather in how manyof the individuals have brown eyes or are forgers,”Kendall and Stuart, 1977) and more recently Parzen(“The two goals in analyzing data. . . I prefer todescribe as “management” and “science.” Manage-ment seeks profit. . . Science seeks truth,” Parzen,2001). In economics there is a similar disagreementregarding “whether prediction per se is a legitimateobjective of economic science, and also whether ob-served data should be used only to shed light on ex-isting theories or also for the purpose of hypothesisseeking in order to develop new theories” (Feelders,2002).Before proceeding with the discrimination between

explanatory and predictive modeling, it is impor-tant to establish prediction as a necessary scientificendeavor beyond utility, for the purpose of devel-oping and testing theories. Predictive modeling andpredictive testing serve several necessary scientificfunctions:

1. Newly available large and rich datasets often con-tain complex relationships and patterns that arehard to hypothesize, especially given theories thatexclude newly measurable concepts. Using pre-dictive modeling in such contexts can help un-cover potential new causal mechanisms and leadto the generation of new hypotheses. See, for ex-ample, the discussion between Gurbaxani and

Mendelson (1990, 1994) and Collopy, Adya andArmstrong (1994).

2. The development of new theory often goes handin hand with the development of new measures(Van Maanen, Sorensen and Mitchell, 2007). Pre-dictive modeling can be used to discover newmeasures as well as to compare different oper-ationalizations of constructs and different mea-surement instruments.

3. By capturing underlying complex patterns andrelationships, predictive modeling can suggest im-provements to existing explanatory models.

4. Scientific development requires empirically rig-orous and relevant research. Predictive model-ing enables assessing the distance between theoryand practice, thereby serving as a “reality check”to the relevance of theories.1 While explanatorypower provides information about the strengthof an underlying causal relationship, it does notimply its predictive power.

5. Predictive power assessment offers a straightfor-ward way to compare competing theories by ex-amining the predictive power of their respectiveexplanatory models.

6. Predictive modeling plays an important role inquantifying the level of predictability of measur-able phenomena by creating benchmarks of pre-dictive accuracy (Ehrenberg and Bound, 1993).Knowledge of un-predictability is a fundamen-tal component of scientific knowledge (see, e.g.,Taleb, 2007). Because predictive models tend tohave higher predictive accuracy than explanatorystatistical models, they can give an indication ofthe potential level of predictability. A very lowpredictability level can lead to the developmentof new measures, new collected data, and newempirical approaches. An explanatory model thatis close to the predictive benchmark may suggestthat our understanding of that phenomenon canonly be increased marginally. On the other hand,an explanatory model that is very far from thepredictive benchmark would imply that there aresubstantial practical and theoretical gains to behad from further scientific development.

For a related, more detailed discussion of the valueof prediction to scientific theory development see thework of Shmueli and Koppius (2010).

1Predictive models are advantageous in terms of negativeempiricism: a model either predicts accurately or it does not,and this can be observed. In contrast, explanatory models cannever be confirmed and are harder to contradict.

Page 5: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 5

1.5 Explaining and Predicting Are Different

In the philosophy of science, it has long been de-bated whether explaining and predicting are one ordistinct. The conflation of explanation and predic-tion has its roots in philosophy of science literature,particularly the influential hypothetico-deductivemodel (Hempel and Oppenheim, 1948), which ex-plicitly equated prediction and explanation. How-ever, as later became clear, the type of uncertaintyassociated with explanation is of a different naturethan that associated with prediction (Helmer andRescher, 1959). This difference highlighted the needfor developing models geared specifically toward deal-ing with predicting future events and trends such asthe Delphi method (Dalkey and Helmer, 1963). Thedistinction between the two concepts has been fur-ther elaborated (Forster and Sober, 1994; Forster,2002; Sober, 2002; Hitchcock and Sober, 2004; Dowe,Gardner and Oppy, 2007). In his book Theory Build-ing, Dubin (1969, page 9) wrote:

Theories of social and human behavior ad-dress themselves to two distinct goals ofscience: (1) prediction and (2) understand-ing. It will be argued that these are sep-arate goals [. . . ] I will not, however, con-clude that they are either inconsistent orincompatible.

Herbert Simon distinguished between “basic science”and “applied science” (Simon, 2001), a distinctionsimilar to explaining versus predicting. Accordingto Simon, basic science is aimed at knowing (“to de-scribe the world”) and understanding (“to provideexplanations of these phenomena”). In contrast, inapplied science, “Laws connecting sets of variablesallow inferences or predictions to be made from knownvalues of some of the variables to unknown values ofother variables.”Why should there be a difference between explain-

ing and predicting? The answer lies in the fact thatmeasurable data are not accurate representationsof their underlying constructs. The operationaliza-tion of theories and constructs into statistical mod-els and measurable data creates a disparity betweenthe ability to explain phenomena at the conceptuallevel and the ability to generate predictions at themeasurable level.To convey this disparity more formally, consider

a theory postulating that construct X causes con-struct Y , via the function F , such that Y = F(X ).

F is often represented by a path model, a set of

qualitative statements, a plot (e.g., a supply anddemand plot), or mathematical formulas. Measur-able variables X and Y are operationalizations ofX and Y , respectively. The operationalization of Finto a statistical model f , such as E(Y ) = f(X), isdone by considering F in light of the study design(e.g., numerical or categorical Y ; hierarchical or flatdesign; time series or cross-sectional; complete orcensored data) and practical considerations such asstandards in the discipline. Because F is usually notsufficiently detailed to lead to a single f , often a set

of f models is considered. Feelders (2002) describedthis process in the field of economics. In the predic-tive context, we consider only X, Y and f .The disparity arises because the goal in explana-

tory modeling is to match f and F as closely aspossible for the statistical inference to apply to thetheoretical hypotheses. The data X, Y are tools forestimating f , which in turn is used for testing thecausal hypotheses. In contrast, in predictive mod-eling the entities of interest are X and Y , and the

function f is used as a tool for generating good pre-dictions of new Y values. In fact, we will see thateven if the underlying causal relationship is indeedY = F(X ), a function other than f(X) and dataother than X might be preferable for prediction.The disparity manifests itself in different ways.

Four major aspects are:

Causation–Association: In explanatory modeling f

represents an underlying causal function, and X

is assumed to cause Y . In predictive modeling f

captures the association between X and Y .Theory–Data: In explanatory modeling, f is care-fully constructed based on F in a fashion thatsupports interpreting the estimated relationshipbetween X and Y and testing the causal hypothe-ses. In predictive modeling, f is often constructedfrom the data. Direct interpretability in terms ofthe relationship between X and Y is not required,

although sometimes transparency of f is desirable.Retrospective–Prospective: Predictive modeling isforward-looking, in that f is constructed for pre-dicting new observations. In contrast, explanatorymodeling is retrospective, in that f is used to testan already existing set of hypotheses.

Bias–Variance: The expected prediction error for anew observation with value x, using a quadratic

Page 6: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

6 G. SHMUELI

loss function,2 is given by Hastie, Tibshirani andFriedman (2009, page 223)

EPE = E{Y − f(x)}2

= E{Y − f(x)}2 + {E(f (x))− f(x)}2

(1)+E{f(x)−E(f(x))}2

=Var(Y ) +Bias2 +Var(f(x)).

Bias is the result of misspecifying the statisticalmodel f . Estimation variance (the third term) isthe result of using a sample to estimate f . Thefirst term is the error that results even if the modelis correctly specified and accurately estimated. Theabove decomposition reveals a source of the differ-ence between explanatory and predictive model-ing: In explanatory modeling the focus is on mini-mizing bias to obtain the most accurate represen-tation of the underlying theory. In contrast, pre-dictive modeling seeks to minimize the combina-tion of bias and estimation variance, occasionallysacrificing theoretical accuracy for improved em-pirical precision. This point is illustrated in theAppendix, showing that the “wrong” model cansometimes predict better than the correct one.

The four aspects impact every step of the mod-eling process, such that the resulting f is markedlydifferent in the explanatory and predictive contexts,as will be shown in Section 2.

1.6 A Void in the Statistics Literature

The philosophical explaining/predicting debate hasnot been directly translated into statistical languagein terms of the practical aspects of the entire statis-tical modeling process.A search of the statistics literature for discussion

of explaining versus predicting reveals a lively dis-cussion in the context of model selection, and in par-ticular, the derivation and evaluation of model selec-tion criteria. In this context, Konishi and Kitagawa(2007) wrote:

There may be no significant difference be-tween the point of view of inferring thetrue structure and that of making a pre-diction if an infinitely large quantity of

2For a binary Y , various 0–1 loss functions have been sug-gested in place of the quadratic loss function (Domingos,2000).

data is available or if the data are noise-less. However, in modeling based on a fi-nite quantity of real data, there is a signifi-cant gap between these two points of view,because an optimal model for predictionpurposes may be different from one ob-tained by estimating the ‘true model.’

The literature on this topic is vast, and we do not in-tend to cover it here, although we discuss the majorpoints in Section 2.6.The focus on prediction in the field of machine

learning and by statisticians such as Geisser, Aitchi-son and Dunsmore, Breiman and Friedman, has high-lighted aspects of predictive modeling that are rel-evant to the explanatory/prediction distinction, al-though they do not directly contrast explanatoryand predictive modeling.3 The prediction literatureraises the importance of evaluating predictive powerusing holdout data, and the usefulness of algorith-mic methods (Breiman, 2001b). The predictive fo-cus has also led to the development of inferencetools that generate predictive distributions. Geisser(1993) introduced “predictive inference” and devel-oped it mainly in a Bayesian context. “Predictivelikelihood” (see Bjornstad, 1990) is a likelihood-basedapproach to predictive inference, and Dawid’s pre-quential theory (Dawid, 1984) investigates inferenceconcepts in terms of predictability. Finally, the bias–variance aspect has been pivotal in data mining forunderstanding the predictive performance of differ-ent algorithms and for designing new ones.Another area in statistics and econometrics that

focuses on prediction is time series. Methods havebeen developed specifically for testing the predictabil-ity of a series [e.g., random walk tests or the conceptof Granger causality (Granger, 1969)], and evalu-ating predictability by examining performance onholdout data. The time series literature in statis-tics is dominated by extrapolation models such asARIMA-type models and exponential smoothing meth-ods, which are suitable for prediction and descrip-tion, but not for causal explanation. Causal modelsfor time series are common in econometrics (e.g.,Song and Witt, 2000), where an underlying causaltheory links constructs, which lead to operational-ized variables, as in the cross-sectional case. Yet, to

3Geisser distinguished between “[statistical] parameters”and “observables” in terms of the objects of interest. His dis-tinction is closely related, but somewhat different from ourdistinction between theoretical constructs and measurements.

Page 7: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 7

the best of my knowledge, there is no discussion inthe statistics time series literature regarding the dis-tinction between predictive and explanatory model-ing, aside from the debate in economics regardingthe scientific value of prediction.To conclude, the explanatory/predictive model-

ing distinction has been discussed directly in themodel selection context, but not in the larger con-text. Areas that focus on developing predictive mod-eling such as machine learning and statistical timeseries, and “predictivists” such as Geisser, have con-sidered prediction as a separate issue, and have notdiscussed its principal and practical distinction fromcausal explanation in terms of developing and test-ing theory. The goal of this article is therefore toexamine the explanatory versus predictive debatefrom a statistical perspective, considering how mod-eling is used by nonstatistician scientists for theorydevelopment.The remainder of the article is organized as fol-

lows. In Section 2, I consider each step in the model-ing process in terms of the four aspects of the predic-tive/explanatory modeling distinction: causation–association, theory–data, retrospective–prospectiveand bias–variance. Section 3 illustrates some of thesedifferences via two examples. A discussion of the im-plications of the predict/explain conflation, conclu-sions, and recommendations are given in Section 4.

2. TWO MODELING PATHS

In the following I examine the process of statisti-cal modeling through the explain/predict lens, fromgoal definition to model use and reporting. For clar-ity, I broke down the process into a generic set ofsteps, as depicted in Figure 2. In each step I pointout differences in the choice of methods, criteria,data, and information to consider when the goal ispredictive versus explanatory. I also briefly describethe related statistics literature. The conceptual andpractical differences invariably lead to a differencebetween a final explanatory model and a predic-tive one, even though they may use the same initialdata. Thus, a priori determination of the main studygoal as either explanatory or predictive4 is essentialto conducting adequate modeling. The discussion inthis section assumes that the main research goal hasbeen determined as either explanatory or predictive.

4The main study goal can also be descriptive.

2.1 Study Design and Data Collection

Even at the early stages of study design and datacollection, issues of what and how much data tocollect, according to what design, and which col-lection instrument to use are considered differentlyfor prediction versus explanation. Consider samplesize. In explanatory modeling, where the goal is toestimate the theory-based f with adequate precisionand to use it for inference, statistical power is themain consideration. Reducing bias also requires suf-ficient data for model specification testing. Beyonda certain amount of data, however, extra precisionis negligible for purposes of inference. In contrast,in predictive modeling, f itself is often determinedfrom the data, thereby requiring a larger samplefor achieving lower bias and variance. In addition,more data are needed for creating holdout datasets(see Section 2.2). Finally, predicting new individ-ual observations accurately, in a prospective man-ner, requires more data than retrospective inferenceregarding population-level parameters, due to theextra uncertainty.A second design issue is sampling scheme. For in-

stance, in the context of hierarchical data (e.g., sam-pling students within schools) Afshartous and deLeeuw (2005) noted, “Although there exists an ex-tensive literature on estimation issues in multilevelmodels, the same cannot be said with respect to pre-diction.” Examining issues of sample size, sample al-location, and multilevel modeling for the purpose of“predicting a future observable y∗j in the Jth groupof a hierarchial dataset,” they found that allocationfor estimation versus prediction should be different:“an increase in group size n is often more benefi-cial with respect to prediction than an increase inthe number of groups J . . . [whereas] estimation ismore improved by increasing the number of groupsJ instead of the group size n.” This relates directlyto the bias–variance aspect. A related issue is thechoice of f in relation to sampling scheme. Afshar-tous and de Leeuw (2005) found that for their hierar-chical data, a hierarchical f , which is more appropri-ate theoretically, had poorer predictive performancethan a nonhierarchical f .A third design consideration is the choice between

experimental and observational settings. Whereasfor causal explanation experimental data are greatlypreferred, subject to availability and resource con-straints, in prediction sometimes observational dataare preferable to “overly clean” experimental data, if

Page 8: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

8 G. SHMUELI

Fig. 2. Steps in the statistical modeling process.

they better represent the realistic context of predic-tion in terms of the uncontrolled factors, the noise,the measured response, etc. This difference arisesfrom the theory–data and prospective–retrospectiveaspects. Similarly, when choosing between primarydata (data collected for the purpose of the study)and secondary data (data collected for other pur-poses), the classic criteria of data recency, relevance,and accuracy (Patzer, 1995) are considered froma different angle. For example, a predictive modelrequires the secondary data to include the exactX, Y variables to be used at the time of prediction,whereas for causal explanation different operational-izations of the constructs X ,Y may be acceptable.In terms of the data collection instrument, whereas

in explanatory modeling the goal is to obtain a re-liable and valid instrument such that the data ob-tained represent the underlying construct adequately(e.g., item response theory in psychometrics), forpredictive purposes it is more important to focus onthe measurement quality and its meaning in termsof the variable to be predicted.Finally, consider the field of design of experiments:

two major experimental designs are factorial designsand response surface methodology (RSM) designs.The former is focused on causal explanation in termsof finding the factors that affect the response. Thelatter is aimed at prediction—finding the combina-tion of predictors that optimizes Y . Factorial designsemploy a linear f for interpretability, whereas RSMdesigns use optimization techniques and estimate anonlinear f from the data, which is less interpretablebut more predictively accurate.5

2.2 Data Preparation

We consider two common data preparation opera-tions: handling missing values and data partitioning.

2.2.1 Handling missing values Most real datasetsconsist of missing values, thereby requiring one toidentify the missing values, to determine the extent

5I thank Douglas Montgomery for this insight.

and type of missingness, and to choose a course ofaction accordingly. Although a rich literature ex-ists on data imputation, it is monopolized by anexplanatory context. In predictive modeling, the so-lution strongly depends on whether the missing val-ues are in the training data and/or the data to bepredicted. For example, Sarle (1998) noted:

If you have only a small proportion ofcases with missing data, you can simplythrow out those cases for purposes of es-timation; if you want to make predictionsfor cases with missing inputs, you don’thave the option of throwing those casesout.

Sarle further listed imputation methods that areuseful for explanatory purposes but not for predic-tive purposes and vice versa. One example is us-ing regression models with dummy variables thatindicate missingness, which is considered unsatisfac-tory in explanatory modeling, but can produce ex-cellent predictions. The usefulness of creating miss-ingness dummy variables was also shown by Dingand Simonoff (2010). In particular, whereas the clas-sic explanatory approach is based on the Missing-At-Random, Missing-Completely-At-Random or Not-Missing-At-Random classification (Little and Ru-bin, 2002), Ding and Simonoff (2010) showed thatfor predictive purposes the important distinction iswhether the missingness depends on Y or not. Theyconcluded:

In the context of classification trees, therelationship between the missingness andthe dependent variable, rather than thestandard missingness classification approachof Little and Rubin (2002). . . is the mosthelpful criterion to distinguish different miss-ing data methods.

Moreover, missingness can be a blessing in a pre-dictive context, if it is sufficiently informative of Y(e.g., missingness in financial statements when thegoal is to predict fraudulent reporting).

Page 9: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 9

Finally, a completely different approach for han-dling missing data for prediction, mentioned by Sarle(1998) and further developed by Saar-Tsechanskyand Provost (2007), considers the case where to-be-predicted observations are missing some predic-tor information, such that the missing informationcan vary across different observations. The proposedsolution is to estimate multiple “reduced” models,each excluding some predictors. When predicting anobservation with missingness on a certain set of pre-dictors, the model that excludes those predictors isused. This approach means that different reducedmodels are created for different observations. Al-though useful for prediction, it is clearly inappro-priate for causal explanation.

2.2.2 Data partitioning A popular solution foravoiding overoptimistic predictive accuracy is toevaluate performance not on the training set, thatis, the data used to build the model, but rather on aholdout sample which the model “did not see.” Thecreation of a holdout sample can be achieved in var-ious ways, the most commonly used being a randompartition of the sample into training and holdoutsets. A popular alternative, especially with scarcedata, is cross-validation. Other alternatives are re-sampling methods, such as bootstrap, which canbe computationally intensive but avoid “bad par-titions” and enable predictive modeling with smalldatasets.Data partitioning is aimed at minimizing the com-

bined bias and variance by sacrificing some bias inreturn for a reduction in sampling variance. A smallersample is associated with higher bias when f is esti-mated from the data, which is common in predictivemodeling but not in explanatory modeling. Hence,data partitioning is useful for predictive modelingbut less so for explanatory modeling. With today’sabundance of large datasets, where the bias sacrificeis practically small, data partitioning has become astandard preprocessing step in predictive modeling.In explanatory modeling, data partitioning is less

common because of the reduction in statistical power.When used, it is usually done for the retrospectivepurpose of assessing the robustness of f . A rareryet important use of data partitioning in explana-tory modeling is for strengthening model validity,by demonstrating some predictive power. Althoughone would not expect an explanatory model to beoptimal in terms of predictive power, it should showsome degree of accuracy (see discussion in Section4.2).

2.3 Exploratory Data Analysis

Exploratory data analysis (EDA) is a key initialstep in both explanatory and predictive modeling.It consists of summarizing the data numerically andgraphically, reducing their dimension, and “prepar-ing” for the more formal modeling step. Althoughthe same set of tools can be used in both cases,they are used in a different fashion. In explanatorymodeling, exploration is channeled toward the the-oretically specified causal relationships, whereas inpredictive modeling EDA is used in a more free-formfashion, supporting the purpose of capturing rela-tionships that are perhaps unknown or at least lessformally formulated.One example is how data visualization is carried

out. Fayyad, Grinstein and Wierse (2002, page 22)contrasted “exploratory visualization” with “confir-matory visualization”:

Visualizations can be used to explore data,to confirm a hypothesis, or to manipu-late a viewer. . . In exploratory visualiza-tion the user does not necessarily knowwhat he is looking for. This creates a dy-namic scenario in which interaction is crit-ical. . . In a confirmatory visualization, theuser has a hypothesis that needs to betested. This scenario is more stable andpredictable. System parameters are oftenpredetermined.

Hence, interactivity, which supports explorationacross a wide and sometimes unknown terrain, isvery useful for learning about measurement qualityand associations that are at the core of predictivemodeling, but much less so in explanatory model-ing, where the data are visualized through the the-oretical lens.A second example is numerical summaries. In a

predictive context, one might explore a wide rangeof numerical summaries for all variables of inter-est, whereas in an explanatory model, the numericalsummaries would focus on the theoretical relation-ships. For example, in order to assess the role of acertain variable as a mediator, its correlation withthe response variable and with other covariates isexamined by generating specific correlation tables.A third example is the use of EDA for assess-

ing assumptions of potential models (e.g., normalityor multicollinearity) and exploring possible variabletransformations. Here, too, an explanatory context

Page 10: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

10 G. SHMUELI

would be more restrictive in terms of the space ex-plored.Finally, dimension reduction is viewed and used

differently. In predictive modeling, a reduction in thenumber of predictors can help reduce sampling vari-ance. Hence, methods such as principal componentsanalysis (PCA) or other data compression methodsthat are even less interpretable (e.g., singular valuedecomposition) are often carried out initially. Theymay later lead to the use of compressed variables(such as the first few components) as predictors,even if those are not easily interpretable. PCA isalso used in explanatory modeling, but for a differ-ent purpose. For questionnaire data, PCA and ex-ploratory factor analysis are used to determine thevalidity of the survey instrument. The resulting fac-tors are expected to correspond to the underlyingconstructs. In fact, the rotation step in factor anal-ysis is specifically aimed at making the factors moreinterpretable. Similarly, correlations are used for as-sessing the reliability of the survey instrument.

2.4 Choice of Variables

The criteria for choosing variables differ markedlyin explanatory versus predictive contexts.In explanatory modeling, where variables are seen

as operationalized constructs, variable choice is basedon the role of the construct in the theoretical causalstructure and on the operationalization itself. A broadterminology related to different variable roles existsin various fields: in the social sciences—antecedent,consequent, mediator and moderator6 variables; inpharmacology and medical sciences—treatment andcontrol variables; and in epidemiology—exposure andconfounding variables. Carte and Craig (2003) men-tioned that explaining moderating effects has be-come an important scientific endeavor in the field ofManagement Information Systems. Another impor-tant term common in economics is endogeneity or“reverse causation,” which results in biased param-eter estimates. Endogeneity can occur due to dif-ferent reasons. One reason is incorrectly omittingan input variable, say Z, from f when the causalconstruct Z is assumed to cause X and Y . In a re-gression model of Y on X , the omission of Z results

6“A moderator variable is one that influences thestrength of a relationship between two other vari-ables, and a mediator variable is one that explainsthe relationship between the two other variables” (fromhttp://psych.wisc.edu/henriques/mediator.html).

in X being correlated with the error term. Winkel-mann (2008) gave the example of a hypothesis thathealth insurance (X ) affects the demand for healthservices Y . The operationalized variables are “healthinsurance status” (X) and “number of doctor con-sultations” (Y ). Omitting an input measurementZ for “true health status” (Z) from the regressionmodel f causes endogeneity because X can be de-termined by Y (i.e., reverse causation), which man-ifests as X being correlated with the error term inf . Endogeneity can arise due to other reasons suchas measurement error in X . Because of the focusin explanatory modeling on causality and on bias,there is a vast literature on detecting endogeneityand on solutions such as constructing instrumen-tal variables and using models such as two-stage-least-squares (2SLS). Another related term is simul-taneous causality, which gives rise to special mod-els such as Seemingly Unrelated Regression (SUR)(Zellner, 1962). In terms of chronology, a causal ex-planatory model can include only “control” vari-ables that take place before the causal variable (Gel-man et al., 2003). And finally, for reasons of modelidentifiability (i.e., given the statistical model, eachcausal effect can be identified), one is required toinclude main effects in a model that contains an in-teraction term between those effects. We note thispractice because it is not necessary or useful in thepredictive context, due to the acceptability of unin-terpretable models and the potential reduction insampling variance when dropping predictors (see,e.g., the Appendix).In predictive modeling, the focus on association

rather than causation, the lack of F , and the prospec-tive context, mean that there is no need to delve intothe exact role of each variable in terms of an under-lying causal structure. Instead, criteria for choosingpredictors are quality of the association between thepredictors and the response, data quality, and avail-ability of the predictors at the time of prediction,known as ex-ante availability. In terms of ex-anteavailability, whereas chronological precedence of Xto Y is necessary in causal models, in predictivemodels not only must X precede Y , but X must beavailable at the time of prediction. For instance, ex-plaining wine quality retrospectively would dictateincluding barrel characteristics as a causal factor.The inclusion of barrel characteristics in a predic-tive model of future wine quality would be impossi-ble if at the time of prediction the grapes are stillon the vine. See the eBay example in Section 3.2 foranother example.

Page 11: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 11

2.5 Choice of Methods

Considering the four aspects of causation–associa-tion, theory–data, retrospective–prospective and bias–variance leads to different choices of plausible meth-ods, with a much larger array of methods usefulfor prediction. Explanatory modeling requires inter-pretable statistical models f that are easily linked tothe underlying theoretical model F . Hence the pop-ularity of statistical models, and especially regression-type methods, in many disciplines. Algorithmic meth-ods such as neural networks or k-nearest-neighbors,and uninterpretable nonparametric models, are con-sidered ill-suited for explanatory modeling.In predictive modeling, where the top priority is

generating accurate predictions of new observationsand f is often unknown, the range of plausible meth-ods includes not only statistical models (interpretableand uninterpretable) but also data mining algorithms.A neural network algorithm might not shed light onan underlying causal mechanism F or even on f ,but it can capture complicated associations, therebyleading to accurate predictions. Although modeltransparency might be important in some cases, it isof secondary importance: “Using complex predictorsmay be unpleasant, but the soundest path is to gofor predictive accuracy first, then try to understandwhy” (Breiman, 2001b).Breiman (2001b) accused the statistical commu-

nity of ignoring algorithmic modeling:

There are two cultures in the use of statis-tical modeling to reach conclusions fromdata. One assumes that the data are gen-erated by a given stochastic data model.The other uses algorithmic models andtreats the data mechanism as unknown.The statistical community has been com-mitted to the almost exclusive use of datamodels.

From the explanatory/predictive view, algorithmicmodeling is indeed very suitable for predictive (anddescriptive) modeling, but not for explanatory mod-eling.Some methods are not suitable for prediction

from the retrospective–prospective aspect, especiallyin time series forecasting. Models with coincidentindicators, which are measured simultaneously, aresuch a class. An example is the model Airfare t =f(OilPricet), which might be useful for explainingthe effect of oil price on airfare based on a causal

theory, but not for predicting future airfare becausethe oil price at the time of prediction is unknown.For prediction, alternative models must be consid-ered, such as using a lagged OilPrice variable, or cre-ating a separate model for forecasting oil prices andplugging its forecast into the airfare model. Anotherexample is the centered moving average, which re-quires the availability of data during a time windowbefore and after a period of interest, and is thereforenot useful for prediction.Lastly, the bias–variance aspect raises two classes

of methods that are very useful for prediction, butnot for explanation. The first is shrinkage methodssuch as ridge regression, principal components re-gression, and partial least squares regression, which“shrink” predictor coefficients or even eliminate them,thereby introducing bias into f , for the purpose ofreducing estimation variance. The second class ofmethods, which “have been called the most influ-ential development in Data Mining and MachineLearning in the past decade” (Seni and Elder, 2010,page vi), are ensemble methods such as bagging(Breiman, 1996), random forests (Breiman, 2001a),boosting7 (Schapire, 1999), variations of those meth-ods, and Bayesian alternatives (e.g., Brown, Van-nucci and Fearn, 2002). Ensembles combine multiplemodels to produce more precise predictions by av-eraging predictions from different models, and haveproven useful in numerous applications (see the Net-flix Prize example in Section 3.1).

2.6 Validation, Model Evaluation and Model

Selection

Choosing the final model among a set of models,validating it, and evaluating its performance, differmarkedly in explanatory and predictive modeling.Although the process is iterative, I separate it intothree components for ease of exposition.

2.6.1 Validation In explanatory modeling, valida-tion consists of two parts: model validation validatesthat f adequately represents F , and model fit vali-dates that f fits the data {X,Y }. In contrast, vali-dation in predictive modeling is focused on general-ization, which is the ability of f to predict new data{Xnew, Ynew}.

7Although boosting algorithms were developed as ensem-ble methods, “[they can] be seen as an interesting regulariza-tion scheme for estimating a model” (Bohlmann and Hothorn,2007).

Page 12: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

12 G. SHMUELI

Methods used in explanatory modeling for modelvalidation include model specification tests such asthe popular Hausman specification test in econo-metrics (Hausman, 1978), and construct validationtechniques such as reliability and validity measuresof survey questions and factor analysis. Inferencefor individual coefficients is also used for detect-ing over- or underspecification. Validating model fitinvolves goodness-of-fit tests (e.g., normality tests)and model diagnostics such as residual analysis. Al-though indications of lack of fit might lead researchersto modify f , modifications are made carefully inlight of the relationship with F and the constructsX , Y .In predictive modeling, the biggest danger to gen-

eralization is overfitting the training data. Hencevalidation consists of evaluating the degree of over-

fitting, by comparing the performance of f on thetraining and holdout sets. If performance is signifi-cantly better on the training set, overfitting is im-plied.Not only is the large context of validation markedly

different in explanatory and predictive modeling,but so are the details. For example, checking formulticollinearity is a standard operation in assess-ing model fit. This practice is relevant in explana-tory modeling, where multicollinearity can lead toinflated standard errors, which interferes with infer-ence. Therefore, a vast literature exists on strategiesfor identifying and reducing multicollinearity, vari-able selection being one strategy. In contrast, forpredictive purposes “multicollinearity is not quite asdamning” (Vaughan and Berry, 2005). Makridakis,Wheelwright and Hyndman (1998, page 288) dis-tinguished between the role of multicollinearity inexplaining versus its role in predicting:

Multicollinearity is not a problem unlesseither (i) the individual regression coeffi-cients are of interest, or (ii) attempts aremade to isolate the contribution of oneexplanatory variable to Y, without the in-fluence of the other explanatory variables.Multicollinearity will not affect the abilityof the model to predict.

Another example is the detection of influential ob-servations. While classic methods are aimed at de-tecting observations that are influential in terms ofmodel estimation, Johnson and Geisser (1983) pro-posed a method for detecting influential observa-tions in terms of their effect on the predictive dis-tribution.

2.6.2 Model evaluation Consider two performanceaspects of a model: explanatory power and predic-tive power. The top priority in terms of model per-formance in explanatory modeling is assessing ex-planatory power, which measures the strength of re-lationship indicated by f . Researchers report R2-type values and statistical significance of overall F -type statistics to indicate the level of explanatorypower.In contrast, in predictive modeling, the focus is on

predictive accuracy or predictive power, which referto the performance of f on new data. Measures ofpredictive power are typically out-of-sample metricsor their in-sample approximations, which depend onthe type of required prediction. For example, predic-tions of a binary Y could be binary classifications(Y = 0,1), predicted probabilities of a certain class

[P (Y = 1)], or rankings of those probabilities. Thelatter are common in marketing and personnel psy-chology. These three different types of predictionswould warrant different performance metrics. Forexample, a model can perform poorly in producingbinary classifications but adequately in producingrankings. Moreover, in the context of asymmetriccosts, where costs are heftier for some types of pre-diction errors than others, alternative performancemetrics are used, such as the “average cost per pre-dicted observation.”A common misconception in various scientific fields

is that predictive power can be inferred from ex-planatory power. However, the two are different andshould be assessed separately. While predictive powercan be assessed for both explanatory and predictivemodels, explanatory power is not typically possibleto assess for predictive models because of the lackof F and an underlying causal structure. Measuressuch as R2 and F would indicate the level of asso-ciation, but not causation.Predictive power is assessed using metrics com-

puted from a holdout set or using cross-validation(Stone, 1974; Geisser, 1975). Thus, a major differ-ence between explanatory and predictive performancemetrics is the data from which they are computed. Ingeneral, measures computed from the data to whichthe model was fitted tend to be overoptimistic interms of predictive accuracy: “Testing the proce-dure on the data that gave it birth is almost certainto overestimate performance” (Mosteller and Tukey,1977). Thus, the holdout set serves as a more real-istic context for evaluating predictive power.

Page 13: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 13

2.6.3 Model selection Once a set of models f1, f2,. . . has been estimated and validated, model selec-tion pertains to choosing among them. Two maindifferentiating aspects are the data–theory and bias–variance considerations. In explanatory modeling,the models are compared in terms of explanatorypower, and hence the popularity of nested models,which are easily compared. Stepwise-type methods,which use overall F statistics to include and/or ex-clude variables, might appear suitable for achiev-ing high explanatory power. However, optimizingexplanatory power in this fashion conceptually con-tradicts the validation step, where variable inclu-sion/exclusion and the structure of the statisticalmodel are carefully designed to represent the theo-retical model. Hence, proper explanatory model se-lection is performed in a constrained manner. In thewords of Jaccard (2001):

Trimming potentially theoretically mean-ingful variables is not advisable unless oneis quite certain that the coefficient for thevariable is near zero, that the variable isinconsequential, and that trimming willnot introduce misspecification error.

A researcher might choose to retain a causal co-variate which has a strong theoretical justificationeven if is statistically insignificant. For example, inmedical research, a covariate that denotes whethera person smokes or not is often present in modelsfor health conditions, whether it is statistically sig-nificant or not.8 In contrast to explanatory power,statistical significance plays a minor or no role inassessing predictive performance. In fact, it is some-times the case that removing inputs with small coef-ficients, even if they are statistically significant, re-sults in improved prediction accuracy (Greenbergand Parks, 1997; Wu, Harris and McAuley, 2007,and see the Appendix). Stepwise-type algorithmsare very useful in predictive modeling as long asthe selection criteria rely on predictive power ratherthan explanatory power.As mentioned in Section 1.6, the statistics liter-

ature on model selection includes a rich discussionon the difference between finding the “true” modeland finding the best predictive model, and on cri-teria for explanatory model selection versus predic-tive model selection. A popular predictive metric isthe in-sample Akaike Information Criterion (AIC).

8I thank Ayala Cohen for this example.

Akaike derived the AIC from a predictive viewpoint,where the model is not intended to accurately inferthe “true distribution,” but rather to predict futuredata as accurately as possible (see, e.g., Berk, 2008;Konishi and Kitagawa, 2007). Some researchers dis-tinguish between AIC and the Bayesian informationcriterion (BIC) on this ground. Sober (2002) con-cluded that AIC measures predictive accuracy whileBIC measures goodness of fit:

In a sense, the AIC and the BIC provideestimates of different things; yet, they al-most always are thought to be in compe-tition. If the question of which estimatoris better is to make sense, we must decidewhether the average likelihood of a family[=BIC] or its predictive accuracy [=AIC]is what we want to estimate.

Similarly, Dowe, Gardner and Oppy (2007) con-trasted the two Bayesian model selection criteriaMinimum Message Length (MML) and MinimumExpected Kullback–Leibler Distance (MEKLD).They concluded,

If you want to maximise predictive accu-racy, you should minimise the expectedKL distance (MEKLD); if you want thebest inference, you should use MML.

Kadane and Lazar (2004) examined a variety of modelselection criteria from a Bayesian decision–theoreticpoint of view, comparing prediction with explana-tion goals.Even when using predictive metrics, the fashion in

which they are used within a model selection processcan deteriorate their adequacy, yielding overopti-mistic predictive performance. Berk (2008) describedthe case where

statistical learning procedures are oftenapplied several times to the data with oneor more tuning parameters varied. TheAIC may be computed for each. But eachAIC is ignorant about the information ob-tained from prior fitting attempts and howmany degrees of freedom were expendedin the process. Matters are even more com-plicated if some of the variables are trans-formed or recoded. . . Some unjustified op-timism remains.

Page 14: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

14 G. SHMUELI

2.7 Model Use and Reporting

Given all the differences that arise in the mod-eling process, the resulting predictive model wouldobviously be very different from a resulting explana-tory model in terms of the data used ({X,Y }), the

estimated model f , and explanatory power and pre-dictive power. The use of f would also greatly differ.As illustrated in Section 1.1, explanatory models

in the context of scientific research are used to de-rive “statistical conclusions” using inference, whichin turn are translated into scientific conclusions re-garding F ,X,Y and the causal hypotheses. Witha focus on theory, causality, bias and retrospectiveanalysis, explanatory studies are aimed at testing orcomparing existing causal theories. Accordingly thestatistical section of explanatory scientific papers isdominated by statistical inference.In predictive modeling f is used to generate pre-

dictions for new data. We note that generating pre-dictions from f can range in the level of difficulty,depending on the complexity of f and on the typeof prediction generated. For example, generating acomplete predictive distribution is easier using aBayesian approach than the predictive likelihood ap-proach.In practical applications, the predictions might be

the final goal. However, the focus here is on pre-dictive modeling for supporting scientific research,as was discussed in Section 1.2. Scientific predictivestudies and articles therefore emphasize data, asso-ciation, bias–variance considerations, and prospec-tive aspects of the study. Conclusions pertain totheory-building aspects such as new hypothesis gen-eration, practical relevance, and predictability level.Whereas explanatory articles focus on theoreticalconstructs and unobservable parameters and theirstatistical section is dominated by inference, predic-tive articles concentrate on the observable level, withpredictive power and its comparison across modelsbeing the core.

3. TWO EXAMPLES

Two examples are used to broadly illustrate thedifferences that arise in predictive and explanatorystudies. In the first I consider a predictive goal anddiscuss what would be involved in “converting” itto an explanatory study. In the second example Iconsider an explanatory study and what would bedifferent in a predictive context. See the work ofShmueli and Koppius (2010) for a detailed example

“converting” the explanatory study of Gefen, Kara-hanna and Straub (2003) from Section 1 into a pre-dictive one.

3.1 Netflix Prize

Netflix is the largest online DVD rental servicein the United States. In an effort to improve theirmovie recommendation system, in 2006 Netflix an-nounced a contest (http://netflixprize.com), mak-ing public a huge dataset of user movie ratings. Eachobservation consisted of a user ID, a movie title, andthe rating that the user gave this movie. The taskwas to accurately predict the ratings of movie-userpairs for a test set such that the predictive accu-racy improved upon Netflix’s recommendation en-gine by at least 10%. The grand prize was set at$ 1,000,000. The 2009 winner was a composite ofthree teams, one of them from the AT&T researchlab (see Bell, Koren and Volinsky, 2010). In their2008 report, the AT&T team, who also won the 2007and 2008 progress prizes, described their modelingapproach (Bell, Koren and Volinsky, 2008).Let me point out several operations and choices

described by Bell, Koren and Volinsky (2008) thathighlight the distinctive predictive context. Start-ing with sample size, the very large sample releasedby Netflix was aimed at allowing the estimation off from the data, reflecting the absence of a strongtheory. In the data preparation step, with relationto missingness that is predictively informative, theteam found that “the information on which movieseach user chose to rate, regardless of specific rat-ing value” turned out to be useful. At the data ex-ploration and reduction step, many teams includingthe winners found that the noninterpretable Sin-gular Value Decomposition (SVD) data reductionmethod was key in producing accurate predictions:“It seems that models based on matrix-factorizationwere found to be most accurate.” As for choice ofvariables, supplementing the Netflix data with infor-mation about the movie (such as actors, director)actually decreased accuracy: “We should mentionthat not all data features were found to be useful.For example, we tried to benefit from an extensiveset of attributes describing each of the movies inthe dataset. Those attributes certainly carry a sig-nificant signal and can explain some of the user be-havior. However, we concluded that they could nothelp at all for improving the accuracy of well tunedcollaborative filtering models.” In terms of choice of

Page 15: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 15

methods, their solution was an ensemble of meth-ods that included nearest-neighbor algorithms, re-gression models, and shrinkage methods. In partic-ular, they found that “using increasingly complexmodels is only one way of improving accuracy. Anapparently easier way to achieve better accuracy isby blending multiple simpler models.” And indeed,more accurate predictions were achieved by collab-orations between competing teams who combinedpredictions from their individual models, such as thewinners’ combined team. All these choices and dis-coveries are very relevant for prediction, but not forcausal explanation. Although the Netflix contest isnot aimed at scientific advancement, there is clearlyscientific value in the predictive models developed.They tell us about the level of predictability of on-line user ratings of movies, and the implicated use-fulness of the rating scale employed by Netflix. Theresearch also highlights the importance of knowingwhich movies a user does not rate. And importantly,it sets the stage for explanatory research.Let us consider a hypothetical goal of explain-

ing movie preferences. After stating causal hypothe-ses, we would define constructs that link user be-havior and movie features X to user preference Y ,with a careful choice of F . An operationalizationstep would link the constructs to measurable data,and the role of each variable in the causality struc-ture would be defined. Even if using the Netflixdataset, supplemental covariates that capture moviefeatures and user characteristics would be absolutelynecessary. In other words, the data collected andthe variables included in the model would be differ-ent from the predictive context. As to methods andmodels, data compression methods such as SVD,heuristic-based predictive algorithms which learn f

from the data, and the combination of multiple mod-els would be considered inappropriate, as they lackinterpretability with respect to F and the hypothe-ses. The choice of f would be restricted to statisticalmodels that can be used for inference, and woulddirectly model issues such as the dependence be-tween records for the same customer and for thesame movie. Finally, the model would be validatedand evaluated in terms of its explanatory power, andused to conclude about the strength of the causal re-lationship between various user and movie charac-teristics and movie preferences. Hence, the explana-tory context leads to a completely different modelingpath and final result than the predictive context.

It is interesting to note that most competing teamshad a background in computer science rather thanstatistics. Yet, the winning team combines the twodisciplines. Statisticians who see the uniqueness andimportance of predictive modeling alongside explana-tory modeling have the capability of contributing toscientific advancement as well as achieving meaning-ful practical results (and large monetary awards).

3.2 Online Auction Research

The following example highlights the differencesbetween explanatory and predictive research in on-line auctions. The predictive approach also illus-trates the utility in creating new theory in an areadominated by explanatory modeling.Online auctions have become a major player in

providing electronic commerce services. eBay (www.eBay.com), the largest consumer-to-consumer auc-tion website, enables a global community of buy-ers and sellers to easily interact and trade. Empir-ical research of online auctions has grown dramat-ically in recent years. Studies using publicly avail-able bid data from websites such as eBay have foundmany divergences of bidding behavior and auctionoutcomes compared to ordinary offline auctions andclassical auction theory. For instance, according toclassical auction theory (e.g., Krishna, 2002), thefinal price of an auction is determined by a prioriinformation about the number of bidders, their val-uation, and the auction format. However, final pricedetermination in online auctions is quite different.Online auctions differ from offline auctions in vari-ous ways such as longer duration, anonymity of bid-ders and sellers, and low barriers of entry. These andother factors lead to new bidding behaviors that arenot explained by auction theory. Another importantdifference is that the total number of bidders in mostonline auctions is unknown until the auction closes.Empirical research in online auctions has concen-

trated in the fields of economics, information sys-tems and marketing. Explanatory modeling has beenemployed to learn about different aspects of bidderbehavior in auctions. A survey of empirical explana-tory research on auctions was given by Bajari andHortacsu (2004). A typical explanatory study relieson game theory to construct F , which can be donein different ways. One approach is to construct a“structural model,” which is a mathematical modellinking the various constructs. The major constructis “bidder valuation,” which is the amount a bidder

Page 16: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

16 G. SHMUELI

is willing to pay, and is typically operationalized us-ing his observed placed bids. The structural modeland operationalized constructs are then translatedinto a regression-type model [see, e.g., Sections 5and 6 in Bajari and Hortacsu (2003)]. To illustratethe use of a statistical model in explanatory auc-tion research, consider the study by Lucking-Reileyet al. (2007) who used a dataset of 461 eBay coinauctions to determine the factors affecting the finalauction price. They estimated a set of linear regres-sion models where Y = log(Price) and X includedauction characteristics (the opening bid, the auc-tion duration, and whether a secret reserve pricewas used), seller characteristics (the number of pos-itive and negative ratings), and a control variable(book value of the coin). One of their four reportedmodels was of the form

log(Price) = β0 + β1 log(BookValue)

+ β2 log(MinBid) + β3Reserve

+ β4NumDays + β5PosRating

+ β6NegRating + ε.

The other three models, or “model specifications,”included a modified set of predictors, with some in-teraction terms and an alternate auction durationmeasurement. The authors used a censored-Normalregression for model estimation, because some auc-tions did not receive any bids and therefore the pricewas truncated at the minimum bid. Typical explana-tory aspects of the modeling are:

Choice of variables: Several issues arise from thecausal-theoretical context. First is the exclusionof the number of bidders (or bids) as a determi-nant due to endogeneity considerations, where al-though it is likely to affect the final price, “it isendogenously determined by the bidders’ choices.”To verify endogeneity the authors report fittinga separate regression of Y = Number of bids onall the determinants. Second, the authors discussoperationalization challenges that might result inbias due to omitted variables. In particular, theauthors discuss the construct of “auction attrac-tiveness” (X ) and their inability to judge mea-sures such as photos and verbal descriptions tooperationalize attractiveness.

Model validation: The four model specifications areused for testing the robustness of the hypothesizedeffect of the construct “auction length” across dif-ferent operationalized variables such as the contin-uous number of days and a categorical alternative.

Model evaluation: For each model, its in-sample R2

is used for determining explanatory power.Model selection: The authors report the four fittedregression models, including both significant andinsignificant coefficients. Retaining the insignifi-cant covariates in the model is for matching f

with F .Model use and reporting: The main focus is on in-ference for the β’s, and the final conclusions aregiven in causal terms. (“A seller’s feedback rat-ings. . . have a measurable effect on her auctionprices. . . when a seller chooses to have her auctionlast for a longer period of days [sic], this signifi-cantly increases the auction price on average.”)

Although online auction research is dominated byexplanatory studies, there have been a few predic-tive studies developing forecasting models for anauction’s final price (e.g., Jank, Shmueli and Wang,2008; Jap and Naik, 2008; Ghani and Simmons, 2004;Wang, Jank and Shmueli, 2008; Zhang, Jank andShmueli, 2010). For a brief survey of online auc-tion forecasting research see the work of Jank andShmueli (2010, Chapter 5). From my involvement inseveral of these predictive studies, let me highlightthe purely predictive aspects that appear in this lit-erature:

Choice of variables: If prediction takes place beforeor at the start of the auction, then obviously thetotal number of bids or bidders cannot be includedas a predictor. While this variable was also omit-ted in the explanatory study, the omission was dueto a different reason, that is, endogeneity. How-ever, if prediction takes place at time t during anongoing auction, then the number of bidders/bidspresent at time t is available and useful for predict-ing the final price. Even more useful is the timeseries of the number of bidders from the start ofthe auction until time t as well as the price curveuntil time t (Bapna, Jank and Shmueli, 2008).

Choice of methods: Predictive studies in online auc-tions tend to learn f from the data, using flexi-ble models and algorithmic methods (e.g., CART,k-nearest neighbors, neural networks, functionalmethods and related nonparametric smoothing-based methods, Kalman filters and boosting (see,e.g., Chapter 5 in Jank and Shmueli, 2010). Manyof these are not interpretable, yet have proven toprovide high predictive accuracy.

Page 17: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 17

Model evaluation: Auction forecasting studies eval-uate predictive power on holdout data. They re-port performance in terms of out-of-sample met-rics such as MAPE and RMSE, and are comparedagainst other predictive models and benchmarks.

Predictive models for auction price cannot providedirect causal explanations. However, by producinghigh-accuracy price predictions they shed light onnew potential variables that are related to price andon the types of relationships that can be furtherinvestigated in terms of causality. For instance, aconstruct that is not directly measurable but thatsome predictive models are apparently capturing iscompetition between bidders.

4. IMPLICATIONS, CONCLUSIONS AND

SUGGESTIONS

4.1 The Cost of Indiscrimination to

Scientific Research

Currently, in many fields, statistical modeling isused nearly exclusively for causal explanation. Theconsequence of neglecting to include predictive mod-eling and testing alongside explanatory modeling islosing the ability to test the relevance of existingtheories and to discover new causal mechanisms.Feelders (2002) commented on the field of economics:“The pure hypothesis testing framework of economicdata analysis should be put aside to give more scopeto learning from the data. This closes the empiricalcycle from observation to theory to the testing oftheories on new data.” The current accelerated rateof social, environmental, and technological changescreates a burning need for new theories and for theexamination of old theories in light of the new real-ities.A common practice due to the indiscrimination

of explanation and prediction is to erroneously in-fer predictive power from explanatory power, whichcan lead to incorrect scientific and practical conclu-sions. Colleagues from various fields confirmed thisfact, and a cursory search of their scientific litera-ture brings up many examples. For instance, in ecol-ogy an article intending to predict forest beetle as-semblages infers predictive power from explanatorypower [“To study. . . predictive power, . . . we calcu-lated the R2”; “We expect predictabilities with R2

of up to 0.6” (Muller and Brandl, 2009)]. In eco-nomics, an article entitled “The predictive powerof zero intelligence in financial markets” (Farmer,

Patelli and Zovko, 2005) infers predictive power froma high R2 value of a linear regression model. In epi-demiology, many studies rely on in-sample hazardratios estimated from Cox regression models to inferpredictive power, reflecting an indiscrimination be-tween description and prediction. For instance, Nabiet al. (2010) used hazard ratio estimates and statis-tical significance “to compare the predictive powerof depression for coronary heart disease with that ofcerebrovascular disease.” In information systems, anarticle on “Understanding and predicting electroniccommerce adoption” (Pavlou and Fygenson, 2006)incorrectly compared the predictive power of differ-ent models using in-sample measures (“To examinethe predictive power of the proposed model, we com-pare it to four models in terms of R2 adjusted”).These examples are not singular, but rather theyreflect the common misunderstanding of predictivepower in these and other fields.Finally, a consequence of omitting predictive mod-

eling from scientific research is also a gap betweenresearch and practice. In an age where empirical re-search has become feasible in many fields, the op-portunity to bridge the gap between methodologicaldevelopment and practical application can be easierto achieve through the combination of explanatoryand predictive modeling.Finance is an example where practice is concerned

with prediction whereas academic research is focusedon explaining. In particular, there has been a re-liance on a limited number of models that are con-sidered pillars of research, yet have proven to per-form very poorly in practice. For instance, the CAPMmodel and more recently the Fama–French modelare regression models that have been used for ex-plaining market behavior for the purpose of portfoliomanagement, and have been evaluated in terms ofexplanatory power (in-sample R2 and residual anal-ysis) and not predictive accuracy.9 More recently,researchers have begun recognizing the distinctionbetween in-sample explanatory power and out-of-sample predictive power (Goyal and Welch, 2007),which has led to a discussion of predictability magni-tude and a search for predictively accurate explana-tory variables (Campbell and Thompson, 2005). Interms of predictive modeling, the Chief Actuary of

9Although in their paper Fama and French (1993) did splitthe sample into two parts, they did so for purposes of testingthe sensitivity of model estimates rather than for assessingpredictive accuracy.

Page 18: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

18 G. SHMUELI

the Financial Supervisory Authority of Sweden com-mented in 1999: “there is a need for models withpredictive power for at least a very near future. . .Given sufficient and relevant data this is an area forstatistical analysis, including cluster analysis andvarious kind of structure-finding methods” (Palm-gren, 1999). While there has been some predictivemodeling using genetic algorithms (Chen, 2002) andneural networks (Chakraborty and Sharma, 2007),it has been performed by practitioners and nonfi-nance academic researchers and outside of the topacademic journals.In summary, the omission of predictive modeling

for theory development results not only in academicwork becoming irrelevant to practice, but also increating a barrier to achieving significant scientificprogress, which is especially unfortunate as data be-come easier to collect, store and access.In the opposite direction, in fields that focus on

predictive modeling, the reason for omitting explana-tory modeling must be sought. A scientific field isusually defined by a cohesive body of theoreticalknowledge, which can be tested. Hence, some formof testing, whether empirical or not, must be a com-ponent of the field. In areas such as bioinformat-ics, where there is little theory and an abundanceof data, predictive models are pivotal in generatingavenues for causal theory.

4.2 Explanatory and Predictive Power:

Two Dimensions

I have polarized explaining and predicting in thisarticle in an effort to highlight their fundamentaldifferences. However, rather than considering themas extremes on some continuum, I consider themas two dimensions.10,11 Explanatory power and pre-dictive accuracy are different qualities; a model willpossess some level of each.A related controversial question arises: must an

explanatory model have some level of predictive powerto be considered scientifically useful? And equally,must a predictive model have sufficient explanatorypower to be scientifically useful? For instance, someexplanatory models that cannot be tested for pre-dictive accuracy yet constitute scientific advancesare Darwinian evolution theory and string theory

10Similarly, descriptive models can be considered as a thirddimension, where yet different criteria are used for assessingthe strength of the descriptive model.

11I thank Bill Langford for the two-dimensional insight.

in physics. The latter produces currently untestablepredictions (Woit, 2006, pages x–xii). Conversely,there exist predictive models that do not properly“explain” yet are scientifically valuable. Galileo, inhis book Two New Sciences, proposed a demonstra-tion to determine whether light was instantaneous.According to Mackay and Oldford (2000), Descartesgave the book a scathing review:

The substantive criticisms are generallydirected at Galileo’s not having identifiedthe causes of the phenomena he investi-gated. For most scientists at this time,and particularly for Descartes, that is thewhole point of science.

Similarly, consider predictive models that are basedon a wrong explanation yet scientifically and prac-tically they are considered valuable. One well-knownexample is Ptolemaic astronomy, which until recentlywas used for nautical navigation but is based on atheory proven to be wrong long ago. While such ex-amples are extreme, in most cases models are likelyto possess some level of both explanatory and pre-dictive power.Considering predictive accuracy and explanatory

power as two axes on a two-dimensional plot wouldplace different models (f ), aimed either at expla-nation or at prediction, on different areas of theplot. The bi-dimensional approach implies that: (1)In terms of modeling, the goal of a scientific studymust be specified a priori in order to optimize thecriterion of interest; and (2) In terms of model eval-uation and scientific reporting, researchers shouldreport both the explanatory and predictive qualitiesof their models. Even if prediction is not the goal,the predictive qualities of a model should be re-ported alongside its explanatory power so that itcan be fairly evaluated in terms of its capabilitiesand compared to other models. Similarly, a predic-tive model might not require causal explanation inorder to be scientifically useful; however, reportingits relation to causal theory is important for pur-poses of theory building. The availability of infor-mation on a variety of predictive and explanatorymodels along these two axes can shed light on bothpredictive and causal aspects of scientific phenom-ena. The statistical modeling process, as depictedin Figure 2, should include “overall model perfor-mance” in terms of both predictive and explanatoryqualities.

Page 19: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 19

4.3 The Cost of Indiscrimination to the

Field of Statistics

Dissolving the ambiguity surrounding explanatoryversus predictive modeling is important for advanc-ing our field itself. Recognizing that statisticalmethodology has focused mainly on inference indi-cates an important gap to be filled. While our lit-erature contains predictive methodology for modelselection and predictive inference, there is scarce sta-tistical predictive methodology for other modelingsteps, such as study design, data collection, datapreparation and EDA, which present opportunitiesfor new research. Currently, the predictive void hasbeen taken up the field of machine learning and datamining. In fact, the differences, and some would sayrivalry, between the fields of statistics and data min-ing can be attributed to their different goals of ex-plaining versus predicting even more than to factorssuch as data size. While statistical theory has fo-cused on model estimation, inference, and fit, ma-chine learning and data mining have concentratedon developing computationally efficient predictivealgorithms and tackling the bias–variance trade-offin order to achieve high predictive accuracy.Sharpening the distinction between explanatory

and predictive modeling can raise a new awarenessof the strengths and limitations of existing meth-ods and practices, and might shed light on currentcontroversies within our field. One example is thedisagreement in survey methodology regarding theuse of sampling weights in the analysis of surveydata (Little, 2007). Whereas some researchers advo-cate using weights to reduce bias at the expense ofincreased variance, and others disagree, might notthe answer be related to the final goal?Another ambiguity that can benefit from an ex-

planatory/predictive distinction is the definition ofparsimony. Some claim that predictive models shouldbe simpler than explanatory models: “Simplicity isrelevant because complex families often do a bad jobof predicting new data, though they can be madeto fit the old data quite well” (Sober, 2002). Thesame argument was given by Hastie, Tibshirani andFriedman (2009): “Typically the more complex wemake the model, the lower the bias but the higherthe variance.” In contrast, some predictive modelsin practice are very complex,12 and indeed Breiman

12I thank Foster Provost from NYU for this observation.

(2001b) commented: “in some cases predictive mod-els are more complex in order to capture small nu-ances that improve predictive accuracy.” Zellner(2001) used the term “sophisticatedly simple” to de-fine the quality of a “good” model. I would suggestthat the definitions of parsimony and complexity aretask-dependent: predictive or explanatory. For ex-ample, an “overly complicated” model in explana-tory terms might prove “sophisticatedly simple” forpredictive purposes.

4.4 Closing Remarks and Suggestions

The consequences from the explanatory/predictivedistinction lead to two proposed actions:

1. It is our responsibility to be aware of how statisti-cal models are used in research outside of statis-tics, why they are used in that fashion, and inresponse to develop methods that support soundscientific research. Such knowledge can be gainedwithin our field by inviting scientists from differ-ent disciplines to give talks at statistics confer-ences and seminars, and to require graduate stu-dents in statistics to read and present researchpapers from other disciplines.

2. As a discipline, we must acknowledge the differ-ence between explanatory, predictive and descrip-tive modeling, and integrate it into statistics ed-ucation of statisticians and nonstatisticians, asearly as possible but most importantly in “re-search methods” courses. This requires creatingwritten materials that are easily accessible andunderstandable by nonstatisticians. We shouldadvocate both explanatory and predictive mod-eling, clarify their differences and distinctive sci-entific and practical uses, and disseminate toolsand knowledge for implementing both. One par-ticular aspect to consider is advocating a morecareful use of terms such as “predictors,” “pre-dictions” and “predictive power,” to reduce theeffects of terminology on incorrect scientific con-clusions.

Awareness of the distinction between explanatoryand predictive modeling, and of the different scien-tific functions that each serve, is essential for theprogress of scientific knowledge.

APPENDIX: IS THE “TRUE” MODEL THE

BEST PREDICTIVE MODEL? A LINEAR

REGRESSION EXAMPLE

Consider F to be the true function relating con-structs X and Y and let us assume that f is a valid

Page 20: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

20 G. SHMUELI

operationalization of F . Choosing an intentionallybiased function f∗ in place of f is clearly undesir-able from a theoretical–explanatory point of view.However, we will show that f∗ can be preferable tof from a predictive standpoint.To illustrate this, consider the statistical model

f(x) = β1x1 + β2x2 + ε which is assumed to be cor-rectly specified with respect to F . Using data, weobtain the estimated model f , which has the prop-erties

Bias = 0,(2)

Var(f(x)) = Var(x1β1 + x2β2)(3)

= σ2x′(X ′X)−1x,

where x is the vector x= [x1, x2]′, and X is the de-

sign matrix based on both predictors. Combiningthe squared bias with the variance gives

EPE =E(Y − f(x))2

= σ2 +0+ σ2x′(X ′X)−1x(4)

= σ2(1 + x′(X ′X)−1x).

In comparison, consider the estimated underspec-ified form f∗(x) = γ1x1. The bias and variance hereare given by Montgomery, Peck and Vining (2001,pages 292–296):

Bias = x1γ1 − (x1β1 + x2β2)

= x1(x′

1x1)−1x′1(x1β1 + x2β2)

− (x1β1 + x2β2),

Var(f∗(x)) = x1Var(γ1)x1 = σ2x1(x′

1x1)−1x1.

Combining the squared bias with the variance gives

EPE = (x1(x′

1x1)−1x′1x2β2 − x2β2)

2

(5)+ σ2(1 + x1(x

1x1)−1x′1).

Although the bias of the underspecified model f∗(x)is larger than that of f(x), its variance can be smaller,and in some cases so small that the overall EPE willbe lower for the underspecified model. Wu, Harrisand McAuley (2007) showed the general result for anunderspecified linear regression model with multiplepredictors. In particular, they showed that the un-derspecified model that leaves out q predictors hasa lower EPE when the following inequality holds:

qσ2 > β′

2X′

2(I −H1)X2β2.(6)

This means that the underspecified model producesmore accurate predictions, in terms of lower EPE,in the following situations:

• when the data are very noisy (large σ);• when the true absolute values of the left-out pa-

rameters (in our example β2) are small;• when the predictors are highly correlated; and• when the sample size is small or the range of left-

out variables is small.

The bottom line is nicely summarized by Hagertyand Srinivasan (1991): “We note that the practicein applied research of concluding that a model witha higher predictive validity is “truer,” is not a validinference. This paper shows that a parsimonious butless true model can have a higher predictive validitythan a truer but less parsimonious model.”

ACKNOWLEDGMENTS

I thank two anonymous reviewers, the associateeditor, and editor for their suggestions and com-ments which improved this manuscript. I expressmy gratitude to many colleagues for invaluable feed-back and fruitful discussion that have helped medevelop the explanatory/predictive argument pre-sented in this article. I am grateful to Otto Koppius(Erasmus) and Ravi Bapna (U Minnesota) for fa-miliarizing me with explanatory modeling in Infor-mation Systems, for collaboratively pursuing pre-diction in this field, and for tireless discussion ofthis work. I thank Ayala Cohen (Technion), RalphSnyder (Monash), Rob Hyndman (Monash) and BillLangford (RMIT) for detailed feedback on earlierdrafts of this article. Special thanks to Boaz Shmueliand Raquelle Azran for their meticulous reading anddiscussions of the manuscript. And special thanksfor invaluable comments and suggestions go to Mur-ray Aitkin (U Melbourne), Yoav Benjamini (TelAviv U), Smarajit Bose (ISI), Saibal Chattopad-hyay (IIMC), Ram Chellapah (Emory), Etti Doveh(Technion), Paul Feigin (Technion), Paulo Goes(U Arizona), Avi Goldfarb (Toronto U), Norma Hubele(ASU), Ron Kenett (KPA Inc.), Paul Lajbcygier(Monash), Thomas Lumley (U Washington), DavidMadigan (Columbia U), Isaac Meilejson (Tel AvivU), Douglas Montgomery (ASU), Amita Pal (ISI),Don Poskitt (Monash), Foster Provost (NYU), Sa-haron Rosset (Tel Aviv U), Jeffrey Simonoff (NYU)and David Steinberg (Tel Aviv U).

REFERENCES

Afshartous, D. and de Leeuw, J. (2005). Prediction inmultilevel models. J. Educ. Behav. Statist. 30 109–139.

Page 21: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 21

Aitchison, J. and Dunsmore, I. R. (1975). Statistical Pre-diction Analysis. Cambridge Univ. Press. MR0408097

Bajari, P. and Hortacsu, A. (2003). The winner’s curse, re-serve prices and endogenous entry: Empirical insights fromebay auctions. Rand J. Econ. 3 329–355.

Bajari, P. and Hortacsu, A. (2004). Economic insightsfrom internet auctions. J. Econ. Liter. 42 457–486.

Bapna, R., Jank, W. and Shmueli, G. (2008). Price forma-tion and its dynamics in online auctions. Decision SupportSystems 44 641–656.

Bell, R. M., Koren, Y. and Volinsky, C. (2008). TheBellKor 2008 solution to the Netflix Prize.

Bell, R. M., Koren, Y. and Volinsky, C. (2010). All to-gether now: A perspective on the netflix prize. Chance 23

24.Berk, R. A. (2008). Statistical Learning from a Regression

Perspective. Springer, New York.Bjornstad, J. F. (1990). Predictive likelihood: A review.

Statist. Sci. 5 242–265. MR1062578Bohlmann, P. and Hothorn, T. (2007). Boosting al-

gorithms: Regularization, prediction and model fitting.Statist. Sci. 22 477–505. MR2420454

Breiman, L. (1996). Bagging predictors. Mach. Learn. 24

123–140. MR1425957Breiman, L. (2001a). Random forests. Mach. Learn. 45 5–32.Breiman, L. (2001b). Statistical modeling: The two cultures.

Statist. Sci. 16 199–215. MR1874152Brown, P. J., Vannucci, M. and Fearn, T. (2002). Bayes

model averaging with selection of regressors. J. R. Stat.Soc. Ser. B Stat. Methodol. 64 519–536. MR1924304

Campbell, J. Y. and Thompson, S. B. (2005). Predictingexcess stock returns out of sample: Can anything beat thehistorical average? Harvard Institute of Economic ResearchWorking Paper 2084.

Carte, T. A. and Craig, J. R. (2003). In pursuit of moder-ation: Nine common errors and their solutions. MIS Quart.27 479–501.

Chakraborty, S. and Sharma, S. K. (2007). Prediction ofcorporate financial health by artificial neural network. Int.J. Electron. Fin. 1 442–459.

Chen, S.-H., Ed. (2002). Genetic Algorithms and GeneticProgramming in Computational Finance. Kluwer, Dor-drecht.

Collopy, F., Adya, M. and Armstrong, J. (1994).Principles for examining predictive–validity—the case ofinformation-systems spending forecasts. Inform. Syst. Res.5 170–179.

Dalkey, N. and Helmer, O. (1963). An experimental appli-cation of the delphi method to the use of experts. Manag.Sci. 9 458–467.

Dawid, A. P. (1984). Present position and potential devel-opments: Some personal views: Statistical theory: The pre-quential approach. J. Roy. Statist. Soc. Ser. A 147 278–292.MR0763811

Ding, Y. and Simonoff, J. (2010). An investigation of miss-ing data methods for classification trees applied to binaryresponse data. J. Mach. Learn. Res. 11 131–170.

Domingos, P. (2000). A unified bias–variance decompositionfor zero–one and squared loss. In Proceedings of the Seven-

teenth National Conference on Artificial Intelligence 564–

569. AAAI Press, Austin, TX.

Dowe, D. L., Gardner, S. and Oppy, G. R. (2007). Bayes

not bust! Why simplicity is no problem for Bayesians. Br.

J. Philos. Sci. 58 709–754. MR2375767

Dubin, R. (1969). Theory Building. The Free Press, New

York.

Edwards, J. R. and Bagozzi, R. P. (2000). On the nature

and direction of relationships between constructs. Psycho-

logical Methods 5 2 155–174.

Ehrenberg, A. and Bound, J. (1993). Predictability and

prediction. J. Roy. Statist. Soc. Ser. A 156 167–206.

Fama, E. F. and French, K. R. (1993). Common risk factors

in stock and bond returns. J. Fin. Econ. 33 3–56.

Farmer, J. D., Patelli, P. and Zovko, I. I. A. A. (2005).

The predictive power of zero intelligence in financial mar-

kets. Proc. Natl. Acad. Sci. USA 102 2254–2259.

Fayyad, U. M., Grinstein, G. G. and Wierse, A. (2002).

Information Visualization in Data Mining and Knowledge

Discovery. Morgan Kaufmann, San Francisco, CA.

Feelders, A. (2002). Data mining in economic science. In

Dealing with the Data Flood 166–175. STT/Beweton, Den

Haag, The Netherlands.

Findley, D. Y. and Parzen, E. (1998). A conversation with

Hirotsugo Akaike. In Selected Papers of Hirotugu Akaike

3–16. Springer, New York. MR1486823

Forster, M. (2002). Predictive accuracy as an achievable

goal of science. Philos. Sci. 69 S124–S134.

Forster, M. and Sober, E. (1994). How to tell when sim-

pler, more unified, or less ad-hoc theories will provide

more accurate predictions. Br. J. Philos. Sci. 45 1–35.

MR1277464

Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the

curse-of-dimensionality. Data Mining and Knowledge Dis-

covery 1 55–77.

Gefen, D., Karahanna, E. and Straub, D. (2003). Trust

and TAM in online shopping: An integrated model. MIS

Quart. 27 51–90.

Geisser, S. (1975). The predictive sample reuse method with

applications. J. Amer. Statist. Assoc. 70 320–328.

Geisser, S. (1993). Predictive Inference: An Introduction.

Chapman and Hall, London. MR1252174

Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D.

B. (2003). Bayesian Data Analysis, 2nd ed. Chapman &

Hall/CRC New York/Boca Raton, FL. MR1385925

Ghani, R. and Simmons, H. (2004). Predicting the end-price

of online auctions. In International Workshop on Data

Mining and Adaptive Modelling Methods for Economics

and Management, Pisa, Italy.

Goyal, A. and Welch, I. (2007). A comprehensive look at

the empirical performance of equity premium prediction.

Rev. Fin. Stud. 21 1455–1508.

Granger, C. (1969). Investigating causal relations by econo-

metric models and cross-spectral methods. Econometrica

37 424–438.

Greenberg, E. and Parks, R. P. (1997). A predictive ap-

proach to model selection and multicollinearity. J. Appl.

Econom. 12 67–75.

Page 22: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

22 G. SHMUELI

Gurbaxani, V. and Mendelson, H. (1990). An integrativemodel of information systems spending growth. Inform.Syst. Res. 1 23–46.

Gurbaxani, V. and Mendelson, H. (1994). Modeling vs.forecasting—the case of information-systems spending. In-form. Syst. Res. 5 180–190.

Hagerty, M. R. and Srinivasan, S. (1991). Comparing thepredictive powers of alternative multiple regression models.Psychometrika 56 77–85. MR1115296

Hastie, T., Tibshirani, R. and Friedman, J. H. (2009).The Elements of Statistical Learning: Data Mining, In-ference, and Prediction, 2nd ed. Springer, New York.MR1851606

Hausman, J. A. (1978). Specification tests in econometrics.Econometrica 46 1251–1271. MR0513692

Helmer, O. and Rescher, N. (1959). On the epistemologyof the inexact sciences. Manag. Sci. 5 25–52.

Hempel, C. and Oppenheim, P. (1948). Studies in the logicof explanation. Philos. Sci. 15 135–175.

Hitchcock, C. and Sober, E. (2004). Prediction versus ac-commodation and the risk of overfitting. Br. J. Philos. Sci.55 1–34.

Jaccard, J. (2001). Interaction Effects in Logistic Regres-sion. SAGE Publications, Thousand Oaks, CA.

Jank, W. and Shmueli, G. (2010). Modeling Online Auc-tions. Wiley, New York.

Jank, W., Shmueli, G. and Wang, S. (2008). Modelingprice dynamics in online auctions via regression trees. InStatistical Methods in eCommerce Research. Wiley, NewYork. MR2414052

Jap, S. and Naik, P. (2008). Bidanalyzer: A method for esti-mation and selection of dynamic bidding models. Market-ing Sci. 27 949–960.

Johnson, W. and Geisser, S. (1983). A predictive view ofthe detection and characterization of influential observa-tions in regression analysis. J. Amer. Statist. Assoc. 78

137–144. MR0696858Kadane, J. B. and Lazar, N. A. (2004). Methods and crite-

ria for model selection. J. Amer. Statist. Soc. 99 279–290.MR2061890

Kendall, M. and Stuart, A. (1977). The Advanced Theoryof Statistics 1, 4th ed. Griffin, London.

Konishi, S. and Kitagawa, G. (2007). Information Criteriaand Statistical Modeling. Springer, New York. MR2367855

Krishna, V. (2002). Auction Theory. Academic Press, SanDiego, CA.

Little, R. J. A. (2007). Should we use the survey weightsto weight? JPSM Distinguished Lecture, Univ. Maryland.

Little, R. J. A. and Rubin, D. B. (2002). Statistical Anal-ysis with Missing Data. Wiley, New York. MR1925014

Lucking-Reiley, D., Bryan, D., Prasad, N. andReeves, D. (2007). Pennies from ebay: The determinantsof price in online auctions. J. Indust. Econ. 55 223–233.

Mackay, R. J. and Oldford, R. W. (2000). Scientificmethod, statistical method, and the speed of light. Work-ing Paper 2000-02, Dept. Statistics and Actuarial Science,Univ. Waterloo. MR1847825

Makridakis, S. G., Wheelwright, S. C. and Hyndman,

R. J. (1998). Forecasting: Methods and Applications, 3rded. Wiley, New York.

Montgomery, D., Peck, E. A. and Vining, G. G. (2001).

Introduction to Linear Regression Analysis. Wiley, New

York. MR1820113

Mosteller, F. and Tukey, J. W. (1977). Data Analysis

and Regression. Addison-Wesley, Reading, MA.

Muller, J. and Brandl, R. (2009). Assessing biodiversity

by remote sensing in mountainous terrain: The potential

of lidar to predict forest beetle assemblages. J. Appl. Ecol.

46 897–905.

Nabi, J., Kivimaki, M., Suominen, S., Koskenvuo, M.

and Vahtera, J. (2010). Does depression predict coronary

heart diseaseand cerebrovascular disease equally well? The

health and social support prospective cohort study. Int. J.

Epidemiol. 39 1016–1024.

Palmgren, B. (1999). The need for financial models. ERCIM

News 38 8–9.

Parzen, E. (2001). Comment on statistical modeling: The

two cultures. Statist. Sci. 16 224–226. MR1874152

Patzer, G. L. (1995). Using Secondary Data in Marketing

Research: United States and Worldwide. Greenwood Pub-

lishing, Westport, CT.

Pavlou, P. and Fygenson, M. (2006). Understanding and

predicting electronic commerce adoption: An extension of

the theory of planned behavior. Mis Quart. 30 115–143.

Pearl, J. (1995). Causal diagrams for empirical research.

Biometrika 82 669–709. MR1380809

Rosenbaum, P. and Rubin, D. B. (1983). The central role

of the propensity score in observational studies for causal

effects. Biometrika 70 41–55. MR0742974

Rubin, D. B. (1997). Estimating causal effects from large

data sets using propensity scores. Ann. Intern. Med. 127

757–763.

Saar-Tsechansky, M. and Provost, F. (2007). Handling

missing features when applying classification models. J.

Mach. Learn. Res. 8 1625–1657.

Sarle, W. S. (1998). Prediction with missing inputs. In JCIS

98 Proceedings (P. Wang, ed.) II 399–402. Research Trian-

gle Park, Durham, NC.

Seni, G. and Elder, J. F. (2010). Ensemble Methods in Data

Mining: Improving Accuracy Through Combining Predic-

tions (Synthesis Lectures on Data Mining and Knowledge

Discovery). Morgan and Claypool, San Rafael, CA.

Shafer, G. (1996). The Art of Causal Conjecture. MIT Press,

Cambridge, MA.

Schapire, R. E. (1999). A brief introduction to boosting. In

Proceedings of the Sixth International Joint Conference on

Artificial Intelligence 1401–1406. Stockholm, Sweden.

Shmueli, G. andKoppius, O. R. (2010). Predictive analytics

in information systems research. MIS Quart. To appear.

Simon, H. A. (2001). Science seeks parsimony, not simplic-

ity: Searching for pattern in phenomena. In Simplicity, In-

ference and Modelling: Keeping it Sophisticatedly Simple

32–72. Cambridge Univ. Press. MR1932928

Sober, E. (2002). Instrumentalism, parsimony, and the

Akaike framework. Philos. Sci. 69 S112–S123.

Song, H. and Witt, S. F. (2000). Tourism Demand Mod-

elling and Forecasting: Modern Econometric Approaches.

Pergamon Press, Oxford.

Page 23: To Explain or to Predict? - arXiv · To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal

TO EXPLAIN OR TO PREDICT? 23

Spirtes, P., Glymour, C. and Scheines, R. (2000). Cau-sation, Prediction, and Search, 2nd ed. MIT Press, Cam-bridge, MA. MR1815675

Stone, M. (1974). Cross-validatory choice and assesment ofstatistical predictions (with discussion). J. Roy. Statist.Soc. Ser. B 39 111–147. MR0356377

Taleb, N. (2007). The Black Swan. Penguin Books, London.Van Maanen, J., Sorensen, J. and Mitchell, T. (2007).

The interplay between theory and method. Acad. Manag.Rev. 32 1145–1154.

Vaughan, T. S. and Berry, K. E. (2005). Using MonteCarlo techniques to demonstrate the meaning and implica-tions of multicollinearity. J. Statist. Educ. 13 online.

Wallis, W. A. (1980). The statistical research group, 1942–1945. J. Amer. Statist. Assoc. 75 320–330. MR0577363

Wang, S., Jank, W. and Shmueli, G. (2008). Explainingand forecasting online auction prices and their dynamicsusing functional data analysis. J. Business Econ. Statist.26 144–160. MR2420144

Winkelmann, R. (2008). Econometric Analysis of CountData, 5th ed. Springer, New York. MR2148271

Woit, P. (2006). Not Even Wrong: The Failure of StringTheory and the Search for Unity in Physical Law. JonathanCope, London. MR2245858

Wu, S., Harris, T. and McAuley, K. (2007). The use ofsimplified or misspecified models: Linear case. Canad. J.Chem. Eng. 85 386–398.

Zellner, A. (1962). An efficient method of estimating seem-ingly unrelated regression equations and tests for aggrega-tion bias. J. Amer. Statist. Assoc. 57 348–368. MR0139235

Zellner, A. (2001). Keep it sophisticatedly simple. In Sim-plicity, Inference and Modelling: Keeping It SophisticatedlySimple 242–261. Cambridge Univ. Press. MR1932939

Zhang, S., Jank, W. and Shmueli, G. (2010). Real-timeforecasting of online auctions via functional k-nearestneighbors. Int. J. Forecast. 26 666–683.