SBF Model Selection MKSC 14_2.pdf

8/10/2019 SBF Model Selection MKSC 14_2.pdf

1/18

Vol. 33, No. 2, MarchApril 2014, pp. 188205ISSN 0732-2399 (print) ISSN 1526-548X (online) http://dx.doi.org/10.1287/mksc.2013.0825

2014 INFORMS

Model Selection Using Database Characteristics:

Developing a Classication Tree forLongitudinal Incidence DataEric M. Schwartz

Stephen M. Ross School of Business, University of Michigan, Ann Arbor, Michigan 48109, [email protected]

Eric T. Bradlow, Peter S. FaderThe Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104

{[email protected], [email protected]}

When managers and researchers encounter a data set, they typically ask two key questions: (1) Which model(from a candidate set) should I use? And (2) if I use a particular model, when is it going to likely workwell for my business goal? This research addresses those two questions and provides a rule, i.e., a decision tree,

for data analysts to portend the winning model before having to t any of them for longitudinal incidencedata. We characterize data sets based on managerially relevant (and easy-to-compute) summary statistics, andwe use classication techniques from machine learning to provide a decision tree that recommends when touse which model. By doing the legwork of obtaining this decision tree for model selection, we provide atime-saving tool to analysts. We illustrate this method for a common marketing problem (i.e., forecasting repeatpurchasing incidence for a cohort of new customers) and demonstrate the methods ability to discriminateamong an integrated family of a hidden Markov model (HMM) and its constrained variants. We observe a strongability for data set characteristics to guide the choice of the most appropriate model, and we observe that somemodel features (e.g., the back-and-forth migration between latent states) are more important to accommodatethan are others (e.g., the inclusion of an off state with no activity). We also demonstrate the methods broadpotential by providing a general recipe for researchers to replicate this kind of model classication task inother managerial contexts (outside of repeat purchasing incidence data and the HMM framework).

Keywords: model selection; machine learning; data science; business intelligence; hidden Markov models;classication tree; random forest; posterior predictive model checking; hierarchical Bayesian methods;forecasting

History : Received: August 2, 2011; accepted: September 4, 2013; Preyas Desai served as the editor-in-chief forthis article. Published online in Articles in Advance January 17, 2014.

1. IntroductionThe explosion in technology-enabled data collectionhas changed the focus of marketing modelers awayfrom aggregated data at the store or market leveltoward more granular, panel-oriented data structuresand associated statistical methodologies. Companieshave reduced their reliance on rolled-up data pro-vided by syndicated vendors (e.g., IRI, Nielsen) andnow build more of their analytics around customer-level longitudinal patterns that they can obtain fromtheir own internal operations. But while this increasedreliance on site-centric data (Zheng et al. 2011)offers a number of meaningful benets to the rm, italso comes with some potential costs to the researcher.

First, site-centric data provide a detailed descrip-tion of each customers stream of purchases (andother actions that the rm can measure directly), but such data often lack information about market-ing variables, competitive tactics, and other potential

drivers of the behavior(s) of interest (Donkers et al.2007, Schweidel et al. 2008) that are typically pro-vided by a third-party rm and are often difcult tolink to purchase data. Thus, many rms are focus-ing their decision-making efforts around the ow of incidence activities, i.e., the timing and nature of eachtransaction, which is very rich but also quite differentfrom the inputs used in more traditional marketing-

mix models (Hanssens et al. 2005).Second, this detailed stream of incidence actionscan be characterized by an ever-larger swath of math-ematical models. That is, increased granularity comeswith the potential for increased model complexity andhence a more difcult model selection problem thanfaced by previous generations of researchers, whooften relied on relatively standard model specica-tions (Cooper and Nakanishi 1988, Wittink et al. 1988)that were sufcient for the relatively standard datastructures made available by a small set of third-partydata providers.

188


2/18


3/18

Schwartz, Bradlow, and Fader: Model Selection Using Database Characteristics190 Marketing Science 33(2), pp. 188205, 2014 INFORMS

connected to each other but still very exible. Themodels we consider are the HMM and three differ-ent constrained variants of it (including the BG/BBmodel). Since they are part of an integrated family,they offer an opportunity to detect when each under-lying model component (in this case, the presence

of an absorbing state and/or the need for a no-purchase state) is worth including or turning off.This provides added insight to the analyst about thenature of customer dynamics, above and beyond sim-plied model implementation and improved modelperformance.

For this context (i.e., repeat-transaction incidencedata, the HMM and its constrained variants), we doall of the legwork for the analyst. We run an arrayof constrained and unconstrained HMM models ondozens of synthetic data sets, generated to broadlyrepresent the kinds of patterns that are likely to occurin real-world settings. Although this is computation-

ally expensive initially (a high up-front cost for usas the researcher), it yields signicant savings for thedownstream userthe manager simply follows ouradvice and selects the most appropriate model giventhe nature of her data set and runs itthe winnerand not the entire class of models.

The focal managerial criterion we use to selectamong models is the forecast error for each cohortspurchases, so the winning model has the minimummean absolute error in a holdout period. We use well-established machine-learning methods known as clas-sication and regression trees (CART) and random foreststo derive general rules to suggest which model touse under different circumstances, based entirely onobserved (and managerially meaningful) patterns inthe customer-base data. The database characteristicsthat turn out to be most important (in our setting)include the nature of the decline in cohort-level salesover time as well as purchase concentration (e.g., the80:20 rule) across customers.

Since the development of the decision tree is ourkey contribution, the structure of the paper centersaround it. There are three ingredients for the clas-sication approach, and we devote a section of thepaper to each one: the candidate models in 2, thedatabase characteristics in 3, and the performancecriterion to determine the winning model for eachdata set in 4. Putting these three ingredients together,we create the decision tree in 5 and focus on its inter-pretation, validation, and managerial implications.

Although we perform our analysis for a spe-cic data/modeling context (albeit an important onein todays marketing environment), the same basicrecipe developed here can be applied to many othersettings. Thus, we formalize our approach as a moregeneral methodology in Appendix B, using the sameingredients outlined above: a set of models, database

characteristics, and a selection criterion (i.e., perfor-mance or error measure with loss function). We now begin the process of laying out these elements to buildour decision tree.

2. Which Models to Consider? TheHMM and Its Constrained Variants

The decision tree recommends which model to use fora given data set, but we have to provide a considera-tion set of models: the HMM and its constrained vari-ants. Why do we consider this class? First, they areappropriate for this popular context of understandingand projecting repeat-purchase patterns of a cohort of customers using longitudinal incidence data (Liechtyet al. 2003, Montgomery et al. 2004, Montoya et al.2010, Netzer et al. 2008, Schweidel et al. 2011). Sec-ond, these models cover a wide range of underlyingstories of customer behavior, leading to differentobservable data patterns. This helps us achieve thegoal of the paper: establish the link between data set-level summaries and model performance.

Third, these models form an integrated family;that is, each model is a constrained or unconstrainedversion of another in the set. Some of these are estab-lished yet seemingly unrelated models, such as buytill you die and latent-class models, among others.But they are all special cases of the HMM. These con-nections have only been partially explored and in anad hoc manner in the previous literature, as we dis-cuss below (Netzer et al. 2008, Schweidel et al. 2011).However, the extra insight that we provide is that

the four variants of the HMM that we consider aredescribed by two model components each with twolevels (as seen in the 2 2 framework discussed inTable 1). So the decision tree not only recommendsa specic model but also emphasizes the presence orabsence of more general model components, therebyadding more insight and comparability across datasets.

The unconstrained HMM used here has two states,and the within-state purchase likelihoods are repeatedBernoulli trials by individuals who can begin the cal-ibration period in state 1 or state 2. We allow forunobserved continuous heterogeneity for the within-

state purchase propensities as well as the between-state transition probabilities. Formally stated, we letyit = 1 for day t if the customer i purchased and yit = 0otherwise. 1 Then,

yitBernoulli( p1i) if in state 1 Z it = 1Bernoulli( p2i) if in state 2 Z it = 2

(1)

1 Without loss of generality, we use day to refer to the unit of discrete time and purchase as the observed behavior of interest.It could be, instead, for example, viewing online videos or not in agiven week, donating or not in a given quarter, etc.


4/18

Schwartz, Bradlow, and Fader: Model Selection Using Database CharacteristicsMarketing Science 33(2), pp. 188205, 2014 INFORMS 191

Table 1 Nested Model Relationships Among the HMM and Its Constrained Variants

State 2 is cold State 2 is off

Backward and forward transitions HMM On and off

pi = p 1i p2i and i =1 12i 12i

21i 1 21i pi = p 1i 0 and i =

1 12i 12i 21i 1 21i

Only forward transitions Hot then cold BG/BBpi = p 1i p2i and i =

1 12i 12i 0 1 pi = p 1i 0 and i =

1 12i 12i 0 1

Notes. The rows and columns illustrate how constraints on two model components lead to different models. The pi and i are the individual-level within-statetransaction probabilities and between-state transition probability matrix, respectively.

where the latent-state variable, Zit , indicates whichstate a customer occupies on each day. The individual-level parameters of the HMM are the within-state propensities, p i, and the transition probabilitymatrix, i . That is,

p i = p 1i p 2i and i =1 12i 12i

21i 1 21i (2)

We let the initial state membership be a population-level parameter, and any individual can start in state 1with probability 1 or state 2 with probability 1 1.We assume independent beta distributions to allowfor heterogeneity across individuals for the compo-nents of p i and i.2 Specically, the prior distributionsused are

p1i beta p1 p1 p 2i beta p2 p2

12i beta 12 12 21i beta 21 21(3)

where = a/ a + b is the mean and = 1/ a +b + 1 is the polarization index of the beta distribu-tion with shape parameters a and b (Sabavala andMorrison 1977). In general, for S 2 states, eachrow r of the transition probability matrix is a vec-tor r 1i rS i Dirichlet r 1 rS .3 We distin-guish between states by referring to state 1 as havinga within-state propensity at least as large as that of state 2 for each individual (i.e., p1i p2i for all i). Thisprevents the label-switching problem known to existwith latent-state models (Stephens 2000).

We highlight the off-diagonal entries of the transi-tion probability matrix, since 12i denotes the prob-ability an individual moves forward (state 1 to 2)and 21i represents the probability an individual

2 The initial state membership probability is assumed to be apopulation-level parameter as a result of the denition of a cohortof customers acquired at the same time. Additionally, the resultsare robust to using a logit-normal for heterogeneity on all individ-ual parameters and for allowing correlations among them.3 We use highly uninformative hyperpriors on the population-level parameters of the beta (or Dirichlet) distributions. For moredetails about the distributions used in the sampling procedure, seeAppendix A.

moves backward (state 2 to 1). Not allowing backward transitions is equivalent to making state 2absorbing.

Given this formulation of the unconstrained HMM,the three nested models emerge as we constrain eitheror both model components. To start, when we apply

both constraints to all individuals, such that there isan off state ( p2i = 0) and backward transitions are pro-hibited ( 21i = 0), the buy till you die BG/BB modelemerges. 4

Then, as we think about how the BG/BB model andHMM differ along these two dimensions, we can con-sider each of those dimensions separately (i.e., eitherp2i = 0 or 21i = 0). These constraints determine thetwo dimensions of Table 1. When applying each of thetwo constraints separately, different models emerge(the off-diagonal cells of Table 1), and each tells a dis-tinct story of customer behavior.

When only p2i = 0, the on and off model (OF)emerges. Consumers can make back-and-forth transi-tions between an on state of activity and an offstate of inactivity. Like the HMM, customers can make backward transitions, yet like the BG/BB model,when in the off state, customers have no chance of activity. This kind of model has been explored inpapers on Markov-modulated Poisson processes (Maand Buschken 2011).

Alternatively, when only 21i = 0, we get the hotthen cold model (HC). At any time, customers can be either in a hot state (higher propensity to pur-chase) or a cold state (purchasing is less likely butstill possible). Like the BG/BB model, once the cus-tomer reaches the cold state, she remains there (no backward transitions), and like the HMM, in the coldstate, purchasing is possible. The hot-then-cold order-ing is informed by the prevalence of customer attri-tion or, at least, the slowing down of transactions(in aggregate) that is common in most cohort-level

4 Unlike the BG/BB as in Fader et al. (2010), which assumes that allindividuals start in the alive state, in our BG/BB specication weallow individuals to start in either state, according to initial stateprobability vector . The model utilized here is more general.


5/18


data sets. 5 Such behavior appears in queuing theorymodels, such as phase-type distributions (Bladt andNeuts 2003, OCinneide 1990), and in marketing mod-els (Fader et al. 2004, Schweidel and Fader 2009).

Past literature has noted how the latent-class model(Kamakura and Russell 1989) is a special case of an

HMM ( 12i = 21i = 0), and other work often utilizes anested model with a death state (Netzer et al. 2008,Schweidel et al. 2011). However, the other links amongthe HMM and its constrained models (e.g., BG/BB,HC, OF) that we consider have not been documentedin full detail as an integrated framework with the 2 2structure as described here.

Viewing the HMM and its constrained variantsas an integrated family provides an opportunity todetect when (i.e., for which types of data sets) eachmodel component is worth including. One may ini-tially (but erroneously) think that the nested structurewould guarantee that the more exible HMM wouldperform at least as well as any of its constrainedversions (with one or both model components shutoff) on all model-performance criteria. But this is notguaranteed in practice. We illustrate that when fore-casting repeat transactions out of sample, the moregeneral model does not always beat its nested ver-sions, and hence there is value in the decision treeprovided in this paper.

The decision tree answers our key question: Forwhat kinds of database characteristics does eachmodel perform best? To perform this classication, weneed a range of different data sets generated from the

2 2 framework. We generate 64 synthetic data sets,each with T = 30 weeks of data in calibration (and 30for holdout) and N = 500 customers, with consider-able variation by simulating them from unconstrainedand constrained versions of the HMM (i.e., to captureeach of the submodels as well as the full uncon-strained HMM) with a generous range of population-level parameters:

p1 0 05 0 50 p2 0 00 0 10

12 0 10 0 35 21 0 00 0 25

p1

p2 1 2

0 10 0 45 1 0 50 1 00

(4)

We discuss these synthetic data sets and the vari-ability across them in the next section. However,after creating these data sets, we put aside the data-generating process and describe them entirely byeasy-to-compute and managerially relevant databasecharacteristics, which we now cover in detail.

5 For this reason, we do not consider a separate cold then hotmodel, although the general HMM and OF specications allowindividual-level purchasing to speed up over time.

3. Selecting Database CharacteristicsThe decision tree is a tool that predicts which spec-ication is likely to be the winning model by onlylooking at summary statistics of a particular database. Just as we need a set of reasonable models from whichto choose, we also need a set of database character-

istics to drive the choice process. But which databasecharacteristics should we consider? We illustrate ourprocess of identifying relevant database characteristics by returning to one of the opening examples, repeatpurchasing for data set A. Before running any models,analysts frequently examine two typical displays of a cohorts purchasing behavior: a cross-sectional his-togram of customer-level transactions and a longitudi-nal tracking plot of cohort-level purchases over time.These two graphs appear in Figure 2.

What are the key features of each graph? We wantto choose summaries that are both managerially rele-vant and easy to compute directly from these aggre-

gate plots. We identify four summaries that offera fairly complete characterization of each plot. Forthe histogram, we propose summaries to capture thenature of the head and the tail of the distribution aswell as its central tendency. For the tracking plot, wefocus on the early and late trends in purchasing as thecohort ages and the trends overall bowed shape,as well as the overall variability over time.

More specically, Table 2 contains a listing of these measures, which we will use in our subsequentempirical analysis. Although there is not an exact sci-ence to selecting these measures, we choose them hereto represent central tendency (e.g., average frequency),higher moments (e.g., top percentile, purchase con-centration 80:20-type rule), and trend behavior (e.g.,steepness, shape, trend variability). We do not claimthis list to be comprehensive, but these values varywidely and in systematic ways across the data setsgenerated by the HMM and its constrained versions.

The variation in these measures across databases isessential: it allows us to explicitly show the range of empirical patterns we consider here and is requiredto obtain a meaningful classication tree linkingthese summaries to the model selection process.We illustrate some of this variation in values of thesesummary statistics for data sets A, B, and C (seeTable 3). It is interesting to see how the data setsare indistinguishable on some dimensions (e.g., fre-quency), quite distinct from each another on others(e.g., penetration), and occasionally exhibit pairwisesimilarities (e.g., late trend for data sets B and C).

In most empirical settings, we think about theamount of information as being related to the number of observations within a data set. But in thissetting, each data set is reduced to a single obser-vation described along multiple dimensions, i.e., thedatabase characteristics described above. Thus, we


6/18


Figure 2 The Observed Database Characteristics Arise Naturally from Plots That Managers Typically Examine When Deciding Which Model(s) to Run

Top 5% levelNumber of transactions

(days 1 to 30)Days

0

5 0

1 0 0

2 0 0

1 4 0

2 0

0

0 10 20 30 40 50 60

N u m

b e r o

f c u s

t o m e r s

1Penetration

Frequency

Concentration

Late trend

Trendvariability

Trend Gini

Early trend

302520151050

6 0

1 0 0

N u m

b e r o

f t r a n s a c t i o n s

Note. The histogram (left) shows how the number of transactions varies across customers in the observation period, and the tracking plot (right) showsincremental transactions of the cohort over the same period.

construct a data set of data sets, a collection of 64 simulated data sets reecting variation along thesummary statistics and representing real-world datasets (Fader et al. 2010, Netzer et al. 2008, Schweidelet al. 2011). Specically, we generate data sets from allpossible combinations of the parameter values notedin 2, which allows us to reect both the structuralvariation and the natural randomness that arisesfrom simulating the purchases. Once the data sets

Table 2 Database Characteristics that Capture Features of a

Longitudinal Incidence Data Set and Can Be Computed fromSummary Plots (e.g., Histogram and Tracking Plot)

Characteristic Description

Frequency How many active days of transactions are there percustomer?

Penetration How many unique customers have made at least onetransaction?

Concentration How is activity spread out among customers (i.e.,what fraction of all transactions was made by thetop 20% of customers)?

Top 5% level How much are the most active customerspurchasing (i.e., what level of transactions is thecutoff for the top 5% of most active customers)?

Early trend What is the trend in the rst half of the calibration

period (i.e., drop from rst to middle day as apercentage of the size of the customer base)?Late trend What is the trend in the second half of the calibration

period (i.e., drop from middle to last day as apercentage of the size of the customer base)?

Trend Gini How much does the actual curve deviate from a lineconnecting the rst and last days (i.e., how mucharea is there below the trend line and the curve, aspercentage of the levels of the line, la the Ginicoefcient)?

Trend variability How much day-to-day variation is present in thecalibration period (i.e., standard deviation ofincremental sales)?

are created, the true values of the population-levelparameters are no longer taken into consideration.

Figure 3 shows the large variability along the val-ues of the database characteristics across the simu-lated data sets. For instance, nearly half of the datasets have penetration rates between 40% and 70%.About 40% of them have a very steep declining trend(steeper than a drop in transactions equivalent to 15%of the cohort size), whereas others show some growthin purchases for the cohort over time. Thus, we believe that by selecting and creating data sets in thisway, we will have avoided biasing our classicationresults to favor any particular model specication.

To ensure that our chosen characteristics are ex-plaining most of the meaningful variation across thecollection data sets, we ran a principal componentsanalysis and an exploratory factor analysis on aneven larger set of summary statistics beyond theones described earlier. We do not present the detailedresults but note a few highlights. The principal com-ponents analysis indicates that 99% of the measuredvariation across the 64 data sets can be captured by six independent components. The loadings of the

Table 3 Database Characteristics for the Three Highlighted Data Sets%

Characteristic Data set A Data set B Data set C

Frequency 11 12 11Penetration 48 83 68Concentration 84 55 63Top 5% level 63 40 43Early trend 17 29 7Late trend 3 1 0Trend shape 27 46 13Trend variability 4 6 2


7/18


Figure 3 Histograms Summarizing the Variability for Each DatabaseCharacteristic Across All 64 Data Sets

Frequency

10 20 30 40

Penetration

20 40 60 80 100

Concentration

40 50 60 70 80 90

Top 5% level

20 40 60 80

Early trend

0 10102030 0

. 0 0

0 . 0

1

0 . 0

2

0 . 0

3

0 . 0

0

0 . 0

1

0 . 0

2

0 . 0

3

0 . 0

0 0

. 0 2 0

. 0 4 0

. 0 6 0

. 0 8

Late trend

0 22468 4 0

. 0 0

0 . 0

5

0 . 1

0

0 . 1

5

Trend ini

0 2020 40 0

. 0 0 0

0 . 0

1 0

0 . 0

2 0

0 . 0

0 0

0 . 0

1 0

0 . 0

2 0

0 . 0

0 0

0 . 0

1 0

0 . 0

2 0

Trend variation

1 2 3 4 5 6 7

0 . 0

0 . 1

0 . 2

0 . 3

D a

t a s e

t s

principal components analysis and the loadings of theexploratory factor analysis (with ve, six, and sevenfactors) all point to a very similar set of summarystatistics, such as central tendency, concentration, andvariation over time.

We also recognize that a number of these databasecharacteristics are naturally correlated with eachother. Some measures are quite independent (e.g., latetrend and penetration, r = 0 01), but other pairs havecorrelations that are large and signicant (e.g., aver-age frequency and penetration, r = 0 86). Althoughthis kind of multicollinearity could be a seriousproblem in a typical regression-like model, it doesnot affect the classication tree and random for-est approach since they are nonparametric methodsdesigned specically for (sequential) variable selec-tion (Breiman 2001a, Breiman et al. 1984).

4. Assessing Model PerformanceThe nal ingredient that goes into the classicationtree is a rule for declaring a winning model for agiven data set. Here, we select a winner based oneach models ability to predict an important man-agerial quantity that is widely used for purchas-

ing data because of its link to customer lifetimevalue and other prot measures: aggregate incremen-tal sales over a holdout period. Specically, we selectan error measure that summarizes the time series of discrepancies between the model and the observedsales for each Markov chain Monte Carlo (MCMC)world. We will look at the variability of the errorsacross worlds and also average the errors acrossthe worlds to obtain a measure of the models errorfor that data set that integrates over the posterioruncertainty. The error measure we use, mean absoluteerror (MAE), assumes a linear loss function and isfrequently used for time-series data. In the more gen-eral formulation of this procedure (see Appendix B),one can select any managerial quantity (replacing out-of-sample aggregate sales over time) and error mea-sure with a different loss function (to replace MAE).Although we present results using MAE for our con-text, our classication tree results are robust to alter-native common summary error measures (e.g., meanabsolute percent error and root mean squared error). 6

Formally stated, we quantify performance as thedegree to which the model-based posterior predic-tive distribution of out-of-sample aggregate salesis outlying with respect to the quantitys observed

value. We assume the posterior distribution has been obtained using standard MCMC procedures(detailed in Appendix A), yielding posterior drawsg = 1 G . For data set k, yobskt is number of theobserved incremental transactions at time period t,and y gkmt is one replicate from model ms correspond-ing posterior predictive distribution for that quantity(i.e., incremental transactions). Then for each poste-rior replicate g, we compute the mean absolute error:

d gkm = 1T

T

t= 1y gkmt y

obskt (5)

6 We note that our choice of error measure for model selection isin contrast to commonly used likelihood-based summary criteria,such as Bayesian information criterion (BIC) and deviation infor-mation criterion (DIC) (Montgomery et al. 2004, Montoya et al.2010, Netzer et al. 2008, Schweidel et al. 2011). We use an empiricalquantity for model selection since many scholars caution againstusing purely likelihood-based measures (Gelman and Rubin 1995),especially for latent-state models, such as the HMM and its vari-ants, because one must face issues with unstable estimators, com-putation of the posterior distribution, and correction factors of thelog-marginal likelihood (Chib 1995, Lenk 2009, Newton and Raftery1994, Spiegelhalter et al. 2002).


8/18


We will use the values of d gkm in two ways. Onthe one hand, we will examine the average posteriorMAE for all four models on each data set to deter-mine a single winner per data set. On the other hand,to provide a more nuanced set of ndings, we char-acterize the full posterior uncertainty of the MAE by

computing the probability that each model has low-est value (i.e., the proportion of times each model isthe winner across the G posterior replicates) becausewe would not want to overly penalize a model that isa close second, for instance. We also use the latterdirectly in our classication tree.

Now, armed with a set of models (the HMM andits constrained variants), the in-sample database char-acteristics for each data set (see Table 2), and an errormeasure (out-of-sample MAE), we have all of theingredients for the decision tree, which is describednext.

5. When to Use Which Model?A Classication Tree

We classify data sets to reveal how we can selectthe model with the best out-of-sample error by onlyusing in-sample database characteristics. This enablesus to answer the papers central question: Given adata sets summary statistics, which model best tsthe data?

The winning model, mWinner gk , for data set k andposterior world g is determined by the identifyingthe model with the minimum error d gkm among all M

models:mWinner gk = argmin

m= 1 M d gkm (6)

We use a classication tree to relate the identity of thewinning model, mWinner gk , to the vector of databasecharacteristics, Yobsk . Given the performance of allM models across all K data sets and G posterior repli-cates, we explain variations in the model performance(i.e., which model wins) as a function of the observedsummaries of that data set. Stated formally, we cap-ture this relationship as follows:

mWinner gk = Tree Yobsk (7)

where the function Tree denotes the classicationtree predicting the winning model mWinner gk for eachof the data sets k = 1 K and posterior world g =1 G .

The classication tree provides cutoff values of thedata set-level summary statistics to place entire datasets into buckets. This classies data sets in aneasy-to-interpret manner. Each bucket of data sets hasa similar prole of data set-level summary statistics

and similar patterns of model performance. There-fore, when a new data set is encountered, it can beclassied using this decision rule to identify whichof the models will likely be most suited for it. Thisallows us to uncover relationships between observedpatterns in the data and model t that are easy to

interpret while avoiding the need to make any addi-tional assumptions about functional form or error dis-tributions common to ordinary regression models.

Additionally, our classication tree approach goesone step further because it also reects the natu-ral parameter and model uncertainty. We reect thatuncertainty since our Bayesian modeling approachprovides the full posterior distribution of perfor-mance for each data setmodel pair. As a result, eachcase to be classied is unique to a particular posteriordraw from a model run on a data set. This means thatthe data to be used to construct the classication treecontain G = 100 model-based replicates of the K = 64

observed data sets. By using G replicates of each setof observed data set summaries (independent vari-ables), we allow for G different values of errors fromeach modeldata set pair; hence, each data set has adistribution of different winning models (dependentvariables) and therefore receives an appropriate number of votes.

5.1. Classication TreeThe classication tree in Figure 4 can be easily read bystarting at the top and following a series of if thendecisions down to a terminal node at the bottom of each branch. These terminal nodes represent a group

of data sets with the same observed summary statis-tic branch values (predictor variables). Each node hasa recommended winning model but also displaysthe within-node winning percentages for each model(based on the number of posterior worlds in whicheach model had the lowest forecast error). Note thatthe N values in the tree sum to 6,400 cases, reectingthe use of 100 posterior replicates for each of the 64simulated data sets.

Four database characteristics were selected by theclassication trees sequential variable selection algo-rithm as being diagnostic: early trend, late trend, con-centration, and trend Gini (trend shape).

The early and late trend statistics reect the changein transactions over each half of the calibration period(15 days in each half) expressed as a percentage of thetotal customer base (500 customers). The classicationtree partitions early trend into three levels: very steep(steeper than a drop in daily transactions equivalentto 11% of the customer base), moderately steep (a drop between 1% and 11%), and relatively at or positive(a slope that is more positive than 1%). Late trend ispartitioned into two levels, which we label moderatelysteep (a drop steeper than 3% of the total customer


9/18


Figure 4 The Classication Tree

BG/BBN = 700

[0.46, 0.26,0.07, 0.21]

A

HCN = 200

[0.28, 0.70,0.01, 0.02]

Highly concentrated Not highly concentrated

BG/BBN = 1,500

[0.40, 0.18,0.16, 0.26]

A

Trend Gini< 5% 5%

Concentration 82% < 82%

Less bowed More bowed

HMMN = 200

[0.28, 0.27,0.09, 0.37]

Early trend< 9% 9%

Early trend< 11% 11%

Late trend< 3% 3%

Early trend< 1% 1%

HMMN = 1,300

[0.14, 0.09,0.32, 0.45]

B

OFN = 800

[0.08, 0.05,0.55, 0.32]

C

Very steep Moderately steep

OF or HMMN = 1,700

[0.02, 0.07,0.47, 0.44]

At least moderately steep

Very steep

Relatively flat orincreasing

Moderately steep

At least slight decline Relatively flat or increasing

Notes. The tree is estimated on 6,400 cases, using 100 posterior samples for each of the 64 data sets. For any data set, the tree should be read from topto bottom: the ovals represent the partitions, the rectangles indicate the terminal nodes, and the listed model is the recommended one for that particularcombination of database characteristics. Also listed is a vector summarizing the posterior winning percentage for all four models (left to right: BG/BB, HC, OF,and HMM). The highlighted data sets A, B, and C appear where they are best classied.

base) and relatively at or positive (a slope that is morepositive than 3%). Next, the split for concentrationhas a remarkable resemblance to the 80:20 rule. A dataset is either highly concentrated (more than 82% of pur-chases are made by the top 20% of customers) or nothighly concentrated.

The trend Gini summary statistic reects the shapeof the curve. How much does the actual curve devi-ate from a line connecting the rst and last daysof the calibration period (i.e., how much area isthere between curve and the trend line)? In otherwords, this measures the degree to which the curveis bowed. The variable is split into a less bowedshape (close to linear with value less than 5%) and amore bowed shape (value greater than 5%). Negative

values indicate that there is more area between thecurve and the trend line that sits above the trend linethan below the trend line.

We illustrate the use of the tree by returning toour three introductory data sets. Recall their databasecharacteristics were shown in Table 3. We can tracehow the tree classies these data sets to illustrateexactly how a manager can use our decision tree. Forinstance, data set A exhibits a sales pattern that isdownward sloping early on (steeper than 1%) andnot strongly downward sloping later on (equal to 3%) and where more than 82% of the purchases aremade by the top 20% of customers. Thus, the classi-cation tree recommends that it would be best modeledusing the BG/BB model since that model provides


10/18


Table 4 For Each Data Set and Each Models Posterior Draw, theOut-of-Sample Forecast Is Generated from the PosteriorPredictive Distribution and MAE Is Computed

Data set

Model A B C

BG/BB 4.37 9 71 14 13HC 4.91 1016 17 29OF 6.52 1020 7 09HMM 5.13 8 35 9 44

Notes . The posterior mean of the MAE values for each modeldata set pairis shown here. The lowest error value (i.e., winning model) for each data setis in bold.

the best out-of-sample forecast for 40% of the pos-terior replicates associated with the 15 different datasets that have similar values of database character-istics. (And indeed, the BG/BB model does providethe best forecast for data set A, as we show inTables 46.)

It is interesting to note the internal consistency of the tree. In particular, the precise value of the latetrend for data set A is exactly the classication treescutoff value ( 3%). So even if the data sets late trendwere just slightly less than that cutoff, the data setwould still fall into a node dominated by the BG/BBmodel (i.e., in the leftmost terminal node of the tree,the BG/BB model is the best-performing model in46% of posterior replicates).

For data sets with a declining early trend and aat or increasing late trend, but without a high pur-chase concentration (fewer than 82% of the purchasesmade by the top 20% of customers), a different patternemerges. Data sets B and C are two such examples,so they fall into two terminal nodes in this part of the tree. Data sets represented in this part of the treeshow a strong need to allow for back-and-forth tran-sition (HMM and OF). But within the back-and-forthpair, there is less certainty about which one wins.

Further splitting the data sets by trend Gini (trendshape) over the calibration period and by early trendone more time allows the analyst to better discriminate

Table 5 For Each Data Set and Each Models Posterior Draw, theOut-of-Sample Forecast Is Generated from the PosteriorPredictive Distribution

Posterior distribution of MAE (%)

Data set A Data set B Data set C

Model 25 50 75 25 50 75 25 50 75

BG/BB 3.7 4.2 4.8 8.3 9 6 11 0 11 9 14 0 16 2HC 4.0 4.6 5.5 8.9 101 11 4 15 4 17 3 19 1OF 4.9 6.2 7.8 8.6 100 11 6 6 3 7 0 7 8HMM 4.1 4.7 5.8 7.0 8 1 9 5 7 7 9 0 10 7

Notes . MAE is computed for each replicate data set. The posterior quantiles(25%, 50%, and 75%) of the values across for each data set are shown here.The lowest value (i.e., winning model) for each data set is in bold.

Table 6 Posterior Probabilities of Each Model Winning AreComputed as the Proportion of Replicated Data Sets (e.g.,MCMC Worlds) in Which Each Model Has the Lowest MAE

Posterior probability of model winning (%)

Data set BG/BB HC OF HMM

A 45 27 6 22B 20 11 16 53C 1 0 81 18

Note . These winning percentages illustrate the uncertainty in declaring awinner.

when each model is likely to perform better. For datasets with a more bowed shape (trend Gini greater thanor equal to 5%) and a very steep early trend (steeperthan 9%), such as data set B, the HMM wins with 45%of votes versus the OF with 32%. However, for otherswith a moderately steep early trend (between a 1% and9% drop) and a more bowed shape, such as data set C,the OF wins with 55% of votes versus the HMM with32%. Data sets B and C are therefore best classied bythe HMM and OF, respectively.

The split on trend shape (trend Gini) and an addi-tional split on early trend should be intuitive becausethe HMM is a more general model than the OF. As aresult, the HMM can generate a wider range of pat-terns across data sets than the OF can because of theextra model exibility (e.g., state 2 purchase probabil-ity is not necessarily zero). To understand this, keepin mind the patterns common to the data sets in thispart of the tree: not highly concentrated purchasingand at or increasing late trend. On the one hand, forless bow-shaped curves, the OF has difculty captur-ing a nearly linear pattern since the off state inducesa moderate steep early drop. On the other hand, formarkedly bow-shaped curves, the OF also has dif-culty capturing both the very steep early decliningtrends and at or increasing later trend. Capturingsuch an interaction among database characteristics isan advantage that CART methods have over tradi-tional linear regression approaches.

5.2. Uncertainty in Model PerformanceAs we take a deeper dive into particular branchesof the decision tree, we examine the uncertainty inmodel performance. That is, although the model withthe lowest error is declared the winner, we describehow we one can use votes for each winning model byutilizing the full posterior from the Bayesian modeloutput.

As an illustration, we return to data sets A, B, and Cto look at the comparative performance of the HMMand its variants from the 2 2 framework. The aver-age performance seen in the plots of Figure 5 arequantied in Table 4.

We summarize each models performance for adata set using MAE averaged over the posterior


11/18


12/18


Figure 6 Illustrations of the Range of Variability in Model Prediction for Data Set A

BG/BB HC

N u m

b e r o

f t r a n s a c

t i o n s

1 5 0

1 0 0

5 0

0

N u m

b e r o

f t r a n s a

c t i o n s

1 5 0

1

0 0

5 0

0

OF

0 10 20 30 40 50 60

1 5 0

1 0 0

5 0

0

1 5 0

1

0 0

5 0

0

HMM

Model meanUncertaintyActual

Days

0 10 20 30 40 50 60Days

0 10 20 30 40 50 60

Days

0 10 20 30 40 50 60Days

Notes. The observed data (solid line) is t by each model. Each models posterior predictive mean (dashed line) and its full distribution from 1,000 posteriordraws (light gray shading) are displayed.

measure of the hit rate of the classication tree. Thehit rate is the number of times the tree recommendsa model that is, in fact, the best model to use on that

Figure 7 Posterior Distribution of Out-of-Sample MAE for the Three Highlighted Data Sets

0 0

. 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

5 10 15 20 25

MAE

0 5 10 15 20 25

MAE

Data set A Data set B Data set C

0 5 10 15 20 25

MAE

P o s

t e r i o r

d e n s

i t y

BG/BB

HC

OF

HMM

Notes. These are densities of error measures, so smaller values indicate better model t. When the densities overlap a great deal (e.g., data set A), there isnot a clear winning model. When one density that stands out as better than the others (e.g., data set C), there is a clear winning model.

data set. Averaged across all models and iterations,the hit rate is 46%. We put this hit rate in context by noting that from a purely operational standpoint,


13/18


the tree allows the analyst to run one model insteadof four. In other words, by reducing the work of ananalyst by 75%, the tree makes a recommendation of which model to use that is about twice as good asguessing (a 25% hit rate) randomly among the fourmodels. This hit rate also fares well when compared

with tougher comparative yardsticks, such as the pro-portional chance criterion and maximum chance cri-terion (Morrison 1969), which yield benchmark hitrates of 28% and 35%, respectively. The latter metricis often hard to beat in a discriminant analysis set-ting. It assesses how much better our classicationsare compared with using the most common actualwinner (in this case, the HMM) every time. Thus thedecision tree clearly offers some improvements overthat simple (but often effective) approach.

However, this measure is purely an in-sample one:it uses the same 64 data sets for calibration and clas-sication purposes, so it may be subject to overtting.

We next describe a procedure, random forests, thatwill allow us to reect the uncertain nature of the treeitself and its application to holdout data.

5.4. Assessing the Predictive Value of the TreeUsing Random Forests: Out of Sample

To answer the question about the value of the tree,using another lens, we turn to another machine-learning method closely related to CART knownas random forests (Breiman 2001a, Liaw and Wiener2002). Although the single classication tree we havedescribed above takes into account the parameter andmodel uncertainty, it does not take into account uncer-tainty in the structure of the classication tree itself.The random forest captures extra variation around theclassication. This requires many classication trees,so the random forest algorithm grows many trees(hence, the forest).

What is special about the random forest is thatit has a built-in monitoring system to make sure itproduces predictions that are validated on a hold-out set and that are utilizing important predic-tor variables. Both aspects of predictions preventovertting (Breiman 2001a, Liaw and Wiener 2002).Fortunately, the random forest algorithm has a built-in cross-validation procedure calculating an n-foldcross-validation, where the holdout sample size, n,is typically about 1/3 of the cases (Breiman 2001a).The holdout misclassication rate, in the language of machine learning, is called the out-of-bag error rate,or the generalized error rate since it is intuitivelysimilar to cross-validation error, which indicates theability of the predictive model to generalize to casesoutside of the given data set.

The random forest out-of-sample error rates brokendown by each model are in Table 7, and the hit rateacross all four models is 48%. This closely matches thein-sample hit rate using one classication tree. It is

Table 7 Out-of-Sample Classication for Each of the 6,400 Cases(Data SetWorld Pairs) from the Random Forest

Model BG/BB HC HMM OF Hit rate (%)

BG/BB 840 151 215 114 64HC 372 247 145 154 27HMM 430 40 1056 677 48OF 259 7 742 951 49

Notes . Rows indicate which model actually ts the data best. Columns indi-cate which model was recommended by the random forests classicationfor that data set using a 2/3 sample for calibration (in sample) and 1/3 sam-ple for validation (out of sample). The hit rate is the proportion of each typeof data set correctly classied as out of sample (i.e., the diagonal entriesdivided by row sums).

encouraging to see that, even when a data set is notused for calibration, it can be classied correctly witha high level of accuracy.

Looking more carefully at the classication treeand random forest results, several distinctive patterns

arise. It is clear that the BG/BB and HMM modelshave substantially higher hit rates than do the othertwo models. It also seems that each of these polaropposite models (at least in terms of parameters andcomplexity) can serve as effective representatives tocharacterize the entire family of HMM models coveredhere.

This result raises the question about which of thetwo constraints/dimensions associated with our 2 2framework is more important to capture: the pres-ence of an off state or the existence of an absorbingstate. A closer inspection of Table 7 clearly reveals theanswer: classifying whether or not the data requirean absorbing-state model or a back-and-forth modelis much more informative than the presence of an off state. There is a high degree of confusion betweenthe BG/BB model and HC, and likewise for HMMand OF, but relatively little confusion between theBG/BB model and OF or between HMM and HC.In Table 8, we aggregate the classications acrossthis single dimension and see incredibly high hitrates (62% and 89%) when we ignore the presence orabsence of the off state.

We have explored the predictive value of the deci-sion tree, so it is natural to ask what is driving its goodpredictive ability. To better understand the drivers of

Table 8 Combined Cases of Data Sets and Classications IntoModels with Absorbing States (BG/BB and HC) andBack-and-Forth Transitions (OF and HMM)

State Absorbing Back-and-forth Hit rate (%)

Absorbing 1 383 855 62Back-and-forth 466 3 696 89

Notes . That is, by ignoring the presence or absence of an off/death state, thehit rates are quite high. Like Table 7, these are out-of-sample classications,so the hit rate is the proportion of cases correctly classied.


14/18


our strong classication capabilities, we now analyzethe database characteristics diagnostic value.

5.5. Which Database CharacteristicsAre Most Diagnostic?

The output of the random forests uncovers which vari-ables are most important in explaining classicationsuccess. This not only validates the decision treeobtained via CART but also quanties variable impor-tance. Variable importance in random forests is ameasure of the average improvement in predictionaccuracy of a tree when this variable is included (andits values are intact) compared with when this vari-ables values are meaningless (arbitrarily permutedacross observations).

Figure 8 displays each database characteristicsvariable importance. This analysis conrms what wesee in the classication tree: early trend, late trend,trend Gini (trend shape), and concentration are the

four most important variables, and they are clearlyseparated out from the others. Among the four, how-ever, late trend is the most important. This makesintuitive sense because the models differ in their abil-ity to generate decreasing or increasing patterns inaggregate sales over time. For instance, for a dataset with a strongly increasing late trend, the BG/BBmodel and HC, because of their absorbing state,would not be able to capture it at all. This pro-vides more evidence that even before running anymodels, an analyst could use these easy-to-computedatabase characteristics to rene the decision aboutwhich model is likely to perform best.

Figure 8 Relative Importance of Each Database Characteristic(Predictor Variable) Used in the Classication TreesObtained from the Random Forest

Top 5% level

Reach

Trend variability

Frequency

Trend Gini

Early trend

Concentration

Late trend

25 30 35 40 45 50

Variable importanceNotes. The most important variables are those that provide the largestincrease in out-of-sample (out-of-bag) classication hit rate, averaged acrossall trees in the forest. The late trend, early trend, and concentration are clearlythe three most important and conrmed by appearing in the classicationtree obtained via CART methods. The next most important variable is trendGini (trend shape), which also appears in the classication tree.

5.6. How Much Value Does theClassication Tree Add?

What does the analyst gain by using our decisiontree? From the above discussion, we nd the deci-sion tree nearly doubles the hit rate compared withuninformed guessing about which of the four mod-

els to run. And much of the remaining error rate isassociated with the relatively unimportant distinction between the presence or absence of an off state.

But although this information helps ease the task of choosing the right model, it also tells us how well ananalyst would do using the decision tree comparedwith running all four models for every data set. So itis reasonable to ask: How much error would the ana-lyst suffer by using only one model for all data sets?After all, this is the starting point for many analysts.Suppose the analyst only used BG/BB models for alldata sets she encountered. How poorly would shehave performed? We can compare the error incurred

to the average performance if she always used thetruly winning model for each data set. Table 9 sum-marizes this analysis.

Running the BG/BB model on all data sets yields anerror 52% worse than using the true winning model(average MAE = 9 52 versus 6.28). An analyst woulddo better by running only the HMM, which yieldsan error 21% worse than the using winning model(average MAE = 7 63). By contrast, using the modelrecommended by decision tree for each data set is the best option because it greatly reduces error to only12% worse than the winning model (average MAE =7 05). That is, in terms of relative error to the bestmodel, only using the HMM is 75% worse than usingthe decision tree. So using the decision tree is a win-win: it requires 75% less effort and helps the analystto avoid a 75% increase in relative error.

The value of the decision tree is even greater if we look beyond average performance and considerthe worst case scenario of model performance. Whenexamining the variability in performance, the 95%level of error for using any single model can be quite

Table 9 Absolute and Relative Benets of Using the ClassicationTree over Running Each Model for All Data Sets

BGBB HC OF HMM Tree Winner

Mean MAE 952 9 19 8 14 7 63 7 05 6 28% worse than 52% 46% 30% 21% 12%

winner

95% MAE 2234 18 97 16 70 15 63 13 00 11 57% worse than 93% 64% 44% 35% 12%

winner

Notes . The MAE values reect performance of running each model for alldata sets compared with following the trees recommendation (Tree) andalways selecting the model with best out-of-sample t for each data set(Winner). The percentages illustrate the loss compared with the best-ttingmodel (Winner).


15/18


16/18


danger in choosing models that are overly complexand excessively customized to every different dataset. And third, we believe strongly in exploring andlearning from the underlying patterns that are driv-ing the observed data patterns. This kind of datascience not only will help analysts create and choose

better models but also will help managers make bettertactical decisions to create and extract more valuefrom their customer relationships.

AcknowledgmentsThis research was supported by a grant from Amazon WebServices, which funded the use of the Elastic ComputingCloud. The rst author thanks the Wharton Risk Manage-ment and Decision Processes Center for its support throughthe Russell Ackoff Doctoral Student Fellowships, as wellas Eva Ascarza, Michael Braun, Oded Netzer, and KennethShirley for their useful comments. The authors all thankYao Zhang, who contributed to a related working paper,Children of the HMM: Tracking Longitudinal Behavior at

Hulu.com, which was presented at the 2010 Marketing Sci-ence Conference.

Appendix A. Hierarchical Bayes Sampler DetailsWe provide the computational details for the models thatwe ran. We provide the details of the sampler for the generalHMM with S states. It can be constrained for the two-stateHMM and each of its nested models, as described in 2.

The MCMC procedure generates draws from the jointposterior:

Zit p i i ap b p Y

=I

i= 1

T

t= 1Y it Zit p i Z it Zi t 1 i

p i ap b p Zi Yi i Zi ap b p (A1)

with constants ( I individuals and T time periods),individual-level parameters ( p i and i), and population-level parameters ( ap b p and ).

The procedure obtains these draws by alternating between the following conditional distributions:

Zi Yi i p i

p i Yi Zi ap b p

i Zi

ap b p p

Z

(A2)

For each entry in the data set of data sets described in 3,we estimate allfour models fromthe 2 2 framework. We use64 data sets, 4 models per data set, 2 chains per model, result-ing in 512 independent MCMC chains. We run each chainfor at least 50,000 iterations depending on the convergencecriterion for that given model and chain. This requires morethan 1,000 days of computing time on a single core. Insteadof using only one core, we distributed thecomputational taskto take advantage of the parallel structure of the task. OnAmazons Elastic Computing Cloud, we used 64 nodes with

eight cores per node for 48 hours (24,576 core-hours). Weran each MCMC chain on each of the 512 cores, so we n-ished running all the models in two days. This was funded by a grant from Amazon Web Services.

Each model is estimated using a version of the MCMCsampler for HMM with certain components shut off or not.The code is available from the authors upon request. For

each chain and for each pair of chains for each model,we perform a set of within-chain diagnostics for conver-gence and computation of effective sample size, as well asacross-chain diagnostics for post-convergence mixingallrecommended now as standard practice (Gelman and Rubin1992, Gelman et al. 2004, Geweke 1992, Plummer et al. 2006,Raftery and Lewis 1992).

The draws of model parameters, , and latent states, Z ,have been obtained using a data-augmented Gibbs sampler(Tanner and Wong 1987) with an embedded Metropolis-Hastings step. Below we describe how each subset of parameters was drawn from its corresponding conditionaldistribution in the MCMC procedures.

Step 1. Generate Zi = Z i1 Z iT . The customers

latent-state sequence is drawn via the forward-backwardalgorithm. The latent states are sampled starting at t = T moving backward based on the probabilities dened recur-sively starting at t = 1 and moving forward using dynamicprogramming. For the case of S = 2, given the observed out-come at t and the probability of being in either state at t 1,each element i t k is a sum of the two elements from t 1weighted by the probability of the corresponding transitionprobabilities. Then the probability of drawing state k is

Zit Yi i p i =i t k

i t 1 ++ i t S

i t 1 = pyit1i 1 p1i 1

yiti t 1 1 11i ++ i t 1 S S 1i

i t S = pyitSi 1 pSi 1 yit i t 1 1 1Si ++ i t 1 S SSi

(A3)

Once the sequences from 1 T are drawn for all indi-viduals, then conditioning on those sampled latent statesas if they were data (i.e., data augmentation) simpliesthe subsequent conditional distributions. Hence, we denea vector, Ni , where each element counts the number of times an individual spent a day in each latent state, N ij =

T t= 1 1 Z i t 1 = j . We also dene a matrix, Nit , where each

entry j , k counts the number of transitions made betweeneach pair of latent states, N ijk =

T t= 1 1 Z i t 1 = j 1 Z it = k .

Step 2. Generate pi = p 1i p Si . The customers pur-chase probability vector is sampled directly from a betadistribution. The use of independent beta priors for eachprobability yields a beta posterior distribution (since thelikelihoods have no covariates). For state k, the prior andposterior are

p ki = beta a pk b pk

pki Yi Zi p p

= beta apk +T

t= 1yit 1 Z it = k b pk + N ij

T

t= 1yit 1 Z it = k

(A4)

where apk and bpk are the shape parameters of the beta dis-tribution and pk and pk are the mean and polarizationindex, as dened in 2. To ensure p1i p2i , we use rejectionsampling of the whole vector.


17/18


Step 3. Generate i . The customers transition probabil-ity matrix must have its rows sum to 1, so it is a multino-mial vector. Using independent Dirichlet priors on each rowyields Dirichlet posteriors. For the probability of movingfrom j to any other state 1 S , the prior and posterior are

j 1 S i = Dirichlet j 1 S

j 1 S i Zi j j (A5)= Dirichlet j 1 + N ij 1 j S + N ij S

where the 1: S indexes a vector of parameters and where r j is the vector of Dirichlet shape parameters, which can besummarized by mean probability vector j = j /

S k= 1 jk

and polarization index j = 1/ 1 +S k= 1 jk .

Step 4. Generate . The initial latent-state membershipprobability vector depends on the latent states across allindividuals at t = 1. Dening N 1k =

I i= 1 1 Z k1 = k , the uni-

form hyperprior and the posterior are= Dirichlet 1 1

Z = Dirichlet 1 + N 11 1 + N 1S (A6)

Step 5. Generate a p b p . There is a highly uninformativehyperprior for each shape parameter of each beta dis-tribution characterizing the heterogeneity of state-specicpurchase propensities. For state k, the hyperprior and pos-terior are

a pk b pk a pk + bpk 5/ 2

apk b pk p k L beta p ki p kI a pk + bpk 5/ 2(A7)

where L beta is the beta density function, and the prior dis-tribution proportional to a + b 5/ 2 is recommended byGelman et al. (2004). That prior is uniform on the beta distribution a/ a + b and considered weakly informative on thepolarization index 1 + a + b 1. Since the posterior has noclosed-form expression, we use a Metropolis-Hastings step

with a log Normal proposal density. Its tuning parameter, orvariance, is set to 0 05 to obtain an appropriate acceptanceprobability.

Step 6. Generate . The Dirichlet distribution shapeparameters j = j 1 jS are generated by a general-ization of the procedure used to generate the shape param-eters of the beta distributions. For state j (i.e., row j of thetransition probability matrix), the hyperprior and posteriorare

j = S

k= 1jk

2S + 1 / 2

j j 1 S LDirichlet j 1 S i j 1 S I L gamma j

(A8)

where LDirichlet is the Dirichlet density function. Again, the

prior distribution proportional to j 1 + + jS

2S +

1 / 2is a generalization for the Dirichlet shape parameters of the prior used for the beta shape parameters (Eversonand Bradlow 2002). Since the posterior has no closed-formexpression, we use a Metropolis-Hastings step with a logNormal proposal density. Its tuning parameter, or variance,is set to 0 05 to obtain an appropriate acceptance probability.

Appendix B. General Recipe: Developing aDecision Tree for Model SelectionUsing Database CharacteristicsAlthough we perform our analysis for a specic data/modeling context, the same basic recipe developed in this

paper can be applied to many other settings. We formal-ize this recipe as a general method for model evaluationand selection involving the three basic ingredients: the setof candidate models, the database characteristics, and theperformance criterion. This enables the analyst to answerthe questions Which model should I use for this this dataset? and Given a data set, how well will a given modelperform?

Step 1. Selecting models. Our procedure supposes thatthe analyst has a consideration set of models 1 M . Themodels can all be run on data sets with the same structure.We also suppose that model-based simulation can be donevia Monte Carlo, or Markov chain Monte Carlo, if needed.Our procedure assumes that an MCMC sampler has beenrun to obtain draws g = 1 G from each models the jointposterior distribution, m Yobs .

Step 2. Choosing database characteristics. We character-ize the database with a set of summary statistics. Theseshould be (1) easy-to-compute characteristics, (2) manageri-ally relevant, and (3) largely comprehensive and mutuallyexclusive. Formally stated, we denote these the data set-

level summary statistics as a covariate vector, Yobsk , fordata set k. These are to be computed before running anymodels on the kth data set, which itself is denoted by Yobsk .These are the independent variables of interest in the even-tual classication.

Step 3. Determining performance criterion. Assessingmodel performance for model selection is an important stepthat should be driven by the business goal. We use anempirical validation approach via posterior predictive dis-tributions. We generate data, Y gm , from the model-basedpredictive distribution, Ym Yobs , where g indexes replica-tions 1 G .

For the performance measure feature s, we summarizethe generated data by T s Y

gm . We quantify performance

as model errors: the degree to which the model-based pos-terior predictive distribution of feature s is outlying withrespect to the features observed value, T s Yobsk .

Let D denote a loss function along a single dimensionthat is, the distance between the draws from the posteriorpredictive distribution and the single observed value of fea-ture of a data set. This distance, dmks, summarizes model msperformance on data set k in terms of feature s, utilizing allreplicates g = 1 G :

dmks = D T s Y1

mk T s YG

mk T s Yobsk (B1)

The choice of the function D should depend on the desiredfeature.

Regardless of which performance metric and error mea-sure is chosen, a single metric is obtained for each posteriorreplicate. For each replicate, we select the model with thelowest error and consider it the winning model, which isthe nominal categorical outcome variable to be classied.

Step 4. Classifying data sets by relating model perfor-mance to observed database characteristics. Putting thosethree ingredients together, we create the decision tree toinfer the relationship between which model is best (out-come) and database characteristics (predictors). We formal-ize the classication as its own predictive tool. The inde-pendent variables of interest are the data set-level summarystatistics, Yobsk , computed before running any models on


18/18


the kth data set, denoted by Yobsk . These values are the pre-dictors of model performance. The error measure, d gksm, asdened above, is on a continuous scale. For classicationpurposes, however, the dependent variable should be anindicator of the winning model mWinner (g)ks , a nominal cate-gorical variable with M levels. For each data set k, features, and posterior replicate g ,

mWinner gks = argminm= 1 M

d gksm (B2)

We use a classication tree to relate the identity of the winning model, mWinnerk , to the database characteristics,

Yobsk . Given the performance of all M models across allK data sets for feature s, we explain variations in the modelperformance (i.e., which model wins) as a function of theobserved summaries of that data set. Stated formally, wecapture this relationship as follows:

mWinner gk = Tree Yobsk (B3)

where Tree denotes the classication tree predicting thewinning model mWinner gk for each of the data sets k =

1 K and replicates g = 1 G . The results will showwhich data set-level summaries are associated with differ-ences in performance across the models for the feature of interest. The exact same setup used for CART methods can be used for implementing random forests. The same basicrelationships are uncovered, but different methods are used.

ReferencesBladt M, Neuts MF (2003) Matrix-exponential distributions: Cal-

culus and interpretations via ows. Stochastic Models 19(1):113124.

Breiman L (2001a) Random forests. Machine Learn. 45(1):532.Breiman L (2001b) Statistical modeling: The two cultures. Statist.

Sci. 16(3):199231.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classicationand Regression Trees (Chapman & Hall, New York).Chib S (1995) Marginal likelihood from the Gibbs output. J. Amer.

Statist. Assoc. 90(432):13131321.Chipman HA, George EI, McCulloch RE (1998) Bayesian CART

model search. J. Amer. Statist. Assoc. 93(443):935948.Cooper LG, Nakanishi M (1988) Market-Share Analysis: Evalu-

ating Competitive Marketing Effectiveness (Kluwer AcademicPublishers, Boston).

Donkers B, Verhoef PC, de Jong M (2007) Modeling CLV: A test of competing models in the insurance industry. Quant. MarketingEconom. 5(2):163190.

Everson PJ, Bradlow ET (2002) Bayesian inference for the beta- binomial distribution via polynomial expansions. J. Comput.Graphic. Statist. 11(1):202207.

Fader PS, Hardie BGS, Huang CY (2004) A dynamic change-point model for new product sales forecasting. Marketing Sci.23(1):5065.

Fader PS, Hardie BGS, Shang J (2010) Customer-base analy-sis in a discrete-time noncontractual setting. Marketing Sci.29(6):10861108.

Gelman A, Rubin DB (1992) Inferences from iterative simulationusing multiple sequences. Statist. Sci. 7(4):457472.

Gelman A, Rubin DB (1995) Avoiding model selection in Bayesiansocial research. Sociol. Methodol. 25:165173.

Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian Data Anal- ysis, 2nd ed. (Chapman & Hall, New York).

Geweke J (1992) Evaluating the accuracy of sampling-basedapproaches to calculating posterior moments. Bernardo JM,Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics, Vol. 4(Oxford University Press, Oxford, UK), 169193.

Hanssens DM, Leeang PSH, Wittink DR (2005) Market responsemodels and marketing practice. Appl. Stochastic Models Bus.Indust. 21(4/5):423434.

Hartmann W, Manchanda P, Nair H, Bothner M, Dodds P, GodesD, Hosanagar K, Tucker C (2008) Modeling social interactions:Identication, empirical methods and policy implications. Mar-keting Lett. 19(4):287304.

Kamakura WA, Russell GJ (1989) A probabilistic choice model for

market segmentation and elasticity structuring. J. MarketingRes. 26(4):37990.Lenk PJ (2009) Simulation pseudo-bias correction to the harmonic

mean estimator of integrated likelihoods. J. Comput. Graphic.Statist. 18(4):941960.

Liaw A, Wiener M (2002) Classication and regression by randomforest. R News 2(3):1822.

Liechty JC, Pieters R, Wedel M (2003) Global and local covert visualattention: Evidence from a Bayesian hidden Markov model.Psychometrika 68(4):519541.

Ma S, Buschken J (2011) Counting your customers from an alwaysa share perspective. Marketing Lett. 22(3):243257.

Montgomery AL, Li S, Srinivasan K, Liechty JC (2004) Modelingonline browsing and path analysis using clickstream data. Mar-keting Sci. 23(4):579595.

Montoya R, Netzer O, Jedidi K (2010) Dynamic allocationof pharmaceutical detailing and sampling. Marketing Sci.29(5):909924.

Morrison DG (1969) On the interpretation of discriminant analysis. J. Marketing Res. 6(2):156163.

Netzer O, Lattin JM, Srinivasan V (2008) A hidden Markovmodel of customer relationship dynamics. Marketing Sci.27(2):185204.

Newton MA, Raftery AE (1994) Approximate Bayesian inferencewith the weighted likelihood bootstrap. J. Roy. Statist. Soc.Ser. B 56(1):348.

OCinneide CA (1990) Characterization of phase-type distributions.Comm. Statist. Stochastic Models 6(1):157.

Plummer M, Best N, Cowles K, Vines K (2006) CODA: Convergencediagnosis and output analysis for MCMC. R News 6(1):711.

Raftery AE, Lewis SM (1992) How many iterations in the Gibbssampler? Bernardo JM, Berger JO, Dawid AP, Smith AFM, eds.Bayesian Statistics, Vol. 4 (Oxford University Press, Oxford, UK),

763773.Sabavala DJ, Morrison DG (1977) A model of TV show loyalty. J. Advertising Res. 17(6):3543.

Schmittlein DC, Morrison DG, Colombo RA (1987) Counting yourcustomers: Who are they and what will they do next? Manage-ment Sci. 33(1):124.

Schweidel DA, Fader PS (2009) Dynamic changepoints revisited:An evolving process model of new product sales. Internat. J.Res. Marketing 26(2):119124.

Schweidel DA, Bradlow ET, Fader PS (2011) Portfolio dynamicsfor customers of a multi-service provider. Management Sci.57(3):471486.

Schweidel DA, Fader PS, Bradlow ET (2008) Understanding sub-scriber retention within and across cohorts using limited infor-mation. J. Marketing 72(1):8294.

Spiegelhalter DJ, Best NG, Carlin BP, Linde AVD (2002) Bayesianmeasures of model complexity and t. J. Roy. Statist. Soc. Ser. B64(4):583639.

Stephens M (2000) Dealing with label-switching in mixture models. J. Roy. Statist. Soc. Ser. B 62(Part 4):795809.

Tanner MA, Wong WH (1987) The calculation of posterior dis-tributions by data augmentation. J. Amer. Statist. Assoc.82(398):528540.

Wittink RR, Addona MJ, Hawkes WJ, Porter JC (1988) SCANPRO: The estimation, validation and use of promotional effects based on scanner data. Internal paper, Cornell University,Ithaca, NY.

Zheng Z, Fader P, Padmanabhan B (2011) From business intelli-gence to competitive intelligence: Inferring competitive mea-sures using augmented site-centric data. Inform. Systems Res.23(3):698720.

SBF Model Selection MKSC 14_2.pdf

Documents

Transcript of SBF Model Selection MKSC 14_2.pdf