Harbingers of Failureand Success - Columbia … approach that yields interpretable consumer-level...

54
Harbingers of Failure...and Success Chaoqun Chen Assistant Professor Marketing Department Cox School of Business Southern Methodist University 6214 Bishop Blvd. Dallas, TX 75275 [email protected] Eric T. Anderson Hartmarx Professor of Marketing Marketing Department Kellogg School of Management Northwestern University 2001 Sheridan Rd. Evanston, IL 60611 [email protected] Blakeley B. McShane Associate Professor Marketing Department Kellogg School of Management Northwestern University 2001 Sheridan Rd. Evanston, IL 60611 [email protected] Author note: Correspondence concerning this manuscript should be addressed to Chaoqun Chen. Supplementary materials for this manuscript are available online. The order of au- thors other than the first was determined by alphabetical order. Acknowledgements: NA Financial Disclosure: NA

Transcript of Harbingers of Failureand Success - Columbia … approach that yields interpretable consumer-level...

Harbingers of Failure...and Success

Chaoqun ChenAssistant Professor

Marketing DepartmentCox School of Business

Southern Methodist University6214 Bishop Blvd.Dallas, TX [email protected]

Eric T. AndersonHartmarx Professor of Marketing

Marketing DepartmentKellogg School of Management

Northwestern University2001 Sheridan Rd.Evanston, IL 60611

[email protected]

Blakeley B. McShaneAssociate Professor

Marketing DepartmentKellogg School of Management

Northwestern University2001 Sheridan Rd.Evanston, IL 60611

[email protected]

Author note: Correspondence concerning this manuscript should be addressed to ChaoqunChen. Supplementary materials for this manuscript are available online. The order of au-thors other than the first was determined by alphabetical order.

Acknowledgements: NA

Financial Disclosure: NA

Harbingers of Failure...and Success

Abstract

We extend the work of Anderson et al. (2015) who find evidence of “harbingers offailure”–consumers who tend to purchase new products that are destined to meet adoomed fate–along four lines: (i) we replicate their findings in a dataset that coversover 400 U.S. retailers and a wide range of product categories, (ii) we develop a novelsemi-parametric approach that yields interpretable consumer-level estimates and im-proved predictive accuracy, (iii) we characterize harbingers of failure showing that theyare wealthier, have more children and larger family size, and shop at warehouse clubs,and (iv) we investigate potential mechanisms that explain the harbingers of failurephenomenon finding that harbingers of failure are more variety-seeking.

We find evidence for not only harbingers of failure but also harbingers of success(i.e., customers who tend to purchase new products that are likely to succeed) withthe former making up 44% of consumers in our data and the latter making up theremaining 56%. Further, were all early sales of a new product to harbingers of failureas opposed to harbingers of success, the probability that the new product remains inthe market two (four) years after introduction is five (seven) percentage points lower;thus, sales to harbingers have a powerful impact–especially when considered in tandemwith the high rate of new product failure.

Keywords: new products, harbinger, failure, methodology.

1

1 Introduction

New product development is the major driver of growth for consumer packaged goods (CPG)

firms (IRI, 2013). However, despite their large amount of investment in new product devel-

opment, the failure rate of new products introduced to market is as high as 75% (Schneider

and Hall, 2011). Moreover, even new products that exceed their expected first-year sales

tend not to last very long in markets.

That new product ideas, which have passed numerous tests from the idea generation

phase all the way through to the commercialization phase, fail at such high rates vexes both

practitioners and academics. Particularly surprising are the results from the market testing

phase as sales in the market might reasonably be thought to provide information about

future product performance. Indeed, new product forecasting tools typically assume that

early sales portend future success.

Recently, Anderson et al. (2015) have challenged this conventional wisdom by demonstrat-

ing not all early sales do in fact portend future success. Instead, certain consumers–so-called

“harbingers of failure”–tend to purchase new products that are destined to meet a doomed

fate. This suggests firms should pay attention not only to how much their new products are

selling but also to whom they are selling.

In this paper, we extend the work of Anderson et al. (2015) along four lines. First,

Anderson et al. (2015) possess data from only a single, national retail chain and it is thus

unclear whether their results generalize to a broader set of retailers. In contrast, we possess

data that covers over 400 U.S. retailers and a wide range of product categories and replicate

and extend the basic findings of Anderson et al. (2015) in this broader context.

Second, the empirical approach of Anderson et al. (2015) is rather ad hoc in several

respects, including (i) requiring that new products be arbitrarily classified as “successes”

or “failures” by the researcher rather than allowing for an objective, continuous measure of

product success and (ii) requiring that consumers be arbitrarily grouped into four segments

by the researcher rather than allowing for a individual consumer-level estimates. In contrast,

2

we develop a novel semi-parametric approach that treats product success in a continuous

manner and yields both interpretable consumer-level (here household-level) estimates and

improved predictive accuracy; we view this model as one of the major contributions of this

paper and note that it can easily be (and is) generalized to accommodate cross-category

effects and yields much improved predictive performance relative to the model of Anderson

et al. (2015) and other competing models.

Third, Anderson et al. (2015) possess only transaction-level data thus limiting their

ability to understand who the demographic profile of harbingers of failure. In contrast,

we possess rich, household-level demographic data that we can tie to our household-level

estimates thereby allowing us to characterize harbingers.

Finally, Anderson et al. (2015) provide only a limited investigation of potential mech-

anisms that explain the harbingers of failure phenomenon, suggesting that they have un-

representative tastes so that the products they choose do not match the preference of the

mass market and thus ultimately fail. In contrast, we develop several alternative hypotheses

that purport to explain what makes a consumer a harbinger of failure: Do they search less

such that they choose poor quality products that ultimately fail? Do they fail to be opinion

leaders who drive the word of mouth necessary for new product success? Are they more

innovative thus adopting new products at a rate that far outpaces that of more typical con-

sumers? Or are they more variety seeking thus failing to drive the repeat purchase necessary

for new product success?

To preview our results, we find evidence for not only harbingers of failure but also

harbingers of success (i.e., customers who tend to purchase new products that are likely

to succeed) with the former making up 44% of consumers in our data and the latter making

up the remaining 56%. Further, were all early sales of a new product to harbingers of failure

as opposed to harbingers of success, the probability that the new product remains in the

market two (four) years after introduction is five (seven) percentage points lower; thus, sales

to harbingers have a powerful impact–especially when considered in tandem with the high

3

rate of new product failure.

We also find that harbingers of failure tend to be wealthier and to have more children

and larger family size. Further, harbingers’ purchase behavior consistently predicts new

product success across multiple categories; for example, if a consumer who has purchased

many newly introduced bakery products that have ultimately gone on to fail also purchases a

newly introduced beauty product, this portends ill for the newly introduced beauty product.

In addition to looking at the behavior of harbingers across categories, we also examine their

behavior across retail formats. Harbingers of failure spend highly at mass merchandisers and

warehouse clubs, whereas harbingers of success spend highly at drug stores and traditional

grocery stores; thus, although from the perspective of manufacturers harbingers of failure

portend new product failure, from the perspective of retailers in particular mass merchan-

disers and warehouse clubs they are an important source of revenue. Finally, our results

suggest that harbingers of failure are more variety-seeking.

The remainder of this paper is organized as follows. In the next section, we review

related literature. In Section 3 we describe our data and in Section 4 we replicate the

results of Anderson et al. (2015) using our more extensive data. Next, we discuss our

more general model in Section 5 and present results from it in Section 6. In Section 7, we

investigate potential mechanisms that explain the harbingers of failure phenomenon. Finally,

we conclude with a brief discussion in Section 8.

2 Literature review

Successful new products are the growth engine for CPG firms (IRI, 2013). Despite the

strategic importance of developing innovative new products, the failure rate of new CPG

products has been high for decades. For example, Crawford (1977) summarizes more than

ten different sources that cite failure rates ranging from 40% to 90% for CPG products.

The high failure rate of new products has led many researchers to develop theories as to

why new products succeed or fail. Crawford (1977) argues that improved marketing research

4

could address many of the eight factors that he believes are linked to a high failure rate among

new products. He then offers a series of nine hypotheses as to why improved market research

may not yield better outcomes. We believe that our research on harbingers of failure and

success introduces a new hypothesis that has not been previously considered. That is, our

research shows that customers vary in their innate preferences and can be classified into

those that systematically purchase new products that ultimately go on to fail or succeed.

We agree with Crawford (1977) that improved consumer insight can address this issue and

believe that our model offers one such approach.

While better information can obviously reduce failure rates, there are many factors that

influence the success or failure of a new product. As discussed in Anderson et al. (2015), one

factor that contributes to success or failure is how managers make decisions. Escalated com-

mitment (Boulding et al., 1997; Brockner and Rubin, 1985; Brockner, 1992), an inability to

integrate information (Biyalogorsky et al., 2006), and distortions in management incentives

(Simester and Zhang, 2010) have all been offered as explanations for the high rate of failure

of new innovations.

A second broad factor that influences success or failure is organizational structure. Both

theoretical and empirical research has shown that the integration of marketing, sales, and

research and development are critical for new product success (see for example Ayers et al.

(1997) and Ernst et al. (2010)). In a study of hundreds of new product launches in Japan,

Song and Parry (1997) show that a firm’s ability to share information throughout the orga-

nization is a key success factor. Research by Sethi and Iqbal (2008) suggests a link between

managerial learning and organizational structure. In particular, they show that a rigid

stage-gate development process can impede learning and this effect is more pronounced in

turbulent markets.

A third factor is technical skills. While management decision making and organizational

factors clearly play a role, Calantone et al. (1996) examine hundreds of new product launches

in China and the U.S. and show that technical resources and skills are critical for new product

5

success.

The strategic importance of new products has led to the development of models that

can be used to understand those factors that predict success or failure. One of the earliest

models was developed by Fourt and Woodlock (1960) and focused on modeling the trial and

repeat purchase behavior of customers. These so-called “trial-repeat” models were further

developed and enhanced by subsequent researchers, including Massy (1969), Eskin (1973),

Eskin and Malec (1976), Silk and Urban (1978), and Pringle et al. (1982). Steenkamp

and Gielens (2003) provide an extensive analysis across many categories that illustrates the

complex interactions between consumer characteristics and marketing actions that influence

trial. A core finding of these research papers is that success follows from both attracting a

large base of buyers and then encouraging them to repeat purchase. In our research, this

finding is consistent with the behavior of harbingers of success; in contrast, repeat purchases

among harbingers of failure may signal failure (Anderson et al., 2015).

In addition to trial-repeat models, researchers have developed many other approaches to

predict new product success. Moe and Fader (2003) show how prelaunch sales of music, which

typically occurs three to five weeks before the official launch, is predictive of overall music

sales. Neelamegham and Chintagunta (1999) show how domestic movie sales are predictive

of international movie sales. Garber et al. (2004) utilize the geographical distribution of

sales to predict the overall success of a new product launch. Finally, research by Calantone

and Cooper (1981) argue that success stems from the integration of multiple factors that

they refer to as scenarios; they analyze more than 200 industrial new product launches to

develop a taxonomy of scenarios that are correlated with success. Our model contributes to

these approaches by developing a novel way of predicting success that utilizes cross-category

data.

These models are part of a broader literature in marketing on predicting outcomes. Some

of the early work in this literature includes predictive model validation (Ryans, 1976), statis-

tics for model fit (Rust and Schmittlein, 1985), and methods for model selection (Bunn,

6

1979). More recently, techniques developed in machine learning have been applied to predic-

tive and other tasks in marketing applications (Cui and Curry, 2005; Dzyabura and Hauser,

2011). Our paper contributes to this literature by offering a novel methodology for predicting

new product success.

Finally, our work relates to a new and growing literature on harbingers of failure. For

instance, consider Simester et al. (2017) who use data from a two retailers to show that

zip codes that tend to have a high proportion of harbingers of failure for one retailer also

tend to have a high proportion of harbingers of failure for the other retailer; they also

analyze geographic customer movements to show that the harbinger trait is a stable customer

characteristic. Our research provides convergent support for both of these findings.

3 Data

Our principal dataset is the IRI U.S. consumer panel dataset which contains the transactions

records and demographics of 103,168 unique households from 2006 to 2009. The data is highly

comprehensive in that households report their shopping trips to over 400 major retailers

that sell products across eight grand categories (bakery, dairy, deli, edible, frozen, general

merchandise, health-beauty-care (HBC), and non-edible).

In addition, we also possess data that indicates when a product was first scanned and last

scanned in the national market through 2013. This allows us to identify new products that

were introduced to the market during the 2006 - 2009 period as well as their product lifetime

(i.e., the number of weeks sold in the national market). As a new product introduction

typically involves multiple versions or several different flavors and sizes, we restrict ourselves

to a subset of independent Universal Product Codes (UPCs) to avoid duplication in our data;

specifically, for each brand in a given new product group, we pick the UPC that was first

introduced to the market as the primary UPC. Further, as seasonal and holiday products

necessarily appear on shelves for only a limited period of time and thus their short lifetimes

are not indicative of product failure, we exclude these new products from our analysis. This

7

leaves us 47,370 independent new products.

[Table 1 about here.]

In Table 1, we present several summary statistics that describe our 47,370 new products.

We focus on on product lifetime as it will serve as our objective, continuous measure of

product success. We note that although lifetime is based on when a new product was first

scanned, a given new product may not be released in all geographic markets simultaneously;

further, even if a manufacturer or retailer discontinues a given new product at a given point

in time, sales may persist in subsequent periods due to inventory. Consequently, the new

product lifetime observed in our data may be longer than expected; however, as the definition

applies equally to all products, it should not affect the relative product lifetime observed in

our data. We note that 19,025 (40%) of our 47,370 new products were still being sold beyond

2013; consequently, the product lifetimes of these products are (right) censored. Nonetheless,

product lifetime varies substantially across the 60% of products which are uncensored.

We also present summary statistics for a variety of other variables in Table 1. As can

be seen, the majority of new products are relatively inexpensive, associated with national

brands, seldom promoted, and have comparably few unit sales in the first twenty-six weeks

after introduction.

[Figure 1 about here.]

For exploratory purposes, we compare the revenue of relatively short-lived versus rela-

tively long-lived new products in the top panel of Figure 1 (although we do not have access

to national sales data, the large number of household in our panel data provides a reasonable

approximation to revenue relative to category). In the figure, we consider only new products

that remained on shelves for at least one year, classifying those that remained for less (more)

than four years as short-lived (long-lived). The smooth curves provide the fit of a generalized

additive model with degree of smoothness estimated from the data separately for short-lived

8

and long-lived new products. As can be seen, the revenue of long-lived new products grows

rapidly in the first fifteen weeks after introduction and remains relatively stable thereafter;

on the other hand, the revenue of short-lived new products declines from the start.

As the difference in revenue between the two curves in the top panel of Figure 1 reflects

both differences in (i) adoption rates in the early phases and (ii) repeat purchase rates in

the later phases, we decompose these curves into the two components shown in the bottom

two panels of the figure. As can be seen, long-lived new products have both higher adoption

and repeat purchase rates relative to short-lived new products.

4 Replication of Anderson et al. (2015)

In this section, we replicate the results of Anderson et al. (2015), which were based on a

single, national retail chain, using our more extensive data. In particular, we apply exactly

the same methodology as Anderson et al. (2015) and find evidence for both harbingers of

failure as well as harbingers of success.

The methodology of Anderson et al. (2015) involves four steps. First, two quantities are

arbitrarily defined: new products are classified as successes or failures based on whether or

not their lifetimes exceed some threshold chosen by the researcher and an “initial evaluation

period” used to assess the early performance of a new product is defined. We choose four

years as our threshold for new product success versus failure; this implies an average failure

rate of 39% which is conservative as compared to the failure rate of 75% documented in

Schneider and Hall (2011) (to obtain a failure rate of 75% would require a threshold of six

years). We also define the first twenty-six weeks after introduction as the initial evaluation

period. Given this, we let Ti denote the lifetime of new product i in weeks, and we define

yi = 1(Ti > 208) as our new product success indicator variable and xi,h as the number of

units of new product i purchased by household h in the initial evaluation period. We also

note two facts that follow from these two definitions: (i) all new products with censored

product lifetimes had lifetimes in excess of four years and thus censoring has no impact on

9

any results presented in this section and (ii) 81,195 households purchased one or more new

products in the initial evaluation period of twenty-six weeks and thus impact results in this

section and in the remainder of this manuscript.

[Table 2 about here.]

Second, the data is split by product into three datasets: a calibration dataset, an in-

sample (or estimation) dataset, and an out-of-sample dataset. We follow Anderson et al.

(2015) and split our datasets by year of new product introduction; in particular, we use

all 25,957 new products introduced in 2006 and 2008 as our calibration dataset, a random

sample of 17,130 (80%) new products introduced in 2007 and 2009 as our in-sample dataset,

and the remaining 4,283 (20%) new products introduced in 2007 and 2009 as our out-of-

sample dataset (Anderson et al. (2015) conduct model evaluation only in-sample and thus

lack a separate out-of-sample dataset). We illustrate our notation and datasets in Table

2 and note that, in the remainder of this manuscript, the subscript i1 always indexes new

products in the calibration dataset, the subscript i2 always indexes new products in the

in-sample dataset, and the subscript i3 always indexes new products in the out-of-sample

dataset

Third, the calibration dataset is used to compute ah, the so-called “flop affinity” of each

household h. Formally, flop affinity is defined as

ah =

∑i1

1(yi1 = 0)1(xi1,h > 0)∑i1

1(xi1,h > 0)

which is the fraction of new products in the calibration dataset purchased by household h

that are classified as failures (i.e., “flops”). Then, households are grouped into four equal-

sized flop affinity segments based on the quartiles (a1, a2, a3) of the distribution of ah.

Finally, the logistic regression model

logit(pi) = β0 +4∑j=1

βjSi,j + β5Si (1)

10

where pi = P(yi = 1); Si,j =∑

h 1(aj−1 < ah ≤ aj)xi,h is the total sales of new product

i among households in flop affinity segment j in the initial evaluation period and where

a0 = −ε and a5 = 1 for any ε ∈ R+; and Si is the total sales of new product i among

households for which flop affinity is undefined (i.e., households that did not purchase any

new products in the calibration dataset years 2006 and 2008 as well as the households that

were present in the data in only the in-sample and out-of-sample dataset years 2007 and

2009). The model is fit using each product i2 in the in-sample dataset and evaluated using

each product i3 in the out-of-sample dataset.

The performance of the model in Equation (1) is compared with that of a more typical

new product forecasting model in which sales to all households are treated equally, namely

the logistic regression model

logit(pi) = β0 + β1

(4∑j=1

Si,j + Si

)= β0 + β1Si (2)

where Si gives the total sales of new product i among all households in the initial evaluation

period.

[Table 3 about here.]

We present results from the benchmark model (Equation (2)) and the model of Anderson

et al. (2015) (Equation (1)) respectively in the first two columns of Table 3. The positive

coefficient of total sales S in the benchmark is consistent with the conventional wisdom that

early sales portend future success. However, the results of the model in Anderson et al.

(2015) yield a more nuanced interpretation: while the coefficients for sales to the first two

flop affinity segments are positive the coefficients to the last two are negative. In other words,

sales to households with low flop affinity portend new product success while sales to those

with high flop affinity portent new product failure. Further, our model fit statistics (log

likelihood (LL) and area under the receiver operating characteristic curve (AUC), in-sample

11

and out-of-sample) show the model of Anderson et al. (2015) outperforms the benchmark

model.

To test the robustness of this result, we consider two additional models that generalize the

model of Anderson et al. (2015), in particular by successively adding three product covariates

(price, private label indicator, and promotion frequency) and category effects (fixed effects for

each of the eight categories; random effects for each of the 291 subcategories) to the model.

These results for these models are presented in the third and fourth columns of Table 3

respectively. As can be seen, the principal results remain unchanged: sales to households

with low (high) flop affinity portend new product success (failure).

In sum, we replicate the principal results of Anderson et al. (2015) and find evidence of

harbingers of failure using our more extensive data; we also find evidence of harbingers of

success.

For a further comparison to the model of Anderson et al. (2015), we also fit logistic

regression model to the data treating the success of new products i2 in the in-sample dataset

as binary as in Anderson et al. (2015) and this section but treating the βh as in Equation

4 of the next section. Again, we find that household that purchase many short-lived new

products are more likely to be harbingers of failure; see Appendix A.1 for details.

5 Model

In this section, we introduce novel methodology that addresses a number of limitations of

the model of Anderson et al. (2015). First, rather than requiring that new products be

arbitrarily classified as successes or failures and using yi as the measure of product success,

we use an objective, continuous measure of product success, namely the product lifetime Ti.

Second, rather than requiring that households be arbitrarily grouped into four flop affinity

segments such that all households in a given segment are constrained to have the same effect

on new product success, we allow for individual household-level effects. Third, and perhaps

most subtly, rather than requiring that flop affinity (and thus the household-level effects) be

12

based merely on the fraction of new product purchased that are classified as failures (and

thus, for example, treating households that purchase one new product thusly classified out

of two total the same as households that purchase ten new products thusly classified out of

twenty total), we use a more general measure.

As a first step towards relaxing the first limitation, rather than employing a logistic

regression model as in Anderson et al. (2015), we employ a survival model, in particular a

Cox proportional hazards model (Cox, 1972), that models the product lifetime Ti. In its

most general form, our model is given by

λ(Ti) = λ0(Ti) exp

(β0 +

H∑h=1

βhxi,h + βH+1Si

)(3)

where λ(Ti) is the hazard function for product i with lifetime Ti and λ0(Ti) is the baseline

hazard function. As before, the model is fit using each product i2 in the in-sample dataset

and evaluated using each product i3 in the out-of-sample dataset.

Within the hazards model framework, the analogue of the benchmark model of Equation

(2) involves constraining βh = β for h ∈ {1, ..., H + 1} in Equation (3) such that sales to

all households are treated equally and estimating β using each product i2 in the in-sample

dataset. Similarly, the analogue of the Anderson et al. (2015) model of Equation (1) involves

constraining the βh for h ∈ {1, ..., H} to follow a step function with three steps where the

location of the steps are based on the quartiles of the distribution of ah and the levels of

the steps are estimated using each product i2 in the in-sample dataset; this in tandem with

Equation (3) yields the hazards model analogue of the model of Anderson et al. (2015).

While this moves toward relaxing the first limitation discussed above (i.e., in that the

dependent variable is the objective, continuous measure product lifetime Ti rather than the

arbitrary classification yi), it does not fully relax it as flop affinity ah still requires that new

products be arbitrarily classified as successes or failures; further, it does not relax either the

second or the third limitations.

13

Consequently, we take an alternative approach that fully relaxes all three limitations.

Key to our approach is recognizing that flop affinity can be written as

ah =

∑i1

1(yi1 = 0)1(xi1,h > 0)∑i1

1(xi1,h > 0)=

∑i1

1(Ti1 ≤ 208)1(xi1,h > 0)∑i1

1(xi1,h > 0)=

∑i1g1(Ti1)g2(xi1,h)

g3(xxxh)

where xxxh is a vector containing the xi1,h and (i) g1(x) = 1(x ≤ 208); (ii) g2(x) = 1(x > 0);

and (iii) g3(xxx) =∑

i 1(xi > 0). Given this, we alter g1, g2, and g3 to create a generalized

flop affinity which will serve as our βh. In particular, we set

βh =

∑i1w(Ti1)1(xi1,h > 0)(∑i1

1(xi1,h > 0))γ (4)

in which case (i) g1 = w is a flexible function estimated from the in-sample data using splines

(specifically thin plate regression splines (Wood, 2003)); (ii) g2 is as above; and (iii) g3 is

as above but exponentiated by γ ≥ 0. This model thus relaxes all three limitations of the

model of Anderson et al. (2015): (i) product lifetimes are treated continuously by both λ

and w; (ii) household effects βh are at the individual-level and are flexibly determined by w;

and (iii) the number of new products purchased impacts the household-level effects via γ.

Of particular note are two key differences in the manner in which the βh are estimated

by the model of Anderson et al. (2015) and this model. First, rather than weighting each

new product purchased in the calibration set in a binary manner via 1(Ti1 ≤ 208) as in

Anderson et al. (2015), this model weights each in a continuous manner via w, a function

of our objective, continuous measure of product success, namely the product lifetime Ti;

consequently, we hereafter refer to w as our weight function. Second, rather than imposing

a rigid functional form for the βh and estimating it in part from the calibration dataset and

in part from the in-sample dataset as in Anderson et al. (2015) (i.e., the locations of the step

function are estimated using (only) each product i1 in the calibration dataset while the levels

of the steps are estimated using each product i2 in the in-sample dataset), this model allows

a flexible functional form for the βh (via w and γ) and estimates it in a more principled

14

manner using only the in-sample dataset; consequently, it is semi-parametric not only in the

usual Cox proportional hazards sense but also in the sense that βh has both parametric and

non-parametric components.

The censoring of new products is naturally accommodated in the hazards model frame-

work; consequently, censoring of each product i2 in the in-sample dataset and each product

i3 in the out-of-sample dataset poses no difficulty for our model. However, censoring of each

product i1 in the calibration dataset does pose a problem as formally Ti1 (and thus w(Ti1))

is not defined for censored products. In our principal analysis presented in the main text

of this manuscript, we simply assume Ti1 is equal to our the ultimate date in our dataset

(i.e., the last week 2013); this necessarily underestimates Ti1 as we know Ti1 does in fact

fall beyond this date resulting in a conservative assumption provided that longer product

lifetimes do indeed portend product success (i.e., w is non-increasing). In an additional

analysis presented in Appendix A.2, we fully model the censoring of each product i1 in the

calibration dataset.

Estimation of our model parameters w and γ proceeds as follows. Conditional on γ, we

estimate w via penalized maximum likelihood using the gam function of the mgcv package in

R (Wood, 2011). We then conduct a grid search over γ to obtain the optimum w and γ.

6 Results

6.1 Model Evaluation

To validate our proposed approach, we evaluate our model against three competitor models.

The first two are the hazard framework analogue of the benchmark model and the Anderson

et al. (2015) model discussed in Section 5; we label these models “Benchmark” and “Ander-

son” respectively. We also consider a competing approach that models the household-level

effects βh as a linear function of a vector of demographic variables dddh such that βh = dddh ·ααα;

we label this model “Demographics” and note it is a particularly relevant competitor model

because managers often in practice use demographic variables for segmentation and target-

15

ing. We note that dddh is composed of variables indicating household income, household size,

age of female and male head of household, and indicators for households that contain a single

child, contain two or more children, consist only of a female, and consist of only of a male1.

We evaluate our model specifications using two metrics, the partial log likelihood (PLL)

and the integrated area under the receiver operating characteristic curve (IAUC); these

metrics are the respective survival model analogues of LL and AUC used in Table 3. To

define PLL, we let Zi = β0 +∑

h βhxi,h + βH+1Si such that λ(Ti) = λ0(Ti) exp(Zi). Then,

the partial log likelihood is given by

∑i:Ci=0

Zi − log∑

j:Tj≥Ti

exp(Zj)

where Ci is a binary variable indicating that the lifetime of product i is right censored (i.e.,

is still being sold through 2013).

To define IAUC, we note that Zi can be used to predict whether or not product i fails

by time t, in particular by thresholding Zi at some value. By varying this value, one can

obtain specificity and sensitivity–and thus the receiver operating characteristic curve and

the AUC–as for any binary classifier (see Chambless and Diao (2006) for details). IAUC is

then defined as this AUC, which is a function of time t, averaged over all values of t.

[Table 4 about here.]

[Figure 2 about here.]

We present our model evaluation results in Table 4. As can be seen, our proposed

approach outperforms the alternative models on products i2 in the in-sample dataset and

products i3 in the out-of-sample dataset. We also plot the out-of-sample AUC across time–

the average of which is IAUC–in Figure 2. Again, our proposed approach outperforms the

alternative models.

16

6.2 Principal Results

[Figure 3 about here.]

We present our estimate of the weight function w in Figure 3. As can be seen, the

estimated weight function is decreasing in the product lifetime and, importantly, is positive

(negative) for sufficiently short (long) lifetimes. This implies that households that purchases

many short-lived new products associate with higher failure risk; in other words, consistent

with the results of Anderson et al. (2015), such households are more likely to be harbingers

of failure.

[Figure 4 about here.]

We present our estimates of the βh (computed via Equation (4)) in Figure 4. 44% of

households have positive βh thus implying that 44% (56%) of households are harbingers of

failure (success); further, 31% and 44% (27% and 40%) of households have positive and

negative βh with 95% (99%) intervals that do not overlap zero respectively.

[Figure 5 about here.]

Our estimates of w and the βh naturally raise the question of the impact of harbingers of

success and failure on new product lifetimes. In Figure 5, we plot the Kaplan-Meier estimate

of the baseline survival probability along with how that estimates changes were all sales of

a given new product to harbingers of success versus harbingers of failure. As can be seen,

the impact can be large. For example, the baseline survival probability of a new product

two (four) years after is 0.82 (0.59); were all sales to harbingers of success, it would rise to

0.84 (0.62) but were all sales to harbingers of failure it would fall to 0.79 (0.55). In general,

the difference between the survival probability estimates when all sales are to harbingers of

success versus failure is larger than 0.05.

Before proceeding and as discussed above, we note we present a further comparison to the

model of Anderson et al. (2015) in Appendix A.1 and additional analysis that fully models

the censoring of each product i1 in the calibration dataset in Appendix A.2.

17

6.3 Covariate Results

[Table 5 about here.]

To describe harbingers of success and failure in terms of demographic variables, we regress

the βh on various variables and present results in the first column of Table 5 (the second

column of this table will be discussed in Section 7.2). As can be seen, households with higher

income, more children, larger family size, and with only a female head of household are more

likely to be harbingers of failure (i.e., have βh > 0).

[Figure 6 about here.]

To evaluate the profitability of harbingers of success and failure to retailers, we present

their revenue contribution to a set of retailers in Figure 6. As an be seen, harbingers

of failure are not necessarily a customer segment to avoid for warehouse clubs and mass

merchandisers; indeed, they account for more than 50% of the revenue of these retailers

despite accounting for only 44% of the population. On the other hand, harbingers of success

spend disproportionately at channels such as drug stores and grocery stores. Thus, although

from the perspective of manufacturers harbingers of failure portend new product failure,

from the perspective of retailers in particular mass merchandisers and warehouse clubs they

are an important source of revenue.

6.4 Cross-category Results

Thus far, our model has treated all new products identically. In particular, the impact of

the lifetime of new product i1 purchased in the calibration dataset on the hazard rate of

new product i2 in the in-sample dataset (or product i3 in the out-of-sample dataset) does

not vary by the category of the products. As a robustness test, we now consider a simple

extension of our model that allows for differential impact by category. In particular, instead

of estimating a single βh for each household, we estimate two in order to account for whether

or not new products i1 and j match in category for products i1 in the calibration datasets

18

and j in the in-sample or out-of-sample datasets; specifically, we replace Equation (4) by

βh,j =

∑i1w1(c(i1)=c(j))(Ti1)1(xi1,h > 0)(∑

i11(xi1,h > 0)

)γwhich amounts to reparameterizing the weight function w as

wc(i1),c(j)(Ti1) = w1(Ti1) · 1(c(i1) = c(j)) + w0(Ti1) · 1(c(i1) 6= c(j))

where c(i) gives the category of product i.

[Figure 7 about here.]

We present our results in Figure 7. As in Figure 3, the weight functions are positive

(negative) for sufficiently short (long) lifetimes. This again implies that a household that

purchases many short-lived new products is more likely to be a harbinger of failure and, as

indicated in the right panel, this holds even when the product is the calibration dataset is

from a different category than the product in the in-sample or out-of-sample dataset.

We note we present a more sophisticated version of this cross-category analysis in Ap-

pendix A.3.

7 Mechanism

7.1 Hypotheses and Implications

Our main results show that the purchase of a given new product by a household that has

bought relatively short-lived new products in the past portends the given new product is

more likely to also be relatively short-lived. One might argue that this relationship is driven

by correlation among products. For instance, it seems reasonable that relatively short-lived

new products may share common attributes that are objectively inferior and that this is

especially likely to be the case within a given category; further, it seems reasonable that

19

consumers are likely to purchase similar products within a given category due to category-

specific preference.

Alternatively, this relationship may be driven by the consistent household-specific behav-

ior, regardless of whether or not the products various household purchase are similar. For

instance, a household with unrepresentative tastes will tend to purchase unpopular products

even if those products have no common attributes at all.

Importantly, this distinction has different implications for new product development.

In particular, the former explanation suggests that firms should learn from attributes of

previously-launched products in order to predict new product success while the latter expla-

nation suggests firms should learn from the purchase behavior of individual households (and

harbingers of success and failure in particular).

We argue the former explanation, while clearly not unimportant, is simply incomplete.

In particular, if this and only this explanation holds, we would expect strong correlation

among products within category but very little across categories. However, as demonstrated

in Section 6.4, purchases by harbingers of success and failure are predictive of new product

lifetime not only for new products within the same category but also for new products across

different categories. Hence, it seems unreasonable to hold that correlation among products

is the sole driver of our results. Indeed, it would appear that traits that drive common

household behavior across different categories must play some role in driving our results. In

the following, we hypothesize several traits that might explain why purchases by harbingers

of success and failure are predictive of new product lifetime; we then test them with both

our principal data as well as survey data to be discussed below.

Hypothesis 1: Harbingers of failure have unrepresentative tastes.

Anderson et al. (2015) suggest that harbingers of failure have unrepresentative tastes so that

the products they purchase do not match the preference of the mass market (and thus new

products they purchase are likely to ultimately fail). As evidence, they find that harbingers

of failure are more likely to purchase niche (existing) products. We conduct a similar analysis

20

with our data to investigate whether harbingers of failure are more likely to purchase less

popular (existing) products.

Hypothesis 2: Harbingers of failure search less.

An alternative explanation emphasizes the role of consumer search. Assuming that short-

lived new products are inferior to other products and that discerning good products from

bad products requires market knowledge, harbingers of failure may be likely to purchase

poor new products due to a lack of information. Consequently, under this hypothesis, we

would expect that harbingers of failure search less, and thus for example visit fewer stores

and pay higher prices.

Hypothesis 3: Harbingers of failure are not opinion leaders.

Literature on new product diffusion (Godes and Mayzlin, 2004; Chevalier and Mayzlin, 2006;

Bell and Song, 2007; Nair et al., 2010) maintains that word of mouth generated by early

adopters of new products is an important driver of whether or not these products achieve

fast penetration as well as, ultimately, long-run success. If harbingers of failure either (i) are

not early adopters of new products or (ii) do not generate word of mouth that is influential

enough to convince other consumers, this portends poorly for these products. To investigate

this explanation partially with our principal data, we examine whether harbingers of failure

are early or late adopters; we then use additional survey data to directly assess their opinion

leadership.

Hypothesis 4: Harbingers of failure are more innovative.

Because managers are often eager to launch new products, the psychological effort exerted by

and behavioral changes required of consumers in adopting new products are often overlooked

(Gourville, 2006). If early adopters are those who are more innovative–that is, if the adoption

of new products requires lower psychological effort and behavioral change for these consumers

as compared to more typical ones–this early adoption will not necessarily indicate that other

consumers will accept new products as quickly and easily as they do. To investigate this

explanation, we again examine whether harbingers of failure are early or late adopters.

21

Before proceeding, we note the differing implications of hypotheses three and four as

related to our examination of them with our principal data (though not our survey data).

If, as per Hypothesis 3, harbingers of failure are not opinion leaders, this suggests they

may not be early adopters (although it need not preclude it). On the other hand, if, as

per Hypothesis 4, harbingers of failure are more innovative, this suggests they are in fact

early adopters. Thus, these contrasting implications allow us to distinguish, at least in part

with our principal data, hypotheses three and four; our survey data, which provides direct

measures, allows for a more conclusive investigation.

Hypothesis 5: Harbingers of failure are more variety-seeking.

Strong early sales are believed to portend future success of new products. However, this is

only the case if initial trial leads to repeat purchase. If the consumers who contribute to early

sales are variety-seeking, then they are less likely to purchase the same new product again.

We investigate this hypothesis by testing whether harbingers of failure tend to purchase more

different brands within a category.

7.2 Results from Principal Data

In this section, we examine our five hypothesis in light of our principal data considered thus

far. The behavioral variables we use to test each hypothesis are, respectively, (i) unrepre-

sentative tastes as measured by the average popularity rank of existing products purchased

by a household across categories, (ii) store search as measured by the average number of

chains visited by a household per week and price search as measured by the frequency of

obtaining price discounts, (iii) opinion leadership as measured by the average adoption lag

in time between the product launch and first purchase by a household, (iv) innovativeness

as measured by the same, and (v) variety-seeking as measured by the average number of

brands purchased by a household per category.

To investigate these relationships, we use the procedure discussed in Section 6.3, that is,

regressing the βh on the various behavioral variables. We present results in the second column

of Table 5. The results show that (i) purchasing less popular products (i.e., those with higher

22

popularity rank), (ii) larger adoption lag (i.e., late adoption), and (iii) larger variety-seeking

are all associated with larger βh (where positive βh implies a household is a harbinger of

failure). Hence, the results from our principal data are consistent with hypotheses one,

three, and five.

7.3 Results from Survey Data

While the behavioral patterns discuss in the prior subsection are suggestive of why purchase

of a given new product by a household that has bought relatively short-lived new products in

the past portends the given new product is more likely to also be relatively short-lived, they

do not directly assess constructs associated with our hypotheses. Therefore, we augment the

analysis with a survey in which we directly measure these constructs.

Our survey subjects were 280 members of the Qualtrics online panel. Subjects were first

presented with a mix of thirty relatively short-lived and relatively long-lived products from

our principal data; they were asked to recall how many times they purchased these products

in the past and to rate their hypothetical purchase intention for the product on a one-to-five

integer scale. Subjects were then asked to rate how much they liked (again, on a one-to-five

integer scale) a set of four products pre-classified as popular and nice as our measure of

unrepresentative taste. We then measured search intensity using the procedure of Beatty

and Smith (1987), opinion leadership using the procedure of Goldsmith et al. (2003), inno-

vativeness using the procedure of Baumgartner and Steenkamp (1996), and variety-seeking

using the procedure of Van Trijp and Steenkamp (1992). Finally, the survey concluded with

a set of demographic questions.

Using Equation 4 and our estimates of w and γ from Section 6.2, we calculate βh for each

of our 280 subjects in two ways: first using purchase recall to define 1(xi1,h > 0) and second

using purchase intent of four or five to define it and summing across the thirty products

examined in the survey. Because the short-lived products did not remain in the market

for very long, purchase recall for these products was typically zero resulting in very little

variation among the βh calculated in the first manner; consequently, we prefer the second

23

manner.

[Table 6 about here.]

We present the correlation of each of our five behavioral variables with the estimates

of βh discussed in the prior paragraph in Table 6. Due to the little variation among the

βh calculated in the first manner, there are no statistically significant correlations with the

variables; however, when βh is calculated in the second manner, the results show (i) larger

opinion leadership, (ii) larger innovativeness, and (iii) larger variety-seeking are all associated

with larger βh (where positive βh implies a household is a harbinger of failure). Hence, the

results from our principal data are consistent with hypotheses four and five (we note the

result on opinion leadership does not match the sign predicted by hypothesis three).

Collectively, the results from our principal data and survey data suggest the variety-

seeking explanation appears to have the most support as an explanation for why the purchase

of a given new product by a household that has bought relatively short-lived new products

in the past portends the given new product is more likely to also be relatively short-lived.

This coheres with the results presented in Figure 1: harbingers of failure are more variety-

seeking and therefore make fewer repeated purchases resulting in a lower repurchase rate for

relatively short-lived new products as shown in the lower right panel of the figure.

8 Discussion

In this paper, we have extended the work of Anderson et al. (2015) along several lines.

First, we have replicated their findings in a dataset that covers over 400 U.S. retailers and

a wide range of product categories. Second, we have developed a novel semi-parametric

approach that treats product success in a continuous manner and yields both interpretable

consumer-level estimates and improved predictive accuracy; our model shows inter alia that

the purchase of a given new product by a household that has bought relatively short-lived new

products in the past portends the given new product is more likely to also be relatively short-

lived and that this holds even across categories. Third, we have characterized harbingers

24

of failure using our rich, household-level demographic data showing that they are wealthier,

have more children and larger family size, and shop at warehouse clubs. Finally, we have

investigated potential mechanisms that explain the harbingers of failure phenomenon finding

that harbingers of failure are more variety-seeking.

Our novel methodology also illustrates that the new product purchase behavior of house-

holds is an informative but noisy, positive signal of success. That is, slightly more than half

(56%) of households in our sample are indeed harbingers of success. This is consistent with

decades of research that has found a positive correlation between trial, repeat purchase, and

new product success. Perhaps more illuminating is that nearly half (44%) of households

are harbingers of failure. Our new methodology, which allows for individual-level estimates,

allows us to recover this metric. The size of this segment is likely to be a surprise to many

academics and managers.

Our findings also have important managerial implications. Contrary to the conventional

wisdom that early sales portend future success, firms should pay attention not only to how

much their new products are selling but also to whom they are selling. Consequently, the

use of individual-level rather than aggregate-level data can lead to improve the predictive

performance of new product forecasting models. Indeed, managers with rich individual-

level information can potentially incorporate our methodology and insights–in particular our

individual-level estimates βh–not only after launch but at earlier stages in the new product

development process.

25

References

Anderson, E. T., Lin, S., Simester, D., and Tucker, C. (2015). Harbingers of failure. Journal

of Marketing Research 52, 5, 580–592.

Ayers, D., Dahlstrom, R., and Skinner, S. J. (1997). An exploratory investigation of organi-

zational antecedents to new product success. Journal of Marketing Research 107–116.

Baumgartner, H. and Steenkamp, J.-B. E. (1996). Exploratory consumer buying behavior:

Conceptualization and measurement. International Journal of Research in Marketing 13,

2, 121–137.

Beatty, S. E. and Smith, S. M. (1987). External search effort: An investigation across several

product categories. Journal of consumer research 83–95.

Bell, D. R. and Song, S. (2007). Neighborhood effects and trial on the internet: Evidence

from online grocery retailing. Quantitative Marketing and Economics 5, 4, 361–400.

Biyalogorsky, E., Boulding, W., and Staelin, R. (2006). Stuck in the past: Why managers

persist with new product failures. Journal of Marketing 70, 2, 108–121.

Boulding, W., Morgan, R., and Staelin, R. (1997). Pulling the plug to stop the new product

drain. Journal of Marketing research 164–176.

Brockner, J. (1992). The escalation of commitment to a failing course of action: Toward

theoretical progress. Academy of management Review 17, 1, 39–61.

Brockner, J. and Rubin, J. Z. (1985). Entrapment in escalating conflicts. Social Psychological

Analysis .

Bunn, D. W. (1979). The synthesis of predictive models in marketing research. Journal of

Marketing Research 280–283.

26

Calantone, R. and Cooper, R. G. (1981). New product scenarios: Prospects for success. The

Journal of Marketing 48–60.

Calantone, R. J., Schmidt, J. B., and Song, X. M. (1996). Controllable factors of new product

success: A cross-national comparison. Marketing Science 15, 4, 341–358.

Chambless, L. E. and Diao, G. (2006). Estimation of time-dependent area under the roc

curve for long-term risk prediction. Statistics in medicine 25, 20, 3474–3486.

Chevalier, J. A. and Mayzlin, D. (2006). The effect of word of mouth on sales: Online book

reviews. Journal of marketing research 43, 3, 345–354.

Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal

Statistical Society 34, 187–220.

Crawford, C. M. (1977). Marketing research and the new product failure rate. The Journal

of Marketing 51–61.

Cui, D. and Curry, D. (2005). Prediction in marketing using the support vector machine.

Marketing Science 24, 4, 595–615.

Dzyabura, D. and Hauser, J. R. (2011). Active machine learning for consideration heuristics.

Marketing Science 30, 5, 801–819.

Ernst, H., Hoyer, W. D., and Rubsaamen, C. (2010). Sales, marketing, and research-and-

development cooperation across new product development stages: implications for success.

Journal of Marketing 74, 5, 80–92.

Eskin, G. J. (1973). Dynamic forecasts of new product demand using a depth of repeat

model. Journal of Marketing Research 115–129.

Eskin, G. J. and Malec, J. (1976). A model for estimating sales potential prior to the test

market. In Proceeding 1976 Fall Educators Conference, Series No, vol. 39, 230–233.

27

Fourt, L. A. and Woodlock, J. W. (1960). Early prediction of market success for new grocery

products. The Journal of Marketing 31–38.

Garber, T., Goldenberg, J., Libai, B., and Muller, E. (2004). From density to destiny: Using

spatial dimension of sales data for early prediction of new product success. Marketing

Science 23, 3, 419–428.

Godes, D. and Mayzlin, D. (2004). Using online conversations to study word-of-mouth

communication. Marketing Science 23, 4, 545–560.

Goldsmith, R. E., Flynn, L. R., and Goldsmith, E. B. (2003). Innovative consumers and

market mavens. Journal of Marketing theory and practice 11, 4, 54–65.

Gourville, J. T. (2006). Eager sellers and stony buyers. Harvand Business Review 99–106.

IRI (2013). 2012 iri new product pacesetters .

Massy, W. F. (1969). Forecasting the demand for new convenience products. Journal of

Marketing Research 405–412.

Moe, W. and Fader, P. (2003). Using advance purchase orders to forecast new product sales

(vol 21, pg 347, 2002). Marketing Science 22, 1, 146–146.

Nair, H. S., Manchanda, P., and Bhatia, T. (2010). Asymmetric social interactions in physi-

cian prescription behavior: The role of opinion leaders. Journal of Marketing Research

47, 5, 883–895.

Neelamegham, R. and Chintagunta, P. (1999). A bayesian model to forecast new product

performance in domestic and international markets. Marketing Science 18, 2, 115–136.

Pringle, L. G., Wilson, R. D., and Brody, E. I. (1982). News: A decision-oriented model for

new product analysis and forecasting. Marketing science 1, 1, 1–29.

28

Rust, R. T. and Schmittlein, D. C. (1985). A bayesian cross-validated likelihood method

for comparing alternative specifications of quantitative models. Marketing Science 4, 1,

20–40.

Ryans, A. B. (1976). Evaluating aggregated predictions from models of consumer choice

behavior. Journal of Marketing Research 333–338.

Schneider, J. and Hall, J. (2011). Why most product launches fail. Harvard business review

21–23.

Sethi, R. and Iqbal, Z. (2008). Stage-gate controls, learning failure, and adverse effect on

novel new products. Journal of Marketing 72, 1, 118–134.

Silk, A. J. and Urban, G. L. (1978). Pre-test-market evaluation of new packaged goods: A

model and measurement methodology. Journal of marketing Research 171–191.

Simester, D., Tucker, C., and Yang, C. (2017). The surprising breadth of harbingers of

failure. Working paper .

Simester, D. and Zhang, J. (2010). Why are bad products so hard to kill? Management

Science 56, 7, 1161–1179.

Song, X. M. and Parry, M. E. (1997). The determinants of japanese new product successes.

Journal of marketing research 64–76.

Steenkamp, J.-B. E. and Gielens, K. (2003). Consumer and market drivers of the trial

probability of new consumer packaged goods. Journal of Consumer Research 30, 3, 368–

384.

Van Trijp, H. C. and Steenkamp, J.-B. E. (1992). Consumers’ variety seeking tendency

with respect to foods: measurement and managerial implications. European Review of

Agricultural Economics 19, 2, 181–195.

29

Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society:

Series B (Statistical Methodology) 65, 1, 95–114.

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood

estimation of semiparametric generalized linear models. Journal of the Royal Statistical

Society (B) 73, 1, 3–36.

30

Notes

1Household income is a categorical variable consisting of thirteen unique categories; we treat this variable

linearly using the values one through thirteen. Household size is a categorical variable consisting of indicators

for households of size one through seven as well as an indicator for households of size eight or more; we treat

this variable linearly using the values one through eight. Age of female and male head of household are

categorical variables consisting of seven unique categories; we treat this variable linearly using the values

one through seven.

31

A Additional Analyses

A.1 Replication of Anderson et al. (2015)

[Figure 8 about here.]

For a further comparison to the model of Anderson et al. (2015), we fit a logistic regression

model to the data treating the success of new products i2 in the in-sample dataset as binary

as in Anderson et al. (2015) and Section 4 but treating the βh as in Equation 4. In particular,

we use the definition of new product success used in Section 4 (i.e., yi = 1(Ti > 208)) and

fit the model

logit(pi) = β0 +∑h

βhxi,h + βH+1Si (5)

where where pi = P(yi = 1) and βh is as in Equation (4).

We present our estimate of the weight function w in Figure 8. As can be seen, the

estimated weight function is increasing in the product lifetime and, importantly, is negative

(positive) for sufficiently short (long) lifetimes. This implies that a household that purchases

many short-lived new products is more likely to be a harbinger of failure and is consistent

with the results of Anderson et al. (2015).

A.2 Principal Results

To fully model the censoring of each product i1 in the calibration dataset, we allow the

weight function w to vary depending upon whether each product i1 in the calibration dataset

is censored or uncensored. Specifically, we replace w in Equation (4) with wCi1(Ti1) where

Ci1 is a binary variable indicating that the lifetime of product i1 is right censored such

that we now estimate two separate weight functions, one for censored products and one for

uncensored products.

Upon fitting this model, the estimated weight function w1 for censored products was

roughly constant. Consequently, we refit the model constraining it to be a constant; model fit

statistics indicated this resulted in no loss in performance so we proceed with the constrained

32

model2.

[Figure 9 about here.]

Figure 9 shows that, consistent with our principal results in Section 6.2, the estimated

weight function for uncensored products is positive for short lifetimes but negative for long

lifetimes; this again implies that a household that purchases many short-lived new products

is more likely to be a harbinger of failure. It also shows the weight function for censored

products is negative which implies that a household that purchases new products still being

sold is less likely to be a harbinger of failure.

A.3 Cross-category Results

To more fully model cross-category effects, we extend the analysis discussed in Section 6.4

to allow the household effects to account not only for whether or not new products i1 and j

match in category for products i1 in the calibration datasets and j in the in-sample or out-

of-sample datasets but also for the respective categories of products i1 and j; specifically, we

replace Equation (4) by

βh,i1,j =

∑i1wc(i1),c(j)(Ti1)1(xi1,h > 0)(∑

i11(xi1,h > 0)

)γby reparameterizing the weight function w as

wc(i1),c(j)(Ti1) = w1(Ti1) · 1(c(i1) = c(j)) + w0(Ti1) · 1(c(i1) 6= c(j)) + uc(i1),c(j)(Ti1)

where c(i) gives the category of product i and uc(i1),c(j)(Ti1) that accounts for the respective

categories of products i1 and j.

Because data for many category pairs is relatively sparse, our estimation of the u differs

from that of the w. Specifically, while we still use the gam function of the mgcv package

in R (Wood, 2011), the u are treated as random effects while the w are, as they have been

throughout, treated as fixed effects.

33

[Figure 10 about here.]

We present our results in Figure 10. As in Figures 3 and 7, the weight functions are

positive for sufficiently short lifetimes. This again implies that a household that purchases

many short-lived new products is more likely to be a harbinger of failure and, as indicated

in the right panel, this holds even when the product is the calibration dataset is from a

different category than the product in the in-sample or out-of-sample dataset.

34

List of Tables

1 Summary Statistics. Product lifetime as it will serve as our objective, con-tinuous measure of product success and varies substantially across the 60%of products which are uncensored. The majority of new products are rela-tively inexpensive, associated with national brands, seldom promoted, andhave comparably few unit sales in the first twenty-six weeks after introduction. 36

2 Notation and Datasets. Ti gives the lifetime of new product i in weeks andxi,h gives the number of units of new product i purchased by household h inthe initial evaluation period (i.e., first twenty-six weeks after introduction).The calibration dataset consists of all new products introduced in 2006 and2008, the in-sample dataset consists of a random sample of 80% of new prod-ucts introduced in 2007 and 2009, and the out-of-sample dataset consists ofthe remaining 20% of new products introduced in 2007 and 2009. In the re-mainder of this manuscript, the subscript i1 always indexes new products inthe calibration dataset, the subscript i2 always indexes new products in thein-sample dataset, and the subscript i3 always indexes new products in theout-of-sample dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Replication of Anderson et al. (2015). The model presented in the first columnis the benchmark new product forecasting model and the model presented inthe second column is that of Anderson et al. (2015). The models presented inthe third and fourth columns generalize the model of Anderson et al. (2015)by successively adding three product covariates (price, private label indicator,and promotion frequency) and category effects (fixed effects for each of theeight categories; random effects for each of the 291 subcategories). The cellsin the upper right subtable give coefficient estimates (estimated standard er-rors). LL denotes log likelihood and AUC denotes the area under the receiveroperating characteristic curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Model Evaluation. We evaluate our proposed approach against three alterna-tive models: the hazard framework analogue of the benchmark model and theAnderson et al. (2015) model as well as one that models the household-leveleffects as a linear function of a vector of demographic variables. Our proposedapproach outperforms the alternative models on products i2 in the in-sampledataset and products i3 in the out-of-sample dataset. . . . . . . . . . . . . . 39

5 Covariate Results. The model in column one (two) is a regression of the βhon demographic variables (demographic variables and behavioral variables).The second column of this table will be discussed in Section 7.2. Coefficientestimates are presented on the z-score scale such that they indicate the effectof a one standard deviation change in the covariate. . . . . . . . . . . . . . . 40

6 Correlation of βh and Survey Behavioral Variables. The results in the first(second) column calculate βh using purchase recall (purchase intention). Largeropinion leadership, innovativeness, and variety-seeking are all associated withlarger βh using purchase intention. The standard error of all correlations is0.06. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

35

Variable Mean SD 25% 50% 75%Lifetime (weeks) 172.8 90.9 100 167 238Price ($) 6.2 9.4 2.0 3.9 7.0Private label (binary) 0.2 0.4 0.0 0.0 0.0Promotion frequency (%) 0.1 0.2 0.0 0.0 0.2First twenty-six week sales (units) 44.4 274.2 1.0 3.0 14.0

Table 1: Summary Statistics. Product lifetime as it will serve as our objective, continuousmeasure of product success and varies substantially across the 60% of products which areuncensored. The majority of new products are relatively inexpensive, associated with na-tional brands, seldom promoted, and have comparably few unit sales in the first twenty-sixweeks after introduction.

36

DatasetNew Product Data Household Purchase DataProduct ID Lifetime HH 1 HH 2 . . . HH h . . . . . . HH H

1. Calibration(2006, 2008; 100%)

......

...... . . .

... . . ....

i1 Ti1 xi1,1 xi1,2 . . . xi1,h . . . xi1,H...

......

... . . .... . . .

...

2. In-sample(2007, 2009; 80%)

......

...... . . .

... . . ....

i2 Ti2 xi2,1 xi2,2 . . . xi2,h . . . xi2,H...

......

... . . .... . . .

...

3. Out-of-sample(2007, 2009; 20%)

......

...... . . .

... . . ....

i3 Ti3 xi3,1 xi3,1 . . . xi3,h . . . xi3,H...

......

... . . .... . . .

...

Table 2: Notation and Datasets. Ti gives the lifetime of new product i in weeks and xi,hgives the number of units of new product i purchased by household h in the initial evaluationperiod (i.e., first twenty-six weeks after introduction). The calibration dataset consists ofall new products introduced in 2006 and 2008, the in-sample dataset consists of a randomsample of 80% of new products introduced in 2007 and 2009, and the out-of-sample datasetconsists of the remaining 20% of new products introduced in 2007 and 2009. In the remainderof this manuscript, the subscript i1 always indexes new products in the calibration dataset,the subscript i2 always indexes new products in the in-sample dataset, and the subscript i3always indexes new products in the out-of-sample dataset.

37

VariableModels

Model 1 Model 2 Model 3 Model 4Intercept 0.3521∗∗∗ 0.3483∗∗∗ 0.4180∗∗∗ 0.1606

(0.0159) (0.0160) (0.0242) (0.2646)

S 0.0001∗∗∗

(0.00002)

S.,1 0.0021∗∗∗ 0.0020∗∗∗ 0.0017∗∗

(0.0008) (0.0008) (0.0008)

S.,2 0.0019∗∗∗ 0.0020∗∗∗ 0.0015∗∗∗

(0.0003) (0.0003) (0.0003)

S.,3 -0.0017∗∗∗ −0.0017∗∗∗ −0.0013∗∗∗

(0.0003) (0.0003) (0.0003)

S.,4 −0.0011∗∗ −0.0011∗∗ −0.0008∗

(0.0005) (0.0005) (0.0005)

S −0.00002 0.0001 0.0002(0.0008) (0.0008) (0.0008)

Price 0.0012 −0.0052∗∗

(0.0019) (0.0021)

Private label −0.0274 0.0263(0.0377) (0.0406)

Promotion −0.6153∗∗∗ −0.5747∗∗∗

(0.0789) (0.0826)Category Effects No No No YesObservations 17,130 17,130 17,130 17,130LL (in-sample) −11,585 −11,531 −11,500 −11,129LL (out-of-sample) -2900 -2886 -2879 -2795AUC (in-sample) 0.5059 0.5734 0.5566 0.6700AUC (out-of-sample) 0.5134 0.5680 0.5507 0.6291

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 3: Replication of Anderson et al. (2015). The model presented in the first column isthe benchmark new product forecasting model and the model presented in the second columnis that of Anderson et al. (2015). The models presented in the third and fourth columnsgeneralize the model of Anderson et al. (2015) by successively adding three product covariates(price, private label indicator, and promotion frequency) and category effects (fixed effectsfor each of the eight categories; random effects for each of the 291 subcategories). The cells inthe upper right subtable give coefficient estimates (estimated standard errors). LL denoteslog likelihood and AUC denotes the area under the receiver operating characteristic curve.38

ModelIn-sample Out-of-sample

PLL IAUC PLL IAUCBenchmark -92,006 0.464 -19,522 0.463Demographics -91,994 0.499 -19,520 0.499Anderson -91,945 0.503 -19,505 0.502Proposed -91,941 0.510 -19,505 0.510

Table 4: Model Evaluation. We evaluate our proposed approach against three alternativemodels: the hazard framework analogue of the benchmark model and the Anderson et al.(2015) model as well as one that models the household-level effects as a linear function of avector of demographic variables. Our proposed approach outperforms the alternative modelson products i2 in the in-sample dataset and products i3 in the out-of-sample dataset.

39

VariableModels

Model 1 Model 2Intercept −0.0003∗∗∗ −0.0003∗∗∗

(0.00002) (0.00002)

Income 0.0002∗∗∗ 0.0002∗∗∗

(0.00003) (0.00003)

Size 0.0001∗ 0.0001∗

(0.00004) (0.00004)

Single child 0.0003∗∗∗ 0.0003∗∗∗

(0.00003) (0.00003)

Two+ children 0.0004∗∗∗ 0.0004∗∗∗

(0.00004) (0.00004)

Female head age 0.0001∗ 0.00005(0.00005) (0.00005)

Male head age −0.0001 −0.0001∗

(0.0001) (0.0001)

Only a female 0.0001∗∗∗ 0.0001∗∗∗

(0.00004) (0.00005)

Only a male 0.0001 0.00002(0.00004) (0.00004)

Popularity rank 0.0002∗∗∗

(0.00002)

Store search −0.00003(0.00003)

Price search 0.00002(0.00002)

Adoption lag 0.0001∗∗∗

(0.00002)

No. brands per category 0.0001∗∗∗

(0.00003)

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 5: Covariate Results. The model in column one (two) is a regression of the βh ondemographic variables (demographic variables and behavioral variables). The second columnof this table will be discussed in Section 7.2. Coefficient estimates are presented on the z-score scale such that they indicate the effect of a one standard deviation change in thecovariate.

40

Hypothesis Purchase Recall Purchase Intention1. Unrepresentative taste 0.053 −0.0312. Search 0.098 0.0623. Opinion leadership −0.032 0.110∗

4. Innovativeness −0.090 0.217∗∗∗

5. Variety-seeking −0.024 0.113∗

∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table 6: Correlation of βh and Survey Behavioral Variables. The results in the first (second)column calculate βh using purchase recall (purchase intention). Larger opinion leadership,innovativeness, and variety-seeking are all associated with larger βh using purchase intention.The standard error of all correlations is 0.06.

41

List of Figures

1 Performance of Short-lived and Long-lived New Products Over Time. Thesmooth curves are fit separately for relatively short-lived (lifetime betweenone and four years) and relatively long-lived (lifetime greater than four years)new products using a generalized additive model with the degree of smooth-ness estimated from the data. The revenue of long-lived new products growsrapidly in the first fifteen weeks after introduction and remains relatively sta-ble thereafter while the revenue of short-lived new products declines from thestart; long-lived new products have both higher adoption and repeat purchaserates relative to short-lived new products. . . . . . . . . . . . . . . . . . . . 44

2 Out-of-Sample AUC Across Time. Our proposed approach outperforms thealternative models on products i3 in the out-of-sample dataset. . . . . . . . . 45

3 Estimated Weight Function w. The estimated weight function is decreasingin the product lifetime and, importantly, is positive (negative) for sufficientlyshort (long) lifetimes. This implies that a household that purchases manyshort-lived new products is more likely to be a harbinger of failure. . . . . . 46

4 Estimates of βh. 44% of households have positive βh thus implying that 44%(56%) of households are harbingers of failure (success). . . . . . . . . . . . . 47

5 Effects of Harbingers on Survival Probability. The solid line provides theKaplan-Meier estimator of the underlying baseline survival probability. Thedashed lines provides average survival probability were all sales to harbingersof success and failure. In general, the difference between the survival prob-ability estimates when all sales are to harbingers of success versus failure islarger than 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Retailer Revenue Contribution of Harbingers of Failure. Harbingers of failureaccount for more than 50% of the revenue of particular mass merchandisersand warehouse clubs despite accounting for only 44% of the population. . . . 49

7 Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive (negative) for sufficiently short (long) lifetimes. This implies thata household that purchases many short-lived new products is more likely tobe a harbinger of failure and, as indicated in the bottom panel, this holdseven when the product is the calibration dataset is from a different categorythan the product in the in-sample or out-of-sample dataset. . . . . . . . . . . 50

8 Logistic Regression Estimated Weight Function w. The estimated weightfunction is increasing in the product lifetime and, importantly, is negative(postive) for sufficiently short (long) lifetimes. This implies that a householdthat purchases many short-lived new products is more likely to be a harbingerof failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

42

9 Estimated Weight Function w in the Full Censoring Model. The top (bot-tom) panel displays the weight function associated uncensored (censored) newproducts in the calibration set. The estimated weight function for uncensoredproducts is positive for short lifetimes but negative for long lifetimes; thisagain implies that a household that purchases many short-lived new productsis more likely to be a harbinger of failure. It also shows the weight function forcensored products is negative which implies that a household that purchasesnew products still being sold is less likely to be a harbinger of failure. . . . . 52

10 Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive for sufficiently short lifetimes. This implies that a household thatpurchases many short-lived new products is more likely to be a harbinger offailure and, as indicated in the bottom panel, this holds even when the productis the calibration dataset is from a different category than the product in thein-sample or out-of-sample dataset. . . . . . . . . . . . . . . . . . . . . . . . 53

43

Repeat Purchase Rate (%)

Number of New Consumers

Revenue Relative to Category Average

0 10 20 30 40 50

0.25

0.30

0.35

0.40

1.6

2.0

2.4

2.8

0.75

1.00

1.25

1.50

Product Lifetime

New products

Long−lived

Short−lived

Figure 1: Performance of Short-lived and Long-lived New Products Over Time. The smoothcurves are fit separately for relatively short-lived (lifetime between one and four years) andrelatively long-lived (lifetime greater than four years) new products using a generalized addi-tive model with the degree of smoothness estimated from the data. The revenue of long-livednew products grows rapidly in the first fifteen weeks after introduction and remains rela-tively stable thereafter while the revenue of short-lived new products declines from the start;long-lived new products have both higher adoption and repeat purchase rates relative toshort-lived new products.

44

0.47

0.48

0.49

0.50

0.51

0 100 200 300 400

Week

AU

C

Model

Benchmark

Demographics

Anderson

Proposed

Figure 2: Out-of-Sample AUC Across Time. Our proposed approach outperforms the alter-native models on products i3 in the out-of-sample dataset.

45

−0.02

0.00

0.02

0.04

0.06

0 100 200 300 400

Product Lifetime

Wei

ght F

unct

ion

Figure 3: Estimated Weight Function w. The estimated weight function is decreasing in theproduct lifetime and, importantly, is positive (negative) for sufficiently short (long) lifetimes.This implies that a household that purchases many short-lived new products is more likelyto be a harbinger of failure.

46

0

20

40

60

−0.01 0.00 0.01 0.02 0.03

βh

Den

sity

Figure 4: Estimates of βh. 44% of households have positive βh thus implying that 44% (56%)of households are harbingers of failure (success).

47

0.4

0.6

0.8

1.0

0 100 200 300

Product Lifetime

Sur

viva

l Pro

babi

lity

Baseline

All success

All failure

Figure 5: Effects of Harbingers on Survival Probability. The solid line provides the Kaplan-Meier estimator of the underlying baseline survival probability. The dashed lines providesaverage survival probability were all sales to harbingers of success and failure. In general,the difference between the survival probability estimates when all sales are to harbingers ofsuccess versus failure is larger than 0.05.

48

Drug store B

Drug store A

Mass merchandiser C

Grocery C

Grocery B

Grocery A

Limit Assort A

Warehouse club B

Mass merchandiser B

Mass merchandiser A

Warehouse club A

0% 20% 40% 60%

Revenue Contribution

Ret

aile

r

Figure 6: Retailer Revenue Contribution of Harbingers of Failure. Harbingers of failureaccount for more than 50% of the revenue of particular mass merchandisers and warehouseclubs despite accounting for only 44% of the population.

49

Cross−Category

Same Category

0 100 200 300 400

−0.025

0.000

0.025

0.050

−0.025

0.000

0.025

0.050

Product Lifetime

Wei

ght F

unct

ion

Figure 7: Cross-category Estimated Weight Functions w1 and w0. Both weight functions arepositive (negative) for sufficiently short (long) lifetimes. This implies that a household thatpurchases many short-lived new products is more likely to be a harbinger of failure and, asindicated in the bottom panel, this holds even when the product is the calibration datasetis from a different category than the product in the in-sample or out-of-sample dataset.

50

−0.03

−0.02

−0.01

0.00

0 100 200 300 400

Product Lifetime

Wei

ght F

unct

ion

Figure 8: Logistic Regression Estimated Weight Function w. The estimated weight functionis increasing in the product lifetime and, importantly, is negative (postive) for sufficientlyshort (long) lifetimes. This implies that a household that purchases many short-lived newproducts is more likely to be a harbinger of failure.

51

Censored

Uncensored

0 100 200 300 400

−0.04

−0.02

0.00

0.02

0.04

0.06

−0.04

−0.02

0.00

0.02

0.04

0.06

Product Lifetime

Wei

ght F

unct

ion

Figure 9: Estimated Weight Function w in the Full Censoring Model. The top (bottom)panel displays the weight function associated uncensored (censored) new products in thecalibration set. The estimated weight function for uncensored products is positive for shortlifetimes but negative for long lifetimes; this again implies that a household that purchasesmany short-lived new products is more likely to be a harbinger of failure. It also showsthe weight function for censored products is negative which implies that a household thatpurchases new products still being sold is less likely to be a harbinger of failure.

52

Cross−Category

Same Category

0 100 200 300 400

0.000

0.025

0.050

0.000

0.025

0.050

Product Lifetime

Wei

ght F

unct

ion

Figure 10: Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive for sufficiently short lifetimes. This implies that a household that purchasesmany short-lived new products is more likely to be a harbinger of failure and, as indicatedin the bottom panel, this holds even when the product is the calibration dataset is from adifferent category than the product in the in-sample or out-of-sample dataset.

53