Download - ABC Catalog company

| 1

MKTG 5963 – Section 513

Term Project: ABC Catalog Company

Final Report

Team Number One:

Antonio Zuniga, Cynthia Araly

Durongkadej, Isarin

Methapatara, Chinnatat

Somraj, Shilpa

Zarate Tellez, Ana

Spring 2014

| 2

Table of Contents

EXECUTIVE SUMMARY .......................................................................................................................... 5

OBJECTIVES ............................................................................................................................................ 5

ISSUES ..................................................................................................................................................... 5

FINDINGS ................................................................................................................................................ 5

Data .................................................................................................................................................. 5

RFM and most valuable customers .................................................................................................. 5

Offer, product, and channel performance ........................................................................................ 5

Customer segments and profiles ...................................................................................................... 5

Future revenue forecast .................................................................................................................... 6

RECOMMENDATIONS .............................................................................................................................. 6

INTRODUCTION ......................................................................................................................................... 7

BACKGROUND ........................................................................................................................................ 7

OBJECTIVES ............................................................................................................................................ 7

PROJECT TIMELINE AND APPROACHES .................................................................................................... 7

REPORT ORGANIZATION ......................................................................................................................... 8

DATA ............................................................................................................................................................. 8

UNDERSTANDING THE DATA .................................................................................................................. 8

Shipping duration ............................................................................................................................. 9

Backorder frequency ........................................................................................................................ 9

EXTERNAL DATA .................................................................................................................................... 9

NEW VARIABLES ..................................................................................................................................... 9

TARGET VARIABLES ............................................................................................................................. 10

DATA IMPUTATION AND TRANSFORMATION ........................................................................................ 10

Deal with outliers ........................................................................................................................... 10

Deal with missing value ................................................................................................................. 11

Transform variables ....................................................................................................................... 11

DATA EXPLORATION AND BUSINESS IMPLICATIONS ............................................................................ 11

Profitable Regions .......................................................................................................................... 11

Impact of Customer Age ................................................................................................................ 11

Products frequently returned .......................................................................................................... 12

Return on Marketing Investment ................................................................................................... 12

Effectiveness of catalog offers by product categories.................................................................... 12

Repeat purchase by catalog offer and product categories .............................................................. 15

CUSTOMER SEGMENTATION AND PROFILE ................................................................................. 20

RFM ANALYSIS ..................................................................................................................................... 20

Exploring R, F and M indexes ....................................................................................................... 21

Exploring RFM results ................................................................................................................... 21

Findings and implications .............................................................................................................. 22

MOST VALUABLE CUSTOMERS BY REGIONS ......................................................................................... 22

CUSTOMER CLUSTERING AND PROFILING ............................................................................................. 24

Objective and Goals ....................................................................................................................... 24

| 3

Inputs ............................................................................................................................................. 24

Segments and profile...................................................................................................................... 26

Findings and Implications .............................................................................................................. 28

PREDICTIVE MODELS ........................................................................................................................... 29

MODEL TARGET .................................................................................................................................... 29

DATA PARTITION .................................................................................................................................. 29

VARIABLE REDUCTION AND SELECTION ............................................................................................... 29

Variable recoding ........................................................................................................................... 30

DECISION TREE ..................................................................................................................................... 30

Model implementation ................................................................................................................... 30

Model results .................................................................................................................................. 31

REGRESSION ......................................................................................................................................... 31

Model Implementation ................................................................................................................... 31

Model Results ................................................................................................................................ 32

Findings ......................................................................................................................................... 32

NEURAL NETWORK ............................................................................................................................... 32

Model Implementation ................................................................................................................... 32

Model result ................................................................................................................................... 32

Findings ......................................................................................................................................... 33

COMPARISON OF MODELS’ PERFORMANCE ........................................................................................... 33

SCORING ............................................................................................................................................... 34

Scoring performance ...................................................................................................................... 34

SUMMARY ................................................................................................................................................. 34

FINDINGS AND RECOMMENDATIONS .................................................................................................... 34

Segmentation and Clustering ......................................................................................................... 34

Predictive Modeling ....................................................................................................................... 35

LIMITATIONS ........................................................................................................................................ 37

APPENDIX .................................................................................................................................................. 38

APPENDIX A .......................................................................................................................................... 38

Data exploration on the net profit and its associations of the input variables ................................ 38

APPENDIX B .......................................................................................................................................... 44

The effectiveness of catalog offers by product categories by total order quantities ...................... 44

The effectiveness of catalog offers by product categories by total profit ...................................... 47

APPENDIX C .......................................................................................................................................... 50

The results of hierarchical clustering ............................................................................................. 50

The results of clustering and profiling with four clusters .............................................................. 51

The results of clustering and profiling with eight clusters ............................................................. 54

The results of clustering and profiling with twelve clusters .......................................................... 60

APPENDIX D .......................................................................................................................................... 67

Data exploration and business implication .................................................................................... 67

APPENDIX E .......................................................................................................................................... 68

Customer segmentation and profile ............................................................................................... 68

APPENDIX F .......................................................................................................................................... 68

Data imputation and transformation .............................................................................................. 68

| 4

APPENDIX G .......................................................................................................................................... 70

Predictive model ............................................................................................................................ 70

| 5

Executive Summary

Objectives

The objective is to aid a special multi-channel catalog company, ABC Catalog Company (referred to as

“ABC” henceforth) to improve its revenue and market share. In order to achieve this objective, evaluating

what drives ABC’s revenues and profits across various customer segments / products / marketing offers is

vital.

Issues

ABC lacks of good understanding on their customers. Because of that, ABC doesn’t have a good idea on

developing an accurate marketing campaign to increase its market share. This further includes the inability

to retain the current and previous customers with the company. The respective performance could be

dramatically improved if ABC sets their emphasis and marketing effort on the right path, and heavily but

wisely utilizes statistical method in conjunction with computer software on their existing data.

Findings

Data

The transaction data was given, and has no severe issues, and missing values was approximately 4%. Some

data anomalies were observed. Skewness issues were noted for variables that deal with cost, price,

quantity, and revenue. A new target variable to denote net profit was created for a more accurate

performance metric for ABC’s profitability. In addition, the customer demographics were brought in from

census data.

RFM and most valuable customers

The ABC customer base primarily consists of two types. ABC has a large number of one-time customers

along with a loyal customer base. From the CRM perspective and by leveraging RFM analysis, we deduce

that most of ABC’s one-time customers are also low-profit generators. Further analysis shows ABC’s most

valuable customers are mostly located in the Pacific Coast, Mid-Atlantic, and Southeast, prefer the

telephone marketing channel, and have low household income on average.

Offer, product, and channel performance

ABC offer programs are not effective on returning customers as number of repeat purchase was extremely

low for every offer and every product category. In overall, the product category T were most popular

(having highest number of orders), most demanded (having largest number of total order quantities), and

most profitable (having the largest number of total generated profit), followed by product category C, E

and P, respectively. In total, ABC customers utilized catalog offers the most. The four most effective

catalog offers were of LMB, GFP, GFB and LCB as a suffix in their offer IDs. Amongst all offers, the web

offer channel (WEB) had the highest number of orders.

Customer segments and profiles

Customer segmentation was performed based on the buying behaviors and preferences of customers.

Because of much similarity, it was very difficult to create a perfectly distinct customer segments. The

experiments were performed on 4, 8 and 12 clusters recommended by statistical criteria of a hierarchical

clustering method. The optimal cluster breakdown was 8, mostly differentiated by the RFM code. Four out

| 6

of eight clusters had high response rate to catalog offers. Product preference was not obviously distinct,

and 1-digit ZIP code of 5, 7 and 9 were somewhat meaningful in profiling customer segments.

Future revenue forecast

From all three models (decision tree, regression, and neural network), the decision tree is the best model to

predict the future revenue with lowest error. There are five variables that are important to predict the

revenue: zip code, payment method, offer range, product category, and RFM. In terms of implementation,

the model was scored with a new data outside SAS environment, the scoring result showed small error of

revenue prediction which indicated that the model is flexible enough to apply with any data outside SAS

environment.

Recommendations

Increase catalog promotion and train personnel for telephone marketing to increase valuable

customer share.

All of the products need better marketing from the CRM perspective. ABC would need to put

more emphasis on customer retention. Improving customer buying and customer service

experiences are necessary. To better understand customers, conducting the survey on their

perception and preferences is a very good start.

The regions that are most profitable are the Pacific Coast (5), Heartland (7) and the South (9)

regions. Therefore, ABC should focus its efforts to increase market share in those areas.

The Return on Investment is higher for products in the lower cost range. Therefore, ABC needs to

take care of and/or promote lower end products since that is where their profits are coming from.

The decision tree suggested four recommendation groups for different price ranges. However, the

first price range ($55-95) is most important, accounting for 90 percent of the company’s total

revenue. Within this price range, ABC should focus on five product categories, i.e., B, P, C, T, and

X, and should sell them in bundle and give customers a promotion if customers use American

Express, Visa, and Master Card to buy a product(s).

Initially, ABC might try to do marketing with only one segment to experiment whether the

segmentation was accurately illustrative. The recommended segment to start with is Segment 7

because it contains the most valuable customer group, in which the majority is female and is

located in the north-central and central US. Therefore, ABC should configure the catalog to be

more feminine for those areas and do the cross-sale promotion.

After deploying new marketing campaign, ABC should track and verify whether or not it is

successful, and whether or not modifications to the market campaign or segmentation were

needed.

| 7

Introduction

Background

ABC is a specialty multi-channel catalog company with a strong web presence. The company sources

different consumer products from manufacturer(s) and then designs catalogs to display an assortment of

such products. Unlike a traditional retail store, the products are not displayed physically in the store but

customers are sent catalogs that are fulfilled through online shopping. ABC wants to improve its revenue

and the customer base. Our team of has been assigned the task of analyzing what drives ABC’s

revenue/profit.

Objectives

The overall analysis needs to be divided into addressable sub-components to make the proper

recommendations on the most effective marketing strategies to achieve higher profits and gain market

share. Hence, the sub-objectives for this project are:

To analyze product, offer, and channel-buying patterns

To measure performance and estimate return on the marketing investment (Marketing ROI)

To examine purchase patterns through Contact histories and detailed product purchase information

To identify seasonal sales/service quality trends

To build customer segmentation

To identify the most valuable customers and predict future revenues

Project timeline and approaches

The CRISP-DM methodology was utilized with the goal to eventually predict future revenues and suggest

marketing strategies. The overall project was split into three phases with goals associated to each of the

steps in the CRISP DM process as follows:

Phase 1 – data preparation and exploration

Phase 2 – data audit and descriptive analysis

Phase 3 – predictive analysis

The primary tools that were used in this project were SAS Enterprise Miner 12.1 and Microsoft Excel. The

detail procedures and the time line of the project are demonstrated in Figure 1.

| 8

ID Task Name Start Finish Duration

Apr 2014Feb 2014 Mar 2014

3/16 4/133/2 4/201/26 4/63/302/9 3/92/2 2/23 3/232/16

1 5d2/1/20141/28/2014Understanding the project descriptions

3 10d2/6/20141/28/2014Understanding business nature and

determining business issues and goals

4 10d2/6/20141/28/2014Exploring, scrubbing and consolidating data

for Phase 1

6 16d2/18/20142/3/2014Examining input associations and company

performance from multiple perspectives

7 6d3/1/20142/24/2014Preparing data, i.e., data imputation and

transformation.

9 6d3/8/20143/3/2014Conducting RFM analysis

10

5 17d2/16/20141/31/2014

Determining meaningful inputs. Auditing

and observing data quality, e.g., distribution

and missing values

9d3/18/20143/10/2014

Associating RFM codes to other inputs and

determining meaningful marketing

strategies

14

13

11 16d3/25/20143/10/2014Creating customer segmentation and

profiling the segments

6d4/5/20143/31/2014Conducting analysis on marketing return on

investment, and product returns

16d4/15/20143/31/2014

Conducting analysis on interactions

between offer codes and product categories

across the years, and repeat purchase

16 18d4/17/20143/31/2014Building predictive models

15 5d4/4/20143/31/2014

Performing variable reduction and selection

and creating meaningful inputs for

predictive models related to business goals

17 2d4/25/20144/24/2014Comparing and scoring data

18 15d4/28/20144/14/2014Writing final report

12 14d3/31/20143/18/2014Writing Phase 2.2 report

2 3d2/3/20142/1/2014Writing Phase 1 report

8 7d3/3/20142/25/2014Writing Phase 2.1 report

CRISP-DM Methodology

Phases

Business understanding

Data Understanding

Business understanding

Data Understanding

Business Understanding

Data Understanding

Data Preparation


Data Understanding

Data Preparation


Data Understanding

Data Preparation

Data Preparation


Data Understanding


Data Understanding


Data Understanding


Data Understanding


Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Figure 1: Project detail tasks and time line

Report organization

The report is organized as follows. The data understanding, external data, business related analysis and

business implications are described in Section Data. Section Customer segmentation and profile presents

RFM analysis and customer clustering and profiling, and finings and related business implications.

Variable reduction and selection, and predictive models are discussed in Section Predictive models, and

Section Summary provides a summary of the project, including findings and recommendations. SAS

Enterprise Miner outputs / graphs / charts and further analyses are contained in Appendix.

Data

Understanding the data

From the business perspective (Step one in CRISP DM process), it is essential to understand the existing

system before preparing the data or building a model. Hence, service quality and catalog response rates

were inspected. Service quality can increase the number of repeat customers, order quantities, and

eventually profits for ABC. Shipping is a key component of service quality. In order to explore the

shipping quality, we need to analyze the shipping duration, backorder frequency, and the number of

returned or cancelled orders.

| 9

Shipping duration

Shipping duration or lead-time can be one of the potential determinants to repeat customers. A good

shipping experience will most likely result in a repeat order from the same customer. Hence, using the

ORDER_DATE and SHIP_DATE can be useful to improve shipping time, reduce delay times and in turn

improve customer satisfaction.

Backorder frequency

Backorder is the key to determining service quality and describes the effectiveness of logistics and supply

chain management, especially under the circumstance of seasonal demand. By frequently having an item

unavailable, customers might look for other catalog /online shopping avenues to obtain the product in time.

Hence, we determined the total number days that customers need to wait for an item to be shipped if it is

not available at the time of purchase. We concluded that the relationship between BO_DATE and

SHIP_DATE demonstrates seasonality.

Identifying trends between the order cancellations/returns across product lines/offers/regions can help

ABC improve service quality in those areas, if required.

External data

The given data offers information about customer orders, quantities ordered, price, offer, channel, and so

on. However, to segment the customers better we need demographic information. To this end, we updated

the given dataset with demographic information (external) based on “zip codes” from American Fact

Finder. (Reference: http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml)

New variables

New variables were created to accommodate the external data in the existing dataset. The new variables

added were described in Table 1.

Table 1: The purpose and description of new variables corresponding to the census data

Variables Description Purpose

FEMALE_PCT_NUM Percentage of female in each zip

code

To explore if there is

any trend pertaining to

gender MALE_PCT_NUM Percentage of male in each zip

code

MEAN_HOUSEHOLD_INC_NUM Average household income in

each zip code



wealth of household

YOUNG_AMT_15_TO_24 The number of population age

between 15-24



different age ranges

MID_AMT_25_TO_44 The number of population age

between 25-44

OLD_AMT_45_TO_64 The number of population age

between 45-64

RET_AMT_65_OR_ABOVE The number of population age

>65

NO_TEL_NUM The number of household who has

no telephone

To exclude any zip

code that has high

volume of these

variables UNEMP_PCT_NUM Percentage of unemployed

population

| 10

Variables Description Purpose

TOT_POP Total population in each zip code To use as comparison

purpose

Target variables

A product with high revenue margin does not necessarily have high profit margin. In other words, although

its unit price is high, it may also be associated with high product cost. Thus, net revenue is not an accurate

target with respect to profitability. Instead, net profit, denoted by NET_PROFIT, is an accurate

representation, which can be calculated as follows.

NET_PROFIT = NET_QUANTITY × (PRICE_PER_UNIT - COST_PER_UNIT).

Most of the time, net profit is positive, but it can also be negative if profit margin is negative. We observed

that there were many transactions having negative profit, and there were many products that on average

have negative profit margin. More interestingly, there were also many products that never generated profit

at all (because all were returned) over the entire horizon in the data set. The histogram of the net profit by

transactions is presented in Figure 2

Figure 2: Histogram of net profit by transactions

The exploration on the net profit and associations of the other inputs with the net profit were thoroughly

studied. The results are demonstrated in Appendix A.

Data imputation and transformation

Deal with outliers

Any variables that have value greater than 5 standard deviations will be filtered out. In this case, 3.54

percent of the original data is filtered out, equal 6,188 observations.

The quality of the data is better after filtering the extreme values. The Kurtosis and Skewness are much

lower indicating that those extreme values affect the quality of the data. For more information, please see

Appendix D.

| 11

Deal with missing value

We replaced the missing value with zero for ret_quantity and ret_revenue, because the missing values for

these two variables indicate “no return”, not a “missing value”. Replacing these two values with zero is

useful for further analysis. Then, we imputed the missing value for other variables by mean (for interval

variable) and mode (for categorical variable).

Transform variables

First, use the max normal transformation to find which method should be used to transform the variables.

For more information, please see Appendix D.

Then, transform those variables based on max normal result. From the result below, we will see that the

skewness and kurtosis look more normal. Consequently, the data is ready for further analysis. For more

information, please see Appendix D.

Data Exploration and business implications

Profitable Regions

Prioritizing regions for marketing efforts is important. We analyzed the zip codes to understand the key

regions for ABC. The top ten most profitable by 3-digit ZIP codes are 100, 600, 770, 606, 117, 926, 945,

070, 334, and 300, which are in the states of NY, IL, TX, CA, NJ, FL and GA. The results are expected as

these states as a whole are equipped, are large in terms of population, and have more buying power

Figure 3: Top-ten most profitable areas by 3-digit ZIP code

Impact of Customer Age

We have segmented the population into four groups to better understand the company. When looking at

the four age ranges every plot is right skewed. When comparing each age range with net revenue, we

observe that mostly all four age ranges people are most likely to purchase products that have net revenue

between 0 and 250. Based on frequency, people in age groups of 15 to 24, and those who are 65 and above

are the most loyal customers. Most of ABC’s revenue is generated through products that are least

expensive in all age segments. Please refer to Appendix D.

Across age categories, the impact of the channels is very similar, Web is the most effective channel with

>50%, Phone represents >40% and Mail represents approximately >3%. Hence the preferred channel is

Web, and any improvement made will most likely have the same performance in all age ranges.

| 12

Products frequently returned

It is essential for ABC to understand what products have frequent / high number of returns. Further

analysis is essential to either lessen the number of returns or contact the manufacturer for potential product

related issues. Frequent returns for a certain product might impact the overall brand value of ABC; hence it

is critical to address this issue. From our analysis, we noted that products with a product category Id of

“C,” “E,” and “P” have a relatively higher number of returns in comparison with the other products. Please

refer to Appendix D.

Return on Marketing Investment

In order to deduce the return on marketing investment, we split the cost into five bins for the $0 to $100

range and then into three bins for the $101 to maximum cost. Since the marketing cost incurred is an

unknown, we considered three types of marketing costs at 5%, 10%, and 12% of the mean cost. Note that

irrespective of the actual marketing cost, the profit to marketing cost ratio decreases as we move towards

the higher cost range and increases as the percent marketing cost decreases as shown in Table 2.

Table 2: The profit to marketing cost ratio at 5%, 10% and 12% percent marketing costs

Min.

cost

Max.

cost Profit

Net

quantity

Return

quantity

Profit

/ cost

Return

%

Profit to marketing cost ratio at

5%

marketing

10%

marketing

12%

marketing

0 20 19 65755 2034 1.86 3.1% 37.2 18.6 15.5

21 40 35 49984 2652 1.15 5.3% 22.9 11.5 9.6

41 60 51 21393 1470 1.01 6.9% 20.2 10.1 8.4

61 80 68 11601 785 0.96 6.8% 19.2 9.6 8.0

81 100 83 4978 367 0.91 7.4% 18.3 9.1 7.6

101 200 109 7992 655 0.72 8.2% 14.5 7.2 6.0

201 300 148 2402 212 0.59 8.8% 11.8 5.9 4.9

301 337.2 133 83 6 0.42 7.2% 8.3 4.2 3.5

Effectiveness of catalog offers1 by product categories

We started to look at the numbers of catalog transactions yearly made from year 2004 to 2011 by product

categories, where the results are presented in Figure 4.

1 The last three letters on the OFFER_ID codes that are catalog offers are EHB, ESB, FDB, GFB, GFP, HLB, LCB,

LCR, LMB, LSB, LWB, MSB, OFM, OTB, PCB, PO0, REQ, SMB, SPB, TOY, VRA VRD and WNB.

| 13

Figure 4: Total number of transactions for catalog offers by product categories over years

Observation 1: From year 2004 to 2010, there were consistently incoming orders for the OFFER_ID EHB,

GFB, GFP, HLB, LCB, LCR and LMB2, where orders by OFFER_ID LCR started to arrive from 2007 to

2010. In 2011, there were no orders for those OFFER_IDs completely. Instead, there were more orders

made from the other OFFER_IDs, such as SMB, SPB and WNB.

This might occur for a business reason; for example, the company might stop those offer codes after 2010,

and utilize more the other codes. There might be a problem with those OFFER_IDs in 2011, which

eventually resulted in zero incoming orders in that year.

Observation 2: From year 2004 to 2010, the distribution pattern of the number of purchased orders by

product categories within the same OFFER_ID seems to be very similar over years, where product

2 Besides OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB, OFFER_ID SMB, LSB, PCB SPB and WNB

seem to be the second group of the best OFFER_IDs, of which most were standing out in 2011.

| 14

category T seemed to be the most popular product, followed by product category C, E and P, respectively.

However, in 2011 when the number of orders for product category T significantly dropped, the product

category P became the most popular product, followed by product category C, E and H, respectively.

The reasons for the number of orders for product category T that dramatically went down may be because

(1) it might have a short product life cycle, (2) there might be an issue with offers for product category T

during 2011, (3) the company marketing strategy might be changed to do more marketing on the other

product categories, and less on product category T, and (4) there might be an issue in the supply side of

product category T.

From Figure 4, OFFER_ID GFB, GFP, LCB and LMB clearly were the top-four catalog OFFER_IDs with

respect to the number of placed orders. We henceforth refer to these four OFFER_IDs as the four most

effective catalog OFFER_IDs. Next, we want to observe how the number of orders by product categories

and OFFER_IDs varied over time, where the graphs for the four most effective catalog OFFER_IDs are

shown in Figure 5.

Figure 5: Total number of transactions for the four most effective catalog offers by product categories

over years

Note that for year 2004 and 2011, the raw input data were not provided for the entire years so their results

will not be analyzed and given out in the following observations.

Observation 3: It can be seen that in these top-four most effective catalog OFFER_IDs, product category

T, C, E and P consistently were top-four product categories with respect to the number of placed orders

over the years. This observation conveys similar insights as Observation 2.

Observation 4: The OFFER_ID LMB and GFP seemed to have the best performance, where OFFER_ID

LMB slightly outperformed OFFER_ID GFP, which had significant drops in 2009 and 2010.

Observation 5: In general, the number of orders for product category T significantly dropped in 2009, and

fluctuated with a relatively large variation in the range between 300 and 700 number of orders

| 15

approximately. The number of orders of the other product categories also had fluctuations, but with much

smaller variation.

Observation 6: After the drop in 2009, the total number of orders particularly for product category T

increased for OFFER_ID LCB and LMB, stayed almost constant for OFFER_ID GFB, and kept decreasing

for OFFER_ID GFP in 2010.

The justification for the Observation 6 is as follows. This could be a sign of bouncing back of product

category T from the previous two drops in the last two years, and if it is, this could also imply repeated

purchases. For the one that continued to drop, and also the one that did not increase, customers might have

switched from those particular OFFER_IDs to the OFFER_ID where the total number of orders increased.

In addition, for the other product categories of which the total number of orders increased from one year to

another year, this also showed a possibility of repeated purchases.

To evaluate how effective the offers were, the analysis only on the number of orders may not be enough to

validate the findings. Typically, the performance of the offers can also be implied by the size as well as the

corresponding profit of the orders. Such analysis was performed, and the results and observations were

given in Appendix B.

As mentioned as one of the company goals as, “contact histories and detailed product purchase

information support excellent analysis of repeat purchase from the same catalog and the same product

categories year to year”, we are not fully confident to make a conclusion about whether or not the repeat

purchases occurred. With the results in this section, we can only say that in overall the product categories

that were popular, highly demanded and very profitable remained in the same fashion as they were so there

is a chance for repeat purchases to occur. The results in this section would have to be extended to include a

customer dimension into the analysis in order to draw conclusion on the existence of the repeat purchase of

customers, and if there is, which and how product categories and OFFER_IDs were repeatedly purchased.

Repeat purchase by catalog offer and product categories

This analysis specifically examined the repeat purchase for the same product categories using the same

catalog offer IDs from year to year. Let’s define the definition of repeat purchase.

Definition: For every unique OFFER_ID and product category combination, the number of repeat

purchases in year X is defined as the number of purchases in year X if and only if there is at least one

purchase in year X-1.

From the definition of the repeat purchase, we can observe that it only takes into consideration the

purchases in two consecutive years, and the number of repeat purchases can be as large as the total number

of purchases within a certain year. Carried on from the previous section, the analysis in this section is

particularly conducted for the top-four most effective catalog OFFER_IDs, namely GFB, GFP, LCB and

LMB, to examine how many and in which product categories the repeat purchases occur in a particular

year.

For OFFER_ID GFB, out of 6,070 customers who made at least one order from year 2004 to 2011, there

were only 37 customers who made their repeat purchases. One of which made twice repeat purchases

consecutively in year 2008 and 2009 on product category T, whereas two of which made a repeat purchase

| 16

twice on different product categories. The numbers of the repeat purchases in each year by product

categories are shown in Figure 6. The labels used inside the stacked bar chart are read as follows. The first

letter is the product category, the second number is the total number of repeat purchases, and the last

number in percent is the percentage of the amount of the repeat purchases out of the total number of orders

for that particular OFFER_ID, product category and year combination.

Figure 6: The number of repeat purchase by product categories for OFFER_ID GFB across years

The labels used inside the stacked bar chart are read as follows. The first letter is the product category, the

second number is the total number of repeat purchases, and the last number in percent is the percentage of

the amount of the repeat purchases out of the total number of orders for that particular OFFER_ID, product

category and year combination. It is apparent that the percent of the repeated purchases in every product

category is very small with the maximum of 2.84% on product category T in 2007. Overall, the product

category T seems to have the largest total number of the repeat purchases across the years. Also, there are

product categories, i.e., product category A, B, D, F, G, I, J, L, M and O which completely had no repeat

purchase.

For the OFFER_ID GFP, out of 7,203 customers who made at least one purchase from year 2004 to 2011,

there were only 38 customers who made their repeat purchases. Two of which made a repeat purchase

twice on two different product categories, and none of the customers made consecutive repeat purchases

using the OFFER_ID GFP. The numbers of the repeat purchases for OFFER_ID GFP by years for

different product categories are demonstrated in Figure 7.

| 17

Figure 7: The number of repeat purchase by product categories for OFFER_ID GFP across years

Similar to the OFFER_ID GFB, the percent of the repeated purchases in every product category is very

small with the maximum of 3.05% on product category T in 2006. Overall, the product category T had the

largest total number of the repeat purchases across the years. Also, there are product categories, i.e.,

product category A, D, F, G, I, J, M, O and S which completely had no repeat purchase.

For the OFFER_ID LCB, out of 5,947 customers who made at least one purchase from year 2004 to 2011,

there were 23 customers who made their repeat purchases. Four of which made a repeat purchase twice on

two different product categories, and none of the customers made consecutive repeat purchases using the

OFFER_ID LCB. The numbers of repeat purchases for OFFER_ID LCB by years for different product

categories are presented in Figure 8.

Figure 8: The number of repeat purchase by product categories for OFFER_ID LCB across years

Again, the percent of the repeat purchases in every product category is very small with the maximum of

2.20% on product category X in 2006. Unlike the other three OFFER_IDs, the product category T does not

account for the majority of the repeated purchases. Instead, product category T, E and C are more likely to

have equal total number of repeat purchases across the years. Besides, there are product categories, i.e.,

product category A, B, D, F, G, H, I, J, K, M, O and S which completely had no repeat purchase.

| 18

For the last OFFER_ID, i.e., LMB, out of 9,020 customers, there were 36 customers who made their repeat

purchases. One of which made a repeat purchase twice on two different product categories, and one of

which made two two-consecutive repeat purchases on product category P and T in 2009 and 2010. The

numbers of repeat purchases for OFFER_ID LCB by years for different product categories are presented in

Figure 9.

Figure 9: The number of repeat purchase by product categories for OFFER_ID LMB across years

The observations are very similar to those in the other three OFFER_IDs. In particular, the percent of the

repeat purchases in every product category is very small with the maximum of 3.13% on product category

M in 2010; however, with only one repeat purchase. Overall, the product category T had the largest total

number of the repeat purchases across the years. Also, there are product categories, i.e., product category

A, B, D, G, I, J, and O which completely had no repeat purchase.

Given the total number of orders in Figure 10, the total number and percent of repeat purchases over the

years on aggregate are shown in Figure 11.

Figure 10: Total number of orders by product categories over the years for OFFER_ID GFB, GFP, LCB

and LMB

A B C D E F G H I J K L M O P S T X

GFB 187 461 1,468 48 1,174 455 12 727 1 34 463 395 45 100 1,242 289 2,493 756

GFP 245 618 1,837 33 1,408 639 25 957 0 49 503 349 40 130 1,425 300 2,754 808

LCB 204 433 1,722 16 1,274 544 5 647 0 51 499 381 59 113 1,250 328 2,257 524

LMB 410 620 2,435 26 1,797 773 7 932 0 91 684 490 87 154 1,864 519 3,516 774

PRODUCT CATEGORIES

OFFER_IDs

| 19

Figure 11: Total number of repeat purchase during 2004-2011 for OFFER_ID GFB, GFP, LCB and LMB

by product categories

Findings and implications

In overall, product category T dominated the total number of the repeat purchases; however, all of the

percent repeat purchases are very small even for the four most effective OFFER_IDs. This revealed some

business issues in the way the company runs their business from the CRM perspectives.

The company clearly has a problem in keeping their customers making a repeat purchase from year to year

using the same OFFER_IDs. The company must segment customer better, and target each of the segments

by providing a specially-designed offer or set of offers. For example, the company could provide a special

discount or superior customer support to repeating buyers. The company should create a contact list, and

provide more care and close customer support on recently purchased products by periodically contacting

customers to ask for their experience on using the purchased products.

The company should also conduct a survey to measure satisfactory on products and customer supports, and

to obtain the feedbacks about the buying factors we well as buying experiences with the company. This

would lead to the better customer segmentation, and specific offers, different for different customer

segments could be more appropriately designed.

The company must find the way, for example developing new offer or modifying existing offers, to be

more attractive to a group of existing customers. As shown in the previous section, currently the

performance of every OFFER_IDs needs to be improved, and none of the OFFER_IDs is significantly

superior to the others. It is important for the company to precisely create distinct customer segments where

amongst different groups customers are different in nature, need, buying behavior, or perception, but

similar within the same groups. For example, one of the segments might contains the customers who are in

the industry so do not care much about the price, but highly concern with the quality, the more attractive

| 20

offer for this customer group might be to provide no discount on the product and an extended warranty at

the discounted price. Some customers might have low income and thus care more about price than the

quality. The offer for this group of customers should provide special discount on the retail price, and

perhaps an extended warranty at an extra cost as an option.

Customer segmentation and profile

RFM analysis

RFM analysis is practically used as a customer segmentation tool where R, F and M are the indexes,

respectively representing recency, frequency and monetary indexes of customers. The given transaction

data were aggregated over customer IDs, and for each of the customer ID, the new variables to denote the

last date of purchase, the total number of purchase orders, and the total profit of every purchase were

created as shown in Table 3.

Table 3: The variables for R, F and M

Variable Names Corresponding Indexes

DAY_LAST_ORDER R

CNT_ORDER F

TOTAL_PROFIT M

In addition, the new variables shown in Table 4 were also created to represent the total net profit and

number of purchases made over the last purchase year, i.e., 2011. Note that the order data information in

2011 is available through August 31, 2011.

Table 4: The variables for total profit and number of orders in the last year

Variable Names Description

LAST_YEAR_TOT_PROF The total profit generated by a particular

customer over the last year

LAST_YEAR_CNT_ORDER The total number of purchase orders made by a

particular customer over the last year

| 21

Exploring R, F and M indexes

Figure 12: Histogram of numbers of purchases

Figure 13: Histogram of numbers of days from

last purchase

Figure 14: Histogram of total customer profit

The distribution of the number of orders (or frequency) by customers is shown in Figure 12. It can be

observed that there were close to 60% of customers who made only a single order over the entire time

horizon. In addition, about 20% and 10% of customers respectively made two and three orders over the

time horizon. Figure 13 demonstrates the distribution of the number of days between the last date in the

dataset (i.e., 31th August 2011) and the last purchase date of each customer. It can obviously be seen that

the data was more spread out, compared to the frequency data. The distribution of total profit by customers

is presented in Figure 14. The total profit was found to be densely distributed approximately from $0 to

$100.

Exploring RFM results

When attempting to generate total of five bins for R, F and M variables, SAS Enterprise Miner created five

bins for R and M variables, but only three bins for F variable due to the fact that the majority of customers

ordered only once. As a result, there were total of seventy five RFM groups generated. Of which, the top-

three RFM groups were 132, 555 and 131, which accounted for 4.36%, 4.21% and 3.96%, respectively.

The percentage for every RFM group is given by the bar chart in Figure 15.

| 22

Figure 15: The frequency of the RFM codes

Findings and implications

From the CRM perspective, one can summarize the customers of the company based on the RFM groups

as follows. Most of their customers are one-time customers, represented by “3” in the second digit of the

RFM code, and they are also low-profit generators, represented by“1” or “2” in the third digit of their RFM

codes. This might point out an existing as well as potential problem of unsatisfied experience of customers

about their purchase, product quality or support, making them not to make their repeated and future

purchase with the company.3 So, one might put emphasis on improving customer experience, and

persuading or providing incentives to customers in order to retain them with the company.

Although the majority of the customers are one-time customers, there are also royal customers, represented

by “555” in their RFM code. This implies the fact that these customers have a satisfied or positive

experience with the company. The company should find the way to try making them to spread out their

words to their family members and friends. Together with the improvement on services and the company

market share, there will be higher numbers of both new and return customers in the future.

Most valuable customers by regions

In order to explore the most valued customers - who are very profitable, order frequently, and ordered

recently, we utilized the “RFM” and check for an RFM of “555”. Since a customer can have multiple

transactions, we checked for the unique number of customers with an RFM of “555” for each state. The

SAS code used is Appendix: MVC_SAS_Code

Understanding what areas the most valued customers belong to is of utmost importance for ABC to alter

marketing efforts. From the graph Appendix: MVC_regions, we deduce that the three “Most Valuable

Regions” for any future marketing offers are

1) Pacific coast (1-digit ZIP code – 9)

2) Southeast (1-digit ZIP code – 3)

3 This circumstance was also previously pointed out in more detail in Findings and implications section of Repeat

purchase by catalog offer and product categories.

| 23

3) Mid-Atlantic (1-digit ZIP code – 1)

Figure 16: Total number of transactions by 1-digit ZIP codes and channels

Though the order changes, these remain the top three regions for “Most Profitable Customers” and the

“Most Frequent Customers” as well.

Since the same customer can order through various channels for different orders, we explored the preferred

channel at a transaction level instead of at a customer level. We deduced that most of the west coast

purchased the product via website, but others purchased primarily via phone and mail. However, when we

analyze the “Most Valuable Transactions,” phone, followed by web, and mail seem to be the preferred

channels to market.

In addition, Most Valuable Customers for ABC Catalog Company are not affluent but people with income

towards the lower end / closer to the mean household income. The shaded region in Figure 17 represents

the income group of the customers. Note that in this case, the majority belongs to groups with an

“Transformed: Imputed MEAN_HOUSEHOLD_INC_NUM” in the range of 0.10 to 0.29.

Figure 17: Total number of transactions by 1-digit

ZIP codes and income groups

Figure 18: Histogram of transformed variable for

average household income

| 24

Customer clustering and profiling

Objective and Goals

As shown by previous analyses, the company is clearly having problem in creating an effective offer

program, and retaining the customers. One of the possibly main and critical causes is due to poor customer

segmentation, and a lack of understanding on customers characteristics. In the following sections, we are

going to address these issues by analyzing customer buying behaviors, and properly generating customer

segments and corresponding profiles.

Each group of the customers should have clear differences in both buying behaviors and demographics to

some extent. The ultimate goal is to get the results to be able to create distinct group of customers and

understand the common natures within and across the groups. Specifically, we would like to observe the

certain purchasing behaviors and preferences within each group, answer the following questions, and

compare those across the groups.

1) Whether or not they are more likely to return or cancel the order? – The higher returns lower the

profits, impacting on business revenue.

2) How often do the orders placed?

3) What is the average total profit? – The target is to increase profits.

4) What is or are the product categories they purchase? – Low-product purchases imply inventory

control changes, or change deal with a manufacturer. In this aspect, the variables are the count of

the purchases for every product category.

5) What is or are the type of primary offers? To make this less complicated, the variable

OFFER_DESC was used, i.e., either catalog or web or etc., to differentiate the groups.

6) What is the majority of customer RFM codes? To reduce the number of RFM levels, the new RFM

scheme, denoted by the RFM_NEW variable, was generated. The value of the RFM_NEW varies

from 1 to 8.4

Apart from that, we would like to add the common demographics to each of the customer segments; for

instance,

1) The average household income to identify the target customers.

2) 1-digit ZIP code to specify the target areas.

3) The average male/female percentage to examine if the behavior changes across gender.

Inputs

According to the goals, new variables were created, and transformed as shown in Table 5. Because of the

nature of the data that the majority of observations takes an extreme value, this can be considered of

response style data. The response style data significantly impacts the performance of clustering procedure,

and therefore needs proper transformation. Pagolu and Chakraborty (2011)5 showed double standardization

transformation on the response style data yields the best clustering performance so we therefore applied

the transformation to our data, as shown by “DSTDZ” in their prefix.

4 The mapping scheme of the RFM_NEW variable is discussed in Variable recoding section. 5 Pagolu and Chakraborty (2011), “Eliminating Response Style Segments in Survey Data via Double

Standardization Before Clustering,” SAS Global Forum paper, p165.

| 25

Table 5: The raw and transformed variables used in creating and profiling customer segments

Variable Name

Transformed Variable

Name Role Description

TOTAL_RETN_QTY

DSTDZ_TOTAL_RET

N_QTY Base

Total return quantity over the entire

time horizon

CNT_ORDER LOG_ CNT_ORDER Base

Total number of orders over the entire

time horizon

TOTAL_PROFIT LOG_TOTAL_PROFIT Base

Total profit generated by the customer

over the entire time horizon

RFM_NEW Base

The transformed RFM code of the

customers

new_CNT_PROD_A

DSTDZ_new_CNT_PR

OD_A Base

Total number of orders on product

category A over the entire horizon

new_CNT_PROD_B

DSTDZ_new_CNT_PR

OD_B Base


category B over the entire horizon

new_CNT_PROD_C

DSTDZ_new_CNT_PR

OD_C Base


category C over the entire horizon

new_CNT_PROD_D

DSTDZ_new_CNT_PR

OD_D Base


category D over the entire horizon

new_CNT_PROD_E

DSTDZ_new_CNT_PR

OD_E Base


category E over the entire horizon

new_CNT_PROD_F

DSTDZ_new_CNT_PR

OD_F Base


category F over the entire horizon

new_CNT_PROD_G

DSTDZ_new_CNT_PR

OD_G Base


category G over the entire horizon

new_CNT_PROD_H

DSTDZ_new_CNT_PR

OD_H Base


category H over the entire horizon

new_CNT_PROD_I

DSTDZ_new_CNT_PR

OD_I Base


category I over the entire horizon

new_CNT_PROD_J

DSTDZ_new_CNT_PR

OD_J Base


category J over the entire horizon

new_CNT_PROD_K

DSTDZ_new_CNT_PR

OD_K Base


category K over the entire horizon

new_CNT_PROD_L

DSTDZ_new_CNT_PR

OD_L Base


category L over the entire horizon

new_CNT_PROD_L

DSTDZ_new_CNT_PR

OD_M Base


category M over the entire horizon

new_CNT_PROD_O

DSTDZ_new_CNT_PR

OD_O Base


category O over the entire horizon

new_CNT_PROD_P

DSTDZ_new_CNT_PR

OD_P Base


category P over the entire horizon

new_CNT_PROD_S

DSTDZ_new_CNT_PR

OD_S Base


category S over the entire horizon

new_CNT_PROD_T

DSTDZ_new_CNT_PR

OD_T Base


category T over the entire horizon

new_CNT_PROD_X

DSTDZ_new_CNT_PR

OD_X Base


category X over the entire horizon

CNT_OFFER_AFFILI

ATE

DSTDZ_CNT_OFFER_

AFFILIATE Base

Total number of orders whose

OFFER_DESC is Affiliate

CNT_OFFER_CANA

DIAN

DSTDZ_CNT_OFFER_

CANADIAN Base


OFFER_DESC is Canadian

| 26

Variable Name

Transformed Variable

Name Role Description

CNT_OFFER_CATA

LOG

DSTDZ_CNT_OFFER_

CATALOG Base


OFFER_DESC is Catalog

CNT_OFFER_EMAIL

DSTDZ_CNT_OFFER_

EMAIL Base


OFFER_DESC is Email

CNT_OFFER_EMP_

ORDER

DSTDZ_CNT_OFFER_

EMP_ORDER Base


OFFER_DESC is Employee Order

CNT_OFFER_INT

DSTDZ_CNT_OFFER_

INT Base


OFFER_DESC is International

CNT_OFFER_ONLIN

E_CAT

DSTDZ_CNT_OFFER_

ONLINE_CAT Base


OFFER_DESC is Online Catalog

CNT_OFFER_PRINT

_AD

DSTDZ_CNT_OFFER_

PRINT_AD Base


OFFER_DESC is Print Ad

CNT_OFFER_WEB

DSTDZ_CNT_OFFER_

WEB Base


OFFER_DESC is Web

IMP_MEAN_HOUSE

HOLD_INC_NUM Descriptor

The average income of the household

in customer’s ZIP code

STATE_1 Descriptor The first digit of customer’s ZIP code

IMP_FEMALE_PCT_

NUM Descriptor

The percentage of female in

customer’s ZIP code

Segments and profile

We performed clustering procedure multiple combinations of clustering method and internal

standardization where we found that the Range internal standardization and Ward clustering method

performed the best.6 Based on statistical recommendation criteria, we selected to create 4, 8, and 12

customer segments, and profile them accordingly, in which the detailed clustering and profiling results are

given in Appendix C. We found that the profile of the eight-cluster solution is most meaningful, which is

presented in Figure 19, and

Table 6 summarizes the profile story.

6 By performing “the best”, we mean the clustering procedure follows the hierarchical structure, having none or very

few observations joining the clusters late in the algorithm. The results of the clustering procedure can be found in

Appendix C.

| 27

Figure 19: Customer profile of the eight-cluster solution

Table 6: Customer segment characteristics

Segment

Number Characteristics

1 The customers in this segment are recent and profitable, but have low number of

transactions with the company. They respond most to the affiliate and email offers, and

specifically to the product category H and M. Most of them live in west coast and

southern parts of the country.

2 Most of the customers are very recent and have many transactions with the company,

but they are not profitable. They like to buy the product category C, K, P, and T through

catalog, and they averagely live in eastern, northern, and southern parts of the country.

However, this segment has a high rate of product return.

3 In this segment, customers like to buy product from website and catalog, especially

product T. Most of them buy the products more than one. As for the products,

customers in this segment are also highly interested in product category C, E, F, and P.

Like segment 2, this segment also has high rate of product return.

4 This is one of the most active segments in terms of product variety and offer responded.

The customers in this segment mostly respond to every offer channel except catalog. As

for the products, they respond to almost all company’s product categories, with category

D, G, and I as the top three. They have not had transaction with the company for a long

time and have no repeat purchase. Mostly they are from northern part of the country.

5 The customers in this segment are responsive to many company offer channels except,

affiliate, catalog, email, and website. They also have a big wallet, and are from west

coast area.

6 In this segments, the customers are mostly responsive to catalog, and product category

E, H, and T. They used to be frequent and high-value-transaction customers.

| 28

Segment

Number Characteristics

7 The most valuable customer group is in this segment. They love product category P and

respond the most to catalog offer. They are mostly female and live in central north and

south of the country.

8 The customers in this group are very recent, and mostly they receive the information of

the product via email. The product categories that they are interested in are A, B, G, L,

J, M, and O. They most live in northern part of the country and have high rate of

product return.

Findings and Implications

From the technical point of view, we found that creating segments with the given company transactional

data is very different. Specifically, the majority of customers are one time purchaser, and therefore there is

no implied pattern in differentiating them based on buying behaviors and preferences. In addition, one time

buyer also creates an issue of response style data, which needs proper transformation to alleviate the

impact on clustering procedure. Despite the limitation, the clustering procedure was able to create

meaningful clusters. For all of those we created profiles, and eventually chose the eight-cluster

segmentation.

From the prediction point of view, the RFM_NEW variable is the most important variable in predicting a

cluster for an observation of every cluster as expected. That is because RFM_NEW variable has eight

levels, which is equal to the number of segments. As a result, there is a one-to-one mapping between levels

of the RFM_NEW variable and the segment numbers.

Previous sections pointed out that the company was not able to create an effective offer program, and this

is perhaps because the company lacked of understanding the nature of their customers. As a result, the

dominating group of the customers is one-time buyer. By having the initial profile as shown in Table 6, the

company should start generating a meaningful promotion on particular combinations of an offer and

product category based on common group demographics, and sending to their customers, especially those

who made only few transactions, for them to make a future purchase with the company. Sample

recommendations are given in Table 7.

Table 7: Sample recommendations for customer segments

Segment

Number Recommendations

1 Use affiliate and email offers on product category H and M on the west coast and

southern part of the country.

2 Use a catalog offer on product category C, K, P and T on the eastern, northern and

southern parts of the country with a product return option.

3 Use website and catalog offers on product category T, C, E, F and P with a product

return option.

4 Use a none-catalog offer mainly on product category D, G and I on the northern part of

the country.

5 Use a variety of offer types, EXCEPT affiliate, catalog, email and website on west

coast.

6 Use a catalog offer on product category E, H and T that provide some advantages to

frequent buyers.

7 Use a catalog offer on product category P, especially female products, on central north

and south of the country and also offer royal reward program.

| 29

Segment

Number Recommendations

8 Use an email offer on product category A, B, G, L, J, M and O on the northern part of

the country with a product return option.

Predictive models

Model target

The target variable for building the predictive models was the NET_REVENUE variable.7

Data partition

The random seed of 57005 was used, and the proportion of the data partition between Train and Validation

was set to 50/50.

Variable reduction and selection

Large number of input variables could create dimensionality problem which limits the ability to fit a model

to a noisy data. So, reducing input variables is a key to avoid dimensionality problem, thereby creating a

better model. In this case, interval and categorical variables were treated separately. With regard to interval

variable, the variable reduction process was handled by the variable clustering method. Variable clustering

will manage to have highly correlated variables in the same group and low correlation between the groups.

As for the categorical variable, the decision tree will be exploited to find the most important variable given

the specific target which, in this case, the targets was the net revenue.

According to the result from clustering (please see Appendix G), most variables were chosen as

representative of their clusters. However, some clusters had more than one variable chosen because of the

business objective. For instance, even though, in cluster 3, count order is the most important variable in the

cluster based on the 1- R square ratio, we also included offer range to see the effect of a particular

promotional range to the target variable.

Unlike variable clustering, decision tree was supervised method which selected variables based on a

specific target variable. The decision tree algorithm will provide the logworth of each variable to the

target. The higher the logworth is, the greater the importance of that variable becomes. For instance, from

the logworth result (please see Appendix G), RFM had the highest logworth which means it was the most

important variable to explain net_revenue. Similarly, some variables that had low importance given the

target variable were included in the model because of the objective of the business.

The following table shows the final pool of variables that could be used as an output for building

predictive model from variable selection and reduction process.

Table 8: The selected input variables for building predictive models

Interval Categorical

1. COST_PER_UNIT

2. PRICE_PER_UNIT

3. RETN_REVENUE

1. RFM

2. RETN_DATE

3. SHIP_DATE

7 We selected to use the NET_REVENUE variable as our single target variable because of the ease of grading on the

performance of the models.

| 30

Interval Categorical

4. QUANTITY

5. SHIP_QUANTITY

6. CNT_ORDER

7. DAY_LAST_ORDER

8. RETN_QTY

9. OFFER_VALID_RANGE

10. CANCEL_QUANTITY

4. PRODUCT_CATEGORY_ID

5. ORDER_DATE

6. PAY_METHOD

7. CHANNEL

8. OFFER_DESC

9. ZIP

Variable recoding

For the interval variables, they have passed the data audit, such as transformation and filtering, but for

categorical variables, some variables need to be recoded from large to small levels. Among the categorical

variables chosen, RFM and ZIP were the two that have a problem with levels, 75 and 17,876 levels

respectively.

From the business point of view and simplicity of the model, the variable RFM was recoded, and denoted

by the variable RFM_NEW. For R, F and M of the RFM, the value was recoded to high (H) if the value

was between 1 and 3, and to low (L) if the value was between 4 and 5. The RFM_NEW variable takes the

values according to the mapping scheme in Table 9. As a result, the number of levels of the RFM code

decreased from 75 in the RFM variable to 8 levels in the RFM_NEW variable.

Table 9: The mapping scheme from the RFM to RFM_NEW variables

Mapping Scheme

LLL 1

LLH 2

LHL 3

LHH 4

HLL 5

HLH 6

HHL 7

HHH 8

The variable ZIP was also recoded by taking the first digit of the zip code, which purposely refers to

regions, and denoted by ZIP1. The number of levels decreased from 17,876 in the ZIP variables to 33 in

the ZIP1 variable.

Decision tree

Model implementation

Three following changes were made before building the model to get the best model performance with

respect to ASE.

Table 10: The modifications made to the default settings on Decision Tree node

Default Change to

Splitting Rule: Interval Target Criterion ProbF Variance

| 31

Default Change to

Maximum Depth 6 8

Subtree: Assessment Measure Decision Average Square Error

Model results

With the default mode, decision tree built the model with 26.69 average square error (ASE) on validation

data. The variables that included in the model were only two variables: price per unit and net quantity

which are not quite meaningful in terms of building marketing startegy. The challenge in building this

model is how to trade off between the accuracy of the model and the meaningful result that the company

could successfully deploy. For instance, some models only include price, cost, and quantity which

certainly have high correlation with the target, which in this case is net revenue, that made the ASE low,

but there is no meaning in the marketing aspect. However, when these variables were taken out of the

model, the ASE got worse. After trial and error of many scenarios, the winner model had ASE only 3.95.

The following seven variables that are important to predict the revenue are included in the model which

are meaningful for marketing implementation.

1) Price per unit

2) Net quantity

3) Zip

4) Payment method

5) Product category

6) Offer range

7) RFM

Expectedly, price is the most important factor to predict the revenue, and then follows with net quantity.

Interestingly, the most important variable amongst categorical variables is payment method. There are 19

rules that are meaningful to discuss. Please see Appendix G for more information.

Regression

Model Implementation

Since the regression model is susceptible to the non-normal distribution of the data, some transformed

variables that are very skewed or kurtotic will be used as inputs.

The Regression model was built under different scenarios to find the best one:

1) Model selection method and criteria: To get as many scenarios as possible, Stepwise, Forward and

Backward will be implemented as our selection models, and Validation Error (ASE) as selection

criterion, the Entrance and Stay significance level were set to 1 each to allow the model to have a

broad scope when selecting the variables in the selection model process. The maximum number of

Steps was set to 10.

2) Interactions: We allow the model to include all two-factor interactions for class variables, as well

as enabling the model to include all possible polynomial terms.

3) Variable selection: The variable selection is important to trade off between the meaningful result

and accuracy. For example, even though the ASE is very low, the variables that are included in the

model cannot be interpreted. Consequently, the model cannot be implemented in the real world.

Therefore, a discretionary variable selection might be implemented in some cases.

| 32

Model Results

After we experimented with many scenarios based on the property mentioned above, we got the optimal

model which included the variables shown below (please see Appendix G).

1) Cost per unit * Retn quantity

2) Cost per unit * Quantity

3) Cost per unit * Ship quantity

4) Price per unit * Net quantity

5) Cost per unit

6) Offer discount

7) Price per unit

The model’s ASE is 34.55 in this case. Even though 34.55 ASE is not the lowest ASE we got among all

the scenarios, the variables selected are interpretable when predicting the target Net Revenue.

Findings

We can see in the estimates that, an increase in the cost per unit by the quantity will drive net revenue

down. Whereas an increase in unit in the price per unit * the quantity sold, will increase net revenue by

$666.2. Also, some offers such as Affiliate, Canadian, Catalog and International have a positive

relationship with Net Revenue. Whereas other offers such as Email, Employee order, Online Catalog and

Print add will not have a positive impact in the Revenue.

Product category H and I seem to be the ones that will drive net revenue up. The Net Revenue obtained by

an increase in one unit for such products are 0.59 and 19.55 respectively (please see Appendix G).

Neural network

Model Implementation

Step1: If the neural network is run with the pool of variables selected in variable selection and reduction

section, the number of estimated weights is 441 which is very huge. So, the variable reduction

should be done again before running the neural network. In this case, we will use stepwise

regression to pick the variables before performing nerual network model.

Step2: After we got the variables from step 1, Neural network will be proceeded with 1000 iterations.

Step3: Use metadeta to change the target to P_Net_revenue which is the output of target variable from

running the neural network.

Step4: Use decision tree to explain the neural network model (surrogation).

Model result

Variables from the regression model stepwise:

1) NET_QUANTITY

2) PRICE_PER_UNIT

3) PRODUCT_CATEGORY_ID

4) QUANTITY

5) RFM_NEW

| 33

With the variables above and changing some defaults of neural network property. The neural network built

the model with 22 average square error (ASE) on validation data with 134 iterations to converge. (see

Neural_Network_3 for the results) However, it is very difficult to explain the findings directly out of

neural network model. Therefore, the decision tree was used to explain the neural network result.

From the decision tree result, expectedly, price is the most important factor to predict the revenue, and then

follows with net quantity (see Appendix G). Interestingly, the most important variable amongst categorical

variables is RFM. There are 35 rules out of 200 rules that are meaningful to discuss. Please see Appendix

G for more information

Findings

Customer specific (RFM)

With the price over $300, the RFM code 2,4,6, and 8 seems to be a good customer in this price range. On

the other hand, the price lower than $300, customer in the RFM code 1,3,5, and 7 seems to be a good

candidate.

Product Category

Product category P,T, and H are important to explain the revenue of the company in most price ranges

(please see Appendix G). As for each price range, the summary of the product importance can be grouped

into four price range below.

Price Range ($) Product Category

<100 B, E, C, T, P, S, F, L, H, X, 0, A

130 – 187 B, E, C, T, P, F, H, K

220 – 250 T, P, H

400 E, C, T, P, F, H, X, 0, K

Comparison of models’ performance

In this section, we identified the best predictive model from the prediction performance on net revenue of

all three predictive models that were examined in the previous sections based on validation average

squared error (ASE). The reason ASE was chosen was because ASE tells us how much difference between

the target and the estimate. The goal is to minimize the ASE, and the model with the minimum ASE value

has the smallest prediction error. Figure 20 presents the results from the Model Comparison node.

Figure 20: The comparison results based on the ASE value

With the average square error criteria, the decision tree model was chosen as it has the lowest ASE value

on validation data, whereas the neural network model and the regression model were second and third,

respectively.

| 34

Scoring

Scoring performance

Since decision tree model is the winner amongst all three models, it was selected to score the data. Figure

21 demonstrates the results of the scoring.

Figure 21: The scoring results on the score data performed by the decision tree model

From the scoring output, the result is quite satisfied. If we look at the mean net revenue, the difference

between validation and scoring data is only $5.33 which is around 7% difference. The result indicates that

the model is flexible enough to apply with any data.

Summary

This project was instigated for ABC Catalog Company, which is a specialty multi-channel catalog

company with a strong web presence. We started with the objective to aid ABC in improving its revenue

and market share. Since there was no information about the actual product or customer level demographic

in the data provided, we updated it with external census data to understand ABC’s customers and business

better. The CRISP-DM methodology was utilized with the goal to eventually predict future revenues and

suggest marketing strategies. We examined the purchase patterns, Marketing ROI, and seasonal trends.

Eventually, we leveraged the data to build customer segmentation and were able to predict future revenues

for ABC through predictive modeling.

The predictive models that we used are decision tree, regression, and neural network. Decision tree is a

rule-based model, whereas regression and neural network are parametric-based models. Before running a

model, the variable selection and reduction were performed as appropriate in order to have only input

variables that have an importance for target variable (i.e., NET_REVENUE) and business goals. Then, the

data was partitioned into training and validation data. The model was built on training data, but it was

assessed on validation data to evaluate flexibility. On each model, experiments were then performed with

various property configurations to find the model with the lowest model’s predictive error. In this work,

the indicator of the model error is the average squared error (ASE).

Findings and Recommendations

Segmentation and Clustering

From the technical point of view, we found that creating segments with the given company transactional

data is very different. Specifically, the majority of customers are one time purchaser, and therefore there is

no implied pattern in differentiating them based on buying behaviors and preferences. In addition, one time

| 35

buyer also creates an issue of response style data, which needs proper transformation to alleviate the

impact on clustering procedure. Despite the limitation, the clustering procedure was able to create

meaningful clusters. For all of those we created profiles, and eventually chose the eight-cluster

segmentation whose profile is most meaningful to the team.

From the prediction point of view, the RFM_NEW variable is the most important variable in predicting a

cluster for an observation of every cluster as expected. That is because RFM_NEW variable has eight

levels, which is equal to the number of segments. As a result, there is a one-to-one mapping between levels

of the RFM_NEW variable and the segment numbers.

Pointed out as one of business issues as ABC was not able to create an effective offer program, and this is

perhaps because the company lacked of understanding the nature of their customers. As a result, the

dominating group of the customers is one-time buyers. By having the initial profile developed in this work,

the company should start generating a meaningful promotion on particular combinations of an offer and

product category based on common group demographics, and sending to their customers, especially those

who made only few transactions, for them to make a future purchase with the company.

Predictive Modeling

Based on the high variance of target variable (NET_REVENUE), all three models have ASE in an

acceptable range from 3.9 to 34. The variables that are selected by each model are summarized in the

following table

Decision Tree Regression Neural Network

Price per unit

Net quantity

Zip

Payment method

Product category

Offer range

RFM

Cost per unit * Retn quantity

Cost per unit * Quantity

Cost per unit * Ship quantity

Price per unit * Net quantity

Cost per unit

Offer description

Price per unit

Net quantity

Price per unit

Product category id

Quantity

RFM

Recommendations in this section are an amalgamation of the data exploration, RFM, most valuable

customer, and model results. Note that from the model comparison, decision tree model delivers the best

performance so the model recommendations are purely based on the decision tree model.

From the model results, there is a clear line of what variables are the most influential in each level. Please

refer to Appendix G for more information. Based on price, we have 4 groups with their own distinct

characteristics. The following table is the summary of each group’s characteristic and price range.

Group Price range ($) Important Factors

1 55-95 Product type, Payment method

2 100-140 Offer range

3 160-220 Payment method

4 320-460 Product type

Group1: Price range 55-95

| 36

This group is the most important group because 70 percent of ABC’s revenue came from the product with

the price lower than $90. From the model’s results, this group contains three significant findings: product

type demand, quantity purchased, and credit card merchant specificity.

Product specific: With the price range from 55 to 70 dollars, ABC company should concentrate on these

ten product types: B, E, P, F, C, J, T, X, H, and A. However, 10 products out of 19 products might be too

large for the marketing budget. To be more specific and limit the marketing budget, product type B, P, C,

T, and X should be a marketing priority.

Bundle strategy: Generally, ninety percent of ABC’s customers buy only one product, but, in this price

range, customers mostly bought more than one product. ABC might try to get bigger customer wallet share

by encouraging customers to buy more than one product at a time with bundle strategy.

Co-promotion with credit card issuers: American Express, Visa, and Master card are the credit card issuers

that affect revenue significantly. ABC company might encourage more sales by cooperating with these

three credit issuers. For example, there could be a discount offer for customers if they buy a product using

one of these credit cards.

Final recommendation: With price range 55-95, ABC company should focus on five products: B, P, C, T,

and X. For these five products, ABC should sell them in bundle and give customers a promotion if

customers use American Express, Visa, and Master Card to buy a product.


Geographical specific: There are two regions based on the zip code that have significant effect on revenue.

The zip codes begin with zero and four. Zip that begins with zero is New England region and the one that

begins with four includes four states: Indiana, Michigan, Ohio, and Kentucky.

Offer range: With this price range there is a significant difference between giving customer a range of 6

months and one year. With 6 months promotion offer range, customers tend to buy higher number of

products than one year. The reason behind this might be that customers might be afraid that they might not

have a chance to buy the product with the particular offer in the future, so customers might buy the product

more than they necessary need at that time.

Final recommendation: ABC company should offer only 6 months promotion offer in the area that has zip

code beginning with zero and four.


Co-promotion with credit card issuers: ABC could follow the co-promotion with various credit card issuers

strategy similar to the one suggested for group one, but including Discover card as well.

Outbound sales: There is a significant relation between RFM code 8 and payment by personal check in this

price range. From the previous analysis, most of the customers that pay by personal check use phone

channel. Therefore, instead of waiting for the customer to reading the catalog and calling in, ABC might

do outbound sales by calling customer based on the RFM code 8 list.

Final recommendation: ABC company should promote outbound sales and co-promotion with credit card

issuers.

| 37


Product specific: The products that the customers in this price range are interested are F, H, T, X, and C.

Therefore, focusing on these products is a key to successful marketing.

Limitations

SAS Program

SAS EM has a limitation of synchronization. Since the project is large, one person doing every process in

one diagram is very difficult. So, the works were divided to each team member. However, at the end, some

of the works needed to be combined in one diagram. SAS EM has no ability to combine the diagram. The

works needed to be re-created in one diagram which is redundant in this case.

Clarification of the data/variables

Some variables have no clear meaning what they mean. Even though we found a relationship or something

interesting, we could not understand and write much about them. For example, product category ID, we do

not know what A or B is. If we know we might be able to get more idea from the result and write up the

report in a more meaningful way.

Company interaction

During the project, we might come up with the non-text book questions that need company opinion to help

clarify them. For instance, in the model building phase, we might exclude some input variables that are

important for the company but not for the model. Also, if we had got some ballpark numbers, such as

marketing cost per year and percentage of marketing cost per sale, it would have helped a lot in

recommendation section. So, it would be great if we could interact with the real company representative at

least once a phase.

| 38

Appendix

Appendix A

Data exploration on the net profit and its associations of the input variables

The input variables can be ideally divided into two categories, which are numerical (or interval) and

categorical (or nominal or binary) variables. The associations can also be categorized in the same manner.

We begin this section with the variable worth of all input variables, followed by correlations between our

target variables and the interval input variables.

Variable worth

Figure 22: Variable worth with respect to net profit

The variable worth shows the worth of an individual input variable in predicting the target variable. Figure

22 shows that NET_REVENUE has the highest worth value. This is intuitive as NET_REVENUE is the

variable which is very highly related to the target as it was used as a primary variable in the NET_PROFIT

formula. Recall that NET_REVENUE is equal to NET_QUANTITY × PRICE_PER_UNIT. The other two

variables that were used in the NET_PROFIT formula are PRICE_PER_UNIT and COST_PER_UNIT,

which unsurprisingly have the second and third highest worth values.

Pearson correlation

Figure 23: Pearson correlation coefficient of interval variables with respect to net profit

| 39

Pearson correlation shows linear relationships between the target and interval input variables. Intuitively,

NET_REVENUE, PRICE_PER_UNIT, COST_PER_UNIT, NET_QUANTITY are amongst the variables

whose correlations are positive and relatively high. The only standout input variable that is negatively

correlated with the target variable is CANCEL_QUANTITY.

Spearman correlation

Figure 24: Spearman correlation coefficient of interval variables with respect to net profit

Spearman correlation shows relationships, considering both linear and non-linear relationships, between

the target and interval input variables. It can be observed that NET_QUANTITY and RETN_QTY are

more positively correlated, and the negative correlation of CANCEL_QUANTITY and RETN_REVENUE

are more significant.

In the following, the relationships between the target and categorical input variables as well as insights are

demonstrated.

Total net profit by product categories

Figure 25: Percent contribution to the net profit by product categories

| 40

In Figure 25, it has been shown that the product category E generated the highest profit (which is

approximately $1.06 M over the entire horizon, which accounts for about 15%), followed by P, T, C, and

so on in a counter-clockwise direction. It is not always necessary for the highest-profit category to have the

largest total quantity sold or to be ordered most often. As presented in Figure 26, product category T is the

most favorite product in terms of the total number of orders and total sold quantity, and the largest product

category in terms of the total number of SKUs. Another factor that significantly contributes to the total net

profit for a particular product category is the average per unit profit of that product category. As shown in

Figure 27, the average per unit profit of product category E is about ten dollars higher than that of product

category T and C, and about seven dollars higher than that of production category P. This is the reason

why product category E is the most profitable product.

Figure 26: Total number of orders, sold quantity and SKUs by product categories

Figure 27: Average unit profit by product categories

| 41

Net profit by SKUs

Figure 28: Top 10 most profitable SKUs

Figure 28 presents the top 10 most profitable products and their corresponding total net profit. As

mentioned earlier, the total net profit is hugely influenced by the average per unit profit. Figure 29 exhibits

the associations between average per unit profit and total net profit by product numbers. It can be seen that

most of the average per unit profits are not high although several of them are higher than 1,000 dollars, but

those did not correspondingly generate high profit. Therefore, the core profitability of the company comes

from selling large volume of inexpensive products.

Figure 29: Scatter plot matrix of average unit profit, total profit by SKUs

Net profit by 3-digit ZIP codes

AVG_UNIT_PROF_BY_PROD

PRODUCT_NO

TOT_PROF_BY_PROD

| 42

The ZIP code information in the data set is provided at 5-digit level, which is very intricate and therefore

difficult to analyze and draw insights. We aggregated the data into 3-digit ZIP code level, and provide

results for the geographical areas based on 3-digit ZIP codes that generated large net profit. As seen in

Figure 30, the top ten most profitable by 3-digit ZIP codes are 100, 600, 770, 606, 117, 926, 945, 070, 334,

and 300, which are in the states of NY, IL, TX, CA, NJ, FL and GA. The results are expected as these

states as a whole are equipped, are large in terms of population, and have more buying power.

Figure 30: Top 10 most profitable 3-digit ZIP codes

Net profit by Customer IDs

Figure 31 shows the top 20 buyers based on their corresponding total net profit. It can be observed that the

customer whose ID is 10031582924 is the most profitable customer and is a potential candidate for the

most valuable customer. Also, there are 13 customers whose purchase(s) resulted in more than 2,500

dollars profit in the entire time horizon. These top customers usually drive company’s profit so the

company should always respect, maintain good relationship and try to retain them.

| 43

Figure 31: Top 20 most profitable customers

Next, we explore the seasonality pattern of customer order, and the relationship with the target variable,

i.e., net profit, which is presented in Figure 32.

Figure 32: Patterns of net profit and customer orders in date/time scale

It is very obvious that the purchase pattern of customers and the resulted net profit are seasonal. To see the

variation of all actions or processes by months, the entire time horizon was divided into the month

numbers, i.e., from 1 (January) to 12 (December), and the plot is given in Figure 33.

| 44

Figure 33: Number of occurrences of activities by month numbers

We can see that the number of customer orders and shipped orders are really high in the last quarter,

starting from September through December. As a result, the number of returned orders and cancelled

orders turn out to be higher in December and January than in other months.

Appendix B

The effectiveness of catalog offers by product categories by total order quantities

To evaluate how effective the offers were, the analysis only on the number of orders may not be enough to

make a conclusion. Typically, the performance of the offers can also be implied by how large the orders

were so this depends on how the order quantities were distributed within those orders. So, in the next

section, we will take the same approach to perform the similar analysis but at this time based on the order

quantities. We again first started to look at the total order quantity from catalogs from year 2004 to 2011,

which are presented in the following figures. The total order quantities were collected by the last three

letters of the OFFER_IDs, and product categories.

| 45

Compared to the results in the previous section, one can envision the similarity of the results in this section

as the majority of the orders contained the order quantity of one. The observations in this section can then

be partially implied from the observations in the previous section, and for those observations that imply

one another, the justification is similar, and thus not mentioned in this section.

Observation 7: From year 2004 to 2010, there consistently were incoming order quantities for the

OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB8, where order quantities by OFFER_ID LCR

started to occur from 2007 to 2010. In 2011, there were no order quantities for those OFFER_IDs

completely. Instead, there were more order quantities by the other OFFER_IDs.9

Observation 8: From year 2004 to 2010, the pattern of the order quantity distribution by product categories

within the same OFFER_ID seemed to be very similar over years, where product category T had the

largest total order quantity, followed by product category C, E and P, respectively. However, in 2011 when

8 Besides OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB, OFFER_ID SMB, LSB, PCB SPB and WNB

seem to be the second group of the best OFFER_IDs, of which most were standing out in 2011. 9 Observation 6 and 1 can imply one another as many of order quantities in the orders were one.

| 46

the total order quantity of product category T significantly dropped, the product category P became the

product with the largest total order quantity, followed by product category C, E and H, respectively.10

Apparently, OFFER_ID GFB, GFP, LCB and LMB were the top-four OFFER_IDs with respect to the

number of order quantities. Again, we want to see not only how the numbers of order quantities were

distributed by product categories within the same OFFER_ID over times, but also how those numbers

changed over times, which are shown in the following figures for the four most effective OFFER_IDs.

Note that for 2004 and 2011, the raw input data were not provided for the entire years so their results will

not be analyzed in the following analysis. Compared the figures above to the figures, provided in the

previous section where the analysis was completed based on the number of orders, they look almost

identical. The following observations again can therefore be implied by the observations made in the

previous section.

Observation 9: It can be seen that within these top-four most effective OFFER_IDs, product category T, C,

E and P consistently were top-four product categories with respect to the total order quantities. This

observation conveys similar information, therefore validating the previous observation.


LMB slightly outperformed OFFER_ID GFM, which had significant drops in 2009 and 2010.

10 Observation 7 and 2 can imply one another as many of order quantities in the orders were one.

| 47

Observation 11: In general, the value of order quantities for product category T significantly dropped in

2009, and fluctuated with a relatively large variation in the range between 300 and 750 order quantities

approximately. The total order quantities of the other product categories also had fluctuations, but with

much smaller variation. Compared to Observation 5, the variation range for each product category was

slightly wider.

Observation 12: After the drop in 2009, the total order quantities particularly for product category T

increased for OFFER_ID LCB and LMB, stayed almost constant for OFFER_ID GFB, and kept decreasing

for OFFER_ID GFP in 2010.

The effectiveness of catalog offers by product categories by total profit

In addition to the analysis based on the number of placed orders by the OFFER_IDs, the analysis in terms

of the order quantities by OFFER_IDs does not give us very useful additional results as well as

conclusions. To further evaluate the effectiveness of the catalog OFFER_IDs, we additionally conducted

the similar analysis, but at this time based on the total profit as different product categories may have

different corresponding profit margins, and thus may lead to a different conclusion. Like the previous two

analyses, we first started by looking at the total profit generated from catalogs from year 2004 to 2011,

presented in the following figures. The total generated profits were collected by the last three letters of the

OFFER_IDs, and product categories.

| 48

Observation 13: From year 2004 to 2010, the total profits were consistently generated by the OFFER_ID

EHB, GFB, GFP, HLB, LCB, LCR and LMB, where the total profit by OFFER_ID LCR started to be

generated from 2007 to 2010. In 2011, there were completely no profits generated from those

OFFER_IDs; instead, there were more from the other OFFER_IDs.

Observation 14: From year 2004 to 2010, the profit distribution pattern by product categories within the

same OFFER_ID seemed to be very similar over years, where product category T seemed to be the most

profitable product, followed by product category C, E and P, respectively. However, in 2004 the product

category H was apparently most profitable due to its large profit margin although its total order quantity

was just a little smaller than that of product category C for OFFER_ID LMB. In addition, in 2011 when

the total profit of product category R dropped significantly the product category P became the most

profitable product, followed by product category C, E and H, respectively.

Similar to the previous two analyses, based on the number of orders placed and total order quantity,

OFFER_ID GFB, GFP, LCB and LMB evidently were the top-four OFFER_IDs with respect to the total

profit generated. We not only want to see how the total profits were distributed by product categories

within the same OFFER_ID over times, but also want to observe how those numbers changed over times,

which are presented in the following figures for the four most effective OFFER_IDs.

| 49

Note that for 2004 and 2011, the raw input data were not provided for the entire years so their results will

not be analyzed in the following analysis.

Observation 15: It can be seen that in these top-four most effective OFFER_IDs, product category T, C, E

and P were consistently top-four product categories with respect to the total profit generated.

Observation 16: Amongst the top-four product categories, the profit margin for product category T seemed

to be relatively low, and the profit margin for product category E seemed to be relatively high, and the

profit margins for product category C and P seemed to be moderate.

To verify Observation 14, one can see the closer gap of the total profits between product category T and

other product categories. In addition, at some points the total profits generated by product category T even

falls below the total profits generated by other product categories, such as C for OFFER_ID LCB in 2010,

although the order quantity of product category T were much larger at that time. Also, the total profits of

product category E occasionally jumps above the total profits of the other product categories even though

at those data points the total order quantities of product category E were lower than those of the other

product categories.


LMB slightly outperformed OFFER_ID GFM, which had significant drops in 2009 and 2010.

| 50

Observation 18: In general, the total profit of product category T significantly dropped in 2009, and

fluctuated within a relatively large range between $8,000 and $24,000 approximately. The total profits of

the other product categories also had fluctuations, but with much smaller variation.

Appendix C

The results of hierarchical clustering

The results for the clustering procedure using Range as internal standardization and Ward as clustering

method in Section Error! Reference source not found. is shown in the followings.

Number of clusters suggested by CCC was forty eight clusters. Overall, this model is the best model

amongst all models that we examined with different settings and properties. As shown in the cluster

history in the figure below, for the last twenty steps, there were very few (i.e., three) observations joining

the clusters late, but not very late, i.e., when the numbers of clusters are from sixteen to twenty. This

number is smallest, compared to the other models. In addition, as guggested by the other statistics, we

selected to further examine the segmentation with four, eight and twelve clusters.

| 51

The results of clustering and profiling with four clusters

The clusters distances projected onto two-dimensional space, and the pie-chart showing the frequencies

and percentage of the frequencies in every cluster are presented below.

The importance of variables is shown in the figure below.

(See peak and solution is “+1”)

The potential solutions for the number of

clusters are 1+1=2, 4+1=5, 7+1=8, 9+1=10,

11+1=12, 13+1=14, 15+1=16 and 18+1=19.

(See peak)

The potential solutions

for the number of

clusters are 2.

(See jump)

The potential solutions

for the number of clusters

are 2, and 4.

| 52

The figures below present the cross-tab tables showing the distribution of the frequencies by the

RFM_NEW variable, and STATE_1 in all customer segments.

| 53

The following figure presents the averages of the numerical variables overall and by clusters.

| 54

The matrix of variable importance in predicting observations belonging to a particular cluster is presented

below.

The segment profile index is presented in the figure below.

The results of clustering and profiling with eight clusters



| 55




| 58


| 59


below.


| 60

The results of clustering and profiling with twelve clusters




| 61



| 64


| 65


below.

| 66


| 67

Appendix D

Data exploration and business implication

Customer_age_impact

Product_Return

| 68

Appendix E

Customer segmentation and profile

US 1-digit zip map MVC_SAS_Code

MVC_Regions

Appendix F

Data imputation and transformation

Imputation_Transformation_1

| 69



| 70

Appendix G

Predictive model

Clustering_1

| 71

LogWorth_1

| 72

Decision_Tree_1

Price:Uni

t

Net

quantity

zip

1

Payment_metho

d Category

Offer

range

RFM_NE

W

1

56-62 >=2

B,E,P,F,C,J,T,

X

56-62 >=2 H

62-72 >=2 B,P,C,T,A

62-72 >=2 X

82-94 >=2 AX

82-94 >=2 VI,MC

2

101-108 1 <366

101-108 1 >366

127-142 >=2 0

127-142 >=2 4 <182

127-142 >=2 4 >182

3

162-175 >=1 AX,VI,MC,DI

162-175 >=1 PC 8

175-187 >=1 AX,VI,MC,DI

175-187 >=1 PC

187-227 >=2 VI

4

322-345 >=1 F,H,T

322-345 >=1 X

444-464 >=1 C,H,X

Decision_Tree_2

| 73

Regression_3.1

| 75

Regression_3.1_Estimates

| 76

Neural_Network_1

Price:Unit

Net

quantity Category RFM_NEW Revenue

1 <26 <1.5 8,3,7 25

2 26-47 1 4,1,5 42

3 26-47 1 8,3,7 35

4 <47 >1.5 B,E,C,T,P,S,F,L,H,X,O,A 4,1,5,7 97

5 <47 >=1.5 B,E,C,T,P,S,F,L,H,X,O,A 8,3 88

6 <34 >=1.5 K 55

7 34-47 >=1.5 K 78

8 47-64 >=1.5 4,1,6,5,2 111

9 47-64 >=1.5 8 99

10 64-87 >=1.5 4,6,2 125

11 64-87 >=1.5 8 110

12 87-106 >=1.5 4,2 142

13 87-106 >=1.5 8 126

14 106-124 >=1.5 4 161

15 106-124 >=1.5 8 141

16 132-141 >=0.5 T,P,F,H,X 131

17 132-141 >=0.5 B,E,C,K,L,A 134

18 152-167 >=0.5 B,E,C,X 158

19 152-167 >=0.5 T,K,P,F,L,H,0 155

20 167-172 >=0.5 B,E,C,D 170

21 167-172 >=0.5 T,K,P,S,F,H 168

22 172-187 >=0.5 B,C,X,A,D 184

23 172-187 >=0.5 E,T,K,P,S,F,L,H,M,G 181

24 227-234 >=0.5 E,C,T,L 246

25 227-234 >=0.5 P,H,0 243

26 234-247 >=0.5 F 232

27 234-247 >=0.5 T,P,H 255

28 284-309 >=0.5 4,6,2 318

29 284-309 >=0.5 8 323

30 309-345 >=0.5 4,6,2 339

31 309-345 >=0.5 8 346

32 345-374 >=0.5 4,2 354

33 345-374 >=0.5 8,6 365

34 394-427 >=0.5 E,C,T,P,F,H,X,0 399

35 394-427 >=0.5 K 390

| 77

Neural_Network_2

Price

range Product Importance B E C T P S F L H X 0 A K D M G

<47

B,E,C,T,P,S,F,L,H,X,O,

A

<47

B,E,C,T,P,S,F,L,H,X,O,

A

<34 K

34-47 K

132-141 T,P,F,H,X

132-141 B,E,C,K,L,A

152-167 B,E,C,X

152-167 T,K,P,F,L,H,0

167-172 B,E,C,D

167-172 T,K,P,S,F,H

172-187 B,C,X,A,D

172-187 E,T,K,P,S,F,L,H,M,G

227-234 E,C,T,L

227-234 P,H,0

234-247 F

234-247 T,P,H

394-427 E,C,T,P,F,H,X,0

394-427 K

Total Count 6 8 8 9 9 4 8 6 9 6 5 4 7 2 1 1

Each Price

Range

count

<100 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0

130-187 4 4 4 4 4 2 4 3 4 3 1 2 4 2 1 1

220-250 0 1 1 2 2 0 1 1 2 0 1 0 0 0 0 0

400 0 1 1 1 1 0 1 0 1 1 1 0 1 0 0 0

| 78

Neural_Network_3

Neural_Network_4