| 1
MKTG 5963 – Section 513
Term Project: ABC Catalog Company
Final Report
Team Number One:
Antonio Zuniga, Cynthia Araly
Durongkadej, Isarin
Methapatara, Chinnatat
Somraj, Shilpa
Zarate Tellez, Ana
Spring 2014
| 2
Table of Contents
EXECUTIVE SUMMARY .......................................................................................................................... 5
OBJECTIVES ............................................................................................................................................ 5
ISSUES ..................................................................................................................................................... 5
FINDINGS ................................................................................................................................................ 5
Data .................................................................................................................................................. 5
RFM and most valuable customers .................................................................................................. 5
Offer, product, and channel performance ........................................................................................ 5
Customer segments and profiles ...................................................................................................... 5
Future revenue forecast .................................................................................................................... 6
RECOMMENDATIONS .............................................................................................................................. 6
INTRODUCTION ......................................................................................................................................... 7
BACKGROUND ........................................................................................................................................ 7
OBJECTIVES ............................................................................................................................................ 7
PROJECT TIMELINE AND APPROACHES .................................................................................................... 7
REPORT ORGANIZATION ......................................................................................................................... 8
DATA ............................................................................................................................................................. 8
UNDERSTANDING THE DATA .................................................................................................................. 8
Shipping duration ............................................................................................................................. 9
Backorder frequency ........................................................................................................................ 9
EXTERNAL DATA .................................................................................................................................... 9
NEW VARIABLES ..................................................................................................................................... 9
TARGET VARIABLES ............................................................................................................................. 10
DATA IMPUTATION AND TRANSFORMATION ........................................................................................ 10
Deal with outliers ........................................................................................................................... 10
Deal with missing value ................................................................................................................. 11
Transform variables ....................................................................................................................... 11
DATA EXPLORATION AND BUSINESS IMPLICATIONS ............................................................................ 11
Profitable Regions .......................................................................................................................... 11
Impact of Customer Age ................................................................................................................ 11
Products frequently returned .......................................................................................................... 12
Return on Marketing Investment ................................................................................................... 12
Effectiveness of catalog offers by product categories.................................................................... 12
Repeat purchase by catalog offer and product categories .............................................................. 15
CUSTOMER SEGMENTATION AND PROFILE ................................................................................. 20
RFM ANALYSIS ..................................................................................................................................... 20
Exploring R, F and M indexes ....................................................................................................... 21
Exploring RFM results ................................................................................................................... 21
Findings and implications .............................................................................................................. 22
MOST VALUABLE CUSTOMERS BY REGIONS ......................................................................................... 22
CUSTOMER CLUSTERING AND PROFILING ............................................................................................. 24
Objective and Goals ....................................................................................................................... 24
| 3
Inputs ............................................................................................................................................. 24
Segments and profile...................................................................................................................... 26
Findings and Implications .............................................................................................................. 28
PREDICTIVE MODELS ........................................................................................................................... 29
MODEL TARGET .................................................................................................................................... 29
DATA PARTITION .................................................................................................................................. 29
VARIABLE REDUCTION AND SELECTION ............................................................................................... 29
Variable recoding ........................................................................................................................... 30
DECISION TREE ..................................................................................................................................... 30
Model implementation ................................................................................................................... 30
Model results .................................................................................................................................. 31
REGRESSION ......................................................................................................................................... 31
Model Implementation ................................................................................................................... 31
Model Results ................................................................................................................................ 32
Findings ......................................................................................................................................... 32
NEURAL NETWORK ............................................................................................................................... 32
Model Implementation ................................................................................................................... 32
Model result ................................................................................................................................... 32
Findings ......................................................................................................................................... 33
COMPARISON OF MODELS’ PERFORMANCE ........................................................................................... 33
SCORING ............................................................................................................................................... 34
Scoring performance ...................................................................................................................... 34
SUMMARY ................................................................................................................................................. 34
FINDINGS AND RECOMMENDATIONS .................................................................................................... 34
Segmentation and Clustering ......................................................................................................... 34
Predictive Modeling ....................................................................................................................... 35
LIMITATIONS ........................................................................................................................................ 37
APPENDIX .................................................................................................................................................. 38
APPENDIX A .......................................................................................................................................... 38
Data exploration on the net profit and its associations of the input variables ................................ 38
APPENDIX B .......................................................................................................................................... 44
The effectiveness of catalog offers by product categories by total order quantities ...................... 44
The effectiveness of catalog offers by product categories by total profit ...................................... 47
APPENDIX C .......................................................................................................................................... 50
The results of hierarchical clustering ............................................................................................. 50
The results of clustering and profiling with four clusters .............................................................. 51
The results of clustering and profiling with eight clusters ............................................................. 54
The results of clustering and profiling with twelve clusters .......................................................... 60
APPENDIX D .......................................................................................................................................... 67
Data exploration and business implication .................................................................................... 67
APPENDIX E .......................................................................................................................................... 68
Customer segmentation and profile ............................................................................................... 68
APPENDIX F .......................................................................................................................................... 68
Data imputation and transformation .............................................................................................. 68
| 4
APPENDIX G .......................................................................................................................................... 70
Predictive model ............................................................................................................................ 70
| 5
Executive Summary
Objectives
The objective is to aid a special multi-channel catalog company, ABC Catalog Company (referred to as
“ABC” henceforth) to improve its revenue and market share. In order to achieve this objective, evaluating
what drives ABC’s revenues and profits across various customer segments / products / marketing offers is
vital.
Issues
ABC lacks of good understanding on their customers. Because of that, ABC doesn’t have a good idea on
developing an accurate marketing campaign to increase its market share. This further includes the inability
to retain the current and previous customers with the company. The respective performance could be
dramatically improved if ABC sets their emphasis and marketing effort on the right path, and heavily but
wisely utilizes statistical method in conjunction with computer software on their existing data.
Findings
Data
The transaction data was given, and has no severe issues, and missing values was approximately 4%. Some
data anomalies were observed. Skewness issues were noted for variables that deal with cost, price,
quantity, and revenue. A new target variable to denote net profit was created for a more accurate
performance metric for ABC’s profitability. In addition, the customer demographics were brought in from
census data.
RFM and most valuable customers
The ABC customer base primarily consists of two types. ABC has a large number of one-time customers
along with a loyal customer base. From the CRM perspective and by leveraging RFM analysis, we deduce
that most of ABC’s one-time customers are also low-profit generators. Further analysis shows ABC’s most
valuable customers are mostly located in the Pacific Coast, Mid-Atlantic, and Southeast, prefer the
telephone marketing channel, and have low household income on average.
Offer, product, and channel performance
ABC offer programs are not effective on returning customers as number of repeat purchase was extremely
low for every offer and every product category. In overall, the product category T were most popular
(having highest number of orders), most demanded (having largest number of total order quantities), and
most profitable (having the largest number of total generated profit), followed by product category C, E
and P, respectively. In total, ABC customers utilized catalog offers the most. The four most effective
catalog offers were of LMB, GFP, GFB and LCB as a suffix in their offer IDs. Amongst all offers, the web
offer channel (WEB) had the highest number of orders.
Customer segments and profiles
Customer segmentation was performed based on the buying behaviors and preferences of customers.
Because of much similarity, it was very difficult to create a perfectly distinct customer segments. The
experiments were performed on 4, 8 and 12 clusters recommended by statistical criteria of a hierarchical
clustering method. The optimal cluster breakdown was 8, mostly differentiated by the RFM code. Four out
| 6
of eight clusters had high response rate to catalog offers. Product preference was not obviously distinct,
and 1-digit ZIP code of 5, 7 and 9 were somewhat meaningful in profiling customer segments.
Future revenue forecast
From all three models (decision tree, regression, and neural network), the decision tree is the best model to
predict the future revenue with lowest error. There are five variables that are important to predict the
revenue: zip code, payment method, offer range, product category, and RFM. In terms of implementation,
the model was scored with a new data outside SAS environment, the scoring result showed small error of
revenue prediction which indicated that the model is flexible enough to apply with any data outside SAS
environment.
Recommendations
Increase catalog promotion and train personnel for telephone marketing to increase valuable
customer share.
All of the products need better marketing from the CRM perspective. ABC would need to put
more emphasis on customer retention. Improving customer buying and customer service
experiences are necessary. To better understand customers, conducting the survey on their
perception and preferences is a very good start.
The regions that are most profitable are the Pacific Coast (5), Heartland (7) and the South (9)
regions. Therefore, ABC should focus its efforts to increase market share in those areas.
The Return on Investment is higher for products in the lower cost range. Therefore, ABC needs to
take care of and/or promote lower end products since that is where their profits are coming from.
The decision tree suggested four recommendation groups for different price ranges. However, the
first price range ($55-95) is most important, accounting for 90 percent of the company’s total
revenue. Within this price range, ABC should focus on five product categories, i.e., B, P, C, T, and
X, and should sell them in bundle and give customers a promotion if customers use American
Express, Visa, and Master Card to buy a product(s).
Initially, ABC might try to do marketing with only one segment to experiment whether the
segmentation was accurately illustrative. The recommended segment to start with is Segment 7
because it contains the most valuable customer group, in which the majority is female and is
located in the north-central and central US. Therefore, ABC should configure the catalog to be
more feminine for those areas and do the cross-sale promotion.
After deploying new marketing campaign, ABC should track and verify whether or not it is
successful, and whether or not modifications to the market campaign or segmentation were
needed.
| 7
Introduction
Background
ABC is a specialty multi-channel catalog company with a strong web presence. The company sources
different consumer products from manufacturer(s) and then designs catalogs to display an assortment of
such products. Unlike a traditional retail store, the products are not displayed physically in the store but
customers are sent catalogs that are fulfilled through online shopping. ABC wants to improve its revenue
and the customer base. Our team of has been assigned the task of analyzing what drives ABC’s
revenue/profit.
Objectives
The overall analysis needs to be divided into addressable sub-components to make the proper
recommendations on the most effective marketing strategies to achieve higher profits and gain market
share. Hence, the sub-objectives for this project are:
To analyze product, offer, and channel-buying patterns
To measure performance and estimate return on the marketing investment (Marketing ROI)
To examine purchase patterns through Contact histories and detailed product purchase information
To identify seasonal sales/service quality trends
To build customer segmentation
To identify the most valuable customers and predict future revenues
Project timeline and approaches
The CRISP-DM methodology was utilized with the goal to eventually predict future revenues and suggest
marketing strategies. The overall project was split into three phases with goals associated to each of the
steps in the CRISP DM process as follows:
Phase 1 – data preparation and exploration
Phase 2 – data audit and descriptive analysis
Phase 3 – predictive analysis
The primary tools that were used in this project were SAS Enterprise Miner 12.1 and Microsoft Excel. The
detail procedures and the time line of the project are demonstrated in Figure 1.
| 8
ID Task Name Start Finish Duration
Apr 2014Feb 2014 Mar 2014
3/16 4/133/2 4/201/26 4/63/302/9 3/92/2 2/23 3/232/16
1 5d2/1/20141/28/2014Understanding the project descriptions
3 10d2/6/20141/28/2014Understanding business nature and
determining business issues and goals
4 10d2/6/20141/28/2014Exploring, scrubbing and consolidating data
for Phase 1
6 16d2/18/20142/3/2014Examining input associations and company
performance from multiple perspectives
7 6d3/1/20142/24/2014Preparing data, i.e., data imputation and
transformation.
9 6d3/8/20143/3/2014Conducting RFM analysis
10
5 17d2/16/20141/31/2014
Determining meaningful inputs. Auditing
and observing data quality, e.g., distribution
and missing values
9d3/18/20143/10/2014
Associating RFM codes to other inputs and
determining meaningful marketing
strategies
14
13
11 16d3/25/20143/10/2014Creating customer segmentation and
profiling the segments
6d4/5/20143/31/2014Conducting analysis on marketing return on
investment, and product returns
16d4/15/20143/31/2014
Conducting analysis on interactions
between offer codes and product categories
across the years, and repeat purchase
16 18d4/17/20143/31/2014Building predictive models
15 5d4/4/20143/31/2014
Performing variable reduction and selection
and creating meaningful inputs for
predictive models related to business goals
17 2d4/25/20144/24/2014Comparing and scoring data
18 15d4/28/20144/14/2014Writing final report
12 14d3/31/20143/18/2014Writing Phase 2.2 report
2 3d2/3/20142/1/2014Writing Phase 1 report
8 7d3/3/20142/25/2014Writing Phase 2.1 report
CRISP-DM Methodology
Phases
Business understanding
Data Understanding
Business understanding
Data Understanding
Business Understanding
Data Understanding
Data Preparation
Business Understanding
Data Understanding
Data Preparation
Business Understanding
Data Understanding
Data Preparation
Data Preparation
Business Understanding
Data Understanding
Business Understanding
Data Understanding
Business Understanding
Data Understanding
Business Understanding
Data Understanding
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Figure 1: Project detail tasks and time line
Report organization
The report is organized as follows. The data understanding, external data, business related analysis and
business implications are described in Section Data. Section Customer segmentation and profile presents
RFM analysis and customer clustering and profiling, and finings and related business implications.
Variable reduction and selection, and predictive models are discussed in Section Predictive models, and
Section Summary provides a summary of the project, including findings and recommendations. SAS
Enterprise Miner outputs / graphs / charts and further analyses are contained in Appendix.
Data
Understanding the data
From the business perspective (Step one in CRISP DM process), it is essential to understand the existing
system before preparing the data or building a model. Hence, service quality and catalog response rates
were inspected. Service quality can increase the number of repeat customers, order quantities, and
eventually profits for ABC. Shipping is a key component of service quality. In order to explore the
shipping quality, we need to analyze the shipping duration, backorder frequency, and the number of
returned or cancelled orders.
| 9
Shipping duration
Shipping duration or lead-time can be one of the potential determinants to repeat customers. A good
shipping experience will most likely result in a repeat order from the same customer. Hence, using the
ORDER_DATE and SHIP_DATE can be useful to improve shipping time, reduce delay times and in turn
improve customer satisfaction.
Backorder frequency
Backorder is the key to determining service quality and describes the effectiveness of logistics and supply
chain management, especially under the circumstance of seasonal demand. By frequently having an item
unavailable, customers might look for other catalog /online shopping avenues to obtain the product in time.
Hence, we determined the total number days that customers need to wait for an item to be shipped if it is
not available at the time of purchase. We concluded that the relationship between BO_DATE and
SHIP_DATE demonstrates seasonality.
Identifying trends between the order cancellations/returns across product lines/offers/regions can help
ABC improve service quality in those areas, if required.
External data
The given data offers information about customer orders, quantities ordered, price, offer, channel, and so
on. However, to segment the customers better we need demographic information. To this end, we updated
the given dataset with demographic information (external) based on “zip codes” from American Fact
Finder. (Reference: http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml)
New variables
New variables were created to accommodate the external data in the existing dataset. The new variables
added were described in Table 1.
Table 1: The purpose and description of new variables corresponding to the census data
Variables Description Purpose
FEMALE_PCT_NUM Percentage of female in each zip
code
To explore if there is
any trend pertaining to
gender MALE_PCT_NUM Percentage of male in each zip
code
MEAN_HOUSEHOLD_INC_NUM Average household income in
each zip code
To explore if there is
any trend pertaining to
wealth of household
YOUNG_AMT_15_TO_24 The number of population age
between 15-24
To explore if there is
any trend pertaining to
different age ranges
MID_AMT_25_TO_44 The number of population age
between 25-44
OLD_AMT_45_TO_64 The number of population age
between 45-64
RET_AMT_65_OR_ABOVE The number of population age
>65
NO_TEL_NUM The number of household who has
no telephone
To exclude any zip
code that has high
volume of these
variables UNEMP_PCT_NUM Percentage of unemployed
population
| 10
Variables Description Purpose
TOT_POP Total population in each zip code To use as comparison
purpose
Target variables
A product with high revenue margin does not necessarily have high profit margin. In other words, although
its unit price is high, it may also be associated with high product cost. Thus, net revenue is not an accurate
target with respect to profitability. Instead, net profit, denoted by NET_PROFIT, is an accurate
representation, which can be calculated as follows.
NET_PROFIT = NET_QUANTITY × (PRICE_PER_UNIT - COST_PER_UNIT).
Most of the time, net profit is positive, but it can also be negative if profit margin is negative. We observed
that there were many transactions having negative profit, and there were many products that on average
have negative profit margin. More interestingly, there were also many products that never generated profit
at all (because all were returned) over the entire horizon in the data set. The histogram of the net profit by
transactions is presented in Figure 2
Figure 2: Histogram of net profit by transactions
The exploration on the net profit and associations of the other inputs with the net profit were thoroughly
studied. The results are demonstrated in Appendix A.
Data imputation and transformation
Deal with outliers
Any variables that have value greater than 5 standard deviations will be filtered out. In this case, 3.54
percent of the original data is filtered out, equal 6,188 observations.
The quality of the data is better after filtering the extreme values. The Kurtosis and Skewness are much
lower indicating that those extreme values affect the quality of the data. For more information, please see
Appendix D.
| 11
Deal with missing value
We replaced the missing value with zero for ret_quantity and ret_revenue, because the missing values for
these two variables indicate “no return”, not a “missing value”. Replacing these two values with zero is
useful for further analysis. Then, we imputed the missing value for other variables by mean (for interval
variable) and mode (for categorical variable).
Transform variables
First, use the max normal transformation to find which method should be used to transform the variables.
For more information, please see Appendix D.
Then, transform those variables based on max normal result. From the result below, we will see that the
skewness and kurtosis look more normal. Consequently, the data is ready for further analysis. For more
information, please see Appendix D.
Data Exploration and business implications
Profitable Regions
Prioritizing regions for marketing efforts is important. We analyzed the zip codes to understand the key
regions for ABC. The top ten most profitable by 3-digit ZIP codes are 100, 600, 770, 606, 117, 926, 945,
070, 334, and 300, which are in the states of NY, IL, TX, CA, NJ, FL and GA. The results are expected as
these states as a whole are equipped, are large in terms of population, and have more buying power
Figure 3: Top-ten most profitable areas by 3-digit ZIP code
Impact of Customer Age
We have segmented the population into four groups to better understand the company. When looking at
the four age ranges every plot is right skewed. When comparing each age range with net revenue, we
observe that mostly all four age ranges people are most likely to purchase products that have net revenue
between 0 and 250. Based on frequency, people in age groups of 15 to 24, and those who are 65 and above
are the most loyal customers. Most of ABC’s revenue is generated through products that are least
expensive in all age segments. Please refer to Appendix D.
Across age categories, the impact of the channels is very similar, Web is the most effective channel with
>50%, Phone represents >40% and Mail represents approximately >3%. Hence the preferred channel is
Web, and any improvement made will most likely have the same performance in all age ranges.
| 12
Products frequently returned
It is essential for ABC to understand what products have frequent / high number of returns. Further
analysis is essential to either lessen the number of returns or contact the manufacturer for potential product
related issues. Frequent returns for a certain product might impact the overall brand value of ABC; hence it
is critical to address this issue. From our analysis, we noted that products with a product category Id of
“C,” “E,” and “P” have a relatively higher number of returns in comparison with the other products. Please
refer to Appendix D.
Return on Marketing Investment
In order to deduce the return on marketing investment, we split the cost into five bins for the $0 to $100
range and then into three bins for the $101 to maximum cost. Since the marketing cost incurred is an
unknown, we considered three types of marketing costs at 5%, 10%, and 12% of the mean cost. Note that
irrespective of the actual marketing cost, the profit to marketing cost ratio decreases as we move towards
the higher cost range and increases as the percent marketing cost decreases as shown in Table 2.
Table 2: The profit to marketing cost ratio at 5%, 10% and 12% percent marketing costs
Min.
cost
Max.
cost Profit
Net
quantity
Return
quantity
Profit
/ cost
Return
%
Profit to marketing cost ratio at
5%
marketing
10%
marketing
12%
marketing
0 20 19 65755 2034 1.86 3.1% 37.2 18.6 15.5
21 40 35 49984 2652 1.15 5.3% 22.9 11.5 9.6
41 60 51 21393 1470 1.01 6.9% 20.2 10.1 8.4
61 80 68 11601 785 0.96 6.8% 19.2 9.6 8.0
81 100 83 4978 367 0.91 7.4% 18.3 9.1 7.6
101 200 109 7992 655 0.72 8.2% 14.5 7.2 6.0
201 300 148 2402 212 0.59 8.8% 11.8 5.9 4.9
301 337.2 133 83 6 0.42 7.2% 8.3 4.2 3.5
Effectiveness of catalog offers1 by product categories
We started to look at the numbers of catalog transactions yearly made from year 2004 to 2011 by product
categories, where the results are presented in Figure 4.
1 The last three letters on the OFFER_ID codes that are catalog offers are EHB, ESB, FDB, GFB, GFP, HLB, LCB,
LCR, LMB, LSB, LWB, MSB, OFM, OTB, PCB, PO0, REQ, SMB, SPB, TOY, VRA VRD and WNB.
| 13
Figure 4: Total number of transactions for catalog offers by product categories over years
Observation 1: From year 2004 to 2010, there were consistently incoming orders for the OFFER_ID EHB,
GFB, GFP, HLB, LCB, LCR and LMB2, where orders by OFFER_ID LCR started to arrive from 2007 to
2010. In 2011, there were no orders for those OFFER_IDs completely. Instead, there were more orders
made from the other OFFER_IDs, such as SMB, SPB and WNB.
This might occur for a business reason; for example, the company might stop those offer codes after 2010,
and utilize more the other codes. There might be a problem with those OFFER_IDs in 2011, which
eventually resulted in zero incoming orders in that year.
Observation 2: From year 2004 to 2010, the distribution pattern of the number of purchased orders by
product categories within the same OFFER_ID seems to be very similar over years, where product
2 Besides OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB, OFFER_ID SMB, LSB, PCB SPB and WNB
seem to be the second group of the best OFFER_IDs, of which most were standing out in 2011.
| 14
category T seemed to be the most popular product, followed by product category C, E and P, respectively.
However, in 2011 when the number of orders for product category T significantly dropped, the product
category P became the most popular product, followed by product category C, E and H, respectively.
The reasons for the number of orders for product category T that dramatically went down may be because
(1) it might have a short product life cycle, (2) there might be an issue with offers for product category T
during 2011, (3) the company marketing strategy might be changed to do more marketing on the other
product categories, and less on product category T, and (4) there might be an issue in the supply side of
product category T.
From Figure 4, OFFER_ID GFB, GFP, LCB and LMB clearly were the top-four catalog OFFER_IDs with
respect to the number of placed orders. We henceforth refer to these four OFFER_IDs as the four most
effective catalog OFFER_IDs. Next, we want to observe how the number of orders by product categories
and OFFER_IDs varied over time, where the graphs for the four most effective catalog OFFER_IDs are
shown in Figure 5.
Figure 5: Total number of transactions for the four most effective catalog offers by product categories
over years
Note that for year 2004 and 2011, the raw input data were not provided for the entire years so their results
will not be analyzed and given out in the following observations.
Observation 3: It can be seen that in these top-four most effective catalog OFFER_IDs, product category
T, C, E and P consistently were top-four product categories with respect to the number of placed orders
over the years. This observation conveys similar insights as Observation 2.
Observation 4: The OFFER_ID LMB and GFP seemed to have the best performance, where OFFER_ID
LMB slightly outperformed OFFER_ID GFP, which had significant drops in 2009 and 2010.
Observation 5: In general, the number of orders for product category T significantly dropped in 2009, and
fluctuated with a relatively large variation in the range between 300 and 700 number of orders
| 15
approximately. The number of orders of the other product categories also had fluctuations, but with much
smaller variation.
Observation 6: After the drop in 2009, the total number of orders particularly for product category T
increased for OFFER_ID LCB and LMB, stayed almost constant for OFFER_ID GFB, and kept decreasing
for OFFER_ID GFP in 2010.
The justification for the Observation 6 is as follows. This could be a sign of bouncing back of product
category T from the previous two drops in the last two years, and if it is, this could also imply repeated
purchases. For the one that continued to drop, and also the one that did not increase, customers might have
switched from those particular OFFER_IDs to the OFFER_ID where the total number of orders increased.
In addition, for the other product categories of which the total number of orders increased from one year to
another year, this also showed a possibility of repeated purchases.
To evaluate how effective the offers were, the analysis only on the number of orders may not be enough to
validate the findings. Typically, the performance of the offers can also be implied by the size as well as the
corresponding profit of the orders. Such analysis was performed, and the results and observations were
given in Appendix B.
As mentioned as one of the company goals as, “contact histories and detailed product purchase
information support excellent analysis of repeat purchase from the same catalog and the same product
categories year to year”, we are not fully confident to make a conclusion about whether or not the repeat
purchases occurred. With the results in this section, we can only say that in overall the product categories
that were popular, highly demanded and very profitable remained in the same fashion as they were so there
is a chance for repeat purchases to occur. The results in this section would have to be extended to include a
customer dimension into the analysis in order to draw conclusion on the existence of the repeat purchase of
customers, and if there is, which and how product categories and OFFER_IDs were repeatedly purchased.
Repeat purchase by catalog offer and product categories
This analysis specifically examined the repeat purchase for the same product categories using the same
catalog offer IDs from year to year. Let’s define the definition of repeat purchase.
Definition: For every unique OFFER_ID and product category combination, the number of repeat
purchases in year X is defined as the number of purchases in year X if and only if there is at least one
purchase in year X-1.
From the definition of the repeat purchase, we can observe that it only takes into consideration the
purchases in two consecutive years, and the number of repeat purchases can be as large as the total number
of purchases within a certain year. Carried on from the previous section, the analysis in this section is
particularly conducted for the top-four most effective catalog OFFER_IDs, namely GFB, GFP, LCB and
LMB, to examine how many and in which product categories the repeat purchases occur in a particular
year.
For OFFER_ID GFB, out of 6,070 customers who made at least one order from year 2004 to 2011, there
were only 37 customers who made their repeat purchases. One of which made twice repeat purchases
consecutively in year 2008 and 2009 on product category T, whereas two of which made a repeat purchase
| 16
twice on different product categories. The numbers of the repeat purchases in each year by product
categories are shown in Figure 6. The labels used inside the stacked bar chart are read as follows. The first
letter is the product category, the second number is the total number of repeat purchases, and the last
number in percent is the percentage of the amount of the repeat purchases out of the total number of orders
for that particular OFFER_ID, product category and year combination.
Figure 6: The number of repeat purchase by product categories for OFFER_ID GFB across years
The labels used inside the stacked bar chart are read as follows. The first letter is the product category, the
second number is the total number of repeat purchases, and the last number in percent is the percentage of
the amount of the repeat purchases out of the total number of orders for that particular OFFER_ID, product
category and year combination. It is apparent that the percent of the repeated purchases in every product
category is very small with the maximum of 2.84% on product category T in 2007. Overall, the product
category T seems to have the largest total number of the repeat purchases across the years. Also, there are
product categories, i.e., product category A, B, D, F, G, I, J, L, M and O which completely had no repeat
purchase.
For the OFFER_ID GFP, out of 7,203 customers who made at least one purchase from year 2004 to 2011,
there were only 38 customers who made their repeat purchases. Two of which made a repeat purchase
twice on two different product categories, and none of the customers made consecutive repeat purchases
using the OFFER_ID GFP. The numbers of the repeat purchases for OFFER_ID GFP by years for
different product categories are demonstrated in Figure 7.
| 17
Figure 7: The number of repeat purchase by product categories for OFFER_ID GFP across years
Similar to the OFFER_ID GFB, the percent of the repeated purchases in every product category is very
small with the maximum of 3.05% on product category T in 2006. Overall, the product category T had the
largest total number of the repeat purchases across the years. Also, there are product categories, i.e.,
product category A, D, F, G, I, J, M, O and S which completely had no repeat purchase.
For the OFFER_ID LCB, out of 5,947 customers who made at least one purchase from year 2004 to 2011,
there were 23 customers who made their repeat purchases. Four of which made a repeat purchase twice on
two different product categories, and none of the customers made consecutive repeat purchases using the
OFFER_ID LCB. The numbers of repeat purchases for OFFER_ID LCB by years for different product
categories are presented in Figure 8.
Figure 8: The number of repeat purchase by product categories for OFFER_ID LCB across years
Again, the percent of the repeat purchases in every product category is very small with the maximum of
2.20% on product category X in 2006. Unlike the other three OFFER_IDs, the product category T does not
account for the majority of the repeated purchases. Instead, product category T, E and C are more likely to
have equal total number of repeat purchases across the years. Besides, there are product categories, i.e.,
product category A, B, D, F, G, H, I, J, K, M, O and S which completely had no repeat purchase.
| 18
For the last OFFER_ID, i.e., LMB, out of 9,020 customers, there were 36 customers who made their repeat
purchases. One of which made a repeat purchase twice on two different product categories, and one of
which made two two-consecutive repeat purchases on product category P and T in 2009 and 2010. The
numbers of repeat purchases for OFFER_ID LCB by years for different product categories are presented in
Figure 9.
Figure 9: The number of repeat purchase by product categories for OFFER_ID LMB across years
The observations are very similar to those in the other three OFFER_IDs. In particular, the percent of the
repeat purchases in every product category is very small with the maximum of 3.13% on product category
M in 2010; however, with only one repeat purchase. Overall, the product category T had the largest total
number of the repeat purchases across the years. Also, there are product categories, i.e., product category
A, B, D, G, I, J, and O which completely had no repeat purchase.
Given the total number of orders in Figure 10, the total number and percent of repeat purchases over the
years on aggregate are shown in Figure 11.
Figure 10: Total number of orders by product categories over the years for OFFER_ID GFB, GFP, LCB
and LMB
A B C D E F G H I J K L M O P S T X
GFB 187 461 1,468 48 1,174 455 12 727 1 34 463 395 45 100 1,242 289 2,493 756
GFP 245 618 1,837 33 1,408 639 25 957 0 49 503 349 40 130 1,425 300 2,754 808
LCB 204 433 1,722 16 1,274 544 5 647 0 51 499 381 59 113 1,250 328 2,257 524
LMB 410 620 2,435 26 1,797 773 7 932 0 91 684 490 87 154 1,864 519 3,516 774
PRODUCT CATEGORIES
OFFER_IDs
| 19
Figure 11: Total number of repeat purchase during 2004-2011 for OFFER_ID GFB, GFP, LCB and LMB
by product categories
Findings and implications
In overall, product category T dominated the total number of the repeat purchases; however, all of the
percent repeat purchases are very small even for the four most effective OFFER_IDs. This revealed some
business issues in the way the company runs their business from the CRM perspectives.
The company clearly has a problem in keeping their customers making a repeat purchase from year to year
using the same OFFER_IDs. The company must segment customer better, and target each of the segments
by providing a specially-designed offer or set of offers. For example, the company could provide a special
discount or superior customer support to repeating buyers. The company should create a contact list, and
provide more care and close customer support on recently purchased products by periodically contacting
customers to ask for their experience on using the purchased products.
The company should also conduct a survey to measure satisfactory on products and customer supports, and
to obtain the feedbacks about the buying factors we well as buying experiences with the company. This
would lead to the better customer segmentation, and specific offers, different for different customer
segments could be more appropriately designed.
The company must find the way, for example developing new offer or modifying existing offers, to be
more attractive to a group of existing customers. As shown in the previous section, currently the
performance of every OFFER_IDs needs to be improved, and none of the OFFER_IDs is significantly
superior to the others. It is important for the company to precisely create distinct customer segments where
amongst different groups customers are different in nature, need, buying behavior, or perception, but
similar within the same groups. For example, one of the segments might contains the customers who are in
the industry so do not care much about the price, but highly concern with the quality, the more attractive
| 20
offer for this customer group might be to provide no discount on the product and an extended warranty at
the discounted price. Some customers might have low income and thus care more about price than the
quality. The offer for this group of customers should provide special discount on the retail price, and
perhaps an extended warranty at an extra cost as an option.
Customer segmentation and profile
RFM analysis
RFM analysis is practically used as a customer segmentation tool where R, F and M are the indexes,
respectively representing recency, frequency and monetary indexes of customers. The given transaction
data were aggregated over customer IDs, and for each of the customer ID, the new variables to denote the
last date of purchase, the total number of purchase orders, and the total profit of every purchase were
created as shown in Table 3.
Table 3: The variables for R, F and M
Variable Names Corresponding Indexes
DAY_LAST_ORDER R
CNT_ORDER F
TOTAL_PROFIT M
In addition, the new variables shown in Table 4 were also created to represent the total net profit and
number of purchases made over the last purchase year, i.e., 2011. Note that the order data information in
2011 is available through August 31, 2011.
Table 4: The variables for total profit and number of orders in the last year
Variable Names Description
LAST_YEAR_TOT_PROF The total profit generated by a particular
customer over the last year
LAST_YEAR_CNT_ORDER The total number of purchase orders made by a
particular customer over the last year
| 21
Exploring R, F and M indexes
Figure 12: Histogram of numbers of purchases
Figure 13: Histogram of numbers of days from
last purchase
Figure 14: Histogram of total customer profit
The distribution of the number of orders (or frequency) by customers is shown in Figure 12. It can be
observed that there were close to 60% of customers who made only a single order over the entire time
horizon. In addition, about 20% and 10% of customers respectively made two and three orders over the
time horizon. Figure 13 demonstrates the distribution of the number of days between the last date in the
dataset (i.e., 31th August 2011) and the last purchase date of each customer. It can obviously be seen that
the data was more spread out, compared to the frequency data. The distribution of total profit by customers
is presented in Figure 14. The total profit was found to be densely distributed approximately from $0 to
$100.
Exploring RFM results
When attempting to generate total of five bins for R, F and M variables, SAS Enterprise Miner created five
bins for R and M variables, but only three bins for F variable due to the fact that the majority of customers
ordered only once. As a result, there were total of seventy five RFM groups generated. Of which, the top-
three RFM groups were 132, 555 and 131, which accounted for 4.36%, 4.21% and 3.96%, respectively.
The percentage for every RFM group is given by the bar chart in Figure 15.
| 22
Figure 15: The frequency of the RFM codes
Findings and implications
From the CRM perspective, one can summarize the customers of the company based on the RFM groups
as follows. Most of their customers are one-time customers, represented by “3” in the second digit of the
RFM code, and they are also low-profit generators, represented by“1” or “2” in the third digit of their RFM
codes. This might point out an existing as well as potential problem of unsatisfied experience of customers
about their purchase, product quality or support, making them not to make their repeated and future
purchase with the company.3 So, one might put emphasis on improving customer experience, and
persuading or providing incentives to customers in order to retain them with the company.
Although the majority of the customers are one-time customers, there are also royal customers, represented
by “555” in their RFM code. This implies the fact that these customers have a satisfied or positive
experience with the company. The company should find the way to try making them to spread out their
words to their family members and friends. Together with the improvement on services and the company
market share, there will be higher numbers of both new and return customers in the future.
Most valuable customers by regions
In order to explore the most valued customers - who are very profitable, order frequently, and ordered
recently, we utilized the “RFM” and check for an RFM of “555”. Since a customer can have multiple
transactions, we checked for the unique number of customers with an RFM of “555” for each state. The
SAS code used is Appendix: MVC_SAS_Code
Understanding what areas the most valued customers belong to is of utmost importance for ABC to alter
marketing efforts. From the graph Appendix: MVC_regions, we deduce that the three “Most Valuable
Regions” for any future marketing offers are
1) Pacific coast (1-digit ZIP code – 9)
2) Southeast (1-digit ZIP code – 3)
3 This circumstance was also previously pointed out in more detail in Findings and implications section of Repeat
purchase by catalog offer and product categories.
| 23
3) Mid-Atlantic (1-digit ZIP code – 1)
Figure 16: Total number of transactions by 1-digit ZIP codes and channels
Though the order changes, these remain the top three regions for “Most Profitable Customers” and the
“Most Frequent Customers” as well.
Since the same customer can order through various channels for different orders, we explored the preferred
channel at a transaction level instead of at a customer level. We deduced that most of the west coast
purchased the product via website, but others purchased primarily via phone and mail. However, when we
analyze the “Most Valuable Transactions,” phone, followed by web, and mail seem to be the preferred
channels to market.
In addition, Most Valuable Customers for ABC Catalog Company are not affluent but people with income
towards the lower end / closer to the mean household income. The shaded region in Figure 17 represents
the income group of the customers. Note that in this case, the majority belongs to groups with an
“Transformed: Imputed MEAN_HOUSEHOLD_INC_NUM” in the range of 0.10 to 0.29.
Figure 17: Total number of transactions by 1-digit
ZIP codes and income groups
Figure 18: Histogram of transformed variable for
average household income
| 24
Customer clustering and profiling
Objective and Goals
As shown by previous analyses, the company is clearly having problem in creating an effective offer
program, and retaining the customers. One of the possibly main and critical causes is due to poor customer
segmentation, and a lack of understanding on customers characteristics. In the following sections, we are
going to address these issues by analyzing customer buying behaviors, and properly generating customer
segments and corresponding profiles.
Each group of the customers should have clear differences in both buying behaviors and demographics to
some extent. The ultimate goal is to get the results to be able to create distinct group of customers and
understand the common natures within and across the groups. Specifically, we would like to observe the
certain purchasing behaviors and preferences within each group, answer the following questions, and
compare those across the groups.
1) Whether or not they are more likely to return or cancel the order? – The higher returns lower the
profits, impacting on business revenue.
2) How often do the orders placed?
3) What is the average total profit? – The target is to increase profits.
4) What is or are the product categories they purchase? – Low-product purchases imply inventory
control changes, or change deal with a manufacturer. In this aspect, the variables are the count of
the purchases for every product category.
5) What is or are the type of primary offers? To make this less complicated, the variable
OFFER_DESC was used, i.e., either catalog or web or etc., to differentiate the groups.
6) What is the majority of customer RFM codes? To reduce the number of RFM levels, the new RFM
scheme, denoted by the RFM_NEW variable, was generated. The value of the RFM_NEW varies
from 1 to 8.4
Apart from that, we would like to add the common demographics to each of the customer segments; for
instance,
1) The average household income to identify the target customers.
2) 1-digit ZIP code to specify the target areas.
3) The average male/female percentage to examine if the behavior changes across gender.
Inputs
According to the goals, new variables were created, and transformed as shown in Table 5. Because of the
nature of the data that the majority of observations takes an extreme value, this can be considered of
response style data. The response style data significantly impacts the performance of clustering procedure,
and therefore needs proper transformation. Pagolu and Chakraborty (2011)5 showed double standardization
transformation on the response style data yields the best clustering performance so we therefore applied
the transformation to our data, as shown by “DSTDZ” in their prefix.
4 The mapping scheme of the RFM_NEW variable is discussed in Variable recoding section. 5 Pagolu and Chakraborty (2011), “Eliminating Response Style Segments in Survey Data via Double
Standardization Before Clustering,” SAS Global Forum paper, p165.
| 25
Table 5: The raw and transformed variables used in creating and profiling customer segments
Variable Name
Transformed Variable
Name Role Description
TOTAL_RETN_QTY
DSTDZ_TOTAL_RET
N_QTY Base
Total return quantity over the entire
time horizon
CNT_ORDER LOG_ CNT_ORDER Base
Total number of orders over the entire
time horizon
TOTAL_PROFIT LOG_TOTAL_PROFIT Base
Total profit generated by the customer
over the entire time horizon
RFM_NEW Base
The transformed RFM code of the
customers
new_CNT_PROD_A
DSTDZ_new_CNT_PR
OD_A Base
Total number of orders on product
category A over the entire horizon
new_CNT_PROD_B
DSTDZ_new_CNT_PR
OD_B Base
Total number of orders on product
category B over the entire horizon
new_CNT_PROD_C
DSTDZ_new_CNT_PR
OD_C Base
Total number of orders on product
category C over the entire horizon
new_CNT_PROD_D
DSTDZ_new_CNT_PR
OD_D Base
Total number of orders on product
category D over the entire horizon
new_CNT_PROD_E
DSTDZ_new_CNT_PR
OD_E Base
Total number of orders on product
category E over the entire horizon
new_CNT_PROD_F
DSTDZ_new_CNT_PR
OD_F Base
Total number of orders on product
category F over the entire horizon
new_CNT_PROD_G
DSTDZ_new_CNT_PR
OD_G Base
Total number of orders on product
category G over the entire horizon
new_CNT_PROD_H
DSTDZ_new_CNT_PR
OD_H Base
Total number of orders on product
category H over the entire horizon
new_CNT_PROD_I
DSTDZ_new_CNT_PR
OD_I Base
Total number of orders on product
category I over the entire horizon
new_CNT_PROD_J
DSTDZ_new_CNT_PR
OD_J Base
Total number of orders on product
category J over the entire horizon
new_CNT_PROD_K
DSTDZ_new_CNT_PR
OD_K Base
Total number of orders on product
category K over the entire horizon
new_CNT_PROD_L
DSTDZ_new_CNT_PR
OD_L Base
Total number of orders on product
category L over the entire horizon
new_CNT_PROD_L
DSTDZ_new_CNT_PR
OD_M Base
Total number of orders on product
category M over the entire horizon
new_CNT_PROD_O
DSTDZ_new_CNT_PR
OD_O Base
Total number of orders on product
category O over the entire horizon
new_CNT_PROD_P
DSTDZ_new_CNT_PR
OD_P Base
Total number of orders on product
category P over the entire horizon
new_CNT_PROD_S
DSTDZ_new_CNT_PR
OD_S Base
Total number of orders on product
category S over the entire horizon
new_CNT_PROD_T
DSTDZ_new_CNT_PR
OD_T Base
Total number of orders on product
category T over the entire horizon
new_CNT_PROD_X
DSTDZ_new_CNT_PR
OD_X Base
Total number of orders on product
category X over the entire horizon
CNT_OFFER_AFFILI
ATE
DSTDZ_CNT_OFFER_
AFFILIATE Base
Total number of orders whose
OFFER_DESC is Affiliate
CNT_OFFER_CANA
DIAN
DSTDZ_CNT_OFFER_
CANADIAN Base
Total number of orders whose
OFFER_DESC is Canadian
| 26
Variable Name
Transformed Variable
Name Role Description
CNT_OFFER_CATA
LOG
DSTDZ_CNT_OFFER_
CATALOG Base
Total number of orders whose
OFFER_DESC is Catalog
CNT_OFFER_EMAIL
DSTDZ_CNT_OFFER_
EMAIL Base
Total number of orders whose
OFFER_DESC is Email
CNT_OFFER_EMP_
ORDER
DSTDZ_CNT_OFFER_
EMP_ORDER Base
Total number of orders whose
OFFER_DESC is Employee Order
CNT_OFFER_INT
DSTDZ_CNT_OFFER_
INT Base
Total number of orders whose
OFFER_DESC is International
CNT_OFFER_ONLIN
E_CAT
DSTDZ_CNT_OFFER_
ONLINE_CAT Base
Total number of orders whose
OFFER_DESC is Online Catalog
CNT_OFFER_PRINT
_AD
DSTDZ_CNT_OFFER_
PRINT_AD Base
Total number of orders whose
OFFER_DESC is Print Ad
CNT_OFFER_WEB
DSTDZ_CNT_OFFER_
WEB Base
Total number of orders whose
OFFER_DESC is Web
IMP_MEAN_HOUSE
HOLD_INC_NUM Descriptor
The average income of the household
in customer’s ZIP code
STATE_1 Descriptor The first digit of customer’s ZIP code
IMP_FEMALE_PCT_
NUM Descriptor
The percentage of female in
customer’s ZIP code
Segments and profile
We performed clustering procedure multiple combinations of clustering method and internal
standardization where we found that the Range internal standardization and Ward clustering method
performed the best.6 Based on statistical recommendation criteria, we selected to create 4, 8, and 12
customer segments, and profile them accordingly, in which the detailed clustering and profiling results are
given in Appendix C. We found that the profile of the eight-cluster solution is most meaningful, which is
presented in Figure 19, and
Table 6 summarizes the profile story.
6 By performing “the best”, we mean the clustering procedure follows the hierarchical structure, having none or very
few observations joining the clusters late in the algorithm. The results of the clustering procedure can be found in
Appendix C.
| 27
Figure 19: Customer profile of the eight-cluster solution
Table 6: Customer segment characteristics
Segment
Number Characteristics
1 The customers in this segment are recent and profitable, but have low number of
transactions with the company. They respond most to the affiliate and email offers, and
specifically to the product category H and M. Most of them live in west coast and
southern parts of the country.
2 Most of the customers are very recent and have many transactions with the company,
but they are not profitable. They like to buy the product category C, K, P, and T through
catalog, and they averagely live in eastern, northern, and southern parts of the country.
However, this segment has a high rate of product return.
3 In this segment, customers like to buy product from website and catalog, especially
product T. Most of them buy the products more than one. As for the products,
customers in this segment are also highly interested in product category C, E, F, and P.
Like segment 2, this segment also has high rate of product return.
4 This is one of the most active segments in terms of product variety and offer responded.
The customers in this segment mostly respond to every offer channel except catalog. As
for the products, they respond to almost all company’s product categories, with category
D, G, and I as the top three. They have not had transaction with the company for a long
time and have no repeat purchase. Mostly they are from northern part of the country.
5 The customers in this segment are responsive to many company offer channels except,
affiliate, catalog, email, and website. They also have a big wallet, and are from west
coast area.
6 In this segments, the customers are mostly responsive to catalog, and product category
E, H, and T. They used to be frequent and high-value-transaction customers.
| 28
Segment
Number Characteristics
7 The most valuable customer group is in this segment. They love product category P and
respond the most to catalog offer. They are mostly female and live in central north and
south of the country.
8 The customers in this group are very recent, and mostly they receive the information of
the product via email. The product categories that they are interested in are A, B, G, L,
J, M, and O. They most live in northern part of the country and have high rate of
product return.
Findings and Implications
From the technical point of view, we found that creating segments with the given company transactional
data is very different. Specifically, the majority of customers are one time purchaser, and therefore there is
no implied pattern in differentiating them based on buying behaviors and preferences. In addition, one time
buyer also creates an issue of response style data, which needs proper transformation to alleviate the
impact on clustering procedure. Despite the limitation, the clustering procedure was able to create
meaningful clusters. For all of those we created profiles, and eventually chose the eight-cluster
segmentation.
From the prediction point of view, the RFM_NEW variable is the most important variable in predicting a
cluster for an observation of every cluster as expected. That is because RFM_NEW variable has eight
levels, which is equal to the number of segments. As a result, there is a one-to-one mapping between levels
of the RFM_NEW variable and the segment numbers.
Previous sections pointed out that the company was not able to create an effective offer program, and this
is perhaps because the company lacked of understanding the nature of their customers. As a result, the
dominating group of the customers is one-time buyer. By having the initial profile as shown in Table 6, the
company should start generating a meaningful promotion on particular combinations of an offer and
product category based on common group demographics, and sending to their customers, especially those
who made only few transactions, for them to make a future purchase with the company. Sample
recommendations are given in Table 7.
Table 7: Sample recommendations for customer segments
Segment
Number Recommendations
1 Use affiliate and email offers on product category H and M on the west coast and
southern part of the country.
2 Use a catalog offer on product category C, K, P and T on the eastern, northern and
southern parts of the country with a product return option.
3 Use website and catalog offers on product category T, C, E, F and P with a product
return option.
4 Use a none-catalog offer mainly on product category D, G and I on the northern part of
the country.
5 Use a variety of offer types, EXCEPT affiliate, catalog, email and website on west
coast.
6 Use a catalog offer on product category E, H and T that provide some advantages to
frequent buyers.
7 Use a catalog offer on product category P, especially female products, on central north
and south of the country and also offer royal reward program.
| 29
Segment
Number Recommendations
8 Use an email offer on product category A, B, G, L, J, M and O on the northern part of
the country with a product return option.
Predictive models
Model target
The target variable for building the predictive models was the NET_REVENUE variable.7
Data partition
The random seed of 57005 was used, and the proportion of the data partition between Train and Validation
was set to 50/50.
Variable reduction and selection
Large number of input variables could create dimensionality problem which limits the ability to fit a model
to a noisy data. So, reducing input variables is a key to avoid dimensionality problem, thereby creating a
better model. In this case, interval and categorical variables were treated separately. With regard to interval
variable, the variable reduction process was handled by the variable clustering method. Variable clustering
will manage to have highly correlated variables in the same group and low correlation between the groups.
As for the categorical variable, the decision tree will be exploited to find the most important variable given
the specific target which, in this case, the targets was the net revenue.
According to the result from clustering (please see Appendix G), most variables were chosen as
representative of their clusters. However, some clusters had more than one variable chosen because of the
business objective. For instance, even though, in cluster 3, count order is the most important variable in the
cluster based on the 1- R square ratio, we also included offer range to see the effect of a particular
promotional range to the target variable.
Unlike variable clustering, decision tree was supervised method which selected variables based on a
specific target variable. The decision tree algorithm will provide the logworth of each variable to the
target. The higher the logworth is, the greater the importance of that variable becomes. For instance, from
the logworth result (please see Appendix G), RFM had the highest logworth which means it was the most
important variable to explain net_revenue. Similarly, some variables that had low importance given the
target variable were included in the model because of the objective of the business.
The following table shows the final pool of variables that could be used as an output for building
predictive model from variable selection and reduction process.
Table 8: The selected input variables for building predictive models
Interval Categorical
1. COST_PER_UNIT
2. PRICE_PER_UNIT
3. RETN_REVENUE
1. RFM
2. RETN_DATE
3. SHIP_DATE
7 We selected to use the NET_REVENUE variable as our single target variable because of the ease of grading on the
performance of the models.
| 30
Interval Categorical
4. QUANTITY
5. SHIP_QUANTITY
6. CNT_ORDER
7. DAY_LAST_ORDER
8. RETN_QTY
9. OFFER_VALID_RANGE
10. CANCEL_QUANTITY
4. PRODUCT_CATEGORY_ID
5. ORDER_DATE
6. PAY_METHOD
7. CHANNEL
8. OFFER_DESC
9. ZIP
Variable recoding
For the interval variables, they have passed the data audit, such as transformation and filtering, but for
categorical variables, some variables need to be recoded from large to small levels. Among the categorical
variables chosen, RFM and ZIP were the two that have a problem with levels, 75 and 17,876 levels
respectively.
From the business point of view and simplicity of the model, the variable RFM was recoded, and denoted
by the variable RFM_NEW. For R, F and M of the RFM, the value was recoded to high (H) if the value
was between 1 and 3, and to low (L) if the value was between 4 and 5. The RFM_NEW variable takes the
values according to the mapping scheme in Table 9. As a result, the number of levels of the RFM code
decreased from 75 in the RFM variable to 8 levels in the RFM_NEW variable.
Table 9: The mapping scheme from the RFM to RFM_NEW variables
Mapping Scheme
LLL 1
LLH 2
LHL 3
LHH 4
HLL 5
HLH 6
HHL 7
HHH 8
The variable ZIP was also recoded by taking the first digit of the zip code, which purposely refers to
regions, and denoted by ZIP1. The number of levels decreased from 17,876 in the ZIP variables to 33 in
the ZIP1 variable.
Decision tree
Model implementation
Three following changes were made before building the model to get the best model performance with
respect to ASE.
Table 10: The modifications made to the default settings on Decision Tree node
Default Change to
Splitting Rule: Interval Target Criterion ProbF Variance
| 31
Default Change to
Maximum Depth 6 8
Subtree: Assessment Measure Decision Average Square Error
Model results
With the default mode, decision tree built the model with 26.69 average square error (ASE) on validation
data. The variables that included in the model were only two variables: price per unit and net quantity
which are not quite meaningful in terms of building marketing startegy. The challenge in building this
model is how to trade off between the accuracy of the model and the meaningful result that the company
could successfully deploy. For instance, some models only include price, cost, and quantity which
certainly have high correlation with the target, which in this case is net revenue, that made the ASE low,
but there is no meaning in the marketing aspect. However, when these variables were taken out of the
model, the ASE got worse. After trial and error of many scenarios, the winner model had ASE only 3.95.
The following seven variables that are important to predict the revenue are included in the model which
are meaningful for marketing implementation.
1) Price per unit
2) Net quantity
3) Zip
4) Payment method
5) Product category
6) Offer range
7) RFM
Expectedly, price is the most important factor to predict the revenue, and then follows with net quantity.
Interestingly, the most important variable amongst categorical variables is payment method. There are 19
rules that are meaningful to discuss. Please see Appendix G for more information.
Regression
Model Implementation
Since the regression model is susceptible to the non-normal distribution of the data, some transformed
variables that are very skewed or kurtotic will be used as inputs.
The Regression model was built under different scenarios to find the best one:
1) Model selection method and criteria: To get as many scenarios as possible, Stepwise, Forward and
Backward will be implemented as our selection models, and Validation Error (ASE) as selection
criterion, the Entrance and Stay significance level were set to 1 each to allow the model to have a
broad scope when selecting the variables in the selection model process. The maximum number of
Steps was set to 10.
2) Interactions: We allow the model to include all two-factor interactions for class variables, as well
as enabling the model to include all possible polynomial terms.
3) Variable selection: The variable selection is important to trade off between the meaningful result
and accuracy. For example, even though the ASE is very low, the variables that are included in the
model cannot be interpreted. Consequently, the model cannot be implemented in the real world.
Therefore, a discretionary variable selection might be implemented in some cases.
| 32
Model Results
After we experimented with many scenarios based on the property mentioned above, we got the optimal
model which included the variables shown below (please see Appendix G).
1) Cost per unit * Retn quantity
2) Cost per unit * Quantity
3) Cost per unit * Ship quantity
4) Price per unit * Net quantity
5) Cost per unit
6) Offer discount
7) Price per unit
The model’s ASE is 34.55 in this case. Even though 34.55 ASE is not the lowest ASE we got among all
the scenarios, the variables selected are interpretable when predicting the target Net Revenue.
Findings
We can see in the estimates that, an increase in the cost per unit by the quantity will drive net revenue
down. Whereas an increase in unit in the price per unit * the quantity sold, will increase net revenue by
$666.2. Also, some offers such as Affiliate, Canadian, Catalog and International have a positive
relationship with Net Revenue. Whereas other offers such as Email, Employee order, Online Catalog and
Print add will not have a positive impact in the Revenue.
Product category H and I seem to be the ones that will drive net revenue up. The Net Revenue obtained by
an increase in one unit for such products are 0.59 and 19.55 respectively (please see Appendix G).
Neural network
Model Implementation
Step1: If the neural network is run with the pool of variables selected in variable selection and reduction
section, the number of estimated weights is 441 which is very huge. So, the variable reduction
should be done again before running the neural network. In this case, we will use stepwise
regression to pick the variables before performing nerual network model.
Step2: After we got the variables from step 1, Neural network will be proceeded with 1000 iterations.
Step3: Use metadeta to change the target to P_Net_revenue which is the output of target variable from
running the neural network.
Step4: Use decision tree to explain the neural network model (surrogation).
Model result
Variables from the regression model stepwise:
1) NET_QUANTITY
2) PRICE_PER_UNIT
3) PRODUCT_CATEGORY_ID
4) QUANTITY
5) RFM_NEW
| 33
With the variables above and changing some defaults of neural network property. The neural network built
the model with 22 average square error (ASE) on validation data with 134 iterations to converge. (see
Neural_Network_3 for the results) However, it is very difficult to explain the findings directly out of
neural network model. Therefore, the decision tree was used to explain the neural network result.
From the decision tree result, expectedly, price is the most important factor to predict the revenue, and then
follows with net quantity (see Appendix G). Interestingly, the most important variable amongst categorical
variables is RFM. There are 35 rules out of 200 rules that are meaningful to discuss. Please see Appendix
G for more information
Findings
Customer specific (RFM)
With the price over $300, the RFM code 2,4,6, and 8 seems to be a good customer in this price range. On
the other hand, the price lower than $300, customer in the RFM code 1,3,5, and 7 seems to be a good
candidate.
Product Category
Product category P,T, and H are important to explain the revenue of the company in most price ranges
(please see Appendix G). As for each price range, the summary of the product importance can be grouped
into four price range below.
Price Range ($) Product Category
<100 B, E, C, T, P, S, F, L, H, X, 0, A
130 – 187 B, E, C, T, P, F, H, K
220 – 250 T, P, H
400 E, C, T, P, F, H, X, 0, K
Comparison of models’ performance
In this section, we identified the best predictive model from the prediction performance on net revenue of
all three predictive models that were examined in the previous sections based on validation average
squared error (ASE). The reason ASE was chosen was because ASE tells us how much difference between
the target and the estimate. The goal is to minimize the ASE, and the model with the minimum ASE value
has the smallest prediction error. Figure 20 presents the results from the Model Comparison node.
Figure 20: The comparison results based on the ASE value
With the average square error criteria, the decision tree model was chosen as it has the lowest ASE value
on validation data, whereas the neural network model and the regression model were second and third,
respectively.
| 34
Scoring
Scoring performance
Since decision tree model is the winner amongst all three models, it was selected to score the data. Figure
21 demonstrates the results of the scoring.
Figure 21: The scoring results on the score data performed by the decision tree model
From the scoring output, the result is quite satisfied. If we look at the mean net revenue, the difference
between validation and scoring data is only $5.33 which is around 7% difference. The result indicates that
the model is flexible enough to apply with any data.
Summary
This project was instigated for ABC Catalog Company, which is a specialty multi-channel catalog
company with a strong web presence. We started with the objective to aid ABC in improving its revenue
and market share. Since there was no information about the actual product or customer level demographic
in the data provided, we updated it with external census data to understand ABC’s customers and business
better. The CRISP-DM methodology was utilized with the goal to eventually predict future revenues and
suggest marketing strategies. We examined the purchase patterns, Marketing ROI, and seasonal trends.
Eventually, we leveraged the data to build customer segmentation and were able to predict future revenues
for ABC through predictive modeling.
The predictive models that we used are decision tree, regression, and neural network. Decision tree is a
rule-based model, whereas regression and neural network are parametric-based models. Before running a
model, the variable selection and reduction were performed as appropriate in order to have only input
variables that have an importance for target variable (i.e., NET_REVENUE) and business goals. Then, the
data was partitioned into training and validation data. The model was built on training data, but it was
assessed on validation data to evaluate flexibility. On each model, experiments were then performed with
various property configurations to find the model with the lowest model’s predictive error. In this work,
the indicator of the model error is the average squared error (ASE).
Findings and Recommendations
Segmentation and Clustering
From the technical point of view, we found that creating segments with the given company transactional
data is very different. Specifically, the majority of customers are one time purchaser, and therefore there is
no implied pattern in differentiating them based on buying behaviors and preferences. In addition, one time
| 35
buyer also creates an issue of response style data, which needs proper transformation to alleviate the
impact on clustering procedure. Despite the limitation, the clustering procedure was able to create
meaningful clusters. For all of those we created profiles, and eventually chose the eight-cluster
segmentation whose profile is most meaningful to the team.
From the prediction point of view, the RFM_NEW variable is the most important variable in predicting a
cluster for an observation of every cluster as expected. That is because RFM_NEW variable has eight
levels, which is equal to the number of segments. As a result, there is a one-to-one mapping between levels
of the RFM_NEW variable and the segment numbers.
Pointed out as one of business issues as ABC was not able to create an effective offer program, and this is
perhaps because the company lacked of understanding the nature of their customers. As a result, the
dominating group of the customers is one-time buyers. By having the initial profile developed in this work,
the company should start generating a meaningful promotion on particular combinations of an offer and
product category based on common group demographics, and sending to their customers, especially those
who made only few transactions, for them to make a future purchase with the company.
Predictive Modeling
Based on the high variance of target variable (NET_REVENUE), all three models have ASE in an
acceptable range from 3.9 to 34. The variables that are selected by each model are summarized in the
following table
Decision Tree Regression Neural Network
Price per unit
Net quantity
Zip
Payment method
Product category
Offer range
RFM
Cost per unit * Retn quantity
Cost per unit * Quantity
Cost per unit * Ship quantity
Price per unit * Net quantity
Cost per unit
Offer description
Price per unit
Net quantity
Price per unit
Product category id
Quantity
RFM
Recommendations in this section are an amalgamation of the data exploration, RFM, most valuable
customer, and model results. Note that from the model comparison, decision tree model delivers the best
performance so the model recommendations are purely based on the decision tree model.
From the model results, there is a clear line of what variables are the most influential in each level. Please
refer to Appendix G for more information. Based on price, we have 4 groups with their own distinct
characteristics. The following table is the summary of each group’s characteristic and price range.
Group Price range ($) Important Factors
1 55-95 Product type, Payment method
2 100-140 Offer range
3 160-220 Payment method
4 320-460 Product type
Group1: Price range 55-95
| 36
This group is the most important group because 70 percent of ABC’s revenue came from the product with
the price lower than $90. From the model’s results, this group contains three significant findings: product
type demand, quantity purchased, and credit card merchant specificity.
Product specific: With the price range from 55 to 70 dollars, ABC company should concentrate on these
ten product types: B, E, P, F, C, J, T, X, H, and A. However, 10 products out of 19 products might be too
large for the marketing budget. To be more specific and limit the marketing budget, product type B, P, C,
T, and X should be a marketing priority.
Bundle strategy: Generally, ninety percent of ABC’s customers buy only one product, but, in this price
range, customers mostly bought more than one product. ABC might try to get bigger customer wallet share
by encouraging customers to buy more than one product at a time with bundle strategy.
Co-promotion with credit card issuers: American Express, Visa, and Master card are the credit card issuers
that affect revenue significantly. ABC company might encourage more sales by cooperating with these
three credit issuers. For example, there could be a discount offer for customers if they buy a product using
one of these credit cards.
Final recommendation: With price range 55-95, ABC company should focus on five products: B, P, C, T,
and X. For these five products, ABC should sell them in bundle and give customers a promotion if
customers use American Express, Visa, and Master Card to buy a product.
Group2: Price range 100-140
Geographical specific: There are two regions based on the zip code that have significant effect on revenue.
The zip codes begin with zero and four. Zip that begins with zero is New England region and the one that
begins with four includes four states: Indiana, Michigan, Ohio, and Kentucky.
Offer range: With this price range there is a significant difference between giving customer a range of 6
months and one year. With 6 months promotion offer range, customers tend to buy higher number of
products than one year. The reason behind this might be that customers might be afraid that they might not
have a chance to buy the product with the particular offer in the future, so customers might buy the product
more than they necessary need at that time.
Final recommendation: ABC company should offer only 6 months promotion offer in the area that has zip
code beginning with zero and four.
Group3: Price range 160-220
Co-promotion with credit card issuers: ABC could follow the co-promotion with various credit card issuers
strategy similar to the one suggested for group one, but including Discover card as well.
Outbound sales: There is a significant relation between RFM code 8 and payment by personal check in this
price range. From the previous analysis, most of the customers that pay by personal check use phone
channel. Therefore, instead of waiting for the customer to reading the catalog and calling in, ABC might
do outbound sales by calling customer based on the RFM code 8 list.
Final recommendation: ABC company should promote outbound sales and co-promotion with credit card
issuers.
| 37
Group4: Price range 320-460
Product specific: The products that the customers in this price range are interested are F, H, T, X, and C.
Therefore, focusing on these products is a key to successful marketing.
Limitations
SAS Program
SAS EM has a limitation of synchronization. Since the project is large, one person doing every process in
one diagram is very difficult. So, the works were divided to each team member. However, at the end, some
of the works needed to be combined in one diagram. SAS EM has no ability to combine the diagram. The
works needed to be re-created in one diagram which is redundant in this case.
Clarification of the data/variables
Some variables have no clear meaning what they mean. Even though we found a relationship or something
interesting, we could not understand and write much about them. For example, product category ID, we do
not know what A or B is. If we know we might be able to get more idea from the result and write up the
report in a more meaningful way.
Company interaction
During the project, we might come up with the non-text book questions that need company opinion to help
clarify them. For instance, in the model building phase, we might exclude some input variables that are
important for the company but not for the model. Also, if we had got some ballpark numbers, such as
marketing cost per year and percentage of marketing cost per sale, it would have helped a lot in
recommendation section. So, it would be great if we could interact with the real company representative at
least once a phase.
| 38
Appendix
Appendix A
Data exploration on the net profit and its associations of the input variables
The input variables can be ideally divided into two categories, which are numerical (or interval) and
categorical (or nominal or binary) variables. The associations can also be categorized in the same manner.
We begin this section with the variable worth of all input variables, followed by correlations between our
target variables and the interval input variables.
Variable worth
Figure 22: Variable worth with respect to net profit
The variable worth shows the worth of an individual input variable in predicting the target variable. Figure
22 shows that NET_REVENUE has the highest worth value. This is intuitive as NET_REVENUE is the
variable which is very highly related to the target as it was used as a primary variable in the NET_PROFIT
formula. Recall that NET_REVENUE is equal to NET_QUANTITY × PRICE_PER_UNIT. The other two
variables that were used in the NET_PROFIT formula are PRICE_PER_UNIT and COST_PER_UNIT,
which unsurprisingly have the second and third highest worth values.
Pearson correlation
Figure 23: Pearson correlation coefficient of interval variables with respect to net profit
| 39
Pearson correlation shows linear relationships between the target and interval input variables. Intuitively,
NET_REVENUE, PRICE_PER_UNIT, COST_PER_UNIT, NET_QUANTITY are amongst the variables
whose correlations are positive and relatively high. The only standout input variable that is negatively
correlated with the target variable is CANCEL_QUANTITY.
Spearman correlation
Figure 24: Spearman correlation coefficient of interval variables with respect to net profit
Spearman correlation shows relationships, considering both linear and non-linear relationships, between
the target and interval input variables. It can be observed that NET_QUANTITY and RETN_QTY are
more positively correlated, and the negative correlation of CANCEL_QUANTITY and RETN_REVENUE
are more significant.
In the following, the relationships between the target and categorical input variables as well as insights are
demonstrated.
Total net profit by product categories
Figure 25: Percent contribution to the net profit by product categories
| 40
In Figure 25, it has been shown that the product category E generated the highest profit (which is
approximately $1.06 M over the entire horizon, which accounts for about 15%), followed by P, T, C, and
so on in a counter-clockwise direction. It is not always necessary for the highest-profit category to have the
largest total quantity sold or to be ordered most often. As presented in Figure 26, product category T is the
most favorite product in terms of the total number of orders and total sold quantity, and the largest product
category in terms of the total number of SKUs. Another factor that significantly contributes to the total net
profit for a particular product category is the average per unit profit of that product category. As shown in
Figure 27, the average per unit profit of product category E is about ten dollars higher than that of product
category T and C, and about seven dollars higher than that of production category P. This is the reason
why product category E is the most profitable product.
Figure 26: Total number of orders, sold quantity and SKUs by product categories
Figure 27: Average unit profit by product categories
| 41
Net profit by SKUs
Figure 28: Top 10 most profitable SKUs
Figure 28 presents the top 10 most profitable products and their corresponding total net profit. As
mentioned earlier, the total net profit is hugely influenced by the average per unit profit. Figure 29 exhibits
the associations between average per unit profit and total net profit by product numbers. It can be seen that
most of the average per unit profits are not high although several of them are higher than 1,000 dollars, but
those did not correspondingly generate high profit. Therefore, the core profitability of the company comes
from selling large volume of inexpensive products.
Figure 29: Scatter plot matrix of average unit profit, total profit by SKUs
Net profit by 3-digit ZIP codes
AVG_UNIT_PROF_BY_PROD
PRODUCT_NO
TOT_PROF_BY_PROD
| 42
The ZIP code information in the data set is provided at 5-digit level, which is very intricate and therefore
difficult to analyze and draw insights. We aggregated the data into 3-digit ZIP code level, and provide
results for the geographical areas based on 3-digit ZIP codes that generated large net profit. As seen in
Figure 30, the top ten most profitable by 3-digit ZIP codes are 100, 600, 770, 606, 117, 926, 945, 070, 334,
and 300, which are in the states of NY, IL, TX, CA, NJ, FL and GA. The results are expected as these
states as a whole are equipped, are large in terms of population, and have more buying power.
Figure 30: Top 10 most profitable 3-digit ZIP codes
Net profit by Customer IDs
Figure 31 shows the top 20 buyers based on their corresponding total net profit. It can be observed that the
customer whose ID is 10031582924 is the most profitable customer and is a potential candidate for the
most valuable customer. Also, there are 13 customers whose purchase(s) resulted in more than 2,500
dollars profit in the entire time horizon. These top customers usually drive company’s profit so the
company should always respect, maintain good relationship and try to retain them.
| 43
Figure 31: Top 20 most profitable customers
Next, we explore the seasonality pattern of customer order, and the relationship with the target variable,
i.e., net profit, which is presented in Figure 32.
Figure 32: Patterns of net profit and customer orders in date/time scale
It is very obvious that the purchase pattern of customers and the resulted net profit are seasonal. To see the
variation of all actions or processes by months, the entire time horizon was divided into the month
numbers, i.e., from 1 (January) to 12 (December), and the plot is given in Figure 33.
| 44
Figure 33: Number of occurrences of activities by month numbers
We can see that the number of customer orders and shipped orders are really high in the last quarter,
starting from September through December. As a result, the number of returned orders and cancelled
orders turn out to be higher in December and January than in other months.
Appendix B
The effectiveness of catalog offers by product categories by total order quantities
To evaluate how effective the offers were, the analysis only on the number of orders may not be enough to
make a conclusion. Typically, the performance of the offers can also be implied by how large the orders
were so this depends on how the order quantities were distributed within those orders. So, in the next
section, we will take the same approach to perform the similar analysis but at this time based on the order
quantities. We again first started to look at the total order quantity from catalogs from year 2004 to 2011,
which are presented in the following figures. The total order quantities were collected by the last three
letters of the OFFER_IDs, and product categories.
| 45
Compared to the results in the previous section, one can envision the similarity of the results in this section
as the majority of the orders contained the order quantity of one. The observations in this section can then
be partially implied from the observations in the previous section, and for those observations that imply
one another, the justification is similar, and thus not mentioned in this section.
Observation 7: From year 2004 to 2010, there consistently were incoming order quantities for the
OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB8, where order quantities by OFFER_ID LCR
started to occur from 2007 to 2010. In 2011, there were no order quantities for those OFFER_IDs
completely. Instead, there were more order quantities by the other OFFER_IDs.9
Observation 8: From year 2004 to 2010, the pattern of the order quantity distribution by product categories
within the same OFFER_ID seemed to be very similar over years, where product category T had the
largest total order quantity, followed by product category C, E and P, respectively. However, in 2011 when
8 Besides OFFER_ID EHB, GFB, GFP, HLB, LCB, LCR and LMB, OFFER_ID SMB, LSB, PCB SPB and WNB
seem to be the second group of the best OFFER_IDs, of which most were standing out in 2011. 9 Observation 6 and 1 can imply one another as many of order quantities in the orders were one.
| 46
the total order quantity of product category T significantly dropped, the product category P became the
product with the largest total order quantity, followed by product category C, E and H, respectively.10
Apparently, OFFER_ID GFB, GFP, LCB and LMB were the top-four OFFER_IDs with respect to the
number of order quantities. Again, we want to see not only how the numbers of order quantities were
distributed by product categories within the same OFFER_ID over times, but also how those numbers
changed over times, which are shown in the following figures for the four most effective OFFER_IDs.
Note that for 2004 and 2011, the raw input data were not provided for the entire years so their results will
not be analyzed in the following analysis. Compared the figures above to the figures, provided in the
previous section where the analysis was completed based on the number of orders, they look almost
identical. The following observations again can therefore be implied by the observations made in the
previous section.
Observation 9: It can be seen that within these top-four most effective OFFER_IDs, product category T, C,
E and P consistently were top-four product categories with respect to the total order quantities. This
observation conveys similar information, therefore validating the previous observation.
Observation 10: The OFFER_ID LMB and GFP seemed to have the best performance, where OFFER_ID
LMB slightly outperformed OFFER_ID GFM, which had significant drops in 2009 and 2010.
10 Observation 7 and 2 can imply one another as many of order quantities in the orders were one.
| 47
Observation 11: In general, the value of order quantities for product category T significantly dropped in
2009, and fluctuated with a relatively large variation in the range between 300 and 750 order quantities
approximately. The total order quantities of the other product categories also had fluctuations, but with
much smaller variation. Compared to Observation 5, the variation range for each product category was
slightly wider.
Observation 12: After the drop in 2009, the total order quantities particularly for product category T
increased for OFFER_ID LCB and LMB, stayed almost constant for OFFER_ID GFB, and kept decreasing
for OFFER_ID GFP in 2010.
The effectiveness of catalog offers by product categories by total profit
In addition to the analysis based on the number of placed orders by the OFFER_IDs, the analysis in terms
of the order quantities by OFFER_IDs does not give us very useful additional results as well as
conclusions. To further evaluate the effectiveness of the catalog OFFER_IDs, we additionally conducted
the similar analysis, but at this time based on the total profit as different product categories may have
different corresponding profit margins, and thus may lead to a different conclusion. Like the previous two
analyses, we first started by looking at the total profit generated from catalogs from year 2004 to 2011,
presented in the following figures. The total generated profits were collected by the last three letters of the
OFFER_IDs, and product categories.
| 48
Observation 13: From year 2004 to 2010, the total profits were consistently generated by the OFFER_ID
EHB, GFB, GFP, HLB, LCB, LCR and LMB, where the total profit by OFFER_ID LCR started to be
generated from 2007 to 2010. In 2011, there were completely no profits generated from those
OFFER_IDs; instead, there were more from the other OFFER_IDs.
Observation 14: From year 2004 to 2010, the profit distribution pattern by product categories within the
same OFFER_ID seemed to be very similar over years, where product category T seemed to be the most
profitable product, followed by product category C, E and P, respectively. However, in 2004 the product
category H was apparently most profitable due to its large profit margin although its total order quantity
was just a little smaller than that of product category C for OFFER_ID LMB. In addition, in 2011 when
the total profit of product category R dropped significantly the product category P became the most
profitable product, followed by product category C, E and H, respectively.
Similar to the previous two analyses, based on the number of orders placed and total order quantity,
OFFER_ID GFB, GFP, LCB and LMB evidently were the top-four OFFER_IDs with respect to the total
profit generated. We not only want to see how the total profits were distributed by product categories
within the same OFFER_ID over times, but also want to observe how those numbers changed over times,
which are presented in the following figures for the four most effective OFFER_IDs.
| 49
Note that for 2004 and 2011, the raw input data were not provided for the entire years so their results will
not be analyzed in the following analysis.
Observation 15: It can be seen that in these top-four most effective OFFER_IDs, product category T, C, E
and P were consistently top-four product categories with respect to the total profit generated.
Observation 16: Amongst the top-four product categories, the profit margin for product category T seemed
to be relatively low, and the profit margin for product category E seemed to be relatively high, and the
profit margins for product category C and P seemed to be moderate.
To verify Observation 14, one can see the closer gap of the total profits between product category T and
other product categories. In addition, at some points the total profits generated by product category T even
falls below the total profits generated by other product categories, such as C for OFFER_ID LCB in 2010,
although the order quantity of product category T were much larger at that time. Also, the total profits of
product category E occasionally jumps above the total profits of the other product categories even though
at those data points the total order quantities of product category E were lower than those of the other
product categories.
Observation 17: The OFFER_ID LMB and GFP seemed to have the best performance, where OFFER_ID
LMB slightly outperformed OFFER_ID GFM, which had significant drops in 2009 and 2010.
| 50
Observation 18: In general, the total profit of product category T significantly dropped in 2009, and
fluctuated within a relatively large range between $8,000 and $24,000 approximately. The total profits of
the other product categories also had fluctuations, but with much smaller variation.
Appendix C
The results of hierarchical clustering
The results for the clustering procedure using Range as internal standardization and Ward as clustering
method in Section Error! Reference source not found. is shown in the followings.
Number of clusters suggested by CCC was forty eight clusters. Overall, this model is the best model
amongst all models that we examined with different settings and properties. As shown in the cluster
history in the figure below, for the last twenty steps, there were very few (i.e., three) observations joining
the clusters late, but not very late, i.e., when the numbers of clusters are from sixteen to twenty. This
number is smallest, compared to the other models. In addition, as guggested by the other statistics, we
selected to further examine the segmentation with four, eight and twelve clusters.
| 51
The results of clustering and profiling with four clusters
The clusters distances projected onto two-dimensional space, and the pie-chart showing the frequencies
and percentage of the frequencies in every cluster are presented below.
The importance of variables is shown in the figure below.
(See peak and solution is “+1”)
The potential solutions for the number of
clusters are 1+1=2, 4+1=5, 7+1=8, 9+1=10,
11+1=12, 13+1=14, 15+1=16 and 18+1=19.
(See peak)
The potential solutions
for the number of
clusters are 2.
(See jump)
The potential solutions
for the number of clusters
are 2, and 4.
| 52
The figures below present the cross-tab tables showing the distribution of the frequencies by the
RFM_NEW variable, and STATE_1 in all customer segments.
| 53
The following figure presents the averages of the numerical variables overall and by clusters.
| 54
The matrix of variable importance in predicting observations belonging to a particular cluster is presented
below.
The segment profile index is presented in the figure below.
The results of clustering and profiling with eight clusters
The clusters distances projected onto two-dimensional space, and the pie-chart showing the frequencies
and percentage of the frequencies in every cluster are presented below.
| 55
The importance of variables is shown in the figure below.
The figures below present the cross-tab tables showing the distribution of the frequencies by the
RFM_NEW variable, and STATE_1 in all customer segments.
| 56
| 57
| 58
The following figure presents the averages of the numerical variables overall and by clusters.
| 59
The matrix of variable importance in predicting observations belonging to a particular cluster is presented
below.
The segment profile index is presented in the figure below.
| 60
The results of clustering and profiling with twelve clusters
The clusters distances projected onto two-dimensional space, and the pie-chart showing the frequencies
and percentage of the frequencies in every cluster are presented below.
The importance of variables is shown in the figure below.
| 61
The figures below present the cross-tab tables showing the distribution of the frequencies by the
RFM_NEW variable, and STATE_1 in all customer segments.
| 62
| 63
| 64
The following figure presents the averages of the numerical variables overall and by clusters.
| 65
The matrix of variable importance in predicting observations belonging to a particular cluster is presented
below.
| 66
The segment profile index is presented in the figure below.
| 67
Appendix D
Data exploration and business implication
Customer_age_impact
Product_Return
| 68
Appendix E
Customer segmentation and profile
US 1-digit zip map MVC_SAS_Code
MVC_Regions
Appendix F
Data imputation and transformation
Imputation_Transformation_1
| 69
Imputation_Transformation_2
Imputation_Transformation_3
| 70
Appendix G
Predictive model
Clustering_1
| 71
LogWorth_1
| 72
Decision_Tree_1
Price:Uni
t
Net
quantity
zip
1
Payment_metho
d Category
Offer
range
RFM_NE
W
1
56-62 >=2
B,E,P,F,C,J,T,
X
56-62 >=2 H
62-72 >=2 B,P,C,T,A
62-72 >=2 X
82-94 >=2 AX
82-94 >=2 VI,MC
2
101-108 1 <366
101-108 1 >366
127-142 >=2 0
127-142 >=2 4 <182
127-142 >=2 4 >182
3
162-175 >=1 AX,VI,MC,DI
162-175 >=1 PC 8
175-187 >=1 AX,VI,MC,DI
175-187 >=1 PC
187-227 >=2 VI
4
322-345 >=1 F,H,T
322-345 >=1 X
444-464 >=1 C,H,X
Decision_Tree_2
| 73
Regression_3.1
| 74
| 75
Regression_3.1_Estimates
| 76
Neural_Network_1
Price:Unit
Net
quantity Category RFM_NEW Revenue
1 <26 <1.5 8,3,7 25
2 26-47 1 4,1,5 42
3 26-47 1 8,3,7 35
4 <47 >1.5 B,E,C,T,P,S,F,L,H,X,O,A 4,1,5,7 97
5 <47 >=1.5 B,E,C,T,P,S,F,L,H,X,O,A 8,3 88
6 <34 >=1.5 K 55
7 34-47 >=1.5 K 78
8 47-64 >=1.5 4,1,6,5,2 111
9 47-64 >=1.5 8 99
10 64-87 >=1.5 4,6,2 125
11 64-87 >=1.5 8 110
12 87-106 >=1.5 4,2 142
13 87-106 >=1.5 8 126
14 106-124 >=1.5 4 161
15 106-124 >=1.5 8 141
16 132-141 >=0.5 T,P,F,H,X 131
17 132-141 >=0.5 B,E,C,K,L,A 134
18 152-167 >=0.5 B,E,C,X 158
19 152-167 >=0.5 T,K,P,F,L,H,0 155
20 167-172 >=0.5 B,E,C,D 170
21 167-172 >=0.5 T,K,P,S,F,H 168
22 172-187 >=0.5 B,C,X,A,D 184
23 172-187 >=0.5 E,T,K,P,S,F,L,H,M,G 181
24 227-234 >=0.5 E,C,T,L 246
25 227-234 >=0.5 P,H,0 243
26 234-247 >=0.5 F 232
27 234-247 >=0.5 T,P,H 255
28 284-309 >=0.5 4,6,2 318
29 284-309 >=0.5 8 323
30 309-345 >=0.5 4,6,2 339
31 309-345 >=0.5 8 346
32 345-374 >=0.5 4,2 354
33 345-374 >=0.5 8,6 365
34 394-427 >=0.5 E,C,T,P,F,H,X,0 399
35 394-427 >=0.5 K 390
| 77
Neural_Network_2
Price
range Product Importance B E C T P S F L H X 0 A K D M G
<47
B,E,C,T,P,S,F,L,H,X,O,
A
<47
B,E,C,T,P,S,F,L,H,X,O,
A
<34 K
34-47 K
132-141 T,P,F,H,X
132-141 B,E,C,K,L,A
152-167 B,E,C,X
152-167 T,K,P,F,L,H,0
167-172 B,E,C,D
167-172 T,K,P,S,F,H
172-187 B,C,X,A,D
172-187 E,T,K,P,S,F,L,H,M,G
227-234 E,C,T,L
227-234 P,H,0
234-247 F
234-247 T,P,H
394-427 E,C,T,P,F,H,X,0
394-427 K
Total Count 6 8 8 9 9 4 8 6 9 6 5 4 7 2 1 1
Each Price
Range
count
<100 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0
130-187 4 4 4 4 4 2 4 3 4 3 1 2 4 2 1 1
220-250 0 1 1 2 2 0 1 1 2 0 1 0 0 0 0 0
400 0 1 1 1 1 0 1 0 1 1 1 0 1 0 0 0
| 78
Neural_Network_3
Neural_Network_4
| 79
Top Related