Eindhoven University of Technology MASTER Boosted tree ... · Ponpare coupon purchase prediction...
Transcript of Eindhoven University of Technology MASTER Boosted tree ... · Ponpare coupon purchase prediction...
Eindhoven University of Technology
MASTER
Boosted tree learning for balanced item recommendation in online retail
Dikker, J.
Award date:2017
Link to publication
DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.
General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
Master thesis
Boosted tree learning for balanced item recommendation
in online retail
Jelle Dikker
0953780
Business Information Systems
October 6, 2017
Supervisors:
dr. Y. Zhang, Eindhoven University of Technology
dr. V. Menkovski, Eindhoven University of Technology
prof.dr.ir. U. Kaymak, Eindhoven University of Technology
S. Coenraad Msc., Building Blocks B.V.
1
2
Contents
List of Figures 4
List of Tables 5
1 Introduction 8
1.1 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Kaggle coupon purchase prediction . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Literature survey 12
2.1 Online retail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Conventional collaborative filtering . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 The long tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Methodology & case studies 19
3.1 Problem formulation and overview . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Data preparation and feature calculation . . . . . . . . . . . . . . . . . . . . . 26
3.4 Hold-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Calculate weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9 Case study: tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9.1 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9.2 Data preparation and feature calculation . . . . . . . . . . . . . . . . 36
3.9.3 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Results 38
4.1 Random selection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Ponpare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3
4.4 Summary of insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Implementation 52
5.1 Used software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Quality aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Conclusion and discussion 54
6.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Implications for research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Managerial implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References 58
Appendices 61
A Data exploration 61
B Feature calculation 65
C Experiment results 66
4
List of Figures
3.1 Overview of recommendation process . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Methodology overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 An overview of the available data. . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Views and purchases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Views and purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Pearson correlation for coupon properties . . . . . . . . . . . . . . . . . . . . 23
3.7 Purchases and views per coupon ID . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Catalog price and discount price . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.9 Catalog price and number of purchases . . . . . . . . . . . . . . . . . . . . . . 25
3.10 Display period and number of purchases . . . . . . . . . . . . . . . . . . . . . 25
3.11 Number of purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.12 Schematic overview of use of interaction data . . . . . . . . . . . . . . . . . . 28
3.13 Age distribution of bookings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.14 Number of bookings per accommodation . . . . . . . . . . . . . . . . . . . . . 37
4.1 Purchases per coupon in test set . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Precision/Recall for all items in the Ponpare dataset . . . . . . . . . . . . . . 39
4.3 Precision/Recall for head items in the Ponpare dataset . . . . . . . . . . . . . 39
4.4 Precision/Recall for tail items in the Ponpare dataset . . . . . . . . . . . . . 40
4.5 Coverage for head items in the Ponpare dataset . . . . . . . . . . . . . . . . . 40
4.6 Coverage for tail items in the Ponpare dataset . . . . . . . . . . . . . . . . . . 41
4.7 Coverage for all items in the Ponpare dataset . . . . . . . . . . . . . . . . . . 41
4.8 Gini-index for different list lengths in the Ponpare dataset . . . . . . . . . . . 42
4.9 F1 score for different list lengths in the Ponpare dataset . . . . . . . . . . . . 42
4.10 F1 score for different list lengths in the Ponpare dataset . . . . . . . . . . . . 43
4.11 F1 score for different list lengths in the Ponpare dataset . . . . . . . . . . . . 43
4.12 Gini-index and F1 score in the Ponpare dataset . . . . . . . . . . . . . . . . . 44
4.13 Purchases per coupon in test set . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.14 Precision/Recall for all items in the tour operator dataset . . . . . . . . . . . 45
4.15 Precision/Recall for head items in the tour operator dataset . . . . . . . . . . 46
4.16 Precision/Recall for tail items in the tour operator dataset . . . . . . . . . . . 46
4.17 Coverage for head items in the tour operator dataset . . . . . . . . . . . . . . 47
4.18 Coverage for tail items in the tour operator dataset . . . . . . . . . . . . . . . 47
4.19 Coverage for all items in the tour operator dataset . . . . . . . . . . . . . . . 48
4.20 Gini-index for different list lengths in the tour operator dataset . . . . . . . . 48
4.21 F1 score for different list lengths in the tour operator dataset . . . . . . . . . 49
4.22 F1 score for different list lengths in the tour operator dataset . . . . . . . . . 49
4.23 F1 score for different list lengths in the tour operator dataset . . . . . . . . . 50
4.24 Gini-index and F1 score in the tour operator dataset . . . . . . . . . . . . . . 50
A.1 Number of purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5
List of Tables
2.1 Example user-item matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Views for example user and coupon . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Example views for constructing genre feature . . . . . . . . . . . . . . . . . . 29
3.3 Example views for constructing prefecture feature . . . . . . . . . . . . . . . . 29
3.4 Example view with genre and prefecture features . . . . . . . . . . . . . . . . 29
3.5 Overview of parameters in grid . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Values after parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Results of the gridsearch with 10-fold Cross Validation . . . . . . . . . . . . . 31
3.8 The confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9 Overview of used parameters for tour operator . . . . . . . . . . . . . . . . . 37
3.10 Results grid search tour operator . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1 Overview of available attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2 Views per user and coupon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.3 Catalog price, discount price, listing frequency, purchase frequency and dis-
play period per coupon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.4 P-values for Pearson correlation test coupon properties . . . . . . . . . . . . . 64
A.5 Pearson correlation coefficient for coupon properties . . . . . . . . . . . . . . 64
B.1 Overview of used features for Ponpare . . . . . . . . . . . . . . . . . . . . . . 65
C.1 Pairwise p-values for Ponpare dataset (listlength=5) . . . . . . . . . . . . . . 67
C.2 Pairwise p-values for tour operator dataset (listlength=5) . . . . . . . . . . . 67
6
Abstract
Modern customers in online retail demand personalized offerings. This increases customer
satisfaction and increases revenue for e-retailers. Recommender systems fulfill the need for
systems capable of delivering these personalized offerings.
Online retailers often experience a long tail in their revenue distribution. Current rec-
ommendation systems focus mainly on accuracy or related metrics and therefore provide
recommendations involving mostly popular items. One of the open problems in recom-
mender systems is delivering a more balanced recommendation set, involving less popular
items more.
This research proposes using boosted tree learning to deliver such balanced recommenda-
tions. The approach involves using XGBoost, a popular implementation of boosted tree
learning. Weights connected to data points are adapted in order to emphasize less popular
items in the training of the model.
Experiments are performed on two datasets from online retailers, namely: a Japanese coupon
website as well as a tour operator. Afterwards, the results are evaluated using evaluation
metrics for general recommendation, as well as specific evaluation for balanced item recom-
mendation.
Experimental results show that the developed approach can successfully be used for recom-
mendation in online retail. Furthermore, it shows that delivering balanced item recommen-
dations is possible, but comes at the cost of lower general predictive performance.
7
Acknowledgements
With the submission of this masters thesis an exciting period comes to an end. During the
past couple of months I have gained many new insights. Also, this marks the end of my
time as a student. Throughout the years I have enjoyed my studies in Groningen and later
Eindhoven a lot.
Many people contributed to the exciting time during the thesis project. First of all, I had
the pleasure to be able to conduct the project at Building Blocks in Tilburg. I would like
to thank all the people at Building Blocks for the energetic and inspiring environment and
Sander for the great guidance during the project.
For me it has been a great pleasure to work together with my supervisor at Eindhoven
University of Technology, Yingqian Zhang. Yingqian, thank you for all the advice and
feedback in all stages of the project. I enjoyed the many discussions we had on Wednesdays
and I learned a lot throughout the process.
Finally, family and friends have always been very important to me throughout my entire
studies. I would like to thank them for their great support throughout this period.
8 1 Introduction
1 Introduction
This project will be carried out as Master thesis project for the Master degree in Business
Information Systems at Eindhoven University of Technology, Eindhoven. The project will
be performed at Building Blocks B.V.
This report describes the application of gradient boosted tree learning for recommendation in
the online-retail domain. The main contribution is the development of an approach suitable
to achieve a more equal division of recommendations over the different items in the result
set. Therefore, an algorithm is developed and will be applied on two datasets, namely: the
Ponpare coupon purchase prediction contest dataset, as well as on the dataset of a large tour
operator. The results are evaluated using different criteria for evaluation of recommendation
systems.
The Ponpare dataset is is a publicly available dataset that has been used for a data science
competition in the past, the tour operator dataset is a dataset from one of the clients of
Building Blocks. This chapter explains the problem context and the background of Building
Blocks, which will be followed by an introduction on the used datasets. This leads to the
research questions and the methodology, which will both be briefly introduced as well in
this chapter.
1.1 Building Blocks
Building Blocks is a data science consultancy firm. The company has customers in three
domains, namely: insurance, retail and travel. It provides these customers with accurate
consumer predictions in order to enhance their business.
The services Building Blocks offer their clients are built around three pillars, namely: cus-
tomer profiling, product profiling and the third pillar, optimization, which combines aspects
of the first two pillars. Examples of services in customer profiling include loyalty estimation
and customer segmentation, whereas product services include for instance product segmen-
tation and complement estimation. Finally, planning, pricing, recommendation, assortment
and promotions combine knowledge from both domains and therefore belong to the third
pillar.
Building Blocks expects client demand for recommendation of products to online clients
in the future. Therefore, Building Blocks is looking to develop a recommender system for
clients in online retail. Currently they make use of recommender systems for events, such as
exhibitions and festivals, offering customers interesting program elements, but they do not
have a recommender system for the e-commerce domain.
Many of Building Blocks clients have a revenue distribution with a so called long-tail: a
small number of products accounts for a relatively large part of sales. However, existing
recommender systems aim at overall prediction performance and in general mostly recom-
mend the most popular products. This means a large part of the item catalogue is often not
used in recommendation. Because the clients of Building Blocks want to incorporate these
products in their recommendations as well, there is need for more research into delivering a
9 1 Introduction
more balanced recommendation.
1.2 Kaggle coupon purchase prediction
Kaggle was founded as a platform for predictive modelling and analytics in 2010. Nowadays,
Kaggle has a community of more than 536 000 users. Kaggle regularly hosts competitions on
predictive modelling and analytics. Furthermore, the portal contains discussions and space
to share code, all to foster sharing of knowledge.
The coupon purchase prediction contest at Kaggle asks participants to predict which coupons
users of the website Ponpare will buy (Kaggle). Ponpare is Japans biggest coupon website,
offering coupons like Groupon and many others. These coupons can include discounted
yoga, gourmet sushi or concerts. The goal of the competition was to predict 10 purchases
for every user for the week after the competition data.
As the coupon purchase prediction dataset contains many interesting features that could
also be present in datasets of potential clients for Building Blocks, this dataset was selected
to conduct this study. Furthermore, the knowledge base around the competition provides
valuable insights in successful approaches. However, the competition only incorporates over-
all predictive accuracy as criterion and does not intent to recommend less popular items.
Therefore, the approaches in this competition will be used and adapted in order to make
more balanced recommendations.
The competition took place from July until September 2015. The competition attracted
1076 teams which submitted 1 or more entries. The training data available for participants
consists of data of a year of coupon purchases.
1.3 Tour operator
The tour operator offers several package holidays for the leisure holiday market in The
Netherlands. Customers are able to browse the offerings on the website of the tour operator
and book the holiday on the website of the tour operator. The tour operator offers a variety
of destinations and types of accommodation, each with a different target audience. For in-
stance, some accommodations are catered for single travellers, while others are more suitable
for families. Furthermore, the holiday offerings have different price ranges and are situated in
different regions. The tour operator would like to make more personalized offerings towards
its customer. Therefore, the tour operator is looking to develop a recommender system. As
the tour operator experience a difference between popularity in different accommodations,
they are looking for ways to promote less popular accommodations.
1.4 Research questions
Following the problem context as explained above, the main research question of this research
can be defined as follows:
Main research question: How to build a recommender system that provides bal-
anced item recommendations in online retail?
10 1 Introduction
In order to answer the main research question, several research questions need to be an-
swered.
Research question 1: Which criteria can be used to evaluate balanced item rec-
ommendation?
The first research question relates to the criteria that can be used to evaluate recommender
systems in general, as well as balanced item recommendation more specifically. This is
essential in evaluating the performance of the developed models.
Research question 2: How to adapt the existing algorithm such that it will deliver
balanced recommendations?
The second research question addresses which algorithms exist for recommendation, their
advantages and disadvantages. Furthermore, it explores the possibility to adapt these algo-
rithm such that items from the long-tail can be recommended.
Research question 3: How can the developed algorithm be applied in the online
retail industry by Building Blocks?
The third research question refers to practical application of the algorithm. A data mining
pipeline should be developed, as well as tools for Building Blocks to use the developed
algorithm in practice.
1.5 Methodology
In order to build a recommendation system which is capable of delivering a balanced rec-
ommendation set, an appropriate model needs to be developed. In order to do this, it is
important to determine which criteria can be used to evaluate recommendation systems as
well as recommendation of long tail items. This is addressed in research question 1.
Secondly, an algorithm need to be selected and adapted for use with respect to the research
goal. Therefore, a literature study will be used to identify the most suitable algorithms for
this task, as well as possible modifications to existing algorithms. A modification to existing
algorithms is proposed. This completes research question 2.
Thirdly, experiments are performed on two data sets from the online retail domain, namely:
the Ponpare dataset and the tour operator dataset. The Ponpare dataset concerns a coupon
sales website, the tour operator sells package holidays. A data mining workflow is developed.
This workflow is developed such that it can also be used by Building Blocks for other clients.
This answers research question 3.
Finally, the results are evaluated using the established criteria, to answer the main research
question of this project.
1.6 Thesis outline
After the introduction, where the problem and research questions have been introduced, the
remainder of this thesis will look as follows: chapter 2 will introduce recommender systems
11 1 Introduction
and the current state of research including some open problems. This will be followed by
chapter 3, which discusses the pursued approach and the case studies. Chapter 4 elaborates
on the results of both case studies. Finally, chapter 5 describes application in the online
retail industry and chapter 6 contains the discussion and conclusions.
12 2 Literature survey
2 Literature survey
This section displays the result of the literature survey that has been carried out. The
domain of online retail will be adressed, as well as recommender systems, algorithms for
recommendation, XGBoost and recommendation for the long tail.
2.1 Online retail
Online retailing is the sale of goods and services through the internet. With the rise of the
internet, e-retailing has become more and more popular among consumers. In 2016, 74% of
all customers in The Netherlands purchased services or goods online (Eurostat, 2017).
Important drivers to e-retail, compared to traditional retail, are reach and efficiency: online
retailers are able to serve a larger geographic area at any time of the day, which makes them
able to operate more efficiently (Grefen, 2015).
Key to the success of online retailers is customer loyalty, also described as the intent of
repeatedly purchasing (Chiu et al., 2014).Modern customers demand to receive personalized
content in real time during their shopping experience. These require systems that are able
to tailor content in real time in order to deliver the customer the desired experience.
Therefore, the current biggest challenge in online-retail from a customer journey perspective
is dealing with the crisis of immediacy, as defined by Parise et al. (2016) : “how to meet
consumers’ need to receive content, expertise, and personalized solutions in real time during
their shopping experience”. Customers demand the right information at the right place and
the right time. Recommender systems fulfill the demand in systems which can be used to
provide customers personalized information at the right time, which is critical to current
e-retailer landscape.
2.2 Recommender systems
The success of the world wide web makes a large amount of information available for anyone
connected to the internet. However, because of the abundance of available information,
difficulty for users to select the right items has increased. In order to support users selecting
relevant information from the internet, appropriate systems are needed.
Recommender systems can be used to help users navigate through a vast amount of items
or documents in a scenario where users do not exactly know what they are looking for and
hence, do not want to formulate an explicit query. Instead of searching, this is therefore
approached as a browsing scenario, where the user does not give an explicit query expressing
his/her information need but does like to be pointed in the right direction by the system
(Baeza-Yates et al., 1999).
However, offering users a successful experience is not easy. When having to navigate through
large amounts of items, users are subject to various phenomena such as choice stress. More-
over, research shows that choice stress can counteract the increased attractiveness of a large
result set over a smaller set and hence, reduce the choice satisfaction of the user (Bollen
13 2 Literature survey
et al., 2010).
Recommender systems exist for a long time and can be used in many scenarios to support
the user by pointing the user in the right direction. One of the earliest projects using a
recommender system was the GroupLens recommender, a recommender system for news
articles (Resnick et al., 1994).
Over the years, recommender systems have become more popular and spread to various do-
mains. For instance, YouTube uses it to recommend videos to users (Davidson et al., 2010)
and Google News uses a recommender system to recommend news items to users (Das et al.,
2007), but many more examples exist.
Moreover, the use of recommender systems in online retail is not a novelty. Arguably, the
most famous and one of the first examples in this domain is the engine Amazon uses to
recommend products to customers (Linden et al., 2003). However, many more companies
active in e-commerce make use of recommender systems to recommend relevant products to
their users.
Recommender systems generate business value and are used in a wide variety of business
models. For instance, recommender systems can generate value by offering users a pos-
itive experience, but also by recommending items users otherwise would not have found
and bought (Gomez-Uribe and Hunt, 2015). An e-retailer or e-marketplace can use a rec-
ommender system to recommend products, an e-integrator can use it to perform mass-
customization towards it customers (Grefen, 2015).
Recommender systems typically use knowledge from domains as Human Computer Interac-
tion and Information Retrieval, but more importantly use algorithms that can be considered
data mining algorithms. Data mining is a set of techniques turning large amounts of data
into valuable information (Han et al., 2012).
2.3 Algorithms
Several techniques have been used over the past years in recommender systems. Shi et al.
(2014) propose a categorization of recommendation techniques, consisting of conventional
collaborative filtering, which include memory and model based approaches. These ap-
proaches use only the user-item matrix, as is explained later. Furthermore, collaborative
filtering using alternative sources uses other information, such as social network information,
user contributed information or interaction information.
Other authors, such as Bobadilla et al. (2013) make an estimation into content-based filter-
ing, demographic filtering, collaborative filtering and hybrid filtering. Here, content-based
filtering includes algorithms utilizing item properties, demographic filtering utilizes user
properties, collaborative filtering utilizes user-item interactions and hybrid filtering uses
combinations of these three. For the remainder of this chapter, the basic collaborative fil-
tering algorithm is explained and several machine learning approaches to the problem are
discussed.
14 2 Literature survey
2.3.1 Conventional collaborative filtering
Collaborative filtering refers to making predictions about a users preference by collecting
preferences from many users. It is based on the assumption that if users have a subset of
similar preferences they are more likely to also have similar preferences on other (unseen)
items. This method does not take any information about the items or users into account
except for their history. Neighborhood based algorithms make use of a matrix measuring N
Table 2.1: Example user-item matrix
The shawshank redemption the godfather the dark knight
Alice 3 1 5Bob 4 ? 5Carol 5 3 3Dave 3 2 5
users and M items, consisting of cells containing rij , namely the rating the corresponding
user i has given the particular item j. Ratings can be on a scale in case of explicit ratings,
but can also be binary, for instance in case of visits or purchases. An example of a user-item
matrix is given in 2.1. In this case a question mark denotes an unknown rating.
In case of conventional collaborative filtering, the predicted rating for a new user is de-
termined as follows: it consists of the weighted average of neighboring users. The weight
corresponds in this case to a notion of similarity. This can for instance be the Pearson
coefficient or cosine similarity. This means that users that are more similar to the user have
a larger share in the final predicted rating. Equation 2.1 displays the rating for user i and
item j using conventional collaborative filtering. This is an example of user based collabora-
tive filtering, where similarity between users is calculated. Item-based collaborative filtering
makes use of similarity between items.
Rij =1
C
∑k∈Zi
sim(i, k)Rkj (2.1)
In this equation, Zj is the set of k neighboring users of user j, sim(i, k) the similarity
between user i and k and C is a constant. The similarity function could be for instance the
cosine similarity function, using the vectors with the ratings for users ~i and ~k respectively:
sim = cos(~i,~k) (2.2)
Conventional collaborative filtering does not scale well as it involves computing similarity
for many other neighboring users or items. Model based approaches use a model in order
to predict a rating for a user and solve these issues. Equation 2.3 contains the general form
of model-based collaborative filtering.
f(pi, qi) 7→ Rij , i = 1, 2...M, j = 1, 2...N (2.3)
15 2 Literature survey
In this equation, pi and qi are the model parameters for user i and item j and f is a function
which maps parameters to know ratings Shi et al. (2014). This can be a model using matrix
factorization, or singular value decomposition (SVD), where the matrix is simplified to a
model using mathematical techniques in order to make new predictions.
2.3.2 Machine learning algorithms
Extending model based CF refers to extending model based CF to not only incorporate
ratings, but also user and item properties. In the following section, some machine learning
examples will be discussed that can be used for recommendation. However, several more
data mining techniques exist that could also be used for the purpose of recommendation.
First of all, clustering refers to grouping users and/or items into clusters based on the user-
item matrix and their properties. Several clustering techniques exist, such as for instance
k-means clustering. Sarwar et al. (2002) propose a clustering algorithm which splits all
users in a number of clusters using bisecting k-means, after which for a given user, the con-
ventional collaborative filtering algorithm is used to determine the predicted rating. The
authors report a decrease in accuracy but also an decrease in computation expenses as there
is no need to compute similarity for the entire matrix. Hence, clustering can be used to
overcome scalability issues of conventional collaborative filtering.
Secondly, it is possible to apply dimensionality reduction techniques. For instance, Singular
Value Decomposition (SVD) can be used to factorize the user-item matrix. Sarwar et al.
(2000) demonstrate that using SVD for an m by n matrix, the computation can be reduced
to O(m+ n), compared to O(m2) for conventional collaborative filtering, while performing
comparably to collaborative filtering. However, updating the SVD is expensive. This re-
solves some of the issues in scalability conventional collaborative filtering has, but is not
suitable for a context where frequent updates of the user-item matrix are desired.
Thirdly, neural networks are discussed. A neural network typically consist of several lay-
ers containing nodes, which are interconnected in a networked structure. Because of their
networked structure, neural networks are able to capture complex relationships. Neural Net-
works show promising results for recommendation. For instance, they have been deployed
for recommendation of cold start items on the Netflix dataset by Wei et al. (2017). The
authors of this paper used a neural network to learn item features and used this features in
a collaborative filtering scenario using SVD. They report a 5% lower RMSE compared to
using only Collaborative Filtering with SVD.
Fourthly, factorization machines combine Support Vector Machines (SVMs) and factoriza-
ton models and combine the advantages of both (Rendle, 2010). Factorization machines
are able to handle sparse data in lineair time. Factorization machines also show promising
results when combining several data sources in an e-commerce environment. Geuens (2015)
demonstrates that using factorization machines using interaction, user and item data as
feature vectors, outperform conventional collaborative filtering, which only uses interaction
data. The author reports an increase in Recall of more than 100% for small selection sizes
in an e-commerce scenario. However, the algorithms have not been compared to other algo-
16 2 Literature survey
rithms except for collaborative filtering in the mentioned paper.
Fifthly, Classification and Regression Trees (CART) are decision trees that can be used for
classification and regression. Furthermore, multiple trees can be combined in a random
forest or using boosted tree learning. Random forests have for instance been successful in
a contest predicting purchase behaviour (Myklatun et al., 2015). The authors use an en-
semble model consisting of probabilistic modeling to construct features and a random forest
algorithm on the interaction data for the recommender systems challenge 2015. The data
consisted of click sessions for users and the challenge was to predict purchasing of item
during a given session. The approach by the authors was the best out of 850 teams.
2.4 The long tail
One of the most important business drivers for e-retailers is increased reach (Grefen, 2015),
as the ability to offer a larger product catalogue is one of the advantages of online retail
over physical stores. Revenue distributions in e-commerce therefore often have a long tail,
where most of the revenue is in a large amount of relatively unpopular products from the
long-tail. As there are not many data points present for these products, typically recom-
mender systems experience difficulties incorporating these products in recommendations.
Furthermore, typically recommender system focus only on general recommendation perfor-
mance and hence, do not take recommendation of less popular items into account. However,
there is a business need to be able to deliver more balanced recommendations and hence,
recommend tail items as well.
Several approaches exist to deliver more balanced recommendations. For instance, cluster-
ing has been proposed. Park and Tuzhilin (2008) split items into head and tail items, and
apply clustering to tail items to estimate the rating for items in the tail, before applying
other algorithms to the entire itemset. They report an increase in Root Mean Squared Error
(RMSE) but their algorithm does not scale well.
Also, Valcarce et al. (2016) promote tail items to get rid of overstock by making using of
relevance modelling. The task is inverted, instead of making a recommendation for each
user a recommendation is created for each item by using similar items. Their proposed
approach obtains better results than neighborhoud based aproaches on reference datasets
for movie recommendation. The reason for this lies in the fact that it is difficult to calculate
a neighborhood for tail items as typically not many datapoints exist for these items.
Steck (2011) conducts experiments with a matrix factorization approach optimizing recall
on the Netflix movie dataset. The author reports outperforming a matrix factorization ap-
proach optimizing RMSE as well as an approach using SVD and a best-seller list in terms
of recall. Moreover, after performing user experiments the author concludes that in a movie
recommendation scenario only a small bias towards less popular items is appreciated by
users. Finally, the author also remarks that in general recommendation accuracy for tail
items decreases as the variance and noise increase towards the long tail.
Concluding, several approaches have been tried in delivering balanced item recommenda-
tion. However, finding appropriate algorithms to promote less popular items is still an open
17 2 Literature survey
challenge. Furthermore, more research can be done on user behaviour as the acceptance of
biased recommendations is not researched in online retail yet.
2.5 XGBoost
When inspecting the pursued approaches by participants to the Kaggle coupon purchase pre-
diction contest, it becomes clear that XGBoost, an implementation of gradient tree boosting
(Chen and Guestrin, 2016b) performs very well. It is used by 2 of the top-3 algorithms in
the contest. Furthermore, it has been used for many other contests in machine learning. For
instance, in the 29 competitions hosted on Kaggle in 2015, 17 winning solutions made use
of XGBoost. Moreover, every team in the top 10 of KDDcup 2015 made use of XGBoost.
Additionally, it must be remarked that neural networks as well as ensemble methods using
neural networks and XGBoost also obtain good results in 10 competitions, but are not as
popular as XGBoost is.
However, all literature mentioning XGBoost focuses on predictive accuracy and/or scala-
bility. For instance, the authors of XGBoost obtained excellent results using boosted trees
on several benchmark datasets (Chen and Guestrin, 2016a). They report a processing time
per tree which is 4 times faster than existing tree boosting implementations and a slightly
improved performance in terms of AUC.
Furthermore, no existing works mention using XGBoost in order to balance recommendation
over items. As XGBoost obtains excellent results in the recommendation of coupons, this
algorithm is chosen to be adapted in order to obtain balanced recommendations.
2.6 Conclusion
Modern customers in online retail demand personalized offerings. This increases customer
satisfaction and increases revenue for e-retailers. Recommender systems fulfill the need for
systems capable of delivering these personalized offerings. There is a long history of recom-
mender systems and the domain can be classified in steps in the recommendation process.
Several algorithms exist for recommendation. Collaborative filtering is one of the first classes
of algorithms and several extensions to it exist. Extending model based algorithms is the
most promising class of algorithms. In this class, several data mining techniques are adapted
to work with user-item interactions as well as interaction data and user and item properties.
Revenue distributions in online tail often show a long tail as reach is one of the important
business drivers for online retail. However, recommender systems often focus on general
predictive performance and are not able to incorporate less popular items in their recom-
mendations. Hence, there is a need for development of systems capable of delivering balanced
recommendations over the different items.
XGBoost shows excellent results in recommendation as well as in other data mining prob-
lems. However, no results are known of adapting XGBoost in order to give a balanced
recommendation set. This research will therefore investigate if the XGBoost algorithm can
be adapted in order to do so. Therefore, it addresses the need of developing appropriate
18 2 Literature survey
algorithms specific to the e-commerce domain and delivering balanced item recommenda-
tions.
19 3 Methodology & case studies
3 Methodology & case studies
In order to answer the research questions, experiments will be performed on the Ponpare
and the tour operator case studies. This chapter describes the research methodology and
the application on the Ponpare case as well as the tour operator case.
As discussed in the previous chapter, XGBoost has proven to perform well in this context
in terms of accuracy-related metrics, but it has not been used for balanced item recommen-
dation. Thus, in this research XGBoost will be adapted for balanced recommendation and
hence, this research evaluates behaviour of the algorithm with respect to different criteria
than only accuracy or related metrics.
3.1 Problem formulation and overview
The recommendation problem focuses on predicting every user i a number n of relevant
items, j. A relevant item is denoted with 1 and a non-relevant item with 0 and hence, the
problem is a case of binary classification.
In order to get to a classification score for every user-item (i, j) combination, every combina-
tion is denoted with a score by means of a classification model, for which XGBoost is used.
This yields a probability between 0 and 1 for every combination. Only the top n items with
the highest score per user i are selected to be converted to 1, all other items for this user
receive a 0 score. This leads to a recommendation list per user i, which will be evaluated
for all users. An overview of this process can be found in Figure 3.1.
Figure 3.1: Overview of recommendation process
The methodology is based on the CRISP-DM cycle (Chapman et al., 1999) and slightly
adapted for use in the context of this research. In this research, data understanding will
involve various explorative visualizations and statistics to discover interesting aspects in the
data. The result of the data understanding phases can be found in the first part of this
chapter.
In order to adapt XGBoost for balanced item recommendation, different weights will be
used for XGBoost training, which result in different models. The resulting models will be
compared with respect to several evaluation criteria. Here, a high-level overview of the pro-
cess is given, the different steps will be explained in greater detail in the remainder of this
chapter. Figure 3.2 displays an overview of the activities taking place after the data under-
standing phase. The first activity undertaken is data preparation and feature calculation.
After computing the features a hold-out is taken which will be used for final evaluation.
A grid search using k-fold cross validation is performed on the training data, in order to
determine the best parameter set for using XGBoost. The weights are calculated on the
20 3 Methodology & case studies
Figure 3.2: Methodology overview
entire training set. These weights are then used to train the model using the parameterset
resulting from the grid search. Finally, the resulting models will be evaluated.
3.2 Data understanding
Data exploration is done to get an overview of the available data and get preliminary insights
in the data to base the approach on.
Firstly, the data structure is explained. Figure 3.3 contains an overview of the different data
sources. Table A.1 provides an overview of the available features in the different sources.
The user table contains user features such as UserId, registration date, sex, residence area,
Figure 3.3: An overview of the available data.
and possibly a withdraw date in case of an unregistered user. The coupon table contains
coupon information such as a description and genre. Furthermore, price information, display
duration, information regarding the validity of the coupon and the location of the shop are
available.
Views is a user-coupon combination enriched with some additional information about the
interaction, such as date, session ID and a flag denoting if the interaction lead to a purchase
or not. Purchase is a User-Coupon combination as well, containing additional information
such as the amount of items purchased, purchase date and area. Finally, some information
21 3 Methodology & case studies
about the location of the shop offering the coupon is available in the table ‘Coupon Area’.
Unique coupons In order to get an impression of the frequency of buying for users and
coupons, we start with the purchases table and join this with the coupon properties in the
coupon list. Firstly, all transactions are grouped on couponid to aggregate the number of
purchases per coupon.
Secondly, coupon listings are counted. After examining the different coupon features the
combination (discount price, catalog price, capsule text, genre) is used to group similar
coupons that have been listed multiple times. The combination (discount price, catalog
price, capsule text, genre) is denoted as a unique coupon. This reduces the initial amount of
19368 coupons to 10803 unique coupons that have been listed 1 or more times. Furthermore,
the display period of a unique coupon can be obtained by adding the display periods for the
different periods this coupon was listed.
Thirdly, information on revenue for unique coupon can be obtained. This is done by multi-
plying the number of purchases by the discount price, which is the price a coupon has been
sold for.
Coupon listings and purchases Table A.3 contains an overview of catalog price, dis-
count price, listing frequency, purchase frequency and display period per coupon. It can be
seen that the typical coupon is listed once. The median number of purchases for a coupon
is 6. This indicates that there is typically not much purchasing interaction per coupon.
Moreover, the median number of days a coupon is displayed (listed) on the website of Pon-
pare is 4 days. As the coupon offering changes continuously and coupons are typically
replaced less than three times and do not return afterwards, this means the model should
learn relationships based on coupon properties instead of static products.
(a) For all coupons (b) For Coupon 1076
Figure 3.4: Views and purchases
Views Figure 3.6 contains a visual representation of Pearson correlation scores for several
coupon properties and the number of views and purchases. Furthermore, Table A.5 contains
the scores and Table A.4 contains the corresponding P-values. A positive score depicts a
22 3 Methodology & case studies
Figure 3.5: Views and purchases per genre
positive correlation and a negative score a negative correlation.
Here it can be seen that a correlation exist between the display period and the number of
purchases and views. This can be explained by the fact that more visibility leads to more
purchases. Furthermore, the catalog price is correlated with the price rate and the discount
price, which means that a higher catalog price in general means a higher discount.
It can be derived from Table A.2 that the median number of views per user is 32, the me-
dian number of views per coupon is 43. Hence, although the view amounts are higher than
purchases, there are typically not many view data points available per user and coupon. A
large number of coupons and users with only few views exists.
Additionally, although typically coupons are viewed more than purchased, it is important
to remark that views appear to follow the same pattern as purchases. This can clearly be
seen in Figure 3.4a, where the views and purchases for all coupons are showed and Figure
3.4b, where an example coupon is displayed. Figure 3.6 confirms that a high amount of
views correlates with a high amount of purchases. Finally, this relation can also be seen
when combining user-coupon views, where typically a user views a coupon a number of times
before making a purchase. Moreover, the same phenomena apply to genre and location: a
high number of views in a specific genre denote a high probability of purchasing an item
from this genre. A high number of views in a specific area denotes a high probability of a
user making a purchase in this area as well.
Figure 3.7a displays the amount of purchases for all coupons. The coupons are sorted
according to their respective number of purchases. It is important to note that a small
23 3 Methodology & case studies
Figure 3.6: Pearson correlation for coupon properties
(a) Purchases per coupon (b) Views per coupon
Figure 3.7: Purchases and views per coupon ID
amount of coupons contributes the most to the total revenue. The distribution of views
shows a similar figure which can be seen in Figure 3.7b.
Coupon properties The following section visualizes the distribution of these view and
(derived) purchase attributes over the entire couponset, as well as a smaller subset for
24 3 Methodology & case studies
visualization purposes. This step is executed to get insights in the combinations of different
properties. Figure 3.8 displays the catalog price and discount price for the entire set of
Figure 3.8: Catalog price and discount price
coupons. This shows that coupons have a similar discount rate over the entire spectrum.
Figure 3.9 displays the catalog price and the amount of sales for all coupons. Here it
can be concluded that coupons with a lower price attract more sales in general. Figure 3.10
shows the display period, which is the number of days a coupon is visible on the Ponpare
website, and the number of sales. Here it becomes apparent that in general, more visibility
correlates with more sales. A similar effect can also be observed when comparing visibility to
revenue. Moreover, similar figures can be obtained when comparing listings to both revenue
and sales as the amount of times a coupon has been listed defines the days a coupon is visible.
However, it is important to remark that the displayed period is logically not available at
time of recommending and will therefore not be used in prediction.
User properties In order to explore the difference in purchasing behaviour, different
user properties are compared with coupon properties. Figure 3.11a shows the number of
purchases for different age groups per sex and genre. Figure 3.11b reveals the number of
purchases per sex and genre over time (per week). These figures only display the top 2 most
popular genres, the remainder can be found in Figures A.1a and A.1b.
It becomes clear for instance that genre hotel is more popular among older customers and
genre food is relatively popular among male customers. Many other similar relationships
can be discovered.
25 3 Methodology & case studies
Figure 3.9: Catalog price and number of purchases
Figure 3.10: Display period and number of purchases
26 3 Methodology & case studies
When taking a closer look at genre popularity per individual user, an interesting phenomenon
can be observed as well. All user purchases are grouped by genre and for every user the
most popular genre is taken. Then the number of purchases in this genre is divided by all
purchases for the user. This gives a mean value of 0.68, which means that on average, users
make 68% of their purchases in their favourite genre. When excluding users not having
made more than 10 purchases the mean value declines to 54%. For the entire population,
the purchases are much more evenly spread over the genres, with the most popular genre
receiving 30% of all purchases.
Concluding, user behaviour is different for different age and sex groups. Moreover, users show
a preference for genres of items. This should be taken into account by properly encoding
features.
(a) Per age group (b) Per week
Figure 3.11: Number of purchases per genre
3.3 Data preparation and feature calculation
The data preparation step of the CRISP-DM cycle describes the phase where the data is
preprocessed such that the information can be used for predictive modelling.
Firstly, as Ponpare is a Japanese coupon website, all descriptions, genre information and
information about the prefecture (place) of the user and coupon needs to be translated from
Japanese to English. This is done with a dictionary available at the repository at Kaggle.
The dictionary is a csv file containing translations from Japanese to English for area names,
prefecture, genres and descriptions present in the dataset. The dictionary is used to replace
the values from the respective columns with the corresponding value from the dictionary.
Secondly, the dataset spans exactly one year. Therefore, it was opted to convert the date
to both a day and a week number, counting from the first day in the dataset. These values
can then be used to split the dataset and compute features based on periods, as will be
27 3 Methodology & case studies
explained later.
Thirdly, all coupon, purchase and user ids are contained in the database as hash value.
Therefore, in all tables where these hashes are contained they are converted to an integer
value according to the same function for coupons, purchases and userids, respectively.
Fourthly, it is important to remark that XGBoost is capable of handling missing data, by
giving all splits a default direction. This is done by, for every split, calculating the total
gain of directing all missing values to the left and comparing this with the gain of directing
all missing values to the right. This yields a default direction. Therefore, no missing data
of single features has been removed. However, a small fraction of views could not be linked
to coupon properties, because this coupon did not occur in the coupon table. Out of the
original 2,833,180 views, 315,974 have been removed, which means 2,517,206 views remain.
Finally, it is important to note that for a user-coupon combination often multiple views
appear in the database, as can be seen in Table 3.1. All views for a user-coupon combination
will be reduced to a single row. In case of a purchase, the date of the purchase and in case
of no purchase, the date of the last view will be taken as date for the record. Double
user-coupon pairs are removed. In the example the record in bold is maintained, all other
records are removed. Several steps are undertaken to encode features appropriately for use
Table 3.1: Views for example user and coupon
PURCHASE FLG I DATE COUPON ID USER ID
0 1-7-2011 17:07 402 225980 1-7-2011 22:51 402 225981 1-7-2011 22:52 402 225980 1-7-2011 22:54 402 22598
by XGBOOST. The first part of this section will therefore focus on simple encoding of static
features, whereas the second part of this section introduces some calculated and combined
features as well as features derived from interaction data.
The time of viewing leads to some features for every user-coupon combination present in
the dataset. First of all, the moment at which the interaction took place leads to a weekday
and time of day. Both are encoded as categorical variables using one-hot encoding for the
day of the week and the moment of the day (in 4 bins), respectively. Moreover, for every
user the registration date and the time of viewing lead to a value representing the number
of days the user has been registered.
Secondly, XGBoost is only capable of handling numerical data. Therefore, categorical data
should be encoded using one-hot encoding. The genre of a coupon is encoded using one-hot
encoding, as well as sex of a user. Age is not altered as well as the price, discount price and
price rate of an item.
Users as well as coupons have a location present in the dataset. For every combination
of item genre, item location and user location: in case the genre indicates an item can
be shipped (genres other, gift, lesson or delivery) and hence, location does not matter, the
attribute is set to 1, otherwise to 0. Furthermore, in case a user location-item location exists
28 3 Methodology & case studies
for goods that cannot be shipped in the previous purchase log, the attribute is also set to
1. Finally, features derived from interaction data are explained. In this case interaction
Figure 3.12: Schematic overview of use of interaction data
data refers to historic view data. The dataset spans 168 996 purchases from 01-07-2011
until 23-6-2012 and coupons are generally only active for a limited number of days. It is
important to remark that, because the data spans only one year of transactions, purchasing
behaviour is assumed not to be changing over time and hence, no concept drift takes place.
This is not the case in practice, however, due to the scope of this research this assumption
is followed. Figure 3.12 displays a schematic overview of the addition of features based on
interaction data.
Moreover, because no predefined application scenario is available, a few assumptions have
to be made about the moment of recommendation and therefore, which interaction data
is available at the time of prediction. In this case the decision has been made to assume
that for the construction of features the dataset is based on 12 periods lasting one month.
The first month is only used for construction of interaction related features for month 2
and from month 2 till the last month the time dependent features are computed over the
previous month. This avoids using data which would in a realistic scenario not be available
for prediction due to delay in processing, or because it would be in the future.
The first example of a feature using interaction data is that sometimes similar coupons are
placed on the website again, as explained in Section 3.2. Therefore, to get an impression of
user interest for a coupon, it has been chosen to search for similar coupons based on coupon
key (discount price, catalog price, capsule text, genre) and add the number of purchases for
this key in the previous month as a feature.
Moreover, user preference over time is captured in two ways: for prefecture and genre. This
is done by grouping the visits by user, month and genre as can be seen in Table 3.2. The
visits are then counted and divided by the total visits made by the user for this month,
leading to a probability for every genre per month, denoted as prob-g in the table. These
are added for records related to this user-genre combination for the following month. Hence,
a relatively high number of visits for a genre in the previous month by a user leads to a high
score in this feature. In a similar manner, a score is computed for prefecture(location).
These features will be added to views for the following month, as can be seen in Table 3.4.
Finally, an overview of all used features can be found in Table B.1.
29 3 Methodology & case studies
Table 3.2: Example views for constructing genre feature
USER ID genre month visits totalvisits prob g
1 Delivery 1 2 18 0.1111111 Food 1 12 18 0.6666671 Leisure 1 3 18 0.1666671 Relaxation 1 1 18 0.055556
Table 3.3: Example views for constructing prefecture feature
USER ID pref month visits totalvisits prob p
1 Gunma 1 1 18 0.0555561 Kanagawa 1 1 18 0.0555561 Saitama 1 1 18 0.0555561 Tokyo 1 15 18 0.833333
Table 3.4: Example view with genre and prefecture features
USER ID month genre pref prob g prob p
1 2 Delivery Tokyo 0.111111 0.8333331 2 Hotel Tokyo 0 0.8333331 2 Hotel Tokyo 0 0.8333331 2 Delivery Tokyo 0.111111 0.8333331 2 Other Tokyo 0 0.833333
3.4 Hold-out
As will be explained in the remaining part of this section, the XGBoost model will be trained
using different weighting schemes. Furthermore, XGBoost has several parameters, as can be
seen in Table 3.5. Therefore, the parameters for XGBoost as well as the most appropriate
weighting scheme need to be defined reliably.
K-fold cross validation is often referred to as the golden standard: the data is split in k folds,
where for every iteration k another fold is used as evaluation set. The average error on the
k folds is then taken as estimator for the true error. This gives a more reliable estimation
because the variance as result of dividing the data into train and test set is reduced.
In general, a grid search trains the model with different parameter sets on the data. The
resulting models are then compared with respect to defined performance criteria in order
to choose the best parameters. However, tuning the parameters and the weights can not
be combined in a single grid search due to the fact that it is not possible to incorporate
instance weights in SKlearn grid search. Therefore, nested cross validation would be most
appropriate alternative. This means: for every k-folds dividing the data into train and test
set, define the best parameter set using cross validation with a grid search, train the model
using different weights and continue to the next fold.
However, in case of a limited grid with three possible values for three parameters and k = 10
this would already account to 3∗3∗10 = 90 model trainings per fold and therefore account to
900 fits totally. Because this is not feasible due to time constraints, a different combination
30 3 Methodology & case studies
of validation strategies is chosen: first, a significant proportion of the set is hold-out, by
splitting on day. All views before day 250 end up in the training set, consisting of 969109
views the hold out consists of views on day 250 or later and includes 515 208 views. This
hold-out is then used as final validation set for different weight strategies and the training
set is used to determine parameters using k-fold cross validation in a grid search.
Because the data is highly unbalanced, the training set will be balanced using undersampling.
This means that out of the 905 000 negative samples in the training set, 63400 will be
randomly sampled, so that the final training set consists of 63400 positive and 63400 negative
samples and a 1:1 ratio.
3.5 Grid search
In order to determine parameters a grid search is performed. The procedure is as follows: the
training data is divided in k = 10 folds. For every fold, the kth fold functions as validation
set, and the remainder as training data. For every possible combination of parameters a
model is fit and the ROC-AUC is calculated. The mean AUC score for all folds lead to an
estimate of the performance for different parameter settings and the best parameter setting
is chosen.
Table 3.5 contains an overview of parameters which can be tuned for XGBoost. The number
of trees refers to the amount of trees to grow, eta to the learning rate λ and maxdepth to the
maximum depth of a tree. Usually the number of trees is chosen between 100 and 1000 and
hold fixed such that the best values for the other parameters can be found. In this case the
number of trees is set to 200. The learning rate and max depth define the model complexity
and also control overfitting. In this case, it was opted to do a grid search with depths of
(5,6,7) and eta (0.05,0.09,0.12), based on recommended values in the documentation.
Subsample refers to sampling of rows, so setting it to 0.8 means that every single tree is
grown on 80 % of the data. Colsamplebytree refers to the sampling of columns, with a value
of 0.8 indicating that every tree is grown on 80% of the columns of the data. This can also
be used to control overfitting, as this reduces fitting in every iteration by including only a
fraction of the data and columns respectively. In this case we choose to leave them fixed
based on values from the XGBoost documentation but they could also be tuned, for instance
by taking them into account for a grid search. The results of the grid search with 10 fold
Table 3.5: Overview of parameters in grid
name description value range in gridsearch
n estimators number of trees 200eta learning rate (0.09,0.12,0.,15)subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth (5,6,7)min child weight min leaf weight 1
cross validation are shown in Table 3.7. As the configuration with learning rate of 0.09 and
31 3 Methodology & case studies
depth of 7 obtains the highest mean AUC value over the 10 folds, this configuration will be
chosen for training the model on the entire training set. Table 3.6 contains an overview of
the used objective and parameters.
Table 3.6: Values after parameter tuning
Name Description Value
objective the learning task binary:logisticeval metric evaluation metric aucn estimators number of trees 200eta learning rate 0.09subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth 7min child weight min leaf weight 1
Table 3.7: Results of the gridsearch with 10-fold Cross Validation
learning rate max depth n estimators rank µ AUC σ
0.09 7 200 1 0.7356 0.00420.12 7 200 2 0.7352 0.00500.15 7 200 3 0.7343 0.00480.12 6 200 4 0.7343 0.00450.15 6 200 5 0.7343 0.00430.15 5 200 6 0.7336 0.00480.09 6 200 7 0.7331 0.00420.12 5 200 8 0.7318 0.00490.09 5 200 9 0.7301 0.0049
3.6 Model training
XGBoost is a popular implementation of gradient tree boosting, as described by Chen and
Guestrin (2016b). XGBoost makes use of gradient boosting, as developed originally by
Friedman (2002) and altered slightly for use in XGBoost. As is the case with boosted tree
learning, XGBoost uses an ensemble of trees. In case of XGBoost, the trees are CART trees.
An overview of how the CART trees are grown looks as follows:
for b rounds:
1. Grow the tree to the maximum depth greedy according to objective
(a) Untill depth d is reached
i. find the best splitting point
ii. Assign value to the two leaves
2. Prune the tree to remove nodes with negative gain
Every tree is grown to maximum depth by iteratively adding splits until the maximum
depth, d is reached. Splits are added by sorting all values for every feature and calculating
32 3 Methodology & case studies
the gain for every split point. Then, the best splits per feature are compared and the best
split is chosen.
The split gain is derived from the objective, consisting of a loss function and a regularization
function. The loss function refers to a value penalizing prediction error, the regularization
penalizes complex trees. The objective at round t looks as follows:
n∑i=1
L(yi, yt−1i + ft(xi) + Ω(ft) + C (3.1)
Here, L denotes the loss function, which in this case is logistic loss. Ω is a function penalizing
complex trees. This reduces overfitting. The first and second derivation of the loss function
are taken for a Taylor approximation. Furthermore, the objective function is transformed
to loss per leave instead of per data point. This leads to the following equation for gain:
Gain =1
2
[(∑
i∈IL gi)2∑
i∈IL hi + λ+
(∑
i∈IR gi)2∑
i∈IR hi + λ+
(∑
i∈I gi)2∑
i∈I hi + λ
]− γ (3.2)
Here, I is the set of all indices of data points assigned to the node, IL and IR are the sets
of two data points assigned to the possible new right and left nodes. gi and hi denote the
gradient and hessian of the objective respectively. The gradient and hessian depend on the
type of loss function chosen. The intuition here is that the gain is higher if it contributes
more to the objective, which is minimizing loss.
After deciding on the best split point the leave value o can also be easily obtained:
o = −∑
I∈Ij gi∑I∈Ij hi + λ
(3.3)
Finally, the tree is pruned backwards to remove splits with a negative gain. Moreover, for
each node all the points with a missing value are directed to the direction which yields the
largest total gain. This gives all nodes a default direction to handle missing values.
As explained earlier, the problem at hand is a case of binary classification, where a positive
value of 1 denotes purchasing and 0 denotes no purchase. Therefore, the XGBoost objective
is set to binary classification, which is a logistic regression where the output of the model
denotes probability, according to the documentation.After every iteration the validation data
is used to calculate the Receiver Operating Characteristics (ROC) Area Under the Curve
(AUC), based on a threshold value of 0.5. If the ROC-AUC has not been improved after 10
iterations, the algorithm is stopped and the best performing ensemble model is selected as
final model.
3.7 Calculate weights
Tree boosting produces multiple classification or regression trees after each other, leading
to an ensemble of multiple trees as the final model. In instance weighting the initial weights
are adjusted to emphasize certain instances in the growing of the trees. The weights travel
33 3 Methodology & case studies
along with the instances for the duration of the algorithm and will therefore influence the
different trees which will lead to the final ensemble.
The weighted loss function for XGBoost looks as follows (Chen and Guestrin, 2016b):
n∑i=1
1
2wi(ft(xi)− gi/wi)
2 + Ω(ft) + constant (3.4)
In this function wi represents the weight connected to an instance. In the original algorithm
the weight is equal for every data point. However, in case of weighted training, the rows
have weighted impact on the gain according to weight. Therefore, for all candidate splits
the gain will be weighted, leading to different split points.
Ting (2002) describes that weights can be successfully applied to grow cost sensitive trees
in a multiclass classification scenario. A similar intuition is followed here, by emphasizing
less popular items by means of changing weights. The expectation is that if these items
are emphasized during growing of trees, this will result in trees that yield a higher score for
similar less popular items and hence, result in recommending more items from the long tail.
And thus, this could lead to achieving the research goal.
In this case w(i) represents the weight for record i, vi the number of views for item i, pi the
number of purchases for item i and n the total number of records.
If we take 1 divided by the count of views or purchases, a low frequency in the training set cor-
responds to a high weight and vice versa. This provides the first two weight definitions, as can
be seen in Equation 3.5 and 3.6. w(i) =1
vi(3.5) w(i) =
1
pi(3.6)
Moreover, weights can also be applied inspired by the inverse document frequency, as is pop-
ular in information retrieval. In this case the document frequency is replaced by the number
of purchases, which results in weights as in equation 3.7 and 3.8. Finally, the difference can
be amplified by taking a function to the power 2, as displayed in equations 3.9 and 3.10.
w(i) = log n/vi (3.7) w(i) = log n/pi (3.8)
w(i) =1
v2i(3.9) w(i) =
1
p2i(3.10)
After applying a weight, all weight strategies need to be scaled such that the total of all
weight equals the number of records. This is done by counting all records , and taking the
sum of all weights ,which leads to a factor. This factor is then applied to all weights.
3.8 Evaluation
Gunawardana and Shani (2009) indicates that in scenarios where the cutoff value n is not
clear the performance can be calculated over a range of values for n. In this case n was
chosen from 1 to 10 and the list performances are calculated over this range. Furthermore,
this will be done for head items, tail items and all items. The items are split in the head
and tail subsets by applying a split at 50% of sales volume.
The first metric to be computed is catalogue coverage, which is nothing more than the
34 3 Methodology & case studies
number of unique items I present in the recommendation lists L with length j, IjL of all
users U as fraction of all items I . In case the catalogue consists of 5 items and only 3 items
are presented in the recommendation lists of all users, the catalogue coverage will be equal
to 35 The coverage of tail items is considered most important, as the coverage of head items
is expected to be high nevertheless.
coveragec =|Uj=1....nI
jL|
|I|(3.11)
Second is the Gini index. Shani and Gunawardana (2011) use the Gini index to measure the
concentration of recommendations over different items. Equation 3.12 contains the Gini-
index. The Gini-index is 1 in case of maximum inequality, which is the case when one
item receives all recommendations. It becomes 0 in case of maximum equality, which is the
case when all items get the same recommendation frequency. Hence, a lower Gini-index
represents a more equal distribution of recommendations.
G =
∑Ii=1(2i− I − 1)zi∑I
i=1 z(3.12)
In this case I is the number of items in the resultset, i the index of an item, zi the number
of recommendations per item. After taking all items in the user predictions, the positive
predictions are grouped by couponid. This leads to a list consisting of couponid and amount.
The Gini-index is calculated over this list, As the Gini-index requires a sorted list, the item
list is first sorted on number of recommendations and then the Gini-index is calculated ac-
cording to the equation.
Thirdly, several metrics are computed for conventional classification performance. For every
listlength n, a confusion matrix is calculated. The true positives in this case is the intersec-
tion of predicted positives and actual positives, false positives is the intersection of predicted
positives and actual negatives, false negatives equals the intersection of predicted negatives
and actual positives and true negatives equals the intersection of predicted negatives and
actual negatives. These lists will be used to compute Precision, Recall and the F1 score.
Table 3.8: The confusion matrix
predicted positive predicted negativeactual positive TP FNactual negative FP TN
These scores for different listlengths n will be used to construct precision-recall curves. This
will be done for all items, as well as different subsets of items, as the items will be split in
head and tail subsets.
Precision =TP
TP + FP(3.13)
Recall =TP
TP + FN(3.14)
35 3 Methodology & case studies
F1 =2TP
2TP + FP + FN(3.15)
Finally, it is important to get significance of the obtained results. Mcnemar’s test considers
two classifiers, fa , and fb. For every example in the test set, it is recorded whether it is
misclassified by both fa and fb, only fa (n01), only fb (n10), or both. After classifying all
records in a category it is easy to compute χ2, as can be seen in 3.16.
χ2 =(|n01 − n10| − 1)2
n01 + n10(3.16)
The χ2 can then be used to determine p-values using a chi-squared function with one degree
of freedom. It is important to note that this test gives significance scores for pairwise
comparisons, and therefore, the results should only be used to compare two values or should
be corrected.
3.9 Case study: tour operator
Apart from the Ponpare data, as introduced earlier, the algorithm is also tested on another
data set. This data concerns purchase and view data of a tour operator. This chapter de-
scribes the business context of this case study, some data characteristics and features used.
The data concerns transaction data from a tour operator. The tour operator offers several
package holidays for the leisure holiday market in The Netherlands. Customers are able to
browse the offerings on the website of the tour operator and book the holiday on the website
of the tour operator.
The tour operator offers a variety of destinations and types of accommodation, each with
a different target audience. For instance, some accommodations are catered for single trav-
ellers, while others are more suitable for families. Furthermore, the holiday offerings have
different price ranges and are situated in different regions. The objective in this case is to
recommend users suitable holiday products taking these characteristics into account.
3.9.1 Data understanding
The data consists of accommodation and user combinations. Item properties consist of
AccommodationCountry, AccommodationCity, StarRating, Childfriendly, Only-adult and
Average-rating. User properties consist of InfantCount, AdultCount, ChildCount and Age
of the main booker.
Figure 3.13 contains an overview of age distributions of bookings per country. In this case
only the three most important countries are listed. Here, a separation is made as well for
accommodations where no children are allowed. The figure makes clear that accommodations
where no children are popular among customers in the age bin 60-70. Figure 3.14 displays
the number of bookings versus the average customer rating and the star category. Upon
inspection it appears that most accommodations are 3, 4 or 5 star accommodations. Most
of the customer ratings are within a range between 7 and 9 and the four most popular
accommodations have a rating above 8.
36 3 Methodology & case studies
Figure 3.13: Age distribution of bookings
3.9.2 Data preparation and feature calculation
Unlike in the Ponpare scenario, the data consists of maximum 1 user-item combination. So,
it is not necessary to aggregate on user-item. The records contain some properties derived
from the user side and some properties that stern from the item side, which can be used as
predictors. Furthermore, the records contain a BookedIndicator, which denotes 1 in case of
a booking and 0 in case of no booking.
In this case we decide to construct no features denoting preference as this data is not
available, but just encode the data for use by XGBoost.
Firstly, the accommodation rating is converted from a categorical to a decimal value. A 3*
rating is converted to 3.0, a 3+ rating to 3.5 and so on. Furthermore, dummies are created
for AccommodationCity, ChildFriendly and ‘Only-adult’.
The data is split in order to get 66% of the data in the training set and 33% in the hold-
out. The split ensures that accommodations end up in either the training or the validation
data. This yields a training set of 217 420 records and a hold-out of 108710. Because the
data is unbalanced, undersampling is applied on the training data, yielding 17 471 positive
instances and 17 741 negatives in the final train set.
3.9.3 Grid search
On the training part, an initial parameter set is defined using stratified k-fold cross validation
with k = 10. The grid search is performed with the same grid values for parameters as in
the Ponpare case. This The results of the gridsearch with 10-fold cross validation can be
found in Table 3.10.
37 3 Methodology & case studies
(a) versus customer rating (b) versus star rating
Figure 3.14: Number of bookings per accommodation
Table 3.9: Overview of used parameters for tour operator
Name Description Value
n estimators number of trees 200eta learning rate 0.15subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth 7min child weight min leaf weight 1
Table 3.10: Results grid search tour operator
learning rate max depth n estimators rank µ AUC σ
0.15 7 200 1 0.86031 0.005600.12 7 200 2 0.85655 0.005390.15 6 200 3 0.85475 0.004870.09 7 200 4 0.85197 0.005490.12 6 200 5 0.85066 0.004280.15 5 200 6 0.84698 0.005750.09 6 200 7 0.84509 0.005720.12 5 200 8 0.84273 0.005270.09 5 200 9 0.83663 0.00548
38 4 Results
4 Results
This chapter describes the obtained results on both the Ponpare dataset and the tour op-
erator dataset. Firstly, the random selection will be explained, after which results will be
displayed for both datasets. Finally, the insights that can be obtained from these results
will be discussed.
4.1 Random selection algorithm
In order to compare the approaches against other approaches, a random selection algorithm
is used. The approach counts the positive instances in the test set and assigns the same
number of positives at random indexes in the test set, drawn from a uniform distribution
and leaving all other records negative. This leaves the positive rate the same as in the actual
results of the test set.
4.2 Ponpare
The distribution of coupons in the hold-out set, which can be seen in Figure 4.1, is examined
first. 4989 coupons appear in the hold-out set, which are purchased 35005 times all together.
A split was made at 50% of all purchases, this results in a head of 381 items and a tail of
4608 items. The 381 and 4608 items will from now on be referred to as ‘head’ and ‘tail’
items. Figure 4.2 displays the precision recall curve for all items in the ponpare dataset. It
Figure 4.1: Purchases per coupon in test set
becomes evident here that all weights outperform a random selection. However, all models
are close to each other, including the variant with equal instance weights, which is labeled
as ‘noweights’.
39 4 Results
Figure 4.2: Precision/Recall for all items in the Ponpare dataset
Figure 4.3: Precision/Recall for head items in the Ponpare dataset
Figure 4.2, Figure 4.3 and Figure 4.4 contain precision-recall curves for the head, tail and
all items in the Ponpare dataset, respectively.
For the head and tail subset, it becomes clear that performance in terms of precision and
recall is much worse for the tail than the head subset.
Additionally, it is interesting to see that for the entire dataset, ‘idflike-views’(3.7), ‘idflike-
purchases’(3.8) and ‘noweights’ are all close to each other and perform the best. The four
remaining models are also close to each other and form the group of worst performing
models, with ‘inv-purchase-count2’(3.10) and ‘inv-view-count2 (3.9) performing the worst of
40 4 Results
Figure 4.4: Precision/Recall for tail items in the Ponpare dataset
the models for the largest part of the plot.
Finally, the remark must be made that the order of models for all items as well as the head
items subset seems to be similar. However, it is remarkable that for head items some of the
worst performing models display less performance than the random selection algorithm.
Figure 4.5: Coverage for head items in the Ponpare dataset
Figures 4.5, 4.6 and 4.7 show the coverage of the head, tail and all items in the Ponpare
set respectively. In the different figures, it can be seen that the number of unique items
do not differ for the different models, however, the coverage is always higher for random
selection than the other models. Figure 4.8 shows the Gini-index for different lengths of
41 4 Results
Figure 4.6: Coverage for tail items in the Ponpare dataset
Figure 4.7: Coverage for all items in the Ponpare dataset
recommendation lists. Once again, the different models are close to each other. Furthermore,
it is important to note that a random selection obviously yields the most equal distribution.
However, it is also important to see that some of the different models clearly show a lower
Gini-index than ‘noweights’. Finally, it is important to note that ‘inv purchase count2’(3.10)
shows the lowest Gini-index of all models and that the models get closer as the list length
increases.
Figures 4.9, 4.10 and 4.11 show the F1 scores for different list length on the Ponpare set.
Here it can be seen that the F1 score is rather low in the beginning as a lot of positives
42 4 Results
Figure 4.8: Gini-index for different list lengths in the Ponpare dataset
are missed, but improves for longer lists as the recall increases faster than the precision
decreases. Figure 4.12 shows the Gini index and F1 score for Ponpare, which confirms
Figure 4.9: F1 score for different list lengths in the Ponpare dataset
that in general, a lower Gini-index means a lower F1 score. In this case, the lines go from
right to left, starting with raising the Gini index while maintaining an equal F1 score and
moving towards lower Gini index and F1 score towards the end of the lists.
Finally, Table C.1 shows the p-values for pairwise comparisons of the algorithms. When
comparing the constructed models against the default scenario, ‘noweights’, it becomes clear
that the difference is not significant for ‘idflike-purchases’ and ‘idflike-views’. However, for
43 4 Results
Figure 4.10: F1 score for different list lengths in the Ponpare dataset
Figure 4.11: F1 score for different list lengths in the Ponpare dataset
all other algorithms the difference is significant.
4.3 Tour operator
In the remainder of this section the results for the tour operator are discussed. The distri-
bution of purchases over coupons in the test set can be seen in Figure 4.13. When this plot
is compared to the distribution of coupons in the Ponpare set, it becomes clear that this
distribution is less equal. In this case, the validation set consists of 3217 accommodations,
44 4 Results
Figure 4.12: Gini-index and F1 score in the Ponpare dataset
Figure 4.13: Purchases per coupon in test set
with a total of 8659 bookings. The validation part is splitted in two parts according to
bookings, which means that the head of 117 accommodations is responsible for the first half
of all bookings and the remainder of 3100 accommodations is responsible for the second half.
Figure 4.14 displays the precision recall curve for all items in the tour operator dataset. It
can be seen here that random performs the worst for all items. Three models are clearly
performing best with similar performance. These are ‘noweights’, ‘idflike-views’ (3.7) and
‘idflike-purchases’ (3.8). Models ‘inv-view count’ (3.5) and ‘inv-purchase-count’(3.6) are be-
45 4 Results
hind this group of three. The worst performing models are ‘inv-purchase-count2’ (3.10) and
‘inv-view-count2’ (3.9).
Figure 4.15 displays the performance of the different models on the head items. Here, it
can be seen that for a short listlength, the random selection outperforms the different mod-
els. However, for longer lengths the models outperform the random selection. Figure 4.16
shows the precision recall curve for the tail items. Here it is clear that all models except
the squared models perform very similar and better than the squared models and random
selection, which performs worst.
It is remarkable, that both for the head as tail items the models as well as the random
selection obtain a rather high recall. Hence, if even with a random selection it is possible to
get a high recall this means the amount of items per users in the test set could be less than
10 in many cases, leading to almost only positive predictions and hence, a high recall score.
Figures 4.19, 4.17 and 4.18 contain the coverage for different recommendation list length.
Figure 4.14: Precision/Recall for all items in the tour operator dataset
Here, it is clear that the coverage for head items is 100% and that for tail items the random
selection has the highest coverage.
Figure 4.20 shows the Gini-index for all models on the tour operator dataset. Here,
it is clear that the random selection obtains the lowest Gini-index, as the resulting item
recommendation set is the most equal. The remaining models are relatively close, however,
it can be seen here that ‘inv-purchase-count2’ (3.10), which performed the worst of all models
obtains the lowest Gini-index of all models, but for higher list lengths it becomes as unequal
as the other models. Figures 4.21, 4.22 and 4.23 show the F1 scores for the head subset, tail
subset and all items.
Figure 4.24 displays the Gini index and the F1 score for the different models. This shows
that a lower Gini index in general means a lower F1 score. Furthermore, it is remarkable
46 4 Results
Figure 4.15: Precision/Recall for head items in the tour operator dataset
Figure 4.16: Precision/Recall for tail items in the tour operator dataset
to see that ‘inv-view-count2’ in this scenario follows a pattern that is for some part of the
graph similar to the random model. Hence, this model performs particularly bad in terms
of precision/recall.
Finally, table C.2 shows the p-values for pairwise comparisons of the algorithms. Also, here
it is clear that ‘noweights’ significantly differs from all models except ‘idflike-purchases’ and
‘idflike-views’.
47 4 Results
Figure 4.17: Coverage for head items in the tour operator dataset
Figure 4.18: Coverage for tail items in the tour operator dataset
4.4 Summary of insights
It is clear that all models perform very similar to each other. In order to investigate the
reason behind this it is important to recall the XGBoost algorithm and where the weights
impact the learned models. At every iteration, a tree is formed, which in turn decides upon
the best splits by calculating a weighted gain according to the loss function, which in this
case is logistic loss. In the case of altered weights, the gain per possible split is a weighted
average: errors on cases with a high weight are penalized heavier. Hence, single trees are
steered towards performing well on cases with a higher weight.
48 4 Results
Figure 4.19: Coverage for all items in the tour operator dataset
Figure 4.20: Gini-index for different list lengths in the tour operator dataset
When evaluating the generated trees AUC is used. This metric is an unweighted metric
for classification error. The similarity of all models might likewise be explained by the fact
that different weights lead to different trees, however, the evaluation function in XGBoost is
unweighted and hence, will steer the ensemble in the direction of general prediction quality.
Furthermore, the relatively small difference with random selection can most likely be ex-
plained by two reasons. In case of Ponpare: only a few features are constructed, in the
case of the tour operator dataset no features have been constructed. It is obvious that trees
benefit of constructing more features. Secondly, in case of the tour operator dataset, users
49 4 Results
Figure 4.21: F1 score for different list lengths in the tour operator dataset
Figure 4.22: F1 score for different list lengths in the tour operator dataset
are present with a mean of 10.98 viewed items per user in the test set. Hence, for some users
items will be classified as positive independent of score, which explains the good performance
of random selection as well. This could be solved by adding more user-item combinations.
When comparing the difference in performance of the different models in terms of preci-
sion and recall a few things become clear as well. ‘idflike-purchases’, ‘idflike-views’ and
‘noweights’ are performing among the best models in terms of precision and recall in both
cases.
Furthermore, it appears that a less equal division of weights typically means worse perfor-
50 4 Results
Figure 4.23: F1 score for different list lengths in the tour operator dataset
Figure 4.24: Gini-index and F1 score in the tour operator dataset
mance in terms of precion-recall. This was expected as the evaluation data is biased towards
purchasing popular (head) items. Therefore, it could be interesting to research the proposed
weights in a scenario involving users, to see if the altered recommendation system leads to
different user behaviour.
Both weights squaring the number of counts, namely: ‘inv-view-count2’ and ‘inv-purchase-
count2’, perform consistently among the worst models. This could indicate that setting
extreme weights does not enable the single trees to learn relevant relationships for purchases
and hence, performs worse in terms of precision and recall. This can be caused by the fact
51 4 Results
that the further down the tail more noise and variance exist which makes it difficult to
extract meaningful splits.
When comparing precision-recall and the Gini-index for the different models it appears that
it is possible to use weights to give a more equal recommendation set. However, a better
performance in terms of precision and recall means a less equal distribution (higher Gini-
index). This can be seen in the results for both datasets.
Concluding, the results show that this approach makes it possible to deliver a more bal-
anced recommendation set. However, normally this results in lower performance in terms
of precision and recall. Furthermore, the constructed models perform consistently better in
terms of precision and recall than a random selection.
52 5 Implementation
5 Implementation
This chapter discusses the usage of the developed approach in the online retail industry.
First, an overview of the used technologies for the implementation is given and secondly
different benefits for retailers are discussed.
5.1 Used software
For the developed pipeline only open source software has been used.
• Python Python is a programming language for general purpose programming. Python
has automatic memory management and supports functional as well as object oriented
programming. Python is an open source project and has a large number of standard
libraries. However, it also supports several packages. The entire project pipeline runs
on Python 3.6
• Pandas Pandas is a Python library. The library contains many functionalities for
working with labeled datasets, such as many statistical operations, but also data op-
erations like joining (merging) tables, functionalities to handle missing data, group by
and pivoting and reshaping functionalities (McKinney, 2011). This project uses the
Pandas library for many data operations, as well as some pre processing steps.
• Jupyter Notebooks Jupyter Notebooks are documents containing both code, such
as for instance Python and rich text. This means a notebook contains the analysis as
a code snippet as well as the results by means of visualizations, tables etc. For this
project the Jupyter Notebook app has been used for all parts of the project.
• SKlearn Scikit-learn is a library containing tools for data mining and data analysis.
It contains for instance many classification and regression algorithms, but also many
tools for model selection such as grid search,cross validation and evaluation modules.
This project uses sklearn for grid search using k-fold cross validation. Furthermore,
some of the evaluation modules are used.
• Numpy Numpy is a package for scientific computing in python and contains amongst
others an N-dimensional array object and lineair algebra functions. Pandas makes use
of Numpy arrays. This project makes use of Numpy through Pandas but also uses
some Numpy functions for elementwise operations on arrays.
• Scipy Scipy is a package for mathematics, science and engineering. In this project
Scipy is used for instance to calculate the Pearson correlation coefficient.
• Matplotlib Matplotlib is a plotting package for Python. Matplotlib makes it possible
to display for instance bar charts, histograms, scatterplots etc. All visualizations in
this project are made using Matplotlib. For some plots the package Seaborn is used
as well.
53 5 Implementation
• XGBoost XGBoost is an implementation of gradient tree boosting and is discussed
earlier.
• KNIME KNIME is an open source data analytics, reporting and integration platform.
In this project it has been used for some statistical analysis in the data exploration
phase.
The usage of standard open source components means the developed solution is easy to
implement as only an environment capable of running Python is needed. Furthermore, the
used open-source software means no license costs will be involved.
The developed pipeline can be easily adapted for new clients for Building Blocks, in a similar
manner as has been done for the tour operator. The model tuning capabilities, as well as
training with different weights and evaluation can be directly used by means of executing
the existing Python script. However, data preprocessing steps are specific to the case at
hand and should therefore be changed for new datasets. Furthermore, manual analysis of
the evaluation results should be done to select the most appropriate weighting scheme.
5.2 Quality aspects
• Flexibility The developed approach can be used in many different scenario’s where
recommendation of items to a user is needed. The proposed approach can be used for
personalized offerings throughout many different channels, such as e-mail marketing,
advertisement or product recommendation in a webshop, but many more examples
are possible in scenario’s where a personalized product advice is needed. Once the
connection with the data is set-up, only the pre processing steeps need to be altered.
Secondly, the developed pipeline can be used by many different retailers. Different
products in different domains mean different product properties and hence, different
consumer behaviour. Therefore, the model selection and parameter tuning capabilities
of the developed pipeline are important to make sure the most appropriate model is
selected for different retailers.
Thirdly, with small adjustment the developed work enables retail clients to incorporate
other objectives than popularity. For instance, it can be chosen to incorporate margin
in the prediction to recommend more items with a higher margin. But many other
objectives can be incorporated in a similar manner.
• Scalability The used packages and algorithms can easily be run on large (virtual)
machines for large amount of users. The only requirement is a container or virtual
machine able to run Python and XGBoost. The current configuration uses a virtual
machine with a 4 core processor and 8GB of memory. This is important for Building
Blocks and its clients.
Summarizing, it becomes possible to make relevant personalized offerings to large groups
of customers on different channels. As discussed in chapter 2 this is essential in order to
attract customers and be successful in the current online retail industry.
54 6 Conclusion and discussion
6 Conclusion and discussion
This chapter discusses the implications of this research for academia, practice and the limi-
tations of this research and will conclude the research.
6.1 Research questions
The main research question is recalled:
Main research question: How to build a recommender system that provides bal-
anced item recommendations in online retail?
In order to answer the main research question, all research questions will be addressed first.
Research question 1: Which criteria can be used to evaluate balanced item rec-
ommendation?
Recommending products to users is a binary classification problem. Chapter 3 describes var-
ious accuracy-related metrics that are used to evaluate recommender system performance.
The classification metrics that have been used include precision, recall and the F1 measure.
Apart from general recommender system performance, several metrics are used to evalu-
ate balanced item recommendation. Firstly, the items in the resultset are split in subsets
based on the number of purchases to compare performance on these subsets. For instance,
precision, recall and F1 score, as well as coverage are computed for head and tail items. Fur-
thermore, the distribution of recommendations over all products is evaluated by computing
the Gini-index over the set of recommendations.
Research question 2: How to adapt the existing algorithm such that it will deliver
balanced recommendations?
This research proposes using XGBoost for recommendation, as described in chapter 3. The
approach involves altering weights connected to data points in order to deliver a more
balanced recommendation set. This approach is tested on two case studies, namely: a
coupon website as well as a tour operator.
The results show that XGBoost can be used for recommendation in online retail and that
the approach outperforms a random selection algorithm in terms of precision and recall.
Furthermore, the approach using different methods of instance weighting turns out to be
successful in delivering a more equal division of recommendations over the itemset. However,
as is the case with other research, delivering a more balanced recommendation set comes at
the cost of lower precision and recall.
Research question 3: How can the developed algorithm be applied in the online
retail industry by Building Blocks?
The developed algorithm can be applied using Python and various open source Python
packages, as discussed in chapter 5. The developed solution pipeline is flexible and scalable.
55 6 Conclusion and discussion
To conclude with the main research question, this thesis presents a feasible approach where
boosted trees using instance weighting are applied to recommend items from the long tail.
The research shows that it is feasible to use boosted trees for recommendation, whereby
adapting instance weights results in a more equal division of recommendations over the
different items.
6.2 Implications for research
First of all this project shows that gradient boosted trees can successfully be used for rec-
ommendation systems in online retail. The recommendation system outperforms a random
selection, however, more benchmarks are needed to see how it compares to other algo-
rithms.
The second important implication for research is that instance weighting in boosted tree
learning indeed proves to give a more equal recommendation set compared to not applying
any instance weighting. This is important because a more equal recommendation set means
more exposure for less popular items. This result can be clearly seen on both the tour
operator as the Ponpare dataset. This is an important addition, because this approach has
not been used before to address balanced item recommendation.
The third implication is that a better performance in terms of precision and recall means
a less equal distribution of recommendations over the items. This strongly suggests that
choosing a model yielding a more equal set of recommendations means sacrificing on perfor-
mance in terms of precision and recall. Hence, for applications where an equal distribution
of recommendations is desired, concessions needs to be done in terms of precision and recall.
6.3 Managerial implications
As discussed in chapter 2, key to the success of online retailers is customer loyalty and
modern customers demand tailored content in real time to fulfill their needs in their online
shopping experience.
The presented system is able to give personalized recommendations in real time for a large
amount of users and items and hence, it is scalable. This is important to Building Blocks
as many of their online retailing client maintain a large catalogue of items and serve a large
amount of customers and previous approaches did not scale well.
Furthermore, the presented approach can be applied in a variety of domains. Underlying user
behaviour is different for purchasing coupons or booking a package holiday. It is important
that the presented approach can also be applied to other domains to also serve potential
future clients, as the clients of building blocks are from various domains.
Finally, using a recommender system that takes into account the performance of tail items
enables customers to dispose of excess inventory and therefore improve their bottom-line
performance. or approaches where for instance margin is taken into account in a similar
fashion as has now been done with purchases and views.
56 6 Conclusion and discussion
In order to do so so, it is possible to construct weights taking into account extended cost
models, which give a better representation of the costs involved in recommendation of items.
For instance, inventory costs, as well as the costs of not purchasing other items might be
taken into account to get a more appropriate estimate of costs and hence, a more appropriate
product mix in the recommendations of the e-retailer.
6.4 Future research
This approach could also be extended to take into account other forms of costs such as
inventory cost, or approaches where for instance margin is taken into account in a similar
fashion as has now been done with purchases and views. Hence, research could be done
to determine whether the proposed approach could be extended to affect other business
requirements as well.
Additionally, more research should be done in the relationship between general recommen-
dation performance and equality. Experiments can be done both within the e-commerce
domain as in other domains to determine this relationship.
Furthermore, it is recommended to compare performance of the developed models to other
algorithms than only random selection as done in this research. Existing collaborative fil-
tering approaches, such as collaborative filtering combined with clustering, as well as other
machine learning methods could serve as benchmark. For instance, random forests also allow
for weighted learning. This could give insights in how the current developed method relates
to other algorithms, both in terms of general recommendation performance as in delivering
equal recommendations.
Moreover, a custom evaluation function for use in the XGBoost algorithm could be used,
in order to evaluate the intermediate models during training not only on general predictive
performance, but also on performance with respect to giving balanced recommendations.
This could lead to different final models.
Finally, it is recommended to test the algorithm live, involving users. This could give insights
in both business impact and user experience of the proposed approaches. For instance, it
could be seen if the proposed approach indeed leads to a more equal division of sales over
all items and the company could therefore for instance reduce overstock. Moreover, it could
give insight in the user experience to see how users perceive the proposed changes.
6.5 Limitations
Firstly, the initial parameters are defined using an accuracy related metric, AUC. For final
evaluation however, different measures are used, such as the Gini-index. This is an impor-
tant limitation as the best parameters for equal recommendations could be different than
the best parameters using AUC. Currently, the XGBoost class for SKlearn Grid search does
not support this and therefore it needs to be extended to be able to do so.
Secondly, the composition of the train and test set for the travel operator make evaluation
of recommendations as a ranked list difficult, as the mean number of items per user is with
57 6 Conclusion and discussion
10.98 only slightly higher than the maximum list length of 10. This means, that towards
the end of the list in many cases there is not a lot of items to choose from and hence,
the different algorithms will give a very similar result. On the other hand, the amount of
possible items for users in the Ponpare set is considerably larger with 34 items per user.
Adding combinations randomly to the tour operator dataset was not feasible as this might
encourage the model to learn relationships based on the underlying generating algorithm
instead of user behaviour. Therefore, the recommendation is to test these algorithms as well
on a dataset with more user-item combinations.
58 6 References
References
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, and Others. Modern information retrieval,
volume 463. ACM press New York, 1999.
J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez. Recommender systems survey.
Knowledge-Based Systems, 46:109–132, 2013. ISSN 09507051. doi: 10.1016/j.knosys.
2013.03.012. URL http://dx.doi.org/10.1016/j.knosys.2013.03.012.
Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. Understanding
Choice Overload in Recommender Systems. In Proceedings of the Fourth ACM Conference
on Recommender Systems, RecSys ’10, pages 63–70, New York, NY, USA, 2010. ACM.
ISBN 978-1-60558-906-0. doi: 10.1145/1864708.1864724. URL http://doi.acm.org/10.
1145/1864708.1864724.
Pete Chapman, Julian Clinton, Thomas Khabaza, Thomas Reinartz, and Rdiger Wirth. the
Crisp-Dm Process Model. The CRIP-DM Consortium, 310(C), 1999.
Tianqi Chen and Carlos Guestrin. XGBoost : Reliable Large-scale Tree Boosting System.
arXiv, pages 1–6, 2016a. ISSN 0146-4833. doi: 10.1145/2939672.2939785.
Tianqi Chen and Carlos Guestrin. XGBoost. Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining - KDD ’16, pages
785–794, 2016b. doi: 10.1145/2939672.2939785. URL http://dl.acm.org/citation.
cfm?doid=2939672.2939785.
Chao Min Chiu, Eric T G Wang, Yu Hui Fang, and Hsin Yi Huang. Understanding cus-
tomers’ repeat purchase intentions in B2C e-commerce: The roles of utilitarian value,
hedonic value and perceived risk. Information Systems Journal, 24(1):85–114, 2014. ISSN
13501917. doi: 10.1111/j.1365-2575.2012.00407.x.
AS Das, M Datar, A Garg, and S Rajaram. Google news personalization: scalable online
collaborative filtering. Proceedings of the 16th international conference on, pages 271–280,
2007. ISSN 1595936548. doi: 10.1145/1242572.1242610. URL http://portal.acm.org/
citation.cfm?id=1242610.
James Davidson, Blake Livingston, Dasarathi Sampath, Benjamin Liebald, Junning Liu,
Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, and Mike Lambert.
The YouTube video recommendation system. Proceedings of the fourth ACM conference
on Recommender systems - RecSys ’10, page 293, 2010. ISSN 1605589063. doi: 10.1145/
1864708.1864770.
Eurostat. Digital economy and society statistics - households and individuals,
2017. URL http://ec.europa.eu/eurostat/statistics-explained/index.php/
Digital\_economy\_and\_society\_statistics\_-\_households\_
and\_individuals.
59 6 References
Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Anal-
ysis, 38(4):367–378, 2002.
Stijn Geuens. Factorization Machines for Hybrid Recommendation Systems Based on Be-
havioral , Product , and Customer Data. In the 2015 ACM conference on Recommender
systems, RecSys 2015, number Umr 9221, pages 379–382, 2015. ISBN 9781450336925.
doi: 10.1145/2792838.2796542.
Carlos A. Gomez-Uribe and Neil Hunt. The Netflix Recommender System. ACM Trans-
actions on Management Information Systems, 6(4):1–19, 2015. ISSN 2158656X. doi:
10.1145/2843948. URL http://dl.acm.org/citation.cfm?id=2869770.2843948.
P Grefen. Beyond E-Business. Routledge, 2015. ISBN 9781315754697. doi: 10.4324/
9781315754697. URL http://www.tandfebooks.com/isbn/9781315754697.
Asela Gunawardana and Guy Shani. A Survey of Accuracy Evaluation Metrics of Recom-
mendation Tasks. The Journal of Machine Learning Research, 10:2935–2962, 2009. ISSN
15324435. doi: 10.1145/1577069.1755883.
Jiawei Han, Micheline Kamber, and Jian Pei. Introduction. Elsevier, 2012. ISBN
9780123814791. doi: 10.1016/B978-0-12-381479-1.00001-0. URL http://linkinghub.
elsevier.com/retrieve/pii/B9780123814791000010.
Kaggle. Coupon Purchase Prediction. URL https : / / www . kaggle . com / c /
coupon-purchase-prediction.
Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item
collaborative filtering. IEEE Internet Computing, 7(1):76–80, 2003. ISSN 10897801. doi:
10.1109/MIC.2003.1167344.
Wes McKinney. pandas: a foundational Python library for data analysis and statistics.
Python for High Performance and Scientific Computing, pages 1–9, 2011.
Oyvind H Myklatun, Thorstein K Thorrud, Hai Nguyen, Helge Langseth, and Anders Kofod-
Petersen. Probability-based Approach for Predicting E-commerce Consumer Behaviour
Using Sparse Session Data. Proceedings of the 2015 International ACM Recommender
Systems Challenge, pages 5:1—-5:4, 2015. doi: 10.1145/2813448.2813514. URL http:
//doi.acm.org/10.1145/2813448.2813514.
Salvatore Parise, Patricia J. Guinan, and Ron Kafka. Solving the crisis of immediacy:
How digital technology can transform the customer experience. Business Horizons, 59
(4):411–420, 2016. ISSN 00076813. doi: 10.1016/j.bushor.2016.03.004. URL http:
//dx.doi.org/10.1016/j.bushor.2016.03.004.
Yoon-Joo Park and Alexander Tuzhilin. The long tail of recommender systems and how to
leverage it. Proceedings of the 2008 ACM conference on Recommender systems RecSys
08, page 11, 2008. ISSN 03029743. doi: 10.1145/1454008.1454012. URL http://portal.
acm.org/citation.cfm?doid=1454008.1454012.
60 6 References
Steffen Rendle. Factorization machines. Proceedings - IEEE International Conference on
Data Mining, ICDM, pages 995–1000, 2010. ISSN 15504786. doi: 10.1109/ICDM.2010.127.
Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grou-
pLens : An Open Architecture for Collaborative Filtering of Netnews. Proceedings of
the 1994 ACM conference on Computer supported cooperative work, pages 175–186, 1994.
ISSN 00027863. doi: 10.1145/192844.192905.
Badrul M Sarwar, George Karypis, Joseph a Konstan, and John T Riedl. Application of
Dimensionality Reduction in Recommender System - A Case Study. Architecture, 1625:
264–8, 2000. ISSN 15533514. doi: 10.1.1.38.744. URL http://citeseerx.ist.psu.edu/
viewdoc/download?doi=10.1.1.29.8381\&rep=rep1\&type=pdf.
Badrul M Sarwar, George Karypis, Joseph Konstan, and John Riedl. Recommender Sys-
tems for Large-scale E-Commerce : Scalable Neighborhood Formation Using Cluster-
ing. Communications, 50(12):158–167, 2002. ISSN 09254773. doi: 10.1.1.4.6985. URL
http://grouplens.org/papers/pdf/sarwar\_cluster.pdf.
Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender
systems handbook, pages 257–297. Springer, 2011.
Yue Shi, Martha Larson, and Alan Hanjalic. Collaborative Filtering beyond the User-Item
Matrix : A Survey of the State of the Art and Future Challenges. ACM Computing Surveys
(CSUR), 47(1):1–45, 2014. ISSN 03600300. doi: http://dx.doi.org/10.1145/2556270.
Harald Steck. Item popularity and recommendation accuracy. In Proceedings of the
fifth ACM conference on Recommender systems - RecSys ’11, page 125, 2011. ISBN
9781450306836. doi: 10.1145/2043932.2043957. URL http://dl.acm.org/citation.
cfm?doid=2043932.2043957.
Kai Ming Ting. An instance-weighting method to induce cost-sensitive trees. IEEE Trans-
actions on Knowledge and Data Engineering, 14(3):659–665, 2002.
Daniel Valcarce, Javier Parapar, and ??lvaro Barreiro. Item-based relevance modelling of
recommendations for getting rid of long tail products. Knowledge-Based Systems, 103:
41–51, 2016. ISSN 09507051. doi: 10.1016/j.knosys.2016.03.021.
Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. Collaborative filtering and
deep learning based recommendation system for cold start items. Expert Systems with
Applications, 69:1339–1351, 2017. ISSN 09574174. doi: 10.1016/j.eswa.2016.09.040.
61 A Data exploration
Appendices
A Data exploration
Table A.1: Overview of available attributes
Table Attribute Description Type
User USER ID hash User ID VARCHAR2REG DATE Registered date DATESEX ID Gender CHARAGE Age NUMBERWITHDRAW DATE Unregistered date DATEPREF NAME Residential Prefecture VARCHAR2
Coupon CAPSULE TEXT Capsule text VARCHAR2GENRE NAME Category name VARCHAR2PRICE RATE Discount rate NUMBERCATALOG PRICE List price NUMBERDISCOUNT PRICE Discount price NUMBERDISPFROM Sales release date DATEDISPEND Sales end date DATEDISPPERIOD Sales period (day) NUMBERVALIDFROM The term of validity starts DATEVALIDEND The term of validity ends DATEVALIDPERIOD Validity period (day) NUMBERUSABLE DATE MON Is available on Monday CHARUSABLE DATE TUE Is available on Tuesday CHARUSABLE DATE WED Is available on Wednesday CHARUSABLE DATE THU Is available on Thursday CHARUSABLE DATE FRI Is available on Friday CHARUSABLE DATE SAT Is available on Saturday CHARUSABLE DATE SUN Is available on Sunday CHARUSABLE DATE HOLIDAY Is available on holiday CHARUSABLE DATE BEFORE HOLIDAY Is available on the day before holiday CHARlarge area name Large area name of shop location VARCHAR2ken name Prefecture name of shop VARCHAR2small area name Small area name of shop location VARCHAR2COUPON ID hash Coupon ID VARCHAR2
View PURCHASE FLG Purchased flag NUMBERPURCHASEID hash Purchase ID VARCHAR2I DATE View date DATEPAGE SERIAL VARCHAR2REFERRER hash Referer VARCHAR2VIEW COUPON ID hash Browsing Coupon ID VARCHAR2USER ID hash User ID VARCHAR2SESSION ID hash Session ID VARCHAR2
Purchase ITEM COUNT Purchased item count NUMBERI DATE Purchase date DATESMALL AREA NAME Small area name VARCHAR2PURCHASEID hash Purchase ID VARCHAR2USER ID hash User ID VARCHAR2COUPON ID hash Coupon ID VARCHAR2
Coupon Area SMALL AREA NAME Small area name VARCHAR2PREF NAME Listed prefecture name VARCHAR2COUPON ID Coupon ID VARCHAR2
62 A Data exploration
Table A.2: Views per user and coupon
User Coupon
Minimum 1.0 1.0Smallest 1.0 1.0Lower Quartile 9.0 15.0Median 32.0 43.0Upper Quartile 135.0 101.0Largest 324.0 230.0Maximum 3629.0 14779.0
Table A.3: Catalog price, discount price, listing frequency, purchase frequency and displayperiod per coupon.
Catalog Price Discount Price Listed Purchased Dispperiod Revenue
Minimum 1.0 0.0 1.0 1.0 0.0 0.0Smallest 1.0 0.0 1.0 1.0 0.0 0.0Lower Quartile 3570.0 1490.0 1.0 2.0 3.0 6560.0Median 6615.0 2580.0 1.0 6.0 4.0 15920.0Upper Quartile 13400.0 4500.0 2.0 15.0 6.0 39680.0Largest 28000.0 9000.0 3.0 34.0 10.0 89250.0Maximum 680000.0 100000.0 123.0 5760.0 422.0 1627500.0
63 A Data exploration
(a) Per age group (b) Per week
Figure A.1: Number of purchases per genre
64 A Data exploration
Tab
leA
.4:
P-v
alu
esfo
rP
ears
on
corr
elati
on
test
cou
pon
pro
per
ties
CA
TA
LO
GP
RIC
ED
ISC
OU
NT
PR
ICE
DIS
PP
ER
IOD
PR
ICE
RA
TE
VA
LID
PE
RIO
Dp
urc
hase
svie
ws
CA
TA
LO
GP
RIC
E0.0
000000
DIS
CO
UN
TP
RIC
E0.0
000000
0.0
000000
DIS
PP
ER
IOD
0.0
000000
0.0
230514
0.0
000000
PR
ICE
RA
TE
0.0
000000
0.0
000001
0.0
000000
0.0
000000
VA
LID
PE
RIO
D0.0
001213
0.7
137261
0.0
000000
0.1
065052
0.0
000000
pu
rch
ase
s0.0
000000
0.0
000000
0.0
000000
0.0
000000
0.8
55523
0.0
000000
vie
ws
0.0
000000
0.3
743773
0.0
000000
0.2
601731
0.6
09746
0.0
000000
0.0
000000
Tab
leA
.5:
Pea
rson
corr
elati
on
coeffi
cien
tfo
rco
up
on
pro
per
ties
CA
TA
LO
GP
RIC
ED
ISC
OU
NT
PR
ICE
DIS
PP
ER
IOD
PR
ICE
RA
TE
VA
LID
PE
RIO
Dp
urc
hase
svie
ws
CA
TA
LO
GP
RIC
E1
DIS
CO
UN
TP
RIC
E0.8
42671
1D
ISP
PE
RIO
D0.0
57761
0.0
16311
1P
RIC
ER
AT
E0.2
80378
-0.0
38462
0.1
3658
1V
AL
IDP
ER
IOD
0.0
33364
0.0
03185
0.1
08647
-0.0
14015
1p
urc
hase
s-0
.074574
-0.0
82232
0.2
58665
0.0
42046
-0.0
01584
1vie
ws
-0.0
39301
-0.0
06376
0.3
00955
-0.0
08082
-0.0
04432
0.8
01952
1
65 B Feature calculation
B Feature calculation
Table B.1: Overview of used features for Ponpare
name description
pricerate discount percentage (%)catalog price catalogue price (Yen)discount price discount price (Yen)validperiod validity (days)usable date Weekdays which the coupon can be used (one-hot)age User age (year)locexists If a user-coupon location appeared in the purchase log of last month (boolean)days registration Days since the user registered (days)timeofday View time of the day of the view (one-hot)dayofweek View day of the week (one-hot)genre Coupon genre (one-hot)sex User sex (one-hot)samepref If user and coupon have the same prefecture (boolean)prob g Genre popularity for user for previous monthprob p Prefecture popularity for user for previous monthkeypop The coupon key popularity in the previous month
66 C Experiment results
C Experiment results
67 C Experiment results
Tab
leC
.1:
Pair
wis
ep
-valu
esfo
rP
on
pare
data
set
(lis
tlen
gth
=5)
RA
ND
OM
idfl
ike
pu
rch
ase
sid
flik
evie
ws
inv
pu
rch
ase
cou
nt
inv
pu
rch
ase
cou
nt2
inv
vie
wco
unt
inv
vie
wco
unt2
now
eights
RA
ND
OM
00
00.0
000234
00
0id
flik
ep
urc
hase
s0
0.3
07324
0.0
00137
00.0
05252
0.0
04475
0.8
23039
idfl
ike
vie
ws
00.3
07324
0.0
00006
00.0
00298
0.0
00723
0.5
26758
inv
pu
rch
ase
cou
nt
00.0
00137
0.0
00006
0.0
06238
0.2
05285
0.9
69553
0.0
00139
inv
pu
rch
ase
cou
nt2
0.0
00023
00
0.0
06238
0.0
00266
0.0
33445
0in
vvie
wco
unt
00.0
05252
0.0
00298
0.2
05285
0.0
002658
0.3
7132
0.0
04083
inv
vie
wco
unt2
00.0
04475
0.0
00723
0.9
69553
0.0
334447
0.3
7132
0.0
02849
now
eights
00.8
23039
0.5
26758
0.0
00139
00.0
04083
0.0
02849
Tab
leC
.2:
Pair
wis
ep
-valu
esfo
rto
ur
op
erato
rd
ata
set
(lis
tlen
gth
=5)
RA
ND
OM
idfl
ike
pu
rch
ase
sid
flik
evie
ws
inv
pu
rch
ase
cou
nt
inv
pu
rch
ase
cou
nt2
inv
vie
wco
unt
inv
vie
wco
unt2
now
eights
RA
ND
OM
00
00
00.0
00001
0id
flik
ep
urc
hase
s0
0.2
17161
0.0
00165
00.0
1604
00.4
12034
idfl
ike
vie
ws
00.2
17161
0.0
00003
00.0
00792
00.7
25891
inv
pu
rch
ase
cou
nt
00.0
00165
0.0
00003
00.2
26908
00.0
00012
inv
pu
rch
ase
cou
nt2
00
00
00.0
00216
0in
vvie
wco
unt
00.0
1604
0.0
00792
0.2
26908
00
0.0
01355
inv
vie
wco
unt2
0.0
00001
00
00.0
00216
00
now
eights
00.4
12034
0.7
25891
0.0
00012
00.0
01355
0