Eindhoven University of Technology MASTER Boosted tree ... · Ponpare coupon purchase prediction...

Eindhoven University of Technology

MASTER

Boosted tree learning for balanced item recommendation in online retail

Dikker, J.

Award date:2017

Link to publication

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

https://research.tue.nl/en/studentthesis/boosted-tree-learning-for-balanced-item-recommendation-in-online-retail(64ae0035-371e-44e0-8bf7-9d45f4569017).html

Master thesis

Boosted tree learning for balanced item recommendation

in online retail

Jelle Dikker

0953780

Business Information Systems

October 6, 2017

Supervisors:

dr. Y. Zhang, Eindhoven University of Technology

dr. V. Menkovski, Eindhoven University of Technology

prof.dr.ir. U. Kaymak, Eindhoven University of Technology

S. Coenraad Msc., Building Blocks B.V.

1

2

Contents

List of Figures 4

List of Tables 5

1 Introduction 8

1.1 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Kaggle coupon purchase prediction . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Literature survey 12

2.1 Online retail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Conventional collaborative filtering . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 The long tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Methodology & case studies 19

3.1 Problem formulation and overview . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Data preparation and feature calculation . . . . . . . . . . . . . . . . . . . . . 26

3.4 Hold-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 Calculate weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 Case study: tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.9.1 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.9.2 Data preparation and feature calculation . . . . . . . . . . . . . . . . 36

3.9.3 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Results 38

4.1 Random selection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Ponpare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Tour operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3

4.4 Summary of insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Implementation 52

5.1 Used software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Quality aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion and discussion 54

6.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Implications for research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 Managerial implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.4 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

References 58

Appendices 61

A Data exploration 61

B Feature calculation 65

C Experiment results 66

4

List of Figures

3.1 Overview of recommendation process . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Methodology overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 An overview of the available data. . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Views and purchases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Views and purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Pearson correlation for coupon properties . . . . . . . . . . . . . . . . . . . . 23

3.7 Purchases and views per coupon ID . . . . . . . . . . . . . . . . . . . . . . . 23

3.8 Catalog price and discount price . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.9 Catalog price and number of purchases . . . . . . . . . . . . . . . . . . . . . . 25

3.10 Display period and number of purchases . . . . . . . . . . . . . . . . . . . . . 25

3.11 Number of purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.12 Schematic overview of use of interaction data . . . . . . . . . . . . . . . . . . 28

3.13 Age distribution of bookings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.14 Number of bookings per accommodation . . . . . . . . . . . . . . . . . . . . . 37

4.1 Purchases per coupon in test set . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Precision/Recall for all items in the Ponpare dataset . . . . . . . . . . . . . . 39

4.3 Precision/Recall for head items in the Ponpare dataset . . . . . . . . . . . . . 39

4.4 Precision/Recall for tail items in the Ponpare dataset . . . . . . . . . . . . . 40

4.5 Coverage for head items in the Ponpare dataset . . . . . . . . . . . . . . . . . 40

4.6 Coverage for tail items in the Ponpare dataset . . . . . . . . . . . . . . . . . . 41

4.7 Coverage for all items in the Ponpare dataset . . . . . . . . . . . . . . . . . . 41

4.8 Gini-index for different list lengths in the Ponpare dataset . . . . . . . . . . . 42

4.9 F1 score for different list lengths in the Ponpare dataset . . . . . . . . . . . . 42



4.12 Gini-index and F1 score in the Ponpare dataset . . . . . . . . . . . . . . . . . 44

4.13 Purchases per coupon in test set . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.14 Precision/Recall for all items in the tour operator dataset . . . . . . . . . . . 45

4.15 Precision/Recall for head items in the tour operator dataset . . . . . . . . . . 46

4.16 Precision/Recall for tail items in the tour operator dataset . . . . . . . . . . . 46

4.17 Coverage for head items in the tour operator dataset . . . . . . . . . . . . . . 47

4.18 Coverage for tail items in the tour operator dataset . . . . . . . . . . . . . . . 47

4.19 Coverage for all items in the tour operator dataset . . . . . . . . . . . . . . . 48

4.20 Gini-index for different list lengths in the tour operator dataset . . . . . . . . 48

4.21 F1 score for different list lengths in the tour operator dataset . . . . . . . . . 49



4.24 Gini-index and F1 score in the tour operator dataset . . . . . . . . . . . . . . 50

A.1 Number of purchases per genre . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5

List of Tables

2.1 Example user-item matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Views for example user and coupon . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Example views for constructing genre feature . . . . . . . . . . . . . . . . . . 29

3.3 Example views for constructing prefecture feature . . . . . . . . . . . . . . . . 29

3.4 Example view with genre and prefecture features . . . . . . . . . . . . . . . . 29

3.5 Overview of parameters in grid . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Values after parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 Results of the gridsearch with 10-fold Cross Validation . . . . . . . . . . . . . 31

3.8 The confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.9 Overview of used parameters for tour operator . . . . . . . . . . . . . . . . . 37

3.10 Results grid search tour operator . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.1 Overview of available attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2 Views per user and coupon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.3 Catalog price, discount price, listing frequency, purchase frequency and dis-

play period per coupon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A.4 P-values for Pearson correlation test coupon properties . . . . . . . . . . . . . 64

A.5 Pearson correlation coefficient for coupon properties . . . . . . . . . . . . . . 64

B.1 Overview of used features for Ponpare . . . . . . . . . . . . . . . . . . . . . . 65

C.1 Pairwise p-values for Ponpare dataset (listlength=5) . . . . . . . . . . . . . . 67

C.2 Pairwise p-values for tour operator dataset (listlength=5) . . . . . . . . . . . 67

6

Abstract

Modern customers in online retail demand personalized offerings. This increases customer

satisfaction and increases revenue for e-retailers. Recommender systems fulfill the need for

systems capable of delivering these personalized offerings.

Online retailers often experience a long tail in their revenue distribution. Current rec-

ommendation systems focus mainly on accuracy or related metrics and therefore provide

recommendations involving mostly popular items. One of the open problems in recom-

mender systems is delivering a more balanced recommendation set, involving less popular

items more.

This research proposes using boosted tree learning to deliver such balanced recommenda-

tions. The approach involves using XGBoost, a popular implementation of boosted tree

learning. Weights connected to data points are adapted in order to emphasize less popular

items in the training of the model.

Experiments are performed on two datasets from online retailers, namely: a Japanese coupon

website as well as a tour operator. Afterwards, the results are evaluated using evaluation

metrics for general recommendation, as well as specific evaluation for balanced item recom-

mendation.

Experimental results show that the developed approach can successfully be used for recom-

mendation in online retail. Furthermore, it shows that delivering balanced item recommen-

dations is possible, but comes at the cost of lower general predictive performance.

7

Acknowledgements

With the submission of this masters thesis an exciting period comes to an end. During the

past couple of months I have gained many new insights. Also, this marks the end of my

time as a student. Throughout the years I have enjoyed my studies in Groningen and later

Eindhoven a lot.

Many people contributed to the exciting time during the thesis project. First of all, I had

the pleasure to be able to conduct the project at Building Blocks in Tilburg. I would like

to thank all the people at Building Blocks for the energetic and inspiring environment and

Sander for the great guidance during the project.

For me it has been a great pleasure to work together with my supervisor at Eindhoven

University of Technology, Yingqian Zhang. Yingqian, thank you for all the advice and

feedback in all stages of the project. I enjoyed the many discussions we had on Wednesdays

and I learned a lot throughout the process.

Finally, family and friends have always been very important to me throughout my entire

studies. I would like to thank them for their great support throughout this period.

8 1 Introduction

1 Introduction

This project will be carried out as Master thesis project for the Master degree in Business

Information Systems at Eindhoven University of Technology, Eindhoven. The project will

be performed at Building Blocks B.V.

This report describes the application of gradient boosted tree learning for recommendation in

the online-retail domain. The main contribution is the development of an approach suitable

to achieve a more equal division of recommendations over the different items in the result

set. Therefore, an algorithm is developed and will be applied on two datasets, namely: the

Ponpare coupon purchase prediction contest dataset, as well as on the dataset of a large tour

operator. The results are evaluated using different criteria for evaluation of recommendation

systems.

The Ponpare dataset is is a publicly available dataset that has been used for a data science

competition in the past, the tour operator dataset is a dataset from one of the clients of

Building Blocks. This chapter explains the problem context and the background of Building

Blocks, which will be followed by an introduction on the used datasets. This leads to the

research questions and the methodology, which will both be briefly introduced as well in

this chapter.

1.1 Building Blocks

Building Blocks is a data science consultancy firm. The company has customers in three

domains, namely: insurance, retail and travel. It provides these customers with accurate

consumer predictions in order to enhance their business.

The services Building Blocks offer their clients are built around three pillars, namely: cus-

tomer profiling, product profiling and the third pillar, optimization, which combines aspects

of the first two pillars. Examples of services in customer profiling include loyalty estimation

and customer segmentation, whereas product services include for instance product segmen-

tation and complement estimation. Finally, planning, pricing, recommendation, assortment

and promotions combine knowledge from both domains and therefore belong to the third

pillar.

Building Blocks expects client demand for recommendation of products to online clients

in the future. Therefore, Building Blocks is looking to develop a recommender system for

clients in online retail. Currently they make use of recommender systems for events, such as

exhibitions and festivals, offering customers interesting program elements, but they do not

have a recommender system for the e-commerce domain.

Many of Building Blocks clients have a revenue distribution with a so called long-tail: a

small number of products accounts for a relatively large part of sales. However, existing

recommender systems aim at overall prediction performance and in general mostly recom-

mend the most popular products. This means a large part of the item catalogue is often not

used in recommendation. Because the clients of Building Blocks want to incorporate these

products in their recommendations as well, there is need for more research into delivering a

9 1 Introduction

more balanced recommendation.

1.2 Kaggle coupon purchase prediction

Kaggle was founded as a platform for predictive modelling and analytics in 2010. Nowadays,

Kaggle has a community of more than 536 000 users. Kaggle regularly hosts competitions on

predictive modelling and analytics. Furthermore, the portal contains discussions and space

to share code, all to foster sharing of knowledge.

The coupon purchase prediction contest at Kaggle asks participants to predict which coupons

users of the website Ponpare will buy (Kaggle). Ponpare is Japans biggest coupon website,

offering coupons like Groupon and many others. These coupons can include discounted

yoga, gourmet sushi or concerts. The goal of the competition was to predict 10 purchases

for every user for the week after the competition data.

As the coupon purchase prediction dataset contains many interesting features that could

also be present in datasets of potential clients for Building Blocks, this dataset was selected

to conduct this study. Furthermore, the knowledge base around the competition provides

valuable insights in successful approaches. However, the competition only incorporates over-

all predictive accuracy as criterion and does not intent to recommend less popular items.

Therefore, the approaches in this competition will be used and adapted in order to make

more balanced recommendations.

The competition took place from July until September 2015. The competition attracted

1076 teams which submitted 1 or more entries. The training data available for participants

consists of data of a year of coupon purchases.

1.3 Tour operator

The tour operator offers several package holidays for the leisure holiday market in The

Netherlands. Customers are able to browse the offerings on the website of the tour operator

and book the holiday on the website of the tour operator. The tour operator offers a variety

of destinations and types of accommodation, each with a different target audience. For in-

stance, some accommodations are catered for single travellers, while others are more suitable

for families. Furthermore, the holiday offerings have different price ranges and are situated in

different regions. The tour operator would like to make more personalized offerings towards

its customer. Therefore, the tour operator is looking to develop a recommender system. As

the tour operator experience a difference between popularity in different accommodations,

they are looking for ways to promote less popular accommodations.

1.4 Research questions

Following the problem context as explained above, the main research question of this research

can be defined as follows:

Main research question: How to build a recommender system that provides bal-

anced item recommendations in online retail?

10 1 Introduction

In order to answer the main research question, several research questions need to be an-

swered.

Research question 1: Which criteria can be used to evaluate balanced item rec-

ommendation?

The first research question relates to the criteria that can be used to evaluate recommender

systems in general, as well as balanced item recommendation more specifically. This is

essential in evaluating the performance of the developed models.

Research question 2: How to adapt the existing algorithm such that it will deliver

balanced recommendations?

The second research question addresses which algorithms exist for recommendation, their

advantages and disadvantages. Furthermore, it explores the possibility to adapt these algo-

rithm such that items from the long-tail can be recommended.

Research question 3: How can the developed algorithm be applied in the online

retail industry by Building Blocks?

The third research question refers to practical application of the algorithm. A data mining

pipeline should be developed, as well as tools for Building Blocks to use the developed

algorithm in practice.

1.5 Methodology

In order to build a recommendation system which is capable of delivering a balanced rec-

ommendation set, an appropriate model needs to be developed. In order to do this, it is

important to determine which criteria can be used to evaluate recommendation systems as

well as recommendation of long tail items. This is addressed in research question 1.

Secondly, an algorithm need to be selected and adapted for use with respect to the research

goal. Therefore, a literature study will be used to identify the most suitable algorithms for

this task, as well as possible modifications to existing algorithms. A modification to existing

algorithms is proposed. This completes research question 2.

Thirdly, experiments are performed on two data sets from the online retail domain, namely:

the Ponpare dataset and the tour operator dataset. The Ponpare dataset concerns a coupon

sales website, the tour operator sells package holidays. A data mining workflow is developed.

This workflow is developed such that it can also be used by Building Blocks for other clients.

This answers research question 3.

Finally, the results are evaluated using the established criteria, to answer the main research

question of this project.

1.6 Thesis outline

After the introduction, where the problem and research questions have been introduced, the

remainder of this thesis will look as follows: chapter 2 will introduce recommender systems

11 1 Introduction

and the current state of research including some open problems. This will be followed by

chapter 3, which discusses the pursued approach and the case studies. Chapter 4 elaborates

on the results of both case studies. Finally, chapter 5 describes application in the online

retail industry and chapter 6 contains the discussion and conclusions.

12 2 Literature survey

2 Literature survey

This section displays the result of the literature survey that has been carried out. The

domain of online retail will be adressed, as well as recommender systems, algorithms for

recommendation, XGBoost and recommendation for the long tail.

2.1 Online retail

Online retailing is the sale of goods and services through the internet. With the rise of the

internet, e-retailing has become more and more popular among consumers. In 2016, 74% of

all customers in The Netherlands purchased services or goods online (Eurostat, 2017).

Important drivers to e-retail, compared to traditional retail, are reach and efficiency: online

retailers are able to serve a larger geographic area at any time of the day, which makes them

able to operate more efficiently (Grefen, 2015).

Key to the success of online retailers is customer loyalty, also described as the intent of

repeatedly purchasing (Chiu et al., 2014).Modern customers demand to receive personalized

content in real time during their shopping experience. These require systems that are able

to tailor content in real time in order to deliver the customer the desired experience.

Therefore, the current biggest challenge in online-retail from a customer journey perspective

is dealing with the crisis of immediacy, as defined by Parise et al. (2016) : “how to meet

consumers’ need to receive content, expertise, and personalized solutions in real time during

their shopping experience”. Customers demand the right information at the right place and

the right time. Recommender systems fulfill the demand in systems which can be used to

provide customers personalized information at the right time, which is critical to current

e-retailer landscape.

2.2 Recommender systems

The success of the world wide web makes a large amount of information available for anyone

connected to the internet. However, because of the abundance of available information,

difficulty for users to select the right items has increased. In order to support users selecting

relevant information from the internet, appropriate systems are needed.

Recommender systems can be used to help users navigate through a vast amount of items

or documents in a scenario where users do not exactly know what they are looking for and

hence, do not want to formulate an explicit query. Instead of searching, this is therefore

approached as a browsing scenario, where the user does not give an explicit query expressing

his/her information need but does like to be pointed in the right direction by the system

(Baeza-Yates et al., 1999).

However, offering users a successful experience is not easy. When having to navigate through

large amounts of items, users are subject to various phenomena such as choice stress. More-

over, research shows that choice stress can counteract the increased attractiveness of a large

result set over a smaller set and hence, reduce the choice satisfaction of the user (Bollen


et al., 2010).

Recommender systems exist for a long time and can be used in many scenarios to support

the user by pointing the user in the right direction. One of the earliest projects using a

recommender system was the GroupLens recommender, a recommender system for news

articles (Resnick et al., 1994).

Over the years, recommender systems have become more popular and spread to various do-

mains. For instance, YouTube uses it to recommend videos to users (Davidson et al., 2010)

and Google News uses a recommender system to recommend news items to users (Das et al.,

2007), but many more examples exist.

Moreover, the use of recommender systems in online retail is not a novelty. Arguably, the

most famous and one of the first examples in this domain is the engine Amazon uses to

recommend products to customers (Linden et al., 2003). However, many more companies

active in e-commerce make use of recommender systems to recommend relevant products to

their users.

Recommender systems generate business value and are used in a wide variety of business

models. For instance, recommender systems can generate value by offering users a pos-

itive experience, but also by recommending items users otherwise would not have found

and bought (Gomez-Uribe and Hunt, 2015). An e-retailer or e-marketplace can use a rec-

ommender system to recommend products, an e-integrator can use it to perform mass-

customization towards it customers (Grefen, 2015).

Recommender systems typically use knowledge from domains as Human Computer Interac-

tion and Information Retrieval, but more importantly use algorithms that can be considered

data mining algorithms. Data mining is a set of techniques turning large amounts of data

into valuable information (Han et al., 2012).

2.3 Algorithms

Several techniques have been used over the past years in recommender systems. Shi et al.

(2014) propose a categorization of recommendation techniques, consisting of conventional

collaborative filtering, which include memory and model based approaches. These ap-

proaches use only the user-item matrix, as is explained later. Furthermore, collaborative

filtering using alternative sources uses other information, such as social network information,

user contributed information or interaction information.

Other authors, such as Bobadilla et al. (2013) make an estimation into content-based filter-

ing, demographic filtering, collaborative filtering and hybrid filtering. Here, content-based

filtering includes algorithms utilizing item properties, demographic filtering utilizes user

properties, collaborative filtering utilizes user-item interactions and hybrid filtering uses

combinations of these three. For the remainder of this chapter, the basic collaborative fil-

tering algorithm is explained and several machine learning approaches to the problem are

discussed.


2.3.1 Conventional collaborative filtering

Collaborative filtering refers to making predictions about a users preference by collecting

preferences from many users. It is based on the assumption that if users have a subset of

similar preferences they are more likely to also have similar preferences on other (unseen)

items. This method does not take any information about the items or users into account

except for their history. Neighborhood based algorithms make use of a matrix measuring N

Table 2.1: Example user-item matrix

The shawshank redemption the godfather the dark knight

Alice 3 1 5Bob 4 ? 5Carol 5 3 3Dave 3 2 5

users and M items, consisting of cells containing rij , namely the rating the corresponding

user i has given the particular item j. Ratings can be on a scale in case of explicit ratings,

but can also be binary, for instance in case of visits or purchases. An example of a user-item

matrix is given in 2.1. In this case a question mark denotes an unknown rating.

In case of conventional collaborative filtering, the predicted rating for a new user is de-

termined as follows: it consists of the weighted average of neighboring users. The weight

corresponds in this case to a notion of similarity. This can for instance be the Pearson

coefficient or cosine similarity. This means that users that are more similar to the user have

a larger share in the final predicted rating. Equation 2.1 displays the rating for user i and

item j using conventional collaborative filtering. This is an example of user based collabora-

tive filtering, where similarity between users is calculated. Item-based collaborative filtering

makes use of similarity between items.

Rij =1

C

∑k∈Zi

sim(i, k)Rkj (2.1)

In this equation, Zj is the set of k neighboring users of user j, sim(i, k) the similarity

between user i and k and C is a constant. The similarity function could be for instance the

cosine similarity function, using the vectors with the ratings for users ~i and ~k respectively:

sim = cos(~i,~k) (2.2)

Conventional collaborative filtering does not scale well as it involves computing similarity

for many other neighboring users or items. Model based approaches use a model in order

to predict a rating for a user and solve these issues. Equation 2.3 contains the general form

of model-based collaborative filtering.

f(pi, qi) 7→ Rij , i = 1, 2...M, j = 1, 2...N (2.3)


In this equation, pi and qi are the model parameters for user i and item j and f is a function

which maps parameters to know ratings Shi et al. (2014). This can be a model using matrix

factorization, or singular value decomposition (SVD), where the matrix is simplified to a

model using mathematical techniques in order to make new predictions.

2.3.2 Machine learning algorithms

Extending model based CF refers to extending model based CF to not only incorporate

ratings, but also user and item properties. In the following section, some machine learning

examples will be discussed that can be used for recommendation. However, several more

data mining techniques exist that could also be used for the purpose of recommendation.

First of all, clustering refers to grouping users and/or items into clusters based on the user-

item matrix and their properties. Several clustering techniques exist, such as for instance

k-means clustering. Sarwar et al. (2002) propose a clustering algorithm which splits all

users in a number of clusters using bisecting k-means, after which for a given user, the con-

ventional collaborative filtering algorithm is used to determine the predicted rating. The

authors report a decrease in accuracy but also an decrease in computation expenses as there

is no need to compute similarity for the entire matrix. Hence, clustering can be used to

overcome scalability issues of conventional collaborative filtering.

Secondly, it is possible to apply dimensionality reduction techniques. For instance, Singular

Value Decomposition (SVD) can be used to factorize the user-item matrix. Sarwar et al.

(2000) demonstrate that using SVD for an m by n matrix, the computation can be reduced

to O(m+ n), compared to O(m2) for conventional collaborative filtering, while performing

comparably to collaborative filtering. However, updating the SVD is expensive. This re-

solves some of the issues in scalability conventional collaborative filtering has, but is not

suitable for a context where frequent updates of the user-item matrix are desired.

Thirdly, neural networks are discussed. A neural network typically consist of several lay-

ers containing nodes, which are interconnected in a networked structure. Because of their

networked structure, neural networks are able to capture complex relationships. Neural Net-

works show promising results for recommendation. For instance, they have been deployed

for recommendation of cold start items on the Netflix dataset by Wei et al. (2017). The

authors of this paper used a neural network to learn item features and used this features in

a collaborative filtering scenario using SVD. They report a 5% lower RMSE compared to

using only Collaborative Filtering with SVD.

Fourthly, factorization machines combine Support Vector Machines (SVMs) and factoriza-

ton models and combine the advantages of both (Rendle, 2010). Factorization machines

are able to handle sparse data in lineair time. Factorization machines also show promising

results when combining several data sources in an e-commerce environment. Geuens (2015)

demonstrates that using factorization machines using interaction, user and item data as

feature vectors, outperform conventional collaborative filtering, which only uses interaction

data. The author reports an increase in Recall of more than 100% for small selection sizes

in an e-commerce scenario. However, the algorithms have not been compared to other algo-


rithms except for collaborative filtering in the mentioned paper.

Fifthly, Classification and Regression Trees (CART) are decision trees that can be used for

classification and regression. Furthermore, multiple trees can be combined in a random

forest or using boosted tree learning. Random forests have for instance been successful in

a contest predicting purchase behaviour (Myklatun et al., 2015). The authors use an en-

semble model consisting of probabilistic modeling to construct features and a random forest

algorithm on the interaction data for the recommender systems challenge 2015. The data

consisted of click sessions for users and the challenge was to predict purchasing of item

during a given session. The approach by the authors was the best out of 850 teams.

2.4 The long tail

One of the most important business drivers for e-retailers is increased reach (Grefen, 2015),

as the ability to offer a larger product catalogue is one of the advantages of online retail

over physical stores. Revenue distributions in e-commerce therefore often have a long tail,

where most of the revenue is in a large amount of relatively unpopular products from the

long-tail. As there are not many data points present for these products, typically recom-

mender systems experience difficulties incorporating these products in recommendations.

Furthermore, typically recommender system focus only on general recommendation perfor-

mance and hence, do not take recommendation of less popular items into account. However,

there is a business need to be able to deliver more balanced recommendations and hence,

recommend tail items as well.

Several approaches exist to deliver more balanced recommendations. For instance, cluster-

ing has been proposed. Park and Tuzhilin (2008) split items into head and tail items, and

apply clustering to tail items to estimate the rating for items in the tail, before applying

other algorithms to the entire itemset. They report an increase in Root Mean Squared Error

(RMSE) but their algorithm does not scale well.

Also, Valcarce et al. (2016) promote tail items to get rid of overstock by making using of

relevance modelling. The task is inverted, instead of making a recommendation for each

user a recommendation is created for each item by using similar items. Their proposed

approach obtains better results than neighborhoud based aproaches on reference datasets

for movie recommendation. The reason for this lies in the fact that it is difficult to calculate

a neighborhood for tail items as typically not many datapoints exist for these items.

Steck (2011) conducts experiments with a matrix factorization approach optimizing recall

on the Netflix movie dataset. The author reports outperforming a matrix factorization ap-

proach optimizing RMSE as well as an approach using SVD and a best-seller list in terms

of recall. Moreover, after performing user experiments the author concludes that in a movie

recommendation scenario only a small bias towards less popular items is appreciated by

users. Finally, the author also remarks that in general recommendation accuracy for tail

items decreases as the variance and noise increase towards the long tail.

Concluding, several approaches have been tried in delivering balanced item recommenda-

tion. However, finding appropriate algorithms to promote less popular items is still an open


challenge. Furthermore, more research can be done on user behaviour as the acceptance of

biased recommendations is not researched in online retail yet.

2.5 XGBoost

When inspecting the pursued approaches by participants to the Kaggle coupon purchase pre-

diction contest, it becomes clear that XGBoost, an implementation of gradient tree boosting

(Chen and Guestrin, 2016b) performs very well. It is used by 2 of the top-3 algorithms in

the contest. Furthermore, it has been used for many other contests in machine learning. For

instance, in the 29 competitions hosted on Kaggle in 2015, 17 winning solutions made use

of XGBoost. Moreover, every team in the top 10 of KDDcup 2015 made use of XGBoost.

Additionally, it must be remarked that neural networks as well as ensemble methods using

neural networks and XGBoost also obtain good results in 10 competitions, but are not as

popular as XGBoost is.

However, all literature mentioning XGBoost focuses on predictive accuracy and/or scala-

bility. For instance, the authors of XGBoost obtained excellent results using boosted trees

on several benchmark datasets (Chen and Guestrin, 2016a). They report a processing time

per tree which is 4 times faster than existing tree boosting implementations and a slightly

improved performance in terms of AUC.

Furthermore, no existing works mention using XGBoost in order to balance recommendation

over items. As XGBoost obtains excellent results in the recommendation of coupons, this

algorithm is chosen to be adapted in order to obtain balanced recommendations.

2.6 Conclusion

Modern customers in online retail demand personalized offerings. This increases customer

satisfaction and increases revenue for e-retailers. Recommender systems fulfill the need for

systems capable of delivering these personalized offerings. There is a long history of recom-

mender systems and the domain can be classified in steps in the recommendation process.

Several algorithms exist for recommendation. Collaborative filtering is one of the first classes

of algorithms and several extensions to it exist. Extending model based algorithms is the

most promising class of algorithms. In this class, several data mining techniques are adapted

to work with user-item interactions as well as interaction data and user and item properties.

Revenue distributions in online tail often show a long tail as reach is one of the important

business drivers for online retail. However, recommender systems often focus on general

predictive performance and are not able to incorporate less popular items in their recom-

mendations. Hence, there is a need for development of systems capable of delivering balanced

recommendations over the different items.

XGBoost shows excellent results in recommendation as well as in other data mining prob-

lems. However, no results are known of adapting XGBoost in order to give a balanced

recommendation set. This research will therefore investigate if the XGBoost algorithm can

be adapted in order to do so. Therefore, it addresses the need of developing appropriate


algorithms specific to the e-commerce domain and delivering balanced item recommenda-

tions.

19 3 Methodology & case studies

3 Methodology & case studies

In order to answer the research questions, experiments will be performed on the Ponpare

and the tour operator case studies. This chapter describes the research methodology and

the application on the Ponpare case as well as the tour operator case.

As discussed in the previous chapter, XGBoost has proven to perform well in this context

in terms of accuracy-related metrics, but it has not been used for balanced item recommen-

dation. Thus, in this research XGBoost will be adapted for balanced recommendation and

hence, this research evaluates behaviour of the algorithm with respect to different criteria

than only accuracy or related metrics.

3.1 Problem formulation and overview

The recommendation problem focuses on predicting every user i a number n of relevant

items, j. A relevant item is denoted with 1 and a non-relevant item with 0 and hence, the

problem is a case of binary classification.

In order to get to a classification score for every user-item (i, j) combination, every combina-

tion is denoted with a score by means of a classification model, for which XGBoost is used.

This yields a probability between 0 and 1 for every combination. Only the top n items with

the highest score per user i are selected to be converted to 1, all other items for this user

receive a 0 score. This leads to a recommendation list per user i, which will be evaluated

for all users. An overview of this process can be found in Figure 3.1.

Figure 3.1: Overview of recommendation process

The methodology is based on the CRISP-DM cycle (Chapman et al., 1999) and slightly

adapted for use in the context of this research. In this research, data understanding will

involve various explorative visualizations and statistics to discover interesting aspects in the

data. The result of the data understanding phases can be found in the first part of this

chapter.

In order to adapt XGBoost for balanced item recommendation, different weights will be

used for XGBoost training, which result in different models. The resulting models will be

compared with respect to several evaluation criteria. Here, a high-level overview of the pro-

cess is given, the different steps will be explained in greater detail in the remainder of this

chapter. Figure 3.2 displays an overview of the activities taking place after the data under-

standing phase. The first activity undertaken is data preparation and feature calculation.

After computing the features a hold-out is taken which will be used for final evaluation.

A grid search using k-fold cross validation is performed on the training data, in order to

determine the best parameter set for using XGBoost. The weights are calculated on the


Figure 3.2: Methodology overview

entire training set. These weights are then used to train the model using the parameterset

resulting from the grid search. Finally, the resulting models will be evaluated.

3.2 Data understanding

Data exploration is done to get an overview of the available data and get preliminary insights

in the data to base the approach on.

Firstly, the data structure is explained. Figure 3.3 contains an overview of the different data

sources. Table A.1 provides an overview of the available features in the different sources.

The user table contains user features such as UserId, registration date, sex, residence area,

Figure 3.3: An overview of the available data.

and possibly a withdraw date in case of an unregistered user. The coupon table contains

coupon information such as a description and genre. Furthermore, price information, display

duration, information regarding the validity of the coupon and the location of the shop are

available.

Views is a user-coupon combination enriched with some additional information about the

interaction, such as date, session ID and a flag denoting if the interaction lead to a purchase

or not. Purchase is a User-Coupon combination as well, containing additional information

such as the amount of items purchased, purchase date and area. Finally, some information


about the location of the shop offering the coupon is available in the table ‘Coupon Area’.

Unique coupons In order to get an impression of the frequency of buying for users and

coupons, we start with the purchases table and join this with the coupon properties in the

coupon list. Firstly, all transactions are grouped on couponid to aggregate the number of

purchases per coupon.

Secondly, coupon listings are counted. After examining the different coupon features the

combination (discount price, catalog price, capsule text, genre) is used to group similar

coupons that have been listed multiple times. The combination (discount price, catalog

price, capsule text, genre) is denoted as a unique coupon. This reduces the initial amount of

19368 coupons to 10803 unique coupons that have been listed 1 or more times. Furthermore,

the display period of a unique coupon can be obtained by adding the display periods for the

different periods this coupon was listed.

Thirdly, information on revenue for unique coupon can be obtained. This is done by multi-

plying the number of purchases by the discount price, which is the price a coupon has been

sold for.

Coupon listings and purchases Table A.3 contains an overview of catalog price, dis-

count price, listing frequency, purchase frequency and display period per coupon. It can be

seen that the typical coupon is listed once. The median number of purchases for a coupon

is 6. This indicates that there is typically not much purchasing interaction per coupon.

Moreover, the median number of days a coupon is displayed (listed) on the website of Pon-

pare is 4 days. As the coupon offering changes continuously and coupons are typically

replaced less than three times and do not return afterwards, this means the model should

learn relationships based on coupon properties instead of static products.

(a) For all coupons (b) For Coupon 1076

Figure 3.4: Views and purchases

Views Figure 3.6 contains a visual representation of Pearson correlation scores for several

coupon properties and the number of views and purchases. Furthermore, Table A.5 contains

the scores and Table A.4 contains the corresponding P-values. A positive score depicts a


Figure 3.5: Views and purchases per genre

positive correlation and a negative score a negative correlation.

Here it can be seen that a correlation exist between the display period and the number of

purchases and views. This can be explained by the fact that more visibility leads to more

purchases. Furthermore, the catalog price is correlated with the price rate and the discount

price, which means that a higher catalog price in general means a higher discount.

It can be derived from Table A.2 that the median number of views per user is 32, the me-

dian number of views per coupon is 43. Hence, although the view amounts are higher than

purchases, there are typically not many view data points available per user and coupon. A

large number of coupons and users with only few views exists.

Additionally, although typically coupons are viewed more than purchased, it is important

to remark that views appear to follow the same pattern as purchases. This can clearly be

seen in Figure 3.4a, where the views and purchases for all coupons are showed and Figure

3.4b, where an example coupon is displayed. Figure 3.6 confirms that a high amount of

views correlates with a high amount of purchases. Finally, this relation can also be seen

when combining user-coupon views, where typically a user views a coupon a number of times

before making a purchase. Moreover, the same phenomena apply to genre and location: a

high number of views in a specific genre denote a high probability of purchasing an item

from this genre. A high number of views in a specific area denotes a high probability of a

user making a purchase in this area as well.

Figure 3.7a displays the amount of purchases for all coupons. The coupons are sorted

according to their respective number of purchases. It is important to note that a small


Figure 3.6: Pearson correlation for coupon properties

(a) Purchases per coupon (b) Views per coupon

Figure 3.7: Purchases and views per coupon ID

amount of coupons contributes the most to the total revenue. The distribution of views

shows a similar figure which can be seen in Figure 3.7b.

Coupon properties The following section visualizes the distribution of these view and

(derived) purchase attributes over the entire couponset, as well as a smaller subset for


visualization purposes. This step is executed to get insights in the combinations of different

properties. Figure 3.8 displays the catalog price and discount price for the entire set of

Figure 3.8: Catalog price and discount price

coupons. This shows that coupons have a similar discount rate over the entire spectrum.

Figure 3.9 displays the catalog price and the amount of sales for all coupons. Here it

can be concluded that coupons with a lower price attract more sales in general. Figure 3.10

shows the display period, which is the number of days a coupon is visible on the Ponpare

website, and the number of sales. Here it becomes apparent that in general, more visibility

correlates with more sales. A similar effect can also be observed when comparing visibility to

revenue. Moreover, similar figures can be obtained when comparing listings to both revenue

and sales as the amount of times a coupon has been listed defines the days a coupon is visible.

However, it is important to remark that the displayed period is logically not available at

time of recommending and will therefore not be used in prediction.

User properties In order to explore the difference in purchasing behaviour, different

user properties are compared with coupon properties. Figure 3.11a shows the number of

purchases for different age groups per sex and genre. Figure 3.11b reveals the number of

purchases per sex and genre over time (per week). These figures only display the top 2 most

popular genres, the remainder can be found in Figures A.1a and A.1b.

It becomes clear for instance that genre hotel is more popular among older customers and

genre food is relatively popular among male customers. Many other similar relationships

can be discovered.


Figure 3.9: Catalog price and number of purchases

Figure 3.10: Display period and number of purchases


When taking a closer look at genre popularity per individual user, an interesting phenomenon

can be observed as well. All user purchases are grouped by genre and for every user the

most popular genre is taken. Then the number of purchases in this genre is divided by all

purchases for the user. This gives a mean value of 0.68, which means that on average, users

make 68% of their purchases in their favourite genre. When excluding users not having

made more than 10 purchases the mean value declines to 54%. For the entire population,

the purchases are much more evenly spread over the genres, with the most popular genre

receiving 30% of all purchases.

Concluding, user behaviour is different for different age and sex groups. Moreover, users show

a preference for genres of items. This should be taken into account by properly encoding

features.

(a) Per age group (b) Per week

Figure 3.11: Number of purchases per genre

3.3 Data preparation and feature calculation

The data preparation step of the CRISP-DM cycle describes the phase where the data is

preprocessed such that the information can be used for predictive modelling.

Firstly, as Ponpare is a Japanese coupon website, all descriptions, genre information and

information about the prefecture (place) of the user and coupon needs to be translated from

Japanese to English. This is done with a dictionary available at the repository at Kaggle.

The dictionary is a csv file containing translations from Japanese to English for area names,

prefecture, genres and descriptions present in the dataset. The dictionary is used to replace

the values from the respective columns with the corresponding value from the dictionary.

Secondly, the dataset spans exactly one year. Therefore, it was opted to convert the date

to both a day and a week number, counting from the first day in the dataset. These values

can then be used to split the dataset and compute features based on periods, as will be


explained later.

Thirdly, all coupon, purchase and user ids are contained in the database as hash value.

Therefore, in all tables where these hashes are contained they are converted to an integer

value according to the same function for coupons, purchases and userids, respectively.

Fourthly, it is important to remark that XGBoost is capable of handling missing data, by

giving all splits a default direction. This is done by, for every split, calculating the total

gain of directing all missing values to the left and comparing this with the gain of directing

all missing values to the right. This yields a default direction. Therefore, no missing data

of single features has been removed. However, a small fraction of views could not be linked

to coupon properties, because this coupon did not occur in the coupon table. Out of the

original 2,833,180 views, 315,974 have been removed, which means 2,517,206 views remain.

Finally, it is important to note that for a user-coupon combination often multiple views

appear in the database, as can be seen in Table 3.1. All views for a user-coupon combination

will be reduced to a single row. In case of a purchase, the date of the purchase and in case

of no purchase, the date of the last view will be taken as date for the record. Double

user-coupon pairs are removed. In the example the record in bold is maintained, all other

records are removed. Several steps are undertaken to encode features appropriately for use

Table 3.1: Views for example user and coupon

PURCHASE FLG I DATE COUPON ID USER ID

0 1-7-2011 17:07 402 225980 1-7-2011 22:51 402 225981 1-7-2011 22:52 402 225980 1-7-2011 22:54 402 22598

by XGBOOST. The first part of this section will therefore focus on simple encoding of static

features, whereas the second part of this section introduces some calculated and combined

features as well as features derived from interaction data.

The time of viewing leads to some features for every user-coupon combination present in

the dataset. First of all, the moment at which the interaction took place leads to a weekday

and time of day. Both are encoded as categorical variables using one-hot encoding for the

day of the week and the moment of the day (in 4 bins), respectively. Moreover, for every

user the registration date and the time of viewing lead to a value representing the number

of days the user has been registered.

Secondly, XGBoost is only capable of handling numerical data. Therefore, categorical data

should be encoded using one-hot encoding. The genre of a coupon is encoded using one-hot

encoding, as well as sex of a user. Age is not altered as well as the price, discount price and

price rate of an item.

Users as well as coupons have a location present in the dataset. For every combination

of item genre, item location and user location: in case the genre indicates an item can

be shipped (genres other, gift, lesson or delivery) and hence, location does not matter, the

attribute is set to 1, otherwise to 0. Furthermore, in case a user location-item location exists


for goods that cannot be shipped in the previous purchase log, the attribute is also set to

1. Finally, features derived from interaction data are explained. In this case interaction

Figure 3.12: Schematic overview of use of interaction data

data refers to historic view data. The dataset spans 168 996 purchases from 01-07-2011

until 23-6-2012 and coupons are generally only active for a limited number of days. It is

important to remark that, because the data spans only one year of transactions, purchasing

behaviour is assumed not to be changing over time and hence, no concept drift takes place.

This is not the case in practice, however, due to the scope of this research this assumption

is followed. Figure 3.12 displays a schematic overview of the addition of features based on

interaction data.

Moreover, because no predefined application scenario is available, a few assumptions have

to be made about the moment of recommendation and therefore, which interaction data

is available at the time of prediction. In this case the decision has been made to assume

that for the construction of features the dataset is based on 12 periods lasting one month.

The first month is only used for construction of interaction related features for month 2

and from month 2 till the last month the time dependent features are computed over the

previous month. This avoids using data which would in a realistic scenario not be available

for prediction due to delay in processing, or because it would be in the future.

The first example of a feature using interaction data is that sometimes similar coupons are

placed on the website again, as explained in Section 3.2. Therefore, to get an impression of

user interest for a coupon, it has been chosen to search for similar coupons based on coupon

key (discount price, catalog price, capsule text, genre) and add the number of purchases for

this key in the previous month as a feature.

Moreover, user preference over time is captured in two ways: for prefecture and genre. This

is done by grouping the visits by user, month and genre as can be seen in Table 3.2. The

visits are then counted and divided by the total visits made by the user for this month,

leading to a probability for every genre per month, denoted as prob-g in the table. These

are added for records related to this user-genre combination for the following month. Hence,

a relatively high number of visits for a genre in the previous month by a user leads to a high

score in this feature. In a similar manner, a score is computed for prefecture(location).

These features will be added to views for the following month, as can be seen in Table 3.4.

Finally, an overview of all used features can be found in Table B.1.


Table 3.2: Example views for constructing genre feature

USER ID genre month visits totalvisits prob g

1 Delivery 1 2 18 0.1111111 Food 1 12 18 0.6666671 Leisure 1 3 18 0.1666671 Relaxation 1 1 18 0.055556

Table 3.3: Example views for constructing prefecture feature

USER ID pref month visits totalvisits prob p

1 Gunma 1 1 18 0.0555561 Kanagawa 1 1 18 0.0555561 Saitama 1 1 18 0.0555561 Tokyo 1 15 18 0.833333

Table 3.4: Example view with genre and prefecture features

USER ID month genre pref prob g prob p

1 2 Delivery Tokyo 0.111111 0.8333331 2 Hotel Tokyo 0 0.8333331 2 Hotel Tokyo 0 0.8333331 2 Delivery Tokyo 0.111111 0.8333331 2 Other Tokyo 0 0.833333

3.4 Hold-out

As will be explained in the remaining part of this section, the XGBoost model will be trained

using different weighting schemes. Furthermore, XGBoost has several parameters, as can be

seen in Table 3.5. Therefore, the parameters for XGBoost as well as the most appropriate

weighting scheme need to be defined reliably.

K-fold cross validation is often referred to as the golden standard: the data is split in k folds,

where for every iteration k another fold is used as evaluation set. The average error on the

k folds is then taken as estimator for the true error. This gives a more reliable estimation

because the variance as result of dividing the data into train and test set is reduced.

In general, a grid search trains the model with different parameter sets on the data. The

resulting models are then compared with respect to defined performance criteria in order

to choose the best parameters. However, tuning the parameters and the weights can not

be combined in a single grid search due to the fact that it is not possible to incorporate

instance weights in SKlearn grid search. Therefore, nested cross validation would be most

appropriate alternative. This means: for every k-folds dividing the data into train and test

set, define the best parameter set using cross validation with a grid search, train the model

using different weights and continue to the next fold.

However, in case of a limited grid with three possible values for three parameters and k = 10

this would already account to 3∗3∗10 = 90 model trainings per fold and therefore account to

900 fits totally. Because this is not feasible due to time constraints, a different combination


of validation strategies is chosen: first, a significant proportion of the set is hold-out, by

splitting on day. All views before day 250 end up in the training set, consisting of 969109

views the hold out consists of views on day 250 or later and includes 515 208 views. This

hold-out is then used as final validation set for different weight strategies and the training

set is used to determine parameters using k-fold cross validation in a grid search.

Because the data is highly unbalanced, the training set will be balanced using undersampling.

This means that out of the 905 000 negative samples in the training set, 63400 will be

randomly sampled, so that the final training set consists of 63400 positive and 63400 negative

samples and a 1:1 ratio.

3.5 Grid search

In order to determine parameters a grid search is performed. The procedure is as follows: the

training data is divided in k = 10 folds. For every fold, the kth fold functions as validation

set, and the remainder as training data. For every possible combination of parameters a

model is fit and the ROC-AUC is calculated. The mean AUC score for all folds lead to an

estimate of the performance for different parameter settings and the best parameter setting

is chosen.

Table 3.5 contains an overview of parameters which can be tuned for XGBoost. The number

of trees refers to the amount of trees to grow, eta to the learning rate λ and maxdepth to the

maximum depth of a tree. Usually the number of trees is chosen between 100 and 1000 and

hold fixed such that the best values for the other parameters can be found. In this case the

number of trees is set to 200. The learning rate and max depth define the model complexity

and also control overfitting. In this case, it was opted to do a grid search with depths of

(5,6,7) and eta (0.05,0.09,0.12), based on recommended values in the documentation.

Subsample refers to sampling of rows, so setting it to 0.8 means that every single tree is

grown on 80 % of the data. Colsamplebytree refers to the sampling of columns, with a value

of 0.8 indicating that every tree is grown on 80% of the columns of the data. This can also

be used to control overfitting, as this reduces fitting in every iteration by including only a

fraction of the data and columns respectively. In this case we choose to leave them fixed

based on values from the XGBoost documentation but they could also be tuned, for instance

by taking them into account for a grid search. The results of the grid search with 10 fold

Table 3.5: Overview of parameters in grid

name description value range in gridsearch

n estimators number of trees 200eta learning rate (0.09,0.12,0.,15)subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth (5,6,7)min child weight min leaf weight 1

cross validation are shown in Table 3.7. As the configuration with learning rate of 0.09 and


depth of 7 obtains the highest mean AUC value over the 10 folds, this configuration will be

chosen for training the model on the entire training set. Table 3.6 contains an overview of

the used objective and parameters.

Table 3.6: Values after parameter tuning

Name Description Value

objective the learning task binary:logisticeval metric evaluation metric aucn estimators number of trees 200eta learning rate 0.09subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth 7min child weight min leaf weight 1

Table 3.7: Results of the gridsearch with 10-fold Cross Validation

learning rate max depth n estimators rank µ AUC σ

0.09 7 200 1 0.7356 0.00420.12 7 200 2 0.7352 0.00500.15 7 200 3 0.7343 0.00480.12 6 200 4 0.7343 0.00450.15 6 200 5 0.7343 0.00430.15 5 200 6 0.7336 0.00480.09 6 200 7 0.7331 0.00420.12 5 200 8 0.7318 0.00490.09 5 200 9 0.7301 0.0049

3.6 Model training

XGBoost is a popular implementation of gradient tree boosting, as described by Chen and

Guestrin (2016b). XGBoost makes use of gradient boosting, as developed originally by

Friedman (2002) and altered slightly for use in XGBoost. As is the case with boosted tree

learning, XGBoost uses an ensemble of trees. In case of XGBoost, the trees are CART trees.

An overview of how the CART trees are grown looks as follows:

for b rounds:

1. Grow the tree to the maximum depth greedy according to objective

(a) Untill depth d is reached

i. find the best splitting point

ii. Assign value to the two leaves

2. Prune the tree to remove nodes with negative gain

Every tree is grown to maximum depth by iteratively adding splits until the maximum

depth, d is reached. Splits are added by sorting all values for every feature and calculating


the gain for every split point. Then, the best splits per feature are compared and the best

split is chosen.

The split gain is derived from the objective, consisting of a loss function and a regularization

function. The loss function refers to a value penalizing prediction error, the regularization

penalizes complex trees. The objective at round t looks as follows:

n∑i=1

L(yi, yt−1i + ft(xi) + Ω(ft) + C (3.1)

Here, L denotes the loss function, which in this case is logistic loss. Ω is a function penalizing

complex trees. This reduces overfitting. The first and second derivation of the loss function

are taken for a Taylor approximation. Furthermore, the objective function is transformed

to loss per leave instead of per data point. This leads to the following equation for gain:

Gain =1

2

[(∑

i∈IL gi)2∑

i∈IL hi + λ+

(∑

i∈IR gi)2∑

i∈IR hi + λ+

(∑

i∈I gi)2∑

i∈I hi + λ

]− γ (3.2)

Here, I is the set of all indices of data points assigned to the node, IL and IR are the sets

of two data points assigned to the possible new right and left nodes. gi and hi denote the

gradient and hessian of the objective respectively. The gradient and hessian depend on the

type of loss function chosen. The intuition here is that the gain is higher if it contributes

more to the objective, which is minimizing loss.

After deciding on the best split point the leave value o can also be easily obtained:

o = −∑

I∈Ij gi∑I∈Ij hi + λ

(3.3)

Finally, the tree is pruned backwards to remove splits with a negative gain. Moreover, for

each node all the points with a missing value are directed to the direction which yields the

largest total gain. This gives all nodes a default direction to handle missing values.

As explained earlier, the problem at hand is a case of binary classification, where a positive

value of 1 denotes purchasing and 0 denotes no purchase. Therefore, the XGBoost objective

is set to binary classification, which is a logistic regression where the output of the model

denotes probability, according to the documentation.After every iteration the validation data

is used to calculate the Receiver Operating Characteristics (ROC) Area Under the Curve

(AUC), based on a threshold value of 0.5. If the ROC-AUC has not been improved after 10

iterations, the algorithm is stopped and the best performing ensemble model is selected as

final model.

3.7 Calculate weights

Tree boosting produces multiple classification or regression trees after each other, leading

to an ensemble of multiple trees as the final model. In instance weighting the initial weights

are adjusted to emphasize certain instances in the growing of the trees. The weights travel


along with the instances for the duration of the algorithm and will therefore influence the

different trees which will lead to the final ensemble.

The weighted loss function for XGBoost looks as follows (Chen and Guestrin, 2016b):

n∑i=1

1

2wi(ft(xi)− gi/wi)

2 + Ω(ft) + constant (3.4)

In this function wi represents the weight connected to an instance. In the original algorithm

the weight is equal for every data point. However, in case of weighted training, the rows

have weighted impact on the gain according to weight. Therefore, for all candidate splits

the gain will be weighted, leading to different split points.

Ting (2002) describes that weights can be successfully applied to grow cost sensitive trees

in a multiclass classification scenario. A similar intuition is followed here, by emphasizing

less popular items by means of changing weights. The expectation is that if these items

are emphasized during growing of trees, this will result in trees that yield a higher score for

similar less popular items and hence, result in recommending more items from the long tail.

And thus, this could lead to achieving the research goal.

In this case w(i) represents the weight for record i, vi the number of views for item i, pi the

number of purchases for item i and n the total number of records.

If we take 1 divided by the count of views or purchases, a low frequency in the training set cor-

responds to a high weight and vice versa. This provides the first two weight definitions, as can

be seen in Equation 3.5 and 3.6. w(i) =1

vi(3.5) w(i) =

1

pi(3.6)

Moreover, weights can also be applied inspired by the inverse document frequency, as is pop-

ular in information retrieval. In this case the document frequency is replaced by the number

of purchases, which results in weights as in equation 3.7 and 3.8. Finally, the difference can

be amplified by taking a function to the power 2, as displayed in equations 3.9 and 3.10.

w(i) = log n/vi (3.7) w(i) = log n/pi (3.8)

w(i) =1

v2i(3.9) w(i) =

1

p2i(3.10)

After applying a weight, all weight strategies need to be scaled such that the total of all

weight equals the number of records. This is done by counting all records , and taking the

sum of all weights ,which leads to a factor. This factor is then applied to all weights.

3.8 Evaluation

Gunawardana and Shani (2009) indicates that in scenarios where the cutoff value n is not

clear the performance can be calculated over a range of values for n. In this case n was

chosen from 1 to 10 and the list performances are calculated over this range. Furthermore,

this will be done for head items, tail items and all items. The items are split in the head

and tail subsets by applying a split at 50% of sales volume.

The first metric to be computed is catalogue coverage, which is nothing more than the


number of unique items I present in the recommendation lists L with length j, IjL of all

users U as fraction of all items I . In case the catalogue consists of 5 items and only 3 items

are presented in the recommendation lists of all users, the catalogue coverage will be equal

to 35 The coverage of tail items is considered most important, as the coverage of head items

is expected to be high nevertheless.

coveragec =|Uj=1....nI

jL|

|I|(3.11)

Second is the Gini index. Shani and Gunawardana (2011) use the Gini index to measure the

concentration of recommendations over different items. Equation 3.12 contains the Gini-

index. The Gini-index is 1 in case of maximum inequality, which is the case when one

item receives all recommendations. It becomes 0 in case of maximum equality, which is the

case when all items get the same recommendation frequency. Hence, a lower Gini-index

represents a more equal distribution of recommendations.

G =

∑Ii=1(2i− I − 1)zi∑I

i=1 z(3.12)

In this case I is the number of items in the resultset, i the index of an item, zi the number

of recommendations per item. After taking all items in the user predictions, the positive

predictions are grouped by couponid. This leads to a list consisting of couponid and amount.

The Gini-index is calculated over this list, As the Gini-index requires a sorted list, the item

list is first sorted on number of recommendations and then the Gini-index is calculated ac-

cording to the equation.

Thirdly, several metrics are computed for conventional classification performance. For every

listlength n, a confusion matrix is calculated. The true positives in this case is the intersec-

tion of predicted positives and actual positives, false positives is the intersection of predicted

positives and actual negatives, false negatives equals the intersection of predicted negatives

and actual positives and true negatives equals the intersection of predicted negatives and

actual negatives. These lists will be used to compute Precision, Recall and the F1 score.

Table 3.8: The confusion matrix

predicted positive predicted negativeactual positive TP FNactual negative FP TN

These scores for different listlengths n will be used to construct precision-recall curves. This

will be done for all items, as well as different subsets of items, as the items will be split in

head and tail subsets.

Precision =TP

TP + FP(3.13)

Recall =TP

TP + FN(3.14)


F1 =2TP

2TP + FP + FN(3.15)

Finally, it is important to get significance of the obtained results. Mcnemar’s test considers

two classifiers, fa , and fb. For every example in the test set, it is recorded whether it is

misclassified by both fa and fb, only fa (n01), only fb (n10), or both. After classifying all

records in a category it is easy to compute χ2, as can be seen in 3.16.

χ2 =(|n01 − n10| − 1)2

n01 + n10(3.16)

The χ2 can then be used to determine p-values using a chi-squared function with one degree

of freedom. It is important to note that this test gives significance scores for pairwise

comparisons, and therefore, the results should only be used to compare two values or should

be corrected.

3.9 Case study: tour operator

Apart from the Ponpare data, as introduced earlier, the algorithm is also tested on another

data set. This data concerns purchase and view data of a tour operator. This chapter de-

scribes the business context of this case study, some data characteristics and features used.

The data concerns transaction data from a tour operator. The tour operator offers several

package holidays for the leisure holiday market in The Netherlands. Customers are able to

browse the offerings on the website of the tour operator and book the holiday on the website

of the tour operator.

The tour operator offers a variety of destinations and types of accommodation, each with

a different target audience. For instance, some accommodations are catered for single trav-

ellers, while others are more suitable for families. Furthermore, the holiday offerings have

different price ranges and are situated in different regions. The objective in this case is to

recommend users suitable holiday products taking these characteristics into account.

3.9.1 Data understanding

The data consists of accommodation and user combinations. Item properties consist of

AccommodationCountry, AccommodationCity, StarRating, Childfriendly, Only-adult and

Average-rating. User properties consist of InfantCount, AdultCount, ChildCount and Age

of the main booker.

Figure 3.13 contains an overview of age distributions of bookings per country. In this case

only the three most important countries are listed. Here, a separation is made as well for

accommodations where no children are allowed. The figure makes clear that accommodations

where no children are popular among customers in the age bin 60-70. Figure 3.14 displays

the number of bookings versus the average customer rating and the star category. Upon

inspection it appears that most accommodations are 3, 4 or 5 star accommodations. Most

of the customer ratings are within a range between 7 and 9 and the four most popular

accommodations have a rating above 8.


Figure 3.13: Age distribution of bookings

3.9.2 Data preparation and feature calculation

Unlike in the Ponpare scenario, the data consists of maximum 1 user-item combination. So,

it is not necessary to aggregate on user-item. The records contain some properties derived

from the user side and some properties that stern from the item side, which can be used as

predictors. Furthermore, the records contain a BookedIndicator, which denotes 1 in case of

a booking and 0 in case of no booking.

In this case we decide to construct no features denoting preference as this data is not

available, but just encode the data for use by XGBoost.

Firstly, the accommodation rating is converted from a categorical to a decimal value. A 3*

rating is converted to 3.0, a 3+ rating to 3.5 and so on. Furthermore, dummies are created

for AccommodationCity, ChildFriendly and ‘Only-adult’.

The data is split in order to get 66% of the data in the training set and 33% in the hold-

out. The split ensures that accommodations end up in either the training or the validation

data. This yields a training set of 217 420 records and a hold-out of 108710. Because the

data is unbalanced, undersampling is applied on the training data, yielding 17 471 positive

instances and 17 741 negatives in the final train set.

3.9.3 Grid search

On the training part, an initial parameter set is defined using stratified k-fold cross validation

with k = 10. The grid search is performed with the same grid values for parameters as in

the Ponpare case. This The results of the gridsearch with 10-fold cross validation can be

found in Table 3.10.


(a) versus customer rating (b) versus star rating

Figure 3.14: Number of bookings per accommodation

Table 3.9: Overview of used parameters for tour operator

Name Description Value

n estimators number of trees 200eta learning rate 0.15subsample row sampling 0.8colsample bytree column sampling 0.8max depth max tree depth 7min child weight min leaf weight 1

Table 3.10: Results grid search tour operator

learning rate max depth n estimators rank µ AUC σ

0.15 7 200 1 0.86031 0.005600.12 7 200 2 0.85655 0.005390.15 6 200 3 0.85475 0.004870.09 7 200 4 0.85197 0.005490.12 6 200 5 0.85066 0.004280.15 5 200 6 0.84698 0.005750.09 6 200 7 0.84509 0.005720.12 5 200 8 0.84273 0.005270.09 5 200 9 0.83663 0.00548

38 4 Results

4 Results

This chapter describes the obtained results on both the Ponpare dataset and the tour op-

erator dataset. Firstly, the random selection will be explained, after which results will be

displayed for both datasets. Finally, the insights that can be obtained from these results

will be discussed.

4.1 Random selection algorithm

In order to compare the approaches against other approaches, a random selection algorithm

is used. The approach counts the positive instances in the test set and assigns the same

number of positives at random indexes in the test set, drawn from a uniform distribution

and leaving all other records negative. This leaves the positive rate the same as in the actual

results of the test set.

4.2 Ponpare

The distribution of coupons in the hold-out set, which can be seen in Figure 4.1, is examined

first. 4989 coupons appear in the hold-out set, which are purchased 35005 times all together.

A split was made at 50% of all purchases, this results in a head of 381 items and a tail of

4608 items. The 381 and 4608 items will from now on be referred to as ‘head’ and ‘tail’

items. Figure 4.2 displays the precision recall curve for all items in the ponpare dataset. It

Figure 4.1: Purchases per coupon in test set

becomes evident here that all weights outperform a random selection. However, all models

are close to each other, including the variant with equal instance weights, which is labeled

as ‘noweights’.

39 4 Results

Figure 4.2: Precision/Recall for all items in the Ponpare dataset

Figure 4.3: Precision/Recall for head items in the Ponpare dataset

Figure 4.2, Figure 4.3 and Figure 4.4 contain precision-recall curves for the head, tail and

all items in the Ponpare dataset, respectively.

For the head and tail subset, it becomes clear that performance in terms of precision and

recall is much worse for the tail than the head subset.

Additionally, it is interesting to see that for the entire dataset, ‘idflike-views’(3.7), ‘idflike-

purchases’(3.8) and ‘noweights’ are all close to each other and perform the best. The four

remaining models are also close to each other and form the group of worst performing

models, with ‘inv-purchase-count2’(3.10) and ‘inv-view-count2 (3.9) performing the worst of

40 4 Results

Figure 4.4: Precision/Recall for tail items in the Ponpare dataset

the models for the largest part of the plot.

Finally, the remark must be made that the order of models for all items as well as the head

items subset seems to be similar. However, it is remarkable that for head items some of the

worst performing models display less performance than the random selection algorithm.

Figure 4.5: Coverage for head items in the Ponpare dataset

Figures 4.5, 4.6 and 4.7 show the coverage of the head, tail and all items in the Ponpare

set respectively. In the different figures, it can be seen that the number of unique items

do not differ for the different models, however, the coverage is always higher for random

selection than the other models. Figure 4.8 shows the Gini-index for different lengths of

41 4 Results

Figure 4.6: Coverage for tail items in the Ponpare dataset

Figure 4.7: Coverage for all items in the Ponpare dataset

recommendation lists. Once again, the different models are close to each other. Furthermore,

it is important to note that a random selection obviously yields the most equal distribution.

However, it is also important to see that some of the different models clearly show a lower

Gini-index than ‘noweights’. Finally, it is important to note that ‘inv purchase count2’(3.10)

shows the lowest Gini-index of all models and that the models get closer as the list length

increases.

Figures 4.9, 4.10 and 4.11 show the F1 scores for different list length on the Ponpare set.

Here it can be seen that the F1 score is rather low in the beginning as a lot of positives

42 4 Results

Figure 4.8: Gini-index for different list lengths in the Ponpare dataset

are missed, but improves for longer lists as the recall increases faster than the precision

decreases. Figure 4.12 shows the Gini index and F1 score for Ponpare, which confirms

Figure 4.9: F1 score for different list lengths in the Ponpare dataset

that in general, a lower Gini-index means a lower F1 score. In this case, the lines go from

right to left, starting with raising the Gini index while maintaining an equal F1 score and

moving towards lower Gini index and F1 score towards the end of the lists.

Finally, Table C.1 shows the p-values for pairwise comparisons of the algorithms. When

comparing the constructed models against the default scenario, ‘noweights’, it becomes clear

that the difference is not significant for ‘idflike-purchases’ and ‘idflike-views’. However, for

43 4 Results



all other algorithms the difference is significant.

4.3 Tour operator

In the remainder of this section the results for the tour operator are discussed. The distri-

bution of purchases over coupons in the test set can be seen in Figure 4.13. When this plot

is compared to the distribution of coupons in the Ponpare set, it becomes clear that this

distribution is less equal. In this case, the validation set consists of 3217 accommodations,

44 4 Results

Figure 4.12: Gini-index and F1 score in the Ponpare dataset

Figure 4.13: Purchases per coupon in test set

with a total of 8659 bookings. The validation part is splitted in two parts according to

bookings, which means that the head of 117 accommodations is responsible for the first half

of all bookings and the remainder of 3100 accommodations is responsible for the second half.

Figure 4.14 displays the precision recall curve for all items in the tour operator dataset. It

can be seen here that random performs the worst for all items. Three models are clearly

performing best with similar performance. These are ‘noweights’, ‘idflike-views’ (3.7) and

‘idflike-purchases’ (3.8). Models ‘inv-view count’ (3.5) and ‘inv-purchase-count’(3.6) are be-

45 4 Results

hind this group of three. The worst performing models are ‘inv-purchase-count2’ (3.10) and

‘inv-view-count2’ (3.9).

Figure 4.15 displays the performance of the different models on the head items. Here, it

can be seen that for a short listlength, the random selection outperforms the different mod-

els. However, for longer lengths the models outperform the random selection. Figure 4.16

shows the precision recall curve for the tail items. Here it is clear that all models except

the squared models perform very similar and better than the squared models and random

selection, which performs worst.

It is remarkable, that both for the head as tail items the models as well as the random

selection obtain a rather high recall. Hence, if even with a random selection it is possible to

get a high recall this means the amount of items per users in the test set could be less than

10 in many cases, leading to almost only positive predictions and hence, a high recall score.

Figures 4.19, 4.17 and 4.18 contain the coverage for different recommendation list length.

Figure 4.14: Precision/Recall for all items in the tour operator dataset

Here, it is clear that the coverage for head items is 100% and that for tail items the random

selection has the highest coverage.

Figure 4.20 shows the Gini-index for all models on the tour operator dataset. Here,

it is clear that the random selection obtains the lowest Gini-index, as the resulting item

recommendation set is the most equal. The remaining models are relatively close, however,

it can be seen here that ‘inv-purchase-count2’ (3.10), which performed the worst of all models

obtains the lowest Gini-index of all models, but for higher list lengths it becomes as unequal

as the other models. Figures 4.21, 4.22 and 4.23 show the F1 scores for the head subset, tail

subset and all items.

Figure 4.24 displays the Gini index and the F1 score for the different models. This shows

that a lower Gini index in general means a lower F1 score. Furthermore, it is remarkable

46 4 Results

Figure 4.15: Precision/Recall for head items in the tour operator dataset

Figure 4.16: Precision/Recall for tail items in the tour operator dataset

to see that ‘inv-view-count2’ in this scenario follows a pattern that is for some part of the

graph similar to the random model. Hence, this model performs particularly bad in terms

of precision/recall.

Finally, table C.2 shows the p-values for pairwise comparisons of the algorithms. Also, here

it is clear that ‘noweights’ significantly differs from all models except ‘idflike-purchases’ and

‘idflike-views’.

47 4 Results

Figure 4.17: Coverage for head items in the tour operator dataset

Figure 4.18: Coverage for tail items in the tour operator dataset

4.4 Summary of insights

It is clear that all models perform very similar to each other. In order to investigate the

reason behind this it is important to recall the XGBoost algorithm and where the weights

impact the learned models. At every iteration, a tree is formed, which in turn decides upon

the best splits by calculating a weighted gain according to the loss function, which in this

case is logistic loss. In the case of altered weights, the gain per possible split is a weighted

average: errors on cases with a high weight are penalized heavier. Hence, single trees are

steered towards performing well on cases with a higher weight.

48 4 Results

Figure 4.19: Coverage for all items in the tour operator dataset

Figure 4.20: Gini-index for different list lengths in the tour operator dataset

When evaluating the generated trees AUC is used. This metric is an unweighted metric

for classification error. The similarity of all models might likewise be explained by the fact

that different weights lead to different trees, however, the evaluation function in XGBoost is

unweighted and hence, will steer the ensemble in the direction of general prediction quality.

Furthermore, the relatively small difference with random selection can most likely be ex-

plained by two reasons. In case of Ponpare: only a few features are constructed, in the

case of the tour operator dataset no features have been constructed. It is obvious that trees

benefit of constructing more features. Secondly, in case of the tour operator dataset, users

49 4 Results

Figure 4.21: F1 score for different list lengths in the tour operator dataset


are present with a mean of 10.98 viewed items per user in the test set. Hence, for some users

items will be classified as positive independent of score, which explains the good performance

of random selection as well. This could be solved by adding more user-item combinations.

When comparing the difference in performance of the different models in terms of preci-

sion and recall a few things become clear as well. ‘idflike-purchases’, ‘idflike-views’ and

‘noweights’ are performing among the best models in terms of precision and recall in both

cases.

Furthermore, it appears that a less equal division of weights typically means worse perfor-

50 4 Results


Figure 4.24: Gini-index and F1 score in the tour operator dataset

mance in terms of precion-recall. This was expected as the evaluation data is biased towards

purchasing popular (head) items. Therefore, it could be interesting to research the proposed

weights in a scenario involving users, to see if the altered recommendation system leads to

different user behaviour.

Both weights squaring the number of counts, namely: ‘inv-view-count2’ and ‘inv-purchase-

count2’, perform consistently among the worst models. This could indicate that setting

extreme weights does not enable the single trees to learn relevant relationships for purchases

and hence, performs worse in terms of precision and recall. This can be caused by the fact

51 4 Results

that the further down the tail more noise and variance exist which makes it difficult to

extract meaningful splits.

When comparing precision-recall and the Gini-index for the different models it appears that

it is possible to use weights to give a more equal recommendation set. However, a better

performance in terms of precision and recall means a less equal distribution (higher Gini-

index). This can be seen in the results for both datasets.

Concluding, the results show that this approach makes it possible to deliver a more bal-

anced recommendation set. However, normally this results in lower performance in terms

of precision and recall. Furthermore, the constructed models perform consistently better in

terms of precision and recall than a random selection.

52 5 Implementation

5 Implementation

This chapter discusses the usage of the developed approach in the online retail industry.

First, an overview of the used technologies for the implementation is given and secondly

different benefits for retailers are discussed.

5.1 Used software

For the developed pipeline only open source software has been used.

• Python Python is a programming language for general purpose programming. Python

has automatic memory management and supports functional as well as object oriented

programming. Python is an open source project and has a large number of standard

libraries. However, it also supports several packages. The entire project pipeline runs

on Python 3.6

• Pandas Pandas is a Python library. The library contains many functionalities for

working with labeled datasets, such as many statistical operations, but also data op-

erations like joining (merging) tables, functionalities to handle missing data, group by

and pivoting and reshaping functionalities (McKinney, 2011). This project uses the

Pandas library for many data operations, as well as some pre processing steps.

• Jupyter Notebooks Jupyter Notebooks are documents containing both code, such

as for instance Python and rich text. This means a notebook contains the analysis as

a code snippet as well as the results by means of visualizations, tables etc. For this

project the Jupyter Notebook app has been used for all parts of the project.

• SKlearn Scikit-learn is a library containing tools for data mining and data analysis.

It contains for instance many classification and regression algorithms, but also many

tools for model selection such as grid search,cross validation and evaluation modules.

This project uses sklearn for grid search using k-fold cross validation. Furthermore,

some of the evaluation modules are used.

• Numpy Numpy is a package for scientific computing in python and contains amongst

others an N-dimensional array object and lineair algebra functions. Pandas makes use

of Numpy arrays. This project makes use of Numpy through Pandas but also uses

some Numpy functions for elementwise operations on arrays.

• Scipy Scipy is a package for mathematics, science and engineering. In this project

Scipy is used for instance to calculate the Pearson correlation coefficient.

• Matplotlib Matplotlib is a plotting package for Python. Matplotlib makes it possible

to display for instance bar charts, histograms, scatterplots etc. All visualizations in

this project are made using Matplotlib. For some plots the package Seaborn is used

as well.

53 5 Implementation

• XGBoost XGBoost is an implementation of gradient tree boosting and is discussed

earlier.

• KNIME KNIME is an open source data analytics, reporting and integration platform.

In this project it has been used for some statistical analysis in the data exploration

phase.

The usage of standard open source components means the developed solution is easy to

implement as only an environment capable of running Python is needed. Furthermore, the

used open-source software means no license costs will be involved.

The developed pipeline can be easily adapted for new clients for Building Blocks, in a similar

manner as has been done for the tour operator. The model tuning capabilities, as well as

training with different weights and evaluation can be directly used by means of executing

the existing Python script. However, data preprocessing steps are specific to the case at

hand and should therefore be changed for new datasets. Furthermore, manual analysis of

the evaluation results should be done to select the most appropriate weighting scheme.

5.2 Quality aspects

• Flexibility The developed approach can be used in many different scenario’s where

recommendation of items to a user is needed. The proposed approach can be used for

personalized offerings throughout many different channels, such as e-mail marketing,

advertisement or product recommendation in a webshop, but many more examples

are possible in scenario’s where a personalized product advice is needed. Once the

connection with the data is set-up, only the pre processing steeps need to be altered.

Secondly, the developed pipeline can be used by many different retailers. Different

products in different domains mean different product properties and hence, different

consumer behaviour. Therefore, the model selection and parameter tuning capabilities

of the developed pipeline are important to make sure the most appropriate model is

selected for different retailers.

Thirdly, with small adjustment the developed work enables retail clients to incorporate

other objectives than popularity. For instance, it can be chosen to incorporate margin

in the prediction to recommend more items with a higher margin. But many other

objectives can be incorporated in a similar manner.

• Scalability The used packages and algorithms can easily be run on large (virtual)

machines for large amount of users. The only requirement is a container or virtual

machine able to run Python and XGBoost. The current configuration uses a virtual

machine with a 4 core processor and 8GB of memory. This is important for Building

Blocks and its clients.

Summarizing, it becomes possible to make relevant personalized offerings to large groups

of customers on different channels. As discussed in chapter 2 this is essential in order to

attract customers and be successful in the current online retail industry.

54 6 Conclusion and discussion

6 Conclusion and discussion

This chapter discusses the implications of this research for academia, practice and the limi-

tations of this research and will conclude the research.

6.1 Research questions

The main research question is recalled:

Main research question: How to build a recommender system that provides bal-

anced item recommendations in online retail?

In order to answer the main research question, all research questions will be addressed first.

Research question 1: Which criteria can be used to evaluate balanced item rec-

ommendation?

Recommending products to users is a binary classification problem. Chapter 3 describes var-

ious accuracy-related metrics that are used to evaluate recommender system performance.

The classification metrics that have been used include precision, recall and the F1 measure.

Apart from general recommender system performance, several metrics are used to evalu-

ate balanced item recommendation. Firstly, the items in the resultset are split in subsets

based on the number of purchases to compare performance on these subsets. For instance,

precision, recall and F1 score, as well as coverage are computed for head and tail items. Fur-

thermore, the distribution of recommendations over all products is evaluated by computing

the Gini-index over the set of recommendations.

Research question 2: How to adapt the existing algorithm such that it will deliver

balanced recommendations?

This research proposes using XGBoost for recommendation, as described in chapter 3. The

approach involves altering weights connected to data points in order to deliver a more

balanced recommendation set. This approach is tested on two case studies, namely: a

coupon website as well as a tour operator.

The results show that XGBoost can be used for recommendation in online retail and that

the approach outperforms a random selection algorithm in terms of precision and recall.

Furthermore, the approach using different methods of instance weighting turns out to be

successful in delivering a more equal division of recommendations over the itemset. However,

as is the case with other research, delivering a more balanced recommendation set comes at

the cost of lower precision and recall.

Research question 3: How can the developed algorithm be applied in the online

retail industry by Building Blocks?

The developed algorithm can be applied using Python and various open source Python

packages, as discussed in chapter 5. The developed solution pipeline is flexible and scalable.


To conclude with the main research question, this thesis presents a feasible approach where

boosted trees using instance weighting are applied to recommend items from the long tail.

The research shows that it is feasible to use boosted trees for recommendation, whereby

adapting instance weights results in a more equal division of recommendations over the

different items.

6.2 Implications for research

First of all this project shows that gradient boosted trees can successfully be used for rec-

ommendation systems in online retail. The recommendation system outperforms a random

selection, however, more benchmarks are needed to see how it compares to other algo-

rithms.

The second important implication for research is that instance weighting in boosted tree

learning indeed proves to give a more equal recommendation set compared to not applying

any instance weighting. This is important because a more equal recommendation set means

more exposure for less popular items. This result can be clearly seen on both the tour

operator as the Ponpare dataset. This is an important addition, because this approach has

not been used before to address balanced item recommendation.

The third implication is that a better performance in terms of precision and recall means

a less equal distribution of recommendations over the items. This strongly suggests that

choosing a model yielding a more equal set of recommendations means sacrificing on perfor-

mance in terms of precision and recall. Hence, for applications where an equal distribution

of recommendations is desired, concessions needs to be done in terms of precision and recall.

6.3 Managerial implications

As discussed in chapter 2, key to the success of online retailers is customer loyalty and

modern customers demand tailored content in real time to fulfill their needs in their online

shopping experience.

The presented system is able to give personalized recommendations in real time for a large

amount of users and items and hence, it is scalable. This is important to Building Blocks

as many of their online retailing client maintain a large catalogue of items and serve a large

amount of customers and previous approaches did not scale well.

Furthermore, the presented approach can be applied in a variety of domains. Underlying user

behaviour is different for purchasing coupons or booking a package holiday. It is important

that the presented approach can also be applied to other domains to also serve potential

future clients, as the clients of building blocks are from various domains.

Finally, using a recommender system that takes into account the performance of tail items

enables customers to dispose of excess inventory and therefore improve their bottom-line

performance. or approaches where for instance margin is taken into account in a similar

fashion as has now been done with purchases and views.


In order to do so so, it is possible to construct weights taking into account extended cost

models, which give a better representation of the costs involved in recommendation of items.

For instance, inventory costs, as well as the costs of not purchasing other items might be

taken into account to get a more appropriate estimate of costs and hence, a more appropriate

product mix in the recommendations of the e-retailer.

6.4 Future research

This approach could also be extended to take into account other forms of costs such as

inventory cost, or approaches where for instance margin is taken into account in a similar

fashion as has now been done with purchases and views. Hence, research could be done

to determine whether the proposed approach could be extended to affect other business

requirements as well.

Additionally, more research should be done in the relationship between general recommen-

dation performance and equality. Experiments can be done both within the e-commerce

domain as in other domains to determine this relationship.

Furthermore, it is recommended to compare performance of the developed models to other

algorithms than only random selection as done in this research. Existing collaborative fil-

tering approaches, such as collaborative filtering combined with clustering, as well as other

machine learning methods could serve as benchmark. For instance, random forests also allow

for weighted learning. This could give insights in how the current developed method relates

to other algorithms, both in terms of general recommendation performance as in delivering

equal recommendations.

Moreover, a custom evaluation function for use in the XGBoost algorithm could be used,

in order to evaluate the intermediate models during training not only on general predictive

performance, but also on performance with respect to giving balanced recommendations.

This could lead to different final models.

Finally, it is recommended to test the algorithm live, involving users. This could give insights

in both business impact and user experience of the proposed approaches. For instance, it

could be seen if the proposed approach indeed leads to a more equal division of sales over

all items and the company could therefore for instance reduce overstock. Moreover, it could

give insight in the user experience to see how users perceive the proposed changes.

6.5 Limitations

Firstly, the initial parameters are defined using an accuracy related metric, AUC. For final

evaluation however, different measures are used, such as the Gini-index. This is an impor-

tant limitation as the best parameters for equal recommendations could be different than

the best parameters using AUC. Currently, the XGBoost class for SKlearn Grid search does

not support this and therefore it needs to be extended to be able to do so.

Secondly, the composition of the train and test set for the travel operator make evaluation

of recommendations as a ranked list difficult, as the mean number of items per user is with


10.98 only slightly higher than the maximum list length of 10. This means, that towards

the end of the list in many cases there is not a lot of items to choose from and hence,

the different algorithms will give a very similar result. On the other hand, the amount of

possible items for users in the Ponpare set is considerably larger with 34 items per user.

Adding combinations randomly to the tour operator dataset was not feasible as this might

encourage the model to learn relationships based on the underlying generating algorithm

instead of user behaviour. Therefore, the recommendation is to test these algorithms as well

on a dataset with more user-item combinations.

58 6 References

References

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, and Others. Modern information retrieval,

volume 463. ACM press New York, 1999.

J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez. Recommender systems survey.

Knowledge-Based Systems, 46:109–132, 2013. ISSN 09507051. doi: 10.1016/j.knosys.

2013.03.012. URL http://dx.doi.org/10.1016/j.knosys.2013.03.012.

Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. Understanding

Choice Overload in Recommender Systems. In Proceedings of the Fourth ACM Conference

on Recommender Systems, RecSys ’10, pages 63–70, New York, NY, USA, 2010. ACM.

ISBN 978-1-60558-906-0. doi: 10.1145/1864708.1864724. URL http://doi.acm.org/10.

1145/1864708.1864724.

Pete Chapman, Julian Clinton, Thomas Khabaza, Thomas Reinartz, and Rdiger Wirth. the

Crisp-Dm Process Model. The CRIP-DM Consortium, 310(C), 1999.

Tianqi Chen and Carlos Guestrin. XGBoost : Reliable Large-scale Tree Boosting System.

arXiv, pages 1–6, 2016a. ISSN 0146-4833. doi: 10.1145/2939672.2939785.

Tianqi Chen and Carlos Guestrin. XGBoost. Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining - KDD ’16, pages

785–794, 2016b. doi: 10.1145/2939672.2939785. URL http://dl.acm.org/citation.

cfm?doid=2939672.2939785.

Chao Min Chiu, Eric T G Wang, Yu Hui Fang, and Hsin Yi Huang. Understanding cus-

tomers’ repeat purchase intentions in B2C e-commerce: The roles of utilitarian value,

hedonic value and perceived risk. Information Systems Journal, 24(1):85–114, 2014. ISSN

13501917. doi: 10.1111/j.1365-2575.2012.00407.x.

AS Das, M Datar, A Garg, and S Rajaram. Google news personalization: scalable online

collaborative filtering. Proceedings of the 16th international conference on, pages 271–280,

2007. ISSN 1595936548. doi: 10.1145/1242572.1242610. URL http://portal.acm.org/

citation.cfm?id=1242610.

James Davidson, Blake Livingston, Dasarathi Sampath, Benjamin Liebald, Junning Liu,

Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, and Mike Lambert.

The YouTube video recommendation system. Proceedings of the fourth ACM conference

on Recommender systems - RecSys ’10, page 293, 2010. ISSN 1605589063. doi: 10.1145/

1864708.1864770.

Eurostat. Digital economy and society statistics - households and individuals,

2017. URL http://ec.europa.eu/eurostat/statistics-explained/index.php/

Digital\_economy\_and\_society\_statistics\_-\_households\_

and\_individuals.

59 6 References

Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Anal-

ysis, 38(4):367–378, 2002.

Stijn Geuens. Factorization Machines for Hybrid Recommendation Systems Based on Be-

havioral , Product , and Customer Data. In the 2015 ACM conference on Recommender

systems, RecSys 2015, number Umr 9221, pages 379–382, 2015. ISBN 9781450336925.

doi: 10.1145/2792838.2796542.

Carlos A. Gomez-Uribe and Neil Hunt. The Netflix Recommender System. ACM Trans-

actions on Management Information Systems, 6(4):1–19, 2015. ISSN 2158656X. doi:

10.1145/2843948. URL http://dl.acm.org/citation.cfm?id=2869770.2843948.

P Grefen. Beyond E-Business. Routledge, 2015. ISBN 9781315754697. doi: 10.4324/

9781315754697. URL http://www.tandfebooks.com/isbn/9781315754697.

Asela Gunawardana and Guy Shani. A Survey of Accuracy Evaluation Metrics of Recom-

mendation Tasks. The Journal of Machine Learning Research, 10:2935–2962, 2009. ISSN

15324435. doi: 10.1145/1577069.1755883.

Jiawei Han, Micheline Kamber, and Jian Pei. Introduction. Elsevier, 2012. ISBN

9780123814791. doi: 10.1016/B978-0-12-381479-1.00001-0. URL http://linkinghub.

elsevier.com/retrieve/pii/B9780123814791000010.

Kaggle. Coupon Purchase Prediction. URL https : / / www . kaggle . com / c /

coupon-purchase-prediction.

Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item

collaborative filtering. IEEE Internet Computing, 7(1):76–80, 2003. ISSN 10897801. doi:

10.1109/MIC.2003.1167344.

Wes McKinney. pandas: a foundational Python library for data analysis and statistics.

Python for High Performance and Scientific Computing, pages 1–9, 2011.

Oyvind H Myklatun, Thorstein K Thorrud, Hai Nguyen, Helge Langseth, and Anders Kofod-

Petersen. Probability-based Approach for Predicting E-commerce Consumer Behaviour

Using Sparse Session Data. Proceedings of the 2015 International ACM Recommender

Systems Challenge, pages 5:1—-5:4, 2015. doi: 10.1145/2813448.2813514. URL http:

//doi.acm.org/10.1145/2813448.2813514.

Salvatore Parise, Patricia J. Guinan, and Ron Kafka. Solving the crisis of immediacy:

How digital technology can transform the customer experience. Business Horizons, 59

(4):411–420, 2016. ISSN 00076813. doi: 10.1016/j.bushor.2016.03.004. URL http:

//dx.doi.org/10.1016/j.bushor.2016.03.004.

Yoon-Joo Park and Alexander Tuzhilin. The long tail of recommender systems and how to

leverage it. Proceedings of the 2008 ACM conference on Recommender systems RecSys

08, page 11, 2008. ISSN 03029743. doi: 10.1145/1454008.1454012. URL http://portal.

acm.org/citation.cfm?doid=1454008.1454012.

60 6 References

Steffen Rendle. Factorization machines. Proceedings - IEEE International Conference on

Data Mining, ICDM, pages 995–1000, 2010. ISSN 15504786. doi: 10.1109/ICDM.2010.127.

Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grou-

pLens : An Open Architecture for Collaborative Filtering of Netnews. Proceedings of

the 1994 ACM conference on Computer supported cooperative work, pages 175–186, 1994.

ISSN 00027863. doi: 10.1145/192844.192905.

Badrul M Sarwar, George Karypis, Joseph a Konstan, and John T Riedl. Application of

Dimensionality Reduction in Recommender System - A Case Study. Architecture, 1625:

264–8, 2000. ISSN 15533514. doi: 10.1.1.38.744. URL http://citeseerx.ist.psu.edu/

viewdoc/download?doi=10.1.1.29.8381\&rep=rep1\&type=pdf.

Badrul M Sarwar, George Karypis, Joseph Konstan, and John Riedl. Recommender Sys-

tems for Large-scale E-Commerce : Scalable Neighborhood Formation Using Cluster-

ing. Communications, 50(12):158–167, 2002. ISSN 09254773. doi: 10.1.1.4.6985. URL

http://grouplens.org/papers/pdf/sarwar\_cluster.pdf.

Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender

systems handbook, pages 257–297. Springer, 2011.

Yue Shi, Martha Larson, and Alan Hanjalic. Collaborative Filtering beyond the User-Item

Matrix : A Survey of the State of the Art and Future Challenges. ACM Computing Surveys

(CSUR), 47(1):1–45, 2014. ISSN 03600300. doi: http://dx.doi.org/10.1145/2556270.

Harald Steck. Item popularity and recommendation accuracy. In Proceedings of the

fifth ACM conference on Recommender systems - RecSys ’11, page 125, 2011. ISBN

9781450306836. doi: 10.1145/2043932.2043957. URL http://dl.acm.org/citation.

cfm?doid=2043932.2043957.

Kai Ming Ting. An instance-weighting method to induce cost-sensitive trees. IEEE Trans-

actions on Knowledge and Data Engineering, 14(3):659–665, 2002.

Daniel Valcarce, Javier Parapar, and ??lvaro Barreiro. Item-based relevance modelling of

recommendations for getting rid of long tail products. Knowledge-Based Systems, 103:

41–51, 2016. ISSN 09507051. doi: 10.1016/j.knosys.2016.03.021.

Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. Collaborative filtering and

deep learning based recommendation system for cold start items. Expert Systems with

Applications, 69:1339–1351, 2017. ISSN 09574174. doi: 10.1016/j.eswa.2016.09.040.

61 A Data exploration

Appendices

A Data exploration

Table A.1: Overview of available attributes

Table Attribute Description Type

User USER ID hash User ID VARCHAR2REG DATE Registered date DATESEX ID Gender CHARAGE Age NUMBERWITHDRAW DATE Unregistered date DATEPREF NAME Residential Prefecture VARCHAR2

Coupon CAPSULE TEXT Capsule text VARCHAR2GENRE NAME Category name VARCHAR2PRICE RATE Discount rate NUMBERCATALOG PRICE List price NUMBERDISCOUNT PRICE Discount price NUMBERDISPFROM Sales release date DATEDISPEND Sales end date DATEDISPPERIOD Sales period (day) NUMBERVALIDFROM The term of validity starts DATEVALIDEND The term of validity ends DATEVALIDPERIOD Validity period (day) NUMBERUSABLE DATE MON Is available on Monday CHARUSABLE DATE TUE Is available on Tuesday CHARUSABLE DATE WED Is available on Wednesday CHARUSABLE DATE THU Is available on Thursday CHARUSABLE DATE FRI Is available on Friday CHARUSABLE DATE SAT Is available on Saturday CHARUSABLE DATE SUN Is available on Sunday CHARUSABLE DATE HOLIDAY Is available on holiday CHARUSABLE DATE BEFORE HOLIDAY Is available on the day before holiday CHARlarge area name Large area name of shop location VARCHAR2ken name Prefecture name of shop VARCHAR2small area name Small area name of shop location VARCHAR2COUPON ID hash Coupon ID VARCHAR2

View PURCHASE FLG Purchased flag NUMBERPURCHASEID hash Purchase ID VARCHAR2I DATE View date DATEPAGE SERIAL VARCHAR2REFERRER hash Referer VARCHAR2VIEW COUPON ID hash Browsing Coupon ID VARCHAR2USER ID hash User ID VARCHAR2SESSION ID hash Session ID VARCHAR2

Purchase ITEM COUNT Purchased item count NUMBERI DATE Purchase date DATESMALL AREA NAME Small area name VARCHAR2PURCHASEID hash Purchase ID VARCHAR2USER ID hash User ID VARCHAR2COUPON ID hash Coupon ID VARCHAR2

Coupon Area SMALL AREA NAME Small area name VARCHAR2PREF NAME Listed prefecture name VARCHAR2COUPON ID Coupon ID VARCHAR2


Table A.2: Views per user and coupon

User Coupon

Minimum 1.0 1.0Smallest 1.0 1.0Lower Quartile 9.0 15.0Median 32.0 43.0Upper Quartile 135.0 101.0Largest 324.0 230.0Maximum 3629.0 14779.0

Table A.3: Catalog price, discount price, listing frequency, purchase frequency and displayperiod per coupon.

Catalog Price Discount Price Listed Purchased Dispperiod Revenue

Minimum 1.0 0.0 1.0 1.0 0.0 0.0Smallest 1.0 0.0 1.0 1.0 0.0 0.0Lower Quartile 3570.0 1490.0 1.0 2.0 3.0 6560.0Median 6615.0 2580.0 1.0 6.0 4.0 15920.0Upper Quartile 13400.0 4500.0 2.0 15.0 6.0 39680.0Largest 28000.0 9000.0 3.0 34.0 10.0 89250.0Maximum 680000.0 100000.0 123.0 5760.0 422.0 1627500.0


(a) Per age group (b) Per week

Figure A.1: Number of purchases per genre


Tab

leA

.4:

P-v

alu

esfo

rP

ears

on

corr

elati

on

test

cou

pon

pro

per

ties

CA

TA

LO

GP

RIC

ED

ISC

OU

NT

PR

ICE

DIS

PP

ER

IOD

PR

ICE

RA

TE

VA

LID

PE

RIO

Dp

urc

hase

svie

ws

CA

TA

LO

GP

RIC

E0.0

000000

DIS

CO

UN

TP

RIC

E0.0

000000

0.0

000000

DIS

PP

ER

IOD

0.0

000000

0.0

230514

0.0

000000

PR

ICE

RA

TE

0.0

000000

0.0

000001

0.0

000000

0.0

000000

VA

LID

PE

RIO

D0.0

001213

0.7

137261

0.0

000000

0.1

065052

0.0

000000

pu

rch

ase

s0.0

000000

0.0

000000

0.0

000000

0.0

000000

0.8

55523

0.0

000000

vie

ws

0.0

000000

0.3

743773

0.0

000000

0.2

601731

0.6

09746

0.0

000000

0.0

000000

Tab

leA

.5:

Pea

rson

corr

elati

on

coeffi

cien

tfo

rco

up

on

pro

per

ties

CA

TA

LO

GP

RIC

ED

ISC

OU

NT

PR

ICE

DIS

PP

ER

IOD

PR

ICE

RA

TE

VA

LID

PE

RIO

Dp

urc

hase

svie

ws

CA

TA

LO

GP

RIC

E1

DIS

CO

UN

TP

RIC

E0.8

42671

1D

ISP

PE

RIO

D0.0

57761

0.0

16311

1P

RIC

ER

AT

E0.2

80378

-0.0

38462

0.1

3658

1V

AL

IDP

ER

IOD

0.0

33364

0.0

03185

0.1

08647

-0.0

14015

1p

urc

hase

s-0

.074574

-0.0

82232

0.2

58665

0.0

42046

-0.0

01584

1vie

ws

-0.0

39301

-0.0

06376

0.3

00955

-0.0

08082

-0.0

04432

0.8

01952

1

65 B Feature calculation

B Feature calculation

Table B.1: Overview of used features for Ponpare

name description

pricerate discount percentage (%)catalog price catalogue price (Yen)discount price discount price (Yen)validperiod validity (days)usable date Weekdays which the coupon can be used (one-hot)age User age (year)locexists If a user-coupon location appeared in the purchase log of last month (boolean)days registration Days since the user registered (days)timeofday View time of the day of the view (one-hot)dayofweek View day of the week (one-hot)genre Coupon genre (one-hot)sex User sex (one-hot)samepref If user and coupon have the same prefecture (boolean)prob g Genre popularity for user for previous monthprob p Prefecture popularity for user for previous monthkeypop The coupon key popularity in the previous month

66 C Experiment results

C Experiment results

67 C Experiment results

Tab

leC

.1:

Pair

wis

ep

-valu

esfo

rP

on

pare

data

set

(lis

tlen

gth

=5)

RA

ND

OM

idfl

ike

pu

rch

ase

sid

flik

evie

ws

inv

pu

rch

ase

cou

nt

inv

pu

rch

ase

cou

nt2

inv

vie

wco

unt

inv

vie

wco

unt2

now

eights

RA

ND

OM

00

00.0

000234

00

0id

flik

ep

urc

hase

s0

0.3

07324

0.0

00137

00.0

05252

0.0

04475

0.8

23039

idfl

ike

vie

ws

00.3

07324

0.0

00006

00.0

00298

0.0

00723

0.5

26758

inv

pu

rch

ase

cou

nt

00.0

00137

0.0

00006

0.0

06238

0.2

05285

0.9

69553

0.0

00139

inv

pu

rch

ase

cou

nt2

0.0

00023

00

0.0

06238

0.0

00266

0.0

33445

0in

vvie

wco

unt

00.0

05252

0.0

00298

0.2

05285

0.0

002658

0.3

7132

0.0

04083

inv

vie

wco

unt2

00.0

04475

0.0

00723

0.9

69553

0.0

334447

0.3

7132

0.0

02849

now

eights

00.8

23039

0.5

26758

0.0

00139

00.0

04083

0.0

02849

Tab

leC

.2:

Pair

wis

ep

-valu

esfo

rto

ur

op

erato

rd

ata

set

(lis

tlen

gth

=5)

RA

ND

OM

idfl

ike

pu

rch

ase

sid

flik

evie

ws

inv

pu

rch

ase

cou

nt

inv

pu

rch

ase

cou

nt2

inv

vie

wco

unt

inv

vie

wco

unt2

now

eights

RA

ND

OM

00

00

00.0

00001

0id

flik

ep

urc

hase

s0

0.2

17161

0.0

00165

00.0

1604

00.4

12034

idfl

ike

vie

ws

00.2

17161

0.0

00003

00.0

00792

00.7

25891

inv

pu

rch

ase

cou

nt

00.0

00165

0.0

00003

00.2

26908

00.0

00012

inv

pu

rch

ase

cou

nt2

00

00

00.0

00216

0in

vvie

wco

unt

00.0

1604

0.0

00792

0.2

26908

00

0.0

01355

inv

vie

wco

unt2

0.0

00001

00

00.0

00216

00

now

eights

00.4

12034

0.7

25891

0.0

00012

00.0

01355

0

Eindhoven University of Technology MASTER Boosted tree ... · Ponpare coupon purchase prediction...

Documents

Transcript of Eindhoven University of Technology MASTER Boosted tree ... · Ponpare coupon purchase prediction...