Deep Neural Networks for Context Aware Personalized Music Recommendation

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Deep Neural Networks for Context Aware Personalized Music RecommendationA Vector of Curation

OKTAY BAHCECI

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Deep Neural Networks forContext Aware PersonalizedMusic Recommendation

A Vector of Curation

OKTAY BAHCECI

Master in Computer ScienceDate: June 26, 2017Supervisor: Hedvig KjellströmExaminer: Patric JensfeltSwedish title: Djupa Neurala Nätverk för KontextberoendePersonaliserad MusikrekommendationSchool of Computer Science and Communication

iii

AbstractInformation Filtering and Recommender Systems have been used andhas been implemented in various ways from various entities since thedawn of the Internet, and state-of-the-art approaches rely on MachineLearning and Deep Learning in order to create accurate and personal-ized recommendations for users in a given context. These models requirebig amounts of data with a variety of features such as time, locationand user data in order to find correlations and patterns that other classi-cal models such as matrix factorization and collaborative filtering cannot.This thesis researches, implements and compares a variety of models withthe primary focus of Machine Learning and Deep Learning for the taskof music recommendation and do so successfully by representing the taskof recommendation as a multi-class extreme classification task with 100000 distinct labels. By comparing fourteen di�erent experiments, all im-plemented models successfully learn features such as time, location, userfeatures and previous listening history in order to create context-awarepersonalized music predictions, and solves the cold start problem by us-ing user demographic information, where the best model being capableof capturing the intended label in its top 100 list of recommended itemsfor more than 1/3 of the unseen data in an o�ine evaluation, when eval-uating on randomly selected examples from the unseen following week.

iv

SammanfattningInformationsfiltrering och rekommendationssystem har använts och im-plementerats på flera olika sätt från olika enheter sedan gryningen avInternet, och moderna tillvägagångssätt beror på Maskininlärrning samtDjupinlärning för att kunna skapa precisa och personliga rekommenda-tioner för användare i en given kontext. Dessa modeller kräver data i storamängder med en varians av kännetecken såsom tid, plats och användarda-ta för att kunna hitta korrelationer samt mönster som klassiska modellersåsom matris faktorisering samt samverkande filtrering inte kan. Dettaexamensarbete forskar, implementerar och jämför en mängd av modellermed fokus på Maskininlärning samt Djupinlärning för musikrekommen-dation och gör det med succé genom att representera rekommendations-problemet som ett extremt multi-klass klassifikationsproblem med 100000 unika klasser att välja utav. Genom att jämföra fjorton olika expe-riment, så lär alla modeller sig kännetäcken såsom tid, plats, användar-kännetecken och lyssningshistorik för att kunna skapa kontextberoendepersonaliserade musikprediktioner, och löser kallstartsproblemet genomanvändning av användares demografiska kännetäcken, där den bästa mo-dellen klarar av att fånga målklassen i sin rekommendationslista medlängd 100 för mer än 1/3 av det osedda datat under en o�ine evalue-ring, när slumpmässigt valda exempel från den osedda kommande veckanevalueras.

v

AcknowledgementsEver since I can remember, I have had a major passion for music andcomputers, and have been waiting for the moment to be able to combinethese interests in order to maximize my potential. I would like to start bythanking Spotify for giving me the chance to do what I love to do, afterpresenting an idea I had been thinking of for years. Not only did you letme transform this idea into a truly successful, production sized projectand into something that holds great value, but provided me with all thetools to do so. I would like to thank my Spotify mentor Marcus Isakssonfor his help and for guiding me in the right directions, and I want to givethanks to Hedvig Kjellström for being my university supervisor. Fur-thermore, I want to give thanks to my university. Apart from giving mea great education and letting me excel in what I love to do, you let meteach and act as an ambassador for years, and gave me multiple oppor-tunities of a lifetime that I never thought were possible. I want to thankall the companies that I have had the chance of working for throughoutmy education, that has shaped me into the engineer I always wanted tobe. I want to thank my friends from Sweden and from California for allyour love and support. I would like to thank my relatives and cousins forgiving me support throughout the good and the bad. Finally, I wouldlike to thank my mom, Cemile Bahceci for all her love and support. Youare the strongest person I know and the coolest woman in tech there is.You have shown me that it is possible to get whatever you want in lifewith hard work and with a positive mindset.

vi

NotationTo simplify reading, the following notation will be used and referred tothroughout this work

vc

play context embeddingsb

c

play context biasesv

t

track a�nity embeddingsv

ci

city embeddingsv

co

country embeddingsv

curation

vector of curation, ranked vector containing the top playcontexts for a user

up

user platform constantu

g

user gender constantu

a

user age constanttd

time of day constanttw

time of week constantq query representationu contextual user representation vectorV10k

vocabulary with 10 000 play contextsV100k

vocabulary with 100 000 play contexts

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem Definition and Objective . . . . . . . . . . . . . 41.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Sustainability, Ethics, and Societal Aspects . . . . . . . . 51.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 61.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work 82.1 Recommender Systems . . . . . . . . . . . . . . . . . . . 82.2 Information Filtering . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Collaborative Filtering . . . . . . . . . . . . . . . 92.3 Content-based Recommendation Systems . . . . . . . . . 102.4 Context-aware Recommendation Systems . . . . . . . . . 10

2.4.1 Matrix Factorization . . . . . . . . . . . . . . . . 112.4.2 Factorization Machines . . . . . . . . . . . . . . . 12

2.5 Hybrid Recommendation Systems . . . . . . . . . . . . . 122.6 Evaluation of Recommendation Systems . . . . . . . . . 12

3 Background 143.1 Vector Representation of Words . . . . . . . . . . . . . . 14

3.1.1 Embedding . . . . . . . . . . . . . . . . . . . . . 153.1.2 Word2Vec . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . 163.3 Feed Forward Neural Networks . . . . . . . . . . . . . . 17

3.3.1 Single-Layer Perceptron . . . . . . . . . . . . . . 173.3.2 Multilayer Perceptron . . . . . . . . . . . . . . . 18

3.4 Convolutional Neural Networks . . . . . . . . . . . . . . 19

vii

viii CONTENTS

3.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . 203.4.2 Rectified Linear Unit . . . . . . . . . . . . . . . . 213.4.3 Exponential Linear Unit . . . . . . . . . . . . . . 213.4.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 223.5.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Deep Neural Networks . . . . . . . . . . . . . . . . . . . 233.6.1 Backpropagation . . . . . . . . . . . . . . . . . . 23

3.7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 243.7.1 Regularization Techniques . . . . . . . . . . . . . 253.7.2 Optimization Techniques . . . . . . . . . . . . . . 273.7.3 Momentum . . . . . . . . . . . . . . . . . . . . . 283.7.4 Adagrad . . . . . . . . . . . . . . . . . . . . . . . 283.7.5 Challenges . . . . . . . . . . . . . . . . . . . . . . 29

4 Data 304.1 Spotify . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Scio . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Data Pipeline . . . . . . . . . . . . . . . . . . . . 31

4.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 314.3.1 Play Contexts . . . . . . . . . . . . . . . . . . . . 324.3.2 Listening History . . . . . . . . . . . . . . . . . . 324.3.3 User Data . . . . . . . . . . . . . . . . . . . . . . 334.3.4 Metadata . . . . . . . . . . . . . . . . . . . . . . 344.3.5 Training and Evaluation Data . . . . . . . . . . . 34

4.4 Feature Engineering and Representation . . . . . . . . . 34

5 Method 365.1 Recommendation Represented as Classification . . . . . . 36

5.1.1 Classifier E�ciency . . . . . . . . . . . . . . . . . 375.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Scoring Function . . . . . . . . . . . . . . . . . . 385.2.2 Diverse and Unlimited Features . . . . . . . . . . 385.2.3 Weights and Priors . . . . . . . . . . . . . . . . . 405.2.4 Batch Training and Normalization . . . . . . . . . 40

5.3 Network Layer and Embedding Dimensions . . . . . . . . 405.4 Vocabulary Dimension . . . . . . . . . . . . . . . . . . . 415.5 Hyperparameters and Tuning . . . . . . . . . . . . . . . 41

CONTENTS ix

5.5.1 Loss Function . . . . . . . . . . . . . . . . . . . . 415.5.2 Optimizer . . . . . . . . . . . . . . . . . . . . . . 42

5.6 The Vector of Curation . . . . . . . . . . . . . . . . . . . 435.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . 43

5.7.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . 435.7.2 Data and Feature representation . . . . . . . . . . 445.7.3 Training . . . . . . . . . . . . . . . . . . . . . . . 44

5.8 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . 445.8.1 Context-Aware Popularity Based Heuristic . . . . 45

5.9 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.9.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . 465.9.2 Reciprocal Rank . . . . . . . . . . . . . . . . . . 475.9.3 Top-K . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Results 486.1 Baseline Heuristic . . . . . . . . . . . . . . . . . . . . . . 48

6.1.1 Mean Accuracy . . . . . . . . . . . . . . . . . . . 496.1.2 Mean Reciprocal Rank . . . . . . . . . . . . . . . 49

6.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3.1 Base Model . . . . . . . . . . . . . . . . . . . . . 516.4 Track Vectors . . . . . . . . . . . . . . . . . . . . . . . . 526.5 Going Deeper . . . . . . . . . . . . . . . . . . . . . . . . 576.6 DEEPER . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.6.1 DEEPESTELU . . . . . . . . . . . . . . . . . . . 596.7 Best Models . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Discussion 617.1 Analysis of Results . . . . . . . . . . . . . . . . . . . . . 617.2 Baseline Heuristic . . . . . . . . . . . . . . . . . . . . . . 617.3 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . 627.4 Track Vectors . . . . . . . . . . . . . . . . . . . . . . . . 637.5 TVEMBD3AG . . . . . . . . . . . . . . . . . . . . . . . 647.6 Deeper Models . . . . . . . . . . . . . . . . . . . . . . . 667.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 67

8 Conclusions 688.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 688.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . 70

x CONTENTS

A Personas 79

B Model Predictions for Personas 81

Chapter 1

Introduction

This chapter is an introduction to the area and problems this thesis con-cerns itself with. An overview of the background of the project and thesisis described, together with an introduction of the company the projectwas conducted at. The motivation for the research, its potential and ap-plications are presented with the fundamental and important problemsthat are integral for the area at hand. Furthermore, the problem thisresearch project is concerned with is defined with the assumptions, ob-jectives and goals that are central to the project. A section is dedicatedto the methodology and the reasoning behind the chosen approach. Thefinal section is dedicated to an outline where the reader is given a highlevel overview of the structure of this thesis.

1.1 BackgroundThe research for this thesis has been conducted at Spotify. Spotify is amusic, podcast and video streaming service that was launched in Octoberof 2008 in Stockholm, Sweden. Spotify is a world-wide industry leadingcompany in music streaming services, with over 100 million active usersas of June 2016, has over 50 million paying subscribers as of March 2017,together with its availability in 60 markets in the world. It is one ofthe biggest music streaming services in the world, with the most payingsubscribers, compared to its competitors such as YouTube, Google Music,Apple Music, SoundCloud, Amazon Prime, Pandora and Tidal.

Spotify is available and has clients on multiple platforms such as com-puters, smartphones, tablets, home entertainment systems, cars, gamingsystems among other systems, making it reachable for a multitude of use

1

2 CHAPTER 1. INTRODUCTION

cases [81].With their reach, status and availability, Spotify has enough and the

right data to perform a Deep Learning project such as the one this thesisconcerns itself with.

1.2 MotivationThe era of Big Data is here. Information in the form of computer in-terpretable data is parsed and processed from one endpoint to anotheracross a wide range of platforms and devices around the world, and muchof this information is stored in servers as data that take di�erent forms,either to fetch and present to users at a certain point in time, or in orderto process the data for the purpose of data mining. With the availabilityof huge amounts of data and content accessible to users, the explorationof the data becomes di�cult due the number of choices at hand, whichcreates a problem for all parties involved. The content creators havea hard time to get their work to relevant users, the users have a hardtime finding this content and the company providing the service wherethe content resides on are faced with the problem of providing the rightcontent for the right users, and in many times, are forced to prioritizethe most popular content for all users.

Recommendation and Recommender Systems (RS) concerns them-selves with the foundations of these challenges. In order to provide accu-rate recommendations for a given individual, the process at hand must beanalyzed for context-aware and personalized recommendations to prevailover blind recommendation that do not take any features regarding theuser into account.

Providing personalized and context-aware recommendations is a chal-lenging task that many entities deal with and prioritize today. Compa-nies want to better their services by providing context-aware personalizedrecommendations to their users for a multitude of reasons, such as theexploration of their data, prioritization or awareness.

Classical approaches such as Collaborative Filtering (CF) and MatrixFactorization (MF) have previously prevailed this area of interest andsome systems depend on Ensemble Methods (EM) to create more accu-rate and personalized recommendations for their users. Many of thesemethods are not able to accurately create context-aware personalizedrecommendations, or do so by classical approaches that are not able totake many features into account.

CHAPTER 1. INTRODUCTION 3

Deep Learning has become a popular approach in a wide range ofareas such as image recognition, natural language processing, automaticspeech recognition and biomedical informatics and seems to prevail inclassification tasks. Therefore, it is of interest to see if di�erent type ofnetwork architectures can be applied to the problem of recommendationin order to create context-aware personalized recommendations by usingdeep learning.

It is of interest to see investigate if it is possible to represent theproblem of recommendation as a multiclass classification problem, inorder to analyze what kind of deep learning networks are applicable forcreating multi-feature context-aware personalized recommendations.

Research in the area of RS has been lacking, possibly due to the inteland advantages it beholds, however, some promising research exist thatgives motivation to using Deep Learning as a tool for creating recom-mendations.

Given the previous research in the area, this thesis will focus on dif-ferent approaches of deep Artificial Neural Networks (ANN) in order toapply them for the problem of creating a recommender system that iscapable of using a multitude of features in order to provide context-awarepersonalized recommendations.


1.3 Problem Definition and ObjectiveThis thesis addresses the problem of creating context-aware personalizedmusic recommendations. A context-aware system refers to a system thatis capable of parsing and understanding as much information as possible,or as much data that is needed to consider the multiple hypothesis thatmay arise for a given situation, in order to understand and predict thebest recommendation. Further, a personalized system is a system that iscapable of using features of a given user in order to create a reflection ofthe users liking and personality, such as taste or preference. To attackthis problem in a scientific manner, one needs to have a hypothesis ofan expected outcome. Given a user U , context C and time T the goal ofthis thesis is to outperform the classical approaches of recommendationsuch as CF or MF with the use of deep learning. Therefore, the researchquestions that this thesis will investigate are the following.

Hypothesis I

Deep learning can be used in order to give

personalized and context aware recommendations

for a given context C, user U at a certain point in time T

Hypothesis II

Deep learning approaches can outperform a classical heuristic approach

in order to give personalized and context aware recommendations.

1.4 LimitationsThe primary focus of this thesis will be predicting the play context ofa given users intent at a given time, where a play context is defined bya Spotify URI string. The play context can represent a playlist, artist,album, radio or track. This work will primarily be dealing with predict-ing a play context with respect to time as an important feature and will


limit itself to only predicting more popular and predictable play contextsa user would like to listen to for a given situation. This will be done bylimiting the number of play contexts available for prediction, and willnot be focusing on single-track predictions, but instead the play contextsthat define set of songs. There exists historical data of what contexta user has listened to previously at a certain point in time as well asspecific historical song data, but this work limits itself to only predictthe play context since it provides the user with great value and service.Previous projects and implemented services at Spotify has already doneweekly predictions and predicts specific tracks for a given user once aweek through a feature called Discover Weekly, or predictions on newtracks a user would like through another feature called Release Radar.Previous work at Spotify has also attacked the problem of predicting aset of tracks for a context, with a feature known as Playlist Extender,and with another feature named Daily Mix. By using historical data,one could argue that it is possible to predict a certain song that is appro-priate for the users features, but would however need many more featuresto work somewhat accurately, and is in general a very di�cult task toaccomplish and is therefore out of scope for this thesis.

1.5 Sustainability, Ethics, and Societal AspectsComputer Science and Machine Learning engineers have the responsibil-ity to take the sustainability, ethical and societal aspects into account intheir work and be aware of their choices and their creations. This sectiondiscusses these aspects and their relation to this thesis.

In the area of Artificial Intelligence and Machine Learning, it is ofimportance to envision and develop systems in such a way that it willbenefit rather than harm humanity as a whole. Systems shall not becreated for causing harm on others, but should be crafted in such a wayto assist and help human beings. For keeping a sustainable progress in thefield, researches should be open with their findings and contribute to thetree of knowledge. This has to be done in order to ensure a sustainableand progressive future with societies around the world immersed withthis knowledge and power. In the area of Computer Science and MachineLearning, we are faced with a multitude of ethical dilemmas in our work.With recommender and information filtering systems, this responsibilityis arguably even greater, due to the nature of the task and the majorimpact these systems have on users that use them. A recommender


system that learns through time with the user’s history as a featuremight have a possibility of tailoring the user around its system, insteadof functioning the other way around as intended. The choice of featureslearned are important as well, whereas a feature such as the gender ofthe user might lead to the model making generalizations that do notapply to all and do not take minorities into considerations. There alsoexist possibilities of manipulating the recommender system to learn toprioritize items that do not apply to all, but are prioritized for material orpopular gain. This work takes these ethical considerations into accountand does its best to minimize these sources of error.

1.6 MethodologyThe approach for this thesis project is to use a data engineering and sci-entific approach to success. In order to succeed at recommendations, it isof importance to study the available information and features thoroughly.One needs to consider social and environmental aspects, as well as takingresponsibility to what data the recommender system is exposed to. Thisthesis investigates the process of recommending music, which in itself hasa foundation that needs to be explored, understood and exploited. Thisfoundation concerns itself in the patterns that music has and how onepiece of content is related to another.

Starting o�, the data will be explored and the relevant features will bechosen and extracted. The choice of features will depend on multiple hy-pothesis based on behavioural understanding and the authors previousunderstanding and knowledge about music. Continuing, feature engi-neering will be conducted. Features will be selected and transformed asneeded to for the model to operate as desired, and unique architecturesof di�erent deep learning models will be analyzed and compared.

1.7 Thesis OutlineThis thesis is structured as follows.

The Related Work chapter introduces classical approaches in the re-search area of recommender systems. This chapter contains the mostsuccessful and popular approaches in both research and enterprise appli-cations for creating recommendations.

The Background chapter is dedicated to Artificial Neural Networks,


the cornerstone of deep learning and other network-type structural learn-ing networks are described with examples of recently applied areas andsuccessful applications are presented.

The Data chapter presents and discusses the form of the data thisproject used to train models with, and a high level description of thedata is given and analyzed.

In the Method chapter, the approach for ensuring the success for thisthesis is described. The way of learning the necessary information isdescribed. A baseline algorithm, or a classical approach to attack thistask is formally presented. The metrics that are needed for evaluation ofthe results from the experiments are presented and their relevance to thetask is described. Architectures and approaches considered are describedand finally, one section is dedicated to the feature engineering requiredfor the models to learn as intended.

The Results chapter presents the performance of the baseline algo-rithm and of the di�erent architectures as well as the performance of theconducted experiments and models.

In the Discussion chapter, the reader is provided an analysis of theresults with reflection over the outcome of these. Following, the finalchapter, namely Conclusions, the conclusion of the outcome from theresults and research project is presented and suggested improvementsand paths for further research in the area is provided.

Chapter 2

Related Work

In this chapter, various of previous approaches to recommendation andrecommender systems are presented. The first section is dedicated to theintroduction of what a recommender system is and an overview of thesesystems. Following, content-based and context-aware recommendationare discussed thoroughly and the most common and popular techniquesare presented, such as matrix factorization and factorization machines.One section is dedicated to user-based recommendation, namely collab-orative filtering. To get the best of both worlds, many turn to a hybridrecommendation system approach. With these, signals from both type ofrecommendation systems give the final output of the choice of item thatis being recommended, presented in the hybrid recommendation systemschapter. A final chapter is dedicated on the evaluation of performancefor these models and the many di�erent metrics researchers tend to usewhen evaluating these.

2.1 Recommender SystemsRecommender systems (RS) are tools and techniques for creating sug-gestions of items to users. A recommender system is a technique thatinvestigates an information process and relate to a decision-making pro-cess in what type of content a user would for example, like to read, listento or watch, or what item to purchase. The recommender system actsas an information filtering system, manipulating data, utilizes personal-ization and context-awareness, or acts without it with popularity basedtechniques.

An item is a general term, and depends on what context the recom-

8

CHAPTER 2. RELATED WORK 9

mender system is implemented in. An item is the physical or non-physicalrepresentation of content or information holder, such as a book or a disk,such as a CD, but can even be a wearable item such as a piece of clothingor a tool such as a charger for a computer.

Content is any desirable piece of information that a user wants to ex-plore, or could potentially benefit from considering for a given situation.

With the enormous amounts of selectable content and overwhelmingnumber of choices of items in a service, a recommender system is fun-damental for exploration. Most services today feature a recommendersystem, in combination with a method for searching for specific desireditems or content through search.

2.2 Information FilteringInformation filtering is the process of creating rules and regulations ondata. Given a dataset, a filter takes the dataset and transforms it intoanother, as specified by the filter. Information filtering is commonly usedto provide insight and delivery of requested information per request. Thefilter is applied per query, or in the form as a recommendation. Whenthe filter is applied in the form of a recommendation, the informationis often personalized to the user at hand and the information system isdenoted as a recommender system.

There are several approaches to filtering information and creatingrecommendations. Some systems leverages personalized features of theuser such as age, gender and location. Other systems leverage previoushistory such as previous purchases and likes of content that requires userinteraction with the items in the system. Multiple approaches to infor-mation filtering exist and some of them such as collaborative filtering,content-based recommendation systems and context-aware recommenda-tion systems are presented in the following sections.

2.2.1 Collaborative FilteringCollaborative Filtering (CF) is an information based filtering system thatleverages user interacted items in order to create recommendations forother users. Two major types of collaborative filtering methods are com-monly used [13], namely memory-based and model-based, among manyothers [49]. In memory-based collaborative filtering, the model take useof a rating matrix, which contains entries of all users interacted items

10 CHAPTER 2. RELATED WORK

and operate over the all users. Memory-based CF methods are com-monly implemented as neighborhood-based methods. Examples of theseare the Pearson correlation coe�cient [1] jaccard similarity [22] and thecosine similarity [49]. These methods predict new recommendations byassuming that if users have similar ratings on some items, they will havesimilar ratings for other items as well.

User-based CF methods identifies users that are similar to other usersby averaging the ratings of similar users [6], whereas item-based CF iden-tifies items that are similar to an item and estimates a rating based onthe average ratings of similar items [8].

In order for collaborative filtering methods to work, the user has topreviously interacted with at least two or more items in the system,thus creating a restriction known as the cold start problem [9] that manyother methods su�er from due to the sparsity of the vectors at handduring computation [12, 24]. It is one of the most successful approachesto building recommender systems [32] and has previously been success-fully implemented in previous recommender systems by Amazon [13] andSpotify [52].

2.3 Content-based Recommendation SystemsContent-based recommender systems use the preference of the user andfocuses on previously interacted items in order to create new recommen-dations. These type of systems has algorithms constructed intended tolearn the simalarity between items in order to suggest new item recom-mendations for the user, by focusing on the features of the item itself.Investigating the items contents and comparing them has shown promisein the area of recommender systems [9, 22, 23]. Previous research in thearea has been successful at extracting latent features from items, suchas the ones from music by investigating audio features [17]. Successfulapproaches of extracting latent features from raw audio data include sev-eral ones that use Convolutional Neural Networks, discussed thoroughlyin the next chapter [60, 62].

2.4 Context-aware Recommendation SystemsContext-aware recommendation systems (CARS) uses contextual infor-mation about the user such as age and gender among other features, or


contextual information about the item for the recommender model inorder to create relevant recommendations for the context at hand. Pre-vious research has tried to extract user context from di�erent sources[15, 35, 39]. Among other features, time has been used as context inprevious research [25], as this data is usually available in datasets andcan be incorporated directly to the model without inference. This thesiswill focus on context-aware recommendation and will investigate itemand context features in order to investigate its accuracy for personalizedrecommendations, discussed in detail in the following chapters.

2.4.1 Matrix FactorizationMatrix Factorization (MF) is a popular technique for producing productrecommendations and has been shown to be superior to other classicalapproaches that uses nearest-negihbor techniques and allows to predictthe relation between two categorical variables [28]. In 2009, Netflix an-nounced a competition to be held in the area of recommender systems, inwhich they would give the best prediction algorithm that performed 10%better than their own recommender system with a prize of one millionUS dollars [26]. BellKor’s Pragmatic Chaos, the combined team ofBellKor, Pragmatic Theory and BigChaos was named as the prize win-ner in September of 2009, and matrix factorization was a big part of thesuccess for their implementation [27, 30, 33].

As an extension to CF and neighborhood techniques, the matrix fac-torization approach has shown promise. The CF and neighborhood takea local approach to ratings for item similarity, whereas the factorizationapproach has a global view, where the idea of MF is to decompose usersand items into a set of latent factors.

The main idea behind matrix factorization models is to map bothusers and items to a joint latent factor space of dimensionality f , so thatuser-item interactions are modeled as inner products in the defined space.Each item i exist in a vector v œ Rf and each user u is represented in avector q

i

œ Rf .The elements of p

u

where u is a user measures the extent of interestthe user has in items that are on the corresponding factors which can bepositive or negative. The dot product captures the interaction betweenuser u and item i, which is the representation of the user’s overall interestin the given item [28].

12 CHAPTER 2. RELATED WORK

2.4.2 Factorization MachinesFactorization Machines (FM) are an ensamble, generic approach thatmimic most factorization models by feature engineering. They havebeen previously found to be applicable and very suitable for the taskof context-aware personalised recommendations, and has shown greatpromise to solve the cold-start problem at the same time [73]. FM’s areused by engineering of the features for input, and combines the superior-ity of factorization models in estimating interactions between categoricalvariables of large domains [40, 45, 51]. Implementations of FM’s usesstochastic gradient descent and alternating least-squares optimization,Bayesian inference using Markov Chains among other techniques [51].

2.5 Hybrid Recommendation SystemsA variety of recommender techniques and approaches has been presentedin the previous sections and many of them are good at either collaborativefiltering approaches, or at giving knowledge-based recommendations. Inmany cases, a hybrid approach is desired where the strength of both sys-tems is combined in order to create better overall recommendations thattake both user interactions and content information into account [10]. Byusing di�erent recommender models and combining the results, previouswork has shown that hybrids work well, especially when combining twocomponents of di�ering strengths, such as from CF and content-basedsystems [20]. Some empirical studies has implemented hybrid systemsand adapted the choice of methods being used for recommendation de-pending on user behaviour [7].

2.6 Evaluation of Recommendation SystemsIn the area of recommendation, there are multiple challenges when tryingto evaluate the performance of the systems at hand. Di�erent metricsneed to be used depending on what the recommender task is and what therecommender system is trying to predict. Studies in the area of recom-mendation systems has been lacking in using similar metrics, evaluationprotocols and evaluation criteria [47] but previous research has primaryfocused on metrics such as precision and recall or Relative OperatingCharacteristic (ROC) [14].

The following metrics are used thourghout di�erent research projects:


• AccuracyThe accuracy is the degree of closeness of measurements of a quan-tity to a quantity’s true value. In recommender systems, the accu-racy of a prediction is defined as the first item being correct in theprediction for the given user. In information systems, it is oftenmeasured with precision-recall measures.

• Precision-RecallIn the fields of pattern recognition and information retrieval, pre-cision is defined as a fraction of retrieved, relevant items. Theprecision is a way to evaluate the recommender systems ability toprovide recmomendations that are relevant to the user. The recallis the fraction of relevant items that are retrieved and is a mea-surement of how well the recommendations cover the range of theusers’ taste.

• Receiver operating characteristic (ROC) curvesA ROC curve plot is commonly used with binary classifiers, whereit describes diagnostic ability of these. ROC curves are commonlyused to compare the true positive rate to the false positive rate.

• Utility measuresUtility measures try to measure the usefulness or relevance of therecommended items with respect to their position in a specific or-dering.

• Reciprocal rankThe reciprocal rank, often measured with a mean, is a measure-ment that evaluates any process that produces a list of responsesthat is ordered by probability of correctness. In the area of recom-mendation, the reciprocal rank is a utility measure that measureshow well the recommendation is given a list of recommended items[44].

Chapter 3

Background

This thesis will focus on building a recommender model with the use ofdeep learning. The foundation of deep learning are artificial neural net-works. The theory and architecture behind them are highly relevant forthe understanding of other concepts of deep learning, and the applicationof other type of network structures in the area are of importance. Thischapter and the following sections will introduce concepts and theoriesthat are of importance for this thesis and for the implementation of theintended recommender system.

3.1 Vector Representation of WordsThe areas of ML and AI have been invested in, and has made importantcontributions to the area of Natural Language Processing (NLP). Onesignificant contribution is Word2Vec, a model to learn vector representa-tion of words, known as word embeddings.

Humans can recognize vision, sound, and language from the encodedraw data representation, such as an image, audio or text, however, NLPsystems often encode words by representing them as atomic symbols, bygiving them a unique discrete id. The word man could be given the rep-resentation Word43 and woman given the representation Word32. Thesetype of models fail to find patterns that it finds for woman and connectthem to man, such as it is a human [56]. These type of representationslead to sparsity in the data, where a lot more data might be necessaryfor learning a generalization with a statistical model.

One suggested solution to this problem is to use Vector Space Model(VSM) in order to represent, or embed words in a continuous vector space

14

CHAPTER 3. BACKGROUND 15

where words with similar semantic meaning are mapped to nearby pointsin space.

Previous work [71] have successfully used embeddings in order to trainDeep Neural Networks for the task of recommendation, for the purposeof learning context similarity among di�erent type of embedded itemssuch as learning country embeddings, where the ultimate goal is to usethis continuous representation in order to map similar items into similarregions. These embeddings have di�erent dimensions, but can easily betransformed to two or three dimensions for visualisation purposes witht-distributed stochastic neighbor embedding (t-SNE).

3.1.1 EmbeddingThe mathematical definition of an embedding is that is an instance ofsome mathematical structure that is contained within another instance,such as a group or a subgroup. An object A is embedded into anotherobject B when the embedding is given by an injective and structurepreserving map such as f : A æ B. Multiple, non-dependent embeddingsof A in B are possible, where an example of this is the situation existwith the real, rational, integers and natural numbers as seen in 3.1.

ℝ ℚ ℤ ℕ

Figure 3.1: the Real numbers R include the rational Q, which include theintegers Z, which include the natural numbers N, an embedding wherethe natural numbers are the canonical, or standard embedding.

16 CHAPTER 3. BACKGROUND

3.1.2 Word2VecWord2Vec is a predictive model for learning word embeddings from rawtext and is either implemented through the Continuous Bag-of-Wordsmodel (CBOW) or with the Skip-Gram model. The skip-gram modelpredicts source context-words from the target words, while CBOW pre-dicts target words given the source context words [56]. Most practicalimplementations use the CBOW representation, since CBOW is severaltimes faster to train than the skip-gram and gets slightly better accuracyfor frequent words [85].

3.2 Artificial Neural NetworksAn Artificial Neural Network (ANN) is a system based on a networktype structure that is an ensemble of neural units, or artificial neurons. Alarge collection of neurons are connected by edges to many other neurons,where these edges control the activation state of each neuron. The goalof training the network is to learn a function f based on its inputs [34].

There exist previous successful approaches to use neural networkstructures for the task of recommendation. Many of these approachesinclude combined approaches with the ones of collaborative filtering, es-pecially ones that were published before the many deep learning break-throughs. The biggest one being the deep learning breakthrough withthe introduction of the Restricted Boltzmann Machine [31]. In [16], theimportance of neural networks techniques and their involvement in therecommendation process is highlighted together with colllaborative fil-tering for personalization. In [11], a hybrid recommender system thatcombines CF with a self-organiziing map neural network is presented,which not only satisfies the predictability of the CF algorithm in Grou-pLens, but improves the scalability and performance of a traditional CFtechnique. [21] Uses a hybrid approach and uses trained neural networksrepresenting individual user preferences for content filtering on Movie-Lens data that yields high accuracy predictions. Other research in thearea focuses on deep learning approaches and will be discussed thoroughlyin a later section.


Figure 3.2: An artificial neural network

3.3 Feed Forward Neural NetworksA Feed-Forward Neural Network (FFN) is one of the simplest kinds ofneural networks and is an ANN that has input layers, hidden layers andoutput layers, where the progression of information flows in only oneforward direction, from the input layers, through the hidden layers tothe output layers. In this type of network structure, the connectionsbetween the units do not form a cycle and there exist no loops in thenetwork.

3.3.1 Single-Layer PerceptronThe single-layer perceptron network is the simplest kind of neural net-work. This network consists of a single layer of output nodes, where theinputs are fed directly to the output with a vector of weights and makeup for the simplest form of a feed-forward network. In each node of thenetwork the sum of the products of the input weights are calculated.

The most basic activation function is the perceptron activation func-tion and works like the following. If the calculated value is above 0, theneuron fires and changes its value to 1, otherwise deactivates and changes


its value to -1 and finally outputs this value [50].The drawback this activation function has is that it can only find

linearly separable decision boundaries. For finding other decision bound-aries, we turn to other activation functions such as the logistic function.

f(x) = 11 + e≠x

(3.1)

This function is commonly known as the sigmoid function, and is ableto compute continuous output, defined in 3.1 and visually representedin 3.3. This activation function is usually used in multi-layer neuralnetworks and is used for finding non-linear decision boundaries [2].

Figure 3.3: the Logistic Curve

When creating an architecture of neural networks, it is of importanceto choose an activation function that is well suited for the learning task athand. In the following sections, we will examine the hyperbolic tangent,softmax activation function, as well as the rectified linear unit activationfunction.

3.3.2 Multilayer PerceptronA multilayer perceptron network is a fully connected feedforward neuralnetwork and is a natural extension of the standard, linear single-layerperceptron. It is constructed of at least three layers, the input layer, atleast one hidden layer and the output layer. In contrast to the single-layer perceptron, this network has a nonlinear activation function thatcan find nonlinear decision boundaries. Multilayer perceptron networksuses backpropagation for training the network.


Figure 3.4: the Hyperbolic Tangent

The two main activation functions used for a multilayer perceptronnetwork are both sigmoids, functions that are shaped in the form of anS. These are the logistic function, described in the previous chapter, aswell as the hyperbolic tangent function, described in equation 3.2.

y(x) = tanh(x) (3.2)

3.4 Convolutional Neural NetworksThe Convolutional Neural Network (CNN) is a specialized type of feed-forward artificial neural network and is used for processing data that hasa known pattern, or grid-like topology. These type of networks uses threeimportant ideas that can help a machine learning system, namely sparseinteractions, parameter sharing and equivariant representations and isnot restricted to one variable input size [72], because of the nature of theconvolution operation. This section describes these types of neural netsand discusses the previous and possible applications in the area of musicrecommendation.

CNNs are commonly used with pattern matching with images whichhas a two dimensional representation, and are useful when dealing withtime-series data as it can be represented as a one dimensional image.These type of networks has been very successful in practical applicationssuch as in computer vision and in the field of robotics, and possibly oneof the most successful neural network approach for practical applications


overall.Previous research at Spotify has shown that it is possible to use CNNs

for processing on input mel-spectrograms generated from audio for de-tecting complex and invariant features in the audio itself. The authorapplies di�erent architectures of the network and succeedes with a com-bination of convolutional, Rectified Linear Unit (ReLU) layer and max-pooling layers. This architecture allowed multiple patterns to be detectedsuch as bass drum sounds and vibrato singing [62]. These results showsanother application area for CNNs and puts importance of the patternsthat exist in music and similarities between items.

Figure 3.5: One of the architectures for the CNN used for pattern detec-tion in audio mel-spectograms [62]

3.4.1 ConvolutionThe convolution in convolutional neural networks is referring to the math-ematical convolution operation. The convolution is an operation on twofunctions of a real-valued argument. It is defined as the integral of theproduct of the two functions, after one of the functions is reversed andshifted. Commonly, the first input given to the convolution is known asthe input, whereas the second argument is the kernel, finally the outputis known as the feature mapping of these arguments. The definition ofconvolution is described as a weighted average of the function f(t) at themoment t where the weighting is given by g(≠t) shifted by amount t. Ast changes, the weighting function highlight parts of the input function,


and blends one function with another.

3.4.2 Rectified Linear UnitThe Rectified Linear Unit (ReLU) layer is one of the most common non-linear activation layers that are used in convolutional networks and is oneof the most popular activation function used in deep neural networks [66].

Figure 3.6: the Rectified Linear Unit (ReLU)

The rectifier is defined as f(x) = max(0, x), where x is the input tothe neuron. In the first stage, the linear, convolutional layer produces aset of linear activations which is passed thereon to the second layer, theReLU layer for further investigation of non-linearities and is sometimesreferred to the detector stage [72]. The input from the ReLU layer ispassed thenon to the third stage in the convolutional layer, namely thepooling layer.

3.4.3 Exponential Linear UnitThe exponential linear unit (ELU) is an alternative to the ReLU activa-tion function and is claimed to speed up learning with DNNs and lead tohigher classification accuracies. Its motivated use is to speedup learningby avoiding a bias shift that ReLU is predisposed to. In contrast to Re-LUs, ELUs have negative values, which allows them to push mean unitactivations closer to zero which speeds up learning. This technique is sim-ilar to batch normalization, but is claimed to have lower computationalcomplexity.


Equation 3.3 defines the ELU where a is a hyperparameter to betuned and a Ø 0 is a constraint.

f(x) =Y]

[x if x Ø 0a(ex ≠ 1) otherwise

(3.3)

ELU networks produced competitive results on ImageNet, where it ledto faster learning in comparison to a ReLU network with the same archi-tecture. ELU networks has currently best published results on CIFAR-100, and claims faster learning and significantly better generalizationperformance than ReLUs on networks with more than 5 layers [63].

3.4.4 PoolingIn the third stage of the convolution process, a pooling function is ap-plied to transform the output of the previous layer further. The poolingfunction replace the output of one layer at a specified point by a sum-mary statistic of the nearby output points. A common pooling operationis the max pooling operation which outputs the maximum output in arectangular area. Other pooling functions exist, such as averages or theL2 norm of neighbours, but are less commonly used [72].

3.5 Recurrent Neural NetworksRecurrent neural networks or RNNs, are a family of neural networks ca-pable of processing sequential data. RNNs are another type of specializedneural network much like CNNs, and are specialized to process sequencialdata, sequences, such as time series data, with variable lengths such as avector of values x1, ..., x

n

. These type of networks produce an output ateach timestep and has recurrent connections between the hidden units inthe network.

3.5.1 LSTMThe introduction of the self-loop to produce paths where the gradient canflow for long durations was the core contribution of the initial long short-term memory (LSTM) model [5]. The LSTM network has been verysuccessful in practical applications for speech recognition [55], machinetranslation [61], handwriting recognition [54] and image captioning [67] to


name a few. In the area of recommendation, LSTM approaches have beene�ective in learning temporal recommendation models [75] and predictinguser perference based on their reviews with data from the Yelp DataChallenge [68] and in session based recommendations [64].

3.6 Deep Neural NetworksA deep neural network (DNN) is an ANN with many hidden layers be-tween its input and output layers. As a multilayered neural network, theDNN is capable of modelling complex non-linear relationships [72].

Deep neural networks has recently become popular after pioneeringkey research events in the area. Until the year of 2006, DNNs wereunder-performing those of shallow neural networks. Breakthrough papersfrom Hinton, Bengio and LeCun in 2006 and 2007 changes this. Hintonpublished an article that solved the vanishing gradient problem, a problemconcerning the training process of deep neural networks when calculatingthe gradient in early layers of the network. With the introduction ofthe RBM [31] and another that introduced the Deep Belief Net. Otherimportant pioneering work include Greedy Layer-Wise Training of DeepNetworks [18] and Scaling learning algorithms towards AI [19].

Another big reason why the era of deep learning has begun now hasto do with progressions in the area of computing hardware. Since DNNsusually take very long to train, recent advances in the area of comput-ing hardware, especially contributions from Graphical Processing Unit(GPU) manufacturers such as NVIDIA [77] has reduced the time to prop-erly train these networks. Powerful hardware to compute on, togetherwith techniques from distributed computing that has allowed for paral-lelization and parallel training [76] has allowed for the area to rise inrecent years. Software libraries such as Keras [79] and TensorFlow [82]have been specifically created for machine learning and deep learningtasks and crafted in such a way to allow parallel, GPU training.

3.6.1 BackpropagationThe backpropagation, or backprop also known as Backpropagation ThroughTime (BPTT) is a common technique used to train ANNs and DNNs to-gether with optimization methods such as gradient descent. Backprop-agation is an algorithm that repeats propagation and weight updates,calculating the gradient backwards [4]. In deep or recurrent networks,


Figure 3.7: A deeper neural network

back-propagated error signals can either shrink rapidly, or grow out ofbounds, a behaviour known as the vanishing or exploding gradient. Thisproblem that was unsolved until Hinton’s contributions in 2006 with theDeep Boltzmann Machine [31] which allowed to the networks to be prop-erly trained.

3.7 Deep LearningPrevious sections have introduced most commonly known artificial neuralnetworks and discussed a range of practical implementations to real-worldproblems, including those that are intended for the task of recommen-dation, or solutions that can aid the recommendation process by findinglatent features that lie in the data the system of recommendation exist


in. In this section we dive deeper into some mentioned concepts that aremore relevant when faced with a deep learning task in contrast to a muchsimpler neural net.

Deep Learning is the task of using deep neural networks for creatingsolutions for tasks such as speech recognition [36, 37, 48]. Using deeplearning has had great success for pedestrian detection [58] and imagesegmentation [53], among other fields. Following, we discuss approachesto recommendation in depth.

Previous research in the area of Deep Learning and recommender sys-tems is slim and is a relatively unexplored area. Some research in thearea does exist, and gives motivation for the task this thesis concernsitself with. [71] describe two deep learning models that are used for thetask of recommendation, one for candidate generation and another forranking. Early stages of their networks mimicked the ones of matrixfactorization and is described to be a non-linear generalization of fac-torization techniques, much like factorization machines try to generalize[40].

They embed features with similar methods as to those of Word2Vecwith word embeddings and features are engineered. The final featurerepresentation is passed into several networks with di�erent sized ReLUlayers for candidate generation and is thereafter passed to another net-work with similar architecture to assign scores for labels using logisticregression.

Some previous research primarily focusing on music recommendationwith the use of deep learning exist. In [60] the authors approach theproblem by using deep convolutional neural networks, investigate andevaluate data from the Million Song Dataset, a popular dataset in thedata science community [41]. With the use of CNNs they find latentfactors from music audio. They show that the network’s predicted latentfactors produces sensible recommendations and use weighted matrix fac-torization in order to learn latent factor representations of all users anditems [60] in the Taste Profile Subset from Echonest [41].

3.7.1 Regularization TechniquesMachine Learning algorithms are crafted with the intention of becominga generalization of the data that it gets trained on. A central problem inthis area is therefore not just generalizing on the training data, but onnew, unseen inputs. Models can underfit or overfit the training data, and


in many cases we use di�erent techniques to make sure that our modelreaches an optimal capacity. Regularization is the art of crafting themodel in such way that it reaches optimal capacity.

Many regularization strategies exist where some add extra restric-tions on the parameter values, but in the context of deep learning mostregularization aims to trade increased bias for reduced variance by usingregularization estimators [72].

Bias-Variance tradeo�

The bias-variance tradeo� is another central problem in the area of ML.It is the problem of minimizing two sources of error simultaneously thatprevent supervised learning algorithms from generalizing beyond theirtraining set. The bias is the error from flawed assumptions the learningalgorithm assumes. High bias can cause a learning algorithm to notrecognize relevant relations and correlations between features and targetoutputs, which leads to underfitting. The variance is an error from thesensitivity to small inconsistencies that exist in the training set, whichcan cause overfitting, where the model is generalizing random noise inthe training data [3, 29].

Dataset Augmentation

Obtaining and training the deep learning model on more data is one ofthe best way to generalize the model. In many situations, this can be adi�cult task, but for most ML tasks such as classifications, it is easy tocreate more fabricated data, with techniques such as boostrap aggregatingor boosting [38, 72].

Early Stopping

Early stopping is a technique that has been useful in the area of deeplearning. In many situations, it is possible to investigate the model’saccuracy through time and see a common pattern of the model’s tendancyto overfit. Stopping the training earlier than intended can be a usefultechnique to prevent this from happening and can therefore obtain amodel with a better validation set error with hopefully better test seterror [38].


3.7.2 Optimization TechniquesDeep learning algorithms usually has some optimization technique im-plemented for creating a more general and correct approximation of theprediction function, such as stochastic gradient descent optimization [43].Mathematical optimization is the task of selecting the best element ofsome function f(x), either by minimizing or maximizing it, by changingthe value of x. When we have a function that is minimized, it is alsocommonly known as the cost function, loss function or error function,which represents an error that we would like to minimize in order toobtain higher accuracy.

Gradient Descent

Gradient descent is the most popular optimization algorithm in the areaof deep learning for training deep neural networks. The technique has itsfoundations in Calculus, the study of change in functions.

Figure 3.8: A visual interpretation of gradient descent [72]

In other words, it is the representation of how the change of value ofx will behave and what to output to expect, f(x + ‘) ¥ f(x) + ‘f Õ(x),which tells how to make a small change of x in order to get a potentialimprovement of y. By changing x incrementally with the opposite sign ofthe derivative, we can reduce our cost function f(x), and by doing this,we are applying the technique of gradient descent.


Figure 3.9: Di�erent types of saddle points [72]

Given a functions inputs, we minimize the function by using gradientdescent. By using an iterative approach, gradient descent aims to take astep in the negative direction of the gradient.

In deep learning, we optimize functions that have many local minimathat are not optimal and multiple saddle points. When the input ismultidimensional we settle for a very low f value, but not necessarily themost minimal [72].

3.7.3 MomentumThe momentum refers to stochastic gradient descent with the use ofmomentum, where the update ”w is remembered at each iteration andthe next update is determined as a convex combination of the gradientand the previous update. Similar to momentum, Nesterov’s AcceleratedGradient (NAG) is a first-order optimization method with a guaranteeof better convergence rate than gradient descent in certain situations,where NAG achieves a global convergence rate of O( 1

T

2 ) after T steps, incontrast to the O( 1

T

) of gradient descent [59].

3.7.4 AdagradAdagrad, or Adaptive Gradient Descent is a modified stochastic gradientdescent algorithm that has per-parameter learning rate. Adagrad allowincrease of the learning rate for more sparse parameters and decreasesthe learning rates for less sparse parameters. Using the technique ofadaptive learning improves convergence performances over the ones ofstandard stochastic gradient descent where the data is sparse and sparseparameters are more informative than others [42].


3.7.5 ChallengesConstructing a generalized deep learning architecture that fits the train-ing data well is many times referred to as more of an art than science [72].However, the challenges that prevails the area include problems duringtraining, lower layers not getting trained well due to gradient errors, thevanishing or exploding gradient, lack of data and overfitting. Consider-ing this, there exist guidance from pioneers in the area, such as the onesfrom Yoshua Bengio and Andrew Ng [46, 74]. A few of these consist ofbut are not limited to the following techniques.

• Normalize inputs to the deep learning algorithm

• Engineer the architecture and choose the correct network for thelearning task

• Optimization tricks such as gradient tricks and early stopping

• Check if the model is powerful enough to overfit, if it is, turn toregularization, if not, change the model structure or make it larger.

Chapter 4

Data

For any machine learning task, it is of importance to investigate whatkind of data exist and what features are available for processing.

This chapter investigates the Spotify backend infrastructure, how thisdata was collected and some exploratory analysis of the data is presented.

4.1 SpotifyThe data used for this thesis project was extracted from Spotify’s bigdata cluster residing at Google Bigcloud.

With over 100 million active users [81], over 30 million songs witharound 20 thousand new songs per day and over 2 billion playlists cre-ated by curators and users, Spotify is a very data-centered company.Spotify has a 2500+ node Hadoop cluster that logs 50TB per day andruns over 10 thousands of jobs per day. For many years, Spotify hasrelied on Apache Spark, a popular engine for large-scale data processing.In recent years, Spotify has been moving over to Google Bigcloud anduses Dataflow with Scala [80]. Dataflow is a unified programming modeland provides a managed service for developing and executing di�erentdata processing patterns such as ETL, batch computation and continu-ous computation. With Cloud Dataflow, engineers do not have to focuson resource management and performance optimization [78].

Dataflow provides a unified batch and streaming model, and Spotifyis able to use the Google Cloud Platform (GCS) ecosystem, namely Big-query, Bigtable, Datastore and Pubsub. Scala, an acronym for ScalableLanguage, is a pure-bred object-oriented language, as well as a functionallanguage, which is a natural fit for data [80].

30

CHAPTER 4. DATA 31

4.2 Data CollectionThe data was primarily fetched with Scala with the open source APIScio, built by Spotify in order to construct and deploy Big Data jobs toGoogle BigCloud. Initial exploratory analysis of the data was done withSQL by using Google BigQuery and by selecting data from the desiredtables holding the information needed.

4.2.1 ScioScio is an open source Scala API for Apache Beam and Google CloudDataflow, developed and managed by Spotify employees. With Scio, thedeveloper is able to use query tables with SQL for a given project throughBigQuery. The result of the query is initially mapped to a case classwhich uses the SQL column names for initial reference. Furthermore,the result can be transformed and manipulated in any desired way. Theresult of the transformations on the data can be stored on Google CloudStorage in any form desired, such as a textfile or dumped to a BigQuerytable.

In the following code snippet, adapted from the code for the base-line, an initial SQL query is written for obtaining all rows that matchthe WHERE clause, since this table contains entries that are not validURLs for linking or recommendation purposes. Further, the top 100 playcontexts for each country and hour is extracted.

4.2.2 Data PipelineThe final data pipeline process several billions of rows of data, distributedon a hundred 8-core machines, and is responsible for creating the training,test, evaluation sets of data.

4.3 Data AnalysisBefore the data was chosen and extracted, careful consideration wentthrough the choice of what data could hold necessary value for patternfinding. Each table was initially queried through SQL queries for analysisand exploration.

32 CHAPTER 4. DATA

4.3.1 Play ContextsPrevious chapters and sections have mentioned play contexts withoutproper definition. The initial chapter of this work mention the limita-tions of this thesis and mentions that this work will restrict itself to notpredicting a set of songs, but play contexts instead. Let us revisit andexpand this definition.

A play context in the listening history table and on the Spotify plat-form represent and are encoded as a Spotify URI that is a collection ofn tracks, where n Ø 1, such as

spotify:user:professoroaks:playlist:5XZNhrbvaOvD6BORMavxmn

repesenting a playlist. The corresponding HTTP link for this playlist ishttps : //open.spotify.com/user/professoroaks/playlist/

5XZNhrbvaOvD6BORMavxmn.As presented, the alternate form of the URI is a HTTP string that

is almost identical, but adapted for the web. Users are given the optionto share either the URI or HTTP representation for any play context onthe platform, where the URI can be pasted in the search field, whereasthe link redirects the user to the application when explored through aweb browser. The majority of all play contexts that have this kind ofrepresentation and public play context URIs can be representing a playcontext that is defined to be a playlist, artist, album, track or radio. Ourgoal is to retrieve the 100, 000 most popular play contexts and build avocabulary V100k

based on these by assigning each play context a uniqueid. The play contexts œ V100k

will be the labels the model aims to rankduring learning and predict for a given set of unique features.

4.3.2 Listening HistoryThe listening history table was considered first. This table arguablycontains the most important features for this task. This table containsevery listened song on the service, data expressing how much of this songwas streamed along with other numerical features. For each row, thistable expresses which user this event belongs to, what device the songwas listened through and what play context the user was playing thesong in, among other features.

After exploration of this table for a given day of training data andafter filtering out irrelevant play context representations that are out of

CHAPTER 4. DATA 33

scope and posses unpredictable contexts, such as free search, we estimatethe sum of the top 100, 000 play contexts, representing the most listenedplay contexts to be roughly half of the sum of the total play contexts.Further calculations show that the top 10, 000 contexts roughly captures40% of this sum, from which we can deduce the following.

• The top 100, 000 play contexts do not only capture a wide varietyof all available play contexts, but nearly half of the sum of theoccurrences of all play contexts.

• Since the sum of the top 10, 000 play contexts is not very far fromthe top 100, 000, the bigger tail of the top 100, 000 mostly representlistening history from personal playlists, or less popular contexts.

Further calculations confirm this. The least popular play contextswhere the number of appearances are < 1, 000 represent all play contextsnot captured by the top 100, 000. Creating a vocabulary of all playcontexts available implies a huge vocabulary and a majority of noise thatwill imply di�culties for model generalization and will potentially aftermany epochs, provide recommendations that only fit a subset of usersand minor boost in metrics. These presented numbers do vary fromday to day, but with a very small margin. Obviously, previous contextsfrom one day will be knocked out from the top for the next day, due tonew releases and trends, but the presented fractions remains the same.Therefore, we chose to restrict our vocabulary to the top 100, 000 playcontexts for the week of training data for experiments.

Each user’s distinct play context history is created as an ordered set ofitems, beginning with the first seen occurrence in the chosen time frame.The history is empty to begin with and gets built up with respect to timeand order occurrence, with a limit of one hundred in size. For user’s withmore than a hundred distinct play contexts, the first seen play contextis removed and the latest is appended as the last item in the list forchronological representation. Additionally, another ordered set of itemsis created, representing the number of occurrences of the play context.Entry i in this set corresponds to entry i in the play context history set,and describes the number of songs the user listened to in play context i.

4.3.3 User DataInformation about the user was considered next. Multiple tables had tobe queried in order to fetch the desired data, holding features such as

34 CHAPTER 4. DATA

gender, age and user country, important for solving the cold start problemthat many recommender systems su�er from.

Furthermore, user a�nity to previously interacted items in the plat-form catalogue is considered. Data for each user’s top 100, most playedtracks exist in another table and is extracted. Another vector is cre-ated and joined with metadata for mapping purposes and the sum of thecounts for all tracks are mapped to lookup files for weight initializationfor embedding lookups.

4.3.4 MetadataThe metadata for all discussed content exist in di�erent tables and aregathered sequentially, joined with the encrypted or decrypted strings forthe play contexts for extracting the titles for visualization and identifi-cation purposes during the development of the models.

4.3.5 Training and Evaluation DataThe chosen training and testing data were based on one week of dataeach. The training set beholds an entire week of data from the listeninghistory table, in the date interval, in Y Y ≠ MM ≠ DD format, from2017 ≠ 04 ≠ 03 to 2017 ≠ 04 ≠ 09. The evaluation data, the unseen dataintended for metric calculation and prediction is extracted from the weekthereafter, in the date interval 2017 ≠ 04 ≠ 10 to 2017 ≠ 04 ≠ 16.

4.4 Feature Engineering and RepresentationFor any machine learning task, feature engineering might have to beconsidered in order to successfully train a statistical model. Featureengineering concerns itself with brainstorming or testing new featuresand deciding what features to create and later on use for the desiredmodel. This section deals with specifics regarding feature engineering forthe features in question.

Initially, the most popular 100, 000 playing contexts for the trainingweek are extracted through a SQL query, sorted by number of apperancesin the dataset, and is mapped to a .tsv file for Tensorboard visualizationand pipeline optimization purposes. Because the intention is to learnembeddings, each play context is mapped to a distinct numerical id for

CHAPTER 4. DATA 35

vocabulary creation and for reference lookup purposes. During the con-struction of the user history vector, out of vocabulary occurrences of playcontexts are mapped to a default, out of vocabulary id.

The features are divided between the regular division between cate-gorical and ordinal, or numerical features. The categorical features suchas platform or country all vary in length, some are binary and some havethousands of possible values.

The majority of features are engineered. Categorical features such ascountry are mapped to numerical values for ease of numerical computa-tion. These mappings are saved for further visualizations with Tensor-Board and for embedding lookups. The unix timestamp is transformedto express a numerical floating point value between [0, 1) for both timeof day and week. The gender of the user is mapped to a numerical valuein the range [0, 2], where 1 represents an unknown gender. A limit of 100for the listening history length is set for each user, and respectively itscorresponding weight vector representing the number of songs listened toin each context, due to the enormous size of the data and for faster train-ing. In practice, multiple other user and content features are selected,engineered and prepared for training.

Chapter 5

Method

Being able to measure the performance of the desired and sophisticatedmodel that accurately models the information process and compare itto the simpler baseline is of importance, in order to provide credibility,proof and motivation for future research and future implementation areasto continue to investigate similar or better models as well as contributingto the tree of knowledge.

This chapter presents the methodology of this thesis and presents theapproaches of constructing the state-of-the-art deep learning architec-tures. The architectures are presented with reasoning behind the choicesof representation for the features and the engineering behind them thatlead the sophisticated model to success are presented.

5.1 Recommendation Represented as Classifi-cation

In the chosen vocabulary of 100, 000 play contexts, one play context orlabel needs to be predicted correctly given the users contextual features.We use an o�ine evaluation and compare the model predictions withevaluation examples from a week of listening history that has alreadyhappened. Therefore, the true label we aim to predict is the label ex-pressing the play context the user chose to listen to. Labels should appearin relevant order, preferably in the top of the recommended list of playcontexts for a boost in metric performances given an evaluation exam-ple. This task therefore implies a task of extreme multiclass classificationproblem with a very high number of labels with di�erent classes to rank

36

CHAPTER 5. METHOD 37

for each evaluation example given its features and one single true labelto compare to.

The prediction problem can be transferred to and represented as ac-curately classifying a single specific play context p

t

at time td

and tw

among hundred thousand of classes, from the corpus of play contextsV100k

based on a user U with its contextual features describing it. Eachuser u represents a high-dimensional embedding with its contextual fea-tures describing it, denoted as u. The history embeddings and tracka�nity embeddings represent a mapping from sparse vectors into a densevector in RN.

The task the deep neural network is faced with is to learn the givenrepresentation of user as a function of the user’s listening history andcontextual features that are relevant for distinguishing play contexts inthe corpus, with softmax as the ideal classifier.

It is worth mentioning that many recommendation engines solves oneproblem and hands the output to another for ranking. Some recommen-dation engines use further models for ranking purposes, which is mayimprove performance, but is out of scope for this thesis.

5.1.1 Classifier E�ciencyIn order to train the proposed classifier with hundred thousand labels toassign probabilities to e�ciently, one cannot rely on traditional softmaxto assign a probability for all of them given one positive example, dueto the big number of labels and the time it will take to do this duringtraining. Therefore, sampled softmax is used. For each positive exam-ple, the cross-entropy is minimized for the true label and n number ofsampled negative labels, with experiments ranging from 50 to thousandsof negative samples. Di�erent sampling methods are used, with someexperiments using probability priors assigned for each example. Thesampling technique provides e�ciency and great speedup over classicalsoftmax and hierarchical softmax during training and is therefore prefer-able. Sampled Noise Contrastive Estimation (NCE) loss has been used totrain the best model for comparison, with directly observable di�erencein behaviour from the ones of softmax. In a perfect world with unlimitedcomputing power and access to resources, one would like to consult alllabels for exact precision and prediction accuracy, which is unfortunatelynot the situation at this point in time.

38 CHAPTER 5. METHOD

5.2 Model ArchitectureHigh dimensional embeddings are learned for fixed vocabularies, inspiredby continuous bag of words language models and in [71]. These em-beddings are thereafter fed into a fully connected feed-forward neuralnetwork with ReLU activation as the main non-linear activation func-tion together with the rest of the features. 5.1 presents the architectureof the proposed model.

The model is trained in batches, with and without batch normal-ization for comparison. The variable length sequence of play contextlistening history v

c

, as well as the corresponding occurrences for theseare represented as sparse inputs and get mapped to dense representationsas required by the feed-forward network. The approach of feeding theaverage of the embedding results are compared with the sum and squareroot, with averaging outperforming the other combiner tactics.

5.2.1 Scoring FunctionThe proposed scoring function uv

c

+ bc

is a dot product of the userquery representation u and the play context embeddings v

c

plus a globalbias based on the logarithm of total number of occurrences of each playcontext in the vocabulary, e�ectively assigning a score for every playcontext in the vocabulary for each user query vector u.

5.2.2 Diverse and Unlimited FeaturesUnlike other techniques for generalization such as the ones of matrixfactorization techniques, the inputs of the ones for neural networks haveno restriction and can be added with ease, making these type of modelsadvantageous in several ways. The users longtime and top a�nity toprevious tracks on the platform, v

t

are embedded and averaged, whichafter averaging represents a summary of dense embeddings.

User demographic features, city and country features are embeddedas well and concatenated with the rest of the query. These features areimportant for solving cold start and gives the model information aboutnew users on the platform, and are important for context-aware andpersonalized recommendations that essentially behaves like collaborativefiltering. The city embeddings v

ci

and country embeddings vco

and theirconcatenation to the rest of the query can be seen in in 5.1. The user


Figure 5.1: Deep Personalized Context-Aware Model

platform, user gender and user age features get represented as scalars,get normalized and transformed into vectors in order to match the di-mensions of the rest of the feature representations, represented as userplatform u

p

, the user gender ug

and the user age ua

. The time featureis engineered and transformed to numerical representations. These fea-tures can be directly concatenated and thereafter fed into the networkwith the rest of the query. Unlike other methods [69] that rely on hashingfor embeddings that ultimately create collisions with fixed vocabulariesand creates shared underlying embeddings, all embeddings in this modelhave distinct embedding vectors. The fully connected layers with di�er-ent dimensionalities enables the network to base the predictions on deepconnections between the features.


5.2.3 Weights and PriorsWeights are assigned for listening history v

c

and for initial play contextsclass assignments as prior probabilities. Each play context history vectorfor a given user has its corresponding count vector which represents thenumber of digested items in the given play context. These are normalizedin di�erent tactics, where the ones for listening history are normalized bytaking the square root of each item. Initial biases, b

c

for the predictionlabels are assigned the logarithm of total the number of occurrences inthe selected top play context vocabulary.

5.2.4 Batch Training and NormalizationFor speedup in training, examples are represented and trained in batches.This provides a great speedup during training, up to 10x speedup inpractice when training on CPUs. Batch normalization as suggested by[65] has been used for some models where previous implementations fordi�erent learning tasks have measured and shown great boosts in modelaccuracy [70].

5.3 Network Layer and Embedding DimensionsAdding multiple features to the model and experimenting with di�erentnumber of layers and levels of depths are common approaches to increasemodel accuracy and these have been taken into account. One of thefirst candidate models rely only on a few set of features and features arecreated and added to new models sequentially for evaluation purposes.Initially, one feedforward network with 128 dimensions is considered andanalyzed through live evaluation during model training for models with-out batch normalization, due to implementational reasons. Experimentswith models that has further depth imply a richer pattern finding in-terface and is compared to other models with smaller dimensions. Sev-eral fully-connected feed forward networks are connected together with apyramid type structure with classical, recommended dimensions that areall © 0 (mod 2), namely 64, 128, 256, 512 as seen in 5.1. The dimensionfor the embedding size for the user representation u uses the same em-bedding size as the track a�nity v

t

and play context embeddings. Thisdimension started at an arbitrary value of 64, and later models took thissize into account and increased this size further, in order to represent the


embeddings in even higher dimensions.

5.4 Vocabulary DimensionThere exist millions of available unique play contexts on the platform.Due to this big number, the vocabulary is limited to the most popu-lar, 100, 000 play contexts, each representing a unique label. Classifyingand learning to rank 100, 000 labels is a di�cult task. Due to the highamount of labels, we are not able to use classical softmax and have torely on sampled softmax that tries to mimic the true softmax in order toaccomplish this task e�ciently. With a smaller set of labels, one can usethe true softmax that arguably leads to higher accuracy and rank scores.With no previous system to compare to on the platform and due to theshort lifetime of the project that has not allowed for live A/B testing, themodels are evaluated in a strict, o�ine manner. The models are trainedusing this limited vocabulary and their performances are compared usingselected metrics.

5.5 Hyperparameters and TuningIt is of importance to consider what hyperparameters to choose fromwhen training a model since many models rely on correct knobturningfor ensuring the success of not only generalization but accuracy of thetrained model. The data and input to the network plays a tremendousrole in model generalization success and pioneers refer to the size of thedata and feature engineering as the most important sources for this. If ahuman is able to see and recommend training examples with the featuresavailable, the machine should be able to do this as well. Therefore, atremendous amount of time was dedicated to considering data, featureengineering and creating features with an expressive representation.

Considering this, some hyperparameter tuning has been experimentedwith regarding the choice of optimizer, loss function and learning rates.

5.5.1 Loss FunctionSampled softmax loss was considered as the primary loss function, withnumber of negative sample classes such as 64, 128 & 1024 and by usingone class as the true class when compared with the false classes. Initially,


sampled NCE loss was consulted and evaluated during training, but wasquickly replaced with sampled softmax loss with 64 negative samples.

Softmax

The softmax function is a generalization of the logistic function. Softmaxsqueezes a K-dimensional vector z of an arbitrary number of real valuesinto a K-dimensional vector represented as ‡(z) of real values in therange (0, 1) that add up to 1, ultimately creating a class of probabilitieswith no restriction in the amount of cases for evaluation as defined in5.1.

‡(z)j

= ezj

qK

k=1 ezkfor j = 1, . . . , K. (5.1)

Noise Contrastive Estimation (NCE) loss

Noise Contrastive Estimation (NCE) loss is an estimation method thathas shown great promise in learning word embeddings, where the ba-sic idea is to train a logistic regression classifier to discriminate betweensamples from the data distribution and samples from the noise distribu-tion, based on the ratio of probabilities of the sample under the model aswell as the noise distribution, as described in [57]. NCE and the sampledsoftmax approach are very similar techniques where both aim to mimicthe true softmax, but have di�erent approaches when trying to do so.

5.5.2 OptimizerGradient descent is the most popular and main optimizer of choice fortraining neural networks. AdamOptimizer, another popular choice thatTensorFlow provides that allows for learning rate decay out of the boxwas consulted but not used due to implementational drawbacks. Learn-ing rate decay was applied to the gradient descent optimizer for someexperiments. Adagrad optimization and Momentum optimization wassuccessfully applied to some experiments and is compared to experimentswith the same model architecture, but trained with gradient descent.


5.6 The Vector of CurationThe Vector of Curation, v

curation

is defined as the resulting vector based ofthe entire play context vocabulary of 100, 000 items, scored as describedby the similarity function and sorted in relevant order for a given userrepresentation u. Where play context pc denotes play contexts in theV100k

vocabulary, entry pc1 denotes the most relevant play context andentry pc

n

denotes the least relevant playcontext, and n = |V100k

| for useru, the vector of curation is defined in 5.2.

vcuration

= [pc1, pc2, ...pcn≠1, pc

n

] (5.2)

For all metrics, only the first 100 items of the vector is consulted forevaluation due to the extent of the size of the vector and due to therelevance of the first 100 items.

5.7 ImplementationPrevious sections have introduced how the architecture of the proposedmodel work theoretically and conceptually. In this section, the toolsavailable for practical implementation is evaluated and the final choicesare presented.

Out of all the available learning APIs for numerical computation us-ing data flow grahps and programming languages to implement thesenetworks in, such as scikit-learn, Ca�e and Theano, the programminglanguage Python and TensorFlow was chosen. This was a natural choice,because of the ease of implementation with the chosen language, and be-cause of not only the rapid growth, but because of the maturity and opensource contributions to the TensorFlow platform.

5.7.1 TensorFlowTensorFlow is an open source software library for numerical computationlibrary using data flow grahps and is the most popular ML platform forimplementing neural networks today. TensorFlow is developed by theGoogle Brain team and was initially released November 9, 2015. Tensor-Flow is under rapid development with 475+ non-Google contributors tothe stable 1.0 release and have big support from the community, with over5, 500 repositories with TensorFlow in the title. There exist big support


from users on platforms such as Stack Overflow, where over 5, 000 ques-tions have been answered to date. The use of TensorFlow in ML classesaround the world is growing, where universities such as Toronto, Berkeley& Stanford incorporating TensorFlow into their laboratory courses [83].TensorFlow uses the term Tensor in tutorials and documentation, and isreferring to the following definition, a mathematical object analogous tobut more general than a vector, represented by an array of componentsthat are functions of the coordinates of a space.

5.7.2 Data and Feature representationTensorFlow supports a wide variety of file formats for feature represen-tation and mapping, supports and has extended tutorials and documen-tation for reading files in csv format, the most popular format for opensource datasets. The user is able to use batching, shu�e the data and de-fine the number of epochs out of the box. TensorFlow recommends usersto encode their data with the standard file format, namely encoded asTFRecords [84]. This representation encodes the data as an Example andstu�s the data within a protocol bu�er, a fileformat created by Google.This format allows for data compression, which in turn makes the size ofthe training data on disk smaller and is preferable. Therefore, all trainingdata in the data generation pipeline is transformed into protocol bu�ers,supporting float, integer and bytearray values and is thenon compressedusing the ZLIB compression type. TensorFlow has an extensive API thatsupports a variety of operations, and most importantly embeddings andsparse tensors.

5.7.3 TrainingThe models were initially trained locally during development but were allfinally trained on Google Cloud via their platform Cloud Machine Learn-ing Engine (MLEngine). MLEngine supports any TensorFlow modelsand supports logging, which was used for logging number of examplestrained, current model loss value and for evaluation during training.

5.8 Baseline AlgorithmIn order to be able to measure the performance of a model, one needs toconsult a baseline algorithm to get a sense of what kind of improvement


the model is capable of giving. The chosen baseline is a context-aware andpersonalized popularity based heuristic that is relevant for the context-aware recommendation problem at hand. This baseline was chosen dueto its simplicity, and it is of interest to compare a simple heuristic likethis with the state-of-the-art approach. The performance of the base-line model is presented and compared to other models in the followingchapter.

5.8.1 Context-Aware Popularity Based HeuristicThe chosen baseline algorithm is a context-aware popularity based heuris-tic and works like the following. As input, we have a dataset of users,listening context, time and country of listening patterns. With this in-formation, we aggregate the context, country and hour of how manyusers listen to this particular context at this given hour and in the se-lected country, and therefore create a list of ascending values where thetop value is the most popular context, followed by the next most pop-ular and so on. Each user’s listening context in the same situation iscompared to these predictions in the same order and the metrics arecalculated thereafter. The pseudocode for the heuristic is presented in 1.

Algorithm 1 Baseline Recommendation HeuristicData: user id, play context, country, hourResult: (country, hour, list(top 100 contexts sorted by popularity))foreach country, hour do

count each occurence of play contextendforeach (country, hour, count) pair do

key by (country, hour)endforeach ((country, hour), count) do

take top(100) by keyend

Where one input training example is given in the following format,where user id is fabricated and only intended as an example.

user_0, spotify:album:1YVf1gCvpUKr0YGtgFVmfm, SE, 2


Following, each user in the respective country and hour are recom-mended the same list of items. The chosen metrics, namely Accuracyand Mean Reciprocal Rank are calculated thereafter.

5.9 MetricsPrevious metric models used for the task of recommendation has beenpresented in the background chapter under section 2.6, Evaluation ofRecommendation Systems and their occurrence inconsistencies in previ-ous work in the area of recommender systems have been mentioned. Inmany cases, online evaluation is consulted for recommender systems, butthe limitation of time and resources for this project have only allowedfor o�ine evaluation. Three metrics have been chosen that are suitedfor this learning task, namely accuracy, mean reciprocal rank and top k.In total, two weeks of data is collected in order to train and evaluatethe system. The metrics are measured on a random sample of 100, 000examples from one week of data, where the model’s prediction of the top100 items is compared to the true value of the user’s listening context, asdescribed in the evaluation set. These chosen metrics imply a very strictevaluation and unfortunately do not capture the strength and relevancethe recommendations the model provides, but example recommendationswill later be referenced in the appendix section of this work. The rankof the item x, which represents the label for the play context pc the userchose to listen to in the predicted vector of top 100 play contexts v

curation

for the user is defined in 5.3.

rank(x) =Y]

[position of pc if pc œ v

curation

0 if ”œ vcuration

(5.3)

5.9.1 AccuracyGiven a list of recommended items where the top item is the most relevantprediction for a given user the accuracy is a measurement of how correctthe prediction is for a given user. The accuracy is 1 if the predictionis correct and 0 otherwise. This metric is calculated for the baselineand for the models for comparison and presentation of how much betterthe models prediction is compared to the suggested baseline heuristic.The accuracy is calculated for the same set of data and users and themean accuracy is measured, together with the mean reciprocal rank. The


accuracy given item x, and its corresponding rank in a ranked list of itemsis defined in 5.6.

accuracy(x) =Y]

[x if rank(x) = 10 otherwise

(5.4)

5.9.2 Reciprocal RankTogether with the accuracy that presents an instant hit recommendationand how accurate the predictions are the first time, we consult the recip-rocal rank to see how well the predictions are overall to get a sense of howfar o� the predictions are from reality. The reciprocal rank predicts howfar o� the prediction is from reality given a top number of items to pre-dict upon. It evaluates a list of possible responses to a sample of queries,ordered by the probability of the correctness. The mean reciprocal rankis the average of reciprocal ranks of results for a sample of queries and isdefined by

The definition of the reciprocal rank for item x in the ranked list ofitems v

curation

is found in 5.5.

rr(x) = 1rank(x) (5.5)

5.9.3 Top-KThe previously introduced metrics are both two very ruthless metricsto use for the prediction task at hand, with only one true label outof the 100, 000 in the selected vocabulary of top contexts and withoutemphasis on ranking. In most applications of recommender systems andinformation retrieval systems such as search engines, many items canbe considered relevant and go through a ranking process, which is notpossible at this time. Therefore, in order to present the strength andpotential of the developed model, the Top-K metric is measured. Similarto Precision at K, it is of interest to analyze if the desired label exist inthe top k where k œ {5, 10, 20, 30, ..., 100} of predictions as given by andsorted by the similarity function. The definition of Top ≠ K for item xin the ranked list of items v

curation

is found in 5.6.

topk

(x) =Y]

[1 if rank(x) Æ k

0 otherwise(5.6)

Chapter 6

Results

The previous chapters have been dedicated to finding relevant informa-tion sources for ensuring the success of the construction of a deep neuralnetwork engineered for the task of recommendation and using state-of-the-art techniques to do so. Two chapters back, the data available and itsengineering was described in depth. The previous chapter described thereasoning behind the architectural choices and the optional structure ofdi�erent modules of the model was presented. This chapter is dedicatedto present the results of the di�erent proposed models in the previouschapter and aims to enlight the reader with the findings of the strengthof the proposed models. The Vectors of Curation v

curation

, given by twomodels for personas with di�erent features can be found in the appendixsection of this thesis.

6.1 Baseline HeuristicThis section presents the baseline heuristic performance of the gathereddata. The context-aware recommendation heuristic baseline is presentedand described in the previous chapter. For each of the baseline heuris-tic and model metric tables, the Class column denotes the metric classbeing evaluated, the Ex. column denotes the number of examples evalu-ated for the respective metric class, MA refers to Mean Accuracy, MRRdenotes the Mean Reciprocal Rank and the T5, T10, T50, T100 en-tries describe the Top at K metric. The baseline heuristic metrics arepresented in table 6.1.

48

CHAPTER 6. RESULTS 49

Class Ex. MA MRR T5 T10 T50 T100

all 100000 2.04% 0.044 7.21% 9.71% 17.59% 22.47%

Table 6.1: Baseline Heuristic Metrics

6.1.1 Mean AccuracyThe accuracy was calculated by taking the most significant, top 100 playcontexts for each hour and country pairs for the week of training dataand evaluated on all users for the evaluation week, as described in theprevious chapter. These were used as a base for the evaluation week, theweek thereafter.

6.1.2 Mean Reciprocal RankThe reciprocal rank was calculated in the same data pipeline and on thesame set of data as the accuracy and the mean reciprocal rank yielded avalue of 0.044.

6.2 ModelsIn this section, trained models are presented and given a nickname for fu-ture reference. All models was implemented with Tensorflow and trainedon Google Cloud with MLEngine as described in the previous chapter.All models were trained in batches of size 64. All models were trainedfor several epochs on the training data until convergence.

All trained models with relevant parameters can be found in 6.2 andtheir respective parameters and hyperparameters were chosen and addedas recommended from previous research, whereas exhaustive search or hy-perparameter optimization is not applicable for a large scale project withenormous amounts of training data. Random hyperparameter searchcould have been used using a subset of the entire training data but dueto the nature of this investigation and the interest to see how di�erenttrained models behave in comparison to each other, this method was notconsulted.

The names in 6.2 uses short names for document fitting reasons. TheExperiment column describes a shortened name of the experiment. Theentries in the dims column denote the decreasing number series where

50 CHAPTER 6. RESULTS

every entry between ... denote the previous entry divided by two to savespace for table entries. For further clarification and as a concrete exam-ple, the entry for DEEPESTELU with notation 8192->...->512 in itsdims column represents a deep network with 6 fully-connected networkswith respective output layer dimensions of 8192, 4092, 2048, 1024, 512,starting with the first network of an output dimension of 8192 that givesits output as the input to the next network, etc, with a final networkthat has an output layer of dimension 256 to match the embedding rep-resentation size. The Loss fn column entry describes the loss functionused for the experiment, either Softmax or NCE. The Embd columndescribes the embedding size of the user representation u, the same em-bedding size the play context and track vector embeddings share. TheOpt column describes which optimizer being used, one of GD, GradientDescent, A, Adagrad or MN, Momentum with Nesterov Acceleration.Entries in column LR describes the initial learning rate for training andfinally, the TV column describes if the track vector feature v

t

is used inthe experiment or withheld from the model.

Experiment Dims Loss fn Embd Opt LR TV

Base Model 64 Softmax 64 GD 0.10 NoBM2 512æ256æ128 Softmax 64 GD 0.10 NoBM2AG 512æ256æ128 Softmax 64 A 0.10 NoTrack Vectors 128 Softmax 64 GD 0.10 YesTVDIMMN 1024æ...æ128 Softmax 64 MN 0.10 YesTVEMBD2 256æ128 Softmax 128 GD 0.10 YesTVEMBD3 1024æ512æ256 Softmax 128 GD 0.10 YesTVEMBD3AG 1024æ512æ256 Softmax 128 A 0.01 YesTVEMBD3NCE 1024æ512æ256 NCE 128 A 0.01 YesTVEMBD3BATCHN 1024æ512æ256 Softmax 128 A 0.01 YesTVEMBD3ELU 1024æ512æ256 Softmax 128 A 0.01 YesDEEP 2048æ...æ128 Softmax 512 A 0.50 YesDEEPER 4096æ...æ512 Softmax 256 A 0.20 YesDEEPESTELU 8192æ...æ512 Softmax 256 A 0.01 Yes

Table 6.2: All trained models with respective parameters and hyperpa-rameters


6.3 ExperimentsThis section presents each experiment, the reason why the experimentwas conducted and why the chosen parameters in the experiment waschosen. Unlike the metrics for the baseline heuristic, it is of interest toanalyze each class of play contexts and how well the model performs onthese, namely albums, artists, playlists and radio. Therefore, each modelevaluates not only all examples with all classes, but analyses how well itperforms on each of the classes individually.

6.3.1 Base ModelInitially, a base model was created with a single feed-forward neuralnetwork with matching output dimensions to match the embedding sizeof the contextual user representation vector u, where the feature trackvectors was withheld from the classifier, represented as v

t

in the modelarchitecture as seen in 5.1. The base model was trained with gradientdescent optimization with an initial learning rate of 0.1 and an embeddingsize of 64. The base model converged with a loss value of 2.071506 andits performance on the evaluation set is presented in 6.3 with addedmetric classes for album, artist, playlists and radio. The simple basemodel outperformed the baseline heuristic on all metrics unexpectedly.


all 100000 3.00% 0.065 9.04% 13.53% 29.28% 37.84%album 21984 1.65% 0.038 1.10% 1.72% 4.37% 5.88%artist 20113 1.23% 0.031 0.78% 1.39% 3.90% 5.46%playlist 54640 3.90% 0.082 6.36% 9.32% 19.24% 24.49%radio 3263 8.00% 0.166 0.80% 1.10% 1.77% 2.00%

Table 6.3: Base model performance on evaluation data

BM2 - Base Model with Multiple Networks

The base model was extended further and multiple fully-connected net-works were added, that enable the model to base predictions on deepconnections between latent features. As described in 6.2 and visuallyrepresented in 5.1 with the user representation u, this model takes this


representation of all features as input and has a corresponding input sizeto fit this representation, with an output size of 512. The outputs fromthe first network is passed then on to the next network with a matchinginput size and an output size of 256. The next to last network takes theseinputs and outputs a dimension of 128 and passes on its outputs to thefinal network that has a matching output size of the embedding size of64.

This models performance on the evaluation set is presented in 6.4 andconverged with a loss value of 2.136182. This model outperformed thebaseline heuristic on all metrics unexpectedly, but did not outperformthe previous baseline model.



Table 6.4: BM2 model performance on evaluation data

BM2AG - Base Model with Multiple Networks and AdaptiveGradient Descent Optimization

Given the added complexity of adding layers and by implication, fur-ther gradients to calculate, and considering our sparse vectors, adaptivegradient descent was considered for an improvement of convergence per-formance. Adaptive learning rates allow per-parameter learning ratesand allows for an increase of learning rate for our sparse parameters.This model converged faster than the base model and the applied learn-ing descent did what it promised to do. BM2AG converged with a lossvalue of 2.235332 and its respective performance on the evaluation datacan be found in 6.5.

6.4 Track VectorsGiven the baseline models performance that already outperform the baslineheurstic, further additions of features to the model can be consulted in




Table 6.5: BM2AG model performance on evaluation data

hopes of improving model performances on the evaluation data. All fur-ther experiments add the track vector feature, represented as v

t

in themodel architecture as shown in 5.1.

With a convergence loss value of 1.890431, two fully connected net-works and with the addition of v

t

and otherwise same configurationsas the base model, the performance of the track vectors model on theevaluation data can be found in 6.6.



Table 6.6: Track Vectors model performance on evaluation data

TVDIMMN - Track Vectors with Multiple Networks and Mo-mentum Nesterov Optimization

Given the performance of the Track Vectors model, further fully con-nected networks was added to the Track Vectors configuration with a newoptimizer, namely Momentum optimization with the use of Nesterov’sAccelerated Gradient, with an initial learning rate of 0.1 and momentumof 0.9 as suggested in [59]. TVDIMMN was stopped early after severaldays of training with a loss value of 3.901461, and its performance onthe evaluation data can be found in 6.7.




Table 6.7: TVDIMMN model performance on evaluation data

TVEMBD2 - Track Vectors with Multiple Networks and In-creased Embedding Size

After unfortunate performances and lack of room for optimization forthe Track Vectors model, an increase of the embedding dimension wasconsidered due to the nature of the sparse feature t

v

and its enormousvocabulary of available items. The dimension of the embedding size wasincreased to 128 and two fully connected networks were added with out-put dimensions of 256, 128 with another final fully connected of, onceagain 128 dimensions. The new architecture of TVEMBD2 lived up toexpectations and outperformed the two previous models after the ad-dition of the sparse feature t

v

and show great potential and room foroptimization. TVEMBD2 converged with a loss value of 1.860276 andits performance is presented in 6.8.



Table 6.8: TVEMBD2 model performance on evaluation data

TVEMBD3 - Track Vectors with Multiple Networks and In-creased Embedding Size

After the success of increase of number of networks and an increase of theembedding dimension, a new initial network with output dimension of


1024 was added to the TVEMBD2 architecture in hopes of finding supple-mentary deep connections that would boost the performance on the eval-uation data, once again, with success for many metric values. TVEMBD3was stopped early and converged with a loss value of 1.919513 and itsperformance is presented in 6.9.



Table 6.9: TVEMBD3 model performance on evaluation data

TVEMBD3AG - Track Vectors with Multiple Networks, In-creased Embedding Size and Adaptive Gradient Descent Opti-mization

After the success of previous experiments and of TVEMBD3, it is ofinterest to see if adaptive gradient descent can be applied to find theconvergence loss value faster than previously, and hopefully find a newoptimum while doing so. With a smaller initial learning rate of 0.01, dueto the increase of model complexity, TVEMBD3AG converged with a lossvalue of 1.890277. Performance of TVEMBD3AG can be found in 6.10.The promise from previous research of using adaptive learning for findinga better minima (and doing so faster) for the function under optimizationthat is estimated during learning was held and TVEMBD3AG marks thebest model after the addition of the sparse feature t

v

.

TVEMBD3NCE - Track Vectors with Multiple Networks andIncreased Embedding Size using NCE Loss

TVEMBD3AG marked best performance after the addition of the sparsefeature t

v

. It is of interest to analyse if another loss function with di�er-ent behavior, but otherwise with the exact same network configurationscan optimize the function that is estimated. Therefore, TVEMBD3NCEconsults the NCE Loss function as primary loss function. With a con-vergence loss value of 2.869891, the performance of TVEMBD3NCE




Table 6.10: TVEMBD3AG model performance on evaluation data

on the evaluation data can be analysed in 6.11. Interestingly enough,the sampled softmax approach and NCE loss approaches almost performidentically, with sampled softmax outperforming NCE loss with a minormargin as seen in 6.11 and 6.10.



Table 6.11: TVEMBD3NCE model performance on evaluation data

TVEMBD3BATCHN - Track Vectors with Multiple Networksand Increased Embedding Size with Batch Normalization

With TVEMBD3AG as a winner in the battle of loss functions, batchnormalization is considered as described in section 5.2.1. The perfor-mance of the TVEMBD3 model configuration with the addition of batchnormalization on the layers is presented in 6.12 and converged with a lossvalue of 1.974833. Unfortunately, the addition of batch normalizationfailed to improve the performance of the TVEMBD3 model, with evenTVEMBD3NCE outperforming it. Therefore, batch normalization wasnot consulted in further experiments.




Table 6.12: TVEMBD3BATCHN model performance on evaluation data

TVEMBD3ELU - Track Vectors with Further Networks andIncreased Embedding Size with ELU Activation

With the promise of a better activation function than standard ReLU ac-tivation used throughout all previous experiments, we consult the ELUactivation function in hopes for optimization of the function being es-timated by our deep embedding network model. TVEMBD3ELU con-verged with a loss value of 3.900022 with much slower training and (!)worse performance than the corresponding model with the same architec-ture as best model TVEMBD3AG with ReLU activation, exact oppositeof the promise the authors make in its presentation [63]. TVEMBD3ELUperformance is presented in 6.13.



Table 6.13: TVEMBD3ELU model performance on evaluation data

6.5 Going DeeperAfter finding the best model, namely TVEMBD3AG, it is of interest togo deeper in order to find further deep connections that hopefully booststhe performance on the evaluation data. This section goes through three


models with very large architectures and analyses the adaptive gradientdescents performance when handed a bigger initial learning rate.

The first deeper model, as described in 6.2, DEEP was trained with aninitial learning rate of 0.5. DEEP’s initial network input layer is matchedto the query representation size and has an output layer of 2048, whichis handed to the next network with an input layer of 2048 and output1024, and so on, until the output dimension of 128. This experiment wasa learning lesson since the packed representation output of 128 is matchedto a final network with output of the embedding size of 512 and does notfollow the previous success of the pyramid type network structure andtraining was therefore stopped early at a loss value of 2.714534. Themodel performance for DEEP is presented in 6.14.



Table 6.14: DEEP model performance on evaluation data

6.6 DEEPERThe previous section presents the initial attempt of going deeper, a newexperiment is conducted with model DEEPER. This time, the initial net-work input layer is matched to the query representation size once again,but has an output layer of 4096, which is handed to the next networkwith an input layer of 4096 and output 2048, and so on, until the finaloutput dimension of 256. As expected, DEEPER took longer to trainthan previous models without this much depth and was stopped earlywhere the converged loss value at stoptime was observed at 3.893327.The performance of DEEPER is described in 6.15.




Table 6.15: DEEPER model performance on evaluation data

6.6.1 DEEPESTELUWith the unfortunate results from the previous deeper configurationsand the lack of performance of the previous model with ELU activa-tion, namely TVEMBD3ELU, we conduct a final experiment with a newmodel, the deepest model with ELU activation. As mentioned in [63],ELU networks perform best with 5 layers or more, and the configurationof DEEPESTELU reflect this using 6 fully connected networks. To as-sist the adaptive gradient descent optimization, the initial learning ratefor DEEPESTELU is set to 0.01. Once again, this network trained ex-tremely slowly and was stopped early after four days of training. Asseen in 6.16, DEEPESTELU performed unfortunately mediocrely on theevaluation set, converging at the loss value around 1.954330.



Table 6.16: DEEPESTELU model performance on evaluation data


6.7 Best ModelsOut of the fourteen experiments, the best models for each experimentclass, namely without track vectors, with track vectors and deeper modelscan be found in 6.17.

Metric Class Model Examples MA MRR Top 5 Top 10 Top 50 Top 100

all BM2AG 100000 3.07% 0.066 9.20% 13.73% 29.08% 37.71%all TVEMBD3AG 100000 3.13% 0.061 8.65% 12.41% 25.54% 33.05%all DEEPESTELU 100000 2.39% 0.048 6.68% 9.75% 19.86% 26.15%

album BM2AG 21984 1.35% 0.035 1.03% 1.67% 4.23% 5.76%album TVEMBD3AG 21984 1.17% 0.03 0.89% 1.42% 3.29% 4.47%album DEEPESTELU 21984 0.44% 0.019 0.77% 1.17% 2.19% 2.94%

artist BM2AG 20113 1.05% 0.030 0.80% 1.37% 3.78% 5.38%artist TVEMBD3AG 20113 1.01% 0.020 0.57% 0.99% 2.83% 4.12%artist DEEPESTELU 20113 0.07% 0.006 0.11% 0.28% 1.27% 2.18%

playlist BM2AG 54640 4.19% 0.085 6.49% 9.50% 19.29% 24.55%playlist TVEMBD3AG 54640 4.18% 0.079 6.20% 8.76% 17.62% 22.46%playlist DEEPESTELU 54640 3.66% 0.006 4.70% 6.92% 14.63% 19.10%

radio BM2AG 3263 8.34% 0.173 0.89% 1.19% 1.78% 2.02%radio TVEMBD3AG 3263 11.86% 0.201 0.98% 1.25% 1.79% 2.00%radio DEEPESTELU 3263 8.43% 0.197 1.09% 1.38% 1.77% 1.93%

Table 6.17: Comparison of best models performance on evaluation data

Chapter 7

Discussion

In this chapter, we aim to analyze and reflect over the performance ofthe presented models in the previous chapter.

7.1 Analysis of ResultsThe previous chapter went over 14 experiments where every possible hy-perparameter and parameters for optimization for deep neural networkswere conducted with previous research in the area and scientific founda-tions to motivate their conduction. With the big number of experiments,we discuss the baseline heuristic, model baseline, track vectors model andthe deeper models and their respective experiments.

7.2 Baseline HeuristicThe baseline heuristic performed surprisingly well on the evaluation datasetas seen in 6.1, even though it did not consider any personal features ofthe user and can be considered an impressive heuristic. The baselineheuristic was able to recommend the intended play context with rank 12.04% of the time with a value of 0.044 in reciprocal rank, both two veryrespectable values for a simple heuristic. By recommending the mostpopular items for each country and hour, the baseline was able to catchthe intended play context for each user in its top 100 recommended itemsfor 22.47% of the entire evaluation dataset. This implies that a chunk ofusers are indeed generic and does not dive deep into the Spotify catalog,but settles for premade playlists and easy to find play contexts, such as

61

62 CHAPTER 7. DISCUSSION

promoted ones by the service. The interesting question here is if thecurrent way to explore content on the platform has anything to with thisvalue, or if its just general user behaviour.

7.3 Baseline ModelThe first conducted experiment was the most successful one that didnot consider the track vectors feature v

t

, seen in 5.1. By only using asimple configuration of a single neural network to match the embeddingsize of 64, the baseline model outperformed the baseline heuristic by along shot on all metrics. In 6.3, we can see that the baseline model hasa performance of 3% on mean accuracy, much better than the baselineheuristic, but perhaps a low value overall. Once again, the end goal wasnot to build a genie prediction machine that is able to act as a per-personfuture choice predictor, but instead learn a generalization on all users andshould be considered with this in mind. The baseline model did performwell for the mean reciprocal rank as well with a value of 0.065, beatingthe baseline heuristic’s corresponding value of 0.044. The increase of thenumber of correct rankings in the predictions the baseline model is beingable to do, as seen with the with top 5, top 10, top 50 and top 100 metricsmakes sense, where the baseline model is able to catch the intended labelwith a record of 37.84% with the top 100 metric. This value expressesthat the system developed performs pretty well without concentrationon the ranking of the items and gives much potential for further rankingof items. With additional pipelines after the model prediction phase, orwith a ranking model, the accuracy metric has a potential to be boosted.

With BM2 and addition of several fully connected networks to themodel, it is interesting to see that metrics su�ered on all classes. Onewould think that the deeper connections that were added would adda boost to the metrics, but unfortunately this is not what happened, asseen in 6.4. This can be due to several di�erent things. Due to the addedcomplexity of the model, the minima BM2 search for during optimizationcould have been missed and a BM2 could have been stuck in a localminima, and missed the local minima by always using the same learningrate of 0.1, without adaptive gradient optimization. With BM2’s slightlylarger loss value of 0.06, BM2 did not reach the Baseline Model’s lossvalue of 2.071506 and perhaps, with further training time, it could havereached this value.

BM2AG solved the previous problems BM2 su�ered from by taking

CHAPTER 7. DISCUSSION 63

use of adaptive gradient descent, recommended optimization when deal-ing with sparse features. The only sparse feature the baseline model deltwith was the play context embeddings, v

c

as described in 5.1. Adaptivegradient descent did indeed help our function under approximation withmuch better metrics than BM2 and did even perform better on the meanaccuracy metric with a value of 3.07% than the baseline model’s recordvalue of 3.00% as seen in 6.5, but surprisingly with another, and greaterloss value than the other two models of 2.235332.

7.4 Track VectorsThe track vectors model introduced another sparse feature, v

t

with abigger embedding dimension due to the addition of the feature and itssparsity, otherwise with exact same model architecture as the BaselineModel. This sparse feature added further complexity to the model and forthe function under estimation. Not only did this feature create complex-ity due to its sparseness, out of all vocabulary sizes, it took the biggestvocabulary size of 1, 000, 000. Due to the extreme size of the catalog,this feature represents user a�nities for certain tracks and have manydi�erent combinations. It is di�cult for the network to learn the progressof the play contexts for users, and with the introduction of this feature,it had to take this a�nity into account.

The initial base track vectors model’s metrics performed decently incomparison to the baseline model and the baseline heuristic consider-ing the addition of the sparse feature v

t

. Converging at a loss value of1.890431, the track vector’s model marked the lowest loss value up tothat point in comparison to the previous baseline models and it seemedlike the addition of the new feature was not only a good decision but afeature that could potentially assist boost in metric performance.

Due to the results with the track vectors model and the potentialperformances the new feature could provide to the model predictions,the TVDIMN experiment analyzed the use of additional networks withfurther depth and a new optimization technique, Momentum optimiza-tion with Nesterov’s accelerated gradient. This model failed to optimizeits function under estimation and was stopped early after several days oftraining, where it converged at a loss value of 3.901461. It would havebeen of interest to analyse the exact same model architecture with theprevious successful adaptive gradient descent technique, namely the ada-grad optimizer, but with the restriction of time for the project and the


unfortunate results of this model, a new embedding size was consideredand optimized instead.

With TVEMBD2, a new embedding size for the embedded featureswas considered at a size of 128. To start, TVEMBD2 considers threefully connected networks with output dimensions of 256, 128 and 128that were trained with gradient descent optimization to see what po-tential this new embedding dimension could give. The increase of thenew embedding size was successful and the foundation of the reasoningof the TVEMBD2 experiment worked out: by increasing the embeddingsize, there is more room to represent the added sparse feature v

t

. Asseen in 6.8, TVEMBD2 performed well on the evaluation dataset andshowed room for improvement. Converging at a loss value of 1.860276,TVEMBD2 showed potential for further optimization that TVEMBD3and all other models took into account.

The increase of the embedding size worked out with experimentTVEMBD2 and TVEMBD3 continued on this pattern, and to see whatfurther insights the deeper connections could provide, TVEMBD3 addedfurther depth, with an initial network with output dimension 1024, onceagain with gradient descent optimization and same initial learning ratevalue of 0.1. These configurations were chosen correctly to improveTVEMBD2, but did not outperform TVEMBD2 on the metric perfor-mance, and converged with a loss value of 1.919513, and after thisexperiment, adaptive gradient descent was consulted to see if this deeperarchitecture could be optimized with the use of adaptive gradient descent.

7.5 TVEMBD3AGTVEMBD3AG (Track Vectors with Multiple Networks, Increased Em-bedding Size and Adaptive Gradient Descent Optimization), inheritedthe TVEMBD3 model architecture and used adaptive gradient descentwith success, with a smaller learning rate value of 0.01 and marked thebest model after applying the sparse feature v

t

. This came as surprise af-ter experiment BM2AG and after analyzing previous research in the area,converging at a loss value of 1.890277, a slightly lower loss value thanTVEMBD2, but did outperform TVEMBD2 on metrics mean accuracyand mean reciprocal rank to name a few. By looking at TVEMBD3AGperformance on metrics in 6.10 and by comparing it to the metric perfor-mance of TVEMBD2 in 6.8 that the deeper network performed much bet-


ter on the radio class in particular and found other deep connections thatboosted the mean accuracy score for the playlist class. TVEMBD3AGdid not do as good on the T100 metric, and further experiments suchas TVEMBD3NCE, TVEMBD3BATCHN and TVEMBD3ELU tried toimprove the TVEMBD3AG model without success as a final experiment.

Interestingly, model TVEMBD3NCE performed almost as good asTVEMBD3AG by using the NCE loss function instead of sampled soft-max, using the same number of negative sampled classes. TVEMBD3NCEconverged at a loss value around 2.869891, and since this approach didnot boost any metric performances, it was interesting to see how similarthese two approaches performed on the evaluation dataset. Once again,the radio class’ deeeper connections was found by TVEMBD3NCE butseeing as the radio class is the minority class, this improvement was notgood enough to continue on the NCE loss track.

With TVEMBD3BATCHN, batch normalization was considered.Batch normalization has shown promise in di�erent applications, andsince the network is trained in batches of size 64, it only made senseto experiment with this technique. TVEMBD3BATCHN convergencevalue of 1.974833 was higher than expected and did not outperformTVEMBD3AG unfortunately. By looking at the metrics in 6.12 we cansee that the model didn’t score as well on the radio class as previouschanges to TVEMBD3AG, something that the larger loss value might beexpressing. It was an unfortunate surprise that TVEMBD3BATCHN didnot hold its promise that batch normalization would increase accuracy asmentioned in [65], perhaps longer training would have helped this modeloptimize TVEMBD3AG and over multiple epochs on the training data,but would not have been able to be compared in the same way with othermodels by getting more training time.

TVEMBD3ELU was the first experiment that took use of the ELUactivation function and did not perform well, or better than ReLU as de-scribed in its introduction paper [63]. TVEMBD3ELU was stopped withan observed converged loss value of 3.900022. By this point it had notbeen trained on as many examples as previous models due to being ableto train on fewer examples per second. As seen in 6.13, TVEMBD3ELUdid not perform better than TVEMBD3AG nor did it on other previ-ous experiments by trying to optimize the TVEMBD3AG model and thecontinued attempt to optimize metric performance for TVEMBD3AGended with TVEMBD3ELU. With longer training time for all models,perhaps these deeper ones would have performed better on the evalua-


tion data, but due to the restriction of time for model training and themultiple days it would have taken for the more complex models, this wasnot considered.

7.6 Deeper ModelsThree experiments were conducted to analyze the performance of modelsthat went much, much deeper than previous models with both failureand success. The first model, DEEP did well on the radio class andperformed much better than other models on this class in particularwith a whopping accuracy of 20.44% as seen in 6.14. It can therefore beconcluded that this model did find deep connections for this class, butfailed to generalize on other classes and overestimated a minority class.As seen in the model architecture table 6.2, DEEP went from output 128to input of 128 and output of 512, an unfortunate mistake in architectureand can be the reason to its performance on the metrics, where the packedrepresentation was lost from the packed representation of 128 to 512.However, for all classes, the DEEP model did get a respectable metricvalue of 15.31% on the T100 metric for all classes.

The DEEPER model outperformed DEEP in general and respectedthe pyramid type structure it shared with all other experimented modelsand did not overestimate the radio class like DEEP did. Once again,this deeper model failed to outperform any other experiment with sim-ilar architecture except for the number of fully connected networks andincreased depth. By analyzing the loss value of 3.893327, this comes tono surprise, where other better performing models all share a loss valuebelow 2.5.

DEEPESTELU was the most successful very deep model out of thethree, converging at a loss value of 1.954330. DEEPESTELU did out-perform all the baseline heuristic metrics, and by analyzing the specificmetric classes in 6.16, it is apparent once again that it found deeper con-nection for the radio class where it performed well on the mean accuracymetric for radio as well as playlist. Overall, the DEEPESTELU modelperformed mediocrely in comparison to the other classes and the ELUactivation did somewhat of a good job with assist from adaptive gradientdescent and a low initial learning rate value of 0.01.


7.7 LimitationsThe biggest limitation of this project was the limitation of time. Spotifyprovided all the resources and data this project needed in order to trainand analyze the experiments and thanks to this, two good models, onewithout the sparse track vectors feature v

t

, namely BM2AG and onethat took this feature into consideration, namely TVEMBD3AG wasconstructed that both performed well on the evaluation data set, whichcould not have been done otherwise. The limitation of time limited thenumber of experiments, the time to train the more complex models andthe time to conduct hyperparameter search. Given the extent of thiswork, these would have to be described in another piece of work in orderfurther analyze and dive deeper into performances.

Furthermore, the limitation of time did not allow for live A/B testing.In the area of recommender systems and search engines, live A/B testingis fundamental to analyze the e�ect of the new models performance onusers. Instead, this work has been limited to o�ine metric evaluationby looking at two weeks, training on one week and predicting for thenext, unseen week. With live A/B testing, the real impact of the modelsrecommendations on users could have been analyzed and described.

Chapter 8

Conclusions

This work has presented a central problem in the area of informationgathering, information filtering and recommender systems and has pre-sented a successful approach for creating context-aware personalized mu-sic recommendations, where the proposed models creates a vector of itemsrelevant for any given user that only require a minimal set of features forrelevant and accurate recommendations, ultimately creating a vector ofcuration for each user.

8.1 SummaryIn the introduction chapter, this work formulated two hypothesis, namely

Hypothesis I: Deep learning can be used in order to give personalizedand context aware recommendations for a given context C, user U at acertain point in time T and

Hypothesis II: Deep learning approaches can outperform a classi-cal heuristic approach in order to give personalized and context awarerecommendations.

By researching the previous work in the area and state-of-the-art ap-plications, analyzing the data available, choosing the correct features,transforming and feature engineering these features and experimentingwith deep learning architectures that use previous scientific research inthe area as foundation, this work has both verified the hypothesis andaccepts the verified hypothesis as proven. Therefore, this work does notonly conclude that deep learning can in fact be used to give personalizedand context aware recommendations, but can outperform the classicalheuristic approach with a great margin on all metrics in an o�ine eval-

68

CHAPTER 8. CONCLUSIONS 69

uation.Furthermore, this work adds seven additional insights to this specific

application of Deep Learning. These insights express a summary fromthe outcome of the conducted experiments throughout this thesis and donot make any promises or proofs whatsoever, but can be seen as aid forfuture research in the area that concerns itself with constructing similararchitectures for future learning tasks similar to this one.

Insight I: Further depth in the output layers in the networks didallow for finding deep connections between latent variables, as seen inexperiments BM2AG, TVEMBD2 and TVEMBD3AG.

Insight II: The use of adaptive gradient descent did help optimizingthe convex function under estimation, that has features with large sparsefeature spaces, as seen with experiments BM2AG and TVEMBD3AG incomparison to experiments BM2 and TVEMBD3.

Insight III: Adaptive gradient descent performed better with aninitial lower value with sparse features where the depth of the networksare increased, as seen in experiments BM2AG and TVEMBD3AG.

Insight IV: Adaptive gradient descent, namely Adagrad optimiza-tion is preferred to Momentum optimization with Nesterov’s acceleratedgradient when optimizing the convex function under estimation, as seenwith experiments TVDIMN and TVDIM3AG.

Insight V: The softmax classifier was preferred to the NCE loss clas-sifier for this learning task, with as seen with experiments TVEMBD3AGand TVEMBD3NCE.

Insight VI: The proposed ELU activation function in [63] did notspeed up learning or accuracies for this application, nor deeper feed-forward neural networks with greater depth and with sparse features,even when aided adaptive gradient descent, seen with the experimentsTVEMBD3AG and TVEMBD3ELU, DEEPER and DEEPESTELU.

Insight VII: Using batch normalization as a normalizer function in anetwork type structure like the one this thesis has presented took longerto train in contrast to a model with exact same architecture, as seenwith experiments TVEMBD3AG and TVEMBD3BATCHN and did notprovide a boost in metric accuracy.

70 CHAPTER 8. CONCLUSIONS

8.2 Future ResearchIn this thesis we have used many di�erent models and network architec-tures for the task of recommending with deep learning using embeddingsand feed-forward neural networks. Further research could extend suchmodel to improve the performance when additional features and moreimportantly when sparse features are added for learning and extend theproposed recommender system to an ensemble of network models thatanalyze di�erent kinds of data. It might be of interest to analyze coverart to find latent features for the corresponding item with a convolutionalneural network and generate tags as a feature for the item itself that is be-ing evaluated for recommendation. As previous research has shown, it ispossible to analyze the audio mel-spectograms using convolutional neuralnetworks, additional item based features that could potentially aid therecommender system. Using an RNN or LSTM, it is possible to analyzethe relevant words for an item, or lyrics of a track to predict which itemare similar to which, thus making the recommender model take more fea-tures into account when basing the recommendations for a certain user.A top tier pipeline of models such as this proposed one could also imple-ment a model being responsible for the ranking of the items for each user,making the model take user features and item features by using implicitand explicit signals. Such model would take many possible relevant signalinto account and truly present the user with ranked, context-aware andpersonalized recommendations that make the most out the catalogue forthe user. At the same time, such system would match content creatorswith potential fans of their content, increase their status and revenue,and would by implication do the same for the service provider at hand.

Bibliography

[1] I Lawrence and Kuei Lin. “A concordance correlation coe�cient toevaluate reproducibility”. In: Biometrics (1989), pp. 255–268 (cit.on p. 10).

[2] Donald F Specht. “Probabilistic neural networks”. In: Neural net-works 3.1 (1990), pp. 109–118 (cit. on p. 18).

[3] Stuart Geman, Elie Bienenstock, and René Doursat. “Neural net-works and the bias/variance dilemma”. In: Neural computation 4.1(1992), pp. 1–58 (cit. on p. 26).

[4] Martin Riedmiller and Heinrich Braun. “A direct adaptive methodfor faster backpropagation learning: The RPROP algorithm”. In:Neural Networks, 1993., IEEE International Conference on. IEEE.1993, pp. 586–591 (cit. on p. 23).

[5] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term mem-ory”. In: Neural computation 9.8 (1997), pp. 1735–1780 (cit. onp. 22).

[6] John S Breese, David Heckerman, and Carl Kadie. “Empirical anal-ysis of predictive algorithms for collaborative filtering”. In: Proceed-ings of the Fourteenth conference on Uncertainty in artificial intel-ligence. Morgan Kaufmann Publishers Inc. 1998, pp. 43–52 (cit. onp. 10).

[7] Thomas Tran and Robin Cohen. “Hybrid recommender systems forelectronic commerce”. In: Proc. Knowledge-Based Electronic Mar-kets, Papers from the AAAI Workshop, Technical Report WS-00-04, AAAI Press. 2000 (cit. on p. 12).

[8] G.D. Linden, J.A. Jacobi, and E.A. Benson. Collaborative rec-ommendations using item-to-item similarity mappings. US Patent6,266,649. July 2001. url: https://www.google.com/patents/US6266649 (cit. on p. 10).

71

https://www.google.com/patents/US6266649

https://www.google.com/patents/US6266649

72 BIBLIOGRAPHY

[9] Loren Terveen and Will Hill. “Beyond recommender systems: Help-ing people help each other”. In: HCI in the New Millennium 1.2001(2001), pp. 487–509 (cit. on p. 10).

[10] Robin Burke. “Hybrid recommender systems: Survey and experi-ments”. In: User modeling and user-adapted interaction 12.4 (2002),pp. 331–370 (cit. on p. 12).

[11] Meehee Lee, Pyungseok Choi, and Yongtae Woo. “A hybrid recom-mender system combining collaborative filtering with neural net-work”. In: International Conference on Adaptive Hypermedia andAdaptive Web-Based Systems. Springer. 2002, pp. 531–534 (cit. onp. 16).

[12] Andrew I Schein et al. “Methods and metrics for cold-start rec-ommendations”. In: Proceedings of the 25th annual internationalACM SIGIR conference on Research and development in informa-tion retrieval. ACM. 2002, pp. 253–260 (cit. on p. 10).

[13] Greg Linden, Brent Smith, and Jeremy York. “Amazon. com rec-ommendations: Item-to-item collaborative filtering”. In: IEEE In-ternet computing 7.1 (2003), pp. 76–80 (cit. on pp. 9, 10).

[14] Jonathan L Herlocker et al. “Evaluating collaborative filtering rec-ommender systems”. In: ACM Transactions on Information Sys-tems (TOIS) 22.1 (2004), pp. 5–53 (cit. on p. 12).

[15] Han-Saem Park, Ji-Oh Yoo, and Sung-Bae Cho. “A context-awaremusic recommendation system using fuzzy bayesian networks withutility theory”. In: International Conference on Fuzzy Systems andKnowledge Discovery. Springer. 2006, pp. 970–979 (cit. on p. 11).

[16] Charalampos Vassiliou et al. “A recommender system frameworkcombining neural networks & collaborative filtering”. In: Proceed-ings of the 5th WSEAS international conference on Instrumenta-tion, measurement, circuits and systems. World Scientific, Engi-neering Academy, and Society (WSEAS). 2006, pp. 285–290 (cit.on p. 16).

[17] Kazuyoshi Yoshii et al. “Hybrid Collaborative and Content-basedMusic Recommendation Using Probabilistic Model with LatentUser Preferences.” In: ISMIR. Vol. 6. 2006, 7th (cit. on p. 10).

[18] Yoshua Bengio et al. “Greedy layer-wise training of deep networks”.In: Advances in neural information processing systems 19 (2007),p. 153 (cit. on p. 23).

BIBLIOGRAPHY 73

[19] Yoshua Bengio, Yann LeCun, et al. “Scaling learning algorithmstowards AI”. In: Large-scale kernel machines 34.5 (2007), pp. 1–41(cit. on p. 23).

[20] Robin Burke. “Hybrid web recommender systems”. In: The adap-tive web. Springer, 2007, pp. 377–408 (cit. on p. 12).

[21] Christina Christakou, Spyros Vrettos, and Andreas Stafylopatis. “Ahybrid movie recommender system based on neural networks”. In:International Journal on Artificial Intelligence Tools 16.05 (2007),pp. 771–792 (cit. on p. 16).

[22] Abhinandan S. Das et al. “Google News Personalization: ScalableOnline Collaborative Filtering”. In: Proceedings of the 16th Interna-tional Conference on World Wide Web. WWW ’07. Ban�, Alberta,Canada: ACM, 2007, pp. 271–280. isbn: 978-1-59593-654-7. doi:10.1145/1242572.1242610. url: http://doi.acm.org.focus.lib.kth.se/10.1145/1242572.1242610 (cit. on p. 10).

[23] Michael J Pazzani and Daniel Billsus. “Content-based recommen-dation systems”. In: The adaptive web. Springer, 2007, pp. 325–341(cit. on p. 10).

[24] Xuan Nhat Lam et al. “Addressing cold-start problem in recom-mendation systems”. In: Proceedings of the 2nd international con-ference on Ubiquitous information management and communica-tion. ACM. 2008, pp. 208–211 (cit. on p. 10).

[25] Linas Baltrunas and Xavier Amatriain. “Towards time-dependantrecommendation based on implicit feedback”. In: Workshop on context-aware recommender systems (CARS’09). 2009 (cit. on p. 11).

[26] Congratulations! 2009. url: http://www.netflixprize.com/index.html (cit. on p. 11).

[27] Yehuda Koren. “The bellkor solution to the netflix grand prize”.In: Netflix prize documentation 81 (2009), pp. 1–10 (cit. on p. 11).

[28] Yehuda Koren, Robert Bell, and Chris Volinsky. “Matrix factor-ization techniques for recommender systems”. In: Computer 42.8(2009) (cit. on p. 11).

[29] Hugo Larochelle et al. “Exploring strategies for training deep neu-ral networks”. In: Journal of Machine Learning Research 10.Jan(2009), pp. 1–40 (cit. on p. 26).

https://doi.org/10.1145/1242572.1242610

http://doi.acm.org.focus.lib.kth.se/10.1145/1242572.1242610

http://doi.acm.org.focus.lib.kth.se/10.1145/1242572.1242610

http://www.netflixprize.com/index.html

http://www.netflixprize.com/index.html

74 BIBLIOGRAPHY

[30] Martin Piotte and Martin Chabbert. “The pragmatic theory so-lution to the netflix grand prize”. In: Netflix prize documentation(2009) (cit. on p. 11).

[31] Ruslan Salakhutdinov and Geo�rey E Hinton. “Deep BoltzmannMachines.” In: AISTATS. Vol. 1. 2009, p. 3 (cit. on pp. 16, 23, 24).

[32] Xiaoyuan Su and Taghi M Khoshgoftaar. “A survey of collabo-rative filtering techniques”. In: Advances in artificial intelligence2009 (2009), p. 4 (cit. on p. 10).

[33] Andreas Töscher, Michael Jahrer, and Robert M Bell. “The bigchaossolution to the netflix grand prize”. In: Netflix prize documentation(2009), pp. 1–52 (cit. on p. 11).

[34] B Yegnanarayana. Artificial neural networks. PHI Learning Pvt.Ltd., 2009 (cit. on p. 16).

[35] Aaron Beach et al. “Fusing mobile, sensor, and social data to fullyenable context-aware computing”. In: Proceedings of the EleventhWorkshop on Mobile Computing Systems & Applications. ACM.2010, pp. 60–65 (cit. on p. 11).

[36] George Dahl, Abdel-rahman Mohamed, Geo�rey E Hinton, et al.“Phone recognition with the mean-covariance restricted Boltzmannmachine”. In: Advances in neural information processing systems.2010, pp. 469–477 (cit. on p. 25).

[37] Li Deng et al. “Binary coding of speech spectrograms using a deepauto-encoder.” In: Interspeech. Citeseer. 2010, pp. 1692–1695 (cit.on p. 25).

[38] Dumitru Erhan et al. “Why does unsupervised pre-training helpdeep learning?” In: Journal of Machine Learning Research 11.Feb(2010), pp. 625–660 (cit. on p. 26).

[39] Dongjoo Lee et al. “Exploiting contextual information from eventlogs for personalized recommendation”. In: Computer and Infor-mation Science 2010. Springer, 2010, pp. 121–139 (cit. on p. 11).

[40] Ste�en Rendle. “Factorization machines”. In: Data Mining (ICDM),2010 IEEE 10th International Conference on. IEEE. 2010, pp. 995–1000 (cit. on pp. 12, 25).

[41] Thierry Bertin-Mahieux et al. “The Million Song Dataset”. In: Pro-ceedings of the 12th International Conference on Music InformationRetrieval (ISMIR 2011). 2011 (cit. on p. 25).

BIBLIOGRAPHY 75

[42] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradi-ent methods for online learning and stochastic optimization”. In:Journal of Machine Learning Research 12.Jul (2011), pp. 2121–2159 (cit. on p. 28).

[43] Jiquan Ngiam et al. “On optimization methods for deep learning”.In: Proceedings of the 28th International Conference on MachineLearning (ICML-11). 2011, pp. 265–272 (cit. on p. 27).

[44] David Martin Powers. “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation”. In:(2011) (cit. on p. 13).

[45] Ste�en Rendle et al. “Fast context-aware recommendations withfactorization machines”. In: Proceedings of the 34th internationalACM SIGIR conference on Research and development in Informa-tion Retrieval. ACM. 2011, pp. 635–644 (cit. on p. 12).

[46] Yoshua Bengio. “Practical recommendations for gradient-based train-ing of deep architectures”. In: Neural networks: Tricks of the trade.Springer, 2012, pp. 437–478 (cit. on p. 29).

[47] Pedro G Campos, Fernando Dıez, and Iván Cantador. “A Perfor-mance Comparison of Time-Aware Recommendation Models”. In:(2012) (cit. on p. 12).

[48] Geo�rey Hinton et al. “Deep neural networks for acoustic modelingin speech recognition: The shared views of four research groups”.In: IEEE Signal Processing Magazine 29.6 (2012), pp. 82–97 (cit.on p. 25).

[49] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. “A compara-tive study of collaborative filtering algorithms”. In: arXiv preprintarXiv:1205.3193 (2012) (cit. on pp. 9, 10).

[50] Franco P Preparata and Michael Shamos. Computational geometry:an introduction. Springer Science & Business Media, 2012 (cit. onp. 18).

[51] Ste�en Rendle. “Factorization machines with libfm”. In: ACM Trans-actions on Intelligent Systems and Technology (TIST) 3.3 (2012),p. 57 (cit. on p. 12).

[52] Erik Bernhardsson. Collaborative Filtering at Spotify. Jan. 2013.url: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 (cit. on p. 10).

http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

76 BIBLIOGRAPHY

[53] Clement Farabet et al. “Learning hierarchical features for scenelabeling”. In: IEEE transactions on pattern analysis and machineintelligence 35.8 (2013), pp. 1915–1929 (cit. on p. 25).

[54] Alex Graves. “Generating sequences with recurrent neural net-works”. In: arXiv preprint arXiv:1308.0850 (2013) (cit. on p. 22).

[55] Alex Graves, Abdel-rahman Mohamed, and Geo�rey Hinton. “Speechrecognition with deep recurrent neural networks”. In: Acoustics,speech and signal processing (icassp), 2013 ieee international con-ference on. IEEE. 2013, pp. 6645–6649 (cit. on p. 22).

[56] Tomas Mikolov et al. “Distributed representations of words andphrases and their compositionality”. In: Advances in neural infor-mation processing systems. 2013, pp. 3111–3119 (cit. on pp. 14,16).

[57] Andriy Mnih and Koray Kavukcuoglu. “Learning word embeddingse�ciently with noise-contrastive estimation”. In: Advances in neu-ral information processing systems. 2013, pp. 2265–2273 (cit. onp. 42).

[58] Pierre Sermanet et al. “Pedestrian detection with unsupervisedmulti-stage feature learning”. In: Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition. 2013, pp. 3626–3633 (cit. on p. 25).

[59] Ilya Sutskever et al. “On the importance of initialization and mo-mentum in deep learning.” In: ICML (3) 28 (2013), pp. 1139–1147(cit. on pp. 28, 53).

[60] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen.“Deep content-based music recommendation”. In: Advances in neu-ral information processing systems. 2013, pp. 2643–2651 (cit. onpp. 10, 25).

[61] Minh-Thang Luong et al. “Addressing the rare word problem inneural machine translation”. In: arXiv preprint arXiv:1410.8206(2014) (cit. on p. 22).

[62] Recommending music on Spotify with deep learning. Aug. 2014.url: http://benanne.github.io/2014/08/05/spotify-cnns.html (cit. on pp. 10, 20).

http://benanne.github.io/2014/08/05/spotify-cnns.html

http://benanne.github.io/2014/08/05/spotify-cnns.html

BIBLIOGRAPHY 77

[63] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter.“Fast and accurate deep network learning by exponential linearunits (elus)”. In: arXiv preprint arXiv:1511.07289 (2015) (cit. onpp. 22, 57, 59, 65, 69).

[64] Balázs Hidasi et al. “Session-based recommendations with recur-rent neural networks”. In: arXiv preprint arXiv:1511.06939 (2015)(cit. on p. 23).

[65] Sergey Io�e and Christian Szegedy. “Batch normalization: Accel-erating deep network training by reducing internal covariate shift”.In: arXiv preprint arXiv:1502.03167 (2015) (cit. on pp. 40, 65).

[66] Yann LeCun, Yoshua Bengio, and Geo�rey Hinton. “Deep learn-ing”. In: Nature 521.7553 (2015), pp. 436–444 (cit. on p. 21).

[67] Kelvin Xu et al. “Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention.” In: ICML. Vol. 14. 2015, pp. 77–81 (cit. on p. 22).

[68] David Zhan Liu and Gurbir Singh. “A Recurrent Neural NetworkBased Recommendation System”. In: (2015) (cit. on p. 23).

[69] Heng-Tze Cheng et al. “Wide & Deep Learning for RecommenderSystems”. In: CoRR abs/1606.07792 (2016). url: http://arxiv.org/abs/1606.07792 (cit. on p. 39).

[70] Tim Cooijmans et al. “Recurrent batch normalization”. In: arXivpreprint arXiv:1603.09025 (2016) (cit. on p. 40).

[71] Paul Covington, Jay Adams, and Emre Sargin. “Deep Neural Net-works for YouTube Recommendations”. In: Proceedings of the 10thACM Conference on Recommender Systems. New York, NY, USA,2016 (cit. on pp. 15, 25, 38).

[72] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learn-ing. http://www.deeplearningbook.org. MIT Press, 2016 (cit.on pp. 19, 21–23, 26–29).

[73] Takuya Kitazawa. “Incremental Factorization Machines for Per-sistently Cold-starting Online Item Recommendation”. In: arXivpreprint arXiv:1607.02858 (2016) (cit. on p. 12).

[74] Andrew Ng. Nuts and bolts of applying deep learning. 2016. url:https://www.bayareadlschool.org/ (cit. on p. 29).

http://arxiv.org/abs/1606.07792

http://arxiv.org/abs/1606.07792

http://www.deeplearningbook.org

https://www.bayareadlschool.org/

78 BIBLIOGRAPHY

[75] Yang Song, Ali Mamdouh Elkahky, and Xiaodong He. “Multi-ratedeep learning for temporal recommendation”. In: Proceedings ofthe 39th International ACM SIGIR conference on Research andDevelopment in Information Retrieval. ACM. 2016, pp. 909–912(cit. on p. 23).

[76] Wei Wang et al. “Deep Learning at Scale and at Ease”. In: ACMTrans. Multimedia Comput. Commun. Appl. 12.4s (Nov. 2016),69:1–69:25. issn: 1551-6857. doi: 10.1145/2996464. url: http://doi.acm.org/10.1145/2996464 (cit. on p. 23).

[77] Deep Learning. Jan. 2017. url: https://developer.nvidia.com/deep-learning (cit. on p. 23).

[78] Google. Dataflow. 2017. url: https : / / cloud . google . com /dataflow/ (visited on 05/01/2017) (cit. on p. 30).

[79] Keras: Deep Learning library for Theano and TensorFlow. 2017.url: https://keras.io/ (cit. on p. 23).

[80] Neville Li. Scio - A Scala API for Google Cloud Dataflow ApacheBeam. 2017. url: https://www.slideshare.net/sinisalyh/scio-a-scala-api-for-google-cloud-dataflow-apache-beam(visited on 05/01/2017) (cit. on p. 30).

[81] Spotify. About. 2017. url: https://press.spotify.com/us/about/ (visited on 03/24/2015) (cit. on pp. 2, 30).

[82] TensorFlow. About TensorFlow. 2017. url: https://www.tensorflow.org/ (visited on 01/24/2017) (cit. on p. 23).

[83] Tensorflow. Keynote (TensorFlow Dev Summit 2017). 2017. url:https://www.youtube.com/watch?v=4n1AHvDvVvw (cit. on p. 44).

[84] Tensorflow. Reading data. 2017. url: https://www.tensorflow.org/programmers_guide/reading_data (cit. on p. 44).

[85] Tensorflow. Vector Representation of Words. 2017. url: https://www.tensorflow.org/tutorials/word2vec (cit. on p. 16).

https://doi.org/10.1145/2996464

http://doi.acm.org/10.1145/2996464

http://doi.acm.org/10.1145/2996464

https://developer.nvidia.com/deep-learning

https://developer.nvidia.com/deep-learning

https://cloud.google.com/dataflow/

https://cloud.google.com/dataflow/

https://keras.io/

https://www.slideshare.net/sinisalyh/scio-a-scala-api-for-google-cloud-dataflow-apache-beam

https://www.slideshare.net/sinisalyh/scio-a-scala-api-for-google-cloud-dataflow-apache-beam

https://press.spotify.com/us/about/

https://press.spotify.com/us/about/

https://www.tensorflow.org/

https://www.tensorflow.org/

https://www.youtube.com/watch?v=4n1AHvDvVvw

https://www.tensorflow.org/programmers_guide/reading_data

https://www.tensorflow.org/programmers_guide/reading_data

https://www.tensorflow.org/tutorials/word2vec

https://www.tensorflow.org/tutorials/word2vec

Appendix A

Personas

In this appendix, I present some fabricated personas with selected fea-tures, together with some personas that reflects personalities of my friends,and one persona reflecting my own features for model demonstration pur-poses. The personas were selected with diversity in mind and in orderto show demographic recommendations as well as recommendations forregions that the service do not exist in as of yet, such as India and Russia.Four personas were created for demonstrating how the models deals withcold start recommendations, where N/A denote no prior information ofthe feature, not available. The Vectors of Curation v

curation

given by thetwo best trained models can be found in Appendix B.

Persona Country City Age Gender Platform Day Hour

Elin Sweden Stockholm 19 female webplayer fri 20:00Armando United States San Luis Obispo 21 male playstation sat 2:00Nate United States San Luis Obispo 22 male android sun 4:00Oktay Sweden Stockholm 23 male ios sat 14:00Shpetim United Kingdom London 23 male windows wed 13:00Maisie United Kingdom London 23 female windows wed 13:00Paulina Sweden Tokyo 24 female android mon 18:00Matz Norway San Francisco 25 male android fri 22:00Dmitry United States New York 25 male webplayer tue 23:00Isabel France Paris 28 female windows fri 20:00Jökull Iceland Reykjavik 30 male ios fri 22:00Ranjit United States Los Angeles 32 male android fri 22:00Alan United Kingdom Cheshire 41 male webplayer fri 22:00Cemile Turkey Istanbul 50 female mac os x wed 19:00Karl Sweden Uppsala 65 male sonos tue 20:00

Table A.1: Personas with respective features

79

80 APPENDIX A. PERSONAS

Persona Play Context Listening History Top Tracks

Elin N/A N/A

Armando

Xxxtentacion (artist),Lil Uzi Vert (artist),

Daddy Yankee (artist),Lil Yachty (artist),

Romeo Santos (artist)

Migos - Bad and Boujee,Lil Uzi Vert - You Was Right,

Future - Low Life,Kyle - iSpy

Nate

Illenium (artist),Marshmello (artist),

Slushii (artist),NGHTMRE (artist)

Illenium - Afterlife,Diplo - Be Right There

Oktay

Flume (artist),Aaliyah (artist),Banks (artist),

KAYTRANADA (artist),HAIM (artist),

Disclosure (artist),Good Music Cruel Summer (album),

Vulfpeck (artist)

Migos - Bad and Boujee,Rihanna - Work,

Flume - Numb & Getting Colder,Grimes - Go,

James Blake - Overgrown,deadmau5 - Strobe (Club Edit),

Hardwell - Spaceman

Shpetim N/A N/A

Maisie N/A N/A

Paulina

Porter Robinson (artist),Girls’ Generation (artist),

Kyary Pamyu Pamyu (artist),Purity Ring (artist),

WEDNESDAY CAMPANELLA (artist)

Porter Robinson - Shelter

Matz

Norsk sommermusikk (playlist),Lido (artist),Arif (artist),

Aminé (artist)

Aminé - Caroline

Dmitry Timati (artist) N/A

IsabelJames Blake (artist),

Live Lounge and Acoustic (playlist),Lartiste (artist)

N/A

Jökull Emmsjé Gauti (artist),Young Thug (artist)

Young Thug - Digits,Young Thug - Best Friend

Ranjit Arjit Singh (playlist),A.R. Rahman (artist) A.R. Rahman - Reality

Alan

Ludwig van Beethoven (artist),Johann Sebastian Bach (artist),

Wolfgang Amadeus Mozart (artist),This Is: Beethoven (playlist)

N/A

Cemile Ahmet Kaya (artist),Sezen Aksu (artist) N/A

Karl N/A N/A

Table A.2: Listening History and Top Tracks for personas in A.1

Appendix B

Model Predictions for Personas

In this appendix, the Vectors of Curation vcuration

given by models BM2AGand TVEMBD3AG as described in 6.2 for each of the personas found inAppendix A and tables A.1 are presented. The corresponding listeninghistory, and their top tracks model TVEMD3AG use for predictionsare found in A.2. Due to the limit of space in the document, only themost relevant top 30 recommendations will be presented for each modeland persona.

81

82 APPENDIX B. MODEL PREDICTIONS FOR PERSONAS


Elin Sweden Stockholm 19 female webplayer fri 20:00

Table B.1: Features that demonstrate a female cold start user.

Rank Play Context Type

1 Sweden Top 50 playlist2 Discover Weekly radio3 Hits idag playlist4 Sommarhits 2017 playlist5 Ambient Nature Sleep Sounds playlist6 New Music Friday Sweden playlist7 Global Top 50 playlist8 Lugnt & Skönt playlist9 Stockholm <3 playlist10 Get Your Hits Together playlist11 ÷ album12 Today’s Top Hits playlist13 Håkan Hellström artist14 Ed Sheeran artist15 Soft Sommar playlist16 Melodifestivalen - Årets bidrag playlist17 Zara Larsson artist18 Classical World: Sweden playlist19 Made in Sweden playlist20 More Life album21 Melodifestivalen 2017 album22 So Good album23 Liiit playlist24 Vår-Chill playlist25 Thunder Sleep Sounds playlist26 Hits Just Nu playlist27 Musik för arbetsdagen playlist28 Chill Hits playlist29 Life is good playlist30 Year in Music radio

Table B.2: BM2AG Recommendations for Elin A.1.

APPENDIX B. MODEL PREDICTIONS FOR PERSONAS 83


Elin Sweden Stockholm 19 female webplayer fri 20:00

Table B.3: Features that demonstrate a female cold start user.


1 Sweden Top 50 playlist2 Global Top 50 playlist3 Discover Weekly radio4 Hits idag playlist5 Sommarhits 2017 playlist6 Lugnt & Skönt playlist7 Stockholm <3 playlist8 Ed Sheeran artist9 Ord och inga visor playlist10 Today’s Top Hits playlist11 More Life album12 New Music Friday Sweden playlist13 Power Gaming playlist14 ÷ album15 Drake artist16 Förfest! playlist17 Viva Latino-Top 50 playlist18 RapCaviar playlist19 Soft Sommar playlist20 Made in Sweden playlist21 Melodifestivalen 2017 album22 Zara Larsson artist23 Release Radar radio24 Sign of the Times album25 Vår-Chill playlist26 Get Your Hits Together playlist27 Peaceful Piano playlist28 Melodifestivalen - Årets bidrag playlist29 Chill Hits playlist30 Dinner with Friends playlist

Table B.4: TVEMBD3AG Recommendations for Elin A.1.



Armando United States San Luis Obispo 21 male playstation sat 2:00

Table B.5: Features of my friend Armando, likes hip hop & latino music.


1 Lil Uzi Vert artist2 More Life album3 Future artist4 Discover Weekly radio5 RapCaviar playlist6 Kodak Black artist7 FUTURE album8 Young Thug artist9 Heatstroke album10 Drake artist11 Get Turnt playlist12 Migos artist13 NAV album14 XO TOUR Llif3 album15 The Chainsmokers artist16 Xxxtentacion artist17 The Weeknd artist18 New Music Friday playlist19 A Boogie Wit da Hoodie artist20 HUMBLE. album21 Travis Scott artist22 Logic artist23 United States Top 50 playlist24 Rae Sremmurd artist25 21 Savage artist26 ALL-AMERIKKKAN BADA$$ album27 Kanye West artist28 HNDRXX album29 Joey Bada$$ artist30 Kendrick Lamar artist

Table B.6: BM2AG Recommendations for Armando A.1.



Armando United States San Luis Obispo 21 male playstation sat 2:00

Table B.7: Features of my friend Armando.


1 Nicky Jam artist2 RapCaviar playlist3 Bruno Mars artist4 New Music Friday playlist5 Today’s Top Hits playlist6 Memories...Do Not Open album7 Enrique Iglesias artist8 Marc Anthony artist9 Global Top 50 playlist10 Juanes artist11 Daddy Yankee artist12 Prince Royce artist13 United States Top 50 playlist14 Latin Urban Gaming playlist15 US Latin Top 50 playlist16 Viva Latino-Top 50 playlist17 Romeo Santos artist18 J Balvin artist19 Shakira artist20 Don Omar artist21 Kendrick Lamar artist22 Migos artist23 Pitbull artist24 Esenciales playlist25 F�änix album26 United States Viral 50 playlist27 Discover Weekly radio28 Despacito (Featuring Daddy Yankee) album29 Maluma artist30 Kyle artist

Table B.8: TVEMBD3AG Recommendations for Armando A.1.



Nate United States San Luis Obispo 22 male android sun 4:00

Table B.9: Nate, demonstrating SoundCloud music a�nity.


1 Heatstroke album2 Memories...Do Not Open album3 Discover Weekly radio4 NGHTMRE artist5 The Chainsmokers artist6 Release Radar radio7 Marshmello artist8 Porter Robinson artist9 Los Amsterdam album10 NAV album11 More Life album12 FUTURE album13 The One album14 Illenium artist15 Future artist16 Lil Uzi Vert artist17 Boombox Cartel artist18 Today’s Top Hits playlist19 Slushii artist20 Say Less album21 Now Or Never album22 Ookay artist23 HNDRXX album24 Said the Sky artist25 Byte album26 The Weeknd artist27 Seven Lions artist28 ALL-AMERIKKKAN BADA$$ album29 Twinbow album30 San Holo artist

Table B.10: BM2AG Recommendations for Nate A.1.



Nate United States San Luis Obispo 22 male android sun 4:00

Table B.11: Features of my friend Nate.


1 Memories...Do Not Open album2 Release Radar radio3 Discover Weekly radio4 DJ Snake artist5 Martin Garrix artist6 electroNOW playlist7 Major Lazer artist8 Today’s Top Hits playlist9 Galantis artist10 United States Top 50 playlist11 Twinbow album12 Marshmello artist13 New Music Friday playlist14 Dillon Francis artist15 Kungs artist16 Bruno Mars artist17 Go Hard playlist18 Above & Beyond artist19 RapCaviar playlist20 Lookas artist21 Run Up album22 Slushii artist23 Laidback Luke artist24 Future artist25 Bassjackers artist26 Radio Station Artist Alok radio27 24K Magic album28 Calvin Harris artist29 Here Comes The Night album30 Jax Jones artist

Table B.12: TVEMBD3AG Recommendations for Nate A.1.



Oktay Sweden Stockholm 23 male ios sat 14:00

Table B.13: Oktay, my features, demonstrating mixed genre a�nities.


1 Discover Weekly radio2 Release Radar radio3 Year in Music radio4 More Life album5 Heatstroke album6 Sweden Top 50 playlist7 New Music Friday Sweden playlist8 Lykke Li artist9 Lana Del Rey artist10 Global Top 50 playlist11 Erik Lundin artist12 Memories...Do Not Open album13 Frank Ocean artist14 The Weeknd artist15 The Life Of Pablo album16 Niki & The Dove artist17 Drake artist18 Andromeda (feat. D.R.A.M.) album19 Future Islands artist20 So Good album21 Zara Larsson artist22 I See You album23 Disclosure artist24 Cherrie artist25 FUTURE album26 Gorillaz artist27 M.I.A. artist28 The Knife artist29 Håkan Hellström artist30 Flume artist

Table B.14: BM2AG Recommendations for Oktay A.1.



Oktay Sweden Stockholm 23 male ios sat 14:00

Table B.15: Features for Oktay.


1 Memories...Do Not Open album2 ALL-AMERIKKKAN BADA$$ album3 Global Top 50 playlist4 Release Radar radio5 Sweden Top 50 playlist6 Now Or Never album7 Discover Weekly radio8 Andromeda (feat. D.R.A.M.) album9 Flume artist10 Sign of the Times album11 Gorillaz artist12 Kendrick Lamar artist13 Yung Lean artist14 Stockholm <3 playlist15 Xxxtentacion artist16 Zara Larsson artist17 HUMBLE. album18 Sommarhits 2017 playlist19 Drake artist20 Say Less album21 Tove Lo artist22 More Life album23 Future artist24 $uicideBoy$ artist25 Clean Bandit artist26 Skin album27 New Music Friday Sweden playlist28 So Good album29 Stormzy artist30 Nirvana artist

Table B.16: TVEMBD3AG Recommendations for Oktay A.1.



Shpetim United Kingdom London 23 male windows wed 13:00

Table B.17: Shpetim, demonstrating a male cold start user in the UK.


1 Discover Weekly radio2 United Kingdom Top 50 playlist3 More Life album4 Hot Hits UK playlist5 ÷ album6 Today’s Top Hits playlist7 Global Top 50 playlist8 Year in Music radio9 RapCaviar playlist10 Release Radar radio11 Grime Shutdown playlist12 UK House Music playlist13 Gang Signs & Prayer album14 Massive Dance Hits playlist15 Heatstroke album16 FUTURE album17 New Music Friday UK playlist18 This Is: Drake playlist19 Album Radio Station ÷ radio20 All Night Dance Party playlist21 Stormzy artist22 Memories...Do Not Open album23 Power Gaming playlist24 Ed Sheeran artist25 Drake artist26 This Is: Ed Sheeran playlist27 Feel Good Friday playlist28 Shape of You album29 A Perfect Day playlist30 Indie Party playlist

Table B.18: BM2AG Recommendations for Shpetim A.1.



Shpetim United Kingdom London 23 male windows wed 13:00

Table B.19: Features of my friend Shpetim.


1 Hot Hits UK playlist2 United Kingdom Top 50 playlist3 Discover Weekly radio4 Global Top 50 playlist5 Massive Dance Hits playlist6 New Music Friday UK playlist7 Today’s Top Hits playlist8 All Night Dance Party playlist9 More Life album10 Grime Shutdown playlist11 Music For Concentration playlist12 A Perfect Day playlist13 ÷ album14 Ed Sheeran artist15 Release Radar radio16 UK House Music playlist17 Power Gaming playlist18 Massive Dance Classics playlist19 Drake artist20 Happy Pop Hits playlist21 You Can Do It playlist22 #MondayMotivation playlist23 Sign of the Times album24 Top Gaming Tracks playlist25 The Great British Breakfast playlist26 All Gold 80s playlist27 Feel Good Friday playlist28 spotify:station:genre:pop radio29 Gang Signs & Prayer album30 This Is: Drake playlist

Table B.20: TVEMBD3AG Recommendations for Shpetim A.1.



Maisie United Kingdom London 23 female windows wed 13:00

Table B.21: Maisie, demonstrating corresponding female cold start user.


1 Discover Weekly radio2 United Kingdom Top 50 playlist3 Hot Hits UK playlist4 ÷ album5 Today’s Top Hits playlist6 More Life album7 Global Top 50 playlist8 Year in Music radio9 Beauty and the Beast album10 New Music Friday UK playlist11 This Is: Ed Sheeran playlist12 Happy Pop Hits playlist13 Release Radar radio14 Massive Dance Hits playlist15 Independent Ladies playlist16 Ed Sheeran artist17 A Perfect Day playlist18 #MondayMotivation playlist19 The Great British Breakfast playlist20 All Night Dance Party playlist21 All Gold 80s playlist22 Album Radio Station ÷ radio23 Feel Good Friday playlist24 Easy 00s playlist25 Get Home Happy! playlist26 Disney Hits playlist27 UK House Music playlist28 Topsify UK Top 50 playlist29 Heatstroke album30 You Can Do It playlist

Table B.22: BM2AG Recommendations for Maisie A.1.



Maisie United Kingdom London 23 female windows wed 13:00

Table B.23: Features of my friend Maisie.


1 Hot Hits UK playlist2 United Kingdom Top 50 playlist3 Discover Weekly radio4 Global Top 50 playlist5 New Music Friday UK playlist6 Massive Dance Hits playlist7 Today’s Top Hits playlist8 All Night Dance Party playlist9 ÷ album10 Music For Concentration playlist11 A Perfect Day playlist12 Ed Sheeran artist13 More Life album14 Sign of the Times album15 Happy Pop Hits playlist16 Release Radar radio17 #MondayMotivation playlist18 Grime Shutdown playlist19 You Can Do It playlist20 This Is: Ed Sheeran playlist21 Massive Dance Classics playlist22 The Great British Breakfast playlist23 Feel Good Friday playlist24 Every UK Number One: 2017 playlist25 UK House Music playlist26 This Is How We Do playlist27 Radio Station Genre Pop radio28 Hot 50 | UK playlist29 Viva Latino-Top 50 playlist30 Drake artist

Table B.24: TVEMBD3AG Recommendations for Maisie A.1.



Paulina Sweden Tokyo 24 female android mon 18:00

Table B.25: Paulina, a female from Sweden living in Tokyo.


1 Discover Weekly radio2 Year in Music radio3 Sweden Top 50 playlist4 Gorillaz artist5 Heatstroke album6 Release Radar radio7 Global Top 50 playlist8 Lana Del Rey artist9 Red Velvet artist10 Solen artist11 BTS artist12 Håkan Hellström artist13 New Music Friday Sweden playlist14 Zara Larsson artist15 Coldplay artist16 SKAM playlist17 The Weeknd artist18 Beauty and the Beast album19 Humanz playlist20 K-Pop Daebak playlist21 Sia artist22 Best of 2016: K-Pop playlist23 Vår-Chill playlist24 Soft Sommar playlist25 SHINee artist26 Monsta X artist27 Melodifestivalen - Årets bidrag playlist28 Hits idag playlist29 Block B artist30 More Life album

Table B.26: BM2AG Recommendations for Paulina A.1.



Paulina Sweden Tokyo 24 female android mon 18:00

Table B.27: Features of my friend Paulina.


1 Suicide Squad: The Album album2 Skrillex artist3 Beast Mode playlist4 Scary Monsters and Nice Sprites album5 Purple Lamborghini (with Rick Ross) album6 Knife Party artist7 Recess album8 Album Radio Station Purple Lamborghini radio9 Discover Weekly radio10 Fast and Furious 8 (Original Soundtrack) playlist11 Skrillex and Diplo present Jack U album12 Bad and Boujee (feat. Lil Uzi Vert) album13 FIYAH album14 Gorillaz artist15 Waiting album16 The Best Of DMX album17 Year in Music radio18 Bangarang EP album19 KSI artist20 Eminem artist21 Iggy Azalea artist22 will.i.am artist23 Power Gaming playlist24 RapCaviar playlist25 Surface album26 Panda album27 Swalla (feat. Nicki Minaj & Ty Dolla $ign) album28 Black Barbies album29 Marshmello artist30 Kendrick Lamar artist

Table B.28: TVEMBD3AG Recommendations for Paulina A.1.



Matz Norway San Francisco 25 male android fri 22:00

Table B.29: Matz, from Norway but living in SF.


1 Memories...Do Not Open album2 Global Top 50 playlist3 Rich Chigga artist4 Discover Weekly radio5 Today’s Top Hits playlist6 Heatstroke album7 Coldplay artist8 The Chainsmokers artist9 Up album10 RAC artist11 Release Radar radio12 It’s a Hit! playlist13 HONNE artist14 It’s a Hit! playlist15 Beauty and the Beast album16 Bruno Mars artist17 Radio Station Artist Rich Chigga artist18 This Is: The Chainsmokers playlist19 Seventeen album20 Marshmello artist21 Yellow Claw artist22 FKJ artist23 Moana album24 Horses album25 Territory album26 Chill Hits playlist27 Los Amsterdam album28 FATE NUMBER FOR album29 Battle Cry album30 Let Me Out (feat. Mavis Staples & Pusha T) album

Table B.30: BM2AG Recommendations for Matz A.1.



Matz Norway San Francisco 25 male android fri 22:00

Table B.31: Altered features of my friend Matz.


1 ÷ album2 Global Top 50 playlist3 Memories...Do Not Open album4 This Is: The Chainsmokers playlist5 Today’s Top Hits playlist6 This Is: Ed Sheeran playlist7 The Chainsmokers artist8 Happy Hits! playlist9 Ed Sheeran artist10 Chance The Rapper artist11 Artist Radio Station Ed Sheeran radio12 It’s a Hit! playlist13 Coloring Book album14 This Is: Chance The Rapper playlist15 Rich Chigga artist16 The One album17 Chill Hits playlist18 Indonesia Top 50 playlist19 Bruno Mars artist20 Album Radio Station Coloring Book radio21 24K Magic album22 Album Radio Station ÷ radio23 Beauty and the Beast album24 La La Land album25 How Far I’ll Go album26 To Pimp A Butterfly album27 Release Radar radio28 Album Radio Station HUMBLE. radio29 Album Radio Station iSpy (feat. Lil Yachty) radio30 Fresh Eyes album

Table B.32: TVEMBD3AG Recommendations for Matz A.1.



Dmitry United States New York 25 male webplayer tue 23:00

Table B.33: Dmitry, with Russian music taste living in NY.


1 Heatstroke album2 Timati artist3 L’one artist4 Switched Flows album5 Mot artist6 Maks Korzh artist7 FUTURE album8 Lindsey Stirling artist9 Deep Focus playlist10 Olympus album11 Basta artist12 Discover Weekly radio13 electroNOW playlist14 Brick God album15 Kristina Si artist16 Drake artist17 Grebz artist18 Power Gaming playlist19 Today’s Top Hits playlist20 Éxitos México playlist21 RapCaviar playlist22 Bianka artist23 Weekly Buzz playlist24 Global Top 50 playlist25 Gameday playlist26 electroNOW playlist27 US Latin Top 50 playlist28 Oxxxymiron artist29 College Dropout album30 Monatik artist

Table B.34: BM2AG Recommendations for Dmitry A.1



Isabel France Paris 28 female windows fri 20:00

Table B.35: Isabel, french with indie music taste.


1 Discover Weekly radio2 Maître Gims artist3 Drake artist4 Release Radar radio5 PNL artist6 More Life album7 Global Top 50 playlist8 Hits du Moment playlist9 Jul artist10 The Weeknd artist11 Jain artist12 Disclosure artist13 FUTURE album14 Memories...Do Not Open album15 Booba artist16 Vianney artist17 Starboy album18 Petit Biscuit artist19 Nekfeu artist20 Rihanna artist21 Beyoncé artist22 France Top 50 playlist23 Sia artist24 New Music Friday France playlist25 Ed Sheeran artist26 Her artist27 Dans la légende album28 The Blaze artist29 Jamiroquai artist30 Comme si album

Table B.36: BM2AG Recommendations for Isabel A.1.



Isabel France Paris 28 female windows fri 20:00

Table B.37: Isabel, french with indie music taste.


1 Apéro ! playlist2 Hits du Moment playlist3 Global Top 50 playlist4 PNL artist5 Fin de journée playlist6 Au calme playlist7 Cyborg album8 Nekfeu artist9 Au coin du feu playlist10 Dans la légende album11 France Top 50 playlist12 Vianney artist13 Lamomali album14 Ed Sheeran artist15 Jul artist16 Drake artist17 KeBlack artist18 Top Hits Mix playlist19 Memories...Do Not Open album20 Alors on danse playlist21 Fréquence Hits playlist22 Punchlineurs playlist23 New Music Friday France playlist24 Lacrim artist25 Loïc Nottet artist26 Force & Honneur album27 Discover Weekly radio28 Booba artist29 More Life album30 Shape of You album

Table B.38: TVEMBD3AG Recommendations for Isabel A.1.



Jökull Iceland Reykjavik 30 male ios fri 22:00

Table B.39: Jökull, Icelander with hiphop music taste.


1 Smellir dagsins playlist2 Fullir Vasar album3 Sautjándi nóvember album4 Partí playlist5 Slappa af playlist6 Album Radio Station Fullir Vasar radio7 Aron Can artist8 New Music Friday Iceland playlist9 Neinei album10 Bubbi Morthens artist11 Sturla Atlas artist12 Album Radio Station Neinei album13 Emmsjé Gauti artist14 Söngvakeppnin 2017 album15 101 Nights album16 Discover Weekly radio17 Global Top 50 playlist18 Smellir dagsins playlist19 Íslenskt rapp playlist20 Lil Uzi Vert artist21 Year in Music radio22 Release Radar radio23 Future artist24 $uicideBoy$ artist25 good kid, m.A.A.d city album26 Yung Lean artist27 The Life Of Pablo album28 100 íslensk barnalög album29 100 íslenskir sumarsmellir album30 Vince Staples artist

Table B.40: BM2AG Recommendations for Jökull A.1.



Jökull Iceland Reykjavik 30 male ios fri 22:00

Table B.41: Features for Jökull.


1 Broken Heart playlist2 Memories...Do Not Open album3 Look At Me! album4 Xxxtentacion artist5 Panic! At The Disco artist6 Global Top 50 playlist7 Kodak Black artist8 Lil Uzi Vert artist9 Eminem artist10 Rich Chigga artist11 Death Of A Bachelor album12 Life Sucks playlist13 Blurryface album14 This Is: All Time Low playlist15 Machine Gun Kelly artist16 $uicideBoy$ artist17 Help album18 G-Eazy artist19 iSpy (feat. Lil Yachty) album20 Young, Wild & Free (feat. Bruno Mars) album21 My Chemical Romance artist22 This Is: Panic! At The Disco playlist23 Twenty One Pilots artist24 Discover Weekly radio25 21 Savage artist26 Lil Pump artist27 Today’s Top Hits playlist28 Cum Cake album29 This Is: Twenty One Pilots playlist30 Pink Season album

Table B.42: TVEMBD3AG Recommendations for Jökull A.1.



Ranjit United States Los Angeles 32 male android fri 22:00

Table B.43: Ranjit living in LA with Indian music taste.


1 Discover Weekly radio2 Today’s Top Hits playlist3 More Life album4 Bollywood Party playlist5 Bollywood Top 50 playlist6 Best of Bollywood: Arijit Singh album7 FUTURE album8 This Is: Arijit Singh playlist9 US Latin Top 50 playlist10 Bollywood Party playlist11 This Is: Drake playlist12 Album Radio Station Kaatru Veliyidai radio13 Yours Truly Arijit album14 Get Turnt playlist15 electroNOW playlist16 Arijit Singh artist17 Memories...Do Not Open album18 Release Radar radio19 Heatstroke album20 RapCaviar playlist21 United States Top 50 playlist22 Global Top 50 playlist23 bollywood 2016 Updated playlist24 Year in Music radio25 Arijit Singh - Ultimate Love Songs album26 Starboy album27 Ae Dil Hai Mushkil album28 Album Radio Station Shape of You radio29 50 Glorious Musical Years (The Complete Works) album30 Viva Latino-Top 50 playlist

Table B.44: BM2AG Recommendations for Ranjit A.1



Ranjit United States Los Angeles 32 male android fri 22:00

Table B.45: Features for Ranjit.


1 Maná artist2 Marco Antonio Solís artist3 Rompiendo Fronteras album4 Alejandro Fernandez artist5 Alejandro Sanz artist6 Luis Miguel artist7 Vicente Fernandez artist8 El Fantasma artist9 Andrea Bocelli artist10 Eros Ramazzotti artist11 Album Radio Station Antonio Aguilar artist12 Global Top 50 playlist13 A State Of Trance Episode 807 album14 A State Of Trance Episode 808 album15 Coldplay artist16 Pepe Aguilar artist17 Alfredo Olivas artist18 Baila Reggaeton playlist19 Iconos album20 "Marc Anthony ""El Cantante"" OST" album21 Ricardo Arjona artist22 UB40 artist23 Lionel Richie artist24 More Life album25 Album Radio Station Vicente Fernandez Para Siempre radio26 Armin van Buuren artist27 Santana artist28 Bob Marley & The Wailers artist29 Discover Weekly radio30 Hoy, Mañana y Siempre album

Table B.46: TVEMBD3AG Recommendations for Ranjit A.1



Alan United Kingdom Cheshire 41 male webplayer fri 22:00

Table B.47: Alan Turing, my favorite Computer Scientist.


1 Discover Weekly radio2 Ludwig van Beethoven artist3 ÷ album4 Classical Essentials playlist5 Year in Music radio6 Spring Classical playlist7 x album8 This Is: JS Bach playlist9 This Is: Mozart playlist10 Today’s Top Hits playlist11 This Is: Beethoven playlist12 This Is: Mozart playlist13 Top Classical playlist14 Frédéric Chopin artist15 Global Top 50 playlist16 Hamilton album17 Release Radar radio18 Wolfgang Amadeus Mozart artist19 Antonio Vivaldi artist20 Bruno Mars artist21 Sergei Rachmanino� artist22 Piano 100: Spotify Picks playlist23 Ed Sheeran artist24 Dmitri Shostakovich artist25 TROLLS (Original Motion Picture Soundtrack) album26 Twenty One Pilots artist27 Johann Sebastian Bach artist28 Shape of You album29 Moana album30 Imagine Dragons artist

Table B.48: BM2AG Recommendations for Alan A.1



Cemile Turkey Istanbul 50 female mac os x wed 19:00

Table B.49: Cemile, Kurdish woman living in Istanbul.


1 Arkada Çalsın playlist2 Discover Weekly radio3 Sezen Aksu artist4 Sütliman playlist5 Hot Hits Türkiye playlist6 Turkey Top 50 playlist7 Yolun Açık Olsun playlist8 Biraz Pop Biraz Sezen album9 Türkçe Rap playlist10 Sıla artist11 New Music Friday Türkiye playlist12 Gülümse album13 Ne�eli Günler playlist14 Selda Ba�can artist15 Türkçe Pop playlist16 Tarkan artist17 Müslüm Gürses artist18 Love Crimes album19 Sertab Erener artist20 Sen A�lama album21 Candan Erçetin artist22 Turkey Viral 50 playlist23 Aura album24 Zeki Müren artist25 Feridun Düza�aç artist26 A�k Tesadüfleri Sever album27 Yansın Geceler album28 �brahim Tatlıses artist29 Global Top 50 playlist30 �ebnem Ferah artist

Table B.50: BM2AG Recommendations for Cemile A.1.



Karl Sweden Uppsala 65 male sonos tue 20:00

Table B.51: Karl, retired Swedish man from my hometown.


1 Discover Weekly radio2 Sweden Top 50 playlist3 Hits idag playlist4 Sommarhits 2017 playlist5 Lugnt & Skönt playlist6 Classical World: Sweden playlist7 Stockholm <3 playlist8 Global Top 50 playlist9 New Music Friday Sweden playlist10 Melodifestivalen - Årets bidrag playlist11 Made in Sweden playlist12 Dinner with Friends playlist13 Today’s Top Hits playlist14 Chill Hits playlist15 Get Your Hits Together playlist16 Year in Music radio17 ÷ album18 Release Radar radio19 PEACE playlist20 Ed Sheeran artist21 Melodifestivalen 2017 album22 Ambient Nature Sleep Sounds playlist23 Piano Dinner playlist24 Zara Larsson artist25 Soft Sommar playlist26 Power Gaming playlist27 Life is good playlist28 Classic Acoustic playlist29 Bara ballader playlist30 Digster Lugna Hits playlist

Table B.52: BM2AG Recommendations for Karl A.1.

www.kth.se

Deep Neural Networks for Context Aware Personalized Music Recommendation

Documents

Transcript of Deep Neural Networks for Context Aware Personalized Music Recommendation