Scalable Factorization Model to Discover Implicit and Explicit ......Quan Do, Wei Liu, Fan Jin and...

UNIVERSITY OF TECHNOLOGY SYDNEY

Faculty of Engineering and Information Technology

Scalable Factorization Model to Discover Implicit

and Explicit Similarities Across Domains

by

Duc Minh Quan Do

A Thesis Submitted for the Degree of

Doctor of Philosophy

Sydney, Australia

2018

UNIVERSITY OF TECHNOLOGY SYDNEY

SCHOOL OF SOFTWARE

The undersigned hereby certify that they have read this thesis entitled “Scalable

Factorization Model to Discover Implicit and Explicit Similarities Across

Domains” by Duc Minh Quan Do and that in their opinions it is fully adequate,

in scope and in quality, as a thesis for the degree of Doctor of Philosophy.

Date:

Principal Supervisor:

Dr. Wei Liu

Certificate of Original Authorship

I, Duc Minh Quan Do declare that this thesis, submitted in fulfilment of the require-

ments for the award of Doctor of Philosophy, in the School of Software, Faculty of

Engineering and Information Technology at the University of Technology Sydney.

This thesis is wholly my own work unless otherwise reference or acknowledged. In

addition, I certify that all information sources and literature used are indicated in

the thesis. This document has not been submitted for qualifications at any other

academic institution. This research is supported by the Commonwealth Scientific

and Industrial Research Organisation (CSIRO) scholarship.

Date: 15/09/2018

Signature of Author:Production Note:

Signature removed prior to publication.

Acknowledgements

I am especially indebted to Dr. Wei Liu, who have provided continuous support,

advice and invaluable comments to pursue my research goals. As my principal

supervisor, he has guided me more than I could ever give him credit for here. Many

thanks are also due to my co-supervisor, Dr. Fang Chen, for many useful discussions

with her.

I am grateful to all I have had the pleasure to discuss with. Each of the members

of my Candidature Assessment Committee has provided me a great deal of profes-

sional feedback about scientific research. This work would not have been possible

without the financial support of the Commonwealth Scientific and Industrial Re-

search Organisation Scholarship (formerly National ICT Australia Scholarship) and

the UTS - International Research Scholarship (IRS).

Nobody has been more important to me in the pursuit of this thesis than the

members of my family. I would like to thank my parents, whose love and guidance

are with me in whatever I pursue and wherever I am. They are the ultimate role

models. Most importantly, I am grateful to my loving and supportive wife, Yen,

and my wonderful daughter, Ellen, for constant inspiration, patience, and faith.

For Ellen

My love for you will last forever.

Contents

Certificate iii

Acknowledgments iv

Dedication v

List of Figures xi

List of Tables xiv

List of Publications xv

Abbreviation xvi

Notation xvii

Abstract xix

1 Introduction 1

1.1 The research problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The improper sharing of explicit similarities among coupled

datasets across domains reduces recommendation accuracy . . 3

1.1.2 Coupled datasets across domains also share implicit

similarities that provide other insights into their relationships 5

1.1.3 Joint analysis of heterogeneous datasets is costly . . . . . . . . 7

1.2 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Knowledge contributions . . . . . . . . . . . . . . . . . . . . . . . . . 12

vii

1.5 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 A new objective function to enable each dataset to have its

own discriminative factor on the coupled mode, capturing

the actual explicit similarities across domains . . . . . . . . . 14

1.5.2 A novel algorithm to discover implicit similarities in

non-coupled mode and align them across domains . . . . . . . 16

1.5.3 A matrix factorization-based model to utilize both explicit

and implicit similarities for cross-domain recommendation

accuracy improvement . . . . . . . . . . . . . . . . . . . . . . 17

1.5.4 A scalable factorization model based on the Spark framework

to scale up the factorization process to the number of tensors,

tensor modes, tensor dimensions and billions of observations . 18

1.6 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Literature Review and Background 24

2.1 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.1 Rating matrix (utility matrix) . . . . . . . . . . . . . . . . . . 25

2.1.2 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1.3 Coupled datasets . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.1 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.2 Matrix Tri-Factorization . . . . . . . . . . . . . . . . . . . . . 30

2.2.3 Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 Cross-domain Recommendation Systems . . . . . . . . . . . . . . . . . 32

2.3.1 Collective Matrix Factorization . . . . . . . . . . . . . . . . . 32

viii

2.3.2 Coupled Matrix Tensor Factorization . . . . . . . . . . . . . . 33

2.3.3 CodeBook Transfer . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.4 Cluster-Level Latent Factor Model . . . . . . . . . . . . . . . 36

2.4 Factorization Methodologies . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 Distributed Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6 Deep learning based recommendation systems . . . . . . . . . . . . . . 39

2.7 Research gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Explicit Similarity Discovery 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 ASTen: the proposed Accurate Coupled Tensor Factorization model . 46

3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.1 Data used in our experiments . . . . . . . . . . . . . . . . . . 50

3.4.2 Performance metric . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Contribution and Summary . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Implicit Similarity Discovery 58

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 HISF: the proposed Hidden Implicit Similarities Factorization Model . 61

4.2.1 Sharing common and preserving domain-specific coupled

latent variables to utilize explicit similaritites . . . . . . . . . 62

4.2.2 Aligning implicit similarities in non-coupled latent clusters

across domains . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

ix

4.3 Extension to three or more matrices . . . . . . . . . . . . . . . . . . . 76

4.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.1 Data for the experiments . . . . . . . . . . . . . . . . . . . . . 79

4.4.2 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.3 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.5 Contributions and Summary . . . . . . . . . . . . . . . . . . . . . . . 88

5 Scalable Multimodal Factorization 91

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2 SMF: the proposed Scalable Multimodal Factorization . . . . . . . . . 93

5.2.1 SMF on Apache Spark . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2 Scaling up to K tensors . . . . . . . . . . . . . . . . . . . . . 102

5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.3.2 Convergence Speed . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.4 Contribution and Summary . . . . . . . . . . . . . . . . . . . . . . . . 111

6 Conclusion 114

6.1 Research questions and contributions . . . . . . . . . . . . . . . . . . . 115

6.2 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.2.1 Investigating explicit and implicit similarities in imbalanced

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.2.2 Extending the use of explicit and implicit similarities to high

dimensional tensors . . . . . . . . . . . . . . . . . . . . . . . . 119

x

6.2.3 Extending the proposed factorization model to handle online

ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.2.4 Investigating the use of explicit and implicit similarities in

Factorization Machines . . . . . . . . . . . . . . . . . . . . . . 119

6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

List of Figures

1.1 An example of implicit similarities . . . . . . . . . . . . . . . . . . . . 5

1.2 The research questions and their correspondent contributions . . . . . 13

2.1 An example of a movie rating matrix . . . . . . . . . . . . . . . . . . 26

2.2 An example of a tensor . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 An example of coupled rating matrices from Netflix and MovieLens

websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 An example of a coupled matrix tensor from MovieLens website . . . 28

2.5 CANDECOMP/ PARAFAC (CP) decomposition . . . . . . . . . . . 31

2.6 Joint analysis of a coupled matrix tensor . . . . . . . . . . . . . . . . 34

2.7 Distributed factorization algorithms . . . . . . . . . . . . . . . . . . . 38

2.8 Multi-view deep neural network for cross-domain recommendation

two datasets where they have the same users. In this case, users of

both datasets share the same features of the left-most network. . . . . 40

3.1 Mean squared errors of test cases with synthetic data . . . . . . . . . 53

3.2 Mean squared error of factorizing the MovieLens dataset . . . . . . . 54

3.3 Mean squared error of factorizing Yahoo! Music dataset . . . . . . . . 55

xii

4.1 The proposed factorization model to discover and share implicit

similarities across domains . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Matrix factorization of X(1) as a clustering method . . . . . . . . . . 64

4.3 Matrix factorization of X(2) as a clustering method . . . . . . . . . . 65

4.4 Possible cases for matching user clusters of X(1) and X(2) . . . . . . . 66

4.5 An illustration of how centroid of a cluster is computed . . . . . . . . 68

4.6 Generated ratings of two domains X(1) and X(2) . . . . . . . . . . . . 69

4.7 An illustration of how well the proposed cluster alignment method

works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.8 Tested mean RMSEs of ABS NSW and ABS VIC datasets under

different values of the common row paramter (c) in the coupled

factor of HISF with rank r = 11 . . . . . . . . . . . . . . . . . . . . . 84

4.9 Tested mean RMSEs of ABS NSW and BOCSAR Crime datasets

under different values of the common row parameter (c) in the

coupled factor of HISF with rank r = 11 . . . . . . . . . . . . . . . . 85

4.10 Tested mean RMSEs of Amazon dataset under different values of

the common row parameter (c) in the coupled factor of HISF-N

with rank r = 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1 Tensor slices for updating each row of the factors when a mode-3

tensor is coupled with a matrix in their first modes . . . . . . . . . . 95

5.2 An example of how to divide coupled matrix and tensor into

non-overlapping blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Observation scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 Machine scalability with 100M synthetic dataset . . . . . . . . . . . . 107

5.5 Factorization speed with MovieLens . . . . . . . . . . . . . . . . . . . 107

5.6 Factorization speed with Netflix . . . . . . . . . . . . . . . . . . . . . 108

xiii

5.7 Factorization speed with Yahoo! Music . . . . . . . . . . . . . . . . . 108

5.8 Coupled factorization speed with MovieLens . . . . . . . . . . . . . . 108

5.9 Coupled factorization speed with Yahoo! Music . . . . . . . . . . . . 109

5.10 Benchmark of different optimization methods . . . . . . . . . . . . . 110

List of Tables

1 Symbols and their descriptions . . . . . . . . . . . . . . . . . . . . . . xviii

1.1 Comparison of existing algorithms for recommendation . . . . . . . . 9

3.1 Ground truth distributions of the factor matrices in the synthetic data 51

4.1 Characteristics of ABS census data on New South Wales and

Victoria states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.2 Characteristics of Amazon datasets on books, movies and electronics 80

4.3 Mean and standard deviation of tested RMSE on ABS New South

Wales and Victoria data with different algorithms . . . . . . . . . . . 81

4.4 Mean and standard deviation of tested RMSE on ABS NSW

demography and BOCSAR NSW crime data with different algorithms 84

4.5 Mean and standard deviation of tested RMSE on Amazon book,

movie and electronics data with different algorithms . . . . . . . . . . 86

5.1 Data for experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Accuracy of each algorithm on the real-world datasets . . . . . . . . . 109

5.3 Accuracy of predicting missing entries on real-world datasets with

different optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

List of Publications

Below is the list of journal and conference papers associated with my Ph.D. research:

1. Quan Do, Wei Liu, Fan Jin and Dacheng Tao, “Unveiling Hidden Implicit

Similarities for Cross-Domain Recommendation,” IEEE Transactions on Knowl-

edge and Data Engineering (TKDE) (Under review).

2. Quan Do and Wei Liu, “Scalable Multimodal Factorization for Learning from

Very Big Data,” in Multimodal Analytics for Next-Generation Big Data Tech-

nologies and Applications, Springer (To appear).

3. Quan Do, Wei Liu and Fang Chen, “Discovering both Explicit and Implicit

Similarities for Cross-Domain Recommendation,” in Proceedings of the 2017

Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD),

pp. 618-630, May 23-26, 2017.

4. Quan Do and Wei Liu, “ASTen: an Accurate and Scalable Approach to

Coupled Tensor Factorization,” in Proceedings of the 2016 International Joint

Conference on Neural Networks (IJCNN), pp. 99-106, Jul. 24-29, 2016.

Abbreviation

ALS Alternating Least Squares. 14

CBT Code Book Transfer. 6, 18

CF Collaborative Filtering. x, 1, 11

CLFM Cluster-Level Latent Factor Model. 6, 8, 18

CMF Collective Matrix Factorization. 2, 8, 13, 17

CMTF Coupled Matrix Tensor Factorization. 5, 6, 8, 11, 13–17

GD Gradient Descent. 14

GPU Graphics Processing Unit. 15

MF Matrix Factorization. 11, 12

NCG Nonlinear Conjugate Gradient. 14

NSW New South Wales. 1, 4

RMSE Root Means Squared Error. 6

TF Tensor Factorization. 11, 12, 14, 15

Nomenclature and Notation

A rating matrix from n users for m items is denoted by a boldface capital, e.g.,

X. Each row of the matrix is a vector of a user’s ratings for all items while each

column is a vector of ratings from all users for a specific item. Vectors are denoted

by boldface lowercases, i.e., u. A boldface capital and lowercase with indices in their

subscript are respectively used for an entry of a matrix and a vector. Table 1 lists

all other symbols we thoroughly use in this thesis.

xviii

Table 1 : Symbols and their descriptions

Symbol Description

X(i) Rating matrix from i-th datasetU(i) The first dimension factor of X(i)

V(0) Common parts of the coupled factorsV(i) Domain-specific parts of the coupled factor of X(i)

S(i) Weighting factor of X(i)

AT Transpose of AA† Moore-Penrose pseudo inverse of AI The identity matrix‖A‖ Frobenius normn, m, p Dimension lengthc Number of common clusters in coupled factorsr Rank of decompositionΩX Number of observations in X∂∂x

Partial derivative with respect to xL Loss functionλ Regularization parameter× Multiplication

x, x, X, X A scalar, a vector, a matrix and a tensorN Mode of a tensorM Number of machineK Number of tensorT Number of iterationI1 × I2 × · · · × IN Dimension of N -mode tensor X|Ω|, Xi1,i2,...,iN Observed data size of X and its entriesX(n) Mode nth of X

X(n)in

Slice in of X(n) - all entries X(n)∗,...,∗,in,∗,...,∗

U(n) nth mode factor of X

u(n)in

inth row of factor U(n)

V(2) 2nd mode factor of Y

v(2)j2

j2th row of factor V(2) - all entries V(2)∗,j2

U1, U2,...,UK Factors of tensor X1, X1, XKI1 × I2 × · · · × INK

Dimension of NK-mode tensor XK|Ω|K , XKi1,i2,...,iNK

Observed data size of XK and its entries

Abstract

E-commerce businesses increasingly depend on recommendation systems to intro-

duce personalized services and products to their target customers. Achieving ac-

curate recommendations requires a sufficient understanding of user preferences and

item characteristics. Given the current innovations on the Web, coupled datasets

are abundantly available across domains. An analysis of these datasets can provide

a broader knowledge to understand the underlying relationship between users and

items. This thorough understanding results in more collaborative filtering power

and leads to a higher recommendation accuracy.

However, how to effectively use this knowledge for recommendation is still a

challenging problem. In this research, we propose to exploit both explicit and

implicit similarities extracted from latent factors across domains with matrix tri-

factorization. On the coupled dimensions, common parts of the coupled factors

across domains are shared among them. At the same time, their domain-specific

parts are preserved. We show that such a configuration of both common and domain-

specific parts benefits cross-domain recommendations significantly. Moreover, on the

non-coupled dimensions, the middle factor of the tri-factorization is proposed to use

to match the closely related clusters across datasets and align the matched ones to

transfer cross-domain implicit similarities, further improving the recommendation.

Furthermore, when dealing with data coupled from different sources, the scalabil-

ity of the analytical method is another significant concern. We design a distributed

factorization model that can scale up as the observed data across domains increases.

Our data parallelism, based on Apache Spark, enables the model to have the small-

est communication cost. Also, the model is equipped with an optimized solver that

converges faster. We demonstrate that these key features stabilize our model’s per-

xx

formance when the data grows.

Validated on real-world datasets, our developed model outperforms the existing

algorithms regarding recommendation accuracy and scalability. These empirical

results illustrate the potential of our research in exploiting both explicit and implicit

similarities across domains for improving recommendation performance.

1

Chapter 1

Introduction

E-commerce providers usually offer a wide range of products. On the one hand, this

massive product selection meets a variety of different consumer needs and tastes. On

the other hand, browsing over a long product list is not a user-friendly task for any

consumer when choosing products of her preferences. Automatic matching between

product properties and consumer interest allows companies to introduce products

and services of interest to consumers. Systems with a capability to recommend

products to each particular user based on user preferences are called personalized

recommendation systems (Koren & Bell 2011). They enrich the user experience,

enhance user satisfaction and eventually lead to more sales. Realizing that they

can provide a competitive advantage, a large number of providers have been em-

ploying recommendation systems to analyze consumers’ past behaviors to provide

personalized product recommendations (Koren et al. 2009).

Recommendation systems have increased in their importance and popularity

among product providers (Zhang 2014). Two fundamental techniques are widely

chosen for developing personalized recommendation systems: the content-based ap-

proach (Lops et al. 2011) and the collaborative filtering (CF)-based approach (Koren

& Bell 2011). The former focuses on the information of users or items for making

2

recommendations whereas the latter is based on the latent similarities between user

interests and the item characteristics to predict items in which specific users would

be interested. This research focuses on improving CF-based recommendations.

To provide accurate recommendations, CF-based methods require a thorough

understanding of the latent similarities between the user preferences and the item

properties (Pan et al. 2011). This understanding can only be obtained when there

is sufficient user feedback (ratings, likes, activities, etc.). Having adequate user

feedback is especially critical here as this is the only information CF-based methods

use for making recommendations (Pan et al. 2010). There are many cases where

a business does not have sufficient ratings, thus, improving its recommendation

performance can be a significant problem.

The problem of a lack of user ratings can be overcome by exploiting related

information from other domains (Wei et al. 2016, Yang et al. 2015, Liu et al. 2015,

Iwata & Koh 2015, Jing et al. 2014, Zhao et al. 2013, Hu et al. 2013, Tang et al.

2012, Tan et al. 2014, Wang et al. 2016, Hsu et al. 2018, Wu et al. 2017, Zhang

et al. 2017). Given the recent innovations on the Internet and social media, many

cross-domain datasets are publicly available (Chen et al. 2013, Li & Lin 2014, Pan

et al. 2010, Jiang et al. 2015). Finding a closely related dataset from another domain

is easily done these days. For example, user ratings can be found for the same set

of movies from both MovieLens and Netflix websites. Thus, they can be jointly

used to understand user preferences and item characteristics better. Acar, Kolda &

Dunlavy (2011), Singh & Gordon (2008), Li et al. (2009a), Bhargava et al. (2015)

proposed the use of correlated datasets across domains as the extra information to

overcome the problems of insufficient ratings.

The joint analysis of different datasets across domains provides a deeper under-

standing of their underlying relationship (Acar, Kolda & Dunlavy 2011). However,

3

there are two problems which need to be addressed. The first problem is how to

accurately discover the exact correlation among sources and exploit them to gain

a deeper understanding of the relationships between users and items. This under-

standing will help to provide more accurate recommendations. Furthermore, mining

these abundant cross-domain datasets incurs a very heavy cost in terms of compu-

tation, communication, and storage. This cost leads to the second problem of how

to scale up the data analysis. Solving all these issues is the primary focus of this

research.

1.1 The research problem

This section lists the main issues investigated in this thesis and presents the

research questions.

1.1.1 The improper sharing of explicit similarities among coupled datasets

across domains reduces recommendation accuracy

Coupled datasets are those with one dimension in common (Acar, Kolda &

Dunlavy 2011). For example, one dataset contains user ratings for a list of movies

on the MovieLens website, and another includes the ratings of a different user base

on the Netflix website for the same list of movies as that of MovieLens. Both contain

the ratings of the same list of movies. Thus, they are coupled in their movie dimen-

sion. As they have one dimension in common, they explicitly share some similarities

in their coupled dimension. For example, the same movies on the MovieLens and

Netflix websites have some common characteristics. The joint analysis of coupled

datasets to utilize these explicit similarities has been an exciting topic in different

research communities such as collaborative filtering (Zheng et al. 2010, Loni et al.

2014, Wang et al. 2012a), community detection (Lin et al. 2009) and link prediction

(Chen et al. 2017, Wang et al. 2013, Viswanath et al. 2009, Dunlavy et al. 2011,

4

Perozzi et al. 2016).

Coupled datasets across domains can provide additional insights into the coupled

dimension. For example, ratings in the MovieLens dataset can be an extra source of

information about movies that can be beneficial to the understanding of movies in

Netflix data and vice versa. Thus, the joint analysis of the coupled datasets across

domains would provide a better understanding of their underlying structures (Kemp

et al. 2006, Yang et al. 2015). This thorough understanding helps to provide more

appropriate recommendations to the users. In this case, the aim is to simultane-

ously learn the rating behaviors from both domains’ observed values to predict their

missing entries with high accuracy. But how to utilize the rating information in one

domain to help to predict unknowns in another one and vice versa still needs to be

investigated.

The existing coupled analysis methods proposed to use the same coupled factor

to collaborate between datasets. Collective matrix factorization (CMF) (Singh &

Gordon 2008) and its extensions, coupled matrix tensor factorization (CMTF) (Acar,

Kolda & Dunlavy 2011), suggested both datasets would share the same factor in

their coupled mode. Gao et al. (2013) and Li et al. (2009a) assumed cross-domain

datasets would have identical latent rating patterns, captured in the middle factor

of the matrix tri-factorization. Although sharing the same factor across domains

had an effect, the cross-domain datasets might possess some unique characteristics

of their domains. For example, the MovieLens website allows ratings with half-star

increments whereas Netflix only allows full-star ratings. Thus, sharing the same

factor across domains is unlikely to capture the exact correlations among them,

reducing its effectiveness in achieving higher recommendation accuracy.

5

Low income LGA

High income LGA

(a) Percentage of high income family per LGA

High crime rate LGA

Low crime rate LGA

(b) Number of “break and enter dwelling” incidents per LGA

Figure 1.1 : An example of the implicit correlation between income and crime rate.Local government areas (LGAs) in New South Wales (NSW) with more high-incomefamilies have a lower break and enter dwelling incidents. The data in a) is from theAustralian Bureau of Statistics and b) is from the NSW Bureau of Crime Statisticsand Research.

1.1.2 Coupled datasets across domains also share implicit similarities

that provide other insights into their relationships

In addition to the aforementioned explicit similarities, cross-domain datasets are

hypothesized to have other implicit correlations in their remaining dimensions. For

instance, MovieLens and Netflix datasets are also correlated in their user dimen-

sions. This intuition comes from consumer segmentation which is a popular concept

6

in marketing. Consumers can be grouped by their behaviors, for example, “tech

leaders” is a group of consumers with a strong desire to own new smartphones with

the latest technologies and there is also another user group which only changes to

a new phone when the current one is out of order. There are two meanings to this

idea. Firstly, different users can be grouped if they have similar preferences, i.e.,

users with similar movie tastes on the MovieLens website can be grouped together

and so can those on Netflix. Secondly, related groups across domains may have sim-

ilar behaviors. For example, action movie fans on MovieLens and those on Netflix

are fond of watching action movies. Thus, these implicit similarities can provide

other insights to understand the relationship between users and items if they are

exploited properly.

The implicit similarities in the non-coupled dimension will have a significant

impact in improving recommendation accuracy. For example, a re-examination of

the MovieLens and Netflix datasets shows that although there is no direct user match

between the two datsets, groups of users on the MovieLens website and related ones

on the Netflix website share similar preferences. Thus, it is possible to use the group

behaviors of one dataset to enrich the understanding of the corresponding groups in

another dataset. Furthermore, this type of correlation also implicitly exists in many

other scenarios in the real world. For example, suburbs with a large number of

high-income families may be correlated with a lower crime rate as shown in Figure

1.1, or users with an interest in action books may also like related movie genres

(e.g., action films), or even though users on Amazon and Walmart are different,

those sharing similar interests may share similar behaviors in relation to the same

set of products. Thus, using both explicit and implicit similarities correctly will

have potential applications to many real-world scenarios.

Implicit similarities have the potential to improve recommendation accuracy.

However, different approaches of joint analysis of cross-domain datasets (Pan 2016)

7

use only explicit similarities as a bridge to collaborate among datasets. Although

sharing these explicit similarities was shown to be effective in improving recom-

mendation, there are rich implicit features that were not used and they have great

potential to provide even more appropriate recommendations.

1.1.3 Joint analysis of heterogeneous datasets is costly

Recent innovations on the Internet and social media have brought many op-

portunities as well as challenges to research communities. On the one hand, they

have made increasingly many gigabytes of matrices and high-order tensors available

(Ermis et al. 2015). An analysis of their explicit and implicit similarities provides us

a deeper understanding of the underlying relationships. On the other hand, mining

these huge data incurs a significant amount of computation, communication and

occupies much memory. Traditional algorithms, such as CMTF (Acar, Kolda &

Dunlavy 2011), are intractably slow or quickly run out of memory. The former is

because they iteratively factorize the full coupled tensors into factors many times;

the latter is because the full coupled tensors cannot be loaded into the local mem-

ory of any typical computer. These challenges have been a motivation for many

researchers.

Both efficient computational methods and scalable works (Zhu et al. 2016, Sael

et al. 2015) have been proposed to speed up the factorization. Whereas concurrent

processing using CPU cores (Papalexakis et al. 2014) or GPUs’ massively paral-

lel architecture (Zou et al. 2015) enhances processing speed, it does not solve the

problem of insufficient local memory to store the whole big data. Other MapRe-

duce distributed models (Shin & Kang 2014, Beutel et al. 2014, Kang et al. 2012)

overcome the memory problem by keeping large files in a distributed file system.

They also improve computational speed by performing the factorization process in

parallel with many different computing nodes.

8

Computing in parallel allows factors to be updated faster, yet the factorization

faces higher data communication cost if it is not well designed. The first critical

weakness of MapReduce algorithms is when a computing node needs data to pro-

cess, the data is transferred from an isolated distributed file system to the node

(Beutel et al. 2014, Kang et al. 2012, Jeon et al. 2016). The iterative nature of

tensor factorization requires data and factors to be distributed over and over again,

incurring enormous communication overhead. If the tensor size is doubled, the al-

gorithms’ performance is 2*T times worse (T is the number of iterations). This

diminished performance leads to the second disadvantage which is their low scala-

bility. Thus, improving the scalability of the analysis of datasets across domains is

another problem that needs to be addressed.

In light of the above research issues, this thesis aims to address the following

three research questions:

Q1. Is it possible to propose appropriate methods of sharing explicit similarities

between cross-domain datasets to understand their actual relationship?

Q2. How to share implicit similarities in non-coupled dimensions across domains to

improve recommendation accuracy?

Q3. How to improve the scalability of the factorization process such that it is able

to scale up to a different number of coupled tensors, tensor modes, tensor

dimensions and billions of observations?

1.2 Thesis

The proposed method is the first scalable factorization model to use both explicit

and implicit similarities across domains for cross-domain recommendation perfor-

mance improvement.

9

Table 1.1 : Comparison of existing algorithms for recommendation. The featuresthat an algorithm supports are checked. Only the proposed method has all thefeatures.

Algorithms Explicit Implicit ScalabilitySimilarities Similarities

CMTF (Acar, Kolda & Dunlavy 2011) � × ×SALS (Shin et al. 2017) × × �SCouT (Jeon et al. 2016) � × �The proposed method � � �

Coupled datasets across domains are strongly correlated. The most apparent

relationship can be seen in the coupled dimension. As the datasets have the same

coupled dimension, they share some direct properties of the coupled dimension. For

example, as the movie ratings on the MovieLens and Netflix websites are of the

same list of movies, a film on MovieLens can be matched with its identical one

on Netflix. It is worth noting that these matched movies have precisely the same

properties. Therefore, both websites directly share some common characteristics

in the movie dimension. These direct correlations on the coupled dimension, or

explicit similarities, can be mutually used to enrich our understanding of the

underlying structure of each dataset.

In addition to the explicit similarities, coupled datasets across domains also

have indirect relationships, called implicit similarities. For instance, as users of

the MovieLens and Netflix websites are different, there is no direct match or sharing

between any two particular users across the two sites. However, some of the users

on the MovieLens website are fans of action movies whereas some of the Netflix

users love to watch action movies. These action movie fans are not the same users

across domains. However, they share some common behaviors, e.g., they are likely

to rate action movies highly. Thus, such indirect or hidden similarities may exist

10

between two related groups of users (e.g., action movie fans, sci-fi movie fans, etc.).

Sharing these implicit similarities provides other rich insights, in addition to the

aforementioned explicit similarities, to better understand the relationships between

users and items. This thorough understanding allows the company to provide more

appropriate recommendations to its users.

The joint analysis of coupled datasets across domains allows us to use both

explicit and implicit similarities to improve recommendation accuracy. However,

mining them often incurs heavy computation, communication and storage costs.

This problem is because data is now generated at tremendous rates. Consequently,

improving the scalability of the factorization process is not an option, but a crucial

requirement. The scalability of a method is its capability to scale up its operations

as the data increases. This means the method has the capability to complete the

computation within a reasonable amount of time. Also, it implies the ability to add

more hardware resources to improve the performance of the analysis. At the same

time, a hardware failure does not prevent the method from performing its operation

and may only reduce its performance. Any method that is not able to scale up will

be in trouble when analyzing large-scale datasets.

Several algorithms have been proposed for the joint analysis of coupled datasets

across domains. Table 1.1 compares these in terms of their capability to use explicit

similarities, discovering implicit similarities and scaling up to large-scale datasets.

Existing methods support either one or two of the three features. CMTF (Acar,

Kolda & Dunlavy 2011) only uses explicit similarities in the coupled factors. SALS

(Shin et al. 2017) scales up the factorization process, but it does not support multiple

datasets. SCouT (Jeon et al. 2016) improves the scalability of CMTF. Nevertheless,

it exploits the same explicit similarities just as CMTF does. The lack of a scalable

method with the ability to exploit both the explicit and implicit similarities across

coupled datasets motivates this research. Section 1.3 provides some background of

11

these existing methods and Section 1.4 introduces the contributions of this thesis

by conducting this research.

1.3 Background

Utilizing similarities across domains has attracted enormous research effort (Zhang,

Yuan, Lian, Xie & Ma 2016, Zhang, Xiong, Kong & Zhu 2016). Some of the com-

monly used algorithms are discussed in this section.

Acar, Kolda & Dunlavy (2011) introduced Coupled Matrix Tensor Factorization

(CMTF) as joint analysis of explicit similarities between a matrix and a tensor cou-

pled in one dimension to improve recommendation accuracy. The authors assumed

both datasets would explicitly share a common factor in the coupled dimension.

Thus, they formulated this identical factor in a coupled loss function. Even though

CMTF provides a deeper knowledge of the underlying structure of the data, it has

three main drawbacks. Firstly, it only uses explicit relationships in the coupled di-

mension. Coupled datasets across domains may also have some implicit similarities

that can be additional resources to deepen the understanding of the actual relation-

ship in the data. Secondly, the assumption that coupled datasets share identical

coupled factors is unrealistic. Even though the coupled datasets may be strongly

correlated, they may also have unique features from their domains. Hence, forcing

them to share identical information may lose the domain-specific characteristics.

Finally, the analysis is performed on a local machine. When the size of the input

matrix and tensor becomes bigger than the size of the machine’s memory, CMTF

fails. Subsequent works have only focused on the latter issue.

As an attempt to resolve the scalability of CMTF, Jeon et al. (2016) imple-

mented a MapReduce-based distributed algorithm, called SCouT. In a nutshell,

SCouT divides huge data into small parts and concurrently factorizes them with

several computing nodes in a cluster. Following the MapReduce framework, SCouT

12

stores data files in distributed file servers. As a result, SCouT requires pieces of

data to be transferred from the distributed file server to each computing nodes for

every iteration. This data transmission cost, in the case of transferring many ter-

abytes to all computing nodes over iterations, even surpasses the time saved from

parallel processing. Hence, this weakness reduces the robustness and effectiveness of

SCouT. An algorithm minimizing this communication is, therefore, a better solution

for scaling up as the observed data increases.

In an attempt to overcome the weakness of the MapReduce framework, Shin et al.

(2017) introduced an optimization to reduce the repeated redistribution of data. The

authors’ idea was to cache data in local disks of computing nodes. This data caching

reduced the communication overhead significantly as data was only transferred from

local disks to memory for each iteration. Nevertheless, this communication can be

reduced even more. As data was stored on disks, reading it to memory for each

access takes time, especially for huge datasets and many iterations. Furthermore,

the authors’ proposed algorithm worked with a single data only, lacking the ability

to use similarities across domains for cross-domain recommendation.

The lack of a scalable algorithm with the capability of utilizing both explicit

and implicit similarities for cross-domain recommendation motivates us to conduct

this research. The proposed model is the only one which can effectively scale up its

analysis to use both explicit and implicit similarities across domains for cross-domain

recommendation performance improvement.

1.4 Knowledge contributions

To investigate the above research questions, the research in this thesis makes four

knowledge contributions to the data mining research community. Figure 1.2 shows

the relationship between the research questions and the knowledge contributions of

this thesis. Details of each contribution are discussed in Section 1.5.

13

Scalable factorization model to discover implicit and explicit similarities across domains

Q1. How to share explicit similarities?

Q2. How to share implicit similarities?

Q3. How to improve the scalability?

Contribution #1. Utilize explicit

similarities

Contribution #2. Discover implicit

similarities

Contribution #3. Exploit both explicit & implicit similarities

Contribution #4. Scale up factorization

Figure 1.2 : The research questions and their corresponding contributions.

Contribution #1. A new objective function to enable each dataset to have its

discriminative factor on the coupled mode, capturing the actual explicit simi-

larities across domains;

Contribution #2. A novel algorithm to discover implicit similarities in non-coupled

mode and align them across domains;

Contribution #3. A matrix factorization-based model to utilize both explicit and

implicit similarities for cross-domain recommendation accuracy improvement;

Contribution #4. A scalable factorization model based on the Spark framework

to scale up the factorization process to the number of tensors, tensor modes,

tensor dimensions and billions of observations.

1.5 Research Methods

This section briefly introduces the research methods to be implemented to in-

vestigate the research questions.

14

1.5.1 A new objective function to enable each dataset to have its own

discriminative factor on the coupled mode, capturing the actual

explicit similarities across domains

The goal of this research is to accurately recommend items that a particular

user may like. Recommending the right products to the right consumers requires a

thorough understanding of user preferences and item characteristics. This require-

ment can be addressed as a result of recent innovations on the Internet and social

media where many datasets, coupled in one dimension, from different sources are

available. As the coupled datasets have one dimension in common, they share some

explicit similarities that can be used effectively to better understand the underly-

ing relationships between users and items, resulting in the provision of more useful

recommendations.

Coupled datasets have strong correlations on their coupled dimension. For in-

stance, the same action movies on the MovieLens and Netflix websites share some

common characteristics. However, each domain also has some unique properties. For

example, the MovieLens allows ratings from 0.5 to 5 with 0.5 increments whereas

the Netflix only enables 1 to 5 ratings with 1 increases. Thus, there are scenarios

where action movie fans on the MovieLens rate action movies with 3.5, 4, 4.5 or

5 stars while those on the Netflix rate them with 4 or 5 stars. Due to this scale

difference across sites, existing models that assume coupled datasets share the same

coupled factor or the same parameters on their coupled dimension are unlikely to

capture the actual differences. A method is proposed to better capture the true

explicit similarities across domains to improve recommendation accuracy.

Suppose cross-domain datasets X and Y are coupled in their first dimension,

popular joint factorization algorithms assume that they share the same features in

the coupled dimension. For example, CMF (Singh & Gordon 2008) and CMTF

15

(Acar, Kolda & Dunlavy 2011) assume that the first dimension of X shares a com-

mon low-rank subspace with the first dimension of Y. A basis for this low-rank

subspace is expressed by the identical latent factors in the coupled dimensions (cou-

pled factors) of X and Y in the coupled loss function. Admittedly, the first factor

of X highly correlates with the first factor of Y, yet they are unequal in many

real-world data and applications. Thus, forcing them to share the same coupled

factor may reduce the accuracy of factorization, leading to a lower recommendation

performance.

Sharing the same coupled factors as proposed by the existing algorithms is hy-

pothesized to reduce the accuracy of the joint factorization. However, this perfor-

mance reduction is not the only issue. By using an identical coupled factor for

cross-domain datasets, the final result optimizes either of them, not both. It may

approximate X well and lose Y’s decomposition accuracy, or vice versa. Hence, this

problem is addressed by allowing each dataset across domains to have its unique

factor even in the coupled dimension. Moreover, a new coupled loss function is

proposed where different coupled factors are regularized to be as close as possi-

ble. These different, yet closely related, coupled factors better capture the true

relationship between cross-domain datasets, optimizing the factorization of every

dataset without sacrificing any accuracy. The proposed model is benchmarked with

commonly used algorithms that can use the explicit similarities across domains for

recommendation, such as CMTF (Acar, Kolda & Dunlavy 2011), CLFM (Gao et al.

2013) and CBT (Li et al. 2009a). For a fair comparison, each model is applied to

the same publicly available datasets. Root means squared error (RMSE) is used as

a metric for benchmarking the proposed idea.

16

1.5.2 A novel algorithm to discover implicit similarities in non-coupled

mode and align them across domains

Cross-domain datasets not only have explicit similarities in the coupled dimen-

sion, but they also share implicit ones in the non-coupled dimension. Different

approaches have been proposed to perform a joint analysis of coupled datasets (Pan

2016). However, all of the existing algorithms use explicit similarities as a bridge

to collaborate among datasets. Although these explicit similarities showed their ef-

fectiveness in improving recommendation, there are still rich implicit features that

were not used but have great potential to further improve the recommendation. The

fact that non-coupled dimensions in the aforementioned example of the MovieLens

and Netflix datasets contain non-overlapping users prevents direct knowledge shar-

ing in their non-coupled factors. However, their latent behaviors are correlated and

should be shared. These latent behaviors can be captured in low-rank factors by

matrix tri-factorization. As factorization is equivalent to spectral clustering (Ding

et al. 2006), different users with similar preferences are grouped in non-coupled

user factors. Developed on this concept, latent clusters in these non-coupled fac-

tors are hypothesized to have a close relationship. Therefore, correlated clusters in

non-coupled factors are aligned to be as close as possible. This idea matches the

fundamental concept of CF in the sense that similar user groups who rate similarly

will continue to do so.

This aim can be achieved by delivering a factorization model that can exploit

the implicit similarities across domains for recommendations. In case of the afore-

mentioned movie rating matrices on MovieLens and Netflix websites, they contain

preferences of different users for the same set of movies. Even though there is no

direct user matching between them, some of them may share hidden behaviors that

can be utilized to improve recommendation accuracy. The performance of the pro-

posed algorithm with implicit similarities exploitation is measured in comparison

17

with that of other widely used methods using only explicit similarities, including

CMF (Singh & Gordon 2008), CST (Pan et al. 2011), CBT (Li et al. 2009a) and

CLFM (Gao et al. 2013). In this event, RMSE is also used as the metric.

1.5.3 A matrix factorization-based model to utilize both explicit and im-

plicit similarities for cross-domain recommendation accuracy im-

provement

This research proposes a cross-domain recommender as the first algorithm uti-

lizing both explicit and implicit similarities between datasets across sources for per-

formance improvement. One of the key hypotheses, extended from CMF (Singh

& Gordon 2008) where both datasets have the same factor in their coupled mode,

is that two datasets across domains also possess their own specific patterns. The

proposed idea is to find a way to combine these unique patterns into the common fac-

tor. One plausible solution is to allow the coupled factors to have both common and

domain-specific parts. In addition, another key hypothesis for implicit similarities

is that they may exist in non-coupled factors. Thus, the proposed method utilizes

both the explicit similarities in the coupled factors and the implicit similarities in

the non-coupled factors to improve recommendation performance.

Validated on real-world datasets, the proposed idea outperforms the current

cross-domain recommendation methods by more than two times. Furthermore, the

more interesting observation is that both explicit and implicit similarities between

datasets help to better suggest unknown information from cross-domain sources.

18

1.5.4 A scalable factorization model based on the Spark framework to

scale up the factorization process to the number of tensors, tensor

modes, tensor dimensions and billions of observations

As businesses grow, they reach more users and eventually collect more ratings.

Having more data opens a new opportunity for them to provide more accurate

recommendations. At the same time, it is also a challenge as they need to analyze an

increasing amount of data to understand more deeply the underlying relationships

between users and items. To accommodate this massive increase, not only the

ability to handle this big data but also the capability to finish the analysis within

a reasonable time are necessary. Therefore, an efficient and scalable method is a

crucial requirement of any recommendation system.

Both computationally efficient methods (He et al. 2016, Rennie & Srebro 2005,

Liu & Shang 2013, Wang, Tung, Smola & Anandkumar 2015) and scalable work

(Yang et al. 2017, Acar, Dunlavy, Kolda & Mrup 2011, Park et al. 2017) have been

proposed to speed up the factorization. Furthermore, other researchers attempted to

use hardware power to enhance processing speed. Papalexakis et al. (2014) presented

a method using multiprocessors for coupled matrix tensor factorization. Zou et al.

(2015) proposed to take advantages of GPUs massively parallel architecture to speed

up tensor factorization. As these methods are performed on a local machine, they

do not solve the problem of insufficient memory when they have to handle huge

datasets.

To overcome the limit of local memory, MapReduce-based factorization mod-

els (Beutel et al. 2014, Kang et al. 2012, Shin & Kang 2014, Jeon et al. 2016)

were introduced. They can keep the large files in a distributed file system which

was designed to be expanded easily by adding more storage. Furthermore, these

MapReduce-based algorithms improve computational speed by having many nodes

19

computed in parallel. Even though distributed computing allows factors to be up-

dated faster, MapReduce-based models require data to be transferred from the iso-

lated distributed file system to the computing node when it needs to process this

data. The iterative nature of tensor factorization requires data and factors to be

distributed over and over again, incurring huge communication overhead.

This research proposes the first data parallelism algorithm that incurs minimal

communication cost. In particular, the proposed method is designed to cache data

in memory in parallel so that no data communication is needed for each iteration.

This design makes it a lightning-fast and scalable tensor factorization algorithm

whose performance does not dramatically reduce as the data increases. Also, the

proposed method is capable of scaling up to a different number of inputted datasets,

their dimensions, and billions of observations. The proposed method’s processing

speed is measured in comparison with SCouT (Jeon et al. 2016) and SALS (Shin

& Kang 2014), which are the fastest scalable coupled matrix tensor factorization

and tensor factorization algorithms, respectively. Moreover, a thorough analysis of

the scalability of the proposed model is also performed. To this end, the proposed

algorithm is compared against its baselines in case the data grows to billions of

observations.

1.6 Significance

The proposed factorization model enriches the research community with a new

way of feature sharing between coupled datasets, leading to more accurate recom-

mendations. Even though coupled datasets have strong correlations in the coupled

dimension, forcing different datasets to have the same factor on their coupled di-

mension is unrealistic in many real-world applications, which is detrimental to the

overall accuracy of the factorization. The research extends the CMTF model by as-

suming coupled datasets do not have common factors even on the shared dimension.

20

Instead, it enables each dataset to have discriminative coupled factors and constrains

the coupled factors to be as close as possible. This idea has two advantages. Firstly,

it properly shares the explicit similarities in coupled dimensions across domains.

Secondly, it optimizes the factorization of every single dataset without sacrificing

accuracy for any of the coupled datasets. Experiments with real-world datasets cou-

pled in one dimension illustrate that the proposed model exploits explicit similarities

better than existing models to improve recommendation performance.

Furthermore, another key contribution of this research relates to implicit simi-

larities. The fact that non-coupled dimensions in the MovieLens and Netflix ex-

ample contain non-overlapping users prevents direct knowledge sharing in their

non-coupled factors. However, their latent behaviors are correlated and should be

shared. These hidden behaviors can be captured in low-rank factors by matrix tri-

factorization. As factorization is equivalent to spectral clustering (Ding et al. 2006,

Sachan & Srivastava 2013), different users with similar preferences are grouped in

non-coupled user factors. Developed on this concept, latent clusters in these non-

coupled factors are hypothesized to have a close relationship. Therefore, correlated

clusters in non-coupled factors are aligned to be as close as possible. This idea

matches the fundamental concept of CF in the sense that similar user groups who

rate similarly will continue to do so. As a result, the developed algorithm is the

first factorization model that utilizes not only the explicit similarities but also the

implicit ones across domains for recommendation accuracy improvement.

In addition, the developed algorithm redounds the benefits for businesses con-

sidering that their users are generating a massive amount of data today. This fast

data generation rate demands a fast and scalable data analytic method. Thus, a

novel distributed model that exhibits robust data parallelism is proposed. It enables

concurrently decomposing factors while minimizing data transmission overhead. As

a result, the proposed algorithm is the only one that scales up well in relation to

21

the number of tensors coupled in one or more mode, tensor modes, tensor dimen-

sions and billions of observations. Moreover, the research also benefits the research

community in two aspects. Firstly, it presents a closed-form optimization solution

which not only converges faster but also achieves higher accuracy. Experiments

with real-world datasets confirm the quality of the proposed solution. Secondly,

this research provides a theoretical complexity analysis of the proposed algorithm in

computation, communication and space aspects as well as some empirical evidence

of its fastest convergence in comparison with existing algorithms.

1.7 Thesis organization

This thesis is organized as follows:

• Chapter 1 introduces the research problems, research questions, contributions

and their significance.

• Chapter 2 presents preliminary concepts and previous work related to the

research topics. The background of matrix factorization, tensor factorization,

and coupled tensor matrix factorization is briefly summarized. Next, different

optimization methods such as gradient descent and alternating least squares

are explained in detail. Furthermore, this chapter discusses different meth-

ods using similarities across domains for recommendation including the joint

analysis of coupled datasets and transfer learning. Also, different distributed

approaches for scaling up factorization processes are reviewed and compared.

• Chapter 3 proposes an algorithm to exploit explicit similarities across do-

mains. It assumes coupled datasets share different but closely similar coupled

factors. The proposed algorithm is described in detail starting with the mo-

tivation to introduce this idea, followed by its technical aspects, then several

22

experiments to show its performance, and a summary of its knowledge contri-

butions.

• Chapter 4 explains a method to discover implicit similarities and use them to

improve cross-domain recommendation performance. Specifically, this chap-

ter presents a method to find related groups across non-coupled factors and

align them to share the implicit similarities across domains. Also, how to use

both explicit and implicit similarities is presented. Extensive experiments are

conducted, and their results are reported in this chapter to demonstrate the

advantages of the proposed method. Finally, this chapter is concluded with

some knowledge contributions of the proposed algorithm.

• Chapter 5 describes a scalable model to scale up the factorization processes

when dealing with big data inputs. This chapter presents the data distributed

design and closed-form optimization to improve computational and time com-

plexity. Thorough experiments are also discussed to benchmark the scalability

of the proposed algorithm in terms of the observed data size, the number of

computing nodes and the number of input datasets. Also, a brief comparison

on recommendation accuracy is reported to conclude the knowledge contribu-

tions of this research.

• Chapter 6 concludes the thesis and summarizes the work in a broader context.

Furthermore, future directions of the research are also described here.

24

Chapter 2

Literature Review and Background

This chapter review different aspects of personalized recommendation systems re-

lated to this research. There are two primary entities in personalized recommen-

dation systems: users and items. Items can be products such as movies, songs,

websites, etc. in product recommendation or other users in friend recommendation

problem. Users are those the systems want to provide recommendations to. The

primary purpose is to predict a user’s preference for a particular item so that the

systems can provide an appropriate recommendation strategy.

As recommendation systems analyze user preferences of different items, this

chapter first introduces the concept of rating matrix and rating tensor to capture

user preferences in recommendation systems in Section 2.1. Section 2.2 discusses

collaborative filtering based recommendation systems. Furthermore, it presents ma-

trix factorization (MF) and its extension tensor factorization (TF). Alternate least

square (ALS) and gradient descent optimization methods are then discussed in Sec-

tion 2.4 to find factors in MF and TF. Also, this chapter then reviews methods to

utilize datasets across domains for higher recommendation accuracy in Section 2.3.

Two main approaches: joint analysis of multiple datasets and transfer learning be-

tween cross-domain ones are described. As data emerges, different algorithms have

25

been proposed to scale up factorization processes. Section 2.5 discusses different

distributed models. Finally, this chapter highlights a few research gaps in Section

2.7.

2.1 Data format

This section presents rating matrix and rating tensor as ways to represent data

in recommendation systems.

2.1.1 Rating matrix (utility matrix)

Users and items are the two primary entities of recommendation systems (Kon-

stan 2004). Users may or may not provide their feedbacks to different items. Differ-

ent websites may use different kinds of feedbacks. For example, users of Facebook

may click “thumb-up” to like a post. Or Amazon users rate items they bought from

1 to 5 stars on Amazon website. For those who do, the feedbacks which are their

degree of preference to the items (Zhang et al. 2008) can be assigned a value, or

rating, correspondence to a pair user-item. All of these ratings, including missing

ones, are formed a matrix with users in one dimension and items in the other. This

matrix is called a rating matrix (or a utility matrix) whose observed entries are

ratings that users provided. An example of a rating matrix is shown in Figure 2.1.

In the sample rating matrix in Figure 2.1, there are six users and seven movies.

Entries with star symbols present rating users provided for respective movies. Rat-

ings are from one (not like) to five stars (very much like). Blank entries are not yet

rated. Recommendation systems are built to utilize the observed ratings in order

to predict these missing entries (Resnick & Varian 1997). They then recommend

movies with high predicted ratings to the users.

Generally, there are many users and many items. A particular user typically

rates a few items. Thus, the number of observed ratings is much smaller than the

26

5 34 4 5

3 43 5 4

2 54 1

Figure 2.1 : An example of a rating matrix of a movie recommendation system.Users rate movies from one to five stars. Blank entries are missing ratings as theusers have not rated them yet. The recommendation system has to predict them.

number of missing ones. In other words, the rating matrix is often sparse with just

a few entries having values (Toscher et al. 2008).

2.1.2 Tensor

As data evolves over the time, the recommendation systems are likely to have

additional entities. For instance, ratings can be collected by weekdays. So the

systems have seven rating matrices from Monday to Sunday as shown in Figure

2.2a. This way of data can be naturally represented by a tensor (Itskov 2009).

A tensor is defined as a multidimensional array (Kolda & Bader 2009). It is

often specified by mode (a.k.a., order or way) which is the number of dimensions.

Specifically, a mode-1 tensor is a vector; a matrix is a mode-2 tensor. A mode-3 or

higher-order tensor is often called tensor in short. In Figure 2.2b, the movie ratings

by weekdays can be put in a mode-3 tensor. Similar to the rating matrix, the tensor

is often sparse.

27

1 5 … … 4

3 2 … … 5

… … … … …

… … … … …

2 5 ... … 1

John

Sam

……

AnnM

ovie

1

Mov

ie 2

Mov

ie n

(a) Movie ratings by weekdays

user

item

(b) A mode-3 tensor

Figure 2.2 : An example of a rating tensor of mode-3. Movies are rated by users foreach weekday from one to five stars. A rating is represented by a three dimensionaltensor of user-by-item-by-weekday.

2.1.3 Coupled datasets

Recent innovations in the Internet and social media have made many closely

related datasets available. As a result, it is possible to find rating matrices across

domains having one dimension in common. For example, MovieLens and Netflix

websites each published a dataset of their user ratings on some movies. Although

users on the MovieLens and those on the Netflix are different, they may rate the

same list of movies. In other words, these datasets have movie dimension in common.

They are called to be coupled in their movie dimension.

sci-fi #1

comedy #1

comedy #2

comedy #3

sci-fi #2

sci-fi #3

(a) X(1)

sci-fi #1

comedy #1

comedy #2

comedy #3

sci-fi #2

sci-fi #3

(b) X(2)

Figure 2.3 : An example of coupled rating matrices X(1) and X(2) from Netflix andMovieLens websites, respectively. Blank entries are unobserved ratings. X(1) andX(2) contain ratings of different users for the same set of movies; they are called tobe coupled in movie dimension.

Besides the two coupled matrices above, a coupled matrix and a tensor can

28

Y

user profile

Z

genre

user

movie

Xuser

movie

Figure 2.4 : An example of a coupled matrix tensor from MovieLens dataset. Movieratings are captured in a mode-3 tensor X of users by movies by weekdays. Ad-ditional information forms a matrix Y of users by user profiles and a matrix Z ofmovies by genres. Tensor X and matrix Y are coupled in their user mode; tensor Xand matrix Z are coupled in their movie mode.

sometimes be found. For example, MovieLens dataset (Harper & Konstan 2015)

includes ratings from users on movies over a period of time. This information can

be represented in the form of a three-dimensional tensor X of users by movies by

weekdays whose entries are ratings. Besides, MovieLens also captures user identity

and categorizes movies into different genres. This additional information forms a

matrix Y of users by user profiles and a matrix Z of movies by genres. It is more

interesting that the first dimension of X is correlated with the first dimension of Y,

and the second mode of X has a relationship with the first dimension of Z. Figure

2.4 visualizes this relationship. On this occasion, X is said to be coupled with Y in

its first mode, and joint with Z in its second mode.

2.2 Recommendation Systems

Recommendation systems have gained their importance and popularity among

product providers. Two fundamental techniques are widely chosen for developing

personalized recommendation systems: content-based approach (Pazzani & Billsus

2007) and collaborative filtering (CF)-based approach (Schafer et al. 2007, Ekstrand

et al. 2011). The former focuses on information of users or items for making rec-

ommendations whereas the latter is based on latent similarities (Gao et al. 2012,

29

Menon & Elkan 2011) of the user interests and those of the item characteristics

for predicting items specific users would be interested. As this research focuses

on improving CF-based recommendations, this section discusses key techniques for

CF-based recommendation systems.

2.2.1 Matrix Factorization

The basic idea of CF-based recommendations is that they rely on latent similar-

ities among users and items for making recommendations. They require to analyze

past user preferences to identify new items that users tend to have similar preferences

(Hu et al. 2008). Koren et al. (2009) pioneered in applying Matrix Factorization

(MF) for movie rating prediction. Observed movie ratings in a form of a user-movie

matrix were decomposed into low-rank matrices, called latent factors or simply fac-

tors.

X ≈ U×VT

where X is the rating matrix of n users by m items (X ∈ Rn×m), U ∈ Rn×r is the

user factor (the factor in user dimension), V ∈ Rm×r is the item factor, and r is the

rank of the factorization.

The factors U and V can be found by solving the optimization of the following

loss function:

L =1

2

∥∥U×VT −X∥∥2

+λ

2

(∥∥U∥∥2+∥∥V∥∥2) (2.1)

where the second term is squared L2 regularization terms to overcome the over-

fitting.

Once latent factors U and V are found, a matrix multiplication of them is then

30

performed to predict the missing entries of the rating matrix X (Kiraly et al. 2015).

The performance of MF was demonstrated in the Netflix competition (Bell & Koren

2007) as Koren (2009) achieved the first highest accuracy for movie rating prediction

with it.

2.2.2 Matrix Tri-Factorization

Unlike MF which factorizes a rating matrix into two factors, matrix tri-factorization

decomposes the input rating matrix into three factors. When the given matrix is

complete, decomposing it into factors can be done by Singular Vector Decompo-

sition. However, where it is incomplete, computing its exact decomposition is an

intractable task (Kolda & Bader 2009). Thus, a more efficient and feasible approach

is to approximate the incomplete X of n users by m item as a matrix multiplication

of U ∈ Rn×r, S ∈ Rr×r and VT ∈ Rr×m:

X ≈ U× S×VT

where r is the rank of the factorization, UT ×U = I and VT ×V = I.

In this case, U is the user factor, V is the item factor and S is the weight between

U and V. These factors can be found by solving the optimization of the following:

L =1

2

∥∥U× S×VT −X∥∥2

+λ

2

(∥∥U∥∥2+∥∥S∥∥2

+∥∥V∥∥2) (2.2)

2.2.3 Tensor Factorization

Matrix Factorization (MF) which is a methodology decomposing a big matrix

into two much lower dimensional factors proved its effectiveness in Netflix Prize

competition (Koren 2009). Given some observed user ratings for a set of movies,

Netflix challenged the research community to predict the unknown ratings. The

31

X ≈ + +…+u1(2)

u1(1)

u2(2) u

r(2)

u2(1) u

r(1)

Figure 2.5 : Tensor factorization following CANDECOMP/ PARAFAC (CP) modelfor a mode-3 tensor X which is decomposed into 3 low-rank factors U(1), U(2), andU(3).

winner achieved the highest movie ratings prediction by presenting Netflix dataset

into a rating matrix of n users by m movies, factorizing it into two low-rank factors,

and then predicting missing entries from these factors (Koren et al. 2009). Since

then, MF has become a new trend and has been extended to be applied to multi-

mode, high-dimensional and sparse big data with a goal to capture the underlying

low dimensional matrices, the so-called low-rank factors.

Karatzoglou et al. (2010) and Wang et al. (2012b) introduced CF-based tensor

factorization (TF) for a flexible and generic integration of contextual information.

As an extension of MF, TF factorizes a multidimensional array, so called a tensor,

into its latent factors to capture the underlying low-rank structures (Fang & Pan

2014, Jiang et al. 2014). Following CANDECOMP/ PARAFAC (CP) decomposition

model (Harshman 1970), TF expresses a mode-p tensor into a sum of a finite number

of low-rank factors (as shown in Figure 2.5), being formulated as:

L(U(1), ....,U(p))=

∥∥∥∥�U(1),U(2), ....,U(p)� −X

∥∥∥∥2

where X ∈ RI1×I2×···×Ip is a mode-p tensor, its p rank-r factors are U(l) ∈ RIl×r, ∀l ∈[1, p], and �U(1),U(2), ....,U(p)�i1,i2,...,ip =

∑r

k=1

∑p

l=1U

(l)il,k

.

Along with the TF, researchers have solved these three key problems of TF:

32

• How to achieve high recommendation accuracy? and

• How to factorize the input tensors?

• Given a huge data, how can this computation be done in a reasonable time?

The following sections discuss these key issues of TF.

2.3 Cross-domain Recommendation Systems

As the input rating matrix (or tensor) is sparse in nature, leveraging closely re-

lated datasets has been proposed to improve recommendation accuracy (Lahat et al.

2015a). This trend has recently forecasted to emerge for the foreseeable future (Liu

et al. n.d., 2012). Two major approaches have been widely applied: joint analysis

of multiple datasets (Acar, Kolda & Dunlavy 2011, Gao et al. 2013, Gemulla et al.

2011) and transfer learning (Pan et al. 2011, Li et al. 2009a). The following subsec-

tions introduce widely used algorithms for cross-domain recommendation systems

related to this thesis.

2.3.1 Collective Matrix Factorization

Singh & Gordon (2008) proposed a joint analysis of two matrices coupled in one

of their modes. As they have one dimension in common, they are likely to share

some common characteristics in the coupled mode. Thus, the authors introduce

collective matrix factorization (CMF) algorithm to use these explicit similarities

between them to overcome the sparsity of the input rating matrices. To this end,

the authors assumed both datasets to have a common low-rank subspace in their

coupled dimension. Suppose X1 and X2 are coupled in their first mode, the authors

modeled the CMF with a coupled loss function:

L = ‖U×VT1 −X1‖2 + ‖U×VT

2 −X2‖2 (2.3)

33

where common U represents explicit similarities shared between two datasets and

regularization terms are omitted for simplification.

The concept of explicit similarities as the common factor has been widely used.

Bhargava et al. (Bhargava et al. 2015) proposed location, activity and time would

provide a complete picture of users. Thus, different data sources were modeled to

have common coupled factors to fuse explicit similarities among them. Transfer by

Collective Factorization (TCF) (Pan et al. 2011) was proposed to use ratings and

binary like/ dislike from the same users and items. TCF assumed that both user

and item factors would be the same between the inputs. Joint Matrix Factorization

(JMF) (Shi et al. 2013) leveraged 5-star ratings and item-item similarity matrix

collected from auxiliary data. The two datasets were joint with the same item factor.

There are also several ideas to extend this CMF’s coupled loss function. Weighted

Non-negative Matrix Co-Tri-Factorization (WNMCTF) (Yoo & Choi 2009) extended

CMF’s idea to non-negative matrix tri-factorization (Lee & Seung 2000).

2.3.2 Coupled Matrix Tensor Factorization

In case of a coupled matrix and tensor as introduced in Section 2.1.3, the first

dimension of X is correlated with the first dimension of Y, and the second mode of

X has a relationship with the first dimension of Z. These side information Y and Z

if joined analyzes with the primary data X will help to deepen our understanding

of the underlying patterns in the data, and to improve the accuracy of tensors

composition.

Acar, Kolda & Dunlavy (2011) extended CMF for a joint analysis of a matrix

and a tensor. She introduced coupled matrix tensor factorization (CMTF) whose

loss function was defined as

L =

∥∥∥∥�U,V,W� −X

∥∥∥∥2

+ ‖U×AT −Y‖2 (2.4)

34

XY

user profile

Z

gen

re

use

r

movie

Xuse

r

movie

Y

Z

≈ + +…+u1

v1

u1

b1

v1

a1

XY

Z

u2

v2

u2

b2

v2

a2

XY

Z

ur

vr

ur

br

vr

ar

Figure 2.6 : Joint factorization of a coupled matrix tensor. X is a tensor of ratingsmade by users for movies on weekdays. MatricesY and Z represent user informationand movie genre, respectively. Movie rating tensor X is, therefore, coupled with userinformation matrix Y in ‘user’ mode, and joint with movie category matrix Z in‘movie’ mode. X is factorized as a sum of low-rank vectors u, v and w; matrix Yis decomposed as a sum of u and a; and a sum of v and b is an approximation ofmatrix Z. Note that X and Y share the same user factor U whereas X and Z sharethe same movie factor V.

where �U,V,W�i,i,,k = ∑rf=1Ui,f ×Vj,f ×Wk,f , U, V and W are factors of X and

U and A are factors of Y. It is worth to note that U is the common factor of both

X and Y. In this case, regularization terms are also omitted for simplification.

Figure 2.6 illustrates the case of a coupled matrix tensor factorization of X, Y

and Z. X is factorized as a sum of low-rank vectors u, v and w; matrix Y is

decomposed as a sum of a and b; and a sum of c and d is an approximation of

matrix Z.

The idea of having a common factor has two potential issues. Firstly, it assumes

that the first mode of X shares a common low-rank subspace with the first mode of

Y. A basis for that low-rank subspace is expressed by the identical latent factor U

of X and Y. Admittedly, the first factor of X highly correlates with the first factor

of Y, yet they are unequal in many real-world data and applications. Due to this

fact, we hypothesize that forcing them to be the same would reduce the accuracy of

this factorization. More importantly, by using identical U for both X and Y, the

best final result optimizes either X or Y, not both. It may approximate X well and

lose Y decomposition’s accuracy, or vice versa. This issue is especially critical when

we want to find the best approximation of every tensors correlated to one another.

35

2.3.3 CodeBook Transfer

Targeting on improving recommendation on one domain by utilizing latent rating

patterns from another domain, Li et al. (2009a) suggested one as a source domain

Xsrc and the other one as a target domain Xtgt. In this research, the authors

employed matrix tri-factorization to decompose the source Xsrc into the user, the

item and the weighting factors.

Xsrc ≈ UsrcSsrcVTsrc (2.5)

This weighting factor (Ssrc) defined rating patterns of the users for the items,

being used as the codebook to be transferred from Xsrc to Xtgt. Thus, Xtgt became

Xtgt ≈ UtgtSsrcVTtgt (2.6)

This explicit knowledge transferred from the source improved the accuracy of

recommendation in the target domain. An extension, named Rating-Matrix Gener-

ative Model (RMGM) (Li et al. 2009b), combined both the steps of extracting the

codebook and transferring it. In case there are more than two datasets, Moreno

et al. (Moreno et al. 2012) introduced multiple explicit codebooks shared among

them. Each codebook was also assigned a different weight to utilize the correlations

better. Even though codebook transfer worked on any datasets, with or without

common users or common items, they are only effective when all datasets have the

same rating patterns. This condition limits their applications to a broader available

data.

36

2.3.4 Cluster-Level Latent Factor Model

The assumption that two datasets from different domains have the same rating

patterns is unrealistic in practice. They may share some common patterns while

possessing their own characteristics. This motivated Gao et al. (2013) to propose

CLFM for cross-domain recommendation. In specific, the authors partitioned the

rating patterns across domains into common and domain-specific parts:

X1 ≈ U1[S0|S1]VT1

X2 ≈ U2[S0|S2]VT2

(2.7)

where S0 ∈ Rr1×c is the common patterns and S1, S2 ∈ Rr1×(r2−c) are domain-

specific parts; c is the number of common columns.

This model allows CLFM to learn the only shared latent space S0, having two

advantages. Firstly, as S0 captures the similar rating patterns across domains, it

helps to overcome the sparsity of each dataset. Secondly, domain-specific S1 and

S2 contain domains’ discriminant characteristics. As a result, diversity of ratings in

each domain is still preserved, improving recommendation performance.

2.4 Factorization Methodologies

A large number of different methodologies have been proposed to optimize TF

and CMTF. The most popular one is Alternating Least Squares (ALS) (Kolda &

Bader 2009). In a nutshell, ALS optimizes the least square of one factor while fixing

the other ones. ALS does this iteratively, alternating between the fixed factors and

the one it optimizes for until the algorithm converges. Shin & Kang (2014) followed

ALS, yet updated a subset of a factor’s columns at a time. ALS based algorithms are

computationally efficient, yet may converge slowly with sparse data (Acar, Kolda &

Dunlavy 2011).

37

Gradient descent (GD) based optimization, such as stochastic gradient descent,

is an alternative for ALS. Starting with initial factors’ values, GD refines them

by iterating the stochastic difference equation (Gemulla et al. 2011). On the one

hand, GD is simple to implement. On the other hand, choosing good optimization

parameters such as learning rate is not straightforward (Le et al. 2011). A learning

rate is usually decided based on experiments. Another approach is backtracking

search for the maximum decrease from data. CMTF (Acar, Kolda & Dunlavy 2011)

used Nonlinear Conjugate Gradient (NCG) with a line search to find an optimal

decrease direction. Nevertheless, backtracking search is computationally heavy and

therefore, it is not suitable for large-scale datasets.

2.5 Distributed Factorization

Although ALS and GD have proved their effectiveness in optimizing matrix fac-

torization, tensor factorization, and coupled matrix tensor factorization, they are

great for small data. As applications of TF and CMTF usually deal with many gi-

gabytes of data, researchers have focused on developing distributed algorithms. The

nature of TF and CMTF requires the same computation to be done on different sets

of data. Consequently, several levels of data parallelism have been proposed.

Data parallelism in a multiprocessor system divides big tasks into many identical

subtasks; each subtask is performed on each processor. Turbo-SMT by Papalexakis

et al. (2014) addressed this direction by sampling sparse coupled tensors into sev-

eral tiny coupled tensors, concurrently decomposing them into factors using Matlab

Parallel ToolBox, and then merging resulted factors. Another approach followed

this direction was GPUTensor (Zou et al. 2015) which utilized multiprocessors in

GPU for factors computation. Even though these methods improved factorization

speed significantly, they performed tasks in a single machine (with multiprocessors

or multi-cores of powerful GPU though). Thus, they would experience “out of mem-

38

Local systems

Centralized Data

Distributed Data

Distributed systems

CMTF_OPT (CTF)Acar et al. 2011

Turbo-SMT (CTF)Papalexakis et al. 2014

ScouT (CTF)Jeon et al. 2016

FlexiFact (CTF)Beutel et al. 2014

SALS (TF)Shin et al. 2015

GPUTensor (TF)Zou et al. 2015

GigaTensor (TF)Kang et al. 2012

Figure 2.7 : Distributed factorization algorithms on a data-computation coordinate.x-axis represents the level of data distribution from data located in a centralizedmemory or file server on the left to data distributed to computed nodes’ memoryon the right. y-axis captures the level of distributed computation from algorithmsprocessed in a local machine to ones done in a distributed cluster (bottom to top).

ory” if the data is too big to be loaded into the local machine.

Distributed data parallelism scales better with the data size. This level makes

use of distributed processors to do calculations. Moreover, the big data files are

often stored in a distributed file system which can theoretically store big files of any

size. ScouT proposed by Jeon et al. (2016), FlexiFact by Beutel et al. (2014) and

GigaTensor by Kang et al. (2012) followed this approach by defining factorization as

MapReduce processes. For any calculations to be done on a distributed processor,

a corresponding part of data from distributed file system needs to be transmitted

to the processor. This process repeats along the algorithms’ iteration, incurring

so much heavy overhead. Shin & Kang (2014) introduced SALS to overcome this

MapReduce framework’s weakness by caching data in computed nodes’ local disks.

SALS reduced communication overhead significantly. Yet this communication can

be reduced even more. As data is stored on disks, reading it to memory for each

access takes time, especially for huge datasets and many iterations.

39

All the algorithms with different levels of data parallelism are put in an x-y co-

ordinate as in Figure 2.7. Naturally, there is no algorithm in quadrant IV as data

computed in a single machine is normally located locally. Algorithms in quadrant

III perform calculations within a local system with local data. As the whole data is

located in local memory, these algorithms would be in trouble as the data size in-

creases. Those in quadrant II are distributed algorithms in which data is centralized

in a distributed file server. Those algorithms scale quite well as the data increases.

Nevertheless, centralized data may be a significant issue. As data is stored in an

isolated server from computed nodes, the data is required to be transmitted to com-

puted nodes as per calculation. Therefore, communication overhead is one of the

most significant disadvantages. SALS (Shin & Kang 2014) in quadrant I overcame

massive communication overhead with data cached in local disks. It thus distributed

data to computed nodes. This data distribution is still able to be improved even

more with memory caching.

2.6 Deep learning based recommendation systems

Recent approaches take advantages of deep learning to capture and learn sim-

ilarities and latent relationships between users and items. Several deep networks

have been introduced for collaborative filtering (Karatzoglou & Hidasi 2017, Wang,

Wang & Yeung 2015, He et al. 2017). In case of cross-domain recommendation

system, Elkahky et. al. (Elkahky et al. 2015) proposed a multi-view deep learning

approach (DSSM) in which users and items of each dataset inputed through two

neural networks. These networks were then mapped into semantic vectors. In this

approach, the relationships between users and items were defined as the cosine sim-

ilarity of their corresponding semantic vectors. In addition, the common dimension

among datasets shared the same network. For example of Figure 2.8, in case X(1)

and X(2) have the same users, their users would be fed to the same network to learn

40

… … …

… … …

… … …

… … …

user item of domain 1 item of domain 2

Figure 2.8 : Multi-view deep neural network for cross-domain recommendation twodatasets where they have the same users. In this case, users of both datasets sharethe same features of the left-most network.

its parameters.

All the proposed deep learning based methods assumed that datasets across

domains have an identical knowledge of the common dimension. In fact, cross-

domain datasets often have different characteristics. Thus, they may share some

common while also possess their own unique properties. Methods which are lack

of the ability to capture this fact are not able to achieve the best recommendation

accuracy.

2.7 Research gaps

This chapter reviews the existing algorithms in relation to the three research

questions of this thesis. Firstly, how to share the actual explicit similarities between

cross-domain datasets to understand their relationships. Secondly, how to exploit

the implicit similarities in non-coupled dimensions across domains to improve rec-

ommendation accuracy further. Lastly, how to scale up the factorization process to

a different number of tensors coupled in one mode or more, tensor modes, tensor

dimensions and billions of observations. We propose several algorithms in this thesis

to fill the research gaps described below.

41

The current methods of utilizing explicit similarities across domains fail to share

the actual correlation in coupled dimension. For joint analysis of cross-domain

datasets or transfer learning between them, the existing models assume coupled

datasets to share a common coupled factor. Admittedly, the coupled factors are

highly correlated with each other, yet they are unequal in many real-world data and

applications. Thus, forcing them to be the same would reduce the recommendation

accuracy. However, this performance reduction is not the only issue. By using an

identical coupled factor for cross-domain datasets, the final result optimizes either

of them, not both. It may approximate X well and lose Y’s decomposition accuracy,

or vice versa.

Furthermore, all existing algorithms did not take into account implicit similarities

which exist in many applications. Cross-domain datasets not only have explicit

similarities in the coupled dimension, but they also share implicit ones in the non-

coupled dimension. Different approaches have been proposed to perform a joint

analysis of coupled datasets (Pan 2016). However, all of the existing algorithms

use explicit similarities as a bridge to collaborate among datasets. Although these

explicit similarities showed their effectiveness in improving recommendation, there

are still rich implicit features that were not used but have great potential to further

improve the recommendation.

Last but not least, the current distributed factorization methods still incur high

communicational and computational cost. MapReduce-based algorithms was intro-

duced to perform the factorization process in parallel with many computing nodes.

Even though distributed computing allows factors to be updated faster, MapReduce-

based models require data to be transferred from the isolated distributed file system

to the computing node when it needs to process this data. The iterative nature

of tensor factorization requires data and factors to be distributed over and over

again, incurring huge communication overhead. A scalable factorization model with

42

minimal overhead will scale up well as the data emerges.

Motivated by the above liturature gaps, this thesis was conducted. By investigat-

ing the three research questions, this thesis proposes a scalable factorization model

to share implicit and explicit similarities across domains for higher recommendation

accuracy.

44

Chapter 3

Explicit Similarity Discovery

This chapter encapsulates contribution #1 of this thesis. It is an extended descrip-

tion of the following publication:

Quan Do and Wei Liu, “ASTen: an Accurate and Scalable Approach to Coupled

Tensor Factorization,” in Proceedings of the 2016 International Joint Conference on

Neural Networks (IJCNN), pp. 99-106, Jul. 24-29, 2016.

3.1 Introduction

This chapter addresses the problem of sharing the actual explicit similarities

in the coupled mode of coupled datasets. Conventional methods such as CMTF

(Acar, Kolda & Dunlavy 2011) assume a coupled matrix and tensor have identical

factors on their coupled modes, which is detrimental to the overall accuracy of the

factorization. There are two problems with this assumption. The first issue is that

they assume the coupled mode of the tensor shares a common low-rank subspace

with the coupled mode of the matrix. A basis for this low-rank subspace is expressed

by the same latent coupled factor shared between them. Admittedly, their coupled

factors (in their coupled dimensions) are highly correlated, yet they are unequal in

45

many real-world data and applications. Due to this fact, we hypothesize that forcing

them to be the same reduces the accuracy of this factorization. More importantly,

by using an identical coupled factor for both datasets, the best final result optimizes

only one of them, not both. It may approximate the tensor well and lose the accuracy

of the matrix decomposition, or vice versa. This problem is especially critical when

the aim is to find the best approximation of both the tensor and matrix which are

correlated to one another.

The second problem could come from an assumption that the error of approx-

imating the tensor and that of decomposing the matrix contribute equally to the

final loss of the model. Clearly, this is not typically the case, as the size of the tensor

and that of the matrix often differ significantly. The loss of factorizing the larger

size one usually outweighs that of decomposing the smaller one. This loss, hence,

reduces the precision of predicting the missing entries in the smaller-sized tensor. In

other words, the traditional loss function does not optimize both the matrix and the

tensor simultaneously. It sacrifices the accuracy of decomposing the smaller-sized

one to better approximate the larger tensor.

This chapter proposes an algorithm to solve these two weaknesses. Unlike ex-

isting algorithms with the traditional objective function which forces the coupled

modes among datasets to share identical coupled factors, the proposed model de-

fines a new objective function which can be optimized with respect to every single

tensor and matrix. This new function enables each dataset to have its own dis-

criminative factor on the coupled mode and regularizes the coupled factors to be

as close as possible to share their explicit similarities. Consequently, optimizing

the proposed objective function produces the lowest decomposition error rate. In

the other words, the proposed method is capable of accurately sharing the explicit

similarities, achieving the accurate approximation of every matrix and tensor of the

coupled datasets. Also, this chapter provides a theoretical proof and experimental

46

evidence that the proposed algorithm converges to an optimum.

The chapter describes ASTen in Section 3.2 by introducing a new objective

function to better share the explicit similarities in the coupled factors. Section

3.3 explains a method to optimize the function. A theoretical proof of the model’s

convergence is also presented. Moreover, Section 3.4 reveals some empirical evidence

to demonstrate the convergence of the method. Finally, the key contributions of the

proposed idea to utilize the explicit similarities across domains is summarized in

Section 3.5.

3.2 ASTen: the proposed Accurate Coupled Tensor Factor-

ization model

This section explains the proposed ASTen, a gradient descent (GD) based al-

gorithm, to more accurately solve CMTF. Although it focuses on discussing mode-3

tensors, all discussions and the algorithm generalize well to higher mode ones. Sup-

pose X ∈ Rn×m×p is a mode-3 tensor. It is obvious that X has at most three coupled

matrices or tensors. Without the loss of generalization, the case where X and a ma-

trix Y ∈ Rn×q coupled in their first mode will be discussed here. The second and

third additional matrices or tensors can be applied the same way.

Unlike traditional CMTF algorithm with Equation (2.4) as their loss function,

ASTen decomposes X and Y into U, V, W, A and B where U, V and W are

factors of X and A and B are factors of Y. ASTen expresses the relationship of

U and A which are jointly correlated by the new third term in our proposed loss

function:

47

L(U,V,W,A,B) =

∥∥�U,V,W� −X∥∥2

F

ΩX

+‖A×BT −Y‖2F

ΩY

+‖U−A‖2F

ΩU

=

∑n,m,pi,j,k (Xi,j,k −

∑rl=1 Ui,lVj,lWk,l)

2

ΩX

+

∑n,pi,j (Yi,j −

∑rl=1 Ai,lBj,l)

2

ΩY

+

∑ni

∑rl=1(Ui,l −Ai,l)

2

ΩU

(3.1)

where ΩX, ΩY and ΩU denote the size of X, Y and the coupled factor U (A has

the same size as U), respectively. In case of sparse tensors, they are the number of

observed elements.

Equation (3.1) overcomes the two weaknesses of the conventional loss function

(2.4) proposed in the literature. At first, Equation (3.1) optimizes X with respect

to U, V, W and the correlated factor A in Y. At the same time, it minimizes Y

approximation error with the optimal A, B and the coupled factor U in X. On

top of that, Equation (3.1) also captures the size difference of X and Y where the

error of X and that of Y are divided by their sizes, respectively. This normalization

ensures the size difference does not have any influence on the distribution of the loss

of X and that of Y to the total decomposition error. As a result, ASTen optimizes

both X and Y without sacrificing the accuracy of either one of the two.

3.3 Optimization

ASTen updates U (the same for V, W, A and B) following this stochastic

difference equation:

Ut+1 = Ut − ηt∂L∂Ut

(3.2)

where ∂L∂Ut is a partial derivative of the loss function L(U,V,W,A,B) with respect

to Ut, ηt is a decreasing step size, t + 1 denotes the current iteration while t is for

48

the previous one.

ASTen updates one entry of a factor matrix at a time, for example, updating U

at row i and column f . Thus, the partial derivative with respect to each ui,f needs

to be computed and can be derived from (3.1):

∂L∂ut

i,f

=− 2

ΩX

m,p∑j,k

[(xi,j,k −

f−1∑l=1

ut+1i,l vtj,lw

tk,l −

r∑l=f

uti,lv

tj,lw

tk,l

)vtj,fw

tk,f

]

+2

ΩU

(uti,f − ati,f

) (3.3)

Equation (3.3) shows that an update for ui,f at a row i and a particular column

f ∈ [1, r] (r is the rank) depends on a whole slice Xi,∗,∗ of X, V and W of the

previous iteration and its paired factor ai,f . It is also worth to mention that ASTen

updates factor matrices column by column, i.e., it updates the first column of U,

then the second column, and so on. Doing so enables ASTen to use newly updated

values of the first column to update the second column, and the updated first and

second column to continue updating the third column. This process is captured in

the term ut+1i,l in the first part of (3.3).

In the similar way, the partial differential with respect to non coupled factors,

such as V, is extracted by

∂L∂vtj,f

= − 2

ΩY

n,q∑i,k

[(xi,j,k −

f−1∑l=1

uti,lv

t+1j,l wt

k,l −r∑

l=f

uti,lv

t+1j,l wt

k,l

)uti,fw

tk,f

](3.4)

Note that the partial differential with respect to coupled factor ai,f is similar to

(3.3), and the partial derivatives with respect to non-coupled factors wk,f and bj,f

can be done in the same way as (3.4). This optimization is shown in Algorithm 1.

49

Algorithm 1: ASTen updates factors U, V, W, A and B using equation(3.2), (3.3) and (3.4). The ones that best approximate X and Y are the factorswe want to produce.

Input : X, Y, EOutput: U,V,W,A,B

1 Randomly initialize U,V,W,A,B;2 Initialize L by a small number;

3 repeat4 PreL = L;5 for f=1 to r do6 for i=1 to n do7 ut+1

i,f ← uti,f − ηt

∂L∂ut

i,f;

8 end9 for j=1 to m do

10 vt+1j,f ← vtj,f − ηt

∂L∂vtj,f

;

11 end12 for k=1 to p do13 wt+1

k,f ← wtk,f − ηt

∂L∂wt

k,f;

14 end15 for i=1 to n do16 at+1

i,f ← ati,f − ηt∂L

∂ati,f;

17 end18 for j=1 to q do19 bt+1

j,f ← btj,f − ηt∂L∂btj,f

;

20 end

21 end

22 Compute L following equation (3.1);

23 until (PreL−LPreL <E);

Theorem 1. The proposed ASTen as presented in Algorithm 1 converges.

Proof. The second partial derivative with respect to U is derived by

∂2L∂ut

i,f2 =

2

ΩX

m,p∑j,k

(vtj,fwtk,f )

2 +2

ΩU

>0

50

As the second partial derivative of L with respect to U is larger than 0, L is

a convex function with respect to U. Followed the same logic, L is also a convex

function for V, W, A and B. It is known that GD for a function f(x) converges if

and only if f(x) is convex (Boyd & Vandenberghe 2004). As a consequence, ASTen

with our proposed objective function defined in (3.1) converges.

3.4 Performance Evaluation

The performance of the new loss function (3.1) proposed in ASTen is com-

pared with the traditional loss function (2.4) (i.e., ASTen with the traditional loss

function) in terms of factorization accuracy. The goal of conducting a series of

experiments is to assess:

1. how accurate the factorization of ASTen is when each tensor is allowed to

have its own discriminative factors and

2. how fast ASTen can achieve the preciseness with the proposed parallel model.

For better validating the ASTen, several synthetic datasets are generated where

the relationship of the coupled factors is controlled. In addition, while doing exper-

iments, the empirical evidence (on top of the above theoretical proof) is shown to

illustrate the correctness of the proposed objective function. All the experiments are

executed on a cluster of 2x 2.8GHz Intel Xeon CPU; each has 10 cores and 256GB

DDR3 RAM.

3.4.1 Data used in our experiments

In the experiments, a synthetic data and two real-world datasets are used to

validate the proposed algorithm. This section explains how data is generated and

processed.

51

Table 3.1 : Ground truth distributions of the factor matrices in the synthetic data.

Test case #1 Test case #2

U normal(0.5, 0.5) normal(0.5, 0.5)V random(0, 1) random(0, 1)W random(0, 1) random(0, 1)A normal(1.5, 0.5) U * random(0, 2) + random(0, 1)B random(0, 1) random(0, 1)

1. Synthetic data To generate coupled tensor and matrix, factors U, V, W

and B of a pre-specified dimension and a predefined rank r are randomly generated.

In specific, V, W and B are randomly generated from 0 to 1; U is generated as

a normal distribution of both mean and standard deviation of 0.5. The remaining

factor, A, which is coupled with U is generated to have a correlation with U. A

summary of how the data is synthesized is captured in Table 3.1. Two different

approaches of generating the ground truth A are tested:

test case #1) A is a normal distribution (mean = 1.5, standard deviation = 0.5)

and

test case #2) A equals U multiplied by a coefficient plus some Gaussian noise

By doing so, the synthetic data will have a similar characteristic of the real-

world data where coupled factors have a relationship, yet unequal. Then unique data

points (i, j, k) are randomly selected andXi,j,k is computed by∑r

f=1 ui,f×vj,f×wk,f .

Y is also computed the same way with A and B. X and Y are finally normalized

to [0, 1].

2. MovieLens data MovieLens dataset (Harper & Konstan 2015) includes rat-

ings from 943 users for 1,682 movies. It is compiled into tensor X of (users, movies,

52

weekdays) whose entries are ratings, matrix Y of (users, user profile) and matrix Z

of (movies, genres). Matrix Y has the size of 943 by 83 where a user is specified

by her gender (0 or 1), is grouped in one of 61 age groups, and have one of 21

occupations. Matrix Z categories 1,682 movies into 19 different genres. One movie

belongs to one or more genres. Finally, X is the tensor of 943 by 1,682 by 7. Values

of X’s entries are 0.2, 0.4, 0.6, 0.8 and 1 which are equivalent to 1-5 ratings. In

this chapter, the experiments tested all the algorithms with 80,000 known ratings,

together with Y and Z of 2,159 and 2,893 observed data, respectively.

3. Yahoo! Music data Yahoo! Music dataset∗ represents a snapshot of user

preferences on various songs. For this experiment, a tensor X of song information

and a matrix Y of user ratings are used. X categorizes 136,736 songs by 20,543

artists and 9,442 genres, each song is performed by an artist and belonged to one

genre. Entries of Y capture ratings of 23,179 users for 136,736 songs and their values

are 0.2, 0.4, 0.6, 0.8 and 1 which are equivalent to 1-5 ratings. There are 8,846,899

observed ratings in total, along with 136,736 observed nonzeros of X.

3.4.2 Performance metric

The input coupled tensors is decomposed into their low-rank factors with differ-

ent algorithms and their decomposition accuracies are compared. In other words,

this experiment evaluates how well these factors approximate the original coupled

tensors. For instance, suppose U, V, W, A and B are factors of X and Y, re-

spectively, as described in test case #1 above. How well the extracted factors

approximate X and Y is quantified with a mean squared errors (MSE) defined by:

∗Yahoo! Music R1 dataset: https://webscope.sandbox.yahoo.com/catalog.php?

datatype=r

53

7.50

E-0

4

5.90

E-0

4

1.34

E-0

3

3.90

E-0

4

6.3

0E

-04

1.02

E-0

3

1.80

E-0

4

5.00

E-0

5

2.30

E-0

4

X Y OVERALL

Individual data factorizationCTF with traditional loss functionCTF with ASTEN loss functionASTEN

X

(a) Test case #1

1.08

E-0

3

3.72

E-0

3

4.80

E-0

3

9.40

E-0

4

4.1

0E

-03

5.04

E-0

3

2.80

E-0

4

7.90

E-0

4

1.07

E-0

3

X Y OVERALL


X

(b) Test case #2

Figure 3.1 : Mean squared errors of a) test case #1 and b) test case #2 withsynthetic data. In both cases, X is a mode-3 tensor of size 100 x 100 x 100, Y isa matrix of size 100 x 100. There are 80,000 known data points of X and 10,000known elements of Y. It is clearly shown that ASTen with the new loss functionreduces 60% (in test case #1) and 74% (in test case #2) of MSE compared withother algorithms using the tradition loss function.

MSE =

∥∥�U,V,W� −X∥∥2

ΩX

+

∥∥Y −A×BT∥∥2

ΩY

where the first term is the MSE of X approximation and the second term is that of

Y.

In case of factorizing an individual tensor X or matrix Y, only the first or the

54

2.71

E-0

2

4.28

E-0

4

1.72

E-0

4

2.77

E-0

2

2.73

E-0

2

8.71

E-0

3

6.26

E-0

3

4.23

E-0

2

2.70

E-0

2

1.64

E-0

4

1.1

2E

-04

2.7

3E

-02

X Y Z OVERALL


X

Figure 3.2 : Mean squared error of factorizing the MovieLens dataset where X is amode-3 tensor of size 943 × 1, 682 × 7, Y is a matrix of size 943 x 83, and Z is amatrix of size 1, 682× 19. We can observe that ASTen with the new loss functionreduces MSE of each tensors. In contrast, other algorithms with the conventionalloss function sacrifice the accuracy of Y and Z to optimize X.

second term will be used to calculate MSE. Similarly, when there are more than

two tensors coupled, just like the MovieLens dataset, all MSE of the corresponding

tensors will be added.

3.4.3 Results

The results of the experiments consistently show that ASTen enhances the

accuracy of coupled matrix tensor factorization. They also provide an evidence that

the proposed new loss function converges.

Figure 3.1 and Figure 3.2 reveal that ASTen with the new loss function out-

performs the algorithm with the traditional loss function defined in equation (2.4)

significantly. In particular, in Figure 3.1a, ASTen with the new loss function not

only reduces 76% approximation error of X, but also improves about 92% of Y

decomposition loss. In comparison, the algorithm with the traditional loss function

gives up Y decomposition accuracy to optimize X factorization by just 16%.

The same phenomenon is also observed in test case #2 as shown in Figure 3.1b.

Again, ASTen with the new loss function optimizes both X and Y by 74% and

55

5.17

E-0

1

1.10

E-0

1 6.26

E-0

1

4.35

E-0

1

9.82

E-0

2 5.3

3E

-01

X Y OVERALL

CTF with traditional loss functionCTF with ASTEN loss functionASTEN

X

Figure 3.3 : Mean squared error of factorizing Yahoo! Music dataset where X is amode-3 tensor of size 136, 736× 20, 543× 9, 442 and Y is a matrix of size 136, 736×23, 179. Again, we can see that ASTen with the new loss function optimizes MSE ofevery tensors. On the contrary, other algorithms with the conventional loss functiongive up the accuracy of X to better reduce MSE of Y.

79% respectively, while the traditional loss function can only improve factorization

of X by 13% without any improvement on Y decomposition.

The performance of ASTen with the new loss function is consistent when it is

applied to both the MovieLens and the Yahoo! Music datasets. Figure 3.2 points out

that ASTen outperforms algorithms with the traditional loss function defined in

equation (2.4) by 35% CMTF’s overall mean squared error reduction. In particular,

to achieve a better approximation of X than algorithms with the traditional loss

function, ASTen reduces decomposition loss of both Y and Z by 62% and 35%,

respectively. In contrast, algorithms with the traditional loss function sacrifices the

accuracy ofY decomposition 20 times and that of Z factorization 36 times to achieve

relatively less accurate X approximation.

The result of the experiment on Yahoo! Music dataset, as presented in Figure

3.3, illustrates that the size differences of the input datasets have strong influence

on the accuracy of CMTF using the traditional loss function. CMTF with the

traditional loss function optimizes Y which has more nonzeros entries, but does not

decompose X equally well. On the contrary, the proposed loss function in ASTen

56

helps to further improve the accuracy of both X and Y under the same stopping

condition as the CMTF with the traditional function.

3.5 Contribution and Summary

This chapter addresses contribution #1 of this thesis by proposing a novel objec-

tive function to share the actual explicit similarities across datasets and an algorithm

to optimize the low-rank factors. The new objective function is designed to optimize

every single tensor and matrix of the coupled datasets. Differing from the existing

algorithms with a traditional objective function which forces coupled modes be-

tween the coupled matrix and tensor to have identical factors, ASTen enables each

of them to have its discriminative factor on the coupled mode. Due to the different

nature of coupled factors across datasets in real-world applications, the new loss

function enables ASTen to capture and share the actual explicit similarities in the

coupled factors. Thus, it is capable of finding the accurate approximation of every

tensor. As a result, it achieves up to 75% error reduction for coupled matrix tensor

factorization.

Furthermore, this chapter provides a theoretical proof and conducts extensive

experiments to show that the proposed algorithm converges to an optimum. Exper-

iments on both real and synthetic datasets demonstrate that the proposed ASTen

outperforms the existing algorithms in utilizing explicit similarities to improve rec-

ommendation accuracy.

58

Chapter 4

Implicit Similarity Discovery

This chapter encapsulates contributions #2 and #3 of this thesis. It is an extended

description of the following publications:

Quan Do, Wei Liu, Fan Jin and Dacheng Tao, “Unveiling Hidden Implicit Simi-

larities for Cross-Domain Recommendation,” IEEE Transactions on Knowledge and

Data Engineering (TKDE) (Under review).

Quan Do, Wei Liu and Fang Chen, “Discovering both Explicit and Implicit

Similarities for Cross-Domain Recommendation,” in Proceedings of the 2017 Pacific-

Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 618-630,

May 23-26, 2017.

4.1 Introduction

As discussed in Chapter 3, coupled datasets across domains possess explicit sim-

ilarities. In the case of the rating matrices on the Netflix and MovieLens websites,

they both contain user preferences for the same set of items. Thus, explicit fea-

tures in the identical item dimension are conventionally used to undertake coupled

learning between datasets (Acar, Kolda & Dunlavy 2011, Shin et al. 2017, Singh

59

& Gordon 2008). In addition to these explicit similarities, cross-domain datasets

are likely to have other implicit correlations in their remaining dimensions. This

intuition comes from consumer segmentation which is a popular concept in market-

ing. Consumers can be grouped by their behaviors, for example, “tech leaders” is

a group of consumers with a strong desire to own new smartphones with the latest

technologies and there is another user group which only changes to a new phone

when the current one is out of order. Although users on the Netflix and MovieLens

websites are different, users with similar interests may share similar behaviors in

relation to the same set of items. This type of correlation implicitly exists in many

other scenarios in the real world, for example, suburbs with a large number of high-

income families may be correlated with a lower crime rate as shown in Figure 1.1,

or users with an interest in action books may also like related movie genres (e.g.,

action films), or even though users on Amazon and Walmart are different, those

sharing similar interests may share similar behaviors in relation to the same set of

products. Thus, using implicit similarities correctly in addition to explicit ones has

a strong potential to assist in understanding more deeply the relationship between

user preferences and item characteristics. This understanding can be used to decide

appropriate recommendation strategies for each particular user.

Different approaches have been proposed to perform a joint analysis of multiple

datasets (Pan 2016). Some researchers proposed joint factorization (Singh & Gordon

2008, Acar, Kolda & Dunlavy 2011) while others suggested transfer learning (Li et al.

2009a, Gao et al. 2013, Pan et al. 2010). However, all of the existing algorithms use

explicit similarities as a bridge to collaborate among datasets. The popular collective

matrix factorization (CMF) (Singh & Gordon 2008) jointly analyzes datasets by

assuming them to have an identical low-rank factor in their coupled dimension. In

this case, the shared identical factor captures the explicit similarities across domains.

Li et al. (2009a) suggest correlated datasets share explicit hidden rating patterns.

60

The similarities between rating patterns are then transferred from one to another

dataset. Gao et al. (2013) extend Li et al. (2009a)’s idea to include unique patterns

in each dataset. Pan et al. (2010) regularize factors of the target user-by-item

matrix with those, called principal coordinates, from the user profile and the item

information matrices. Although these explicit similarities showed their effectiveness

in improving recommendation, there are still rich implicit features that were not

used but have great potential to further improve the recommendation.

Motivated by this literature gap, this chapter proposes an algorithm to discover

the implicit similarities in non-coupled dimensions of the cross-domain datasets.

The fact that non-coupled dimensions in the aforementioned example of the Movie-

Lens and the Netflix coupled datasets contain non-overlapping users prevents direct

knowledge sharing in their non-coupled factors. However, their latent behaviors are

correlated and should be shared. These hidden behaviors can be captured in low-

rank factors by matrix tri-factorization. As factorization is equivalent to spectral

clustering (Ding et al. 2006, Huang et al. 2008, Papalexakis et al. 2013), different

users with similar preferences are grouped in non-coupled user factors. Developed

on this concept, the proposed algorithm hypothesizes that latent clusters in these

non-coupled factors may have a close relationship. Therefore, it aligns correlated

clusters in non-coupled factors to be as close as possible. This idea matches the

fundamental concept of CF in the sense that similar user groups who rate similarly

will continue to do so.

In addition, this chapter presents the first algorithm to utilize both explicit and

implicit similarities across datasets. Both of them provide a thorough understand-

ing of the underlying relationship between user preferences and item characteristics.

This understanding helps to recommend appropriate items to each user, enhancing

the performance of cross-domain recommendation. Validated on real-world datasets,

the proposed approach outperforms the existing algorithms by more than two times

61

in term of recommendation accuracy. These results show the significance of explicit

and implicit similarities in improving the performance of cross-domain recommen-

dation.

The chapter describes how implicit similarities are discovered in Section 4.2.2

and the challenge of aligning them across non-coupled factors 4.2.2. The proposed

method to align them is introduced in Section 4.2.2. Furthermore, Section 4.2.1

discusses explicit similarities and a method to utilize both explicit and implicit

similarities is detailed in Section 4.2.3. Section 4.3 presents a way to extend the

proposed idea to handle more than two coupled datasets. An extensive evaluation

with real-world datasets is discussed in Section 4.4. Finally, Section 4.5 summaries

the key contributions of the proposed idea to discover implicit similarities and to

use them to achieve more accurate cross-domain recommendations.

4.2 HISF: the proposed Hidden Implicit Similarities Factor-

ization Model

This section introduces how explicit and implicit similarities are exploited among

different datasets. It first explains a case where X(1) is a rating matrix of n users by

m items and X(2) is another one of p users by m items. They are coupled in their

second dimension. An extension to more matrices will be then discussed in the next

section. Compared to existing models, HISF makes two principle improvements.

Firstly, it discovers and aligns similar user clusters in the non-coupled user factors

across domains. Secondly, it shares common parts and preserves domain-specific

parts of the coupled latent variables (item factors).

62

4.2.1 Sharing common and preserving domain-specific coupled latent

variables to utilize explicit similaritites

Even though X(1) and X(2) contain the same m items, it is unreasonable to force

them to share the same low-rank item factors (i.e., V(1) equals V(2)). The fact that

both matrices capture ratings of the same items indicates that they are strongly

correlated. Nevertheless, they also have their own unique features characterized

by their domain. For this reason, both common and unique parts are included in

item factors to better capture correlations of the coupled dimension among different

datasets. Thus, the coupled loss function is proposed to be:

L =

∥∥∥∥X(1) −U(1)S(1)

⎡⎢⎣V

(0)

V(1)

⎤⎥⎦

T ∥∥∥∥2

+

∥∥∥∥X(2) −U(2)S(2)

⎡⎢⎣V

(0)

V(2)

⎤⎥⎦

T ∥∥∥∥2

+ λθ

whereV(0) ∈ Rm∗c is the common part, V(1) andV(2) ∈ Rm∗(r−c) are domain-specific

parts, r is rank of the decomposition, c is the number of common rows such that

1 ≤ c ≤ r and θ is the squared L2 regularization term. These common V(0) and

domain-specific V(1), V(2) parts in coupled factors are illustrated in Figure 4.1.

4.2.2 Aligning implicit similarities in non-coupled latent clusters across

domains

Besides the explicit similarities in common parts of coupled factors (as described

above), the non-coupled dimension of datasets from different sources also possess

a strong relationship. Even though users in X(1) and X(2) are different users, they

may have similar behaviors which can be grouped by latent factors. For instance,

different users with similar preferences on sci-fi movies can be grouped as sci-fi

fans; sci-fi fans have similar behaviors across domains. This intuition inspires us to

63

X(1)

X(2)

U(1)

U(2)

user of domain 1

item

item

user of domain 2

V(0)

V(1)[ ]T

V(0)

V(2)[ ]T

common parts(explicitly shared)

X(2) specific parts

X(1) specific parts

user clusters alignment(implicitly shared)

S(2)

S(1)

Figure 4.1 : The proposed factorization model to discover and share implicit sim-ilarities across two rating matrices X(1) and X(2). They are coupled in their itemdimensions. The proposed algorithm decomposes X(1) into U(1), S(1), common V(0)

and its specific V(1). At the same time, X(2) is factorized into U(2), S(2), commonV(0) and its specific V(2). Note that the proposed model matches clusters of userswith similar interests in non-coupled factors U(1) and U(2) by aligning correlatedclusters (captured in their columns) as close as possible.

hypothesize that non-coupled user factors share closely related latent user clusters.

Thus, HISF first discovers cluster matchings between user clusters in U(1) and those

in U(2). Based on these matchings, centroids of their correlated clusters are then

regularized to be as close as possible. Figure 4.1 demonstrates a case of user clusters

alignment between U(1), U(2).

In the following subsections, a few examples are used to explain how challenging

it is to match user clusters across domains and how clustering can be achieved with

matrix tri-factorization. Then the proposed solution is introduced to firstly finds

related user clusters across domains and secondly aligns them together.

Factorization as a clustering method

Ding et al. (2006) proved that the matrix tri-factorization is equivalent to spec-

tral clustering such that U and V contain clusters of users and those of items,

64

1 1 1 11 1 1

1 1 11 1 1 1

user 1

item

X(1) U(1) V(1) U’(1) V’(1)T T

11

S(1)

11

S’(1)

a) b)

Figure 4.2 : Matrix factorization as a clustering method. Suppose X(1) containsratings of n users for m items and there are two user groups represented by theirrating similarities (brown circles and green triangles). When X(1) is factorized withmatrix tri-factorization, these two user groups are captured in columns of the userfactor. Two possible cases can happen: a) users with brown circles are in the firstcolumn and those with green triangles are in the second column of U(1); or b) userswith brown circles are in the second column and those with green triangles are inthe first column of U′(1).

respectively. S captures weights between user clusters and item clusters.

An example is now created where X(1) contains ratings of two groups of users

with similar preferences as shown in Figure 4.2: one group giving brown circle

ratings and another one giving green triangle ratings. When X(1) is factorized,

these two user groups are captured in columns of the user factor, i.e., one column

of the user factor contains users with brown circles and another one captures users

with green triangles. This example illustrates that factorization can cluster users

based on their preferences. However, it does not provide any information on whether

the first column is the group of users with brown circles or green triangles as both

U(1) × S(1) ×V(1)T and U′(1) × S′(1) ×V′(1)T are equally good solutions.

The weighting factor S(1) or S′(1) shows how much a user cluster interacts with

the items clusters. For example, the first row of S(1) in Figure 4.2 indicates that

users with brown circles strongly interact with the first item cluster and have no

interaction with the second one.

65

1 1 1 11 1 1

1 1 11 1 1 1

user 2

item

X(2) U(2) V(2) U’(2) V’(2)T T

11

S(2)

11

S’(2)

Figure 4.3 : Suppose X(2) is obtained from another domain where different usersrate the same set of items as that of X(1). X(2) also has two user groups havingsimilar preferences with those in X(1): circle and triangle preferences.

A challenge to align similar clusters across domains

In case there is another dataset from a different domain X(2) having ratings from

different p users on the same set of m items as shown in Figure 4.3. By factorizing

X(2), the matrix tri-factorization can group users with blue circle and those with

yellow triangle preferences in columns of user factor U(2) or U′(2).

Although the behaviors of users in X(1) and those in X(2) are highly correlated,

finding a match between user clusters in X(1) and X(2) is not an obvious task due

to two reasons. Firstly, users in X(1) and X(2) are different. It means that there is

no information about the users nor their order in the rating matrices. Secondly, as

alternating least square method (Hu et al. 2008) is often used to find the factors,

factors are initiated randomly. Therefore, a particular user cluster may never be

fixed in a particular column of the user factor. In other words, there is no way to

ensure whether users with brown circles are to be captured in the first or the second

column of the user factor of X(1). In the same way, users with blue circles can be

captured in the first or the second column of the user factor of X(2). Therefore, there

are four possible cases for matching user clusters of X(1) and X(2) as demonstrated

in Figure 4.4. In the event the input datasets have r clusters, the situation is more

complex as there will be r2 possible cases.

66

U(1)

U(2)

(a)

U(1)

U(2)

(b)

U(1)

U(2)

(c)

U(1)

U(2)

(d)

Figure 4.4 : Possible cases for matching user clusters of X(1) and X(2). User clusters’alignment between X(1) in Figure 4.2 and X(2) in Figure 4.3. There are four possiblecases: in a) and d) the user cluster in the first column ofU(1) matches the user clusterin the first column of U(2) and the user cluster in the second column of U(1) matchesthe user cluster in the second column of U(2); in b) and c) the user cluster in thefirst column of U(1) matches the user cluster in the second column of U(2) and theuser cluster in the second column of U(1) matches the user cluster in the first columnof U(2). These cases show the challenge to determine how clusters in U(1) and U(2)

are aligned at each iteration.

Matching correlated clusters across domains

To find matches of users clusters across domains, let’s revisit the intuition about

similar user groups with an assumption that users have similar behavior on the

common item set. When user clusters are extracted by matrix tri-factorization as

discussed in Sect. 4.2.2, item factors V(1) and V(2) have common and different

columns. These columns of V define new coordinates of low dimensional (linear

transformation) column-wise (item side) information of X. The common columns

of V(1) and V(2) thus provide new coordinates of common item information of X(1)

and X(2); each of new coordinates (each common columns of V(1) and V(2)) defines a

cluster of items (linear transformation of the original item side ofX) having the same

characteristics (e.g., sci-fi movies or comedy movies). The rows of S(1) and S(2) are

the values (weights) of different user clusters (captured in columns of U(1) and U(2)

67

respectively) in these common new coordinates. If a user cluster in X(1) and another

one in X(2) are geometrically close in the new coordinates, it indicates these user

clusters have similar interests on items of the same characteristics. Hence, matching

the S(1) and S(2) weights that are close geometrically in the new coordinates defined

by V(1) and V(2) can discover matching user clusters.

This principle is used to find matches of user clusters across domains. A user

cluster U(1)∗,i matches with U

(2)∗,j if S

(1)i,∗ is the closest to S

(2)j,∗ . Thus, the Euclidean

distances between rows of S(1) and those of S(2) are compared to match user clusters

whose distances are the shortest.

Aligning correlated clusters

Once matches between user clusters across domains are found, they are aligned

to be as close as possible in the optimization processes. This alignment is to enforce

similar users in latent feature space to rate new items similarly. Here, HISF is

proposed to regularize the difference between centroids of the closely related user

clusters across U(1) and U(2).

L =

∥∥∥∥X(1) −U(1)S(1)

⎡⎢⎣V

(0)

V(1)

⎤⎥⎦

T ∥∥∥∥2

+

∥∥∥∥X(2) −U(2)S(2)

⎡⎢⎣V

(0)

V(2)

⎤⎥⎦

T ∥∥∥∥2

+r∑

l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

+ λθ

(4.1)

where m(1)l

Tdenotes a row vector of the l-th user cluster’s centroid in U(1) and m

(2)∗

T

denotes a row vector of the matched user cluster’s centroid in U(2); an example of

how m(1)l

Tand m

(2)∗

Tare computed is shown in Figure 4.5.

When the user clusters across domains are matched as Figure 4.4a and Figure

4.4d, then:

68

m = ( , )1

T

m = ( , )2

T

u1.x+u2.x+u3.x

x y

u1.y+u2.y+u3.y

3 3

u4.x+u5.x u4.y+u5.y

2 2

U

Figure 4.5 : An illustration of how centroid of a cluster is computed. U capturestwo user clusters: the first three users are in the first cluster and the last two are inthe second one. x and y denote the first and the second column of U, respectively.

r∑l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

=

√(m

(1)1 .x−m

(2)1 .x)2 + (m

(1)1 .y −m

(2)1 .y)2

+

√(m

(1)2 .x−m

(2)2 .x)2 + (m

(1)2 .y −m

(2)2 .y)2

However, if the user clusters across domains are matched as Figure 4.4b and

Figure 4.4c, the order of columns in m(2)∗

Thas to be adjusted to match with the

order of columns m(1)l

Tsuch that

r∑l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

=

√(m

(1)1 .x−m

(2)2 .y)2 + (m

(1)1 .y −m

(2)2 .x)2

+

√(m

(1)2 .x−m

(2)1 .y)2 + (m

(1)2 .y −m

(2)1 .x)2

To elaborate how this alignment works, Algorithm 2 is run with a synthetic data

in Figure 4.6 where X(1) and X(2) each has six unique users who rate the same

list of six movies (three sci-fi and three comedy movies). Some users in X(1) and

X(2) like sci-fi genre and provide high ratings for sci-fi movies; some other prefer

comedy category and rate comedy movies five stars. Thus, users in each domain

can be grouped by their preferences: one cluster of sci-fi and another one of comedy

fans. Furthermore, each group of users in X(1) shares its implicit preference with

a corresponding group of users in X(2). As a result, when algorithm 2 factorizes

69

sci-fi #1

comedy #1

comedy #2

comedy #3

sci-fi #2

sci-fi #3

(a) X(1)

sci-fi #1

comedy #1

comedy #2

comedy #3

sci-fi #2

sci-fi #3

(b) X(2)

Figure 4.6 : Generated ratings of two domains X(1) and X(2). Blank entries areunobserved ratings. Although users inX(1) andX(2) are different, they share implicitsimilarities in their preferences: those who like sci-fi genre rate sci-fi movies highlyand those who are comedy fans do so for comedy movies.

X(1) and X(2) with rank r = 2, each user cluster is captured in a column of the

user factors. In addition, their user clusters are aligned so that those with similar

preferences are close to each other.

The resulted user clusters of several iterations are then plotted into an x-y coor-

dinates as in Figure 4.7 where the x-axis is the first column of U(1) and the y-axis

is its second column. If the first and the second column of U(2) match with the first

and the second column of U(1) respectively, x is the first column and y is the second

column of U(2). Otherwise, the order of columns in U(2) is changed before plotting

its users so that matching columns between U(1) and U(2) are aligned, i.e., y is the

first and x is the second column of U(2).

Initially, the algorithm randomizes the user factors. So the users are scattered

randomly in the x-y space. Iteration 1 starts grouping users to different clusters and

70

Figure 4.7 : An illustration of how well the proposed cluster alignment methodworks. Sci-fi fans and comedy ones from X(1) and X(2) in Figure 4.6 are capturedin user factors U(1) and U(2). Circles are users captured in the first column of U,and triangles are for those captured in the second column of U. As U(1) and U(2)

are randomly initiated, users are scattered. User clusters are formed and alignedover iterations. From iteration 2, two user clusters in X(1) are clearly separatedand gradually aligned with those in X(2). There is a change in cluster matching initeration 3: the first column of U(1) is matched with the second column of U(2) andvice versa. Since then, centroids of clusters are aligned to be as close as possiblefrom iteration five till the last iteration.

71

aligning them. After iteration 2, users with similar behaviors in U(1) and those in

U(2) are formed, and it can be seen that the first and the second user clusters of

X(1) are aligned with the first and the second user clusters of X(2), respectively. In

other words, sci-fi fans in the first column of U(1) (blue circles) are matched with

those in the first column of U(2) (red circles) whereas comedy fans in the second

column of U(1) (blue triangles) are matched with those in the second column of

U(2) (red triangles). Iteration 3 makes a correction on cluster alignment: sci-fi fans,

now in the second column of U(2), are matched with those in the first column of

U(1) (blue circles) whereas comedy fans (in the first column of U(2) (red circles) are

matched with ones in the second column of U(1) (blue triangles). In this case, the

order of columns U(2) is corrected so that users in U(1) and U(2) are in the same

coordinates. In the next iterations, centroids of corresponding clusters in X(1) and

X(2) are adjusted to be close as observed in iteration 5’s result or the last iteration’s.

4.2.3 Optimization

Equation (4.1) is not a convex function. However, it is convex with respect to a

factor when the others are fixed. Therefore, the alternating least square framework

is used to take turns optimizing one factor in function (4.1) while fixing the others

as shown in Algorithm 2. Moreover, as data for updating rows of the factors are

independent, the proposed algorithm computes factors in a row-wise manner instead

of full matrix operations so that the computation can be done in parallel with

multiple CPU cores or a distributed system. To this end, the equation (4.1) is

rewritten as the following:

72

L =

n,m∑i,j

(X

(1)i,j − u

(1)i

TS(1)

⎡⎢⎣v

(0)j

v(1)j

⎤⎥⎦)2

+

p,m∑k,j

(X

(2)k,j − u

(2)k

TS(2)

⎡⎢⎣v

(0)j

v(2)j

⎤⎥⎦)2

+r∑

l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

+ λθ

(4.2)

Solving U(1) and U(2)

Let v(01)j = S(1)

⎡⎢⎣v

(0)j

v(1)j

⎤⎥⎦ and v

(02)j = S(2)

⎡⎢⎣v

(0)j

v(2)j

⎤⎥⎦, then Equation (4.2) becomes

L =

n,m∑i,j

(X

(1)i,j − u

(1)i

Tv(01)j

)2

+

p,m∑k,j

(X

(2)k,j − u

(2)k

Tv(02)j

)2

+r∑

l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

+ λθ

(4.3)

By fixing all vj, Equation (4.3) is a convex function with respect to u(1)i

T. As a

result, u(1)i

Tis optimal when the partial derivative of L with respect to it is set to

zero.

δLδu

(1)i

T=− 2

m∑j

(X

(1)i,j − u

(1)i

Tv(01)j

)v(01)j

T+ 2

(u(1)i

T − bT)+ 2λu

(1)i

T

=− 2x(1)i,∗

TV(01) + 2u

(1)i

TV(01)TV(01) + 2u

(1)i

T − 2bT + 2λu(1)i

T

where bT = −m(1)l

T+m

(2)∗

T+ u

(1)i

T, l is the cluster user i belongs to and x

(1)i,∗

Tis a

row vector of all observed X(1)i,j , ∀j ∈ [1,m].

By setting δLδu

(1)i

T = 0, the updating rule for u(1)i

Tcan be achieved:

73

u(1)i

T=

(V(01)TV(01) + (λ+ 1)I

)+(x(1)i,∗

TV(01) + bT

)(4.4)

In the same way, optimal u(2)k

Tcan be derived from:

u(2)k

T=

(V(02)TV(02) + (λ+ 1)I

)+(x(2)k,∗

TV(02) + bT

)(4.5)

where bT = −m(1)l

T+ m

(2)∗

T+ u

(2)k

T, l is the cluster matched with the one user

k belongs to and x(2)k,∗

Tis a row vector of all observed X

(2)k,j, ∀j ∈ [1,m]. I is the

identity matrix.

Solving common V(0)

Let u(1)i

T=

[u(10)i |u(11)

i

]TS(1) and u

(2)k

T=

[u(20)k |u(22)

k

]TS(2) where u

(10)i

T, u

(20)k

T ∈R1∗c and u

(11)i

T, u

(22)k

T ∈ R1∗r−c, then Equation (4.1) can be rewritten as:

L =

n,m∑i,j

(X

(1)i,j − u

(10)i

Tv(0)j − u

(11)i

Tv(1)j

)2

+

p,m∑k,j

(X

(2)k,j − u

(20)k

Tv(0)j − u

(22)k

Tv(2)j

)2

+r∑

l=1

∥∥∥∥m(1)l

T −m(2)∗

T∥∥∥∥2

+ λθ

Similar to the case of U(1) and U(2), v(0)j is now optimized while fixing all other

parameters. Again, by setting the partial derivative of L with respect to v(0)j to

zero, optimal value of v(0)j can be achieved.

74

δLδv

(0)j

=− 2n∑i

(Y

(1)i,j − u

(10)i

Tv(0)j

)u(10)i − 2

p∑k

(Y

(2)k,j − u

(20)k

Tv(0)j

)u(20)k + 2λv

(0)j

=− 2U(1)Ty(1)∗,j + 2U(1)TU(1)v

(0)j − 2U(2)Ty

(2)∗,j + 2U(2)TU(2)v

(0)j + 2λv

(0)j

where Y(1)i,j = X

(1)i,j −u

(11)i

Tv(1)j and Y

(2)k,j = X

(2)k,j−u

(22)k

Tv(2)j ; y

(1)∗,j and y

(2)∗,j are column

vectors of all observed Y(1)i,j , ∀i ∈ [1, n] and Y

(2)k,j , ∀k ∈ [1, p], respectively.

The updating rule for v(0)j is:

v(0)j =

(U(1)TU(1) +U(2)TU(2) + λI

)+(U(1)Ty

(1)∗,j +U(2)Ty

(2)∗,j

)(4.6)

Solving domain-specific V(1) and V(2)

Similar operations with respect to v(1)j and v

(2)j are now performed.

δLδv

(1)j

= −2U(1)Tz(1)∗,j + 2U(1)TU(1)v

(1)j + 2λv

(1)j = 0

where Z(1)i,j = X

(1)i,j −u

(10)i

Tv(0)j and Z

(2)k,j = X

(2)k,j −u

(20)k

Tv(0)j ; z

(1)∗,j and z

(2)∗,j are column

vectors of all observed Z(1)i,j , ∀i ∈ [1, n] and Z

(2)k,j, ∀k ∈ [1, p], respectively.

Then the updating rule for v(1)j is derived:

v(1)j =

(U(1)TU(1) + λI

)+

U(1)Tz(1)∗,j (4.7)

Analogy to optimizing v(1)j , optimal v

(2)j is obtained by:

75

v(2)j =

(U(2)TU(2) + λI

)+

U(2)Tz(2)∗,j (4.8)

Solving weighting factor S(1) and S(2)

Let s(1)T= (S

(1)1,1,S

(1)2,1, . . . ,S

(1)r,1 ,S

(1)1,2,S

(1)2,2, . . . ,S

(1)r,2 , . . . ,S

(1)1,r,S

(1)2,r, . . . ,S

(1)r,r ) and

aT = (U(1)i,1V

(1)j,1 ,U

(1)i,2V

(1)j,1 , . . . ,U

(1)i,rV

(1)j,1 ,

U(1)i,1V

(1)j,2 ,U

(1)i,2V

(1)j,2 , . . . ,U

(1)i,rV

(1)j,2 ,

. . . ,

U(1)i,1V

(1)j,r ,U

(1)i,2V

(1)j,r , . . . ,U

(1)i,rV

(1)j,r )

Equation (4.2) then becomes:

L =

n,m∑i,j

∥∥X(1)i,j − s(1)

Ta∥∥2

+ λ(‖s(1)T‖2) + const

where const is the remaining regularization terms.

Optimal s(1)Tcan be achieved when ∂L

∂s(1)T = 0

∂L∂s(1)

T=− 2x(1)

∗,∗TA+ 2s(1)

TATA+ 2λs(1)

T= 0

⇔ s(1)T=(ATA+ λI

)−1(x(1)∗,∗

TA) (4.9)

where x(1)∗,∗

Tcontains observed X

(1)i,j ; x

(1)∗,∗

T ∈ R1xΩX(1) ; A ∈ RΩ

X(1)×(r×r).

In an analogy way, the update rule for s(2)Tis:

76

Algorithm 2: HISF: Utilizing both explicit and implicit similarities from twomatrices

Input : X(1), X(2), EOutput: U(1),S(1),V(0),V(1),U(2),S(2),V(2)

1 Randomly initialize all factors2 Initialize L by a small number

3 repeat4 PreL = L5 Find matches between clusters in U(1) and U(2)

6 Solve U(1) by (4.4)

7 Solve U(2) by (4.5)

8 Solve common V(0) by (4.6)

9 Solve domain-specific V(1) by (4.7)

10 Solve domain-specific V(2) by (4.8)

11 Solve S(1) by (4.9)

12 Solve S(2) by (4.10)

13 Compute L following (4.1)

14 until (PreL−LPreL < E)

⇔ s(2)T=(ATA+ λI

)−1(x(2)∗,∗

TA)

(4.10)

4.3 Extension to three or more matrices

It can be found more than two correlated matrices in many cases. For example,

Amazon has ratings from different domains, e.g., Books, Movies and TV, Electronics,

Digital Musics, etc. These rating matrices may have a close relationship that can

help to collaboratively improve recommendation accuracy. In this case, suppose

there are three correlated matrices X(1), X(2) and X(3). They are coupled in their

second dimension, i.e. X(1) is a rating matrix from n users for m items, X(2) is

another rating matrix from k users for the same m items and X(3) is another rating

77

Algorithm 3: HISF-N: Utilizing both explicit and implicit similarities from qmatrices

Input : X(1), X(2), ..., X(q), EOutput: U(1), ...,U(q),S(1), ...,S(q),V(0),V(1), ...,V(q)

1 Randomly initialize all factors2 Initialize L by a small number

3 repeat4 PreL = L5 Find matches among clusters in U(1), U(2),..., U(q)

6 for i∈{1, q} do7 Solve U(i) while fixing all other factors

8 Solve S(i) while fixing all other factors

9 Solve V(i) while fixing all other factors

10 Solve common V(0) while fixing all other factors

11 Compute L following (4.11)


matrix from l users for the same m items. The idea above can be extended to utilize

common parts of coupled factors of the three matrices (explicit similarities) and

align clusters among non-coupled factors (implicit similarities). Thus, the following

extension is proposed for utilizing explicit and implicit similarities among three or

more matrices:

L =3∑

i=1

∥∥∥∥X(i) −U(i)S(i)

⎡⎢⎣V

(0)

V(i)

⎤⎥⎦

T ∥∥∥∥2

+ simimplicit + λθ (4.11)

where simimplicit is the regularization term of implicit similarities across domains

and is defined by:

simimplicit =r∑

f=1

∥∥∥∥m(1)f

T −m(2)∗

T∥∥∥∥2

+r∑

f=1

∥∥∥∥m(2)f

T −m(3)∗

T∥∥∥∥2

+r∑

f=1

∥∥∥∥m(3)f

T −m(1)∗

T∥∥∥∥2

78

Updating rules for optimizingU(1), U(2), U(3), S(1), S(2), S(3), V(1), V(2) andV(3)

can be achieved similarly to the case of two matrices in Section 4.2.3. Moreover,

following the same derivations for common parts of two matrices in Section 4.2.3,an

updating rule for v(0)j , which is the common parts of three matrices, can be reached:

v(0)j =

(U(1)TU(1) +U(2)TU(2) +U(3)TU(3) + λI

)+

×(U(1)Ty

(1)∗,j +U(2)Ty

(2)∗,j +U(3)Ty

(3)∗,j

) (4.12)

By optimizing Equation (4.11), both explicit similarities (in a form of common

V(0) parts) and implicit similarities (in a form of aligned user groups in U(1), U(2)

and U(3)) among X(1), X(2) and X(3) are leveraged. Utilization of correlations from

four or more matrices can be easily extended in the same way. Therefore, the

proposed method does not limit itself to a certain number of correlated matrices.

4.4 Experiments and Analysis

The proposed HISF is evaluated in comparison with existing algorithms, includ-

ing CMF (Singh & Gordon 2008), CST (Pan et al. 2010), CBT (Li et al. 2009a)

and CLFM (Gao et al. 2013). This evaluation thoroughly studies two test cases:

one with two matrices and another one with three matrices. The goal is to evalu-

ate how well these algorithms suggest unknown information based on the observed

cross-domain ratings. For this purpose, they are compared based on the commonly

used root mean squared error (RMSE) metric.

RMSE =

√√√√√∑n,m

i,j

(Ui × S×VT

j −Xi,j

)2

ΩX

where ΩX is the number of observations of X.

79

Table 4.1 : Dimension and number of known entries for training, validation andtesting of census data on New South Wales (NSW)(X(1)) and Victoria (VIC) (X(2))states as well as crime statistics of NSW (X(3)).

Characteristics X(1) X(2) X(3)

Dimension 154 × 7,889 81 × 7,889 154 × 62Training 91,069 47,900 661Validation 4,793 2,521 34Testing 23,965 12,605 173

4.4.1 Data for the experiments

Three publicly available datasets are used for the experiments. Their character-

istics are summarized in Table 4.1 and Table 4.2.

Dataset #1

Australian Bureau of Statistics (ABS)∗ publishes comprehensive census data of

New South Wales (NSW) and Victoria (VIC) states. The dataset for NSW comprises

of populations and family profiles of 154 areas, so-called “local government areas”

(LGA). They are formed in a matrix X(1) (LGA by population and family profile)

for NSW and another matrix X(2) for VIC. 10% of the data is randomly selected

and its 80% are used for training whereas the rest 20% is for testing.

Dataset #2

Bureau of Crime Statistics and Research (BOCSAR)† provides a statistical record

of criminal incidents within 154 LGAs of New South Wales. There are 62 specific

crime types. 10% of the data are randomly selected to form a matrix X(3) of (LGA,

crime types). 80% of X(3) is used for training and the rest is for testing.

∗ABS: http://www.abs.gov.au/websitedbs/censushome.nsf/home/datapacks†BOCSAR: http://www.bocsar.nsw.gov.au/Pages/bocsar_crime_stats/bocsar_crime_

stats.aspx

80

Table 4.2 : Dimension and number of known entries for training, validation andtesting of Amazon datasets on books (X(4)), movies (X(5)) and electronics (X(6)).

Characteristics X(4) X(5) X(6)

Dimension 5,000 × 5,000 5,000 × 5,000 5,000 × 5,000Training 158,907 94,665 41,126Validation 8,363 4,982 2,164Testing 18,585 11,071 4,809

Dataset #3

Three matrices of ratings for books, movies and electronics are extracted from

Amazon website (He &McAuley 2016). The book data contains ratings from 305,475

users on 888,057 books; the movie data is ratings from the same 305,475 users on

128,097 movies and TV programs; the electronics data is of the same users on 196,894

items. All ratings are from 1 to 5. For this experiment, the data is constructed as

the following:

- The same sub-sampling approach as in (Pan et al. 2010) is first adopted by

randomly extracting 104×104 dense rating matrices from these three matrices. Then,

three sub-matrices of 5,000×5,000 each are then taken as summarized in Table 4.2.

All sub-matrices share the same users, but no common items.

- All ratings are normalized by 5 so that their values are from 0.2 to 1.

4.4.2 Experimental settings

All algorithms factorize the input datasets with different ranks. Also, each al-

gorithm was run five times. The mean and standard deviation of the results are

reported in the next Section. Furthermore, the small changes across consecutive

iterations indicate an algorithm’s convergence. Thus, the algorithms are stopped

when changes were less than 10−5.

81

Table 4.3 : Mean and standard deviation of tested RMSE on ABS NSW and ABSVIC data with different algorithms. For CST, when X(1) is the target, X(2) isused as an auxiliary data and vice versa. Best results for each rank are in bold.The Hotelling’s T squared tests row presents p-value of Hotelling’s T squared testsbetween each algorithm and the proposed HISF.

Rank Dataset CMF CBT CLFM CST HISF

5ABS NSW X(1) 0.0226 ±0.0026 0.0839 ±0.0002 0.0838 ±0.0002 0.0248 ±0.0005 0.0132 ±0.0002ABS VIC X(2) 0.0364 ±0.0031 0.0844 ±0.0003 0.0845 ±0.0004 0.0271 ±0.0015 0.0266 ±0.0030

7ABS NSW X(1) 0.0222 ±0.0009 0.0836 ±0.0004 0.0842 ±0.0006 0.0244 ±0.0005 0.0131 ±0.0003ABS VIC X(2) 0.0428 ±0.0020 0.0845 ±0.0004 0.0849 ±0.0004 0.0288 ±0.0025 0.0239 ±0.0025

9ABS NSW X(1) 0.0241 ±0.0011 0.0841 ±0.0002 0.0848 ±0.0009 0.0231 ±0.0009 0.0143 ±0.0004ABS VIC X(2) 0.0476 ±0.0040 0.0852 ±0.0003 0.0848 ±0.0003 0.0283 ±0.0024 0.0221 ±0.0019

11ABS NSW X(1) 0.0265 ±0.0026 0.0846 ±0.0007 0.0841 ±0.0007 0.0223 ±0.0006 0.0143 ±0.0003ABS VIC X(2) 0.0501 ±0.0029 0.0858 ±0.0005 0.0851 ±0.0007 0.0267 ±0.0020 0.0242 ±0.0015

13ABS NSW X(1) 0.0237 ±0.0024 0.0851 ±0.0002 0.0850 ±0.0005 0.0222 ±0.0010 0.0151 ±0.0004ABS VIC X(2) 0.0489 ±0.0032 0.0860 ±0.0003 0.0852 ±0.0003 0.0290 ±0.0026 0.0227 ±0.0015

15ABS NSW X(1) 0.0229 ±0.0029 0.0853 ±0.0005 0.0847 ±0.0005 0.0219 ±0.0007 0.0150 ±0.0000ABS VIC X(2) 0.0514 ±0.0041 0.0862 ±0.0002 0.0854 ±0.0006 0.0285 ±0.0023 0.0215 ±0.0005

Hotelling’s T squared tests 1.15×10−6 3.97×10−17 1.13×10−18 4.91×10−8 -

4.4.3 Empirical results

Three scenarios are tested as the following:

Case #1. Latent demographic profile similarities and latent LGA groups

similarities can help to collaboratively suggest unknown information in

these states

Matrices X(1) and X(2) are used as described in Sect. 4.4.1 which are from

different LGAs of two states. Nevertheless, both of them are ratings for the same

demographic categories. They share some common explicit demography similarities

as well as implicit LGAs’ latent ones. They are tested to assess how well both

explicit similarities in demography dimension and implicit ones in LGA dimension

collaboratively suggest unknown information.

Table 4.3 shows mean and standard deviation of RMSE of all algorithms on

tested ABS data for New South Wales (X(1)) and Victoria (X(2)) states. Both CBT

and CLFM that assume two states have similar demography patterns in latent sense

82

clearly perform the worst. The results demonstrate that these explicit similarities (in

the form of latent demography patterns) do not fully capture the correlation nature

of two datasets in this case. Thus, they do not help both CBT and CLFM to improve

their performance significantly. CMF applies another approach to take advantages

of explicit correlations between NSW state’s population and family profile and those

of VIC state. Specifically, CMF’s assumption on the same population and family

profile factor between NSW and VIC helps improve its performance over that of CBT

and CLFM. CST allows a more flexible utilization of explicit correlations between

NSW’s population and family profile and those of VIC state than CMF does. As a

result, CST achieves a little bit higher accuracy than CMF in recommending NSW’s

and VIC’s missing information.

Nevertheless, the prediction accuracy can be improved even more as illustrated

with the proposed idea of explicit and implicit similarities discovery. Utilizing them

helps the proposed HISF to achieve about two times higher accuracy compared with

CMF, about up to 47% improvement for NSW and up to 25% for VIC compared to

CST. These impressive results are achieved when the numbers of common columns

are 3, 5, 5, 7, 7 and 7 for the decomposition rank of 5, 7, 9, 11, 13 and 15, respectively.

It means that common parts together with domain-specific parts better capture the

true correlation nature of datasets. These explicit similarities together with implicit

similar group alignments allow better knowledge leveraging between datasets, thus,

improving recommendation accuracy.

To confirm the statistical significance of the proposed method, Hotelling’s T

squared tests (Hotelling 1931) which is the multivariate version of the t-tests in

univariate statistics are performed. The objective is to validate if the proposed al-

gorithm differs significantly from baselines. This multivariate Hotelling’s T squared

tests are used here because each population involves observations from two variables:

NSW (ABS NSW X(1)) and VIC (ABS VIC X(2)). For testing the null hypothesis

83

that each pair of algorithms (CMF vs. HISF, CBT vs. HISF, CLFM vs. HISF and

CST vs. HISF) has identical mean RMSE vectors, let

H0: population mean RMSEs are identical for all of the variables

(μNSW 1 = μNSW 2 and μV IC1 = μV IC2)

H1: at least one pair of these means is different

(μNSW 1 = μNSW 2 or μV IC1 = μV IC2)

Because all p-values are smaller than α (0.05), the null hypothesis is rejected.

Therefore, the observed difference between the baselines and the proposed algorithm

is statistically different.

Case #2. Latent LGA similarities and latent similarities between de-

mography and crime can help to collaboratively suggest unknown crime

and state information

The advantages of both explicit and implicit similarities are further confirmed

in Table 4.4. In this case, they are applied to other cross domains: ABS NSW

demography (X(1)) and NSW Crime (X(3)). These datasets have explicit similarities

in their LGA latent factors. At the same time, implicit similarities in demography

profile and criminal behaviors are also utilized. The proposed HISF leveraging

both similarities outperforms existing algorithms. It is worth to note here that

performance of CST, in this case, is worse than that of CMF (about two times

worse for NSW and a bit lower for VIC). The results suggest that flexibility of

utilizing explicit similarities does not work here. On the contrary, the proposed idea

of utilizing both explicit and implicit similarities eventually achieves more stable

and better results (demonstrated in Table 4.3 and 4.4).

Similar Hotelling’s T squared tests are perfomed as of the case #1. Again, as

84

Table 4.4 : Mean and standard deviation of tested RMSE on ABS NSW demographyand BOCSAR NSW crime data with different algorithms. For CST, when X(1) isthe target, X(3) is used as an auxiliary data and vice versa. Best results for eachrank are in bold. The Hotelling’s T squared tests row presents p-value of Hotelling’sT squared tests between each algorithm and the proposed HISF.

Dataset Rank CMF CBT CLFM CST HISF

5Demography X(1) 0.0209 ±0.0016 0.0840 ±0.0001 0.0840 ±0.0001 0.0304 ±0.0080 0.0174 ±0.0015

Crime X(3) 0.2796 ±0.0204 0.3411 ±0.0035 0.3422 ±0.0071 0.3216 ±0.0052 0.2697 ±0.0073

7Demography X(1) 0.0223 ±0.0024 0.0840 ±0.0002 0.0855 ±0.0006 0.0324 ±0.0040 0.0143 ±0.0004

Crime X(3) 0.2907 ±0.0265 0.3432 ±0.0021 0.3912 ±0.0188 0.3440 ±0.0055 0.2716 ±0.0029

9Demography X(1) 0.0199 ±0.0027 0.0838 ±0.0002 0.0850 ±0.0008 0.0337 ±0.0050 0.0143 ±0.0003

Crime X(3) 0.2813 ±0.0261 0.3562 ±0.0134 0.3722 ±0.0249 0.3434 ±0.0087 0.2648 ±0.0058

11Demography X(1) 0.0212 ±0.0049 0.0839 ±0.0001 0.0843 ±0.0004 0.0402 ±0.0073 0.0146 ±0.0003

Crime X(3) 0.2689 ±0.0143 0.3539 ±0.0061 0.3712 ±0.0199 0.3495 ±0.0051 0.2618 ±0.0012

13Demography X(1) 0.0194 ±0.0022 0.0837 ±0.0001 0.0837 ±0.0003 0.0383 ±0.0113 0.0149 ±0.0003

Crime X(3) 0.2700 ±0.0150 0.3481 ±0.0070 0.3500 ±0.0135 0.3599 ±0.0024 0.2623 ±0.0024

15Demography X(1) 0.0173 ±0.0014 0.0835 ±0.0001 0.0834 ±0.0002 0.0503 ±0.0084 0.0149 ±0.0002

Crime X(3) 0.2647 ±0.0031 0.3485 ±0.0038 0.3580 ±0.0099 0.3610 ±0.0011 0.2625 ±0.0015

Hotelling’s T squared tests 8.22×10−4 1.37×10−15 2.51×10−15 1.38×10−6 -

C

HISF

(a)

C

HISF

(b)

Figure 4.8 : Tested mean RMSEs under different values of the common row pa-rameter (c) in the coupled factor of HISF with r ank r = 11. a) Results on ABSNSW dataset; b) Results on ABS VIC dataset. CMF and CBT do not have cparameter, thus, their results are used as references. Lower is better. RMSE ofHISF outperforms its competitors when c equals 7.

all p-values are smaller than α (0.05) here, the null hypothesis is rejected. It is

therefore convincing to conclude that the mean RMSEs between the baselines and

the proposed idea differ significantly.

Figure 4.8 and 4.9 show how HISF works with different values of c parame-

85

C

HISF

(a)

C

HISF

(b)

Figure 4.9 : Tested mean RMSEs under different values of the common row param-eter (c) in the coupled factor of HISF with rank r = 11. a) Results on ABS NSWdemography; b) Results on BOCSAR NSW Crime. CMF and CBT do not havec parameter, thus, their results are used as references. Lower is better. RMSE ofHISF outperforms its competitors when c equals 9.

ter. These figures illustrate the case where decomposition rank equals 11. All the

curves of HISF have an identical trend. As the number of common row of explicit

similarities (c parameter) increases, recommendation accuracy improves to reach its

minimum rate. Then, the performance reduces as c increases. This pattern confirms

the significance of preserving domain-specific parts of each dataset. In other words,

common and domain-specific parts better capture the correlation nature of datasets

across domains, thus, achieve a higher recommendation accuracy. Moreover, when c

equals rank 11, HISF and CMF utilize the same explicit similarities (identical cou-

pled factors). Yet, HISF uses an extra correlation from implicit similarities. This

additional knowledge enables HISF to achieve a lower RMSE compared to CMF

(Figure 4.8b). All of these suggest the advantages of both explicit and implicit

similarities in joint analysis of two matrices.

86

Table 4.5 : Mean and standard deviation of tested RMSE on book, movie andelectronics data with different algorithms. CST is not applied here as it does notsupport two or more principal coordinates on one factor. Best results for each rankare in bold. The Hotelling’s T squared tests row presents p-value of Hotelling’s Tsquared tests between each algorithm and the proposed HISF.

Rank Dataset CMF CBT CLFM HISF-N

5Books X(4) 0.2169 ±0.0005 0.2187 ±0.0026 0.2161 ±0.0023 0.1951 ±0.0005Movies X(5) 0.2273 ±0.0022 0.3342 ±0.0039 0.3474 ±0.0063 0.2212 ±0.0010

Electronics X(6) 0.2375 ±0.0017 0.4206 ±0.0036 0.4596 ±0.0066 0.2642 ±0.0011

7Books X(4) 0.2170 ±0.0009 0.3068 ±0.0031 0.3059 ±0.0024 0.1977 ±0.0010Movies X(5) 0.2279 ±0.0009 0.4368 ±0.0031 0.4337 ±0.0055 0.2213 ±0.0007

Electronics X(6) 0.2348 ±0.0009 0.6607 ±0.0135 0.6389 ±0.0116 0.2497 ±0.0016

9Books X(4) 0.2185 ±0.0010 0.3163 ±0.0004 0.3150 ±0.0026 0.2001 ±0.0005Movies X(5) 0.2302 ±0.0010 0.4661 ±0.0046 0.4594 ±0.0065 0.2218 ±0.0008

Electronics X(6) 0.2354 ±0.0019 0.7098 ±0.0232 0.6909 ±0.0175 0.2411 ±0.0019

11Books X(4) 0.2219 ±0.0014 0.3207 ±0.0044 0.3204 ±0.0021 0.2012 ±0.0004Movies X(5) 0.2319 ±0.0021 0.4865 ±0.0019 0.4795 ±0.0043 0.2247 ±0.0005

Electronics X(6) 0.2417 ±0.0010 0.7390 ±0.0090 0.7223 ±0.0123 0.2420 ±0.0018

13Books X(4) 0.2244 ±0.0018 0.3267 ±0.0020 0.3291 ±0.0027 0.2021 ±0.0010Movies X(5) 0.2349 ±0.0025 0.5014 ±0.0057 0.4908 ±0.0028 0.2263 ±0.0009

Electronics X(6) 0.2422 ±0.0012 0.7695 ±0.0118 0.7684 ±0.0155 0.2410 ±0.0022

15Books X(4) 0.2260 ±0.0014 0.3303 ±0.0024 0.3353 ±0.0025 0.2022 ±0.0008Movies X(5) 0.2365 ±0.0018 0.5132 ±0.0044 0.5070 ±0.0065 0.2270 ±0.0009

Electronics X(6) 0.2447 ±0.0013 0.8067 ±0.0126 0.7688 ±0.0122 0.2417 ±0.0024

Hotelling’s T squared tests 2.80×10−7 1.15×10−5 1.11×10−6 -

Case #3. Latent users’ taste similarities when buying books, movies

and electronics devices and latent item group similarities can help to

collaboratively improve recommendation accuracy

Joint analysis of all three rating matrices from Amazon website, X(4) for books,

X(5) for movies andX(6) for electronics, is studied. All of them are explicitly from the

same users. Nevertheless, their hidden similarities in personal preferences and items’

characteristics can also help to better suggest missing information. This case assesses

how these explicit together with implicit similarities can help to collaboratively

improve recommendation accuracy of each.

Table 4.5 summarizes the performance of CMF, CBT, CLFM and the proposed

87

algorithm on Books, Movies and Electronics datasets from Amazon. CST is not

applied due to its limit on one auxiliary dataset for one factor; in this case, two

side data for user dimension can be used. The results in this three datasets are

consistent with those for two matrices above. In particular, CBT and CLFM which

assume shared rating patterns among the three perform the worst. CMF improves

accuracy a lot in comparison with CBT and CLFM. Nevertheless, its performance

is once more time outperformed by the proposed ideas. The results, in this case,

demonstrate the idea can be generalized to the situation of more than two datasets.

Hotelling’s T squared tests are again performed for three variables, i.e., Books,

Movies and Electronics, between each baseline and HISF-N to confirm the statistical

significance of the proposed method. For each pair (CMF vs. HISF-N, CBT vs.

HISF-N and CLFM vs. HISF-N).

H0: population mean RMSEs are identical for all of the variables

(μBooks1 = μBooks2 and μMovies1 = μMovies2and μElectronics1 = μElectronics2)

H1: at least one pair of these means is different

(μBooks1 = μBooks2 or μMovies1 = μMovies2 or μElectronics1 = μElectronics2)

The Hotelling’s T squared tests this time also result in very small p-values. This

once again confirms the performance between them is significantly different. Thus,

it can be concluded that their observed difference is convincingly significant.

Figure 4.10 shows how HISF-N works with different values of c parameter. There

are two conclusions can be drawn by observing the figures. Firstly, domain-specific

parts are quite substantial in addition to the common parts. Secondly, the proposed

model reduces X(4)’s RMSE the most, then X(5)’s and X(6)’s one in comparison with

other algorithms. This can be explained by their observed data size. The proposed

model, which optimizes the loss function (4.11), gives more preference on optimizing

88

C

HISF

(a)

C

HISF

(b)

C

HISF

(c)

Figure 4.10 : Tested mean RMSEs of Amazon dataset under different values of thecommon row parameter (c) in the coupled factor of HISF-N with rank r = 11. a)Results on Amazon Books; b) Results on Amazon Movies; c) Results on AmazonElectronics. CMF and CBT do not have c parameter, thus, their results are usedas references. Lower is better. RMSE of HISF outperforms its competitors when cequals 7.

the domain with more data while preserving comparable performance on the other

domains.

4.5 Contributions and Summary

This chapter addresses contributions #2 and #3 of this thesis by proposing a

novel algorithm to discover implicit similarities between datasets across domains. It

presents an idea to discover non-coupled factors’ latent similarities and makes use

of them to further improve the recommendation. Specifically, on the non-shared di-

mension, the middle matrix of the tri-factorization is proposed to match the unique

89

factors. Based on the found matches, HISF aligns the matched unique factors to

further transfer cross-domain implicit similarities and thus improve the recommen-

dation.

There are two significances of the proposed algorithm. Firstly, e-commerce busi-

nesses are increasingly dependent on recommendation systems to introduce personal-

ized services and products to targeted customers. Providing useful recommendations

requires sufficient knowledge about user preferences and product (item) characteris-

tics. Given the current abundance of available data across domains, HISF achieves

a thorough understanding of the relationship between users and items by exploiting

the implicit similarities among them. Discovering and utilizing them can bring in

more collaborative filtering power and lead to a higher recommendation accuracy.

Secondly, HISF is the first factorization method using both explicit and implicit

similarities to enhance the performance of cross-domain recommendation. Validated

on real-world datasets, the proposed approach outperforms existing algorithms by

more than two times in term of recommendation accuracy. The empirical results

also encourage us to extend the idea to a generalized model which enables joint

analysis of the explicit and implicit similarities from multiple correlated datasets.

The advantages of the ideas suggest both similarities has a significant impact on

improving the performance of cross-domain recommendation.

91

Chapter 5

Scalable Multimodal Factorization

This chapter encapsulates contributions #4 of this thesis. It is an extended descrip-

tion of the following publication:

Quan Do and Wei Liu, “Scalable Multimodal Factorization for Learning from

Very Big Data,” in Multimodal Analytics for Next-Generation Big Data Technologies

and Applications, Springer. (To appear)

5.1 Introduction

Recent technological advances in data acquisition have brought new opportuni-

ties as well as new challenges (Lahat et al. 2015b) to research communities. Many

new acquisition methods and sensors enable researchers to acquire multiple modes

of information about the real world. This multimodal data can be naturally and

efficiently represented by a multi-way structure, called a tensor, which can be ana-

lyzed to extract the underlying meaning of the observed data (Baltruaitis et al. 2018,

Farias et al. 2016). The increasing availability of multiple modalities, captured in

correlated tensors, provides more excellent opportunities to examine a complete

picture of all the data patterns as discussed in chapters 3 and 4.

92

The joint analysis of multimodal tensor data generated from different sources

provides a deeper understanding of the data’s underlying structure (Bhargava et al.

2015, Diao et al. 2014). However, processing this massive amount of correlated

data incurs a very heavy cost in terms of computation, communication, and storage.

Traditional methods which operate on a local machine such as coupled matrix tensor

factorization (CMTF) (Acar, Kolda & Dunlavy 2011) are either intractably slow or

memory insufficient. The former issue is because they iteratively compute factors

on the full coupled tensors many times; the latter is due to the fact that the full

coupled tensors cannot be loaded into a typical machine’s local memory. Both

computationally efficient methods and scalable work have been proposed to speed

up the factorization of multimodal data. Whereas concurrent processing using CPU

cores (Papalexakis et al. 2014, 2012) or GPUs’ massively parallel architecture (Zou

et al. 2015) computed faster, it did not solve the problem of insufficient local memory

to store a large amount of observed data. Other MapReduce distributed models

(Beutel et al. 2014, Jeon et al. 2016, Sun et al. 2010, Shin & Kang 2014) overcame

memory problems by keeping the large files in a distributed file system. They also

improved computational speed by having many different computing nodes processed

in parallel.

Computing in parallel allows factors to be updated faster (Liavas & Sidiropoulos

2015), yet the factorization process faces a higher data communication cost if it is not

well designed. One critical weakness of MapReduce algorithms is that when a node

needs data to be processed, the data is transferred from an isolated distributed

file system to the node (Shi et al. 2015). The iterative nature of factorization

requires data and factors to be distributed over and over again, incurring huge

communication overhead. If tensor size is doubled, the algorithms’ performance is

2*T times worse (T is the number of iterations). This cost is one of the disadvantages

of the MapReduce methods due to their low scalability.

93

This chapter describes an even more scalable multimodal factorization (SMF)

to improve the performance of MapReduce-based factorization algorithms as the

observed data becomes larger. The aforementioned deficiencies of MapReduce-based

algorithms can be overcome by minimizing the data transmission between computing

nodes and choosing a fast convergence optimization. The chapter describes SMF in

Section 5.2 in two parts: the first explains the observations behind processing by

blocks and caching data on computing nodes as well as provides a theoretical analysis

of the optimization process and the second part shows how SMF can be scaled up

to an unlimited input of multimodal data. The advantages of this method in terms

of minimal communication cost and scaling up capability are essential features of

any technique for dealing with multimodal big data. Also, Section 5.3 demonstrates

how it works by performing several tests with real-world multimodal data to evaluate

its scalability, its convergence speed, its accuracy and performance using different

optimization methods. Finally, Section 5.4 summarizes the primary contributions

of the proposed idea in achieving the scalability of factorization algorithms.

5.2 SMF: the proposed Scalable Multimodal Factorization

This section introduces SMF for the joint analysis of several N-mode tensors with

one or more modes in common. Let X ∈ RI1×I2×···×IN be a mode-N tensor. X has

at most N coupled matrices or tensors. Without loss of generality, this section first

explains a case where X and another matrix Y ∈ RI1×J2 are coupled in their first

modes. The joint analysis of more than two tensors is discussed in section 5.2.2.

Based on the coupled matrix tensor factorization of X and Y whose first modes

are correlated as in Section 2.3.2, SMF decomposes X into U(1) ∈ RI1×r, U(2) ∈RI2×r, ..., U(N) ∈ RIN×r and Y into U(1) ∈ RI1×r and V(2) ∈ RJ2×r, where U(1) is

the common factor and r is the decomposition rank.

94

L =

∥∥∥∥�U(1),U(2), ....,U(N)

� −X

∥∥∥∥2

+

∥∥∥∥�U(1),V(2)

� −Y

∥∥∥∥2

(5.1)

Observation 1. Approximating each row of one factor while fixing the other factors

reduces the complexity of CMTF.

Let U(−k) =�U(1), . . . ,U(k−1),U(k+1) . . . , U(N)

�. Based on observation 1, in-

stead of finding U(1), U(2), ..., U(N) and V(2) that minimize the loss function in

Equation (5.1), the problem can be formulated as optimizing every single row u(1)i1

of the coupled factor

L =∑

i2,...,iN

∥∥∥∥r∑

f=1

u(1)i1,f

×U(−1)i2,...,iN ,f −X

(1)i1,i2,...,iN

∥∥∥∥2

+∑j2

∥∥∥∥r∑

f=1

u(1)i1,f

×V(2)j2,f

−Y(1)i1,j2

∥∥∥∥2

(5.2)

minimizing each u(n)in

of a non-coupled factor U(n) (n > 1)

L =∑

i1,...,iN

∥∥∥∥r∑

f=1

u(n)in,f

×U(−n)i1,...,iN ,f −X

(n)i1,i2,...,iN

∥∥∥∥2

(5.3)

where u(n)in

is the variable the proposed model wants to find while fixing the other

terms and Xi1,i2,...,iN are the observed entries of X.

and minimizing v(2)j2

of a non-coupled factor V(2)

L =∑i1

∥∥∥∥r∑

f=1

U(1)i1,f

× v(2)j2,f

−Y(2)i1,j2

∥∥∥∥2

(5.4)

where v(2)j2

is the variable the proposed algorithm wants to find while fixing the other

terms and Yi1,j2 are the observed entries of Y.

According to Equation (5.2), computing a row u(1)i1

of the coupled factor while

fixing the other factors requires observed entries ofX and those ofY that are located

95

X

X

Xu

u

u

Y

v

Y

(a) (b)

(c) (d)

i1,* i1,*,**,i2,*

*,*,i3 *,j2

Figure 5.1 : Tensor slices for updating each row of U(1), U(2), U(3), and V(2) whenthe input tensors are a mode-3 tensor X coupled with a matrix Y in their first mode:(a) coupled slices in the first mode of both X(1) and Y(1) required for updating arow of U(1); (b) a slice in the second mode of X(2) for updating a row of U(2); (c) aslice in the third mode of X(3) for updating a row of U(3); (d) a slice in the secondmode of Y(2) for updating a row of V(2).

in a slice X(1)i1

and Y(1)i1, respectively. Figure 5.1 illustrates these tensor slices for

calculating each row of any factor. Similarly, Equation (5.3) suggests a slice X(n)in

for

updating a corresponding row u(n)in

and Equation (5.4) suggests Y(2)j2

for updating a

corresponding row v(2)j2.

Definition 1. Two slices X(n)i and X

(n)i′ are independent if and only if ∀x ∈ X

(n)i ,

∀x′ ∈ X(n)i′ and i = i′ then x = x′.

Observation 2. Row updates for each factor as in Equation (5.2), (5.3) and (5.4)

require independent tensor slices; each of these non-overlapping parts can be pro-

cessed in parallel.

Figure 5.1a shows that X(1)i ,Y

(1)i for updating U

(1)i and X

(1)i′ ,Y

(1)i′ for updating

U(1)i′ are non-overlapping ∀i, i′ ∈ [1, I1] and i = i′. Consequently, all rows of U(1) are

96

independent and can be executed concurrently. The same parallel updates are for

all rows of U(2),. . . , U(N) and V(2).

5.2.1 SMF on Apache Spark

This section discusses distributed SMF for large-scale datasets.

Observation 3. The most critical performance bottleneck of any distributed CMTF

algorithm is transferring a large-scale dataset to computing nodes at each iteration.

As with observation 2, optimizing rows of factors can be done in parallel with

distributed nodes; each one needs a tensor slice and other fixed factors. Existing dis-

tributed algorithms, such as FlexiFact (Beutel et al. 2014), GigaTensor (Kang et al.

2012) and SCouT (Jeon et al. 2016), store input tensors in a distributed file system.

Computing any factor requires the corresponding data to be transferred to process-

ing nodes. Because of the iterative nature of the CP model, this large-scale tensor

distribution repeats, causing a heavy communication overhead. SMF eliminates this

huge data transmission cost by robustly caching the required data in memory of the

processing nodes.

SMF partitions input tensors and localizes them in the computing nodes’ mem-

ory. It is based on Apache Spark because Spark natively supports local data caching

with its resilient distributed datasets (RDD) (Zaharia et al. 2010). In a nutshell, an

RDD is a collection of data partitioned across computational nodes. Any transfor-

mation or operation (map, reduce, foreach, ...) on an RDD is done in parallel. As

the data partition is located in the processing nodes’ memory, revisiting this data

many times over the algorithm’s iterations does not incur any communication over-

head. SMF designs RDD variables and chooses the optimization method carefully

to maximize RDD’s capability.

97

B1 B1

CB

CB

CB

B1 B1B2

(a) (b) (c)

B

Figure 5.2 : Dividing coupled matrix and tensor into non-overlapping blocks: (a)coupled blocks CB(1) ← (B1(1),B2(1)) in the first mode of both X and Y for up-dating U(1). (b) blocks in the second mode of X for updating U(2) and those in thesecond mode of Y for V(2). (c) blocks in the third mode of X for U(3). All blocksare independent and can be performed concurrently.

Block processing

SMF processes blocks of slices to enhance efficiency. As observed in Figure ??,

a coupled slice (X(1)i1,Y

(1)i1) is required for updating a row of U

(1)i1. Slices X

(2)i2, X

(3)i3,

and Y(2)j2

are for updating U(2)i2, U

(3)i3, and V

(2)j2, respectively. On one hand, it is

possible to work on every single slice of I1 slices X(1)i1

and Y(1)i1, I2 slices X

(2)i2, I3

slices X(3)i3

and J2 slices Y(2)j2

separately. On the other hand, dividing data into too

many small parts is not a wise choice as the time needed for job scheduling may

exceed the computational time. Thus, merging several slices into non-overlapping

blocks and working on them in parallel increases efficiency. An example of this

grouping is presented in Figure 5.2.

Figure 5.2a shows coupled blocks CB(1) are created by merging corresponding

(B1(1),B2(1)) in the first mode of both X and Y for updating U(1). Blocks in the

second mode of X for updating U(2) and those in the second mode of Y for V(2)

are illustrated in Figure 5.2b. Figure 5.2c reveals blocks in the third mode of X for

U(3). All blocks are independent and can be performed concurrently.

98

N copies of a N-mode tensor caching

SMF caches N copies of X and N copies of Y in memory. As observed in Figure

5.2, blocks of CB(1) are used for updating the first mode factor U(1); blocks of B1(2)

and B2(2) are used for the second mode factors U(2) and V(2), respectively; blocks of

B1(3) are used for the third mode U(3). Thus, to totally eliminate data transmission,

all of these blocks should be cached. This duplicated caching needs more memory,

yet it does not require a huge memory extension as more processing nodes can be

added.

Function createRDDInput : X-filename, N = 3

Y-filenamenumber of blocks d

Output: cached CB(1), B1(2), ..., B1(N), B2(2)

1 RDD1 ← read(X-filename)2 RDD2 ← read(Y-filename)3 foreach mode n ∈ [1, N ] do4 B1(n) ←RDD1.map(emit〈in, (i1, ..., iN , value)〉)5 .groupByKey()6 .partionBy(d)7 .cache()

8 foreach mode n ∈ [1, 2] do9 B2(n) ←RDD2.map(emit〈in, (i1, i2, value)〉)

10 .groupByKey()11 .partionBy(d)12 .cache()

13 CB(1) ← B1(1).join(B2(1)).cache()

A pseudo code for creating these copies as RDD variables in Apache Spark is in

Function createRDD(). Lines 1 and 2 create two RDDs of strings (i1, ..., iN , value)

for X and Y. The entries of RDDs are automatically partitioned across processing

nodes. These strings are converted into N key-value pairs of 〈in, (i1, ..., iN , value)〉,one for each mode, in lines 4 and 9. Lines 5 and 10 group the results into slices,

as illustrated in Figure 5.1. These slices are then merged into blocks (lines 6 and

99

11) to be cached in working nodes (lines 7 and 12). Coupled blocks are created by

joining corresponding blocks of X(1) and Y(1) in line 13. It is worth noting that

these transformations (line 4 to 7 and line 9 to 12) are performed concurrently on

parts of RDDs located in each working node.

Optimized solver

SMF uses a closed form solution for solving each row of the factors. This opti-

mizer not only converges faster but also helps to achieve higher accuracy.

Theorem 1. The optimal row u(1)i1

of a coupled factor U(1) is computed by

u(1)i1

= (ATA+CTC)†(ATbi1 +CTdi1) (5.5)

where bi1 is a column vector of all observed X(1)i1; di1 is a column vector of all

observed Y(1)i1; A and C are all U

(−1)i2,...,iN

and Vj2 with respect to all observed bi1 and

di1, respectively.

Proof. Equation (5.2) can be written for each row of U(1) as:

L =∑i

∥∥Au(1)i1

− bi1

∥∥2+∑i

∥∥Bu(1)i1

− di1

∥∥2

Let x be the optimal u(1)i1, then x can be derived by setting the derivative of L with

respect to x to zero. Hence,

L =(Ax− bi1

)T(Ax− bi1

)+(Cx− di1

)T(Cx− di1

)

= xTATAx− 2bTi1Ax+ bT

i1bi1 + xTCTCx− 2dT

i1Cx+ dT

i1di1

100

⇔ ∂L∂x

= 2ATAx− 2ATbi1 + 2CTCx− 2CTdi1 = 0

⇔ (ATA+CTC)x = ATbi1 +CTdi1

⇔ x = (ATA+CTC)†(ATbi1 +CTdi1)

Theorem 2. The optimal row u(n)in

of a non-coupled factor U(n) is computed by

u(n)in

= (ATA)†ATbin (5.6)

where bin is a row vector of all observed X(n)in

; A is all U(−n)i2,...,iN

with respect to all

observed bin.

Proof. Similar to the proof of Theorem 1, u(n)in

minimizes Equation (5.3) when the

derivative with respect to it goes to 0.

∂L∂u

(n)in

= −2ATAu(n)in

+ 2ATbin = 0

⇔ u(n)in

= (ATA)†ATbin

Performing pseudo inversion in Equations (5.5) and (5.6) is expensive. Never-

theless, as (ATA + CTC) and ATA are small squared matrices of Rr×r, a more

efficient operation is to use Cholesky decomposition for computing pseudo inversion

and solving u(n)in

as in algorithm 4. At each iteration, SMF first broadcasts all the

newly updated factors to all processing nodes. Then each factor is computed. While

SMF updates each factor of each tensor sequentially, each row of a factor is computed

in parallel by either updateFactor() (line 10 and 12) or updateCoupledFactor() (line

7). These two functions are processed concurrently by different computing nodes

with their cached data blocks. These steps are iterated to update the factors until

101

the algorithm converges.

Algorithm 4: SMF with data parallelism

Input : cached CB(1), B1(2), ..., B1(N), B2(2), EOutput: U(1), ...,U(N),V(2)

1 Initialize L by a small number2 Randomly initialize all factors

3 repeat4 Broadcast all factors5 PreL = L6 // coupled Factor

7 U(1) ← updateCoupledFactor(CB(1), 1)8 // non-coupled Factor U9 foreach mode n ∈ [2, N ] do

10 U(n) ← updateFactor(B1(n), n)

11 // non-coupled Factor V

12 V(2) ← updateFactor(B2(2), 2)

13 Compute L following Equation (5.1)


Theorem 3. The computational complexity of Algorithm 4 is

O(T∑K

k=1

∑Nk

n=1|Ωk|M

(Nk + r)r + T∑K

k=1

∑Nk

n=1

IknM

r3).

Proof. An N -mode tensor requires finding N factors. A factor is updated in ei-

ther lines 1-4 of the function updateFactor() or lines 1-6 of updateCoupledFactor().

Lines 1-4 prepare A, compute ATA, ATbi and perform Cholesky decompositions.

Lines 1-6 double A preparation and ATA, ATbi computations. Computing A

requires( |Ω|M(N − 1)r

)operations while ATA and ATbi take

( |Ω|M(r − 1)r

)each.

Cholesky decomposition of InM

Rr×r matrices is O( InMr3). So updating a factor re-

quires O( |Ω|M(N + r)r + In

Mr3); all factors of K tensors take O

(∑K

k=1

∑Nk

n=1|Ωk|M

(Nk+

r)r +∑K

k=1

∑Nk

n=1

IknM

r3). These steps may iterate T times. Therefore, the compu-

tational complexity of Algorithm 4 is

O(T∑K

k=1

∑Nk

n=1|Ωk|M

(Nk + r)r +∑K

k=1

∑Nk

n=1

IknM

r3).

102

Theorem 4. The communication complexity of Algorithm 4 is O(T∑K

k=1

∑Nk

n=1Iknr).

Proof. At each iteration, Algorithm 4 broadcasts∑K

k=1Nk factors to M machines

(line 4). As the broadcast in Apache Spark is done using the BitTorrent technique,

each broadcast to M machines takes O(∑K

k=1

∑Nk

n=1Iknr). So the total T times

broadcast requires O(T∑K

k=1

∑Nk

n=1Iknr).

Theorem 5. The space complexity of Algorithm 4 is

O(∑K

k=1 |Ωk|Nk

M+∑K

k=1

∑Nk

n=1Iknr).

Proof. Each computing node stores blocks of tensor data and all the factors. Firstly,

Nk copies of |Ωk|M

observations of the kth tensor need to be stored on each node,

requiring O( |Ωk|Nk

M). So, K tensors take O(

∑Kk=1 |Ωk|Nk

M). Secondly, storing all fac-

tors in each node requires O(∑K

k=1

∑Nk

n=1Iknr). Therefore, the space complexity is

O(∑K

k=1 |Ωk|Nk

M+∑K

k=1

∑Nk

n=1Iknr).

Function updateFactor(B, n)

Output: U

1 B.map(2 〈i, (i1, ..., iN ,bi)〉 ← Bi

3 A ← U(−n)

4 Compute u(n)i by Equation (5.6)

5 )

6 Collect result and merge to U(n)

5.2.2 Scaling up to K tensors

The implementation of Algorithm 5 supports K N -mode tensors. In this case, K

tensors have (K−1) coupled blocks CB(1), ...,CB(K−1). The algorithm checks which

mode of the main tensor is the coupled mode and applies the updateCoupledFactor()

function with the corresponding coupled blocks.

103

Function updateCoupledFactor(CB, n)

Output: U

1 CB.map(2 〈i, (i1, ..., iN ,bi)〉 ← B1i

3 A ← U(−n)

4 〈i, (i1, j2,di)〉 ← B2i

5 C ← V(2)

6 Compute u(n)i by Equation (5.5)

7 )

8 Collect result and merge to U(n)

5.3 Performance Evaluation

SMF was implemented in Scala and tested on Apache Spark 1.6.0∗ with the

Yarn scheduler† from Hadoop 2.7.1. The experiments compare the performance

of SMF with existing distributed algorithms to assess the following questions. 1)

How scalable is SMF with respect to the number of observations and the number

of machines? 2) How fast does SMF converge? 3) What level of accuracy does

SMF achieve? and 4) How does the closed form solution perform compared to the

widely chosen gradient-based methods?

All the experiments were executed on a cluster of 9 nodes, each having 2.8GHz

CPU with 8 cores and 32GB RAM. Since SALS (Shin & Kang 2014) and SCouT

(Jeon et al. 2016) were shown to be significantly better than FlexiFact (Beutel et al.

2014), comparisons with SALS and SCouT are included and the FlexiFact results are

discarded. CTMF-OPT (Acar, Kolda & Dunlavy 2011) was also run on one of the

nodes. Publicly available 22.5M (i.e., 22.5 million observations) movie ratings with

movie genre information in MovieLens (Harper & Konstan 2015), 100M Netflix’s

∗Apache Spark http://spark.apache.org/†Yarn scheduler https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/

hadoop-yarn-site/FairScheduler.html

104

Algorithm 5: SMF for K tensors where the first mode of X1 is coupled withthe first mode of X2, ..., (K − 1)th mode of X1 is joint with the first mode ofXK.

Input : cached CB(1), ...,CB(K−1),B1(2), ..., B1(N1), ..., BK(2), ..., BK(NK), E

Output: U1(1), ...,U1(N1), ..., UK(2), ...,UK(NK)

1 Initialize L by a small number2 Randomly initialize all factors

3 repeat4 Broadcast all factors5 PreL = L6 foreach tensor k ∈ [1, K] do7 if (k is the main tensor) then8 foreach mode n ∈ [1, Nk] do9 if (n is a coupled mode) then

10 Uk(n) ← updateCoupledFactor(CB(n), n)11 else12 Uk(n) ← updateFactor(Bk(n), n)

13 else14 foreach mode n ∈ [2, Nk] do15 Uk(n) ← updateFactor(Bk(n), n)

16 Compute L following Equation (5.1)


movie ratings‡ and 718M song ratings coupled with song-artist-album information

from Yahoo! Music dataset§ are used. All ratings are from 0.2 to 1, equivalent to 1

to 5 stars. When evaluated as a missing value completion (rating recommendation)

problem, about 80% of the observed data was for training and the rest was for

testing. The details of the datasets are summarized in Table 5.1.

‡Netflix’s movie ratings dataset http://www.netflixprize.com/§Yahoo! Research Webscope’s Music User Ratings of Musical Artists datasets http://

research.yahoo.com/

105

Table 5.1 : Data for experiments

Dataset Tensor |Ω|train |Ω|test I1 I2 I3

MovieLensX 18M 4.5M 34,208 247,753 7Y 649K - 34,208 19 -

Netflix X 80M 20M 480,189 17,770 2,182

Yahoo! MusicY 700M 18M 136,736 1,823,179 -X 136K - 136,736 20,543 9,442

Synthetic (1)X 1M - 100K 100K 100KY 100 - 100K 100K -

Synthetic (2)X 10M - 100K 100K 100KY 1K - 100K 100K -

Synthetic (3)X 100M - 100K 100K 100KY 10K - 100K 100K -

Synthetic (4)X 1B - 100K 100K 100KY 100K - 100K 100K -

5.3.1 Scalability

To validate the scalability of SMF, four synthetic datasets are generated with

different observation densities as summarized in Table 5.1. The scalability of SMF is

measured with respect to the number of observations and machines.

A. Observation scalability

Figure 5.3 compares the observation scalability of SALS, CMTF-OPT and SMF (in

the case of TF of the main tensor X - Figure 5.3a) and of SCouT, CMTF-OPT and

SMF (for CMTF of the tensor X and the additional matrix Y - Figure 5.3b). As

shown in Figure 5.3a, the performance of SALS is similar to ours when the number

of observation is from 1M to 10M. However, SALS performs worse as the observed

data size becomes larger. Specifically, when observed data increases 10x (i.e., 10

times) from 100M to 1B, SALS’s running time per iteration slows down 10.36x,

151% down rate of SMF’s. As for CMTF, SMF significantly outperforms SCouT

106

(a) (b)

Figure 5.3 : Observation scalability. SMF scales up better as the number of knowndata increases. In a) only X is factorized. SMF is 4.17x, and 2.76x faster thanSALS in the case of 1B and 100M observations, respectively. In b), both X and Yare jointly analyzed. SMF consistently outperforms SCouT at a rate of over 70x inall test cases. In both cases, CMTF-OPT runs out of memory for more than 1Mdatasets.

being 73x faster. CMTF-OPT achieves similar performance for 1M dataset, but it

experiences ”out-of-memory” when dealing with the other larger datasets.

B. Machine scalability

The increase in the speed of each algorithm as more computational power is

added to the cluster is measured. Synthetic (3) with 100M is used in this test. The

speedup rate is calculated by normalizing the time each algorithm takes on three

machines (T3) with time (TM) on M machines (in this test M is 6 and 9). In general,

SMF speeds up to a rate which is similar to SALS and to a much higher rate than

SCouT.

5.3.2 Convergence Speed

This section investigates how fast SMF converges in a benchmark with both

SALS and SCouT on the three real-world datasets. As observed in Figures 5.5,

5.6 and 5.7, when the tensor size becomes bigger from MovieLens (Figure 5.5) to

Netflix (Figure 5.6) and to Yahoo! Music (Figure 5.7), the advantages of SMF over

107

(a) (b)

Figure 5.4 : Machine scalability with 100M synthetic dataset. In a) only X isfactorized. In b), both X and Y are jointly analyzed. SMF speeds up to a ratewhich is similar to SALS and to a much higher rate than SCouT.

(a) (b)

Figure 5.5 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in MovieLens.

SALS increases. SMF eliminates all data streaming from local disk to memory,

especially for large-scale data, improving its efficiency significantly. Specifically,

SMF outperforms SALS 3.8x in the case of 700M observations (Yahoo! Music) and

2x in 80M observations (Netflix). This result, in combination with the fact that it

is 4.17x faster than SALS in 1B synthetic dataset, strongly suggests that SMF is

the fastest tensor factorization algorithm for large-scale datasets.

While Figures 5.5, 5.6 and 5.7 show single tensor factorization results, Figures

5.8 and 5.9 provide more empirical evidence to show that SMF is able to perform

108

(a) (b)

Figure 5.6 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in Netflix

(a) (b)

Figure 5.7 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in Yahoo! Music

(a) (b)

Figure 5.8 : Factorization speed (a) and training RMSE per iteration (b) of thecoupled matrix tensor factorization of X and Y in MovieLens

109

(a) (b)

Figure 5.9 : Factorization speed (a) and training RMSE per iteration (b) of thecoupled matrix tensor factorization of X and Y in Yahoo! Music.

Table 5.2 : Accuracy of each algorithm on the real-world datasets. Decomposedfactors are used to predict missing entries. The accuracy is measured with a RMSEon the test sets.

AlgorithmTF CMTF

MovieLens Netflix Yahoo MovieLens Yahoo

SALS 0.1695 0.1751 0.2396 - -SCouT - - - 0.7110 0.7365SMF 0.1685 0.1749 0.2352 0.1676 0.2349

lightning-fast coupled matrix tensor factorization for the joint analysis of heteroge-

neous datasets. In this case, only MovieLens and Yahoo! Music are used as the

Netflix dataset does not have side information. SMF surpasses SCouT, the current

fastest CMTF algorithm, by 17.8x on the Yahoo! Music dataset.

5.3.3 Accuracy

In addition to having the fastest convergence speed, SMF also recovers missing

entries with the highest accuracy. Table 5.2 lists all the prediction results on the test

sets. Note that SALS does not support CMTF and SCouT does not support TF. In

this test, SALS is almost as good as SMF for missing entry recovery while SCouT

performs much worse. This also shows the power of using coupled information in

110

the factorization processes.

5.3.4 Optimization

Different optimizers are benchmarked in this section. Instead of computing each

row of any factor as line 4 of updateFactor() or line 6 of updateCoupledFactor(),

two gradient-based optimizers are used: nonlinear conjugate gradient (called SMF-

NCG) with the More-Thuente line search (More & Thuente 1994) and gradient

descent (called SMF-GD) to update it. All the optimizers stop when ε < 10−4.

(a) (b)

(c)

Figure 5.10 : Benchmark of different optimization methods for a) MovieLens, b)Netflix and c) Yahoo! Music datasets. In all cases, SMF-CF quickly goes to thelowest RMSE.

These results are compared with the closed form solution (hereafter called SMF-

CF) as displayed in Figure 5.10. The results demonstrate that the closed form

111

Table 5.3 : Accuracy of predicting missing entries on real-world datasets with dif-ferent optimizers. SMFoptimized solver achieves the lowest tested RMSE under thesame stopping condition.

Optimizer MovieLens Netflix Yahoo

SMF-CF 0.1672 0.1749 0.2352SMF-NCG 0.1783 0.1904 0.2368SMF-GD 0.1716 0.1765 0.2387

solution converges to the lowest RMSE faster than conventional gradient-based op-

timizers in general. The results on the test sets shown in Table 5.3 also confirm that

SMF-CF has the highest precision on the test sets.

5.4 Contribution and Summary

This chapter addresses contribution #4 of this thesis by proposing a novel scal-

able factorization algorithm and demonstrating its high impact on the cross-domain

datasets factorization problem. Given large-scale datasets, existing distributed al-

gorithms for the joint analysis of multi-dimensional data generated from multi-

ple sources decompose them on several computing nodes following the MapReduce

paradigm. How to improve the performance of MapReduce-based factorization al-

gorithms as the observed data grows larger is a requirement if any cross-domain

learning from big data is to occur. This case requires an even more efficient solution

that not only reduces communication overhead but also optimizes factors faster.

The chapter introduces SMF algorithm to analyze coupled big datasets across-

domains. It has two key features to enable large-scale factorization of the coupled

datasets. Firstly, the SMF design, based on Apache Spark, eliminates the huge

data transmission overhead, having the smallest communication cost. Secondly, its

optimized solver converges faster. Its advanced design together with the optimized

solver stabilizes the algorithm performance as the data size increases. As a result,

112

SMF is exceptionally efficient in case the data grows.

Extensive experiments show that SMF scales the best, compared to the other

existing methods, with respect to the number of tensors, observations, and ma-

chines. They also demonstrate SMF’s effectiveness in terms of convergence speed

and accuracy on real-world datasets. When dealing with one billion observed entries,

SMF outperforms the currently fastest coupled matrix tensor factorization and ten-

sor factorization by 17.8 and 3.8 times, respectively. Compellingly, SMF achieves

this speed with the highest accuracy. All these advantages suggest SMF design

principles should be the building blocks for large-scale multimodal factorization

methods.

114

Chapter 6

Conclusion

E-commerce businesses are increasingly dependent on recommendation systems to

introduce personalized services and products to targeted customers. Providing useful

recommendations requires sufficient knowledge about user preferences and product

(item) characteristics. Given the current abundance of available data across do-

mains, achieving a thorough understanding of the relationship between users and

items can result in more collaborative filtering power and lead to higher recommen-

dation accuracy.

However, how to effectively utilize different types of knowledge obtained across

domains is still a challenging problem for the data mining research community. Cur-

rent research in cross-domain recommendation only uses explicit similarities across

domains to improve recommendation accuracy. Moreover, all the existing algorithms

assume coupled datasets across domains share identical coupled factors, losing the

capability to capture the actual explicit similarities among them.

The joint analysis of datasets from multiple domains can provide additional in-

sights into the coupled dimension. Nevertheless, performing this joint factorization

of the coupled cross-domain datasets often incurs very heavy costs in terms of com-

115

putation, communication, and storage if the proposed methods are not well designed.

One critical weakness of MapReduce-based algorithms is when a node needs data

to process, the data is transferred from an isolated distributed file system to the

node. The iterative nature of tensor factorization requires data and factors to be

distributed over and over again, incurring enormous communication overhead.

6.1 Research questions and contributions

Motivated by these gaps, this thesis proposes a few ideas to improve the per-

formance of cross-domain recommendation. Chapter 2 discusses a few reasons that

affect the performance of the existing algorithms as summarized below.

• Existing joint factorization models assume coupled datasets share identical

coupled factors, failing to capture the actual explicit similarities across do-

mains.

• Current algorithms use only explicit similarities for joint analysis or transfer

learning from cross-domain coupled datasets. They fail to exploit the implicit

similarities among them.

• MapReduce-based distributed factorization algorithms incur a huge communi-

cation overhead, reducing their effectiveness in handling large-scale datasets.

To overcome these problems, this research investigates the following questions:

Q1. Is it possible to propose appropriate methods of sharing explicit similarities

between cross-domain datasets to understand their actual relationship?

This question is answered by proposing a new objective function to better

utilize explicit similarities in chapter 3.

116

Q2. How to share implicit similarities in non-coupled dimensions across domains to

improve recommendation accuracy?

This question is answered by proposing a novel algorithm to discover and ex-

ploit implicit similarities in section 4.2.2. A method to combine both implicit

and explicit similarities is proposed in section 4.2.1 to improve recommenda-

tion performance.

Q3. How to improve the scalability of the factorization process such that it is able

to scale up to a different number of coupled tensors, tensor modes, tensor

dimensions and billions of observations?

A scalable factorization model is introduced to answer this question in chapter

5. The proposed model significantly reduces communication overhead, scaling

up better compared to existing distributed algorithms.

By investigating these research questions, this thesis makes four knowledge con-

tributions, summarized below.

Contribution #1. A new objective function to enable each dataset to

have its discriminative factor on the coupled mode, capturing the actual

explicit similarities across domains

Section 3.2 describes the proposed objective function to optimize with respect to

every single tensor and matrix. Differing from algorithms with a traditional objective

function which forces shared modes among tensors to have identical factors, the

proposed method enables each tensor to have its own discriminative factor on the

coupled mode and regularizes them to be as close as possible. Thus, it is capable of

sharing the accurate explicit similarities among them, improving recommendation

accuracy. In addition, a theoretical proof and experimental evidence confirm that

the algorithm converges to an optimum. Experiments on both real and synthetic

117

datasets demonstrate that the proposed algorithm outperforms the current state-of-

the-art algorithms in predicting missing entries for recommendations.

Contribution #2. A novel algorithm to discover implicit similarities

in non-coupled mode and align them across domains

Section 4.2.2 explains a method to discover implicit similarities from latent fac-

tors across domains based on matrix tri-factorization. The method captures the

implicit similarities on the non-coupled dimension. To this end, it uses the middle

matrix of the tri-factorization to match the unique factors. Based on the identified

matches, it aligns the matched unique factors to transfer cross-domain implicit sim-

ilarities. The empirical results demonstrate that these implicit similarities provide

other insights into the underlying structure. Thus, utilizing them effectively helps

to improve cross-domain recommendations.

Contribution #3. A matrix factorization-based model to utilize both

explicit and implicit similarities for cross-domain recommendation accu-

racy improvement

Section 4.2.1 presents a different way to share explicit similarities across do-

mains, and the use of both similarities improves recommendation performance. This

is based on the fact that coupled datasets which share the same coupled dimension

indicates a strong correlation in relation to the coupled factors. Nevertheless, they

also have their unique features characterized by their domain. For this reason,

both common and unique parts are included in the coupled factors to better cap-

ture explicit similarities among different datasets. Moreover, the proposed method

also utilizes the implicit similarities on the non-coupled dimension. This research

is the first to propose the transfer of both explicit and implicit knowledge in cou-

pled and non-coupled dimensions and thus further improves the recommendation.

Validated on real-world datasets, the proposed approach outperforms the state-of-

118

the-art algorithms by more than two times in term of recommendation accuracy.

These empirical results confirm the potential of utilizing both explicit and implicit

similarities for making cross-domain recommendations.

Contribution #4. A scalable factorization model based on the Spark

framework to scale up the factorization process to the number of tensors,

tensor modes, tensor dimensions and billions of observations

Section 5.2 describes a scalable factorization method to improve the performance

of MapReduce-based algorithms as the observed data becomes larger. As data grows,

it requires an even more efficient solution, especially for reducing the communication

overhead. The proposed distributed lightning-fast and scalable algorithm incurs the

smallest communication overhead compared to all methods proposed in the litera-

ture. It is equipped with an optimized solver which reduces the overall time complex-

ity. These key features stabilize the proposed method’s performance when the data

size increases, being confirmed by experiments with 1 billion known entries. Vali-

dated on real-world datasets, the proposed method outperforms the state-of-the-art

distributed tensor factorization and coupled matrix tensor factorization algorithms

by 3.8 and 17.8 times, respectively. Furthermore, the more interesting observation

is that it achieves this fast decomposition with the highest accuracy on the test sets.

6.2 Future research directions

This section discusses a few research plans to extend the proposed methods in

this thesis and to overcome their limitations.

6.2.1 Investigating explicit and implicit similarities in imbalanced datasets

The proposed methods in this thesis discover and use both explicit and implicit

similarities in coupled datasets with balanced ratings. However, there are many

cases where the ratings are imbalanced. For example, people are likely to provide

119

feedback once they have a negative experience with a product or service. Thus, the

number of negative ratings may be much higher than that of a positive one. It will

be an essential contribution to investigate this imbalanced issue by extending the

proposed methods.

6.2.2 Extending the use of explicit and implicit similarities to high di-

mensional tensors

The proposed methods using explicit and implicit similarities across domains

work with rating matrices. As data is currently being generated at an unprecedented

speed, a growing number of high dimensional tensors is available. The joint analysis

of these high dimensional tensors will help to provide a thorough understanding

of the underlying structure, thus improving recommendation accuracy. Discovering

the explicit and implicit similarities between them is the first step. Designing an

algorithm to extend the methods proposed in this thesis which can handle high

dimensional tensors will significantly enrich the data mining research community.

6.2.3 Extending the proposed factorization model to handle online rat-

ings

This thesis presents a scalable factorization method to scale up the factorization

process on offline coupled datasets. For e-commerce websites, ratings are provided in

real-time. How to update the factors with online ratings will help to achieve accurate

recommendations that reflect new ratings coming to the systems. Developing a

scalable online factorization will benefit an increasing number of businesses.

6.2.4 Investigating the use of explicit and implicit similarities in Factor-

ization Machines

Factorization machines (FMs) (Rendle 2010) are more general models than MF.

Instead of using user and item indexes as in MF, FMs deal with both user and item

120

features. It will be interesting to analyze the user and item features to discover the

explicit and implicit similarities of these features. By doing so, it will contribute to

the data mining community by the provision of new factorization machines that can

handle coupled datasets.

6.3 Conclusion

In summary, this thesis enriches the data mining research community by propos-

ing algorithms to effectively utilize different types of knowledge obtained across

domains. Also, it improves the scalability of the factorization process by developing

a factorization model capable of scaling up to a different number of tensors, ten-

sor modes, tensor dimensions and billions of observations. The proposed methods

are thus applicable to discover explicit and implicit similarities across large-scale

datasets for cross-domain recommendations.

122

Bibliography

Acar, E., Dunlavy, D. M., Kolda, T. G. & Mrup, M. (2011), ‘Scalable tensor factor-

izations for incomplete data’, Chemometrics and Intelligent Laboratory Systems

106(1), 41 – 56. Multiway and Multiset Data Analysis.

Acar, E., Kolda, T. G. & Dunlavy, D. M. (2011), All-at-once optimization for cou-

pled matrix and tensor factorizations, in ‘KDDWorkshop on Mining and Learning

with Graphs (arXiv:1105.3422v1)’.

Baltruaitis, T., Ahuja, C. & Morency, L. (2018), ‘Multimodal machine learning:

A survey and taxonomy’, IEEE Transactions on Pattern Analysis and Machine

Intelligence pp. 1–1.

Bell, R. M. & Koren, Y. (2007), ‘Lessons from the netflix prize challenge’, SIGKDD

Explor. Newsl. 9(2), 75–79.

Beutel, A., Talukdar, P. P., Kumar, A., Faloutsos, C., Papalexakis, E. E. & Xing,

E. P. (2014), Flexifact: Scalable flexible factorization of coupled tensors on

hadoop, in ‘Proceedings of the SIAM International Conference on Data Mining’,

SDM’14, pp. 109–117.

Bhargava, P., Phan, T., Zhou, J. & Lee, J. (2015), Who, what, when, and where:

Multi-dimensional collaborative recommendations using tensor factorization on

sparse user-generated data, in ‘Proceedings of the 24th International Conference

on World Wide Web’, WWW’15, pp. 130–140.

123

Boyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge University

Press.

Chen, B., Li, F., Chen, S., Hu, R. & Chen, L. (2017), ‘Link prediction based on

non-negative matrix factorization’, PLOS ONE 12, 1–18.

Chen, W., Hsu, W. & Lee, M. L. (2013), Making recommendations from multiple

domains, in ‘Proceedings of the 19th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining’, KDD’13, pp. 892–900.

Diao, Q., Qiu, M., Wu, C.-Y., Smola, A. J., Jiang, J. & Wang, C. (2014), Jointly

modeling aspects, ratings and sentiments for movie recommendation (jmars), in

‘Proceedings of the 20th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining’, KDD’14, pp. 193–202.

Ding, C., Li, T., Peng, W. & Park, H. (2006), Orthogonal nonnegative matrix t-

factorizations for clustering, in ‘Proceedings of the 12th ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data Mining’, KDD’06, pp. 126–

135.

Dunlavy, D. M., Kolda, T. G. & Acar, E. (2011), ‘Temporal link prediction using

matrix and tensor factorizations’, ACM Trans. Knowl. Discov. Data 5(2), 10:1–

10:27.

Ekstrand, M. D., Riedl, J. T. & Konstan, J. A. (2011), ‘Collaborative filtering

recommender systems’, Found. Trends Hum.-Comput. Interact. 4(2), 81–173.

Elkahky, A. M., Song, Y. & He, X. (2015), A multi-view deep learning approach for

cross domain user modeling in recommendation systems, in ‘Proceedings of the

24th International Conference on World Wide Web’, WWW ’15, pp. 278–288.

Ermis, B., Acar, E. & Cemgil, A. T. (2015), ‘Link prediction in heterogeneous

124

data via generalized coupled tensor factorization’, Data Min. Knowl. Discov.

29(1), 203–236.

Fang, X. & Pan, R. (2014), Fast dtt: A near linear algorithm for decomposing a

tensor into factor tensors, in ‘Proceedings of the 20th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining’, KDD’14, pp. 967–976.

Farias, R. C., Cohen, J. E. & Comon, P. (2016), ‘Exploring multimodal data fu-

sion through joint decompositions with flexible couplings’, IEEE Transactions on

Signal Processing 64(18), 4830–4844.

Gao, S., Denoyer, L. & Gallinari, P. (2012), Link prediction via latent factor block-

model, in ‘Proceedings of the 21st International Conference on World Wide Web’,

WWW’12, pp. 507–508.

Gao, S., Luo, H., Chen, D., Li, S., Gallinari, P. & Guo, J. (2013), Cross-domain

recommendation via cluster-level latent factor model, in ‘Proceedings, Part II,

of the European Conference on Machine Learning and Knowledge Discovery in

Databases - Volume 8189’, ECML PKDD’13, pp. 161–176.

Gemulla, R., Nijkamp, E., Haas, P. J. & Sismanis, Y. (2011), Large-scale matrix

factorization with distributed stochastic gradient descent, in ‘Proceedings of the

17th ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining’, KDD’11, pp. 69–77.

Harper, F. M. & Konstan, J. A. (2015), ‘The movielens datasets: History and

context’, ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19.

Harshman, R. A. (1970), ‘Foundations of the parafac procedure: Models and con-

ditions for an ” explanatory” multi-modal factor analysis’, UCLA working papers

in phonetics 16, 1–84.

125

He, R. & McAuley, J. (2016), Ups and downs: Modeling the visual evolution of

fashion trends with one-class collaborative filtering, in ‘Proceedings of the 25th

International Conference on World Wide Web’, WWW’16, pp. 507–517.

He, X., Liao, L., Zhang, H., Nie, L., Hu, X. & Chua, T.-S. (2017), Neural collab-

orative filtering, in ‘Proceedings of the 26th International Conference on World

Wide Web’, WWW ’17, pp. 173–182.

He, X., Zhang, H., Kan, M.-Y. & Chua, T.-S. (2016), Fast matrix factorization for

online recommendation with implicit feedback, in ‘Proceedings of the 39th Inter-

national ACM SIGIR Conference on Research and Development in Information

Retrieval’, SIGIR’16, pp. 549–558.

Hotelling, H. (1931), ‘The generalization of student’s ratio’, The Annals of Mathe-

matical Statistics .

Hsu, C., Yeh, M. & Lin, S. (2018), ‘A general framework for implicit and explicit

social recommendation’, IEEE Transactions on Knowledge and Data Engineering

pp. 1–1.

Hu, L., Cao, J., Xu, G., Cao, L., Gu, Z. & Zhu, C. (2013), Personalized recom-

mendation via cross-domain triadic factorization, in ‘Proceedings of the 22Nd

International Conference on World Wide Web’, WWW’13, pp. 595–606.

Hu, Y., Koren, Y. & Volinsky, C. (2008), Collaborative filtering for implicit feedback

datasets, in ‘Proceedings of the 2008 Eighth IEEE International Conference on

Data Mining’, ICDM’08, pp. 263–272.

Huang, H., Ding, C., Luo, D. & Li, T. (2008), Simultaneous tensor subspace selection

and clustering: The equivalence of high order svd and k-means clustering, in


Discovery and Data Mining’, KDD ’08, pp. 327–335.

126

Itskov, M. (2009), Tensor Algebra and Tensor Analysis for Engineers: With Appli-

cations to Continuum Mechanics, 2nd edn, Springer Publishing Company, Incor-

porated.

Iwata, T. & Koh, T. (2015), Cross-domain recommendation without shared users

or items by sharing latent vector distributions, in ‘Proceedings of the Eighteenth

International Conference on Artificial Intelligence and Statistics’, Vol. 38 of Pro-

ceedings of Machine Learning Research, pp. 379–387.

Jeon, B., Jeon, I., Sael, L. & Kang, U. (2016), Scout: Scalable coupled matrix-

tensor factorization - algorithm and discoveries, in ‘2016 IEEE 32nd International

Conference on Data Engineering’, ICDE’16, pp. 811–822.

Jiang, M., Cui, P., Chen, X., Wang, F., Zhu, W. & Yang, S. (2015), ‘Social recom-

mendation with cross-domain transferable knowledge’, IEEE Trans. on Knowl.

and Data Eng. 27(11), 3084–3097.

Jiang, M., Cui, P., Wang, F., Xu, X., Zhu, W. & Yang, S. (2014), Fema: Flexible

evolutionary multi-faceted analysis for dynamic behavioral pattern discovery, in


Discovery and Data Mining’, KDD ’14, pp. 1186–1195.

Jing, H., Liang, A., Lin, S. & Tsao, Y. (2014), A transfer probabilistic collective

factorization model to handle sparse data in collaborative filtering, in ‘2014 IEEE

International Conference on Data Mining (ICDM)’, Vol. 00 of ICDM’14, pp. 250–

259.

Kang, U., Papalexakis, E., Harpale, A. & Faloutsos, C. (2012), Gigatensor: Scaling

tensor analysis up by 100 times - algorithms and discoveries, in ‘Proceedings of

the 18th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining’, KDD’12, pp. 316–324.

127

Karatzoglou, A., Amatriain, X., Baltrunas, L. & Oliver, N. (2010), Multiverse rec-

ommendation: N-dimensional tensor factorization for context-aware collaborative

filtering, in ‘Proceedings of the Fourth ACM Conference on Recommender Sys-

tems’, RecSys’10, pp. 79–86.

Karatzoglou, A. & Hidasi, B. (2017), Deep learning for recommender systems, in

‘Proceedings of the Eleventh ACM Conference on Recommender Systems’, RecSys

’17, pp. 396–397.

Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006),

Learning systems of concepts with an infinite relational model, in ‘Proceedings

of the 21st National Conference on Artificial Intelligence - Volume 1’, AAAI’06,

AAAI Press, pp. 381–388.

Kiraly, F. J., Theran, L. & Tomioka, R. (2015), ‘The algebraic combinatorial ap-

proach for low-rank matrix completion’, J. Mach. Learn. Res. 16(1), 1391–1436.

Kolda, T. G. & Bader, B. W. (2009), ‘Tensor decompositions and applications’,

SIAM Rev. 51(3), 455–500.

Konstan, J. A. (2004), ‘Introduction to recommender systems: Algorithms and eval-

uation’, ACM Trans. Inf. Syst. 22(1), 1–4.

Koren, Y. (2009), Collaborative filtering with temporal dynamics, in ‘Proceedings

of the 15th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining’, KDD’09, pp. 447–456.

Koren, Y. & Bell, R. (2011), Advances in Collaborative Filtering, Springer US,

pp. 145–186.

Koren, Y., Bell, R. & Volinsky, C. (2009), ‘Matrix factorization techniques for rec-

ommender systems’, Computer 42(8), 30–37.

128

Lahat, D., Adali, T. & Jutten, C. (2015a), ‘Multimodal data fusion: An overview of

methods, challenges, and prospects’, Proceedings of the IEEE 103(9), 1449–1477.

Lahat, D., Adali, T. & Jutten, C. (2015b), ‘Multimodal data fusion: An overview of

methods, challenges, and prospects’, Proceedings of the IEEE 103(9), 1449–1477.

Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B. & Ng, A. Y. (2011), On

optimization methods for deep learning, in ‘Proceedings of the 28th International

Conference on International Conference on Machine Learning’, ICML’11, pp. 265–

272.

Lee, D. D. & Seung, H. S. (2000), Algorithms for non-negative matrix factoriza-

tion, in ‘Proceedings of the 13th International Conference on Neural Information

Processing Systems’, NIPS’00, pp. 535–541.

Li, B., Yang, Q. & Xue, X. (2009a), Can movies and books collaborate?: Cross-

domain collaborative filtering for sparsity reduction, in ‘Proceedings of the 21st

International Jont Conference on Artifical Intelligence’, IJCAI’09, pp. 2052–2057.

Li, B., Yang, Q. & Xue, X. (2009b), Transfer learning for collaborative filtering via a

rating-matrix generative model, in ‘Proceedings of the 26th Annual International

Conference on Machine Learning’, ICML’09, pp. 617–624.

Li, C.-Y. & Lin, S.-D. (2014), Matching users and items across domains to im-

prove the recommendation quality, in ‘Proceedings of the 20th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining’, KDD’14,

pp. 801–810.

Liavas, A. P. & Sidiropoulos, N. D. (2015), ‘Parallel algorithms for constrained tensor

factorization via alternating direction method of multipliers’, IEEE Transactions

on Signal Processing 63(20), 5450–5463.

129

Lin, Y.-R., Sun, J., Castro, P., Konuru, R., Sundaram, H. & Kelliher, A. (2009),

Metafac: Community discovery via relational hypergraph factorization, in ‘Pro-

ceedings of the 15th ACM SIGKDD International Conference on Knowledge Dis-

covery and Data Mining’, KDD’09, pp. 527–536.

Liu, W., Chan, J., Bailey, J., Leckie, C. & Ramamohanarao, K. (n.d.), Mining

labelled tensors by discovering both their common and discriminative subspaces,

in ‘Proceedings of the 2013 SIAM International Conference on Data Mining’,

pp. 614–622.

Liu, W., Kan, A., Chan, J., Bailey, J., Leckie, C., Pei, J. & Kotagiri, R. (2012),

On compressing weighted time-evolving graphs, in ‘Proceedings of the 21st ACM

International Conference on Information and Knowledge Management’, CIKM’12,

pp. 2319–2322.

Liu, Y.-F., Hsu, C.-Y. & Wu, S.-H. (2015), Non-linear cross-domain collaborative fil-

tering via hyper-structure transfer, in ‘Proceedings of the 32Nd International Con-

ference on International Conference on Machine Learning - Volume 37’, ICML’15,

pp. 1190–1198.

Liu, Y. & Shang, F. (2013), ‘An efficient matrix factorization method for tensor

completion’, IEEE Signal Processing Letters 20(4), 307–310.

Loni, B., Shi, Y., Larson, M. & Hanjalic, A. (2014), Cross-domain collaborative

filtering with factorization machines, in ‘Proceedings of the 36th European Con-

ference on IR Research on Advances in Information Retrieval - Volume 8416’,

ECIR 2014, pp. 656–661.

Lops, P., de Gemmis, M. & Semeraro, G. (2011), Content-based Recommender Sys-

tems: State of the Art and Trends, Springer US, pp. 73–105.

130

Menon, A. K. & Elkan, C. (2011), Link prediction via matrix factorization, in ‘Pro-

ceedings of the 2011 European Conference on Machine Learning and Knowledge

Discovery in Databases - Volume Part II’, ECML PKDD’11, pp. 437–452.

More, J. J. & Thuente, D. J. (1994), ‘Line search algorithms with guaranteed suffi-

cient decrease’, ACM Trans. Math. Softw. 20(3), 286–307.

Moreno, O., Shapira, B., Rokach, L. & Shani, G. (2012), Talmud: Transfer learning

for multiple domains, in ‘Proceedings of the 21st ACM International Conference

on Information and Knowledge Management’, CIKM’12, pp. 425–434.

Pan, W. (2016), ‘A survey of transfer learning for collaborative recommendation

with auxiliary data’, Neurocomput. 177(C), 447–453.

Pan, W., Liu, N. N., Xiang, E. W. & Yang, Q. (2011), Transfer learning to predict

missing ratings via heterogeneous user feedbacks, in ‘Proceedings of the Twenty-

Second International Joint Conference on Artificial Intelligence - Volume Volume

Three’, IJCAI’11, pp. 2318–2323.

Pan, W., Xiang, E. W., Liu, N. N. & Yang, Q. (2010), Transfer learning in collabo-

rative filtering for sparsity reduction, in ‘Proceedings of the Twenty-Fourth AAAI

Conference on Artificial Intelligence’, AAAI’10, pp. 230–235.

Papalexakis, E. E., Faloutsos, C., Mitchell, T. M., Talukdar, P. P., Sidiropoulos,

N. D. & Murphy, B. (2014), Turbo-smt: Accelerating coupled sparse matrix-

tensor factorizations by 200x, in ‘Proceedings of the 2014 SIAM International

Conference on Data Mining’, SDM’14, pp. 118–126.

Papalexakis, E. E., Faloutsos, C. & Sidiropoulos, N. D. (2012), Parcube: Sparse

parallelizable tensor decompositions, in ‘Proceedings of the 2012 European Con-

ference on Machine Learning and Knowledge Discovery in Databases’, ECML

PKDD’12, pp. 521–536.

131

Papalexakis, E. E., Sidiropoulos, N. D. & Bro, R. (2013), ‘From k-means to higher-

way co-clustering: Multilinear decomposition with sparse latent factors’, Trans.

Sig. Proc. 61(2), 493–506.

Park, N., Oh, S. & Kang, U. (2017), Fast and scalable distributed boolean tensor

factorization, in ‘2017 IEEE 33rd International Conference on Data Engineering

(ICDE)’, ICDE’17, pp. 1071–1082.

Pazzani, M. J. & Billsus, D. (2007), The adaptive web, Springer-Verlag, Berlin,

Heidelberg, chapter Content-based Recommendation Systems, pp. 325–341.

Perozzi, B., Schueppert, M., Saalweachter, J. & Thakur, M. (2016), When recom-

mendation goes wrong: Anomalous link discovery in recommendation networks, in

‘Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining’, KDD’16, pp. 569–578.

Rendle, S. (2010), Factorization machines, in ‘Proceedings of the 2010 IEEE Inter-

national Conference on Data Mining’, ICDM’10, pp. 995–1000.

Rennie, J. D. M. & Srebro, N. (2005), Fast maximum margin matrix factorization

for collaborative prediction, in ‘Proceedings of the 22Nd International Conference

on Machine Learning’, ICML’05, pp. 713–719.

Resnick, P. & Varian, H. R. (1997), ‘Recommender systems’, Commun. ACM

40(3), 56–58.

Sachan, M. & Srivastava, S. (2013), Collective matrix factorization for co-clustering,

in ‘Proceedings of the 22Nd International Conference on World Wide Web’,

WWW’13, pp. 93–94.

Sael, L., Jeon, I. & Kang, U. (2015), ‘Scalable tensor mining’, Big Data Res. 2(2), 82–

86.

132

Schafer, J. B., Frankowski, D., Herlocker, J. & Sen, S. (2007), The adaptive

web, Springer-Verlag, Berlin, Heidelberg, chapter Collaborative Filtering Rec-

ommender Systems, pp. 291–324.

Shi, J., Qiu, Y., Minhas, U. F., Jiao, L., Wang, C., Reinwald, B. & Ozcan, F.

(2015), ‘Clash of the titans: Mapreduce vs. spark for large scale data analytics’,

Proc. VLDB Endow. 8(13), 2110–2121.

Shi, Y., Larson, M. & Hanjalic, A. (2013), ‘Mining contextual movie similarity with

matrix factorization for context-aware recommendation’, ACM Trans. Intell. Syst.

Technol. 4(1), 16:1–16:19.

Shin, K. & Kang, U. (2014), Distributed methods for high-dimensional and large-

scale tensor factorization, in ‘2014 IEEE International Conference on Data Min-

ing’, ICDM’14, pp. 989–994.

Shin, K., Sael, L. & Kang, U. (2017), ‘Fully scalable methods for distributed tensor

factorization’, IEEE Trans. on Knowl. and Data Eng. 29(1), 100–113.

Singh, A. P. & Gordon, G. J. (2008), Relational learning via collective matrix fac-

torization, in ‘Proceedings of the 14th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining’, KDD’08, pp. 650–658.

Sun, Z., Li, T. & Rishe, N. (2010), Large-scale matrix factorization using mapreduce,

in ‘2010 IEEE International Conference on Data Mining Workshops’, pp. 1242–

1248.

Tan, S., Bu, J., Qin, X., Chen, C. & Cai, D. (2014), ‘Cross domain recommendation

based on multi-type media fusion’, Neurocomput. 127.

Tang, J., Wu, S., Sun, J. & Su, H. (2012), Cross-domain collaboration recommen-

dation, in ‘Proceedings of the 18th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining’, KDD’12, pp. 1285–1293.

133

Toscher, A., Jahrer, M. & Legenstein, R. (2008), Improved neighborhood-based al-

gorithms for large-scale recommender systems, in ‘Proceedings of the 2Nd KDD

Workshop on Large-Scale Recommender Systems and the Netflix Prize Competi-

tion’, NETFLIX ’08, pp. 4:1–4:6.

Viswanath, B., Mislove, A., Cha, M. & Gummadi, K. P. (2009), On the evolution

of user interaction in facebook, in ‘Proceedings of the 2Nd ACM Workshop on

Online Social Networks’, WOSN ’09, pp. 37–42.

Wang, B., Ester, M., Liao, Y., Bu, J., Zhu, Y., Guan, Z. & Cai, D. (2016), The

million domain challenge: Broadcast email prioritization by cross-domain recom-

mendation, in ‘Proceedings of the 22Nd ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining’, KDD’16, pp. 1895–1904.

Wang, H., Wang, N. & Yeung, D.-Y. (2015), Collaborative deep learning for recom-

mender systems, in ‘Proceedings of the 21th ACM SIGKDD International Con-

ference on Knowledge Discovery and Data Mining’, KDD ’15, pp. 1235–1244.

Wang, J. J.-Y., Bensmail, H. & Gao, X. (2013), ‘Multiple graph regularized non-

negative matrix factorization’, Pattern Recogn. 46(10), 2840–2847.

Wang, Y., Liu, Y. & Yu, X. (2012a), Collaborative filtering with aspect-based opin-

ion mining: A tensor factorization approach, in ‘2012 IEEE 12th International

Conference on Data Mining’, ICDM’12, pp. 1152–1157.

Wang, Y., Liu, Y. & Yu, X. (2012b), Collaborative filtering with aspect-based opin-

ion mining: A tensor factorization approach, in ‘2012 IEEE 12th International

Conference on Data Mining’, ICDM’12, pp. 1152–1157.

Wang, Y., Tung, H.-Y., Smola, A. & Anandkumar, A. (2015), Fast and guaran-

teed tensor decomposition via sketching, in ‘Proceedings of the 28th International

134

Conference on Neural Information Processing Systems - Volume 1’, NIPS’15, MIT

Press, pp. 991–999.

Wei, Y., Zheng, Y. & Yang, Q. (2016), Transfer knowledge between cities, in ‘Pro-

ceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Dis-

covery and Data Mining’, KDD’16, pp. 1905–1914.

Wu, F., Yuan, Z. & Huang, Y. (2017), ‘Collaboratively training sentiment classifiers

for multiple domains’, IEEE Transactions on Knowledge and Data Engineering

29(7), 1370–1383.

Yang, D., He, J., Qin, H., Xiao, Y. & Wang, W. (2015), A graph-based recommenda-

tion across heterogeneous domains, in ‘Proceedings of the 24th ACM International

on Conference on Information and Knowledge Management’, CIKM’15, pp. 463–

472.

Yang, F., Shang, F., Huang, Y., Cheng, J., Li, J., Zhao, Y. & Zhao, R. (2017),

‘Lftf: A framework for efficient tensor analytics at scale’, Proc. VLDB Endow.

10(7), 745–756.

Yoo, J. & Choi, S. (2009), Weighted nonnegative matrix co-tri-factorization for

collaborative prediction, in ‘Proceedings of the 1st Asian Conference on Machine

Learning: Advances in Machine Learning’, ACML’09, pp. 396–411.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010), Spark:

Cluster computing with working sets, in ‘Proceedings of the 2Nd USENIX Con-

ference on Hot Topics in Cloud Computing’, HotCloud’10, pp. 10–10.

Zhang, F., Yuan, N. J., Lian, D., Xie, X. & Ma, W.-Y. (2016), Collaborative knowl-

edge base embedding for recommender systems, in ‘Proceedings of the 22Nd ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining’,

KDD’16, pp. 353–362.

135

Zhang, J., Chow, C. & Xu, J. (2017), ‘Enabling kernel-based attribute-aware matrix

factorization for rating prediction’, IEEE Transactions on Knowledge and Data

Engineering 29(4), 798–812.

Zhang, L., Zhang, K. & Li, C. (2008), A topical pagerank based algorithm for recom-

mender systems, in ‘Proceedings of the 31st Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval’, SIGIR’08,

pp. 713–714.

Zhang, Y. (2014), Browser-oriented universal cross-site recommendation and expla-

nation based on user browsing logs, in ‘Proceedings of the 8th ACM Conference

on Recommender Systems’, RecSys’14, pp. 433–436.

Zhang, Y., Xiong, Y., Kong, X. & Zhu, Y. (2016), Netcycle: Collective evolution in-

ference in heterogeneous information networks, in ‘Proceedings of the 22Nd ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining’,

KDD’16, pp. 1365–1374.

Zhao, L., Pan, S. J., Xiang, E. W., Zhong, E., Lu, Z. & Yang, Q. (2013), Active

transfer learning for cross-system recommendation, in ‘Proceedings of the 27th

AAAI Conference on Artificial Intelligence’, AAAI’13, pp. 1205–1211.

Zheng, V. W., Cao, B., Zheng, Y., Xie, X. & Yang, Q. (2010), Collaborative filtering

meets mobile recommendation: A user-centered approach, in ‘Proceedings of the

Twenty-Fourth AAAI Conference on Artificial Intelligence’, AAAI’10, pp. 236–

241.

Zhu, L., Guo, D., Yin, J., Steeg, G. V. & Galstyan, A. (2016), ‘Scalable tempo-

ral latent space inference for link prediction in dynamic social networks’, IEEE

Transactions on Knowledge and Data Engineering 28(10), 2765–2777.

136

Zou, B., Li, C., Tan, L. & Chen, H. (2015), ‘Gputensor: Efficient tensor factorization

for context-aware recommendations’, Inf. Sci. 299(C), 159–177.

Scalable Factorization Model to Discover Implicit and Explicit ......Quan Do, Wei Liu, Fan Jin and...

Documents

Transcript of Scalable Factorization Model to Discover Implicit and Explicit ......Quan Do, Wei Liu, Fan Jin and...