Scalable Factorization Model to Discover Implicit and Explicit ......Quan Do, Wei Liu, Fan Jin and...
Transcript of Scalable Factorization Model to Discover Implicit and Explicit ......Quan Do, Wei Liu, Fan Jin and...
UNIVERSITY OF TECHNOLOGY SYDNEY
Faculty of Engineering and Information Technology
Scalable Factorization Model to Discover Implicit
and Explicit Similarities Across Domains
by
Duc Minh Quan Do
A Thesis Submitted for the Degree of
Doctor of Philosophy
Sydney, Australia
2018
UNIVERSITY OF TECHNOLOGY SYDNEY
SCHOOL OF SOFTWARE
The undersigned hereby certify that they have read this thesis entitled “Scalable
Factorization Model to Discover Implicit and Explicit Similarities Across
Domains” by Duc Minh Quan Do and that in their opinions it is fully adequate,
in scope and in quality, as a thesis for the degree of Doctor of Philosophy.
Date:
Principal Supervisor:
Dr. Wei Liu
Certificate of Original Authorship
I, Duc Minh Quan Do declare that this thesis, submitted in fulfilment of the require-
ments for the award of Doctor of Philosophy, in the School of Software, Faculty of
Engineering and Information Technology at the University of Technology Sydney.
This thesis is wholly my own work unless otherwise reference or acknowledged. In
addition, I certify that all information sources and literature used are indicated in
the thesis. This document has not been submitted for qualifications at any other
academic institution. This research is supported by the Commonwealth Scientific
and Industrial Research Organisation (CSIRO) scholarship.
Date: 15/09/2018
Signature of Author:Production Note:
Signature removed prior to publication.
Acknowledgements
I am especially indebted to Dr. Wei Liu, who have provided continuous support,
advice and invaluable comments to pursue my research goals. As my principal
supervisor, he has guided me more than I could ever give him credit for here. Many
thanks are also due to my co-supervisor, Dr. Fang Chen, for many useful discussions
with her.
I am grateful to all I have had the pleasure to discuss with. Each of the members
of my Candidature Assessment Committee has provided me a great deal of profes-
sional feedback about scientific research. This work would not have been possible
without the financial support of the Commonwealth Scientific and Industrial Re-
search Organisation Scholarship (formerly National ICT Australia Scholarship) and
the UTS - International Research Scholarship (IRS).
Nobody has been more important to me in the pursuit of this thesis than the
members of my family. I would like to thank my parents, whose love and guidance
are with me in whatever I pursue and wherever I am. They are the ultimate role
models. Most importantly, I am grateful to my loving and supportive wife, Yen,
and my wonderful daughter, Ellen, for constant inspiration, patience, and faith.
For Ellen
My love for you will last forever.
Contents
Certificate iii
Acknowledgments iv
Dedication v
List of Figures xi
List of Tables xiv
List of Publications xv
Abbreviation xvi
Notation xvii
Abstract xix
1 Introduction 1
1.1 The research problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 The improper sharing of explicit similarities among coupled
datasets across domains reduces recommendation accuracy . . 3
1.1.2 Coupled datasets across domains also share implicit
similarities that provide other insights into their relationships 5
1.1.3 Joint analysis of heterogeneous datasets is costly . . . . . . . . 7
1.2 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Knowledge contributions . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
1.5 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 A new objective function to enable each dataset to have its
own discriminative factor on the coupled mode, capturing
the actual explicit similarities across domains . . . . . . . . . 14
1.5.2 A novel algorithm to discover implicit similarities in
non-coupled mode and align them across domains . . . . . . . 16
1.5.3 A matrix factorization-based model to utilize both explicit
and implicit similarities for cross-domain recommendation
accuracy improvement . . . . . . . . . . . . . . . . . . . . . . 17
1.5.4 A scalable factorization model based on the Spark framework
to scale up the factorization process to the number of tensors,
tensor modes, tensor dimensions and billions of observations . 18
1.6 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Literature Review and Background 24
2.1 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Rating matrix (utility matrix) . . . . . . . . . . . . . . . . . . 25
2.1.2 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.3 Coupled datasets . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 Matrix Tri-Factorization . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Cross-domain Recommendation Systems . . . . . . . . . . . . . . . . . 32
2.3.1 Collective Matrix Factorization . . . . . . . . . . . . . . . . . 32
viii
2.3.2 Coupled Matrix Tensor Factorization . . . . . . . . . . . . . . 33
2.3.3 CodeBook Transfer . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Cluster-Level Latent Factor Model . . . . . . . . . . . . . . . 36
2.4 Factorization Methodologies . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Distributed Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Deep learning based recommendation systems . . . . . . . . . . . . . . 39
2.7 Research gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Explicit Similarity Discovery 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 ASTen: the proposed Accurate Coupled Tensor Factorization model . 46
3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Data used in our experiments . . . . . . . . . . . . . . . . . . 50
3.4.2 Performance metric . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Contribution and Summary . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Implicit Similarity Discovery 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 HISF: the proposed Hidden Implicit Similarities Factorization Model . 61
4.2.1 Sharing common and preserving domain-specific coupled
latent variables to utilize explicit similaritites . . . . . . . . . 62
4.2.2 Aligning implicit similarities in non-coupled latent clusters
across domains . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
4.3 Extension to three or more matrices . . . . . . . . . . . . . . . . . . . 76
4.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.1 Data for the experiments . . . . . . . . . . . . . . . . . . . . . 79
4.4.2 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.3 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Contributions and Summary . . . . . . . . . . . . . . . . . . . . . . . 88
5 Scalable Multimodal Factorization 91
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 SMF: the proposed Scalable Multimodal Factorization . . . . . . . . . 93
5.2.1 SMF on Apache Spark . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 Scaling up to K tensors . . . . . . . . . . . . . . . . . . . . . 102
5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.2 Convergence Speed . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 Contribution and Summary . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Conclusion 114
6.1 Research questions and contributions . . . . . . . . . . . . . . . . . . . 115
6.2 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.1 Investigating explicit and implicit similarities in imbalanced
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.2 Extending the use of explicit and implicit similarities to high
dimensional tensors . . . . . . . . . . . . . . . . . . . . . . . . 119
x
6.2.3 Extending the proposed factorization model to handle online
ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.4 Investigating the use of explicit and implicit similarities in
Factorization Machines . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
List of Figures
1.1 An example of implicit similarities . . . . . . . . . . . . . . . . . . . . 5
1.2 The research questions and their correspondent contributions . . . . . 13
2.1 An example of a movie rating matrix . . . . . . . . . . . . . . . . . . 26
2.2 An example of a tensor . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 An example of coupled rating matrices from Netflix and MovieLens
websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 An example of a coupled matrix tensor from MovieLens website . . . 28
2.5 CANDECOMP/ PARAFAC (CP) decomposition . . . . . . . . . . . 31
2.6 Joint analysis of a coupled matrix tensor . . . . . . . . . . . . . . . . 34
2.7 Distributed factorization algorithms . . . . . . . . . . . . . . . . . . . 38
2.8 Multi-view deep neural network for cross-domain recommendation
two datasets where they have the same users. In this case, users of
both datasets share the same features of the left-most network. . . . . 40
3.1 Mean squared errors of test cases with synthetic data . . . . . . . . . 53
3.2 Mean squared error of factorizing the MovieLens dataset . . . . . . . 54
3.3 Mean squared error of factorizing Yahoo! Music dataset . . . . . . . . 55
xii
4.1 The proposed factorization model to discover and share implicit
similarities across domains . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Matrix factorization of X(1) as a clustering method . . . . . . . . . . 64
4.3 Matrix factorization of X(2) as a clustering method . . . . . . . . . . 65
4.4 Possible cases for matching user clusters of X(1) and X(2) . . . . . . . 66
4.5 An illustration of how centroid of a cluster is computed . . . . . . . . 68
4.6 Generated ratings of two domains X(1) and X(2) . . . . . . . . . . . . 69
4.7 An illustration of how well the proposed cluster alignment method
works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Tested mean RMSEs of ABS NSW and ABS VIC datasets under
different values of the common row paramter (c) in the coupled
factor of HISF with rank r = 11 . . . . . . . . . . . . . . . . . . . . . 84
4.9 Tested mean RMSEs of ABS NSW and BOCSAR Crime datasets
under different values of the common row parameter (c) in the
coupled factor of HISF with rank r = 11 . . . . . . . . . . . . . . . . 85
4.10 Tested mean RMSEs of Amazon dataset under different values of
the common row parameter (c) in the coupled factor of HISF-N
with rank r = 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 Tensor slices for updating each row of the factors when a mode-3
tensor is coupled with a matrix in their first modes . . . . . . . . . . 95
5.2 An example of how to divide coupled matrix and tensor into
non-overlapping blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Observation scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Machine scalability with 100M synthetic dataset . . . . . . . . . . . . 107
5.5 Factorization speed with MovieLens . . . . . . . . . . . . . . . . . . . 107
5.6 Factorization speed with Netflix . . . . . . . . . . . . . . . . . . . . . 108
xiii
5.7 Factorization speed with Yahoo! Music . . . . . . . . . . . . . . . . . 108
5.8 Coupled factorization speed with MovieLens . . . . . . . . . . . . . . 108
5.9 Coupled factorization speed with Yahoo! Music . . . . . . . . . . . . 109
5.10 Benchmark of different optimization methods . . . . . . . . . . . . . 110
List of Tables
1 Symbols and their descriptions . . . . . . . . . . . . . . . . . . . . . . xviii
1.1 Comparison of existing algorithms for recommendation . . . . . . . . 9
3.1 Ground truth distributions of the factor matrices in the synthetic data 51
4.1 Characteristics of ABS census data on New South Wales and
Victoria states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Characteristics of Amazon datasets on books, movies and electronics 80
4.3 Mean and standard deviation of tested RMSE on ABS New South
Wales and Victoria data with different algorithms . . . . . . . . . . . 81
4.4 Mean and standard deviation of tested RMSE on ABS NSW
demography and BOCSAR NSW crime data with different algorithms 84
4.5 Mean and standard deviation of tested RMSE on Amazon book,
movie and electronics data with different algorithms . . . . . . . . . . 86
5.1 Data for experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Accuracy of each algorithm on the real-world datasets . . . . . . . . . 109
5.3 Accuracy of predicting missing entries on real-world datasets with
different optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
List of Publications
Below is the list of journal and conference papers associated with my Ph.D. research:
1. Quan Do, Wei Liu, Fan Jin and Dacheng Tao, “Unveiling Hidden Implicit
Similarities for Cross-Domain Recommendation,” IEEE Transactions on Knowl-
edge and Data Engineering (TKDE) (Under review).
2. Quan Do and Wei Liu, “Scalable Multimodal Factorization for Learning from
Very Big Data,” in Multimodal Analytics for Next-Generation Big Data Tech-
nologies and Applications, Springer (To appear).
3. Quan Do, Wei Liu and Fang Chen, “Discovering both Explicit and Implicit
Similarities for Cross-Domain Recommendation,” in Proceedings of the 2017
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD),
pp. 618-630, May 23-26, 2017.
4. Quan Do and Wei Liu, “ASTen: an Accurate and Scalable Approach to
Coupled Tensor Factorization,” in Proceedings of the 2016 International Joint
Conference on Neural Networks (IJCNN), pp. 99-106, Jul. 24-29, 2016.
Abbreviation
ALS Alternating Least Squares. 14
CBT Code Book Transfer. 6, 18
CF Collaborative Filtering. x, 1, 11
CLFM Cluster-Level Latent Factor Model. 6, 8, 18
CMF Collective Matrix Factorization. 2, 8, 13, 17
CMTF Coupled Matrix Tensor Factorization. 5, 6, 8, 11, 13–17
GD Gradient Descent. 14
GPU Graphics Processing Unit. 15
MF Matrix Factorization. 11, 12
NCG Nonlinear Conjugate Gradient. 14
NSW New South Wales. 1, 4
RMSE Root Means Squared Error. 6
TF Tensor Factorization. 11, 12, 14, 15
Nomenclature and Notation
A rating matrix from n users for m items is denoted by a boldface capital, e.g.,
X. Each row of the matrix is a vector of a user’s ratings for all items while each
column is a vector of ratings from all users for a specific item. Vectors are denoted
by boldface lowercases, i.e., u. A boldface capital and lowercase with indices in their
subscript are respectively used for an entry of a matrix and a vector. Table 1 lists
all other symbols we thoroughly use in this thesis.
xviii
Table 1 : Symbols and their descriptions
Symbol Description
X(i) Rating matrix from i-th datasetU(i) The first dimension factor of X(i)
V(0) Common parts of the coupled factorsV(i) Domain-specific parts of the coupled factor of X(i)
S(i) Weighting factor of X(i)
AT Transpose of AA† Moore-Penrose pseudo inverse of AI The identity matrix‖A‖ Frobenius normn, m, p Dimension lengthc Number of common clusters in coupled factorsr Rank of decompositionΩX Number of observations in X∂∂x
Partial derivative with respect to xL Loss functionλ Regularization parameter× Multiplication
x, x, X, X A scalar, a vector, a matrix and a tensorN Mode of a tensorM Number of machineK Number of tensorT Number of iterationI1 × I2 × · · · × IN Dimension of N -mode tensor X|Ω|, Xi1,i2,...,iN Observed data size of X and its entriesX(n) Mode nth of X
X(n)in
Slice in of X(n) - all entries X(n)∗,...,∗,in,∗,...,∗
U(n) nth mode factor of X
u(n)in
inth row of factor U(n)
V(2) 2nd mode factor of Y
v(2)j2
j2th row of factor V(2) - all entries V(2)∗,j2
U1, U2,...,UK Factors of tensor X1, X1, XKI1 × I2 × · · · × INK
Dimension of NK-mode tensor XK|Ω|K , XKi1,i2,...,iNK
Observed data size of XK and its entries
Abstract
E-commerce businesses increasingly depend on recommendation systems to intro-
duce personalized services and products to their target customers. Achieving ac-
curate recommendations requires a sufficient understanding of user preferences and
item characteristics. Given the current innovations on the Web, coupled datasets
are abundantly available across domains. An analysis of these datasets can provide
a broader knowledge to understand the underlying relationship between users and
items. This thorough understanding results in more collaborative filtering power
and leads to a higher recommendation accuracy.
However, how to effectively use this knowledge for recommendation is still a
challenging problem. In this research, we propose to exploit both explicit and
implicit similarities extracted from latent factors across domains with matrix tri-
factorization. On the coupled dimensions, common parts of the coupled factors
across domains are shared among them. At the same time, their domain-specific
parts are preserved. We show that such a configuration of both common and domain-
specific parts benefits cross-domain recommendations significantly. Moreover, on the
non-coupled dimensions, the middle factor of the tri-factorization is proposed to use
to match the closely related clusters across datasets and align the matched ones to
transfer cross-domain implicit similarities, further improving the recommendation.
Furthermore, when dealing with data coupled from different sources, the scalabil-
ity of the analytical method is another significant concern. We design a distributed
factorization model that can scale up as the observed data across domains increases.
Our data parallelism, based on Apache Spark, enables the model to have the small-
est communication cost. Also, the model is equipped with an optimized solver that
converges faster. We demonstrate that these key features stabilize our model’s per-
xx
formance when the data grows.
Validated on real-world datasets, our developed model outperforms the existing
algorithms regarding recommendation accuracy and scalability. These empirical
results illustrate the potential of our research in exploiting both explicit and implicit
similarities across domains for improving recommendation performance.
1
Chapter 1
Introduction
E-commerce providers usually offer a wide range of products. On the one hand, this
massive product selection meets a variety of different consumer needs and tastes. On
the other hand, browsing over a long product list is not a user-friendly task for any
consumer when choosing products of her preferences. Automatic matching between
product properties and consumer interest allows companies to introduce products
and services of interest to consumers. Systems with a capability to recommend
products to each particular user based on user preferences are called personalized
recommendation systems (Koren & Bell 2011). They enrich the user experience,
enhance user satisfaction and eventually lead to more sales. Realizing that they
can provide a competitive advantage, a large number of providers have been em-
ploying recommendation systems to analyze consumers’ past behaviors to provide
personalized product recommendations (Koren et al. 2009).
Recommendation systems have increased in their importance and popularity
among product providers (Zhang 2014). Two fundamental techniques are widely
chosen for developing personalized recommendation systems: the content-based ap-
proach (Lops et al. 2011) and the collaborative filtering (CF)-based approach (Koren
& Bell 2011). The former focuses on the information of users or items for making
2
recommendations whereas the latter is based on the latent similarities between user
interests and the item characteristics to predict items in which specific users would
be interested. This research focuses on improving CF-based recommendations.
To provide accurate recommendations, CF-based methods require a thorough
understanding of the latent similarities between the user preferences and the item
properties (Pan et al. 2011). This understanding can only be obtained when there
is sufficient user feedback (ratings, likes, activities, etc.). Having adequate user
feedback is especially critical here as this is the only information CF-based methods
use for making recommendations (Pan et al. 2010). There are many cases where
a business does not have sufficient ratings, thus, improving its recommendation
performance can be a significant problem.
The problem of a lack of user ratings can be overcome by exploiting related
information from other domains (Wei et al. 2016, Yang et al. 2015, Liu et al. 2015,
Iwata & Koh 2015, Jing et al. 2014, Zhao et al. 2013, Hu et al. 2013, Tang et al.
2012, Tan et al. 2014, Wang et al. 2016, Hsu et al. 2018, Wu et al. 2017, Zhang
et al. 2017). Given the recent innovations on the Internet and social media, many
cross-domain datasets are publicly available (Chen et al. 2013, Li & Lin 2014, Pan
et al. 2010, Jiang et al. 2015). Finding a closely related dataset from another domain
is easily done these days. For example, user ratings can be found for the same set
of movies from both MovieLens and Netflix websites. Thus, they can be jointly
used to understand user preferences and item characteristics better. Acar, Kolda &
Dunlavy (2011), Singh & Gordon (2008), Li et al. (2009a), Bhargava et al. (2015)
proposed the use of correlated datasets across domains as the extra information to
overcome the problems of insufficient ratings.
The joint analysis of different datasets across domains provides a deeper under-
standing of their underlying relationship (Acar, Kolda & Dunlavy 2011). However,
3
there are two problems which need to be addressed. The first problem is how to
accurately discover the exact correlation among sources and exploit them to gain
a deeper understanding of the relationships between users and items. This under-
standing will help to provide more accurate recommendations. Furthermore, mining
these abundant cross-domain datasets incurs a very heavy cost in terms of compu-
tation, communication, and storage. This cost leads to the second problem of how
to scale up the data analysis. Solving all these issues is the primary focus of this
research.
1.1 The research problem
This section lists the main issues investigated in this thesis and presents the
research questions.
1.1.1 The improper sharing of explicit similarities among coupled datasets
across domains reduces recommendation accuracy
Coupled datasets are those with one dimension in common (Acar, Kolda &
Dunlavy 2011). For example, one dataset contains user ratings for a list of movies
on the MovieLens website, and another includes the ratings of a different user base
on the Netflix website for the same list of movies as that of MovieLens. Both contain
the ratings of the same list of movies. Thus, they are coupled in their movie dimen-
sion. As they have one dimension in common, they explicitly share some similarities
in their coupled dimension. For example, the same movies on the MovieLens and
Netflix websites have some common characteristics. The joint analysis of coupled
datasets to utilize these explicit similarities has been an exciting topic in different
research communities such as collaborative filtering (Zheng et al. 2010, Loni et al.
2014, Wang et al. 2012a), community detection (Lin et al. 2009) and link prediction
(Chen et al. 2017, Wang et al. 2013, Viswanath et al. 2009, Dunlavy et al. 2011,
4
Perozzi et al. 2016).
Coupled datasets across domains can provide additional insights into the coupled
dimension. For example, ratings in the MovieLens dataset can be an extra source of
information about movies that can be beneficial to the understanding of movies in
Netflix data and vice versa. Thus, the joint analysis of the coupled datasets across
domains would provide a better understanding of their underlying structures (Kemp
et al. 2006, Yang et al. 2015). This thorough understanding helps to provide more
appropriate recommendations to the users. In this case, the aim is to simultane-
ously learn the rating behaviors from both domains’ observed values to predict their
missing entries with high accuracy. But how to utilize the rating information in one
domain to help to predict unknowns in another one and vice versa still needs to be
investigated.
The existing coupled analysis methods proposed to use the same coupled factor
to collaborate between datasets. Collective matrix factorization (CMF) (Singh &
Gordon 2008) and its extensions, coupled matrix tensor factorization (CMTF) (Acar,
Kolda & Dunlavy 2011), suggested both datasets would share the same factor in
their coupled mode. Gao et al. (2013) and Li et al. (2009a) assumed cross-domain
datasets would have identical latent rating patterns, captured in the middle factor
of the matrix tri-factorization. Although sharing the same factor across domains
had an effect, the cross-domain datasets might possess some unique characteristics
of their domains. For example, the MovieLens website allows ratings with half-star
increments whereas Netflix only allows full-star ratings. Thus, sharing the same
factor across domains is unlikely to capture the exact correlations among them,
reducing its effectiveness in achieving higher recommendation accuracy.
5
Low income LGA
High income LGA
(a) Percentage of high income family per LGA
High crime rate LGA
Low crime rate LGA
(b) Number of “break and enter dwelling” incidents per LGA
Figure 1.1 : An example of the implicit correlation between income and crime rate.Local government areas (LGAs) in New South Wales (NSW) with more high-incomefamilies have a lower break and enter dwelling incidents. The data in a) is from theAustralian Bureau of Statistics and b) is from the NSW Bureau of Crime Statisticsand Research.
1.1.2 Coupled datasets across domains also share implicit similarities
that provide other insights into their relationships
In addition to the aforementioned explicit similarities, cross-domain datasets are
hypothesized to have other implicit correlations in their remaining dimensions. For
instance, MovieLens and Netflix datasets are also correlated in their user dimen-
sions. This intuition comes from consumer segmentation which is a popular concept
6
in marketing. Consumers can be grouped by their behaviors, for example, “tech
leaders” is a group of consumers with a strong desire to own new smartphones with
the latest technologies and there is also another user group which only changes to
a new phone when the current one is out of order. There are two meanings to this
idea. Firstly, different users can be grouped if they have similar preferences, i.e.,
users with similar movie tastes on the MovieLens website can be grouped together
and so can those on Netflix. Secondly, related groups across domains may have sim-
ilar behaviors. For example, action movie fans on MovieLens and those on Netflix
are fond of watching action movies. Thus, these implicit similarities can provide
other insights to understand the relationship between users and items if they are
exploited properly.
The implicit similarities in the non-coupled dimension will have a significant
impact in improving recommendation accuracy. For example, a re-examination of
the MovieLens and Netflix datasets shows that although there is no direct user match
between the two datsets, groups of users on the MovieLens website and related ones
on the Netflix website share similar preferences. Thus, it is possible to use the group
behaviors of one dataset to enrich the understanding of the corresponding groups in
another dataset. Furthermore, this type of correlation also implicitly exists in many
other scenarios in the real world. For example, suburbs with a large number of
high-income families may be correlated with a lower crime rate as shown in Figure
1.1, or users with an interest in action books may also like related movie genres
(e.g., action films), or even though users on Amazon and Walmart are different,
those sharing similar interests may share similar behaviors in relation to the same
set of products. Thus, using both explicit and implicit similarities correctly will
have potential applications to many real-world scenarios.
Implicit similarities have the potential to improve recommendation accuracy.
However, different approaches of joint analysis of cross-domain datasets (Pan 2016)
7
use only explicit similarities as a bridge to collaborate among datasets. Although
sharing these explicit similarities was shown to be effective in improving recom-
mendation, there are rich implicit features that were not used and they have great
potential to provide even more appropriate recommendations.
1.1.3 Joint analysis of heterogeneous datasets is costly
Recent innovations on the Internet and social media have brought many op-
portunities as well as challenges to research communities. On the one hand, they
have made increasingly many gigabytes of matrices and high-order tensors available
(Ermis et al. 2015). An analysis of their explicit and implicit similarities provides us
a deeper understanding of the underlying relationships. On the other hand, mining
these huge data incurs a significant amount of computation, communication and
occupies much memory. Traditional algorithms, such as CMTF (Acar, Kolda &
Dunlavy 2011), are intractably slow or quickly run out of memory. The former is
because they iteratively factorize the full coupled tensors into factors many times;
the latter is because the full coupled tensors cannot be loaded into the local mem-
ory of any typical computer. These challenges have been a motivation for many
researchers.
Both efficient computational methods and scalable works (Zhu et al. 2016, Sael
et al. 2015) have been proposed to speed up the factorization. Whereas concurrent
processing using CPU cores (Papalexakis et al. 2014) or GPUs’ massively paral-
lel architecture (Zou et al. 2015) enhances processing speed, it does not solve the
problem of insufficient local memory to store the whole big data. Other MapRe-
duce distributed models (Shin & Kang 2014, Beutel et al. 2014, Kang et al. 2012)
overcome the memory problem by keeping large files in a distributed file system.
They also improve computational speed by performing the factorization process in
parallel with many different computing nodes.
8
Computing in parallel allows factors to be updated faster, yet the factorization
faces higher data communication cost if it is not well designed. The first critical
weakness of MapReduce algorithms is when a computing node needs data to pro-
cess, the data is transferred from an isolated distributed file system to the node
(Beutel et al. 2014, Kang et al. 2012, Jeon et al. 2016). The iterative nature of
tensor factorization requires data and factors to be distributed over and over again,
incurring enormous communication overhead. If the tensor size is doubled, the al-
gorithms’ performance is 2*T times worse (T is the number of iterations). This
diminished performance leads to the second disadvantage which is their low scala-
bility. Thus, improving the scalability of the analysis of datasets across domains is
another problem that needs to be addressed.
In light of the above research issues, this thesis aims to address the following
three research questions:
Q1. Is it possible to propose appropriate methods of sharing explicit similarities
between cross-domain datasets to understand their actual relationship?
Q2. How to share implicit similarities in non-coupled dimensions across domains to
improve recommendation accuracy?
Q3. How to improve the scalability of the factorization process such that it is able
to scale up to a different number of coupled tensors, tensor modes, tensor
dimensions and billions of observations?
1.2 Thesis
The proposed method is the first scalable factorization model to use both explicit
and implicit similarities across domains for cross-domain recommendation perfor-
mance improvement.
9
Table 1.1 : Comparison of existing algorithms for recommendation. The featuresthat an algorithm supports are checked. Only the proposed method has all thefeatures.
Algorithms Explicit Implicit ScalabilitySimilarities Similarities
CMTF (Acar, Kolda & Dunlavy 2011) � × ×SALS (Shin et al. 2017) × × �SCouT (Jeon et al. 2016) � × �The proposed method � � �
Coupled datasets across domains are strongly correlated. The most apparent
relationship can be seen in the coupled dimension. As the datasets have the same
coupled dimension, they share some direct properties of the coupled dimension. For
example, as the movie ratings on the MovieLens and Netflix websites are of the
same list of movies, a film on MovieLens can be matched with its identical one
on Netflix. It is worth noting that these matched movies have precisely the same
properties. Therefore, both websites directly share some common characteristics
in the movie dimension. These direct correlations on the coupled dimension, or
explicit similarities, can be mutually used to enrich our understanding of the
underlying structure of each dataset.
In addition to the explicit similarities, coupled datasets across domains also
have indirect relationships, called implicit similarities. For instance, as users of
the MovieLens and Netflix websites are different, there is no direct match or sharing
between any two particular users across the two sites. However, some of the users
on the MovieLens website are fans of action movies whereas some of the Netflix
users love to watch action movies. These action movie fans are not the same users
across domains. However, they share some common behaviors, e.g., they are likely
to rate action movies highly. Thus, such indirect or hidden similarities may exist
10
between two related groups of users (e.g., action movie fans, sci-fi movie fans, etc.).
Sharing these implicit similarities provides other rich insights, in addition to the
aforementioned explicit similarities, to better understand the relationships between
users and items. This thorough understanding allows the company to provide more
appropriate recommendations to its users.
The joint analysis of coupled datasets across domains allows us to use both
explicit and implicit similarities to improve recommendation accuracy. However,
mining them often incurs heavy computation, communication and storage costs.
This problem is because data is now generated at tremendous rates. Consequently,
improving the scalability of the factorization process is not an option, but a crucial
requirement. The scalability of a method is its capability to scale up its operations
as the data increases. This means the method has the capability to complete the
computation within a reasonable amount of time. Also, it implies the ability to add
more hardware resources to improve the performance of the analysis. At the same
time, a hardware failure does not prevent the method from performing its operation
and may only reduce its performance. Any method that is not able to scale up will
be in trouble when analyzing large-scale datasets.
Several algorithms have been proposed for the joint analysis of coupled datasets
across domains. Table 1.1 compares these in terms of their capability to use explicit
similarities, discovering implicit similarities and scaling up to large-scale datasets.
Existing methods support either one or two of the three features. CMTF (Acar,
Kolda & Dunlavy 2011) only uses explicit similarities in the coupled factors. SALS
(Shin et al. 2017) scales up the factorization process, but it does not support multiple
datasets. SCouT (Jeon et al. 2016) improves the scalability of CMTF. Nevertheless,
it exploits the same explicit similarities just as CMTF does. The lack of a scalable
method with the ability to exploit both the explicit and implicit similarities across
coupled datasets motivates this research. Section 1.3 provides some background of
11
these existing methods and Section 1.4 introduces the contributions of this thesis
by conducting this research.
1.3 Background
Utilizing similarities across domains has attracted enormous research effort (Zhang,
Yuan, Lian, Xie & Ma 2016, Zhang, Xiong, Kong & Zhu 2016). Some of the com-
monly used algorithms are discussed in this section.
Acar, Kolda & Dunlavy (2011) introduced Coupled Matrix Tensor Factorization
(CMTF) as joint analysis of explicit similarities between a matrix and a tensor cou-
pled in one dimension to improve recommendation accuracy. The authors assumed
both datasets would explicitly share a common factor in the coupled dimension.
Thus, they formulated this identical factor in a coupled loss function. Even though
CMTF provides a deeper knowledge of the underlying structure of the data, it has
three main drawbacks. Firstly, it only uses explicit relationships in the coupled di-
mension. Coupled datasets across domains may also have some implicit similarities
that can be additional resources to deepen the understanding of the actual relation-
ship in the data. Secondly, the assumption that coupled datasets share identical
coupled factors is unrealistic. Even though the coupled datasets may be strongly
correlated, they may also have unique features from their domains. Hence, forcing
them to share identical information may lose the domain-specific characteristics.
Finally, the analysis is performed on a local machine. When the size of the input
matrix and tensor becomes bigger than the size of the machine’s memory, CMTF
fails. Subsequent works have only focused on the latter issue.
As an attempt to resolve the scalability of CMTF, Jeon et al. (2016) imple-
mented a MapReduce-based distributed algorithm, called SCouT. In a nutshell,
SCouT divides huge data into small parts and concurrently factorizes them with
several computing nodes in a cluster. Following the MapReduce framework, SCouT
12
stores data files in distributed file servers. As a result, SCouT requires pieces of
data to be transferred from the distributed file server to each computing nodes for
every iteration. This data transmission cost, in the case of transferring many ter-
abytes to all computing nodes over iterations, even surpasses the time saved from
parallel processing. Hence, this weakness reduces the robustness and effectiveness of
SCouT. An algorithm minimizing this communication is, therefore, a better solution
for scaling up as the observed data increases.
In an attempt to overcome the weakness of the MapReduce framework, Shin et al.
(2017) introduced an optimization to reduce the repeated redistribution of data. The
authors’ idea was to cache data in local disks of computing nodes. This data caching
reduced the communication overhead significantly as data was only transferred from
local disks to memory for each iteration. Nevertheless, this communication can be
reduced even more. As data was stored on disks, reading it to memory for each
access takes time, especially for huge datasets and many iterations. Furthermore,
the authors’ proposed algorithm worked with a single data only, lacking the ability
to use similarities across domains for cross-domain recommendation.
The lack of a scalable algorithm with the capability of utilizing both explicit
and implicit similarities for cross-domain recommendation motivates us to conduct
this research. The proposed model is the only one which can effectively scale up its
analysis to use both explicit and implicit similarities across domains for cross-domain
recommendation performance improvement.
1.4 Knowledge contributions
To investigate the above research questions, the research in this thesis makes four
knowledge contributions to the data mining research community. Figure 1.2 shows
the relationship between the research questions and the knowledge contributions of
this thesis. Details of each contribution are discussed in Section 1.5.
13
Scalable factorization model to discover implicit and explicit similarities across domains
Q1. How to share explicit similarities?
Q2. How to share implicit similarities?
Q3. How to improve the scalability?
Contribution #1. Utilize explicit
similarities
Contribution #2. Discover implicit
similarities
Contribution #3. Exploit both explicit & implicit similarities
Contribution #4. Scale up factorization
Figure 1.2 : The research questions and their corresponding contributions.
Contribution #1. A new objective function to enable each dataset to have its
discriminative factor on the coupled mode, capturing the actual explicit simi-
larities across domains;
Contribution #2. A novel algorithm to discover implicit similarities in non-coupled
mode and align them across domains;
Contribution #3. A matrix factorization-based model to utilize both explicit and
implicit similarities for cross-domain recommendation accuracy improvement;
Contribution #4. A scalable factorization model based on the Spark framework
to scale up the factorization process to the number of tensors, tensor modes,
tensor dimensions and billions of observations.
1.5 Research Methods
This section briefly introduces the research methods to be implemented to in-
vestigate the research questions.
14
1.5.1 A new objective function to enable each dataset to have its own
discriminative factor on the coupled mode, capturing the actual
explicit similarities across domains
The goal of this research is to accurately recommend items that a particular
user may like. Recommending the right products to the right consumers requires a
thorough understanding of user preferences and item characteristics. This require-
ment can be addressed as a result of recent innovations on the Internet and social
media where many datasets, coupled in one dimension, from different sources are
available. As the coupled datasets have one dimension in common, they share some
explicit similarities that can be used effectively to better understand the underly-
ing relationships between users and items, resulting in the provision of more useful
recommendations.
Coupled datasets have strong correlations on their coupled dimension. For in-
stance, the same action movies on the MovieLens and Netflix websites share some
common characteristics. However, each domain also has some unique properties. For
example, the MovieLens allows ratings from 0.5 to 5 with 0.5 increments whereas
the Netflix only enables 1 to 5 ratings with 1 increases. Thus, there are scenarios
where action movie fans on the MovieLens rate action movies with 3.5, 4, 4.5 or
5 stars while those on the Netflix rate them with 4 or 5 stars. Due to this scale
difference across sites, existing models that assume coupled datasets share the same
coupled factor or the same parameters on their coupled dimension are unlikely to
capture the actual differences. A method is proposed to better capture the true
explicit similarities across domains to improve recommendation accuracy.
Suppose cross-domain datasets X and Y are coupled in their first dimension,
popular joint factorization algorithms assume that they share the same features in
the coupled dimension. For example, CMF (Singh & Gordon 2008) and CMTF
15
(Acar, Kolda & Dunlavy 2011) assume that the first dimension of X shares a com-
mon low-rank subspace with the first dimension of Y. A basis for this low-rank
subspace is expressed by the identical latent factors in the coupled dimensions (cou-
pled factors) of X and Y in the coupled loss function. Admittedly, the first factor
of X highly correlates with the first factor of Y, yet they are unequal in many
real-world data and applications. Thus, forcing them to share the same coupled
factor may reduce the accuracy of factorization, leading to a lower recommendation
performance.
Sharing the same coupled factors as proposed by the existing algorithms is hy-
pothesized to reduce the accuracy of the joint factorization. However, this perfor-
mance reduction is not the only issue. By using an identical coupled factor for
cross-domain datasets, the final result optimizes either of them, not both. It may
approximate X well and lose Y’s decomposition accuracy, or vice versa. Hence, this
problem is addressed by allowing each dataset across domains to have its unique
factor even in the coupled dimension. Moreover, a new coupled loss function is
proposed where different coupled factors are regularized to be as close as possi-
ble. These different, yet closely related, coupled factors better capture the true
relationship between cross-domain datasets, optimizing the factorization of every
dataset without sacrificing any accuracy. The proposed model is benchmarked with
commonly used algorithms that can use the explicit similarities across domains for
recommendation, such as CMTF (Acar, Kolda & Dunlavy 2011), CLFM (Gao et al.
2013) and CBT (Li et al. 2009a). For a fair comparison, each model is applied to
the same publicly available datasets. Root means squared error (RMSE) is used as
a metric for benchmarking the proposed idea.
16
1.5.2 A novel algorithm to discover implicit similarities in non-coupled
mode and align them across domains
Cross-domain datasets not only have explicit similarities in the coupled dimen-
sion, but they also share implicit ones in the non-coupled dimension. Different
approaches have been proposed to perform a joint analysis of coupled datasets (Pan
2016). However, all of the existing algorithms use explicit similarities as a bridge
to collaborate among datasets. Although these explicit similarities showed their ef-
fectiveness in improving recommendation, there are still rich implicit features that
were not used but have great potential to further improve the recommendation. The
fact that non-coupled dimensions in the aforementioned example of the MovieLens
and Netflix datasets contain non-overlapping users prevents direct knowledge shar-
ing in their non-coupled factors. However, their latent behaviors are correlated and
should be shared. These latent behaviors can be captured in low-rank factors by
matrix tri-factorization. As factorization is equivalent to spectral clustering (Ding
et al. 2006), different users with similar preferences are grouped in non-coupled
user factors. Developed on this concept, latent clusters in these non-coupled fac-
tors are hypothesized to have a close relationship. Therefore, correlated clusters in
non-coupled factors are aligned to be as close as possible. This idea matches the
fundamental concept of CF in the sense that similar user groups who rate similarly
will continue to do so.
This aim can be achieved by delivering a factorization model that can exploit
the implicit similarities across domains for recommendations. In case of the afore-
mentioned movie rating matrices on MovieLens and Netflix websites, they contain
preferences of different users for the same set of movies. Even though there is no
direct user matching between them, some of them may share hidden behaviors that
can be utilized to improve recommendation accuracy. The performance of the pro-
posed algorithm with implicit similarities exploitation is measured in comparison
17
with that of other widely used methods using only explicit similarities, including
CMF (Singh & Gordon 2008), CST (Pan et al. 2011), CBT (Li et al. 2009a) and
CLFM (Gao et al. 2013). In this event, RMSE is also used as the metric.
1.5.3 A matrix factorization-based model to utilize both explicit and im-
plicit similarities for cross-domain recommendation accuracy im-
provement
This research proposes a cross-domain recommender as the first algorithm uti-
lizing both explicit and implicit similarities between datasets across sources for per-
formance improvement. One of the key hypotheses, extended from CMF (Singh
& Gordon 2008) where both datasets have the same factor in their coupled mode,
is that two datasets across domains also possess their own specific patterns. The
proposed idea is to find a way to combine these unique patterns into the common fac-
tor. One plausible solution is to allow the coupled factors to have both common and
domain-specific parts. In addition, another key hypothesis for implicit similarities
is that they may exist in non-coupled factors. Thus, the proposed method utilizes
both the explicit similarities in the coupled factors and the implicit similarities in
the non-coupled factors to improve recommendation performance.
Validated on real-world datasets, the proposed idea outperforms the current
cross-domain recommendation methods by more than two times. Furthermore, the
more interesting observation is that both explicit and implicit similarities between
datasets help to better suggest unknown information from cross-domain sources.
18
1.5.4 A scalable factorization model based on the Spark framework to
scale up the factorization process to the number of tensors, tensor
modes, tensor dimensions and billions of observations
As businesses grow, they reach more users and eventually collect more ratings.
Having more data opens a new opportunity for them to provide more accurate
recommendations. At the same time, it is also a challenge as they need to analyze an
increasing amount of data to understand more deeply the underlying relationships
between users and items. To accommodate this massive increase, not only the
ability to handle this big data but also the capability to finish the analysis within
a reasonable time are necessary. Therefore, an efficient and scalable method is a
crucial requirement of any recommendation system.
Both computationally efficient methods (He et al. 2016, Rennie & Srebro 2005,
Liu & Shang 2013, Wang, Tung, Smola & Anandkumar 2015) and scalable work
(Yang et al. 2017, Acar, Dunlavy, Kolda & Mrup 2011, Park et al. 2017) have been
proposed to speed up the factorization. Furthermore, other researchers attempted to
use hardware power to enhance processing speed. Papalexakis et al. (2014) presented
a method using multiprocessors for coupled matrix tensor factorization. Zou et al.
(2015) proposed to take advantages of GPUs massively parallel architecture to speed
up tensor fac- torization. As these methods are performed on a local machine, they
do not solve the problem of insufficient memory when they have to handle huge
datasets.
To overcome the limit of local memory, MapReduce-based factorization mod-
els (Beutel et al. 2014, Kang et al. 2012, Shin & Kang 2014, Jeon et al. 2016)
were introduced. They can keep the large files in a distributed file system which
was designed to be expanded easily by adding more storage. Furthermore, these
MapReduce-based algorithms improve computational speed by having many nodes
19
computed in parallel. Even though distributed computing allows factors to be up-
dated faster, MapReduce-based models require data to be transferred from the iso-
lated distributed file system to the computing node when it needs to process this
data. The iterative nature of tensor factorization requires data and factors to be
distributed over and over again, incurring huge communication overhead.
This research proposes the first data parallelism algorithm that incurs minimal
communication cost. In particular, the proposed method is designed to cache data
in memory in parallel so that no data communication is needed for each iteration.
This design makes it a lightning-fast and scalable tensor factorization algorithm
whose performance does not dramatically reduce as the data increases. Also, the
proposed method is capable of scaling up to a different number of inputted datasets,
their dimensions, and billions of observations. The proposed method’s processing
speed is measured in comparison with SCouT (Jeon et al. 2016) and SALS (Shin
& Kang 2014), which are the fastest scalable coupled matrix tensor factorization
and tensor factorization algorithms, respectively. Moreover, a thorough analysis of
the scalability of the proposed model is also performed. To this end, the proposed
algorithm is compared against its baselines in case the data grows to billions of
observations.
1.6 Significance
The proposed factorization model enriches the research community with a new
way of feature sharing between coupled datasets, leading to more accurate recom-
mendations. Even though coupled datasets have strong correlations in the coupled
dimension, forcing different datasets to have the same factor on their coupled di-
mension is unrealistic in many real-world applications, which is detrimental to the
overall accuracy of the factorization. The research extends the CMTF model by as-
suming coupled datasets do not have common factors even on the shared dimension.
20
Instead, it enables each dataset to have discriminative coupled factors and constrains
the coupled factors to be as close as possible. This idea has two advantages. Firstly,
it properly shares the explicit similarities in coupled dimensions across domains.
Secondly, it optimizes the factorization of every single dataset without sacrificing
accuracy for any of the coupled datasets. Experiments with real-world datasets cou-
pled in one dimension illustrate that the proposed model exploits explicit similarities
better than existing models to improve recommendation performance.
Furthermore, another key contribution of this research relates to implicit simi-
larities. The fact that non-coupled dimensions in the MovieLens and Netflix ex-
ample contain non-overlapping users prevents direct knowledge sharing in their
non-coupled factors. However, their latent behaviors are correlated and should be
shared. These hidden behaviors can be captured in low-rank factors by matrix tri-
factorization. As factorization is equivalent to spectral clustering (Ding et al. 2006,
Sachan & Srivastava 2013), different users with similar preferences are grouped in
non-coupled user factors. Developed on this concept, latent clusters in these non-
coupled factors are hypothesized to have a close relationship. Therefore, correlated
clusters in non-coupled factors are aligned to be as close as possible. This idea
matches the fundamental concept of CF in the sense that similar user groups who
rate similarly will continue to do so. As a result, the developed algorithm is the
first factorization model that utilizes not only the explicit similarities but also the
implicit ones across domains for recommendation accuracy improvement.
In addition, the developed algorithm redounds the benefits for businesses con-
sidering that their users are generating a massive amount of data today. This fast
data generation rate demands a fast and scalable data analytic method. Thus, a
novel distributed model that exhibits robust data parallelism is proposed. It enables
concurrently decomposing factors while minimizing data transmission overhead. As
a result, the proposed algorithm is the only one that scales up well in relation to
21
the number of tensors coupled in one or more mode, tensor modes, tensor dimen-
sions and billions of observations. Moreover, the research also benefits the research
community in two aspects. Firstly, it presents a closed-form optimization solution
which not only converges faster but also achieves higher accuracy. Experiments
with real-world datasets confirm the quality of the proposed solution. Secondly,
this research provides a theoretical complexity analysis of the proposed algorithm in
computation, communication and space aspects as well as some empirical evidence
of its fastest convergence in comparison with existing algorithms.
1.7 Thesis organization
This thesis is organized as follows:
• Chapter 1 introduces the research problems, research questions, contributions
and their significance.
• Chapter 2 presents preliminary concepts and previous work related to the
research topics. The background of matrix factorization, tensor factorization,
and coupled tensor matrix factorization is briefly summarized. Next, different
optimization methods such as gradient descent and alternating least squares
are explained in detail. Furthermore, this chapter discusses different meth-
ods using similarities across domains for recommendation including the joint
analysis of coupled datasets and transfer learning. Also, different distributed
approaches for scaling up factorization processes are reviewed and compared.
• Chapter 3 proposes an algorithm to exploit explicit similarities across do-
mains. It assumes coupled datasets share different but closely similar coupled
factors. The proposed algorithm is described in detail starting with the mo-
tivation to introduce this idea, followed by its technical aspects, then several
22
experiments to show its performance, and a summary of its knowledge contri-
butions.
• Chapter 4 explains a method to discover implicit similarities and use them to
improve cross-domain recommendation performance. Specifically, this chap-
ter presents a method to find related groups across non-coupled factors and
align them to share the implicit similarities across domains. Also, how to use
both explicit and implicit similarities is presented. Extensive experiments are
conducted, and their results are reported in this chapter to demonstrate the
advantages of the proposed method. Finally, this chapter is concluded with
some knowledge contributions of the proposed algorithm.
• Chapter 5 describes a scalable model to scale up the factorization processes
when dealing with big data inputs. This chapter presents the data distributed
design and closed-form optimization to improve computational and time com-
plexity. Thorough experiments are also discussed to benchmark the scalability
of the proposed algorithm in terms of the observed data size, the number of
computing nodes and the number of input datasets. Also, a brief comparison
on recommendation accuracy is reported to conclude the knowledge contribu-
tions of this research.
• Chapter 6 concludes the thesis and summarizes the work in a broader context.
Furthermore, future directions of the research are also described here.
23
24
Chapter 2
Literature Review and Background
This chapter review different aspects of personalized recommendation systems re-
lated to this research. There are two primary entities in personalized recommen-
dation systems: users and items. Items can be products such as movies, songs,
websites, etc. in product recommendation or other users in friend recommendation
problem. Users are those the systems want to provide recommendations to. The
primary purpose is to predict a user’s preference for a particular item so that the
systems can provide an appropriate recommendation strategy.
As recommendation systems analyze user preferences of different items, this
chapter first introduces the concept of rating matrix and rating tensor to capture
user preferences in recommendation systems in Section 2.1. Section 2.2 discusses
collaborative filtering based recommendation systems. Furthermore, it presents ma-
trix factorization (MF) and its extension tensor factorization (TF). Alternate least
square (ALS) and gradient descent optimization methods are then discussed in Sec-
tion 2.4 to find factors in MF and TF. Also, this chapter then reviews methods to
utilize datasets across domains for higher recommendation accuracy in Section 2.3.
Two main approaches: joint analysis of multiple datasets and transfer learning be-
tween cross-domain ones are described. As data emerges, different algorithms have
25
been proposed to scale up factorization processes. Section 2.5 discusses different
distributed models. Finally, this chapter highlights a few research gaps in Section
2.7.
2.1 Data format
This section presents rating matrix and rating tensor as ways to represent data
in recommendation systems.
2.1.1 Rating matrix (utility matrix)
Users and items are the two primary entities of recommendation systems (Kon-
stan 2004). Users may or may not provide their feedbacks to different items. Differ-
ent websites may use different kinds of feedbacks. For example, users of Facebook
may click “thumb-up” to like a post. Or Amazon users rate items they bought from
1 to 5 stars on Amazon website. For those who do, the feedbacks which are their
degree of preference to the items (Zhang et al. 2008) can be assigned a value, or
rating, correspondence to a pair user-item. All of these ratings, including missing
ones, are formed a matrix with users in one dimension and items in the other. This
matrix is called a rating matrix (or a utility matrix) whose observed entries are
ratings that users provided. An example of a rating matrix is shown in Figure 2.1.
In the sample rating matrix in Figure 2.1, there are six users and seven movies.
Entries with star symbols present rating users provided for respective movies. Rat-
ings are from one (not like) to five stars (very much like). Blank entries are not yet
rated. Recommendation systems are built to utilize the observed ratings in order
to predict these missing entries (Resnick & Varian 1997). They then recommend
movies with high predicted ratings to the users.
Generally, there are many users and many items. A particular user typically
rates a few items. Thus, the number of observed ratings is much smaller than the
26
5 34 4 5
3 43 5 4
2 54 1
Figure 2.1 : An example of a rating matrix of a movie recommendation system.Users rate movies from one to five stars. Blank entries are missing ratings as theusers have not rated them yet. The recommendation system has to predict them.
number of missing ones. In other words, the rating matrix is often sparse with just
a few entries having values (Toscher et al. 2008).
2.1.2 Tensor
As data evolves over the time, the recommendation systems are likely to have
additional entities. For instance, ratings can be collected by weekdays. So the
systems have seven rating matrices from Monday to Sunday as shown in Figure
2.2a. This way of data can be naturally represented by a tensor (Itskov 2009).
A tensor is defined as a multidimensional array (Kolda & Bader 2009). It is
often specified by mode (a.k.a., order or way) which is the number of dimensions.
Specifically, a mode-1 tensor is a vector; a matrix is a mode-2 tensor. A mode-3 or
higher-order tensor is often called tensor in short. In Figure 2.2b, the movie ratings
by weekdays can be put in a mode-3 tensor. Similar to the rating matrix, the tensor
is often sparse.
27
1 5 … … 4
3 2 … … 5
… … … … …
… … … … …
2 5 ... … 1
John
Sam
……
AnnM
ovie
1
Mov
ie 2
Mov
ie n
(a) Movie ratings by weekdays
user
item
(b) A mode-3 tensor
Figure 2.2 : An example of a rating tensor of mode-3. Movies are rated by users foreach weekday from one to five stars. A rating is represented by a three dimensionaltensor of user-by-item-by-weekday.
2.1.3 Coupled datasets
Recent innovations in the Internet and social media have made many closely
related datasets available. As a result, it is possible to find rating matrices across
domains having one dimension in common. For example, MovieLens and Netflix
websites each published a dataset of their user ratings on some movies. Although
users on the MovieLens and those on the Netflix are different, they may rate the
same list of movies. In other words, these datasets have movie dimension in common.
They are called to be coupled in their movie dimension.
sci-fi #1
comedy #1
comedy #2
comedy #3
sci-fi #2
sci-fi #3
(a) X(1)
sci-fi #1
comedy #1
comedy #2
comedy #3
sci-fi #2
sci-fi #3
(b) X(2)
Figure 2.3 : An example of coupled rating matrices X(1) and X(2) from Netflix andMovieLens websites, respectively. Blank entries are unobserved ratings. X(1) andX(2) contain ratings of different users for the same set of movies; they are called tobe coupled in movie dimension.
Besides the two coupled matrices above, a coupled matrix and a tensor can
28
Y
user profile
Z
genre
user
movie
Xuser
movie
Figure 2.4 : An example of a coupled matrix tensor from MovieLens dataset. Movieratings are captured in a mode-3 tensor X of users by movies by weekdays. Ad-ditional information forms a matrix Y of users by user profiles and a matrix Z ofmovies by genres. Tensor X and matrix Y are coupled in their user mode; tensor Xand matrix Z are coupled in their movie mode.
sometimes be found. For example, MovieLens dataset (Harper & Konstan 2015)
includes ratings from users on movies over a period of time. This information can
be represented in the form of a three-dimensional tensor X of users by movies by
weekdays whose entries are ratings. Besides, MovieLens also captures user identity
and categorizes movies into different genres. This additional information forms a
matrix Y of users by user profiles and a matrix Z of movies by genres. It is more
interesting that the first dimension of X is correlated with the first dimension of Y,
and the second mode of X has a relationship with the first dimension of Z. Figure
2.4 visualizes this relationship. On this occasion, X is said to be coupled with Y in
its first mode, and joint with Z in its second mode.
2.2 Recommendation Systems
Recommendation systems have gained their importance and popularity among
product providers. Two fundamental techniques are widely chosen for developing
personalized recommendation systems: content-based approach (Pazzani & Billsus
2007) and collaborative filtering (CF)-based approach (Schafer et al. 2007, Ekstrand
et al. 2011). The former focuses on information of users or items for making rec-
ommendations whereas the latter is based on latent similarities (Gao et al. 2012,
29
Menon & Elkan 2011) of the user interests and those of the item characteristics
for predicting items specific users would be interested. As this research focuses
on improving CF-based recommendations, this section discusses key techniques for
CF-based recommendation systems.
2.2.1 Matrix Factorization
The basic idea of CF-based recommendations is that they rely on latent similar-
ities among users and items for making recommendations. They require to analyze
past user preferences to identify new items that users tend to have similar preferences
(Hu et al. 2008). Koren et al. (2009) pioneered in applying Matrix Factorization
(MF) for movie rating prediction. Observed movie ratings in a form of a user-movie
matrix were decomposed into low-rank matrices, called latent factors or simply fac-
tors.
X ≈ U×VT
where X is the rating matrix of n users by m items (X ∈ Rn×m), U ∈ Rn×r is the
user factor (the factor in user dimension), V ∈ Rm×r is the item factor, and r is the
rank of the factorization.
The factors U and V can be found by solving the optimization of the following
loss function:
L =1
2
∥∥U×VT −X∥∥2
+λ
2
(∥∥U∥∥2+∥∥V∥∥2) (2.1)
where the second term is squared L2 regularization terms to overcome the over-
fitting.
Once latent factors U and V are found, a matrix multiplication of them is then
30
performed to predict the missing entries of the rating matrix X (Kiraly et al. 2015).
The performance of MF was demonstrated in the Netflix competition (Bell & Koren
2007) as Koren (2009) achieved the first highest accuracy for movie rating prediction
with it.
2.2.2 Matrix Tri-Factorization
Unlike MF which factorizes a rating matrix into two factors, matrix tri-factorization
decomposes the input rating matrix into three factors. When the given matrix is
complete, decomposing it into factors can be done by Singular Vector Decompo-
sition. However, where it is incomplete, computing its exact decomposition is an
intractable task (Kolda & Bader 2009). Thus, a more efficient and feasible approach
is to approximate the incomplete X of n users by m item as a matrix multiplication
of U ∈ Rn×r, S ∈ Rr×r and VT ∈ Rr×m:
X ≈ U× S×VT
where r is the rank of the factorization, UT ×U = I and VT ×V = I.
In this case, U is the user factor, V is the item factor and S is the weight between
U and V. These factors can be found by solving the optimization of the following:
L =1
2
∥∥U× S×VT −X∥∥2
+λ
2
(∥∥U∥∥2+∥∥S∥∥2
+∥∥V∥∥2) (2.2)
2.2.3 Tensor Factorization
Matrix Factorization (MF) which is a methodology decomposing a big matrix
into two much lower dimensional factors proved its effectiveness in Netflix Prize
competition (Koren 2009). Given some observed user ratings for a set of movies,
Netflix challenged the research community to predict the unknown ratings. The
31
X ≈ + +…+u1(2)
u1(1)
u2(2) u
r(2)
u2(1) u
r(1)
Figure 2.5 : Tensor factorization following CANDECOMP/ PARAFAC (CP) modelfor a mode-3 tensor X which is decomposed into 3 low-rank factors U(1), U(2), andU(3).
winner achieved the highest movie ratings prediction by presenting Netflix dataset
into a rating matrix of n users by m movies, factorizing it into two low-rank factors,
and then predicting missing entries from these factors (Koren et al. 2009). Since
then, MF has become a new trend and has been extended to be applied to multi-
mode, high-dimensional and sparse big data with a goal to capture the underlying
low dimensional matrices, the so-called low-rank factors.
Karatzoglou et al. (2010) and Wang et al. (2012b) introduced CF-based tensor
factorization (TF) for a flexible and generic integration of contextual information.
As an extension of MF, TF factorizes a multidimensional array, so called a tensor,
into its latent factors to capture the underlying low-rank structures (Fang & Pan
2014, Jiang et al. 2014). Following CANDECOMP/ PARAFAC (CP) decomposition
model (Harshman 1970), TF expresses a mode-p tensor into a sum of a finite number
of low-rank factors (as shown in Figure 2.5), being formulated as:
L(U(1), ....,U(p))=
∥∥∥∥�U(1),U(2), ....,U(p)� −X
∥∥∥∥2
where X ∈ RI1×I2×···×Ip is a mode-p tensor, its p rank-r factors are U(l) ∈ RIl×r, ∀l ∈[1, p], and �U(1),U(2), ....,U(p)�i1,i2,...,ip =
∑r
k=1
∑p
l=1U
(l)il,k
.
Along with the TF, researchers have solved these three key problems of TF:
32
• How to achieve high recommendation accuracy? and
• How to factorize the input tensors?
• Given a huge data, how can this computation be done in a reasonable time?
The following sections discuss these key issues of TF.
2.3 Cross-domain Recommendation Systems
As the input rating matrix (or tensor) is sparse in nature, leveraging closely re-
lated datasets has been proposed to improve recommendation accuracy (Lahat et al.
2015a). This trend has recently forecasted to emerge for the foreseeable future (Liu
et al. n.d., 2012). Two major approaches have been widely applied: joint analysis
of multiple datasets (Acar, Kolda & Dunlavy 2011, Gao et al. 2013, Gemulla et al.
2011) and transfer learning (Pan et al. 2011, Li et al. 2009a). The following subsec-
tions introduce widely used algorithms for cross-domain recommendation systems
related to this thesis.
2.3.1 Collective Matrix Factorization
Singh & Gordon (2008) proposed a joint analysis of two matrices coupled in one
of their modes. As they have one dimension in common, they are likely to share
some common characteristics in the coupled mode. Thus, the authors introduce
collective matrix factorization (CMF) algorithm to use these explicit similarities
between them to overcome the sparsity of the input rating matrices. To this end,
the authors assumed both datasets to have a common low-rank subspace in their
coupled dimension. Suppose X1 and X2 are coupled in their first mode, the authors
modeled the CMF with a coupled loss function:
L = ‖U×VT1 −X1‖2 + ‖U×VT
2 −X2‖2 (2.3)
33
where common U represents explicit similarities shared between two datasets and
regularization terms are omitted for simplification.
The concept of explicit similarities as the common factor has been widely used.
Bhargava et al. (Bhargava et al. 2015) proposed location, activity and time would
provide a complete picture of users. Thus, different data sources were modeled to
have common coupled factors to fuse explicit similarities among them. Transfer by
Collective Factorization (TCF) (Pan et al. 2011) was proposed to use ratings and
binary like/ dislike from the same users and items. TCF assumed that both user
and item factors would be the same between the inputs. Joint Matrix Factorization
(JMF) (Shi et al. 2013) leveraged 5-star ratings and item-item similarity matrix
collected from auxiliary data. The two datasets were joint with the same item factor.
There are also several ideas to extend this CMF’s coupled loss function. Weighted
Non-negative Matrix Co-Tri-Factorization (WNMCTF) (Yoo & Choi 2009) extended
CMF’s idea to non-negative matrix tri-factorization (Lee & Seung 2000).
2.3.2 Coupled Matrix Tensor Factorization
In case of a coupled matrix and tensor as introduced in Section 2.1.3, the first
dimension of X is correlated with the first dimension of Y, and the second mode of
X has a relationship with the first dimension of Z. These side information Y and Z
if joined analyzes with the primary data X will help to deepen our understanding
of the underlying patterns in the data, and to improve the accuracy of tensors
composition.
Acar, Kolda & Dunlavy (2011) extended CMF for a joint analysis of a matrix
and a tensor. She introduced coupled matrix tensor factorization (CMTF) whose
loss function was defined as
L =
∥∥∥∥�U,V,W� −X
∥∥∥∥2
+ ‖U×AT −Y‖2 (2.4)
34
XY
user profile
Z
gen
re
use
r
movie
Xuse
r
movie
Y
Z
≈ + +…+u1
v1
u1
b1
v1
a1
XY
Z
u2
v2
u2
b2
v2
a2
XY
Z
ur
vr
ur
br
vr
ar
Figure 2.6 : Joint factorization of a coupled matrix tensor. X is a tensor of ratingsmade by users for movies on weekdays. MatricesY and Z represent user informationand movie genre, respectively. Movie rating tensor X is, therefore, coupled with userinformation matrix Y in ‘user’ mode, and joint with movie category matrix Z in‘movie’ mode. X is factorized as a sum of low-rank vectors u, v and w; matrix Yis decomposed as a sum of u and a; and a sum of v and b is an approximation ofmatrix Z. Note that X and Y share the same user factor U whereas X and Z sharethe same movie factor V.
where �U,V,W�i,i,,k = ∑rf=1Ui,f ×Vj,f ×Wk,f , U, V and W are factors of X and
U and A are factors of Y. It is worth to note that U is the common factor of both
X and Y. In this case, regularization terms are also omitted for simplification.
Figure 2.6 illustrates the case of a coupled matrix tensor factorization of X, Y
and Z. X is factorized as a sum of low-rank vectors u, v and w; matrix Y is
decomposed as a sum of a and b; and a sum of c and d is an approximation of
matrix Z.
The idea of having a common factor has two potential issues. Firstly, it assumes
that the first mode of X shares a common low-rank subspace with the first mode of
Y. A basis for that low-rank subspace is expressed by the identical latent factor U
of X and Y. Admittedly, the first factor of X highly correlates with the first factor
of Y, yet they are unequal in many real-world data and applications. Due to this
fact, we hypothesize that forcing them to be the same would reduce the accuracy of
this factorization. More importantly, by using identical U for both X and Y, the
best final result optimizes either X or Y, not both. It may approximate X well and
lose Y decomposition’s accuracy, or vice versa. This issue is especially critical when
we want to find the best approximation of every tensors correlated to one another.
35
2.3.3 CodeBook Transfer
Targeting on improving recommendation on one domain by utilizing latent rating
patterns from another domain, Li et al. (2009a) suggested one as a source domain
Xsrc and the other one as a target domain Xtgt. In this research, the authors
employed matrix tri-factorization to decompose the source Xsrc into the user, the
item and the weighting factors.
Xsrc ≈ UsrcSsrcVTsrc (2.5)
This weighting factor (Ssrc) defined rating patterns of the users for the items,
being used as the codebook to be transferred from Xsrc to Xtgt. Thus, Xtgt became
Xtgt ≈ UtgtSsrcVTtgt (2.6)
This explicit knowledge transferred from the source improved the accuracy of
recommendation in the target domain. An extension, named Rating-Matrix Gener-
ative Model (RMGM) (Li et al. 2009b), combined both the steps of extracting the
codebook and transferring it. In case there are more than two datasets, Moreno
et al. (Moreno et al. 2012) introduced multiple explicit codebooks shared among
them. Each codebook was also assigned a different weight to utilize the correlations
better. Even though codebook transfer worked on any datasets, with or without
common users or common items, they are only effective when all datasets have the
same rating patterns. This condition limits their applications to a broader available
data.
36
2.3.4 Cluster-Level Latent Factor Model
The assumption that two datasets from different domains have the same rating
patterns is unrealistic in practice. They may share some common patterns while
possessing their own characteristics. This motivated Gao et al. (2013) to propose
CLFM for cross-domain recommendation. In specific, the authors partitioned the
rating patterns across domains into common and domain-specific parts:
X1 ≈ U1[S0|S1]VT1
X2 ≈ U2[S0|S2]VT2
(2.7)
where S0 ∈ Rr1×c is the common patterns and S1, S2 ∈ Rr1×(r2−c) are domain-
specific parts; c is the number of common columns.
This model allows CLFM to learn the only shared latent space S0, having two
advantages. Firstly, as S0 captures the similar rating patterns across domains, it
helps to overcome the sparsity of each dataset. Secondly, domain-specific S1 and
S2 contain domains’ discriminant characteristics. As a result, diversity of ratings in
each domain is still preserved, improving recommendation performance.
2.4 Factorization Methodologies
A large number of different methodologies have been proposed to optimize TF
and CMTF. The most popular one is Alternating Least Squares (ALS) (Kolda &
Bader 2009). In a nutshell, ALS optimizes the least square of one factor while fixing
the other ones. ALS does this iteratively, alternating between the fixed factors and
the one it optimizes for until the algorithm converges. Shin & Kang (2014) followed
ALS, yet updated a subset of a factor’s columns at a time. ALS based algorithms are
computationally efficient, yet may converge slowly with sparse data (Acar, Kolda &
Dunlavy 2011).
37
Gradient descent (GD) based optimization, such as stochastic gradient descent,
is an alternative for ALS. Starting with initial factors’ values, GD refines them
by iterating the stochastic difference equation (Gemulla et al. 2011). On the one
hand, GD is simple to implement. On the other hand, choosing good optimization
parameters such as learning rate is not straightforward (Le et al. 2011). A learning
rate is usually decided based on experiments. Another approach is backtracking
search for the maximum decrease from data. CMTF (Acar, Kolda & Dunlavy 2011)
used Nonlinear Conjugate Gradient (NCG) with a line search to find an optimal
decrease direction. Nevertheless, backtracking search is computationally heavy and
therefore, it is not suitable for large-scale datasets.
2.5 Distributed Factorization
Although ALS and GD have proved their effectiveness in optimizing matrix fac-
torization, tensor factorization, and coupled matrix tensor factorization, they are
great for small data. As applications of TF and CMTF usually deal with many gi-
gabytes of data, researchers have focused on developing distributed algorithms. The
nature of TF and CMTF requires the same computation to be done on different sets
of data. Consequently, several levels of data parallelism have been proposed.
Data parallelism in a multiprocessor system divides big tasks into many identical
subtasks; each subtask is performed on each processor. Turbo-SMT by Papalexakis
et al. (2014) addressed this direction by sampling sparse coupled tensors into sev-
eral tiny coupled tensors, concurrently decomposing them into factors using Matlab
Parallel ToolBox, and then merging resulted factors. Another approach followed
this direction was GPUTensor (Zou et al. 2015) which utilized multiprocessors in
GPU for factors computation. Even though these methods improved factorization
speed significantly, they performed tasks in a single machine (with multiprocessors
or multi-cores of powerful GPU though). Thus, they would experience “out of mem-
38
Local systems
Centralized Data
Distributed Data
Distributed systems
CMTF_OPT (CTF)Acar et al. 2011
Turbo-SMT (CTF)Papalexakis et al. 2014
ScouT (CTF)Jeon et al. 2016
FlexiFact (CTF)Beutel et al. 2014
SALS (TF)Shin et al. 2015
GPUTensor (TF)Zou et al. 2015
GigaTensor (TF)Kang et al. 2012
Figure 2.7 : Distributed factorization algorithms on a data-computation coordinate.x-axis represents the level of data distribution from data located in a centralizedmemory or file server on the left to data distributed to computed nodes’ memoryon the right. y-axis captures the level of distributed computation from algorithmsprocessed in a local machine to ones done in a distributed cluster (bottom to top).
ory” if the data is too big to be loaded into the local machine.
Distributed data parallelism scales better with the data size. This level makes
use of distributed processors to do calculations. Moreover, the big data files are
often stored in a distributed file system which can theoretically store big files of any
size. ScouT proposed by Jeon et al. (2016), FlexiFact by Beutel et al. (2014) and
GigaTensor by Kang et al. (2012) followed this approach by defining factorization as
MapReduce processes. For any calculations to be done on a distributed processor,
a corresponding part of data from distributed file system needs to be transmitted
to the processor. This process repeats along the algorithms’ iteration, incurring
so much heavy overhead. Shin & Kang (2014) introduced SALS to overcome this
MapReduce framework’s weakness by caching data in computed nodes’ local disks.
SALS reduced communication overhead significantly. Yet this communication can
be reduced even more. As data is stored on disks, reading it to memory for each
access takes time, especially for huge datasets and many iterations.
39
All the algorithms with different levels of data parallelism are put in an x-y co-
ordinate as in Figure 2.7. Naturally, there is no algorithm in quadrant IV as data
computed in a single machine is normally located locally. Algorithms in quadrant
III perform calculations within a local system with local data. As the whole data is
located in local memory, these algorithms would be in trouble as the data size in-
creases. Those in quadrant II are distributed algorithms in which data is centralized
in a distributed file server. Those algorithms scale quite well as the data increases.
Nevertheless, centralized data may be a significant issue. As data is stored in an
isolated server from computed nodes, the data is required to be transmitted to com-
puted nodes as per calculation. Therefore, communication overhead is one of the
most significant disadvantages. SALS (Shin & Kang 2014) in quadrant I overcame
massive communication overhead with data cached in local disks. It thus distributed
data to computed nodes. This data distribution is still able to be improved even
more with memory caching.
2.6 Deep learning based recommendation systems
Recent approaches take advantages of deep learning to capture and learn sim-
ilarities and latent relationships between users and items. Several deep networks
have been introduced for collaborative filtering (Karatzoglou & Hidasi 2017, Wang,
Wang & Yeung 2015, He et al. 2017). In case of cross-domain recommendation
system, Elkahky et. al. (Elkahky et al. 2015) proposed a multi-view deep learning
approach (DSSM) in which users and items of each dataset inputed through two
neural networks. These networks were then mapped into semantic vectors. In this
approach, the relationships between users and items were defined as the cosine sim-
ilarity of their corresponding semantic vectors. In addition, the common dimension
among datasets shared the same network. For example of Figure 2.8, in case X(1)
and X(2) have the same users, their users would be fed to the same network to learn
40
… … …
… … …
… … …
… … …
user item of domain 1 item of domain 2
Figure 2.8 : Multi-view deep neural network for cross-domain recommendation twodatasets where they have the same users. In this case, users of both datasets sharethe same features of the left-most network.
its parameters.
All the proposed deep learning based methods assumed that datasets across
domains have an identical knowledge of the common dimension. In fact, cross-
domain datasets often have different characteristics. Thus, they may share some
common while also possess their own unique properties. Methods which are lack
of the ability to capture this fact are not able to achieve the best recommendation
accuracy.
2.7 Research gaps
This chapter reviews the existing algorithms in relation to the three research
questions of this thesis. Firstly, how to share the actual explicit similarities between
cross-domain datasets to understand their relationships. Secondly, how to exploit
the implicit similarities in non-coupled dimensions across domains to improve rec-
ommendation accuracy further. Lastly, how to scale up the factorization process to
a different number of tensors coupled in one mode or more, tensor modes, tensor
dimensions and billions of observations. We propose several algorithms in this thesis
to fill the research gaps described below.
41
The current methods of utilizing explicit similarities across domains fail to share
the actual correlation in coupled dimension. For joint analysis of cross-domain
datasets or transfer learning between them, the existing models assume coupled
datasets to share a common coupled factor. Admittedly, the coupled factors are
highly correlated with each other, yet they are unequal in many real-world data and
applications. Thus, forcing them to be the same would reduce the recommendation
accuracy. However, this performance reduction is not the only issue. By using an
identical coupled factor for cross-domain datasets, the final result optimizes either
of them, not both. It may approximate X well and lose Y’s decomposition accuracy,
or vice versa.
Furthermore, all existing algorithms did not take into account implicit similarities
which exist in many applications. Cross-domain datasets not only have explicit
similarities in the coupled dimension, but they also share implicit ones in the non-
coupled dimension. Different approaches have been proposed to perform a joint
analysis of coupled datasets (Pan 2016). However, all of the existing algorithms
use explicit similarities as a bridge to collaborate among datasets. Although these
explicit similarities showed their effectiveness in improving recommendation, there
are still rich implicit features that were not used but have great potential to further
improve the recommendation.
Last but not least, the current distributed factorization methods still incur high
communicational and computational cost. MapReduce-based algorithms was intro-
duced to perform the factorization process in parallel with many computing nodes.
Even though distributed computing allows factors to be updated faster, MapReduce-
based models require data to be transferred from the isolated distributed file system
to the computing node when it needs to process this data. The iterative nature
of tensor factorization requires data and factors to be distributed over and over
again, incurring huge communication overhead. A scalable factorization model with
42
minimal overhead will scale up well as the data emerges.
Motivated by the above liturature gaps, this thesis was conducted. By investigat-
ing the three research questions, this thesis proposes a scalable factorization model
to share implicit and explicit similarities across domains for higher recommendation
accuracy.
43
44
Chapter 3
Explicit Similarity Discovery
This chapter encapsulates contribution #1 of this thesis. It is an extended descrip-
tion of the following publication:
Quan Do and Wei Liu, “ASTen: an Accurate and Scalable Approach to Coupled
Tensor Factorization,” in Proceedings of the 2016 International Joint Conference on
Neural Networks (IJCNN), pp. 99-106, Jul. 24-29, 2016.
3.1 Introduction
This chapter addresses the problem of sharing the actual explicit similarities
in the coupled mode of coupled datasets. Conventional methods such as CMTF
(Acar, Kolda & Dunlavy 2011) assume a coupled matrix and tensor have identical
factors on their coupled modes, which is detrimental to the overall accuracy of the
factorization. There are two problems with this assumption. The first issue is that
they assume the coupled mode of the tensor shares a common low-rank subspace
with the coupled mode of the matrix. A basis for this low-rank subspace is expressed
by the same latent coupled factor shared between them. Admittedly, their coupled
factors (in their coupled dimensions) are highly correlated, yet they are unequal in
45
many real-world data and applications. Due to this fact, we hypothesize that forcing
them to be the same reduces the accuracy of this factorization. More importantly,
by using an identical coupled factor for both datasets, the best final result optimizes
only one of them, not both. It may approximate the tensor well and lose the accuracy
of the matrix decomposition, or vice versa. This problem is especially critical when
the aim is to find the best approximation of both the tensor and matrix which are
correlated to one another.
The second problem could come from an assumption that the error of approx-
imating the tensor and that of decomposing the matrix contribute equally to the
final loss of the model. Clearly, this is not typically the case, as the size of the tensor
and that of the matrix often differ significantly. The loss of factorizing the larger
size one usually outweighs that of decomposing the smaller one. This loss, hence,
reduces the precision of predicting the missing entries in the smaller-sized tensor. In
other words, the traditional loss function does not optimize both the matrix and the
tensor simultaneously. It sacrifices the accuracy of decomposing the smaller-sized
one to better approximate the larger tensor.
This chapter proposes an algorithm to solve these two weaknesses. Unlike ex-
isting algorithms with the traditional objective function which forces the coupled
modes among datasets to share identical coupled factors, the proposed model de-
fines a new objective function which can be optimized with respect to every single
tensor and matrix. This new function enables each dataset to have its own dis-
criminative factor on the coupled mode and regularizes the coupled factors to be
as close as possible to share their explicit similarities. Consequently, optimizing
the proposed objective function produces the lowest decomposition error rate. In
the other words, the proposed method is capable of accurately sharing the explicit
similarities, achieving the accurate approximation of every matrix and tensor of the
coupled datasets. Also, this chapter provides a theoretical proof and experimental
46
evidence that the proposed algorithm converges to an optimum.
The chapter describes ASTen in Section 3.2 by introducing a new objective
function to better share the explicit similarities in the coupled factors. Section
3.3 explains a method to optimize the function. A theoretical proof of the model’s
convergence is also presented. Moreover, Section 3.4 reveals some empirical evidence
to demonstrate the convergence of the method. Finally, the key contributions of the
proposed idea to utilize the explicit similarities across domains is summarized in
Section 3.5.
3.2 ASTen: the proposed Accurate Coupled Tensor Factor-
ization model
This section explains the proposed ASTen, a gradient descent (GD) based al-
gorithm, to more accurately solve CMTF. Although it focuses on discussing mode-3
tensors, all discussions and the algorithm generalize well to higher mode ones. Sup-
pose X ∈ Rn×m×p is a mode-3 tensor. It is obvious that X has at most three coupled
matrices or tensors. Without the loss of generalization, the case where X and a ma-
trix Y ∈ Rn×q coupled in their first mode will be discussed here. The second and
third additional matrices or tensors can be applied the same way.
Unlike traditional CMTF algorithm with Equation (2.4) as their loss function,
ASTen decomposes X and Y into U, V, W, A and B where U, V and W are
factors of X and A and B are factors of Y. ASTen expresses the relationship of
U and A which are jointly correlated by the new third term in our proposed loss
function:
47
L(U,V,W,A,B) =
∥∥�U,V,W� −X∥∥2
F
ΩX
+‖A×BT −Y‖2F
ΩY
+‖U−A‖2F
ΩU
=
∑n,m,pi,j,k (Xi,j,k −
∑rl=1 Ui,lVj,lWk,l)
2
ΩX
+
∑n,pi,j (Yi,j −
∑rl=1 Ai,lBj,l)
2
ΩY
+
∑ni
∑rl=1(Ui,l −Ai,l)
2
ΩU
(3.1)
where ΩX, ΩY and ΩU denote the size of X, Y and the coupled factor U (A has
the same size as U), respectively. In case of sparse tensors, they are the number of
observed elements.
Equation (3.1) overcomes the two weaknesses of the conventional loss function
(2.4) proposed in the literature. At first, Equation (3.1) optimizes X with respect
to U, V, W and the correlated factor A in Y. At the same time, it minimizes Y
approximation error with the optimal A, B and the coupled factor U in X. On
top of that, Equation (3.1) also captures the size difference of X and Y where the
error of X and that of Y are divided by their sizes, respectively. This normalization
ensures the size difference does not have any influence on the distribution of the loss
of X and that of Y to the total decomposition error. As a result, ASTen optimizes
both X and Y without sacrificing the accuracy of either one of the two.
3.3 Optimization
ASTen updates U (the same for V, W, A and B) following this stochastic
difference equation:
Ut+1 = Ut − ηt∂L∂Ut
(3.2)
where ∂L∂Ut is a partial derivative of the loss function L(U,V,W,A,B) with respect
to Ut, ηt is a decreasing step size, t + 1 denotes the current iteration while t is for
48
the previous one.
ASTen updates one entry of a factor matrix at a time, for example, updating U
at row i and column f . Thus, the partial derivative with respect to each ui,f needs
to be computed and can be derived from (3.1):
∂L∂ut
i,f
=− 2
ΩX
m,p∑j,k
[(xi,j,k −
f−1∑l=1
ut+1i,l vtj,lw
tk,l −
r∑l=f
uti,lv
tj,lw
tk,l
)vtj,fw
tk,f
]
+2
ΩU
(uti,f − ati,f
) (3.3)
Equation (3.3) shows that an update for ui,f at a row i and a particular column
f ∈ [1, r] (r is the rank) depends on a whole slice Xi,∗,∗ of X, V and W of the
previous iteration and its paired factor ai,f . It is also worth to mention that ASTen
updates factor matrices column by column, i.e., it updates the first column of U,
then the second column, and so on. Doing so enables ASTen to use newly updated
values of the first column to update the second column, and the updated first and
second column to continue updating the third column. This process is captured in
the term ut+1i,l in the first part of (3.3).
In the similar way, the partial differential with respect to non coupled factors,
such as V, is extracted by
∂L∂vtj,f
= − 2
ΩY
n,q∑i,k
[(xi,j,k −
f−1∑l=1
uti,lv
t+1j,l wt
k,l −r∑
l=f
uti,lv
t+1j,l wt
k,l
)uti,fw
tk,f
](3.4)
Note that the partial differential with respect to coupled factor ai,f is similar to
(3.3), and the partial derivatives with respect to non-coupled factors wk,f and bj,f
can be done in the same way as (3.4). This optimization is shown in Algorithm 1.
49
Algorithm 1: ASTen updates factors U, V, W, A and B using equation(3.2), (3.3) and (3.4). The ones that best approximate X and Y are the factorswe want to produce.
Input : X, Y, EOutput: U,V,W,A,B
1 Randomly initialize U,V,W,A,B;2 Initialize L by a small number;
3 repeat4 PreL = L;5 for f=1 to r do6 for i=1 to n do7 ut+1
i,f ← uti,f − ηt
∂L∂ut
i,f;
8 end9 for j=1 to m do
10 vt+1j,f ← vtj,f − ηt
∂L∂vtj,f
;
11 end12 for k=1 to p do13 wt+1
k,f ← wtk,f − ηt
∂L∂wt
k,f;
14 end15 for i=1 to n do16 at+1
i,f ← ati,f − ηt∂L
∂ati,f;
17 end18 for j=1 to q do19 bt+1
j,f ← btj,f − ηt∂L∂btj,f
;
20 end
21 end
22 Compute L following equation (3.1);
23 until (PreL−LPreL <E);
Theorem 1. The proposed ASTen as presented in Algorithm 1 converges.
Proof. The second partial derivative with respect to U is derived by
∂2L∂ut
i,f2 =
2
ΩX
m,p∑j,k
(vtj,fwtk,f )
2 +2
ΩU
>0
50
As the second partial derivative of L with respect to U is larger than 0, L is
a convex function with respect to U. Followed the same logic, L is also a convex
function for V, W, A and B. It is known that GD for a function f(x) converges if
and only if f(x) is convex (Boyd & Vandenberghe 2004). As a consequence, ASTen
with our proposed objective function defined in (3.1) converges.
3.4 Performance Evaluation
The performance of the new loss function (3.1) proposed in ASTen is com-
pared with the traditional loss function (2.4) (i.e., ASTen with the traditional loss
function) in terms of factorization accuracy. The goal of conducting a series of
experiments is to assess:
1. how accurate the factorization of ASTen is when each tensor is allowed to
have its own discriminative factors and
2. how fast ASTen can achieve the preciseness with the proposed parallel model.
For better validating the ASTen, several synthetic datasets are generated where
the relationship of the coupled factors is controlled. In addition, while doing exper-
iments, the empirical evidence (on top of the above theoretical proof) is shown to
illustrate the correctness of the proposed objective function. All the experiments are
executed on a cluster of 2x 2.8GHz Intel Xeon CPU; each has 10 cores and 256GB
DDR3 RAM.
3.4.1 Data used in our experiments
In the experiments, a synthetic data and two real-world datasets are used to
validate the proposed algorithm. This section explains how data is generated and
processed.
51
Table 3.1 : Ground truth distributions of the factor matrices in the synthetic data.
Test case #1 Test case #2
U normal(0.5, 0.5) normal(0.5, 0.5)V random(0, 1) random(0, 1)W random(0, 1) random(0, 1)A normal(1.5, 0.5) U * random(0, 2) + random(0, 1)B random(0, 1) random(0, 1)
1. Synthetic data To generate coupled tensor and matrix, factors U, V, W
and B of a pre-specified dimension and a predefined rank r are randomly generated.
In specific, V, W and B are randomly generated from 0 to 1; U is generated as
a normal distribution of both mean and standard deviation of 0.5. The remaining
factor, A, which is coupled with U is generated to have a correlation with U. A
summary of how the data is synthesized is captured in Table 3.1. Two different
approaches of generating the ground truth A are tested:
test case #1) A is a normal distribution (mean = 1.5, standard deviation = 0.5)
and
test case #2) A equals U multiplied by a coefficient plus some Gaussian noise
By doing so, the synthetic data will have a similar characteristic of the real-
world data where coupled factors have a relationship, yet unequal. Then unique data
points (i, j, k) are randomly selected andXi,j,k is computed by∑r
f=1 ui,f×vj,f×wk,f .
Y is also computed the same way with A and B. X and Y are finally normalized
to [0, 1].
2. MovieLens data MovieLens dataset (Harper & Konstan 2015) includes rat-
ings from 943 users for 1,682 movies. It is compiled into tensor X of (users, movies,
52
weekdays) whose entries are ratings, matrix Y of (users, user profile) and matrix Z
of (movies, genres). Matrix Y has the size of 943 by 83 where a user is specified
by her gender (0 or 1), is grouped in one of 61 age groups, and have one of 21
occupations. Matrix Z categories 1,682 movies into 19 different genres. One movie
belongs to one or more genres. Finally, X is the tensor of 943 by 1,682 by 7. Values
of X’s entries are 0.2, 0.4, 0.6, 0.8 and 1 which are equivalent to 1-5 ratings. In
this chapter, the experiments tested all the algorithms with 80,000 known ratings,
together with Y and Z of 2,159 and 2,893 observed data, respectively.
3. Yahoo! Music data Yahoo! Music dataset∗ represents a snapshot of user
preferences on various songs. For this experiment, a tensor X of song information
and a matrix Y of user ratings are used. X categorizes 136,736 songs by 20,543
artists and 9,442 genres, each song is performed by an artist and belonged to one
genre. Entries of Y capture ratings of 23,179 users for 136,736 songs and their values
are 0.2, 0.4, 0.6, 0.8 and 1 which are equivalent to 1-5 ratings. There are 8,846,899
observed ratings in total, along with 136,736 observed nonzeros of X.
3.4.2 Performance metric
The input coupled tensors is decomposed into their low-rank factors with differ-
ent algorithms and their decomposition accuracies are compared. In other words,
this experiment evaluates how well these factors approximate the original coupled
tensors. For instance, suppose U, V, W, A and B are factors of X and Y, re-
spectively, as described in test case #1 above. How well the extracted factors
approximate X and Y is quantified with a mean squared errors (MSE) defined by:
∗Yahoo! Music R1 dataset: https://webscope.sandbox.yahoo.com/catalog.php?
datatype=r
53
7.50
E-0
4
5.90
E-0
4
1.34
E-0
3
3.90
E-0
4
6.3
0E
-04
1.02
E-0
3
1.80
E-0
4
5.00
E-0
5
2.30
E-0
4
X Y OVERALL
Individual data factorizationCTF with traditional loss functionCTF with ASTEN loss functionASTEN
X
(a) Test case #1
1.08
E-0
3
3.72
E-0
3
4.80
E-0
3
9.40
E-0
4
4.1
0E
-03
5.04
E-0
3
2.80
E-0
4
7.90
E-0
4
1.07
E-0
3
X Y OVERALL
Individual data factorizationCTF with traditional loss functionCTF with ASTEN loss functionASTEN
X
(b) Test case #2
Figure 3.1 : Mean squared errors of a) test case #1 and b) test case #2 withsynthetic data. In both cases, X is a mode-3 tensor of size 100 x 100 x 100, Y isa matrix of size 100 x 100. There are 80,000 known data points of X and 10,000known elements of Y. It is clearly shown that ASTen with the new loss functionreduces 60% (in test case #1) and 74% (in test case #2) of MSE compared withother algorithms using the tradition loss function.
MSE =
∥∥�U,V,W� −X∥∥2
ΩX
+
∥∥Y −A×BT∥∥2
ΩY
where the first term is the MSE of X approximation and the second term is that of
Y.
In case of factorizing an individual tensor X or matrix Y, only the first or the
54
2.71
E-0
2
4.28
E-0
4
1.72
E-0
4
2.77
E-0
2
2.73
E-0
2
8.71
E-0
3
6.26
E-0
3
4.23
E-0
2
2.70
E-0
2
1.64
E-0
4
1.1
2E
-04
2.7
3E
-02
X Y Z OVERALL
Individual data factorizationCTF with traditional loss functionCTF with ASTEN loss functionASTEN
X
Figure 3.2 : Mean squared error of factorizing the MovieLens dataset where X is amode-3 tensor of size 943 × 1, 682 × 7, Y is a matrix of size 943 x 83, and Z is amatrix of size 1, 682× 19. We can observe that ASTen with the new loss functionreduces MSE of each tensors. In contrast, other algorithms with the conventionalloss function sacrifice the accuracy of Y and Z to optimize X.
second term will be used to calculate MSE. Similarly, when there are more than
two tensors coupled, just like the MovieLens dataset, all MSE of the corresponding
tensors will be added.
3.4.3 Results
The results of the experiments consistently show that ASTen enhances the
accuracy of coupled matrix tensor factorization. They also provide an evidence that
the proposed new loss function converges.
Figure 3.1 and Figure 3.2 reveal that ASTen with the new loss function out-
performs the algorithm with the traditional loss function defined in equation (2.4)
significantly. In particular, in Figure 3.1a, ASTen with the new loss function not
only reduces 76% approximation error of X, but also improves about 92% of Y
decomposition loss. In comparison, the algorithm with the traditional loss function
gives up Y decomposition accuracy to optimize X factorization by just 16%.
The same phenomenon is also observed in test case #2 as shown in Figure 3.1b.
Again, ASTen with the new loss function optimizes both X and Y by 74% and
55
5.17
E-0
1
1.10
E-0
1 6.26
E-0
1
4.35
E-0
1
9.82
E-0
2 5.3
3E
-01
X Y OVERALL
CTF with traditional loss functionCTF with ASTEN loss functionASTEN
X
Figure 3.3 : Mean squared error of factorizing Yahoo! Music dataset where X is amode-3 tensor of size 136, 736× 20, 543× 9, 442 and Y is a matrix of size 136, 736×23, 179. Again, we can see that ASTen with the new loss function optimizes MSE ofevery tensors. On the contrary, other algorithms with the conventional loss functiongive up the accuracy of X to better reduce MSE of Y.
79% respectively, while the traditional loss function can only improve factorization
of X by 13% without any improvement on Y decomposition.
The performance of ASTen with the new loss function is consistent when it is
applied to both the MovieLens and the Yahoo! Music datasets. Figure 3.2 points out
that ASTen outperforms algorithms with the traditional loss function defined in
equation (2.4) by 35% CMTF’s overall mean squared error reduction. In particular,
to achieve a better approximation of X than algorithms with the traditional loss
function, ASTen reduces decomposition loss of both Y and Z by 62% and 35%,
respectively. In contrast, algorithms with the traditional loss function sacrifices the
accuracy ofY decomposition 20 times and that of Z factorization 36 times to achieve
relatively less accurate X approximation.
The result of the experiment on Yahoo! Music dataset, as presented in Figure
3.3, illustrates that the size differences of the input datasets have strong influence
on the accuracy of CMTF using the traditional loss function. CMTF with the
traditional loss function optimizes Y which has more nonzeros entries, but does not
decompose X equally well. On the contrary, the proposed loss function in ASTen
56
helps to further improve the accuracy of both X and Y under the same stopping
condition as the CMTF with the traditional function.
3.5 Contribution and Summary
This chapter addresses contribution #1 of this thesis by proposing a novel objec-
tive function to share the actual explicit similarities across datasets and an algorithm
to optimize the low-rank factors. The new objective function is designed to optimize
every single tensor and matrix of the coupled datasets. Differing from the existing
algorithms with a traditional objective function which forces coupled modes be-
tween the coupled matrix and tensor to have identical factors, ASTen enables each
of them to have its discriminative factor on the coupled mode. Due to the different
nature of coupled factors across datasets in real-world applications, the new loss
function enables ASTen to capture and share the actual explicit similarities in the
coupled factors. Thus, it is capable of finding the accurate approximation of every
tensor. As a result, it achieves up to 75% error reduction for coupled matrix tensor
factorization.
Furthermore, this chapter provides a theoretical proof and conducts extensive
experiments to show that the proposed algorithm converges to an optimum. Exper-
iments on both real and synthetic datasets demonstrate that the proposed ASTen
outperforms the existing algorithms in utilizing explicit similarities to improve rec-
ommendation accuracy.
57
58
Chapter 4
Implicit Similarity Discovery
This chapter encapsulates contributions #2 and #3 of this thesis. It is an extended
description of the following publications:
Quan Do, Wei Liu, Fan Jin and Dacheng Tao, “Unveiling Hidden Implicit Simi-
larities for Cross-Domain Recommendation,” IEEE Transactions on Knowledge and
Data Engineering (TKDE) (Under review).
Quan Do, Wei Liu and Fang Chen, “Discovering both Explicit and Implicit
Similarities for Cross-Domain Recommendation,” in Proceedings of the 2017 Pacific-
Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 618-630,
May 23-26, 2017.
4.1 Introduction
As discussed in Chapter 3, coupled datasets across domains possess explicit sim-
ilarities. In the case of the rating matrices on the Netflix and MovieLens websites,
they both contain user preferences for the same set of items. Thus, explicit fea-
tures in the identical item dimension are conventionally used to undertake coupled
learning between datasets (Acar, Kolda & Dunlavy 2011, Shin et al. 2017, Singh
59
& Gordon 2008). In addition to these explicit similarities, cross-domain datasets
are likely to have other implicit correlations in their remaining dimensions. This
intuition comes from consumer segmentation which is a popular concept in market-
ing. Consumers can be grouped by their behaviors, for example, “tech leaders” is
a group of consumers with a strong desire to own new smartphones with the latest
technologies and there is another user group which only changes to a new phone
when the current one is out of order. Although users on the Netflix and MovieLens
websites are different, users with similar interests may share similar behaviors in
relation to the same set of items. This type of correlation implicitly exists in many
other scenarios in the real world, for example, suburbs with a large number of high-
income families may be correlated with a lower crime rate as shown in Figure 1.1,
or users with an interest in action books may also like related movie genres (e.g.,
action films), or even though users on Amazon and Walmart are different, those
sharing similar interests may share similar behaviors in relation to the same set of
products. Thus, using implicit similarities correctly in addition to explicit ones has
a strong potential to assist in understanding more deeply the relationship between
user preferences and item characteristics. This understanding can be used to decide
appropriate recommendation strategies for each particular user.
Different approaches have been proposed to perform a joint analysis of multiple
datasets (Pan 2016). Some researchers proposed joint factorization (Singh & Gordon
2008, Acar, Kolda & Dunlavy 2011) while others suggested transfer learning (Li et al.
2009a, Gao et al. 2013, Pan et al. 2010). However, all of the existing algorithms use
explicit similarities as a bridge to collaborate among datasets. The popular collective
matrix factorization (CMF) (Singh & Gordon 2008) jointly analyzes datasets by
assuming them to have an identical low-rank factor in their coupled dimension. In
this case, the shared identical factor captures the explicit similarities across domains.
Li et al. (2009a) suggest correlated datasets share explicit hidden rating patterns.
60
The similarities between rating patterns are then transferred from one to another
dataset. Gao et al. (2013) extend Li et al. (2009a)’s idea to include unique patterns
in each dataset. Pan et al. (2010) regularize factors of the target user-by-item
matrix with those, called principal coordinates, from the user profile and the item
information matrices. Although these explicit similarities showed their effectiveness
in improving recommendation, there are still rich implicit features that were not
used but have great potential to further improve the recommendation.
Motivated by this literature gap, this chapter proposes an algorithm to discover
the implicit similarities in non-coupled dimensions of the cross-domain datasets.
The fact that non-coupled dimensions in the aforementioned example of the Movie-
Lens and the Netflix coupled datasets contain non-overlapping users prevents direct
knowledge sharing in their non-coupled factors. However, their latent behaviors are
correlated and should be shared. These hidden behaviors can be captured in low-
rank factors by matrix tri-factorization. As factorization is equivalent to spectral
clustering (Ding et al. 2006, Huang et al. 2008, Papalexakis et al. 2013), different
users with similar preferences are grouped in non-coupled user factors. Developed
on this concept, the proposed algorithm hypothesizes that latent clusters in these
non-coupled factors may have a close relationship. Therefore, it aligns correlated
clusters in non-coupled factors to be as close as possible. This idea matches the
fundamental concept of CF in the sense that similar user groups who rate similarly
will continue to do so.
In addition, this chapter presents the first algorithm to utilize both explicit and
implicit similarities across datasets. Both of them provide a thorough understand-
ing of the underlying relationship between user preferences and item characteristics.
This understanding helps to recommend appropriate items to each user, enhancing
the performance of cross-domain recommendation. Validated on real-world datasets,
the proposed approach outperforms the existing algorithms by more than two times
61
in term of recommendation accuracy. These results show the significance of explicit
and implicit similarities in improving the performance of cross-domain recommen-
dation.
The chapter describes how implicit similarities are discovered in Section 4.2.2
and the challenge of aligning them across non-coupled factors 4.2.2. The proposed
method to align them is introduced in Section 4.2.2. Furthermore, Section 4.2.1
discusses explicit similarities and a method to utilize both explicit and implicit
similarities is detailed in Section 4.2.3. Section 4.3 presents a way to extend the
proposed idea to handle more than two coupled datasets. An extensive evaluation
with real-world datasets is discussed in Section 4.4. Finally, Section 4.5 summaries
the key contributions of the proposed idea to discover implicit similarities and to
use them to achieve more accurate cross-domain recommendations.
4.2 HISF: the proposed Hidden Implicit Similarities Factor-
ization Model
This section introduces how explicit and implicit similarities are exploited among
different datasets. It first explains a case where X(1) is a rating matrix of n users by
m items and X(2) is another one of p users by m items. They are coupled in their
second dimension. An extension to more matrices will be then discussed in the next
section. Compared to existing models, HISF makes two principle improvements.
Firstly, it discovers and aligns similar user clusters in the non-coupled user factors
across domains. Secondly, it shares common parts and preserves domain-specific
parts of the coupled latent variables (item factors).
62
4.2.1 Sharing common and preserving domain-specific coupled latent
variables to utilize explicit similaritites
Even though X(1) and X(2) contain the same m items, it is unreasonable to force
them to share the same low-rank item factors (i.e., V(1) equals V(2)). The fact that
both matrices capture ratings of the same items indicates that they are strongly
correlated. Nevertheless, they also have their own unique features characterized
by their domain. For this reason, both common and unique parts are included in
item factors to better capture correlations of the coupled dimension among different
datasets. Thus, the coupled loss function is proposed to be:
L =
∥∥∥∥X(1) −U(1)S(1)
⎡⎢⎣V
(0)
V(1)
⎤⎥⎦
T ∥∥∥∥2
+
∥∥∥∥X(2) −U(2)S(2)
⎡⎢⎣V
(0)
V(2)
⎤⎥⎦
T ∥∥∥∥2
+ λθ
whereV(0) ∈ Rm∗c is the common part, V(1) andV(2) ∈ Rm∗(r−c) are domain-specific
parts, r is rank of the decomposition, c is the number of common rows such that
1 ≤ c ≤ r and θ is the squared L2 regularization term. These common V(0) and
domain-specific V(1), V(2) parts in coupled factors are illustrated in Figure 4.1.
4.2.2 Aligning implicit similarities in non-coupled latent clusters across
domains
Besides the explicit similarities in common parts of coupled factors (as described
above), the non-coupled dimension of datasets from different sources also possess
a strong relationship. Even though users in X(1) and X(2) are different users, they
may have similar behaviors which can be grouped by latent factors. For instance,
different users with similar preferences on sci-fi movies can be grouped as sci-fi
fans; sci-fi fans have similar behaviors across domains. This intuition inspires us to
63
X(1)
X(2)
U(1)
U(2)
user of domain 1
item
item
user of domain 2
V(0)
V(1)[ ]T
V(0)
V(2)[ ]T
common parts(explicitly shared)
X(2) specific parts
X(1) specific parts
user clusters alignment(implicitly shared)
S(2)
S(1)
Figure 4.1 : The proposed factorization model to discover and share implicit sim-ilarities across two rating matrices X(1) and X(2). They are coupled in their itemdimensions. The proposed algorithm decomposes X(1) into U(1), S(1), common V(0)
and its specific V(1). At the same time, X(2) is factorized into U(2), S(2), commonV(0) and its specific V(2). Note that the proposed model matches clusters of userswith similar interests in non-coupled factors U(1) and U(2) by aligning correlatedclusters (captured in their columns) as close as possible.
hypothesize that non-coupled user factors share closely related latent user clusters.
Thus, HISF first discovers cluster matchings between user clusters in U(1) and those
in U(2). Based on these matchings, centroids of their correlated clusters are then
regularized to be as close as possible. Figure 4.1 demonstrates a case of user clusters
alignment between U(1), U(2).
In the following subsections, a few examples are used to explain how challenging
it is to match user clusters across domains and how clustering can be achieved with
matrix tri-factorization. Then the proposed solution is introduced to firstly finds
related user clusters across domains and secondly aligns them together.
Factorization as a clustering method
Ding et al. (2006) proved that the matrix tri-factorization is equivalent to spec-
tral clustering such that U and V contain clusters of users and those of items,
64
1 1 1 11 1 1
1 1 11 1 1 1
user 1
item
X(1) U(1) V(1) U’(1) V’(1)T T
11
S(1)
11
S’(1)
a) b)
Figure 4.2 : Matrix factorization as a clustering method. Suppose X(1) containsratings of n users for m items and there are two user groups represented by theirrating similarities (brown circles and green triangles). When X(1) is factorized withmatrix tri-factorization, these two user groups are captured in columns of the userfactor. Two possible cases can happen: a) users with brown circles are in the firstcolumn and those with green triangles are in the second column of U(1); or b) userswith brown circles are in the second column and those with green triangles are inthe first column of U′(1).
respectively. S captures weights between user clusters and item clusters.
An example is now created where X(1) contains ratings of two groups of users
with similar preferences as shown in Figure 4.2: one group giving brown circle
ratings and another one giving green triangle ratings. When X(1) is factorized,
these two user groups are captured in columns of the user factor, i.e., one column
of the user factor contains users with brown circles and another one captures users
with green triangles. This example illustrates that factorization can cluster users
based on their preferences. However, it does not provide any information on whether
the first column is the group of users with brown circles or green triangles as both
U(1) × S(1) ×V(1)T and U′(1) × S′(1) ×V′(1)T are equally good solutions.
The weighting factor S(1) or S′(1) shows how much a user cluster interacts with
the items clusters. For example, the first row of S(1) in Figure 4.2 indicates that
users with brown circles strongly interact with the first item cluster and have no
interaction with the second one.
65
1 1 1 11 1 1
1 1 11 1 1 1
user 2
item
X(2) U(2) V(2) U’(2) V’(2)T T
11
S(2)
11
S’(2)
Figure 4.3 : Suppose X(2) is obtained from another domain where different usersrate the same set of items as that of X(1). X(2) also has two user groups havingsimilar preferences with those in X(1): circle and triangle preferences.
A challenge to align similar clusters across domains
In case there is another dataset from a different domain X(2) having ratings from
different p users on the same set of m items as shown in Figure 4.3. By factorizing
X(2), the matrix tri-factorization can group users with blue circle and those with
yellow triangle preferences in columns of user factor U(2) or U′(2).
Although the behaviors of users in X(1) and those in X(2) are highly correlated,
finding a match between user clusters in X(1) and X(2) is not an obvious task due
to two reasons. Firstly, users in X(1) and X(2) are different. It means that there is
no information about the users nor their order in the rating matrices. Secondly, as
alternating least square method (Hu et al. 2008) is often used to find the factors,
factors are initiated randomly. Therefore, a particular user cluster may never be
fixed in a particular column of the user factor. In other words, there is no way to
ensure whether users with brown circles are to be captured in the first or the second
column of the user factor of X(1). In the same way, users with blue circles can be
captured in the first or the second column of the user factor of X(2). Therefore, there
are four possible cases for matching user clusters of X(1) and X(2) as demonstrated
in Figure 4.4. In the event the input datasets have r clusters, the situation is more
complex as there will be r2 possible cases.
66
U(1)
U(2)
(a)
U(1)
U(2)
(b)
U(1)
U(2)
(c)
U(1)
U(2)
(d)
Figure 4.4 : Possible cases for matching user clusters of X(1) and X(2). User clusters’alignment between X(1) in Figure 4.2 and X(2) in Figure 4.3. There are four possiblecases: in a) and d) the user cluster in the first column ofU(1) matches the user clusterin the first column of U(2) and the user cluster in the second column of U(1) matchesthe user cluster in the second column of U(2); in b) and c) the user cluster in thefirst column of U(1) matches the user cluster in the second column of U(2) and theuser cluster in the second column of U(1) matches the user cluster in the first columnof U(2). These cases show the challenge to determine how clusters in U(1) and U(2)
are aligned at each iteration.
Matching correlated clusters across domains
To find matches of users clusters across domains, let’s revisit the intuition about
similar user groups with an assumption that users have similar behavior on the
common item set. When user clusters are extracted by matrix tri-factorization as
discussed in Sect. 4.2.2, item factors V(1) and V(2) have common and different
columns. These columns of V define new coordinates of low dimensional (linear
transformation) column-wise (item side) information of X. The common columns
of V(1) and V(2) thus provide new coordinates of common item information of X(1)
and X(2); each of new coordinates (each common columns of V(1) and V(2)) defines a
cluster of items (linear transformation of the original item side ofX) having the same
characteristics (e.g., sci-fi movies or comedy movies). The rows of S(1) and S(2) are
the values (weights) of different user clusters (captured in columns of U(1) and U(2)
67
respectively) in these common new coordinates. If a user cluster in X(1) and another
one in X(2) are geometrically close in the new coordinates, it indicates these user
clusters have similar interests on items of the same characteristics. Hence, matching
the S(1) and S(2) weights that are close geometrically in the new coordinates defined
by V(1) and V(2) can discover matching user clusters.
This principle is used to find matches of user clusters across domains. A user
cluster U(1)∗,i matches with U
(2)∗,j if S
(1)i,∗ is the closest to S
(2)j,∗ . Thus, the Euclidean
distances between rows of S(1) and those of S(2) are compared to match user clusters
whose distances are the shortest.
Aligning correlated clusters
Once matches between user clusters across domains are found, they are aligned
to be as close as possible in the optimization processes. This alignment is to enforce
similar users in latent feature space to rate new items similarly. Here, HISF is
proposed to regularize the difference between centroids of the closely related user
clusters across U(1) and U(2).
L =
∥∥∥∥X(1) −U(1)S(1)
⎡⎢⎣V
(0)
V(1)
⎤⎥⎦
T ∥∥∥∥2
+
∥∥∥∥X(2) −U(2)S(2)
⎡⎢⎣V
(0)
V(2)
⎤⎥⎦
T ∥∥∥∥2
+r∑
l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
+ λθ
(4.1)
where m(1)l
Tdenotes a row vector of the l-th user cluster’s centroid in U(1) and m
(2)∗
T
denotes a row vector of the matched user cluster’s centroid in U(2); an example of
how m(1)l
Tand m
(2)∗
Tare computed is shown in Figure 4.5.
When the user clusters across domains are matched as Figure 4.4a and Figure
4.4d, then:
68
m = ( , )1
T
m = ( , )2
T
u1.x+u2.x+u3.x
x y
u1.y+u2.y+u3.y
3 3
u4.x+u5.x u4.y+u5.y
2 2
U
Figure 4.5 : An illustration of how centroid of a cluster is computed. U capturestwo user clusters: the first three users are in the first cluster and the last two are inthe second one. x and y denote the first and the second column of U, respectively.
r∑l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
=
√(m
(1)1 .x−m
(2)1 .x)2 + (m
(1)1 .y −m
(2)1 .y)2
+
√(m
(1)2 .x−m
(2)2 .x)2 + (m
(1)2 .y −m
(2)2 .y)2
However, if the user clusters across domains are matched as Figure 4.4b and
Figure 4.4c, the order of columns in m(2)∗
Thas to be adjusted to match with the
order of columns m(1)l
Tsuch that
r∑l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
=
√(m
(1)1 .x−m
(2)2 .y)2 + (m
(1)1 .y −m
(2)2 .x)2
+
√(m
(1)2 .x−m
(2)1 .y)2 + (m
(1)2 .y −m
(2)1 .x)2
To elaborate how this alignment works, Algorithm 2 is run with a synthetic data
in Figure 4.6 where X(1) and X(2) each has six unique users who rate the same
list of six movies (three sci-fi and three comedy movies). Some users in X(1) and
X(2) like sci-fi genre and provide high ratings for sci-fi movies; some other prefer
comedy category and rate comedy movies five stars. Thus, users in each domain
can be grouped by their preferences: one cluster of sci-fi and another one of comedy
fans. Furthermore, each group of users in X(1) shares its implicit preference with
a corresponding group of users in X(2). As a result, when algorithm 2 factorizes
69
sci-fi #1
comedy #1
comedy #2
comedy #3
sci-fi #2
sci-fi #3
(a) X(1)
sci-fi #1
comedy #1
comedy #2
comedy #3
sci-fi #2
sci-fi #3
(b) X(2)
Figure 4.6 : Generated ratings of two domains X(1) and X(2). Blank entries areunobserved ratings. Although users inX(1) andX(2) are different, they share implicitsimilarities in their preferences: those who like sci-fi genre rate sci-fi movies highlyand those who are comedy fans do so for comedy movies.
X(1) and X(2) with rank r = 2, each user cluster is captured in a column of the
user factors. In addition, their user clusters are aligned so that those with similar
preferences are close to each other.
The resulted user clusters of several iterations are then plotted into an x-y coor-
dinates as in Figure 4.7 where the x-axis is the first column of U(1) and the y-axis
is its second column. If the first and the second column of U(2) match with the first
and the second column of U(1) respectively, x is the first column and y is the second
column of U(2). Otherwise, the order of columns in U(2) is changed before plotting
its users so that matching columns between U(1) and U(2) are aligned, i.e., y is the
first and x is the second column of U(2).
Initially, the algorithm randomizes the user factors. So the users are scattered
randomly in the x-y space. Iteration 1 starts grouping users to different clusters and
70
Figure 4.7 : An illustration of how well the proposed cluster alignment methodworks. Sci-fi fans and comedy ones from X(1) and X(2) in Figure 4.6 are capturedin user factors U(1) and U(2). Circles are users captured in the first column of U,and triangles are for those captured in the second column of U. As U(1) and U(2)
are randomly initiated, users are scattered. User clusters are formed and alignedover iterations. From iteration 2, two user clusters in X(1) are clearly separatedand gradually aligned with those in X(2). There is a change in cluster matching initeration 3: the first column of U(1) is matched with the second column of U(2) andvice versa. Since then, centroids of clusters are aligned to be as close as possiblefrom iteration five till the last iteration.
71
aligning them. After iteration 2, users with similar behaviors in U(1) and those in
U(2) are formed, and it can be seen that the first and the second user clusters of
X(1) are aligned with the first and the second user clusters of X(2), respectively. In
other words, sci-fi fans in the first column of U(1) (blue circles) are matched with
those in the first column of U(2) (red circles) whereas comedy fans in the second
column of U(1) (blue triangles) are matched with those in the second column of
U(2) (red triangles). Iteration 3 makes a correction on cluster alignment: sci-fi fans,
now in the second column of U(2), are matched with those in the first column of
U(1) (blue circles) whereas comedy fans (in the first column of U(2) (red circles) are
matched with ones in the second column of U(1) (blue triangles). In this case, the
order of columns U(2) is corrected so that users in U(1) and U(2) are in the same
coordinates. In the next iterations, centroids of corresponding clusters in X(1) and
X(2) are adjusted to be close as observed in iteration 5’s result or the last iteration’s.
4.2.3 Optimization
Equation (4.1) is not a convex function. However, it is convex with respect to a
factor when the others are fixed. Therefore, the alternating least square framework
is used to take turns optimizing one factor in function (4.1) while fixing the others
as shown in Algorithm 2. Moreover, as data for updating rows of the factors are
independent, the proposed algorithm computes factors in a row-wise manner instead
of full matrix operations so that the computation can be done in parallel with
multiple CPU cores or a distributed system. To this end, the equation (4.1) is
rewritten as the following:
72
L =
n,m∑i,j
(X
(1)i,j − u
(1)i
TS(1)
⎡⎢⎣v
(0)j
v(1)j
⎤⎥⎦)2
+
p,m∑k,j
(X
(2)k,j − u
(2)k
TS(2)
⎡⎢⎣v
(0)j
v(2)j
⎤⎥⎦)2
+r∑
l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
+ λθ
(4.2)
Solving U(1) and U(2)
Let v(01)j = S(1)
⎡⎢⎣v
(0)j
v(1)j
⎤⎥⎦ and v
(02)j = S(2)
⎡⎢⎣v
(0)j
v(2)j
⎤⎥⎦, then Equation (4.2) becomes
L =
n,m∑i,j
(X
(1)i,j − u
(1)i
Tv(01)j
)2
+
p,m∑k,j
(X
(2)k,j − u
(2)k
Tv(02)j
)2
+r∑
l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
+ λθ
(4.3)
By fixing all vj, Equation (4.3) is a convex function with respect to u(1)i
T. As a
result, u(1)i
Tis optimal when the partial derivative of L with respect to it is set to
zero.
δLδu
(1)i
T=− 2
m∑j
(X
(1)i,j − u
(1)i
Tv(01)j
)v(01)j
T+ 2
(u(1)i
T − bT)+ 2λu
(1)i
T
=− 2x(1)i,∗
TV(01) + 2u
(1)i
TV(01)TV(01) + 2u
(1)i
T − 2bT + 2λu(1)i
T
where bT = −m(1)l
T+m
(2)∗
T+ u
(1)i
T, l is the cluster user i belongs to and x
(1)i,∗
Tis a
row vector of all observed X(1)i,j , ∀j ∈ [1,m].
By setting δLδu
(1)i
T = 0, the updating rule for u(1)i
Tcan be achieved:
73
u(1)i
T=
(V(01)TV(01) + (λ+ 1)I
)+(x(1)i,∗
TV(01) + bT
)(4.4)
In the same way, optimal u(2)k
Tcan be derived from:
u(2)k
T=
(V(02)TV(02) + (λ+ 1)I
)+(x(2)k,∗
TV(02) + bT
)(4.5)
where bT = −m(1)l
T+ m
(2)∗
T+ u
(2)k
T, l is the cluster matched with the one user
k belongs to and x(2)k,∗
Tis a row vector of all observed X
(2)k,j, ∀j ∈ [1,m]. I is the
identity matrix.
Solving common V(0)
Let u(1)i
T=
[u(10)i |u(11)
i
]TS(1) and u
(2)k
T=
[u(20)k |u(22)
k
]TS(2) where u
(10)i
T, u
(20)k
T ∈R1∗c and u
(11)i
T, u
(22)k
T ∈ R1∗r−c, then Equation (4.1) can be rewritten as:
L =
n,m∑i,j
(X
(1)i,j − u
(10)i
Tv(0)j − u
(11)i
Tv(1)j
)2
+
p,m∑k,j
(X
(2)k,j − u
(20)k
Tv(0)j − u
(22)k
Tv(2)j
)2
+r∑
l=1
∥∥∥∥m(1)l
T −m(2)∗
T∥∥∥∥2
+ λθ
Similar to the case of U(1) and U(2), v(0)j is now optimized while fixing all other
parameters. Again, by setting the partial derivative of L with respect to v(0)j to
zero, optimal value of v(0)j can be achieved.
74
δLδv
(0)j
=− 2n∑i
(Y
(1)i,j − u
(10)i
Tv(0)j
)u(10)i − 2
p∑k
(Y
(2)k,j − u
(20)k
Tv(0)j
)u(20)k + 2λv
(0)j
=− 2U(1)Ty(1)∗,j + 2U(1)TU(1)v
(0)j − 2U(2)Ty
(2)∗,j + 2U(2)TU(2)v
(0)j + 2λv
(0)j
where Y(1)i,j = X
(1)i,j −u
(11)i
Tv(1)j and Y
(2)k,j = X
(2)k,j−u
(22)k
Tv(2)j ; y
(1)∗,j and y
(2)∗,j are column
vectors of all observed Y(1)i,j , ∀i ∈ [1, n] and Y
(2)k,j , ∀k ∈ [1, p], respectively.
The updating rule for v(0)j is:
v(0)j =
(U(1)TU(1) +U(2)TU(2) + λI
)+(U(1)Ty
(1)∗,j +U(2)Ty
(2)∗,j
)(4.6)
Solving domain-specific V(1) and V(2)
Similar operations with respect to v(1)j and v
(2)j are now performed.
δLδv
(1)j
= −2U(1)Tz(1)∗,j + 2U(1)TU(1)v
(1)j + 2λv
(1)j = 0
where Z(1)i,j = X
(1)i,j −u
(10)i
Tv(0)j and Z
(2)k,j = X
(2)k,j −u
(20)k
Tv(0)j ; z
(1)∗,j and z
(2)∗,j are column
vectors of all observed Z(1)i,j , ∀i ∈ [1, n] and Z
(2)k,j, ∀k ∈ [1, p], respectively.
Then the updating rule for v(1)j is derived:
v(1)j =
(U(1)TU(1) + λI
)+
U(1)Tz(1)∗,j (4.7)
Analogy to optimizing v(1)j , optimal v
(2)j is obtained by:
75
v(2)j =
(U(2)TU(2) + λI
)+
U(2)Tz(2)∗,j (4.8)
Solving weighting factor S(1) and S(2)
Let s(1)T= (S
(1)1,1,S
(1)2,1, . . . ,S
(1)r,1 ,S
(1)1,2,S
(1)2,2, . . . ,S
(1)r,2 , . . . ,S
(1)1,r,S
(1)2,r, . . . ,S
(1)r,r ) and
aT = (U(1)i,1V
(1)j,1 ,U
(1)i,2V
(1)j,1 , . . . ,U
(1)i,rV
(1)j,1 ,
U(1)i,1V
(1)j,2 ,U
(1)i,2V
(1)j,2 , . . . ,U
(1)i,rV
(1)j,2 ,
. . . ,
U(1)i,1V
(1)j,r ,U
(1)i,2V
(1)j,r , . . . ,U
(1)i,rV
(1)j,r )
Equation (4.2) then becomes:
L =
n,m∑i,j
∥∥X(1)i,j − s(1)
Ta∥∥2
+ λ(‖s(1)T‖2) + const
where const is the remaining regularization terms.
Optimal s(1)Tcan be achieved when ∂L
∂s(1)T = 0
∂L∂s(1)
T=− 2x(1)
∗,∗TA+ 2s(1)
TATA+ 2λs(1)
T= 0
⇔ s(1)T=(ATA+ λI
)−1(x(1)∗,∗
TA) (4.9)
where x(1)∗,∗
Tcontains observed X
(1)i,j ; x
(1)∗,∗
T ∈ R1xΩX(1) ; A ∈ RΩ
X(1)×(r×r).
In an analogy way, the update rule for s(2)Tis:
76
Algorithm 2: HISF: Utilizing both explicit and implicit similarities from twomatrices
Input : X(1), X(2), EOutput: U(1),S(1),V(0),V(1),U(2),S(2),V(2)
1 Randomly initialize all factors2 Initialize L by a small number
3 repeat4 PreL = L5 Find matches between clusters in U(1) and U(2)
6 Solve U(1) by (4.4)
7 Solve U(2) by (4.5)
8 Solve common V(0) by (4.6)
9 Solve domain-specific V(1) by (4.7)
10 Solve domain-specific V(2) by (4.8)
11 Solve S(1) by (4.9)
12 Solve S(2) by (4.10)
13 Compute L following (4.1)
14 until (PreL−LPreL < E)
⇔ s(2)T=(ATA+ λI
)−1(x(2)∗,∗
TA)
(4.10)
4.3 Extension to three or more matrices
It can be found more than two correlated matrices in many cases. For example,
Amazon has ratings from different domains, e.g., Books, Movies and TV, Electronics,
Digital Musics, etc. These rating matrices may have a close relationship that can
help to collaboratively improve recommendation accuracy. In this case, suppose
there are three correlated matrices X(1), X(2) and X(3). They are coupled in their
second dimension, i.e. X(1) is a rating matrix from n users for m items, X(2) is
another rating matrix from k users for the same m items and X(3) is another rating
77
Algorithm 3: HISF-N: Utilizing both explicit and implicit similarities from qmatrices
Input : X(1), X(2), ..., X(q), EOutput: U(1), ...,U(q),S(1), ...,S(q),V(0),V(1), ...,V(q)
1 Randomly initialize all factors2 Initialize L by a small number
3 repeat4 PreL = L5 Find matches among clusters in U(1), U(2),..., U(q)
6 for i∈{1, q} do7 Solve U(i) while fixing all other factors
8 Solve S(i) while fixing all other factors
9 Solve V(i) while fixing all other factors
10 Solve common V(0) while fixing all other factors
11 Compute L following (4.11)
12 until (PreL−LPreL < E)
matrix from l users for the same m items. The idea above can be extended to utilize
common parts of coupled factors of the three matrices (explicit similarities) and
align clusters among non-coupled factors (implicit similarities). Thus, the following
extension is proposed for utilizing explicit and implicit similarities among three or
more matrices:
L =3∑
i=1
∥∥∥∥X(i) −U(i)S(i)
⎡⎢⎣V
(0)
V(i)
⎤⎥⎦
T ∥∥∥∥2
+ simimplicit + λθ (4.11)
where simimplicit is the regularization term of implicit similarities across domains
and is defined by:
simimplicit =r∑
f=1
∥∥∥∥m(1)f
T −m(2)∗
T∥∥∥∥2
+r∑
f=1
∥∥∥∥m(2)f
T −m(3)∗
T∥∥∥∥2
+r∑
f=1
∥∥∥∥m(3)f
T −m(1)∗
T∥∥∥∥2
78
Updating rules for optimizingU(1), U(2), U(3), S(1), S(2), S(3), V(1), V(2) andV(3)
can be achieved similarly to the case of two matrices in Section 4.2.3. Moreover,
following the same derivations for common parts of two matrices in Section 4.2.3,an
updating rule for v(0)j , which is the common parts of three matrices, can be reached:
v(0)j =
(U(1)TU(1) +U(2)TU(2) +U(3)TU(3) + λI
)+
×(U(1)Ty
(1)∗,j +U(2)Ty
(2)∗,j +U(3)Ty
(3)∗,j
) (4.12)
By optimizing Equation (4.11), both explicit similarities (in a form of common
V(0) parts) and implicit similarities (in a form of aligned user groups in U(1), U(2)
and U(3)) among X(1), X(2) and X(3) are leveraged. Utilization of correlations from
four or more matrices can be easily extended in the same way. Therefore, the
proposed method does not limit itself to a certain number of correlated matrices.
4.4 Experiments and Analysis
The proposed HISF is evaluated in comparison with existing algorithms, includ-
ing CMF (Singh & Gordon 2008), CST (Pan et al. 2010), CBT (Li et al. 2009a)
and CLFM (Gao et al. 2013). This evaluation thoroughly studies two test cases:
one with two matrices and another one with three matrices. The goal is to evalu-
ate how well these algorithms suggest unknown information based on the observed
cross-domain ratings. For this purpose, they are compared based on the commonly
used root mean squared error (RMSE) metric.
RMSE =
√√√√√∑n,m
i,j
(Ui × S×VT
j −Xi,j
)2
ΩX
where ΩX is the number of observations of X.
79
Table 4.1 : Dimension and number of known entries for training, validation andtesting of census data on New South Wales (NSW)(X(1)) and Victoria (VIC) (X(2))states as well as crime statistics of NSW (X(3)).
Characteristics X(1) X(2) X(3)
Dimension 154 × 7,889 81 × 7,889 154 × 62Training 91,069 47,900 661Validation 4,793 2,521 34Testing 23,965 12,605 173
4.4.1 Data for the experiments
Three publicly available datasets are used for the experiments. Their character-
istics are summarized in Table 4.1 and Table 4.2.
Dataset #1
Australian Bureau of Statistics (ABS)∗ publishes comprehensive census data of
New South Wales (NSW) and Victoria (VIC) states. The dataset for NSW comprises
of populations and family profiles of 154 areas, so-called “local government areas”
(LGA). They are formed in a matrix X(1) (LGA by population and family profile)
for NSW and another matrix X(2) for VIC. 10% of the data is randomly selected
and its 80% are used for training whereas the rest 20% is for testing.
Dataset #2
Bureau of Crime Statistics and Research (BOCSAR)† provides a statistical record
of criminal incidents within 154 LGAs of New South Wales. There are 62 specific
crime types. 10% of the data are randomly selected to form a matrix X(3) of (LGA,
crime types). 80% of X(3) is used for training and the rest is for testing.
∗ABS: http://www.abs.gov.au/websitedbs/censushome.nsf/home/datapacks†BOCSAR: http://www.bocsar.nsw.gov.au/Pages/bocsar_crime_stats/bocsar_crime_
stats.aspx
80
Table 4.2 : Dimension and number of known entries for training, validation andtesting of Amazon datasets on books (X(4)), movies (X(5)) and electronics (X(6)).
Characteristics X(4) X(5) X(6)
Dimension 5,000 × 5,000 5,000 × 5,000 5,000 × 5,000Training 158,907 94,665 41,126Validation 8,363 4,982 2,164Testing 18,585 11,071 4,809
Dataset #3
Three matrices of ratings for books, movies and electronics are extracted from
Amazon website (He &McAuley 2016). The book data contains ratings from 305,475
users on 888,057 books; the movie data is ratings from the same 305,475 users on
128,097 movies and TV programs; the electronics data is of the same users on 196,894
items. All ratings are from 1 to 5. For this experiment, the data is constructed as
the following:
- The same sub-sampling approach as in (Pan et al. 2010) is first adopted by
randomly extracting 104×104 dense rating matrices from these three matrices. Then,
three sub-matrices of 5,000×5,000 each are then taken as summarized in Table 4.2.
All sub-matrices share the same users, but no common items.
- All ratings are normalized by 5 so that their values are from 0.2 to 1.
4.4.2 Experimental settings
All algorithms factorize the input datasets with different ranks. Also, each al-
gorithm was run five times. The mean and standard deviation of the results are
reported in the next Section. Furthermore, the small changes across consecutive
iterations indicate an algorithm’s convergence. Thus, the algorithms are stopped
when changes were less than 10−5.
81
Table 4.3 : Mean and standard deviation of tested RMSE on ABS NSW and ABSVIC data with different algorithms. For CST, when X(1) is the target, X(2) isused as an auxiliary data and vice versa. Best results for each rank are in bold.The Hotelling’s T squared tests row presents p-value of Hotelling’s T squared testsbetween each algorithm and the proposed HISF.
Rank Dataset CMF CBT CLFM CST HISF
5ABS NSW X(1) 0.0226 ±0.0026 0.0839 ±0.0002 0.0838 ±0.0002 0.0248 ±0.0005 0.0132 ±0.0002ABS VIC X(2) 0.0364 ±0.0031 0.0844 ±0.0003 0.0845 ±0.0004 0.0271 ±0.0015 0.0266 ±0.0030
7ABS NSW X(1) 0.0222 ±0.0009 0.0836 ±0.0004 0.0842 ±0.0006 0.0244 ±0.0005 0.0131 ±0.0003ABS VIC X(2) 0.0428 ±0.0020 0.0845 ±0.0004 0.0849 ±0.0004 0.0288 ±0.0025 0.0239 ±0.0025
9ABS NSW X(1) 0.0241 ±0.0011 0.0841 ±0.0002 0.0848 ±0.0009 0.0231 ±0.0009 0.0143 ±0.0004ABS VIC X(2) 0.0476 ±0.0040 0.0852 ±0.0003 0.0848 ±0.0003 0.0283 ±0.0024 0.0221 ±0.0019
11ABS NSW X(1) 0.0265 ±0.0026 0.0846 ±0.0007 0.0841 ±0.0007 0.0223 ±0.0006 0.0143 ±0.0003ABS VIC X(2) 0.0501 ±0.0029 0.0858 ±0.0005 0.0851 ±0.0007 0.0267 ±0.0020 0.0242 ±0.0015
13ABS NSW X(1) 0.0237 ±0.0024 0.0851 ±0.0002 0.0850 ±0.0005 0.0222 ±0.0010 0.0151 ±0.0004ABS VIC X(2) 0.0489 ±0.0032 0.0860 ±0.0003 0.0852 ±0.0003 0.0290 ±0.0026 0.0227 ±0.0015
15ABS NSW X(1) 0.0229 ±0.0029 0.0853 ±0.0005 0.0847 ±0.0005 0.0219 ±0.0007 0.0150 ±0.0000ABS VIC X(2) 0.0514 ±0.0041 0.0862 ±0.0002 0.0854 ±0.0006 0.0285 ±0.0023 0.0215 ±0.0005
Hotelling’s T squared tests 1.15×10−6 3.97×10−17 1.13×10−18 4.91×10−8 -
4.4.3 Empirical results
Three scenarios are tested as the following:
Case #1. Latent demographic profile similarities and latent LGA groups
similarities can help to collaboratively suggest unknown information in
these states
Matrices X(1) and X(2) are used as described in Sect. 4.4.1 which are from
different LGAs of two states. Nevertheless, both of them are ratings for the same
demographic categories. They share some common explicit demography similarities
as well as implicit LGAs’ latent ones. They are tested to assess how well both
explicit similarities in demography dimension and implicit ones in LGA dimension
collaboratively suggest unknown information.
Table 4.3 shows mean and standard deviation of RMSE of all algorithms on
tested ABS data for New South Wales (X(1)) and Victoria (X(2)) states. Both CBT
and CLFM that assume two states have similar demography patterns in latent sense
82
clearly perform the worst. The results demonstrate that these explicit similarities (in
the form of latent demography patterns) do not fully capture the correlation nature
of two datasets in this case. Thus, they do not help both CBT and CLFM to improve
their performance significantly. CMF applies another approach to take advantages
of explicit correlations between NSW state’s population and family profile and those
of VIC state. Specifically, CMF’s assumption on the same population and family
profile factor between NSW and VIC helps improve its performance over that of CBT
and CLFM. CST allows a more flexible utilization of explicit correlations between
NSW’s population and family profile and those of VIC state than CMF does. As a
result, CST achieves a little bit higher accuracy than CMF in recommending NSW’s
and VIC’s missing information.
Nevertheless, the prediction accuracy can be improved even more as illustrated
with the proposed idea of explicit and implicit similarities discovery. Utilizing them
helps the proposed HISF to achieve about two times higher accuracy compared with
CMF, about up to 47% improvement for NSW and up to 25% for VIC compared to
CST. These impressive results are achieved when the numbers of common columns
are 3, 5, 5, 7, 7 and 7 for the decomposition rank of 5, 7, 9, 11, 13 and 15, respectively.
It means that common parts together with domain-specific parts better capture the
true correlation nature of datasets. These explicit similarities together with implicit
similar group alignments allow better knowledge leveraging between datasets, thus,
improving recommendation accuracy.
To confirm the statistical significance of the proposed method, Hotelling’s T
squared tests (Hotelling 1931) which is the multivariate version of the t-tests in
univariate statistics are performed. The objective is to validate if the proposed al-
gorithm differs significantly from baselines. This multivariate Hotelling’s T squared
tests are used here because each population involves observations from two variables:
NSW (ABS NSW X(1)) and VIC (ABS VIC X(2)). For testing the null hypothesis
83
that each pair of algorithms (CMF vs. HISF, CBT vs. HISF, CLFM vs. HISF and
CST vs. HISF) has identical mean RMSE vectors, let
H0: population mean RMSEs are identical for all of the variables
(μNSW 1 = μNSW 2 and μV IC1 = μV IC2)
H1: at least one pair of these means is different
(μNSW 1 = μNSW 2 or μV IC1 = μV IC2)
Because all p-values are smaller than α (0.05), the null hypothesis is rejected.
Therefore, the observed difference between the baselines and the proposed algorithm
is statistically different.
Case #2. Latent LGA similarities and latent similarities between de-
mography and crime can help to collaboratively suggest unknown crime
and state information
The advantages of both explicit and implicit similarities are further confirmed
in Table 4.4. In this case, they are applied to other cross domains: ABS NSW
demography (X(1)) and NSW Crime (X(3)). These datasets have explicit similarities
in their LGA latent factors. At the same time, implicit similarities in demography
profile and criminal behaviors are also utilized. The proposed HISF leveraging
both similarities outperforms existing algorithms. It is worth to note here that
performance of CST, in this case, is worse than that of CMF (about two times
worse for NSW and a bit lower for VIC). The results suggest that flexibility of
utilizing explicit similarities does not work here. On the contrary, the proposed idea
of utilizing both explicit and implicit similarities eventually achieves more stable
and better results (demonstrated in Table 4.3 and 4.4).
Similar Hotelling’s T squared tests are perfomed as of the case #1. Again, as
84
Table 4.4 : Mean and standard deviation of tested RMSE on ABS NSW demographyand BOCSAR NSW crime data with different algorithms. For CST, when X(1) isthe target, X(3) is used as an auxiliary data and vice versa. Best results for eachrank are in bold. The Hotelling’s T squared tests row presents p-value of Hotelling’sT squared tests between each algorithm and the proposed HISF.
Dataset Rank CMF CBT CLFM CST HISF
5Demography X(1) 0.0209 ±0.0016 0.0840 ±0.0001 0.0840 ±0.0001 0.0304 ±0.0080 0.0174 ±0.0015
Crime X(3) 0.2796 ±0.0204 0.3411 ±0.0035 0.3422 ±0.0071 0.3216 ±0.0052 0.2697 ±0.0073
7Demography X(1) 0.0223 ±0.0024 0.0840 ±0.0002 0.0855 ±0.0006 0.0324 ±0.0040 0.0143 ±0.0004
Crime X(3) 0.2907 ±0.0265 0.3432 ±0.0021 0.3912 ±0.0188 0.3440 ±0.0055 0.2716 ±0.0029
9Demography X(1) 0.0199 ±0.0027 0.0838 ±0.0002 0.0850 ±0.0008 0.0337 ±0.0050 0.0143 ±0.0003
Crime X(3) 0.2813 ±0.0261 0.3562 ±0.0134 0.3722 ±0.0249 0.3434 ±0.0087 0.2648 ±0.0058
11Demography X(1) 0.0212 ±0.0049 0.0839 ±0.0001 0.0843 ±0.0004 0.0402 ±0.0073 0.0146 ±0.0003
Crime X(3) 0.2689 ±0.0143 0.3539 ±0.0061 0.3712 ±0.0199 0.3495 ±0.0051 0.2618 ±0.0012
13Demography X(1) 0.0194 ±0.0022 0.0837 ±0.0001 0.0837 ±0.0003 0.0383 ±0.0113 0.0149 ±0.0003
Crime X(3) 0.2700 ±0.0150 0.3481 ±0.0070 0.3500 ±0.0135 0.3599 ±0.0024 0.2623 ±0.0024
15Demography X(1) 0.0173 ±0.0014 0.0835 ±0.0001 0.0834 ±0.0002 0.0503 ±0.0084 0.0149 ±0.0002
Crime X(3) 0.2647 ±0.0031 0.3485 ±0.0038 0.3580 ±0.0099 0.3610 ±0.0011 0.2625 ±0.0015
Hotelling’s T squared tests 8.22×10−4 1.37×10−15 2.51×10−15 1.38×10−6 -
C
HISF
(a)
C
HISF
(b)
Figure 4.8 : Tested mean RMSEs under different values of the common row pa-rameter (c) in the coupled factor of HISF with r ank r = 11. a) Results on ABSNSW dataset; b) Results on ABS VIC dataset. CMF and CBT do not have cparameter, thus, their results are used as references. Lower is better. RMSE ofHISF outperforms its competitors when c equals 7.
all p-values are smaller than α (0.05) here, the null hypothesis is rejected. It is
therefore convincing to conclude that the mean RMSEs between the baselines and
the proposed idea differ significantly.
Figure 4.8 and 4.9 show how HISF works with different values of c parame-
85
C
HISF
(a)
C
HISF
(b)
Figure 4.9 : Tested mean RMSEs under different values of the common row param-eter (c) in the coupled factor of HISF with rank r = 11. a) Results on ABS NSWdemography; b) Results on BOCSAR NSW Crime. CMF and CBT do not havec parameter, thus, their results are used as references. Lower is better. RMSE ofHISF outperforms its competitors when c equals 9.
ter. These figures illustrate the case where decomposition rank equals 11. All the
curves of HISF have an identical trend. As the number of common row of explicit
similarities (c parameter) increases, recommendation accuracy improves to reach its
minimum rate. Then, the performance reduces as c increases. This pattern confirms
the significance of preserving domain-specific parts of each dataset. In other words,
common and domain-specific parts better capture the correlation nature of datasets
across domains, thus, achieve a higher recommendation accuracy. Moreover, when c
equals rank 11, HISF and CMF utilize the same explicit similarities (identical cou-
pled factors). Yet, HISF uses an extra correlation from implicit similarities. This
additional knowledge enables HISF to achieve a lower RMSE compared to CMF
(Figure 4.8b). All of these suggest the advantages of both explicit and implicit
similarities in joint analysis of two matrices.
86
Table 4.5 : Mean and standard deviation of tested RMSE on book, movie andelectronics data with different algorithms. CST is not applied here as it does notsupport two or more principal coordinates on one factor. Best results for each rankare in bold. The Hotelling’s T squared tests row presents p-value of Hotelling’s Tsquared tests between each algorithm and the proposed HISF.
Rank Dataset CMF CBT CLFM HISF-N
5Books X(4) 0.2169 ±0.0005 0.2187 ±0.0026 0.2161 ±0.0023 0.1951 ±0.0005Movies X(5) 0.2273 ±0.0022 0.3342 ±0.0039 0.3474 ±0.0063 0.2212 ±0.0010
Electronics X(6) 0.2375 ±0.0017 0.4206 ±0.0036 0.4596 ±0.0066 0.2642 ±0.0011
7Books X(4) 0.2170 ±0.0009 0.3068 ±0.0031 0.3059 ±0.0024 0.1977 ±0.0010Movies X(5) 0.2279 ±0.0009 0.4368 ±0.0031 0.4337 ±0.0055 0.2213 ±0.0007
Electronics X(6) 0.2348 ±0.0009 0.6607 ±0.0135 0.6389 ±0.0116 0.2497 ±0.0016
9Books X(4) 0.2185 ±0.0010 0.3163 ±0.0004 0.3150 ±0.0026 0.2001 ±0.0005Movies X(5) 0.2302 ±0.0010 0.4661 ±0.0046 0.4594 ±0.0065 0.2218 ±0.0008
Electronics X(6) 0.2354 ±0.0019 0.7098 ±0.0232 0.6909 ±0.0175 0.2411 ±0.0019
11Books X(4) 0.2219 ±0.0014 0.3207 ±0.0044 0.3204 ±0.0021 0.2012 ±0.0004Movies X(5) 0.2319 ±0.0021 0.4865 ±0.0019 0.4795 ±0.0043 0.2247 ±0.0005
Electronics X(6) 0.2417 ±0.0010 0.7390 ±0.0090 0.7223 ±0.0123 0.2420 ±0.0018
13Books X(4) 0.2244 ±0.0018 0.3267 ±0.0020 0.3291 ±0.0027 0.2021 ±0.0010Movies X(5) 0.2349 ±0.0025 0.5014 ±0.0057 0.4908 ±0.0028 0.2263 ±0.0009
Electronics X(6) 0.2422 ±0.0012 0.7695 ±0.0118 0.7684 ±0.0155 0.2410 ±0.0022
15Books X(4) 0.2260 ±0.0014 0.3303 ±0.0024 0.3353 ±0.0025 0.2022 ±0.0008Movies X(5) 0.2365 ±0.0018 0.5132 ±0.0044 0.5070 ±0.0065 0.2270 ±0.0009
Electronics X(6) 0.2447 ±0.0013 0.8067 ±0.0126 0.7688 ±0.0122 0.2417 ±0.0024
Hotelling’s T squared tests 2.80×10−7 1.15×10−5 1.11×10−6 -
Case #3. Latent users’ taste similarities when buying books, movies
and electronics devices and latent item group similarities can help to
collaboratively improve recommendation accuracy
Joint analysis of all three rating matrices from Amazon website, X(4) for books,
X(5) for movies andX(6) for electronics, is studied. All of them are explicitly from the
same users. Nevertheless, their hidden similarities in personal preferences and items’
characteristics can also help to better suggest missing information. This case assesses
how these explicit together with implicit similarities can help to collaboratively
improve recommendation accuracy of each.
Table 4.5 summarizes the performance of CMF, CBT, CLFM and the proposed
87
algorithm on Books, Movies and Electronics datasets from Amazon. CST is not
applied due to its limit on one auxiliary dataset for one factor; in this case, two
side data for user dimension can be used. The results in this three datasets are
consistent with those for two matrices above. In particular, CBT and CLFM which
assume shared rating patterns among the three perform the worst. CMF improves
accuracy a lot in comparison with CBT and CLFM. Nevertheless, its performance
is once more time outperformed by the proposed ideas. The results, in this case,
demonstrate the idea can be generalized to the situation of more than two datasets.
Hotelling’s T squared tests are again performed for three variables, i.e., Books,
Movies and Electronics, between each baseline and HISF-N to confirm the statistical
significance of the proposed method. For each pair (CMF vs. HISF-N, CBT vs.
HISF-N and CLFM vs. HISF-N).
H0: population mean RMSEs are identical for all of the variables
(μBooks1 = μBooks2 and μMovies1 = μMovies2and μElectronics1 = μElectronics2)
H1: at least one pair of these means is different
(μBooks1 = μBooks2 or μMovies1 = μMovies2 or μElectronics1 = μElectronics2)
The Hotelling’s T squared tests this time also result in very small p-values. This
once again confirms the performance between them is significantly different. Thus,
it can be concluded that their observed difference is convincingly significant.
Figure 4.10 shows how HISF-N works with different values of c parameter. There
are two conclusions can be drawn by observing the figures. Firstly, domain-specific
parts are quite substantial in addition to the common parts. Secondly, the proposed
model reduces X(4)’s RMSE the most, then X(5)’s and X(6)’s one in comparison with
other algorithms. This can be explained by their observed data size. The proposed
model, which optimizes the loss function (4.11), gives more preference on optimizing
88
C
HISF
(a)
C
HISF
(b)
C
HISF
(c)
Figure 4.10 : Tested mean RMSEs of Amazon dataset under different values of thecommon row parameter (c) in the coupled factor of HISF-N with rank r = 11. a)Results on Amazon Books; b) Results on Amazon Movies; c) Results on AmazonElectronics. CMF and CBT do not have c parameter, thus, their results are usedas references. Lower is better. RMSE of HISF outperforms its competitors when cequals 7.
the domain with more data while preserving comparable performance on the other
domains.
4.5 Contributions and Summary
This chapter addresses contributions #2 and #3 of this thesis by proposing a
novel algorithm to discover implicit similarities between datasets across domains. It
presents an idea to discover non-coupled factors’ latent similarities and makes use
of them to further improve the recommendation. Specifically, on the non-shared di-
mension, the middle matrix of the tri-factorization is proposed to match the unique
89
factors. Based on the found matches, HISF aligns the matched unique factors to
further transfer cross-domain implicit similarities and thus improve the recommen-
dation.
There are two significances of the proposed algorithm. Firstly, e-commerce busi-
nesses are increasingly dependent on recommendation systems to introduce personal-
ized services and products to targeted customers. Providing useful recommendations
requires sufficient knowledge about user preferences and product (item) characteris-
tics. Given the current abundance of available data across domains, HISF achieves
a thorough understanding of the relationship between users and items by exploiting
the implicit similarities among them. Discovering and utilizing them can bring in
more collaborative filtering power and lead to a higher recommendation accuracy.
Secondly, HISF is the first factorization method using both explicit and implicit
similarities to enhance the performance of cross-domain recommendation. Validated
on real-world datasets, the proposed approach outperforms existing algorithms by
more than two times in term of recommendation accuracy. The empirical results
also encourage us to extend the idea to a generalized model which enables joint
analysis of the explicit and implicit similarities from multiple correlated datasets.
The advantages of the ideas suggest both similarities has a significant impact on
improving the performance of cross-domain recommendation.
90
91
Chapter 5
Scalable Multimodal Factorization
This chapter encapsulates contributions #4 of this thesis. It is an extended descrip-
tion of the following publication:
Quan Do and Wei Liu, “Scalable Multimodal Factorization for Learning from
Very Big Data,” in Multimodal Analytics for Next-Generation Big Data Technologies
and Applications, Springer. (To appear)
5.1 Introduction
Recent technological advances in data acquisition have brought new opportuni-
ties as well as new challenges (Lahat et al. 2015b) to research communities. Many
new acquisition methods and sensors enable researchers to acquire multiple modes
of information about the real world. This multimodal data can be naturally and
efficiently represented by a multi-way structure, called a tensor, which can be ana-
lyzed to extract the underlying meaning of the observed data (Baltruaitis et al. 2018,
Farias et al. 2016). The increasing availability of multiple modalities, captured in
correlated tensors, provides more excellent opportunities to examine a complete
picture of all the data patterns as discussed in chapters 3 and 4.
92
The joint analysis of multimodal tensor data generated from different sources
provides a deeper understanding of the data’s underlying structure (Bhargava et al.
2015, Diao et al. 2014). However, processing this massive amount of correlated
data incurs a very heavy cost in terms of computation, communication, and storage.
Traditional methods which operate on a local machine such as coupled matrix tensor
factorization (CMTF) (Acar, Kolda & Dunlavy 2011) are either intractably slow or
memory insufficient. The former issue is because they iteratively compute factors
on the full coupled tensors many times; the latter is due to the fact that the full
coupled tensors cannot be loaded into a typical machine’s local memory. Both
computationally efficient methods and scalable work have been proposed to speed
up the factorization of multimodal data. Whereas concurrent processing using CPU
cores (Papalexakis et al. 2014, 2012) or GPUs’ massively parallel architecture (Zou
et al. 2015) computed faster, it did not solve the problem of insufficient local memory
to store a large amount of observed data. Other MapReduce distributed models
(Beutel et al. 2014, Jeon et al. 2016, Sun et al. 2010, Shin & Kang 2014) overcame
memory problems by keeping the large files in a distributed file system. They also
improved computational speed by having many different computing nodes processed
in parallel.
Computing in parallel allows factors to be updated faster (Liavas & Sidiropoulos
2015), yet the factorization process faces a higher data communication cost if it is not
well designed. One critical weakness of MapReduce algorithms is that when a node
needs data to be processed, the data is transferred from an isolated distributed
file system to the node (Shi et al. 2015). The iterative nature of factorization
requires data and factors to be distributed over and over again, incurring huge
communication overhead. If tensor size is doubled, the algorithms’ performance is
2*T times worse (T is the number of iterations). This cost is one of the disadvantages
of the MapReduce methods due to their low scalability.
93
This chapter describes an even more scalable multimodal factorization (SMF)
to improve the performance of MapReduce-based factorization algorithms as the
observed data becomes larger. The aforementioned deficiencies of MapReduce-based
algorithms can be overcome by minimizing the data transmission between computing
nodes and choosing a fast convergence optimization. The chapter describes SMF in
Section 5.2 in two parts: the first explains the observations behind processing by
blocks and caching data on computing nodes as well as provides a theoretical analysis
of the optimization process and the second part shows how SMF can be scaled up
to an unlimited input of multimodal data. The advantages of this method in terms
of minimal communication cost and scaling up capability are essential features of
any technique for dealing with multimodal big data. Also, Section 5.3 demonstrates
how it works by performing several tests with real-world multimodal data to evaluate
its scalability, its convergence speed, its accuracy and performance using different
optimization methods. Finally, Section 5.4 summarizes the primary contributions
of the proposed idea in achieving the scalability of factorization algorithms.
5.2 SMF: the proposed Scalable Multimodal Factorization
This section introduces SMF for the joint analysis of several N-mode tensors with
one or more modes in common. Let X ∈ RI1×I2×···×IN be a mode-N tensor. X has
at most N coupled matrices or tensors. Without loss of generality, this section first
explains a case where X and another matrix Y ∈ RI1×J2 are coupled in their first
modes. The joint analysis of more than two tensors is discussed in section 5.2.2.
Based on the coupled matrix tensor factorization of X and Y whose first modes
are correlated as in Section 2.3.2, SMF decomposes X into U(1) ∈ RI1×r, U(2) ∈RI2×r, ..., U(N) ∈ RIN×r and Y into U(1) ∈ RI1×r and V(2) ∈ RJ2×r, where U(1) is
the common factor and r is the decomposition rank.
94
L =
∥∥∥∥�U(1),U(2), ....,U(N)
� −X
∥∥∥∥2
+
∥∥∥∥�U(1),V(2)
� −Y
∥∥∥∥2
(5.1)
Observation 1. Approximating each row of one factor while fixing the other factors
reduces the complexity of CMTF.
Let U(−k) =�U(1), . . . ,U(k−1),U(k+1) . . . , U(N)
�. Based on observation 1, in-
stead of finding U(1), U(2), ..., U(N) and V(2) that minimize the loss function in
Equation (5.1), the problem can be formulated as optimizing every single row u(1)i1
of the coupled factor
L =∑
i2,...,iN
∥∥∥∥r∑
f=1
u(1)i1,f
×U(−1)i2,...,iN ,f −X
(1)i1,i2,...,iN
∥∥∥∥2
+∑j2
∥∥∥∥r∑
f=1
u(1)i1,f
×V(2)j2,f
−Y(1)i1,j2
∥∥∥∥2
(5.2)
minimizing each u(n)in
of a non-coupled factor U(n) (n > 1)
L =∑
i1,...,iN
∥∥∥∥r∑
f=1
u(n)in,f
×U(−n)i1,...,iN ,f −X
(n)i1,i2,...,iN
∥∥∥∥2
(5.3)
where u(n)in
is the variable the proposed model wants to find while fixing the other
terms and Xi1,i2,...,iN are the observed entries of X.
and minimizing v(2)j2
of a non-coupled factor V(2)
L =∑i1
∥∥∥∥r∑
f=1
U(1)i1,f
× v(2)j2,f
−Y(2)i1,j2
∥∥∥∥2
(5.4)
where v(2)j2
is the variable the proposed algorithm wants to find while fixing the other
terms and Yi1,j2 are the observed entries of Y.
According to Equation (5.2), computing a row u(1)i1
of the coupled factor while
fixing the other factors requires observed entries ofX and those ofY that are located
95
X
X
Xu
u
u
Y
v
Y
(a) (b)
(c) (d)
i1,* i1,*,**,i2,*
*,*,i3 *,j2
Figure 5.1 : Tensor slices for updating each row of U(1), U(2), U(3), and V(2) whenthe input tensors are a mode-3 tensor X coupled with a matrix Y in their first mode:(a) coupled slices in the first mode of both X(1) and Y(1) required for updating arow of U(1); (b) a slice in the second mode of X(2) for updating a row of U(2); (c) aslice in the third mode of X(3) for updating a row of U(3); (d) a slice in the secondmode of Y(2) for updating a row of V(2).
in a slice X(1)i1
and Y(1)i1, respectively. Figure 5.1 illustrates these tensor slices for
calculating each row of any factor. Similarly, Equation (5.3) suggests a slice X(n)in
for
updating a corresponding row u(n)in
and Equation (5.4) suggests Y(2)j2
for updating a
corresponding row v(2)j2.
Definition 1. Two slices X(n)i and X
(n)i′ are independent if and only if ∀x ∈ X
(n)i ,
∀x′ ∈ X(n)i′ and i = i′ then x = x′.
Observation 2. Row updates for each factor as in Equation (5.2), (5.3) and (5.4)
require independent tensor slices; each of these non-overlapping parts can be pro-
cessed in parallel.
Figure 5.1a shows that X(1)i ,Y
(1)i for updating U
(1)i and X
(1)i′ ,Y
(1)i′ for updating
U(1)i′ are non-overlapping ∀i, i′ ∈ [1, I1] and i = i′. Consequently, all rows of U(1) are
96
independent and can be executed concurrently. The same parallel updates are for
all rows of U(2),. . . , U(N) and V(2).
5.2.1 SMF on Apache Spark
This section discusses distributed SMF for large-scale datasets.
Observation 3. The most critical performance bottleneck of any distributed CMTF
algorithm is transferring a large-scale dataset to computing nodes at each iteration.
As with observation 2, optimizing rows of factors can be done in parallel with
distributed nodes; each one needs a tensor slice and other fixed factors. Existing dis-
tributed algorithms, such as FlexiFact (Beutel et al. 2014), GigaTensor (Kang et al.
2012) and SCouT (Jeon et al. 2016), store input tensors in a distributed file system.
Computing any factor requires the corresponding data to be transferred to process-
ing nodes. Because of the iterative nature of the CP model, this large-scale tensor
distribution repeats, causing a heavy communication overhead. SMF eliminates this
huge data transmission cost by robustly caching the required data in memory of the
processing nodes.
SMF partitions input tensors and localizes them in the computing nodes’ mem-
ory. It is based on Apache Spark because Spark natively supports local data caching
with its resilient distributed datasets (RDD) (Zaharia et al. 2010). In a nutshell, an
RDD is a collection of data partitioned across computational nodes. Any transfor-
mation or operation (map, reduce, foreach, ...) on an RDD is done in parallel. As
the data partition is located in the processing nodes’ memory, revisiting this data
many times over the algorithm’s iterations does not incur any communication over-
head. SMF designs RDD variables and chooses the optimization method carefully
to maximize RDD’s capability.
97
B1 B1
CB
CB
CB
B1 B1B2
(a) (b) (c)
B
Figure 5.2 : Dividing coupled matrix and tensor into non-overlapping blocks: (a)coupled blocks CB(1) ← (B1(1),B2(1)) in the first mode of both X and Y for up-dating U(1). (b) blocks in the second mode of X for updating U(2) and those in thesecond mode of Y for V(2). (c) blocks in the third mode of X for U(3). All blocksare independent and can be performed concurrently.
Block processing
SMF processes blocks of slices to enhance efficiency. As observed in Figure ??,
a coupled slice (X(1)i1,Y
(1)i1) is required for updating a row of U
(1)i1. Slices X
(2)i2, X
(3)i3,
and Y(2)j2
are for updating U(2)i2, U
(3)i3, and V
(2)j2, respectively. On one hand, it is
possible to work on every single slice of I1 slices X(1)i1
and Y(1)i1, I2 slices X
(2)i2, I3
slices X(3)i3
and J2 slices Y(2)j2
separately. On the other hand, dividing data into too
many small parts is not a wise choice as the time needed for job scheduling may
exceed the computational time. Thus, merging several slices into non-overlapping
blocks and working on them in parallel increases efficiency. An example of this
grouping is presented in Figure 5.2.
Figure 5.2a shows coupled blocks CB(1) are created by merging corresponding
(B1(1),B2(1)) in the first mode of both X and Y for updating U(1). Blocks in the
second mode of X for updating U(2) and those in the second mode of Y for V(2)
are illustrated in Figure 5.2b. Figure 5.2c reveals blocks in the third mode of X for
U(3). All blocks are independent and can be performed concurrently.
98
N copies of a N-mode tensor caching
SMF caches N copies of X and N copies of Y in memory. As observed in Figure
5.2, blocks of CB(1) are used for updating the first mode factor U(1); blocks of B1(2)
and B2(2) are used for the second mode factors U(2) and V(2), respectively; blocks of
B1(3) are used for the third mode U(3). Thus, to totally eliminate data transmission,
all of these blocks should be cached. This duplicated caching needs more memory,
yet it does not require a huge memory extension as more processing nodes can be
added.
Function createRDDInput : X-filename, N = 3
Y-filenamenumber of blocks d
Output: cached CB(1), B1(2), ..., B1(N), B2(2)
1 RDD1 ← read(X-filename)2 RDD2 ← read(Y-filename)3 foreach mode n ∈ [1, N ] do4 B1(n) ←RDD1.map(emit〈in, (i1, ..., iN , value)〉)5 .groupByKey()6 .partionBy(d)7 .cache()
8 foreach mode n ∈ [1, 2] do9 B2(n) ←RDD2.map(emit〈in, (i1, i2, value)〉)
10 .groupByKey()11 .partionBy(d)12 .cache()
13 CB(1) ← B1(1).join(B2(1)).cache()
A pseudo code for creating these copies as RDD variables in Apache Spark is in
Function createRDD(). Lines 1 and 2 create two RDDs of strings (i1, ..., iN , value)
for X and Y. The entries of RDDs are automatically partitioned across processing
nodes. These strings are converted into N key-value pairs of 〈in, (i1, ..., iN , value)〉,one for each mode, in lines 4 and 9. Lines 5 and 10 group the results into slices,
as illustrated in Figure 5.1. These slices are then merged into blocks (lines 6 and
99
11) to be cached in working nodes (lines 7 and 12). Coupled blocks are created by
joining corresponding blocks of X(1) and Y(1) in line 13. It is worth noting that
these transformations (line 4 to 7 and line 9 to 12) are performed concurrently on
parts of RDDs located in each working node.
Optimized solver
SMF uses a closed form solution for solving each row of the factors. This opti-
mizer not only converges faster but also helps to achieve higher accuracy.
Theorem 1. The optimal row u(1)i1
of a coupled factor U(1) is computed by
u(1)i1
= (ATA+CTC)†(ATbi1 +CTdi1) (5.5)
where bi1 is a column vector of all observed X(1)i1; di1 is a column vector of all
observed Y(1)i1; A and C are all U
(−1)i2,...,iN
and Vj2 with respect to all observed bi1 and
di1, respectively.
Proof. Equation (5.2) can be written for each row of U(1) as:
L =∑i
∥∥Au(1)i1
− bi1
∥∥2+∑i
∥∥Bu(1)i1
− di1
∥∥2
Let x be the optimal u(1)i1, then x can be derived by setting the derivative of L with
respect to x to zero. Hence,
L =(Ax− bi1
)T(Ax− bi1
)+(Cx− di1
)T(Cx− di1
)
= xTATAx− 2bTi1Ax+ bT
i1bi1 + xTCTCx− 2dT
i1Cx+ dT
i1di1
100
⇔ ∂L∂x
= 2ATAx− 2ATbi1 + 2CTCx− 2CTdi1 = 0
⇔ (ATA+CTC)x = ATbi1 +CTdi1
⇔ x = (ATA+CTC)†(ATbi1 +CTdi1)
Theorem 2. The optimal row u(n)in
of a non-coupled factor U(n) is computed by
u(n)in
= (ATA)†ATbin (5.6)
where bin is a row vector of all observed X(n)in
; A is all U(−n)i2,...,iN
with respect to all
observed bin.
Proof. Similar to the proof of Theorem 1, u(n)in
minimizes Equation (5.3) when the
derivative with respect to it goes to 0.
∂L∂u
(n)in
= −2ATAu(n)in
+ 2ATbin = 0
⇔ u(n)in
= (ATA)†ATbin
Performing pseudo inversion in Equations (5.5) and (5.6) is expensive. Never-
theless, as (ATA + CTC) and ATA are small squared matrices of Rr×r, a more
efficient operation is to use Cholesky decomposition for computing pseudo inversion
and solving u(n)in
as in algorithm 4. At each iteration, SMF first broadcasts all the
newly updated factors to all processing nodes. Then each factor is computed. While
SMF updates each factor of each tensor sequentially, each row of a factor is computed
in parallel by either updateFactor() (line 10 and 12) or updateCoupledFactor() (line
7). These two functions are processed concurrently by different computing nodes
with their cached data blocks. These steps are iterated to update the factors until
101
the algorithm converges.
Algorithm 4: SMF with data parallelism
Input : cached CB(1), B1(2), ..., B1(N), B2(2), EOutput: U(1), ...,U(N),V(2)
1 Initialize L by a small number2 Randomly initialize all factors
3 repeat4 Broadcast all factors5 PreL = L6 // coupled Factor
7 U(1) ← updateCoupledFactor(CB(1), 1)8 // non-coupled Factor U9 foreach mode n ∈ [2, N ] do
10 U(n) ← updateFactor(B1(n), n)
11 // non-coupled Factor V
12 V(2) ← updateFactor(B2(2), 2)
13 Compute L following Equation (5.1)
14 until (PreL−LPreL < E)
Theorem 3. The computational complexity of Algorithm 4 is
O(T∑K
k=1
∑Nk
n=1|Ωk|M
(Nk + r)r + T∑K
k=1
∑Nk
n=1
IknM
r3).
Proof. An N -mode tensor requires finding N factors. A factor is updated in ei-
ther lines 1-4 of the function updateFactor() or lines 1-6 of updateCoupledFactor().
Lines 1-4 prepare A, compute ATA, ATbi and perform Cholesky decompositions.
Lines 1-6 double A preparation and ATA, ATbi computations. Computing A
requires( |Ω|M(N − 1)r
)operations while ATA and ATbi take
( |Ω|M(r − 1)r
)each.
Cholesky decomposition of InM
Rr×r matrices is O( InMr3). So updating a factor re-
quires O( |Ω|M(N + r)r + In
Mr3); all factors of K tensors take O
(∑K
k=1
∑Nk
n=1|Ωk|M
(Nk+
r)r +∑K
k=1
∑Nk
n=1
IknM
r3). These steps may iterate T times. Therefore, the compu-
tational complexity of Algorithm 4 is
O(T∑K
k=1
∑Nk
n=1|Ωk|M
(Nk + r)r +∑K
k=1
∑Nk
n=1
IknM
r3).
102
Theorem 4. The communication complexity of Algorithm 4 is O(T∑K
k=1
∑Nk
n=1Iknr).
Proof. At each iteration, Algorithm 4 broadcasts∑K
k=1Nk factors to M machines
(line 4). As the broadcast in Apache Spark is done using the BitTorrent technique,
each broadcast to M machines takes O(∑K
k=1
∑Nk
n=1Iknr). So the total T times
broadcast requires O(T∑K
k=1
∑Nk
n=1Iknr).
Theorem 5. The space complexity of Algorithm 4 is
O(∑K
k=1 |Ωk|Nk
M+∑K
k=1
∑Nk
n=1Iknr).
Proof. Each computing node stores blocks of tensor data and all the factors. Firstly,
Nk copies of |Ωk|M
observations of the kth tensor need to be stored on each node,
requiring O( |Ωk|Nk
M). So, K tensors take O(
∑Kk=1 |Ωk|Nk
M). Secondly, storing all fac-
tors in each node requires O(∑K
k=1
∑Nk
n=1Iknr). Therefore, the space complexity is
O(∑K
k=1 |Ωk|Nk
M+∑K
k=1
∑Nk
n=1Iknr).
Function updateFactor(B, n)
Output: U
1 B.map(2 〈i, (i1, ..., iN ,bi)〉 ← Bi
3 A ← U(−n)
4 Compute u(n)i by Equation (5.6)
5 )
6 Collect result and merge to U(n)
5.2.2 Scaling up to K tensors
The implementation of Algorithm 5 supports K N -mode tensors. In this case, K
tensors have (K−1) coupled blocks CB(1), ...,CB(K−1). The algorithm checks which
mode of the main tensor is the coupled mode and applies the updateCoupledFactor()
function with the corresponding coupled blocks.
103
Function updateCoupledFactor(CB, n)
Output: U
1 CB.map(2 〈i, (i1, ..., iN ,bi)〉 ← B1i
3 A ← U(−n)
4 〈i, (i1, j2,di)〉 ← B2i
5 C ← V(2)
6 Compute u(n)i by Equation (5.5)
7 )
8 Collect result and merge to U(n)
5.3 Performance Evaluation
SMF was implemented in Scala and tested on Apache Spark 1.6.0∗ with the
Yarn scheduler† from Hadoop 2.7.1. The experiments compare the performance
of SMF with existing distributed algorithms to assess the following questions. 1)
How scalable is SMF with respect to the number of observations and the number
of machines? 2) How fast does SMF converge? 3) What level of accuracy does
SMF achieve? and 4) How does the closed form solution perform compared to the
widely chosen gradient-based methods?
All the experiments were executed on a cluster of 9 nodes, each having 2.8GHz
CPU with 8 cores and 32GB RAM. Since SALS (Shin & Kang 2014) and SCouT
(Jeon et al. 2016) were shown to be significantly better than FlexiFact (Beutel et al.
2014), comparisons with SALS and SCouT are included and the FlexiFact results are
discarded. CTMF-OPT (Acar, Kolda & Dunlavy 2011) was also run on one of the
nodes. Publicly available 22.5M (i.e., 22.5 million observations) movie ratings with
movie genre information in MovieLens (Harper & Konstan 2015), 100M Netflix’s
∗Apache Spark http://spark.apache.org/†Yarn scheduler https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/
hadoop-yarn-site/FairScheduler.html
104
Algorithm 5: SMF for K tensors where the first mode of X1 is coupled withthe first mode of X2, ..., (K − 1)th mode of X1 is joint with the first mode ofXK.
Input : cached CB(1), ...,CB(K−1),B1(2), ..., B1(N1), ..., BK(2), ..., BK(NK), E
Output: U1(1), ...,U1(N1), ..., UK(2), ...,UK(NK)
1 Initialize L by a small number2 Randomly initialize all factors
3 repeat4 Broadcast all factors5 PreL = L6 foreach tensor k ∈ [1, K] do7 if (k is the main tensor) then8 foreach mode n ∈ [1, Nk] do9 if (n is a coupled mode) then
10 Uk(n) ← updateCoupledFactor(CB(n), n)11 else12 Uk(n) ← updateFactor(Bk(n), n)
13 else14 foreach mode n ∈ [2, Nk] do15 Uk(n) ← updateFactor(Bk(n), n)
16 Compute L following Equation (5.1)
17 until (PreL−LPreL < E)
movie ratings‡ and 718M song ratings coupled with song-artist-album information
from Yahoo! Music dataset§ are used. All ratings are from 0.2 to 1, equivalent to 1
to 5 stars. When evaluated as a missing value completion (rating recommendation)
problem, about 80% of the observed data was for training and the rest was for
testing. The details of the datasets are summarized in Table 5.1.
‡Netflix’s movie ratings dataset http://www.netflixprize.com/§Yahoo! Research Webscope’s Music User Ratings of Musical Artists datasets http://
research.yahoo.com/
105
Table 5.1 : Data for experiments
Dataset Tensor |Ω|train |Ω|test I1 I2 I3
MovieLensX 18M 4.5M 34,208 247,753 7Y 649K - 34,208 19 -
Netflix X 80M 20M 480,189 17,770 2,182
Yahoo! MusicY 700M 18M 136,736 1,823,179 -X 136K - 136,736 20,543 9,442
Synthetic (1)X 1M - 100K 100K 100KY 100 - 100K 100K -
Synthetic (2)X 10M - 100K 100K 100KY 1K - 100K 100K -
Synthetic (3)X 100M - 100K 100K 100KY 10K - 100K 100K -
Synthetic (4)X 1B - 100K 100K 100KY 100K - 100K 100K -
5.3.1 Scalability
To validate the scalability of SMF, four synthetic datasets are generated with
different observation densities as summarized in Table 5.1. The scalability of SMF is
measured with respect to the number of observations and machines.
A. Observation scalability
Figure 5.3 compares the observation scalability of SALS, CMTF-OPT and SMF (in
the case of TF of the main tensor X - Figure 5.3a) and of SCouT, CMTF-OPT and
SMF (for CMTF of the tensor X and the additional matrix Y - Figure 5.3b). As
shown in Figure 5.3a, the performance of SALS is similar to ours when the number
of observation is from 1M to 10M. However, SALS performs worse as the observed
data size becomes larger. Specifically, when observed data increases 10x (i.e., 10
times) from 100M to 1B, SALS’s running time per iteration slows down 10.36x,
151% down rate of SMF’s. As for CMTF, SMF significantly outperforms SCouT
106
(a) (b)
Figure 5.3 : Observation scalability. SMF scales up better as the number of knowndata increases. In a) only X is factorized. SMF is 4.17x, and 2.76x faster thanSALS in the case of 1B and 100M observations, respectively. In b), both X and Yare jointly analyzed. SMF consistently outperforms SCouT at a rate of over 70x inall test cases. In both cases, CMTF-OPT runs out of memory for more than 1Mdatasets.
being 73x faster. CMTF-OPT achieves similar performance for 1M dataset, but it
experiences ”out-of-memory” when dealing with the other larger datasets.
B. Machine scalability
The increase in the speed of each algorithm as more computational power is
added to the cluster is measured. Synthetic (3) with 100M is used in this test. The
speedup rate is calculated by normalizing the time each algorithm takes on three
machines (T3) with time (TM) on M machines (in this test M is 6 and 9). In general,
SMF speeds up to a rate which is similar to SALS and to a much higher rate than
SCouT.
5.3.2 Convergence Speed
This section investigates how fast SMF converges in a benchmark with both
SALS and SCouT on the three real-world datasets. As observed in Figures 5.5,
5.6 and 5.7, when the tensor size becomes bigger from MovieLens (Figure 5.5) to
Netflix (Figure 5.6) and to Yahoo! Music (Figure 5.7), the advantages of SMF over
107
(a) (b)
Figure 5.4 : Machine scalability with 100M synthetic dataset. In a) only X isfactorized. In b), both X and Y are jointly analyzed. SMF speeds up to a ratewhich is similar to SALS and to a much higher rate than SCouT.
(a) (b)
Figure 5.5 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in MovieLens.
SALS increases. SMF eliminates all data streaming from local disk to memory,
especially for large-scale data, improving its efficiency significantly. Specifically,
SMF outperforms SALS 3.8x in the case of 700M observations (Yahoo! Music) and
2x in 80M observations (Netflix). This result, in combination with the fact that it
is 4.17x faster than SALS in 1B synthetic dataset, strongly suggests that SMF is
the fastest tensor factorization algorithm for large-scale datasets.
While Figures 5.5, 5.6 and 5.7 show single tensor factorization results, Figures
5.8 and 5.9 provide more empirical evidence to show that SMF is able to perform
108
(a) (b)
Figure 5.6 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in Netflix
(a) (b)
Figure 5.7 : Factorization speed (a) and training RMSE per iteration (b) of thetensor factorization of X in Yahoo! Music
(a) (b)
Figure 5.8 : Factorization speed (a) and training RMSE per iteration (b) of thecoupled matrix tensor factorization of X and Y in MovieLens
109
(a) (b)
Figure 5.9 : Factorization speed (a) and training RMSE per iteration (b) of thecoupled matrix tensor factorization of X and Y in Yahoo! Music.
Table 5.2 : Accuracy of each algorithm on the real-world datasets. Decomposedfactors are used to predict missing entries. The accuracy is measured with a RMSEon the test sets.
AlgorithmTF CMTF
MovieLens Netflix Yahoo MovieLens Yahoo
SALS 0.1695 0.1751 0.2396 - -SCouT - - - 0.7110 0.7365SMF 0.1685 0.1749 0.2352 0.1676 0.2349
lightning-fast coupled matrix tensor factorization for the joint analysis of heteroge-
neous datasets. In this case, only MovieLens and Yahoo! Music are used as the
Netflix dataset does not have side information. SMF surpasses SCouT, the current
fastest CMTF algorithm, by 17.8x on the Yahoo! Music dataset.
5.3.3 Accuracy
In addition to having the fastest convergence speed, SMF also recovers missing
entries with the highest accuracy. Table 5.2 lists all the prediction results on the test
sets. Note that SALS does not support CMTF and SCouT does not support TF. In
this test, SALS is almost as good as SMF for missing entry recovery while SCouT
performs much worse. This also shows the power of using coupled information in
110
the factorization processes.
5.3.4 Optimization
Different optimizers are benchmarked in this section. Instead of computing each
row of any factor as line 4 of updateFactor() or line 6 of updateCoupledFactor(),
two gradient-based optimizers are used: nonlinear conjugate gradient (called SMF-
NCG) with the More-Thuente line search (More & Thuente 1994) and gradient
descent (called SMF-GD) to update it. All the optimizers stop when ε < 10−4.
(a) (b)
(c)
Figure 5.10 : Benchmark of different optimization methods for a) MovieLens, b)Netflix and c) Yahoo! Music datasets. In all cases, SMF-CF quickly goes to thelowest RMSE.
These results are compared with the closed form solution (hereafter called SMF-
CF) as displayed in Figure 5.10. The results demonstrate that the closed form
111
Table 5.3 : Accuracy of predicting missing entries on real-world datasets with dif-ferent optimizers. SMFoptimized solver achieves the lowest tested RMSE under thesame stopping condition.
Optimizer MovieLens Netflix Yahoo
SMF-CF 0.1672 0.1749 0.2352SMF-NCG 0.1783 0.1904 0.2368SMF-GD 0.1716 0.1765 0.2387
solution converges to the lowest RMSE faster than conventional gradient-based op-
timizers in general. The results on the test sets shown in Table 5.3 also confirm that
SMF-CF has the highest precision on the test sets.
5.4 Contribution and Summary
This chapter addresses contribution #4 of this thesis by proposing a novel scal-
able factorization algorithm and demonstrating its high impact on the cross-domain
datasets factorization problem. Given large-scale datasets, existing distributed al-
gorithms for the joint analysis of multi-dimensional data generated from multi-
ple sources decompose them on several computing nodes following the MapReduce
paradigm. How to improve the performance of MapReduce-based factorization al-
gorithms as the observed data grows larger is a requirement if any cross-domain
learning from big data is to occur. This case requires an even more efficient solution
that not only reduces communication overhead but also optimizes factors faster.
The chapter introduces SMF algorithm to analyze coupled big datasets across-
domains. It has two key features to enable large-scale factorization of the coupled
datasets. Firstly, the SMF design, based on Apache Spark, eliminates the huge
data transmission overhead, having the smallest communication cost. Secondly, its
optimized solver converges faster. Its advanced design together with the optimized
solver stabilizes the algorithm performance as the data size increases. As a result,
112
SMF is exceptionally efficient in case the data grows.
Extensive experiments show that SMF scales the best, compared to the other
existing methods, with respect to the number of tensors, observations, and ma-
chines. They also demonstrate SMF’s effectiveness in terms of convergence speed
and accuracy on real-world datasets. When dealing with one billion observed entries,
SMF outperforms the currently fastest coupled matrix tensor factorization and ten-
sor factorization by 17.8 and 3.8 times, respectively. Compellingly, SMF achieves
this speed with the highest accuracy. All these advantages suggest SMF design
principles should be the building blocks for large-scale multimodal factorization
methods.
113
114
Chapter 6
Conclusion
E-commerce businesses are increasingly dependent on recommendation systems to
introduce personalized services and products to targeted customers. Providing useful
recommendations requires sufficient knowledge about user preferences and product
(item) characteristics. Given the current abundance of available data across do-
mains, achieving a thorough understanding of the relationship between users and
items can result in more collaborative filtering power and lead to higher recommen-
dation accuracy.
However, how to effectively utilize different types of knowledge obtained across
domains is still a challenging problem for the data mining research community. Cur-
rent research in cross-domain recommendation only uses explicit similarities across
domains to improve recommendation accuracy. Moreover, all the existing algorithms
assume coupled datasets across domains share identical coupled factors, losing the
capability to capture the actual explicit similarities among them.
The joint analysis of datasets from multiple domains can provide additional in-
sights into the coupled dimension. Nevertheless, performing this joint factorization
of the coupled cross-domain datasets often incurs very heavy costs in terms of com-
115
putation, communication, and storage if the proposed methods are not well designed.
One critical weakness of MapReduce-based algorithms is when a node needs data
to process, the data is transferred from an isolated distributed file system to the
node. The iterative nature of tensor factorization requires data and factors to be
distributed over and over again, incurring enormous communication overhead.
6.1 Research questions and contributions
Motivated by these gaps, this thesis proposes a few ideas to improve the per-
formance of cross-domain recommendation. Chapter 2 discusses a few reasons that
affect the performance of the existing algorithms as summarized below.
• Existing joint factorization models assume coupled datasets share identical
coupled factors, failing to capture the actual explicit similarities across do-
mains.
• Current algorithms use only explicit similarities for joint analysis or transfer
learning from cross-domain coupled datasets. They fail to exploit the implicit
similarities among them.
• MapReduce-based distributed factorization algorithms incur a huge communi-
cation overhead, reducing their effectiveness in handling large-scale datasets.
To overcome these problems, this research investigates the following questions:
Q1. Is it possible to propose appropriate methods of sharing explicit similarities
between cross-domain datasets to understand their actual relationship?
This question is answered by proposing a new objective function to better
utilize explicit similarities in chapter 3.
116
Q2. How to share implicit similarities in non-coupled dimensions across domains to
improve recommendation accuracy?
This question is answered by proposing a novel algorithm to discover and ex-
ploit implicit similarities in section 4.2.2. A method to combine both implicit
and explicit similarities is proposed in section 4.2.1 to improve recommenda-
tion performance.
Q3. How to improve the scalability of the factorization process such that it is able
to scale up to a different number of coupled tensors, tensor modes, tensor
dimensions and billions of observations?
A scalable factorization model is introduced to answer this question in chapter
5. The proposed model significantly reduces communication overhead, scaling
up better compared to existing distributed algorithms.
By investigating these research questions, this thesis makes four knowledge con-
tributions, summarized below.
Contribution #1. A new objective function to enable each dataset to
have its discriminative factor on the coupled mode, capturing the actual
explicit similarities across domains
Section 3.2 describes the proposed objective function to optimize with respect to
every single tensor and matrix. Differing from algorithms with a traditional objective
function which forces shared modes among tensors to have identical factors, the
proposed method enables each tensor to have its own discriminative factor on the
coupled mode and regularizes them to be as close as possible. Thus, it is capable of
sharing the accurate explicit similarities among them, improving recommendation
accuracy. In addition, a theoretical proof and experimental evidence confirm that
the algorithm converges to an optimum. Experiments on both real and synthetic
117
datasets demonstrate that the proposed algorithm outperforms the current state-of-
the-art algorithms in predicting missing entries for recommendations.
Contribution #2. A novel algorithm to discover implicit similarities
in non-coupled mode and align them across domains
Section 4.2.2 explains a method to discover implicit similarities from latent fac-
tors across domains based on matrix tri-factorization. The method captures the
implicit similarities on the non-coupled dimension. To this end, it uses the middle
matrix of the tri-factorization to match the unique factors. Based on the identified
matches, it aligns the matched unique factors to transfer cross-domain implicit sim-
ilarities. The empirical results demonstrate that these implicit similarities provide
other insights into the underlying structure. Thus, utilizing them effectively helps
to improve cross-domain recommendations.
Contribution #3. A matrix factorization-based model to utilize both
explicit and implicit similarities for cross-domain recommendation accu-
racy improvement
Section 4.2.1 presents a different way to share explicit similarities across do-
mains, and the use of both similarities improves recommendation performance. This
is based on the fact that coupled datasets which share the same coupled dimension
indicates a strong correlation in relation to the coupled factors. Nevertheless, they
also have their unique features characterized by their domain. For this reason,
both common and unique parts are included in the coupled factors to better cap-
ture explicit similarities among different datasets. Moreover, the proposed method
also utilizes the implicit similarities on the non-coupled dimension. This research
is the first to propose the transfer of both explicit and implicit knowledge in cou-
pled and non-coupled dimensions and thus further improves the recommendation.
Validated on real-world datasets, the proposed approach outperforms the state-of-
118
the-art algorithms by more than two times in term of recommendation accuracy.
These empirical results confirm the potential of utilizing both explicit and implicit
similarities for making cross-domain recommendations.
Contribution #4. A scalable factorization model based on the Spark
framework to scale up the factorization process to the number of tensors,
tensor modes, tensor dimensions and billions of observations
Section 5.2 describes a scalable factorization method to improve the performance
of MapReduce-based algorithms as the observed data becomes larger. As data grows,
it requires an even more efficient solution, especially for reducing the communication
overhead. The proposed distributed lightning-fast and scalable algorithm incurs the
smallest communication overhead compared to all methods proposed in the litera-
ture. It is equipped with an optimized solver which reduces the overall time complex-
ity. These key features stabilize the proposed method’s performance when the data
size increases, being confirmed by experiments with 1 billion known entries. Vali-
dated on real-world datasets, the proposed method outperforms the state-of-the-art
distributed tensor factorization and coupled matrix tensor factorization algorithms
by 3.8 and 17.8 times, respectively. Furthermore, the more interesting observation
is that it achieves this fast decomposition with the highest accuracy on the test sets.
6.2 Future research directions
This section discusses a few research plans to extend the proposed methods in
this thesis and to overcome their limitations.
6.2.1 Investigating explicit and implicit similarities in imbalanced datasets
The proposed methods in this thesis discover and use both explicit and implicit
similarities in coupled datasets with balanced ratings. However, there are many
cases where the ratings are imbalanced. For example, people are likely to provide
119
feedback once they have a negative experience with a product or service. Thus, the
number of negative ratings may be much higher than that of a positive one. It will
be an essential contribution to investigate this imbalanced issue by extending the
proposed methods.
6.2.2 Extending the use of explicit and implicit similarities to high di-
mensional tensors
The proposed methods using explicit and implicit similarities across domains
work with rating matrices. As data is currently being generated at an unprecedented
speed, a growing number of high dimensional tensors is available. The joint analysis
of these high dimensional tensors will help to provide a thorough understanding
of the underlying structure, thus improving recommendation accuracy. Discovering
the explicit and implicit similarities between them is the first step. Designing an
algorithm to extend the methods proposed in this thesis which can handle high
dimensional tensors will significantly enrich the data mining research community.
6.2.3 Extending the proposed factorization model to handle online rat-
ings
This thesis presents a scalable factorization method to scale up the factorization
process on offline coupled datasets. For e-commerce websites, ratings are provided in
real-time. How to update the factors with online ratings will help to achieve accurate
recommendations that reflect new ratings coming to the systems. Developing a
scalable online factorization will benefit an increasing number of businesses.
6.2.4 Investigating the use of explicit and implicit similarities in Factor-
ization Machines
Factorization machines (FMs) (Rendle 2010) are more general models than MF.
Instead of using user and item indexes as in MF, FMs deal with both user and item
120
features. It will be interesting to analyze the user and item features to discover the
explicit and implicit similarities of these features. By doing so, it will contribute to
the data mining community by the provision of new factorization machines that can
handle coupled datasets.
6.3 Conclusion
In summary, this thesis enriches the data mining research community by propos-
ing algorithms to effectively utilize different types of knowledge obtained across
domains. Also, it improves the scalability of the factorization process by developing
a factorization model capable of scaling up to a different number of tensors, ten-
sor modes, tensor dimensions and billions of observations. The proposed methods
are thus applicable to discover explicit and implicit similarities across large-scale
datasets for cross-domain recommendations.
121
122
Bibliography
Acar, E., Dunlavy, D. M., Kolda, T. G. & Mrup, M. (2011), ‘Scalable tensor factor-
izations for incomplete data’, Chemometrics and Intelligent Laboratory Systems
106(1), 41 – 56. Multiway and Multiset Data Analysis.
Acar, E., Kolda, T. G. & Dunlavy, D. M. (2011), All-at-once optimization for cou-
pled matrix and tensor factorizations, in ‘KDDWorkshop on Mining and Learning
with Graphs (arXiv:1105.3422v1)’.
Baltruaitis, T., Ahuja, C. & Morency, L. (2018), ‘Multimodal machine learning:
A survey and taxonomy’, IEEE Transactions on Pattern Analysis and Machine
Intelligence pp. 1–1.
Bell, R. M. & Koren, Y. (2007), ‘Lessons from the netflix prize challenge’, SIGKDD
Explor. Newsl. 9(2), 75–79.
Beutel, A., Talukdar, P. P., Kumar, A., Faloutsos, C., Papalexakis, E. E. & Xing,
E. P. (2014), Flexifact: Scalable flexible factorization of coupled tensors on
hadoop, in ‘Proceedings of the SIAM International Conference on Data Mining’,
SDM’14, pp. 109–117.
Bhargava, P., Phan, T., Zhou, J. & Lee, J. (2015), Who, what, when, and where:
Multi-dimensional collaborative recommendations using tensor factorization on
sparse user-generated data, in ‘Proceedings of the 24th International Conference
on World Wide Web’, WWW’15, pp. 130–140.
123
Boyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge University
Press.
Chen, B., Li, F., Chen, S., Hu, R. & Chen, L. (2017), ‘Link prediction based on
non-negative matrix factorization’, PLOS ONE 12, 1–18.
Chen, W., Hsu, W. & Lee, M. L. (2013), Making recommendations from multiple
domains, in ‘Proceedings of the 19th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining’, KDD’13, pp. 892–900.
Diao, Q., Qiu, M., Wu, C.-Y., Smola, A. J., Jiang, J. & Wang, C. (2014), Jointly
modeling aspects, ratings and sentiments for movie recommendation (jmars), in
‘Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining’, KDD’14, pp. 193–202.
Ding, C., Li, T., Peng, W. & Park, H. (2006), Orthogonal nonnegative matrix t-
factorizations for clustering, in ‘Proceedings of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining’, KDD’06, pp. 126–
135.
Dunlavy, D. M., Kolda, T. G. & Acar, E. (2011), ‘Temporal link prediction using
matrix and tensor factorizations’, ACM Trans. Knowl. Discov. Data 5(2), 10:1–
10:27.
Ekstrand, M. D., Riedl, J. T. & Konstan, J. A. (2011), ‘Collaborative filtering
recommender systems’, Found. Trends Hum.-Comput. Interact. 4(2), 81–173.
Elkahky, A. M., Song, Y. & He, X. (2015), A multi-view deep learning approach for
cross domain user modeling in recommendation systems, in ‘Proceedings of the
24th International Conference on World Wide Web’, WWW ’15, pp. 278–288.
Ermis, B., Acar, E. & Cemgil, A. T. (2015), ‘Link prediction in heterogeneous
124
data via generalized coupled tensor factorization’, Data Min. Knowl. Discov.
29(1), 203–236.
Fang, X. & Pan, R. (2014), Fast dtt: A near linear algorithm for decomposing a
tensor into factor tensors, in ‘Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining’, KDD’14, pp. 967–976.
Farias, R. C., Cohen, J. E. & Comon, P. (2016), ‘Exploring multimodal data fu-
sion through joint decompositions with flexible couplings’, IEEE Transactions on
Signal Processing 64(18), 4830–4844.
Gao, S., Denoyer, L. & Gallinari, P. (2012), Link prediction via latent factor block-
model, in ‘Proceedings of the 21st International Conference on World Wide Web’,
WWW’12, pp. 507–508.
Gao, S., Luo, H., Chen, D., Li, S., Gallinari, P. & Guo, J. (2013), Cross-domain
recommendation via cluster-level latent factor model, in ‘Proceedings, Part II,
of the European Conference on Machine Learning and Knowledge Discovery in
Databases - Volume 8189’, ECML PKDD’13, pp. 161–176.
Gemulla, R., Nijkamp, E., Haas, P. J. & Sismanis, Y. (2011), Large-scale matrix
factorization with distributed stochastic gradient descent, in ‘Proceedings of the
17th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining’, KDD’11, pp. 69–77.
Harper, F. M. & Konstan, J. A. (2015), ‘The movielens datasets: History and
context’, ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19.
Harshman, R. A. (1970), ‘Foundations of the parafac procedure: Models and con-
ditions for an ” explanatory” multi-modal factor analysis’, UCLA working papers
in phonetics 16, 1–84.
125
He, R. & McAuley, J. (2016), Ups and downs: Modeling the visual evolution of
fashion trends with one-class collaborative filtering, in ‘Proceedings of the 25th
International Conference on World Wide Web’, WWW’16, pp. 507–517.
He, X., Liao, L., Zhang, H., Nie, L., Hu, X. & Chua, T.-S. (2017), Neural collab-
orative filtering, in ‘Proceedings of the 26th International Conference on World
Wide Web’, WWW ’17, pp. 173–182.
He, X., Zhang, H., Kan, M.-Y. & Chua, T.-S. (2016), Fast matrix factorization for
online recommendation with implicit feedback, in ‘Proceedings of the 39th Inter-
national ACM SIGIR Conference on Research and Development in Information
Retrieval’, SIGIR’16, pp. 549–558.
Hotelling, H. (1931), ‘The generalization of student’s ratio’, The Annals of Mathe-
matical Statistics .
Hsu, C., Yeh, M. & Lin, S. (2018), ‘A general framework for implicit and explicit
social recommendation’, IEEE Transactions on Knowledge and Data Engineering
pp. 1–1.
Hu, L., Cao, J., Xu, G., Cao, L., Gu, Z. & Zhu, C. (2013), Personalized recom-
mendation via cross-domain triadic factorization, in ‘Proceedings of the 22Nd
International Conference on World Wide Web’, WWW’13, pp. 595–606.
Hu, Y., Koren, Y. & Volinsky, C. (2008), Collaborative filtering for implicit feedback
datasets, in ‘Proceedings of the 2008 Eighth IEEE International Conference on
Data Mining’, ICDM’08, pp. 263–272.
Huang, H., Ding, C., Luo, D. & Li, T. (2008), Simultaneous tensor subspace selection
and clustering: The equivalence of high order svd and k-means clustering, in
‘Proceedings of the 14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining’, KDD ’08, pp. 327–335.
126
Itskov, M. (2009), Tensor Algebra and Tensor Analysis for Engineers: With Appli-
cations to Continuum Mechanics, 2nd edn, Springer Publishing Company, Incor-
porated.
Iwata, T. & Koh, T. (2015), Cross-domain recommendation without shared users
or items by sharing latent vector distributions, in ‘Proceedings of the Eighteenth
International Conference on Artificial Intelligence and Statistics’, Vol. 38 of Pro-
ceedings of Machine Learning Research, pp. 379–387.
Jeon, B., Jeon, I., Sael, L. & Kang, U. (2016), Scout: Scalable coupled matrix-
tensor factorization - algorithm and discoveries, in ‘2016 IEEE 32nd International
Conference on Data Engineering’, ICDE’16, pp. 811–822.
Jiang, M., Cui, P., Chen, X., Wang, F., Zhu, W. & Yang, S. (2015), ‘Social recom-
mendation with cross-domain transferable knowledge’, IEEE Trans. on Knowl.
and Data Eng. 27(11), 3084–3097.
Jiang, M., Cui, P., Wang, F., Xu, X., Zhu, W. & Yang, S. (2014), Fema: Flexible
evolutionary multi-faceted analysis for dynamic behavioral pattern discovery, in
‘Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining’, KDD ’14, pp. 1186–1195.
Jing, H., Liang, A., Lin, S. & Tsao, Y. (2014), A transfer probabilistic collective
factorization model to handle sparse data in collaborative filtering, in ‘2014 IEEE
International Conference on Data Mining (ICDM)’, Vol. 00 of ICDM’14, pp. 250–
259.
Kang, U., Papalexakis, E., Harpale, A. & Faloutsos, C. (2012), Gigatensor: Scaling
tensor analysis up by 100 times - algorithms and discoveries, in ‘Proceedings of
the 18th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining’, KDD’12, pp. 316–324.
127
Karatzoglou, A., Amatriain, X., Baltrunas, L. & Oliver, N. (2010), Multiverse rec-
ommendation: N-dimensional tensor factorization for context-aware collaborative
filtering, in ‘Proceedings of the Fourth ACM Conference on Recommender Sys-
tems’, RecSys’10, pp. 79–86.
Karatzoglou, A. & Hidasi, B. (2017), Deep learning for recommender systems, in
‘Proceedings of the Eleventh ACM Conference on Recommender Systems’, RecSys
’17, pp. 396–397.
Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006),
Learning systems of concepts with an infinite relational model, in ‘Proceedings
of the 21st National Conference on Artificial Intelligence - Volume 1’, AAAI’06,
AAAI Press, pp. 381–388.
Kiraly, F. J., Theran, L. & Tomioka, R. (2015), ‘The algebraic combinatorial ap-
proach for low-rank matrix completion’, J. Mach. Learn. Res. 16(1), 1391–1436.
Kolda, T. G. & Bader, B. W. (2009), ‘Tensor decompositions and applications’,
SIAM Rev. 51(3), 455–500.
Konstan, J. A. (2004), ‘Introduction to recommender systems: Algorithms and eval-
uation’, ACM Trans. Inf. Syst. 22(1), 1–4.
Koren, Y. (2009), Collaborative filtering with temporal dynamics, in ‘Proceedings
of the 15th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining’, KDD’09, pp. 447–456.
Koren, Y. & Bell, R. (2011), Advances in Collaborative Filtering, Springer US,
pp. 145–186.
Koren, Y., Bell, R. & Volinsky, C. (2009), ‘Matrix factorization techniques for rec-
ommender systems’, Computer 42(8), 30–37.
128
Lahat, D., Adali, T. & Jutten, C. (2015a), ‘Multimodal data fusion: An overview of
methods, challenges, and prospects’, Proceedings of the IEEE 103(9), 1449–1477.
Lahat, D., Adali, T. & Jutten, C. (2015b), ‘Multimodal data fusion: An overview of
methods, challenges, and prospects’, Proceedings of the IEEE 103(9), 1449–1477.
Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B. & Ng, A. Y. (2011), On
optimization methods for deep learning, in ‘Proceedings of the 28th International
Conference on International Conference on Machine Learning’, ICML’11, pp. 265–
272.
Lee, D. D. & Seung, H. S. (2000), Algorithms for non-negative matrix factoriza-
tion, in ‘Proceedings of the 13th International Conference on Neural Information
Processing Systems’, NIPS’00, pp. 535–541.
Li, B., Yang, Q. & Xue, X. (2009a), Can movies and books collaborate?: Cross-
domain collaborative filtering for sparsity reduction, in ‘Proceedings of the 21st
International Jont Conference on Artifical Intelligence’, IJCAI’09, pp. 2052–2057.
Li, B., Yang, Q. & Xue, X. (2009b), Transfer learning for collaborative filtering via a
rating-matrix generative model, in ‘Proceedings of the 26th Annual International
Conference on Machine Learning’, ICML’09, pp. 617–624.
Li, C.-Y. & Lin, S.-D. (2014), Matching users and items across domains to im-
prove the recommendation quality, in ‘Proceedings of the 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining’, KDD’14,
pp. 801–810.
Liavas, A. P. & Sidiropoulos, N. D. (2015), ‘Parallel algorithms for constrained tensor
factorization via alternating direction method of multipliers’, IEEE Transactions
on Signal Processing 63(20), 5450–5463.
129
Lin, Y.-R., Sun, J., Castro, P., Konuru, R., Sundaram, H. & Kelliher, A. (2009),
Metafac: Community discovery via relational hypergraph factorization, in ‘Pro-
ceedings of the 15th ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining’, KDD’09, pp. 527–536.
Liu, W., Chan, J., Bailey, J., Leckie, C. & Ramamohanarao, K. (n.d.), Mining
labelled tensors by discovering both their common and discriminative subspaces,
in ‘Proceedings of the 2013 SIAM International Conference on Data Mining’,
pp. 614–622.
Liu, W., Kan, A., Chan, J., Bailey, J., Leckie, C., Pei, J. & Kotagiri, R. (2012),
On compressing weighted time-evolving graphs, in ‘Proceedings of the 21st ACM
International Conference on Information and Knowledge Management’, CIKM’12,
pp. 2319–2322.
Liu, Y.-F., Hsu, C.-Y. & Wu, S.-H. (2015), Non-linear cross-domain collaborative fil-
tering via hyper-structure transfer, in ‘Proceedings of the 32Nd International Con-
ference on International Conference on Machine Learning - Volume 37’, ICML’15,
pp. 1190–1198.
Liu, Y. & Shang, F. (2013), ‘An efficient matrix factorization method for tensor
completion’, IEEE Signal Processing Letters 20(4), 307–310.
Loni, B., Shi, Y., Larson, M. & Hanjalic, A. (2014), Cross-domain collaborative
filtering with factorization machines, in ‘Proceedings of the 36th European Con-
ference on IR Research on Advances in Information Retrieval - Volume 8416’,
ECIR 2014, pp. 656–661.
Lops, P., de Gemmis, M. & Semeraro, G. (2011), Content-based Recommender Sys-
tems: State of the Art and Trends, Springer US, pp. 73–105.
130
Menon, A. K. & Elkan, C. (2011), Link prediction via matrix factorization, in ‘Pro-
ceedings of the 2011 European Conference on Machine Learning and Knowledge
Discovery in Databases - Volume Part II’, ECML PKDD’11, pp. 437–452.
More, J. J. & Thuente, D. J. (1994), ‘Line search algorithms with guaranteed suffi-
cient decrease’, ACM Trans. Math. Softw. 20(3), 286–307.
Moreno, O., Shapira, B., Rokach, L. & Shani, G. (2012), Talmud: Transfer learning
for multiple domains, in ‘Proceedings of the 21st ACM International Conference
on Information and Knowledge Management’, CIKM’12, pp. 425–434.
Pan, W. (2016), ‘A survey of transfer learning for collaborative recommendation
with auxiliary data’, Neurocomput. 177(C), 447–453.
Pan, W., Liu, N. N., Xiang, E. W. & Yang, Q. (2011), Transfer learning to predict
missing ratings via heterogeneous user feedbacks, in ‘Proceedings of the Twenty-
Second International Joint Conference on Artificial Intelligence - Volume Volume
Three’, IJCAI’11, pp. 2318–2323.
Pan, W., Xiang, E. W., Liu, N. N. & Yang, Q. (2010), Transfer learning in collabo-
rative filtering for sparsity reduction, in ‘Proceedings of the Twenty-Fourth AAAI
Conference on Artificial Intelligence’, AAAI’10, pp. 230–235.
Papalexakis, E. E., Faloutsos, C., Mitchell, T. M., Talukdar, P. P., Sidiropoulos,
N. D. & Murphy, B. (2014), Turbo-smt: Accelerating coupled sparse matrix-
tensor factorizations by 200x, in ‘Proceedings of the 2014 SIAM International
Conference on Data Mining’, SDM’14, pp. 118–126.
Papalexakis, E. E., Faloutsos, C. & Sidiropoulos, N. D. (2012), Parcube: Sparse
parallelizable tensor decompositions, in ‘Proceedings of the 2012 European Con-
ference on Machine Learning and Knowledge Discovery in Databases’, ECML
PKDD’12, pp. 521–536.
131
Papalexakis, E. E., Sidiropoulos, N. D. & Bro, R. (2013), ‘From k-means to higher-
way co-clustering: Multilinear decomposition with sparse latent factors’, Trans.
Sig. Proc. 61(2), 493–506.
Park, N., Oh, S. & Kang, U. (2017), Fast and scalable distributed boolean tensor
factorization, in ‘2017 IEEE 33rd International Conference on Data Engineering
(ICDE)’, ICDE’17, pp. 1071–1082.
Pazzani, M. J. & Billsus, D. (2007), The adaptive web, Springer-Verlag, Berlin,
Heidelberg, chapter Content-based Recommendation Systems, pp. 325–341.
Perozzi, B., Schueppert, M., Saalweachter, J. & Thakur, M. (2016), When recom-
mendation goes wrong: Anomalous link discovery in recommendation networks, in
‘Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining’, KDD’16, pp. 569–578.
Rendle, S. (2010), Factorization machines, in ‘Proceedings of the 2010 IEEE Inter-
national Conference on Data Mining’, ICDM’10, pp. 995–1000.
Rennie, J. D. M. & Srebro, N. (2005), Fast maximum margin matrix factorization
for collaborative prediction, in ‘Proceedings of the 22Nd International Conference
on Machine Learning’, ICML’05, pp. 713–719.
Resnick, P. & Varian, H. R. (1997), ‘Recommender systems’, Commun. ACM
40(3), 56–58.
Sachan, M. & Srivastava, S. (2013), Collective matrix factorization for co-clustering,
in ‘Proceedings of the 22Nd International Conference on World Wide Web’,
WWW’13, pp. 93–94.
Sael, L., Jeon, I. & Kang, U. (2015), ‘Scalable tensor mining’, Big Data Res. 2(2), 82–
86.
132
Schafer, J. B., Frankowski, D., Herlocker, J. & Sen, S. (2007), The adaptive
web, Springer-Verlag, Berlin, Heidelberg, chapter Collaborative Filtering Rec-
ommender Systems, pp. 291–324.
Shi, J., Qiu, Y., Minhas, U. F., Jiao, L., Wang, C., Reinwald, B. & Ozcan, F.
(2015), ‘Clash of the titans: Mapreduce vs. spark for large scale data analytics’,
Proc. VLDB Endow. 8(13), 2110–2121.
Shi, Y., Larson, M. & Hanjalic, A. (2013), ‘Mining contextual movie similarity with
matrix factorization for context-aware recommendation’, ACM Trans. Intell. Syst.
Technol. 4(1), 16:1–16:19.
Shin, K. & Kang, U. (2014), Distributed methods for high-dimensional and large-
scale tensor factorization, in ‘2014 IEEE International Conference on Data Min-
ing’, ICDM’14, pp. 989–994.
Shin, K., Sael, L. & Kang, U. (2017), ‘Fully scalable methods for distributed tensor
factorization’, IEEE Trans. on Knowl. and Data Eng. 29(1), 100–113.
Singh, A. P. & Gordon, G. J. (2008), Relational learning via collective matrix fac-
torization, in ‘Proceedings of the 14th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining’, KDD’08, pp. 650–658.
Sun, Z., Li, T. & Rishe, N. (2010), Large-scale matrix factorization using mapreduce,
in ‘2010 IEEE International Conference on Data Mining Workshops’, pp. 1242–
1248.
Tan, S., Bu, J., Qin, X., Chen, C. & Cai, D. (2014), ‘Cross domain recommendation
based on multi-type media fusion’, Neurocomput. 127.
Tang, J., Wu, S., Sun, J. & Su, H. (2012), Cross-domain collaboration recommen-
dation, in ‘Proceedings of the 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining’, KDD’12, pp. 1285–1293.
133
Toscher, A., Jahrer, M. & Legenstein, R. (2008), Improved neighborhood-based al-
gorithms for large-scale recommender systems, in ‘Proceedings of the 2Nd KDD
Workshop on Large-Scale Recommender Systems and the Netflix Prize Competi-
tion’, NETFLIX ’08, pp. 4:1–4:6.
Viswanath, B., Mislove, A., Cha, M. & Gummadi, K. P. (2009), On the evolution
of user interaction in facebook, in ‘Proceedings of the 2Nd ACM Workshop on
Online Social Networks’, WOSN ’09, pp. 37–42.
Wang, B., Ester, M., Liao, Y., Bu, J., Zhu, Y., Guan, Z. & Cai, D. (2016), The
million domain challenge: Broadcast email prioritization by cross-domain recom-
mendation, in ‘Proceedings of the 22Nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining’, KDD’16, pp. 1895–1904.
Wang, H., Wang, N. & Yeung, D.-Y. (2015), Collaborative deep learning for recom-
mender systems, in ‘Proceedings of the 21th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining’, KDD ’15, pp. 1235–1244.
Wang, J. J.-Y., Bensmail, H. & Gao, X. (2013), ‘Multiple graph regularized non-
negative matrix factorization’, Pattern Recogn. 46(10), 2840–2847.
Wang, Y., Liu, Y. & Yu, X. (2012a), Collaborative filtering with aspect-based opin-
ion mining: A tensor factorization approach, in ‘2012 IEEE 12th International
Conference on Data Mining’, ICDM’12, pp. 1152–1157.
Wang, Y., Liu, Y. & Yu, X. (2012b), Collaborative filtering with aspect-based opin-
ion mining: A tensor factorization approach, in ‘2012 IEEE 12th International
Conference on Data Mining’, ICDM’12, pp. 1152–1157.
Wang, Y., Tung, H.-Y., Smola, A. & Anandkumar, A. (2015), Fast and guaran-
teed tensor decomposition via sketching, in ‘Proceedings of the 28th International
134
Conference on Neural Information Processing Systems - Volume 1’, NIPS’15, MIT
Press, pp. 991–999.
Wei, Y., Zheng, Y. & Yang, Q. (2016), Transfer knowledge between cities, in ‘Pro-
ceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining’, KDD’16, pp. 1905–1914.
Wu, F., Yuan, Z. & Huang, Y. (2017), ‘Collaboratively training sentiment classifiers
for multiple domains’, IEEE Transactions on Knowledge and Data Engineering
29(7), 1370–1383.
Yang, D., He, J., Qin, H., Xiao, Y. & Wang, W. (2015), A graph-based recommenda-
tion across heterogeneous domains, in ‘Proceedings of the 24th ACM International
on Conference on Information and Knowledge Management’, CIKM’15, pp. 463–
472.
Yang, F., Shang, F., Huang, Y., Cheng, J., Li, J., Zhao, Y. & Zhao, R. (2017),
‘Lftf: A framework for efficient tensor analytics at scale’, Proc. VLDB Endow.
10(7), 745–756.
Yoo, J. & Choi, S. (2009), Weighted nonnegative matrix co-tri-factorization for
collaborative prediction, in ‘Proceedings of the 1st Asian Conference on Machine
Learning: Advances in Machine Learning’, ACML’09, pp. 396–411.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010), Spark:
Cluster computing with working sets, in ‘Proceedings of the 2Nd USENIX Con-
ference on Hot Topics in Cloud Computing’, HotCloud’10, pp. 10–10.
Zhang, F., Yuan, N. J., Lian, D., Xie, X. & Ma, W.-Y. (2016), Collaborative knowl-
edge base embedding for recommender systems, in ‘Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining’,
KDD’16, pp. 353–362.
135
Zhang, J., Chow, C. & Xu, J. (2017), ‘Enabling kernel-based attribute-aware matrix
factorization for rating prediction’, IEEE Transactions on Knowledge and Data
Engineering 29(4), 798–812.
Zhang, L., Zhang, K. & Li, C. (2008), A topical pagerank based algorithm for recom-
mender systems, in ‘Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval’, SIGIR’08,
pp. 713–714.
Zhang, Y. (2014), Browser-oriented universal cross-site recommendation and expla-
nation based on user browsing logs, in ‘Proceedings of the 8th ACM Conference
on Recommender Systems’, RecSys’14, pp. 433–436.
Zhang, Y., Xiong, Y., Kong, X. & Zhu, Y. (2016), Netcycle: Collective evolution in-
ference in heterogeneous information networks, in ‘Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining’,
KDD’16, pp. 1365–1374.
Zhao, L., Pan, S. J., Xiang, E. W., Zhong, E., Lu, Z. & Yang, Q. (2013), Active
transfer learning for cross-system recommendation, in ‘Proceedings of the 27th
AAAI Conference on Artificial Intelligence’, AAAI’13, pp. 1205–1211.
Zheng, V. W., Cao, B., Zheng, Y., Xie, X. & Yang, Q. (2010), Collaborative filtering
meets mobile recommendation: A user-centered approach, in ‘Proceedings of the
Twenty-Fourth AAAI Conference on Artificial Intelligence’, AAAI’10, pp. 236–
241.
Zhu, L., Guo, D., Yin, J., Steeg, G. V. & Galstyan, A. (2016), ‘Scalable tempo-
ral latent space inference for link prediction in dynamic social networks’, IEEE
Transactions on Knowledge and Data Engineering 28(10), 2765–2777.
136
Zou, B., Li, C., Tan, L. & Chen, H. (2015), ‘Gputensor: Efficient tensor factorization
for context-aware recommendations’, Inf. Sci. 299(C), 159–177.